Prev: __weak vs ifdef
Next: kdb,panic,debug_core: Allow the debug core to receive a panic before smp_send_stop()
From: Mel Gorman on 7 Apr 2010 12:10 On Tue, Apr 06, 2010 at 05:06:03PM -0700, Andrew Morton wrote: > On Fri, 2 Apr 2010 17:02:45 +0100 > Mel Gorman <mel(a)csn.ul.ie> wrote: > > > Ordinarily when a high-order allocation fails, direct reclaim is entered to > > free pages to satisfy the allocation. With this patch, it is determined if > > an allocation failed due to external fragmentation instead of low memory > > and if so, the calling process will compact until a suitable page is > > freed. Compaction by moving pages in memory is considerably cheaper than > > paging out to disk and works where there are locked pages or no swap. If > > compaction fails to free a page of a suitable size, then reclaim will > > still occur. > > Does this work? > Well, yes or there wouldn't be a marked reduction in the latency to allocate a huge page as linked to in the leader and the difference in allocation success rates on ppc64 would not be so marked. > > Direct compaction returns as soon as possible. As each block is compacted, > > it is checked if a suitable page has been freed and if so, it returns. > > So someone else can get in and steal it. How is that resolved? > It isn't, lumpy reclaim has a similar problem. They could be captured of course but so far stealing has only been a problem when under very heavy memory pressure. > Please expound upon the relationship between the icky pageblock_order > and the caller's desired allocation order here. Compaction works on the same units as anti-fragmentation does - the pageblock_order. It could work on units smaller than that when selecting pages to migrate from and to, but there would be little advantage for some additional complexity. The caller's desired allocation order determines if compaction has finished or not after a pageblock of pages has been migrated. > The compaction design > seems fairly fixated upon pageblock_order - what happens if the caller > wanted something larger than pageblock_order? Then it would get tricky. Selecting for migration stays simple but there would be additional complexity in finding 2 or more adjacent naturally-aligned MIGRATE_MOVABLE blocks to migrate to. As pageblock_order is related to the default huge page size, I'd wonder what caller would be routinely allocating larger pages? > The > less-than-pageblock_order case seems pretty obvious, although perhaps > wasteful? > compact_finished() could be called more regularly but the waste is minimal. At worst, a few more pages get migrated that weren't necessary for the caller to successfully allocate. This is not massively dissimilar to how direct reclaim can reclaim slightly more pages than necessary. > > > > ... > > > > +static unsigned long compact_zone_order(struct zone *zone, > > + int order, gfp_t gfp_mask) > > +{ > > + struct compact_control cc = { > > + .nr_freepages = 0, > > + .nr_migratepages = 0, > > + .order = order, > > + .migratetype = allocflags_to_migratetype(gfp_mask), > > + .zone = zone, > > + }; > > yeah, like that. > > > + INIT_LIST_HEAD(&cc.freepages); > > + INIT_LIST_HEAD(&cc.migratepages); > > + > > + return compact_zone(zone, &cc); > > +} > > + > > +/** > > + * try_to_compact_pages - Direct compact to satisfy a high-order allocation > > + * @zonelist: The zonelist used for the current allocation > > + * @order: The order of the current allocation > > + * @gfp_mask: The GFP mask of the current allocation > > + * @nodemask: The allowed nodes to allocate from > > + * > > + * This is the main entry point for direct page compaction. > > + */ > > +unsigned long try_to_compact_pages(struct zonelist *zonelist, > > + int order, gfp_t gfp_mask, nodemask_t *nodemask) > > +{ > > + enum zone_type high_zoneidx = gfp_zone(gfp_mask); > > + int may_enter_fs = gfp_mask & __GFP_FS; > > + int may_perform_io = gfp_mask & __GFP_IO; > > + unsigned long watermark; > > + struct zoneref *z; > > + struct zone *zone; > > + int rc = COMPACT_SKIPPED; > > + > > + /* > > + * Check whether it is worth even starting compaction. The order check is > > + * made because an assumption is made that the page allocator can satisfy > > + * the "cheaper" orders without taking special steps > > + */ > > + if (order <= PAGE_ALLOC_COSTLY_ORDER > > Was that a correct decision? If we perform compaction when smaller > allocation attemtps fail, will the kernel get better, or worse? > I think better but there are concerns about LRU churn and it might encourage increased use of high-order allocations. The desire is to try compaction out first with huge pages and move towards lifting this restriction on order later. > And how do we save my order-4-allocating wireless driver? Ultimately, it could perform a subset of compaction that doesn't go to sleep but migration isn't up to that right now. > That would > require that kswapd perform the compaction for me, perhaps? > > > || !may_enter_fs || !may_perform_io) > > Would be nice to add some comments explaining this a bit more. > Compaction doesn't actually perform IO, nor enter filesystems, does it? > Compaction doesn't, but migration can and you don't know in advance if it will need to or not. Migration would itself need to take a GFP mask of what was and wasn't allowed during the course of migration but these checks to be moved. Not impossible, just not done as of this time. > > + return rc; > > + > > + count_vm_event(COMPACTSTALL); > > + > > + /* Compact each zone in the list */ > > + for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx, > > + nodemask) { > > + int fragindex; > > + int status; > > + > > + /* > > + * Watermarks for order-0 must be met for compaction. Note > > + * the 2UL. This is because during migration, copies of > > + * pages need to be allocated and for a short time, the > > + * footprint is higher > > + */ > > + watermark = low_wmark_pages(zone) + (2UL << order); > > + if (!zone_watermark_ok(zone, 0, watermark, 0, 0)) > > + continue; > > ooh, so that starts to explain split_free_page(). But > split_free_page() didn't do the 2UL thing. > No, but split_free_page() knows exactly how much it is removing at that time. At this point, there is a worst-case expectation that the pages being migrating from and to are both isolated. At no point should they be all allocated at any given time but it's not checking against deadlocks. > Surely these things are racy? So we'll deadlock less often :( > It won't deadlock, this is a heuristic only that guesses whether compaction is likely to succeed or not. The watermarks are rechecked every time pages are taken off free list. > > + /* > > + * fragmentation index determines if allocation failures are > > + * due to low memory or external fragmentation > > + * > > + * index of -1 implies allocations might succeed depending > > + * on watermarks > > + * index towards 0 implies failure is due to lack of memory > > + * index towards 1000 implies failure is due to fragmentation > > + * > > + * Only compact if a failure would be due to fragmentation. > > + */ > > + fragindex = fragmentation_index(zone, order); > > + if (fragindex >= 0 && fragindex <= 500) > > + continue; > > + > > + if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) { > > + rc = COMPACT_PARTIAL; > > + break; > > + } > > Why are we doing all this handwavy stuff? Why not just try a > compaction run and see if it worked? Because if that index is not matched, it really is a waste of time to try compacting. It just won't work but it'll do a full scan of the zone figuring that out. > That would be more robust/reliable, surely? > We'll also eventually get a bug report on low-memory situations causing large amounts of CPU to be consumed in compaction without the pages being allocated. Granted, we wouldn't get them until compaction was also working for the lower orders but we'd get the report eventually. > > + status = compact_zone_order(zone, order, gfp_mask); > > + rc = max(status, rc); > > + > > + if (zone_watermark_ok(zone, order, watermark, 0, 0)) > > + break; > > + } > > + > > + return rc; > > +} > > + > > + > > /* Compact all zones within a node */ > > static int compact_node(int nid) > > { > > > > ... > > > > --- a/mm/vmstat.c > > +++ b/mm/vmstat.c > > @@ -561,7 +561,7 @@ static int unusable_show(struct seq_file *m, void *arg) > > * The value can be used to determine if page reclaim or compaction > > * should be used > > */ > > -int fragmentation_index(unsigned int order, struct contig_page_info *info) > > +int __fragmentation_index(unsigned int order, struct contig_page_info *info) > > { > > unsigned long requested = 1UL << order; > > > > @@ -581,6 +581,14 @@ int fragmentation_index(unsigned int order, struct contig_page_info *info) > > return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total); > > } > > > > +/* Same as __fragmentation index but allocs contig_page_info on stack */ > > +int fragmentation_index(struct zone *zone, unsigned int order) > > +{ > > + struct contig_page_info info; > > + > > + fill_contig_page_info(zone, order, &info); > > + return __fragmentation_index(order, &info); > > +} > > > > static void extfrag_show_print(struct seq_file *m, > > pg_data_t *pgdat, struct zone *zone) > > @@ -596,7 +604,7 @@ static void extfrag_show_print(struct seq_file *m, > > zone->name); > > for (order = 0; order < MAX_ORDER; ++order) { > > fill_contig_page_info(zone, order, &info); > > - index = fragmentation_index(order, &info); > > + index = __fragmentation_index(order, &info); > > seq_printf(m, "%d.%03d ", index / 1000, index % 1000); > > } > > > > @@ -896,6 +904,9 @@ static const char * const vmstat_text[] = { > > "compact_blocks_moved", > > "compact_pages_moved", > > "compact_pagemigrate_failed", > > + "compact_stall", > > + "compact_fail", > > + "compact_success", > > CONFIG_COMPACTION=n? > Yeah, it should be. > > > > ... > > > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on 7 Apr 2010 14:40 On Tue, Apr 06, 2010 at 05:06:03PM -0700, Andrew Morton wrote: > > @@ -896,6 +904,9 @@ static const char * const vmstat_text[] = { > > "compact_blocks_moved", > > "compact_pages_moved", > > "compact_pagemigrate_failed", > > + "compact_stall", > > + "compact_fail", > > + "compact_success", > > CONFIG_COMPACTION=n? > This patch goes on top of the series. It looks big but it's mainly moving code. ==== CUT HERE ==== mm,compaction: Do not display compaction-related stats when !CONFIG_COMPACTION Although compaction can be disabled from .config, the vmstat entries still exist. This patch removes the vmstat entries. As page_alloc.c refers directly to the counters, the patch introduces __alloc_pages_direct_compact() to isolate use of the counters. Signed-off-by: Mel Gorman <mel(a)csn.ul.ie> --- include/linux/vmstat.h | 2 + mm/page_alloc.c | 92 ++++++++++++++++++++++++++++++++--------------- mm/vmstat.c | 2 + 3 files changed, 66 insertions(+), 30 deletions(-) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index b4b4d34..7f43ccd 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -43,8 +43,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY, KSWAPD_SKIP_CONGESTION_WAIT, PAGEOUTRUN, ALLOCSTALL, PGROTATED, +#ifdef CONFIG_COMPACTION COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED, COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS, +#endif #ifdef CONFIG_HUGETLB_PAGE HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL, #endif diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 46f6be4..514cc96 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1756,6 +1756,59 @@ out: return page; } +#ifdef CONFIG_COMPACTION +/* Try memory compaction for high-order allocations before reclaim */ +static struct page * +__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, + struct zonelist *zonelist, enum zone_type high_zoneidx, + nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone, + int migratetype, unsigned long *did_some_progress) +{ + struct page *page; + + if (!order) + return NULL; + + *did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask, + nodemask); + if (*did_some_progress != COMPACT_SKIPPED) { + + /* Page migration frees to the PCP lists but we want merging */ + drain_pages(get_cpu()); + put_cpu(); + + page = get_page_from_freelist(gfp_mask, nodemask, + order, zonelist, high_zoneidx, + alloc_flags, preferred_zone, + migratetype); + if (page) { + __count_vm_event(COMPACTSUCCESS); + return page; + } + + /* + * It's bad if compaction run occurs and fails. + * The most likely reason is that pages exist, + * but not enough to satisfy watermarks. + */ + count_vm_event(COMPACTFAIL); + + cond_resched(); + } + + return NULL; +} +#else +static inline struct page * +__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, + struct zonelist *zonelist, enum zone_type high_zoneidx, + nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone, + int migratetype, unsigned long *did_some_progress) +{ + return NULL; +} +#endif /* CONFIG_COMPACTION */ + /* The really slow allocator path where we enter direct reclaim */ static inline struct page * __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, @@ -1769,36 +1822,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, cond_resched(); - /* Try memory compaction for high-order allocations before reclaim */ - if (order) { - *did_some_progress = try_to_compact_pages(zonelist, - order, gfp_mask, nodemask); - if (*did_some_progress != COMPACT_SKIPPED) { - - /* Page migration frees to the PCP lists but we want merging */ - drain_pages(get_cpu()); - put_cpu(); - - page = get_page_from_freelist(gfp_mask, nodemask, - order, zonelist, high_zoneidx, - alloc_flags, preferred_zone, - migratetype); - if (page) { - __count_vm_event(COMPACTSUCCESS); - return page; - } - - /* - * It's bad if compaction run occurs and fails. - * The most likely reason is that pages exist, - * but not enough to satisfy watermarks. - */ - count_vm_event(COMPACTFAIL); - - cond_resched(); - } - } - /* We now go into synchronous reclaim */ cpuset_memory_pressure_bump(); p->flags |= PF_MEMALLOC; @@ -1972,6 +1995,15 @@ rebalance: if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL)) goto nopage; + /* Try direct compaction */ + page = __alloc_pages_direct_compact(gfp_mask, order, + zonelist, high_zoneidx, + nodemask, + alloc_flags, preferred_zone, + migratetype, &did_some_progress); + if (page) + goto got_pg; + /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, zonelist, high_zoneidx, diff --git a/mm/vmstat.c b/mm/vmstat.c index 2780a36..0a58cbe 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -901,12 +901,14 @@ static const char * const vmstat_text[] = { "pgrotated", +#ifdef CONFIG_COMPACTION "compact_blocks_moved", "compact_pages_moved", "compact_pagemigrate_failed", "compact_stall", "compact_fail", "compact_success", +#endif #ifdef CONFIG_HUGETLB_PAGE "htlb_buddy_alloc_success", -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
First
|
Prev
|
Pages: 1 2 3 4 5 6 Prev: __weak vs ifdef Next: kdb,panic,debug_core: Allow the debug core to receive a panic before smp_send_stop() |