From: Mel Gorman on 28 Oct 2009 06:30 On Tue, Oct 27, 2009 at 01:19:05PM -0700, Andrew Morton wrote: > On Tue, 27 Oct 2009 13:40:33 +0000 > Mel Gorman <mel(a)csn.ul.ie> wrote: > > > When a high-order allocation fails, kswapd is kicked so that it reclaims > > at a higher-order to avoid direct reclaimers stall and to help GFP_ATOMIC > > allocations. Something has changed in recent kernels that affect the timing > > where high-order GFP_ATOMIC allocations are now failing with more frequency, > > particularly under pressure. This patch forces kswapd to notice sooner that > > high-order allocations are occuring. > > > > "something has changed"? Shouldn't we find out what that is? > We've been trying but the answer right now is "lots". There were some changes in the allocator itself which were unintentional and fixed in patches 1 and 2 of this series. The two other major changes are iwlagn is now making high order GFP_ATOMIC allocations which didn't help. This is being addressed separetly and I believe the relevant patches are now in mainline. The other major change appears to be in page writeback. Reverting commits 373c0a7e + 8aa7e847 significantly helps one bug reporter but it's still unknown as to why that is. The latter is still being investigated but as the patches in this series are known to help some bug reporters with their GFP_ATOMIC failures and it is being reported against latest mainline and -stable, I felt it was best to help some of the bug reporters now to reduce duplicate reports. > > --- > > mm/vmscan.c | 9 +++++++++ > > 1 files changed, 9 insertions(+), 0 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 64e4388..7eceb02 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2016,6 +2016,15 @@ loop_again: > > priority != DEF_PRIORITY) > > continue; > > > > + /* > > + * Exit the function now and have kswapd start over > > + * if it is known that higher orders are required > > + */ > > + if (pgdat->kswapd_max_order > order) { > > + all_zones_ok = 1; > > + goto out; > > + } > > + > > if (!zone_watermark_ok(zone, order, > > high_wmark_pages(zone), end_zone, 0)) > > all_zones_ok = 0; > > So this handles the case where some concurrent thread or interrupt > increases pgdat->kswapd_max_order while kswapd was running > balance_pgdat(), yes? > Right. > Does that actually happen much? Enough for this patch to make any > useful difference? > Apparently, yes. Wireless drivers in particularly seem to be very high-order GFP_ATOMIC happy. > If one where to whack a printk in that `if' block, how often would it > trigger, and under what circumstances? I don't know the frequency. The circumstances are "under load" when there are drivers depending on high-order allocations but the reproduction cases are unreliable. Do you want me to slap together a patch that adds a vmstat counter for this? I can then ask future bug reporters to examine that counter and see if it really is a major factor for a lot of people or not. > If the -stable maintainers were to ask me "why did you send this" then > right now my answer would have to be "I have no idea". Help. > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on 2 Nov 2009 11:10 On Wed, Oct 28, 2009 at 12:47:56PM -0700, Andrew Morton wrote: > On Wed, 28 Oct 2009 10:29:36 +0000 > Mel Gorman <mel(a)csn.ul.ie> wrote: > > > On Tue, Oct 27, 2009 at 01:19:05PM -0700, Andrew Morton wrote: > > > On Tue, 27 Oct 2009 13:40:33 +0000 > > > Mel Gorman <mel(a)csn.ul.ie> wrote: > > > > > > > When a high-order allocation fails, kswapd is kicked so that it reclaims > > > > at a higher-order to avoid direct reclaimers stall and to help GFP_ATOMIC > > > > allocations. Something has changed in recent kernels that affect the timing > > > > where high-order GFP_ATOMIC allocations are now failing with more frequency, > > > > particularly under pressure. This patch forces kswapd to notice sooner that > > > > high-order allocations are occuring. > > > > > > > > > > "something has changed"? Shouldn't we find out what that is? > > > > > > > We've been trying but the answer right now is "lots". There were some > > changes in the allocator itself which were unintentional and fixed in > > patches 1 and 2 of this series. The two other major changes are > > > > iwlagn is now making high order GFP_ATOMIC allocations which didn't > > help. This is being addressed separetly and I believe the relevant > > patches are now in mainline. > > > > The other major change appears to be in page writeback. Reverting > > commits 373c0a7e + 8aa7e847 significantly helps one bug reporter but > > it's still unknown as to why that is. > > Peculiar. Those changes are fairly remote from large-order-GFP_ATOMIC > allocations. > Indeed. The significance of the patch seems to be how long and how often processes sleep in the page allocator and what kswapd is doing. > > ... > > > > Wireless drivers in particularly seem to be very > > high-order GFP_ATOMIC happy. > > It would be nice if we could find a way of preventing people from > attempting high-order atomic allocations in the first place - it's a bit > of a trap. > True. > Maybe add a runtime warning which is suppressable by GFP_NOWARN (or a > new flag), then either fix existing callers or, after review, add the > flag. > > Of course, this might just end up with people adding these hopeless > allocation attempts and just setting the nowarn flag :( > That's the difficulty but we should consider adding such warnings or maintaining in-kernel the unique GFP_ATOMIC callers and their frequency. It would require a lot of monitoring though and a fair amount of stick beatings to get the callers corrected. > > > If one where to whack a printk in that `if' block, how often would it > > > trigger, and under what circumstances? > > > > I don't know the frequency. The circumstances are "under load" when > > there are drivers depending on high-order allocations but the > > reproduction cases are unreliable. > > > > Do you want me to slap together a patch that adds a vmstat counter for > > this? I can then ask future bug reporters to examine that counter and see > > if it really is a major factor for a lot of people or not. > > Something like that, if it will help us understand what's going on. I > don't see a permanent need for that instrumentation but while this > problem is still in the research stage, sure, lard it up with debug > stuff? > I have a candidate patch below. One of the reasons it took so long to get out is what I found on the way developing the patch. I had added a debugging patch to printk what kswapd was doing. One massive difference I noted was that in 2.6.30 kswapd often went to sleep for 25 jiffies (HZ/10) in balance_pgdat(). In 2.6.31 and particularly in mainline, it sleeps less and for shorter intervals. When the sleep interval is low, kswapd notices the watermarks are ok and goes back to sleep far quicker than 2.6.30 did. One consequence of this is that kswapd is going back to sleep just as the high watermark is clear but if it had slept for longer it would have found that the zone quickly went back below the high watermark due to parallel allocators. i.e. in 2.6.30, kswapd worked for longer than current mainline. To see if there is any merit to this, the patch below also counts the number of times that kswapd prematurely went to sleep. If kswapd is routinely going to sleep with watermarks not being met, one correction might be to make balance_pgdat() unconditionally sleep for HZ/10 instead of sleeping based on congestion as this would bring kswapd closer in line with 2.6.30. Of course, the pain in the neck is that the premature-sleep-check itself is happening too quickly. > It's very important to understand _why_ the VM got worse. And, of > course, to fix that up. But, separately, we should find a way of > preventing developers from using these very unreliable allocations. > Agreed. I think the main thing that has changed is timing. congestion_wait() is now doing the "right" thing and sleeping until congestion is cleared. Unfortunately, it feels like some users of congestion_wait(), such as kswapd, really wanted to sleep for a fixed interval and not based on congestion. The comment in balance_pgdat() appears to indicate this was the expected behaviour. ==== CUT HERE ==== vmscan: Help debug kswapd issues by counting number of rewakeups and premature sleeps There is a growing amount of anedotal evidence that high-order atomic allocation failures have been increasing since 2.6.31-rc1. The two strongest possibilities are a marked increase in the number of GFP_ATOMIC allocations and alterations in timing. Debugging printk patches have shown for example that kswapd is sleeping for shorter intervals and going to sleep when watermarks are still not being met. This patch adds two kswapd counters to help identify if timing is an issue. The first counter kswapd_highorder_rewakeup counts the number of times that kswapd stops reclaiming at one order and restarts at a higher order. The second counter kswapd_slept_prematurely counts the number of times kswapd went to sleep when the high watermark was not met. Signed-off-by: Mel Gorman <mel(a)csn.ul.ie> --- include/linux/vmstat.h | 1 + mm/vmscan.c | 17 ++++++++++++++++- mm/vmstat.c | 2 ++ 3 files changed, 19 insertions(+), 1 deletion(-) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 2d0f222..2e0d18d 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -40,6 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, PGSCAN_ZONE_RECLAIM_FAILED, #endif PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL, + KSWAPD_HIGHORDER_REWAKEUP, KSWAPD_PREMATURE_SLEEP, PAGEOUTRUN, ALLOCSTALL, PGROTATED, #ifdef CONFIG_HUGETLB_PAGE HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL, diff --git a/mm/vmscan.c b/mm/vmscan.c index 7eceb02..cf40136 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2021,6 +2021,7 @@ loop_again: * if it is known that higher orders are required */ if (pgdat->kswapd_max_order > order) { + count_vm_event(KSWAPD_HIGHORDER_REWAKEUP); all_zones_ok = 1; goto out; } @@ -2124,6 +2125,17 @@ out: return sc.nr_reclaimed; } +static int kswapd_sleeping_prematurely(int order) +{ + struct zone *zone; + for_each_populated_zone(zone) + if (!zone_watermark_ok(zone, order, high_wmark_pages(zone), + 0, 0)) + return 1; + + return 0; +} + /* * The background pageout daemon, started as a kernel thread * from the init process. @@ -2183,8 +2195,11 @@ static int kswapd(void *p) */ order = new_order; } else { - if (!freezing(current)) + if (!freezing(current)) { + if (kswapd_sleeping_prematurely(order)) + count_vm_event(KSWAPD_PREMATURE_SLEEP); schedule(); + } order = pgdat->kswapd_max_order; } diff --git a/mm/vmstat.c b/mm/vmstat.c index c81321f..fa881c5 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -683,6 +683,8 @@ static const char * const vmstat_text[] = { "slabs_scanned", "kswapd_steal", "kswapd_inodesteal", + "kswapd_highorder_rewakeup", + "kswapd_slept_prematurely", "pageoutrun", "allocstall", -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on 2 Nov 2009 12:40 On Mon, Nov 02, 2009 at 06:32:54PM +0100, Frans Pop wrote: > On Monday 02 November 2009, Mel Gorman wrote: > > vmscan: Help debug kswapd issues by counting number of rewakeups and > > premature sleeps > > > > There is a growing amount of anedotal evidence that high-order atomic > > allocation failures have been increasing since 2.6.31-rc1. The two > > strongest possibilities are a marked increase in the number of > > GFP_ATOMIC allocations and alterations in timing. Debugging printk > > patches have shown for example that kswapd is sleeping for shorter > > intervals and going to sleep when watermarks are still not being met. > > > > This patch adds two kswapd counters to help identify if timing is an > > issue. The first counter kswapd_highorder_rewakeup counts the number of > > times that kswapd stops reclaiming at one order and restarts at a higher > > order. The second counter kswapd_slept_prematurely counts the number of > > times kswapd went to sleep when the high watermark was not met. > > What testing would you like done with this patch? > Same reproduction as before except post what the contents of /proc/vmstat were after the problem was triggered. Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on 2 Nov 2009 15:40 On Mon, Nov 02, 2009 at 05:38:38PM +0000, Mel Gorman wrote: > On Mon, Nov 02, 2009 at 06:32:54PM +0100, Frans Pop wrote: > > On Monday 02 November 2009, Mel Gorman wrote: > > > vmscan: Help debug kswapd issues by counting number of rewakeups and > > > premature sleeps > > > > > > There is a growing amount of anedotal evidence that high-order atomic > > > allocation failures have been increasing since 2.6.31-rc1. The two > > > strongest possibilities are a marked increase in the number of > > > GFP_ATOMIC allocations and alterations in timing. Debugging printk > > > patches have shown for example that kswapd is sleeping for shorter > > > intervals and going to sleep when watermarks are still not being met. > > > > > > This patch adds two kswapd counters to help identify if timing is an > > > issue. The first counter kswapd_highorder_rewakeup counts the number of > > > times that kswapd stops reclaiming at one order and restarts at a higher > > > order. The second counter kswapd_slept_prematurely counts the number of > > > times kswapd went to sleep when the high watermark was not met. > > > > What testing would you like done with this patch? > > > > Same reproduction as before except post what the contents of > /proc/vmstat were after the problem was triggered. > In the event there is a positive count for kswapd_slept_prematurely after the error is produced, can you also check if the following patch makes a difference and what the contents of vmstat are please? It alters how kswapd behaves and when it goes to sleep. Thanks ==== CUT HERE ==== vmscan: Have kswapd sleep for a short interval and double check it should be asleep After kswapd balances all zones in a pgdat, it goes to sleep. In the event of no IO congestion, kswapd can go to sleep very shortly after the high watermark was reached. If there are a constant stream of allocations from parallel processes, it can mean that kswapd went to sleep too quickly and the high watermark is not being maintained for sufficient length time. This patch makes kswapd go to sleep as a two-stage process. It first tries to sleep for HZ/10. If it is woken up by another process or the high watermark is no longer met, it's considered a premature sleep and kswapd continues work. Otherwise it goes fully to sleep. This adds more counters to distinguish between fast and slow breaches of watermarks. A "fast" premature sleep is one where the low watermark was hit in a very short time after kswapd going to sleep. A "slow" premature sleep indicates that the high watermark was breached after a very short interval. Signed-off-by: Mel Gorman <mel(a)csn.ul.ie> --- include/linux/vmstat.h | 3 ++- mm/vmscan.c | 31 +++++++++++++++++++++++++++---- mm/vmstat.c | 3 ++- 3 files changed, 31 insertions(+), 6 deletions(-) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 2e0d18d..f344878 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -40,7 +40,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, PGSCAN_ZONE_RECLAIM_FAILED, #endif PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL, - KSWAPD_HIGHORDER_REWAKEUP, KSWAPD_PREMATURE_SLEEP, + KSWAPD_HIGHORDER_REWAKEUP, + KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW, PAGEOUTRUN, ALLOCSTALL, PGROTATED, #ifdef CONFIG_HUGETLB_PAGE HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL, diff --git a/mm/vmscan.c b/mm/vmscan.c index 11a69a8..70aeb05 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1905,10 +1905,14 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, #endif /* is kswapd sleeping prematurely? */ -static int sleeping_prematurely(int order) +static int sleeping_prematurely(int order, long remaining) { struct zone *zone; + /* If a direct reclaimer woke kswapd within HZ/10, it's premature */ + if (remaining) + return 1; + /* If after HZ/10, a zone is below the high mark, it's premature */ for_each_populated_zone(zone) if (!zone_watermark_ok(zone, order, high_wmark_pages(zone), @@ -2209,9 +2213,28 @@ static int kswapd(void *p) order = new_order; } else { if (!freezing(current)) { - if (sleeping_prematurely(order)) - count_vm_event(KSWAPD_PREMATURE_SLEEP); - schedule(); + long remaining = 0; + + /* Try to sleep for a short interval */ + if (!sleeping_prematurely(order, remaining)) { + remaining = schedule_timeout(HZ/10); + finish_wait(&pgdat->kswapd_wait, &wait); + prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); + } + + /* + * After a short sleep, check if it was a + * premature sleep. If not, then go fully + * to sleep until explicitly woken up + */ + if (!sleeping_prematurely(order, remaining)) + schedule(); + else { + if (remaining) + count_vm_event(KSWAPD_PREMATURE_FAST); + else + count_vm_event(KSWAPD_PREMATURE_SLOW); + } } order = pgdat->kswapd_max_order; diff --git a/mm/vmstat.c b/mm/vmstat.c index fa881c5..47a6914 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -684,7 +684,8 @@ static const char * const vmstat_text[] = { "kswapd_steal", "kswapd_inodesteal", "kswapd_highorder_rewakeup", - "kswapd_slept_prematurely", + "kswapd_slept_prematurely_fast", + "kswapd_slept_prematurely_slow", "pageoutrun", "allocstall", -- 1.6.3.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on 3 Nov 2009 17:10 On Tue, Nov 03, 2009 at 11:01:50PM +0100, Frans Pop wrote: > On Monday 02 November 2009, Mel Gorman wrote: > > On Mon, Nov 02, 2009 at 06:32:54PM +0100, Frans Pop wrote: > > > On Monday 02 November 2009, Mel Gorman wrote: > > > > vmscan: Help debug kswapd issues by counting number of rewakeups and > > > > premature sleeps > > > > > > > > There is a growing amount of anedotal evidence that high-order > > > > atomic allocation failures have been increasing since 2.6.31-rc1. > > > > The two strongest possibilities are a marked increase in the number > > > > of GFP_ATOMIC allocations and alterations in timing. Debugging > > > > printk patches have shown for example that kswapd is sleeping for > > > > shorter intervals and going to sleep when watermarks are still not > > > > being met. > > > > > > > > This patch adds two kswapd counters to help identify if timing is an > > > > issue. The first counter kswapd_highorder_rewakeup counts the number > > > > of times that kswapd stops reclaiming at one order and restarts at a > > > > higher order. The second counter kswapd_slept_prematurely counts the > > > > number of times kswapd went to sleep when the high watermark was not > > > > met. > > > > > > What testing would you like done with this patch? > > > > Same reproduction as before except post what the contents of > > /proc/vmstat were after the problem was triggered. > > With a representative test I get 0 for kswapd_slept_prematurely. > Tested with .32-rc6 + patches 1-3 + this patch. > Assuming the problem actually reproduced, can you still retest with the patch I posted as a follow-up and see if fast or slow premature sleeps are happening and if the problem still occurs please? It's still possible with the patch as-is could be timing related. After I posted this patch, I continued testing and found I could get counts fairly reliably if kswapd was calling printk() before making the premature check so the window appears to be very small. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
|
Next
|
Last
Pages: 1 2 Prev: x86, mce: disable MCE if cpu has no MCE banks Next: ark3116: (v2) driver rework after review |