From: Mel Gorman on 12 Mar 2010 04:20 On Thu, Mar 11, 2010 at 03:41:24PM -0800, Andrew Morton wrote: > On Mon, 8 Mar 2010 11:48:20 +0000 > Mel Gorman <mel(a)csn.ul.ie> wrote: > > > Under memory pressure, the page allocator and kswapd can go to sleep using > > congestion_wait(). In two of these cases, it may not be the appropriate > > action as congestion may not be the problem. > > clear_bdi_congested() is called each time a write completes and the > queue is below the congestion threshold. > Where you appear to get a kicking is if you want on "congestion" but no writes are involved. In that case you potentially sleep for the whole timeout waiting on an event that is not going to occur. > So if the page allocator or kswapd call congestion_wait() against a > non-congested queue, they'll wake up on the very next write completion. > > Hence the above-quoted claim seems to me to be a significant mis-analysis and > perhaps explains why the patchset didn't seem to help anything? > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on 12 Mar 2010 05:50 On Fri, Mar 12, 2010 at 02:05:26AM -0500, Andrew Morton wrote: > On Fri, 12 Mar 2010 07:39:26 +0100 Christian Ehrhardt <ehrhardt(a)linux.vnet.ibm.com> wrote: > > > > > > > Andrew Morton wrote: > > > On Mon, 8 Mar 2010 11:48:20 +0000 > > > Mel Gorman <mel(a)csn.ul.ie> wrote: > > > > > >> Under memory pressure, the page allocator and kswapd can go to sleep using > > >> congestion_wait(). In two of these cases, it may not be the appropriate > > >> action as congestion may not be the problem. > > > > > > clear_bdi_congested() is called each time a write completes and the > > > queue is below the congestion threshold. > > > > > > So if the page allocator or kswapd call congestion_wait() against a > > > non-congested queue, they'll wake up on the very next write completion. > > > > Well the issue came up in all kind of loads where you don't have any > > writes at all that can wake up congestion_wait. > > Thats true for several benchmarks, but also real workload as well e.g. A > > backup job reading almost all files sequentially and pumping out stuff > > via network. > > Why is reclaim going into congestion_wait() at all if there's heaps of > clean reclaimable pagecache lying around? > > (I don't thing the read side of the congestion_wqh[] has ever been used, btw) > I believe it's a race albeit one that has been there a long time. In __alloc_pages_direct_reclaim, a process does approximately the following 1. Enters direct reclaim 2. Calls cond_reched() 3. Drain pages if necessary 4. Attempt to allocate a page Between steps 2 and 3, it's possible to have reclaimed the pages but another process allocate them. It then proceeds and decides try again but calls congestion_wait() before it loops around. Plenty of read cache reclaimed but no forward progress. > > > Hence the above-quoted claim seems to me to be a significant mis-analysis and > > > perhaps explains why the patchset didn't seem to help anything? > > > > While I might have misunderstood you and it is a mis-analysis in your > > opinion, it fixes a -80% Throughput regression on sequential read > > workloads, thats not nothing - its more like absolutely required :-) > > > > You might check out the discussion with the subject "Performance > > regression in scsi sequential throughput (iozone) due to "e084b - > > page-allocator: preserve PFN ordering when __GFP_COLD is set"". > > While the original subject is misleading from todays point of view, it > > contains a lengthy discussion about exactly when/why/where time is lost > > due to congestion wait with a lot of traces, counters, data attachments > > and such stuff. > > Well if we're not encountering lots of dirty pages in reclaim then we > shouldn't be waiting for writes to retire, of course. > > But if we're not encountering lots of dirty pages in reclaim, we should > be reclaiming pages, normally. > We probably are. > I could understand reclaim accidentally going into congestion_wait() if > it hit a large pile of pages which are unreclaimable for reasons other > than being dirty, but is that happening in this case? > Probably not. It's almost certainly the race I described above. > If not, we broke it again. > We were broken with respect to this in the first place. That cond_reched() is badly placed and waiting on congestion when congestion might not be involved is also a bit odd. It's possible that Christian's specific problem would also be addressed by the following patch. Christian, willing to test? It still feels a bit unnatural though that the page allocator waits on congestion when what it really cares about is watermarks. Even if this patch works for Christian, I think it still has merit so will kick it a few more times. ==== CUT HERE ==== page-allocator: Attempt page allocation immediately after direct reclaim After a process completes direct reclaim it calls cond_resched() as potentially it has been running a long time. When it wakes up, it attempts to allocate a page. There is a large window during which another process can allocate the pages reclaimed by direct reclaim. This patch attempts to allocate a page immediately after direct reclaim but will still go to sleep afterwards if its quantum has expired. Signed-off-by: Mel Gorman <mel(a)csn.ul.ie> --- mm/page_alloc.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a8182c8..973b7fc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1721,8 +1721,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, lockdep_clear_current_reclaim_state(); p->flags &= ~PF_MEMALLOC; - cond_resched(); - if (order != 0) drain_all_pages(); @@ -1731,6 +1729,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order, zonelist, high_zoneidx, alloc_flags, preferred_zone, migratetype); + + cond_resched(); + return page; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on 15 Mar 2010 08:40 On Fri, Mar 12, 2010 at 09:37:55AM -0500, Andrew Morton wrote: > On Fri, 12 Mar 2010 13:15:05 +0100 Christian Ehrhardt <ehrhardt(a)linux.vnet.ibm.com> wrote: > > > > It still feels a bit unnatural though that the page allocator waits on > > > congestion when what it really cares about is watermarks. Even if this > > > patch works for Christian, I think it still has merit so will kick it a > > > few more times. > > > > In whatever way I can look at it watermark_wait should be supperior to > > congestion_wait. Because as Mel points out waiting for watermarks is > > what is semantically correct there. > > If a direct-reclaimer waits for some thresholds to be achieved then what > task is doing reclaim? > > Ultimately, kswapd. Well, not quite. The direct reclaimer will still wake up after a timeout and try again regardless of whether watermarks have been met or not. The intention is to back after after direct reclaim has failed. Granted, the window during which a direct reclaim finishes and an allocation attempt occurs is unnecessarily large. This may be addressed by the patch that changes where cond_resched() is called. > This will introduce a hard dependency upon kswapd > activity. This might introduce scalability problems. And latency > problems if kswapd if off doodling with a slow device (say), or doing a > journal commit. And perhaps deadlocks if kswapd tries to take a lock > which one of the waiting-for-watermark direct relcaimers holds. > What lock could they be holding? Even if that is the case, the direct reclaimers do not wait indefinitily. > Generally, kswapd is an optional, best-effort latency optimisation > thing and we haven't designed for it to be a critical service. > Probably stuff would break were we to do so. > No disagreements there. > This is one of the reasons why we avoided creating such dependencies in > reclaim. Instead, what we do when a reclaimer is encountering lots of > dirty or in-flight pages is > > msleep(100); > > then try again. We're waiting for the disks, not kswapd. > > Only the hard-wired 100 is a bit silly, so we made the "100" variable, > inversely dependent upon the number of disks and their speed. If you > have more and faster disks then you sleep for less time. > > And that's what congestion_wait() does, in a very simplistic fashion. > It's a facility which direct-reclaimers use to ratelimit themselves in > inverse proportion to the speed with which the system can retire writes. > The problem being hit is when a direct reclaimer goes to sleep waiting on congestion when in reality there were not lots of dirty or in-flight pages. It goes to sleep for the wrong reasons and doesn't get woken up again until the timeout expires. Bear in mind that even if congestion clears, it just means that dirty pages are now clean although I admit that the next direct reclaim it does is going to encounter clean pages and should succeed. Lets see how the other patch that changes when cond_reched() gets called gets on. If it also works out, then it's harder to justify this patch. If it doesn't work out then it'll need to be kicked another few times. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on 16 Mar 2010 06:20 On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote: > On Mon, 15 Mar 2010 13:34:50 +0100 > Christian Ehrhardt <ehrhardt(a)linux.vnet.ibm.com> wrote: > > > c) If direct reclaim did reasonable progress in try_to_free but did not > > get a page, AND there is no write in flight at all then let it try again > > to free up something. > > This could be extended by some kind of max retry to avoid some weird > > looping cases as well. > > > > d) Another way might be as easy as letting congestion_wait return > > immediately if there are no outstanding writes - this would keep the > > behavior for cases with write and avoid the "running always in full > > timeout" issue without writes. > > They're pretty much equivalent and would work. But there are two > things I still don't understand: > Unfortunately, this regression is very poorly understood. I haven't been able to reproduce it locally and while Christian has provided various debugging information, it still isn't clear why the problem occurs now. > 1: Why is direct reclaim calling congestion_wait() at all? If no > writes are going on there's lots of clean pagecache around so reclaim > should trivially succeed. What's preventing it from doing so? > Memory pressure I think. The workload involves 16 processes (see http://lkml.org/lkml/2009/12/7/237). I suspect they are all direct reclaimers and some processes are getting their pages stolen before they have a chance to allocate them. It's knowing that adding a small amount of memory "fixes" this problem. > 2: This is, I think, new behaviour. A regression. What caused it? > Short answer, I don't know. Longer answer. Initially, this was reported as being caused by commit e084b2d: page-allocator: preserve PFN ordering when __GFP_COLD is set but it was never established why and reverting it was unpalatable because it fixed another performance problem. According to Christian, the controller does nothing with the merging of IO requests and he was very sure about this. As all the patch does is change the order that pages are returned in and the timing slightly due to differences in cache hotness, although the fact that such a small change could make a big difference in reclaim later was surprising. There were other bugs that might have complicated this such as errors in free page counters but they were fixed up and the problem still did not go away. It was after much debugging that it was found that direct reclaim was returning, the subsequent allocation attempt failed and congestion_wait() was called but without dirty pages, congestion or writes, it waits for the full timeout. congestion_wait() was also being called a lot more frequently so something was causing reclaim to fail more frequently (http://lkml.org/lkml/2009/12/18/150). Again, I couldn't figure out why e084b2d would make a difference. Later, it got even worse because patches e084b2d and 5f8dcc21 had to be reverted in 2.6.33 to "resolve" the problem. 5f8dcc21 was more plausible as it affected how many pages were on the per-cpu lists but making it behave like 2.6.32 did not help the situation. Again, it looked like a very small timing problem but it could not be isolated exactly why reclaim would fail. Again, other bugs were found and fixed but made no difference. What lead to this patch was recognising we could enter congestion_wait() and wait the entire timeout because no writes were in progress or dirty pages to be cleaned. As what was really of interest was watermarks in this path, the patch intended to make the page allocator care about watermarks instead of congestion. We know it was treating symptoms rather than understanding the underlying problem but I was somewhat at a loss to explain why small changes in timing made such a large difference. Any new insight is welcome. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on 18 Mar 2010 13:50 On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote: > On Mon, 15 Mar 2010 13:34:50 +0100 > Christian Ehrhardt <ehrhardt(a)linux.vnet.ibm.com> wrote: > > > c) If direct reclaim did reasonable progress in try_to_free but did not > > get a page, AND there is no write in flight at all then let it try again > > to free up something. > > This could be extended by some kind of max retry to avoid some weird > > looping cases as well. > > > > d) Another way might be as easy as letting congestion_wait return > > immediately if there are no outstanding writes - this would keep the > > behavior for cases with write and avoid the "running always in full > > timeout" issue without writes. > > They're pretty much equivalent and would work. But there are two > things I still don't understand: > > 1: Why is direct reclaim calling congestion_wait() at all? If no > writes are going on there's lots of clean pagecache around so reclaim > should trivially succeed. What's preventing it from doing so? > > 2: This is, I think, new behaviour. A regression. What caused it? > I looked at this a bit closer using an iozone test very similar to Christian's. Despite buying a number of disks, I still can't reproduce his problem but I instrumented congestion_wait counts and times similar to what he did. 2.6.29-instrument:congestion_waittime 990 2.6.30-instrument:congestion_waittime 2823 2.6.31-instrument:congestion_waittime 193169 2.6.32-instrument:congestion_waittime 228890 2.6.33-instrument:congestion_waittime 785529 2.6.34-rc1-instrument:congestion_waittime 797178 So in the problem window, there was *definite* increases in the time spent in congestion_wait and the number of times it was called. I'll look closer at this tomorrow and Monday and see can I pin down what is happening. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
|
Next
|
Last
Pages: 1 2 Prev: microblaze: Fix Makefile to delete build generated files Next: [GIT PULL] MFD fixes for 2.6.34 |