Prev: 2.6.34 Northbridge Chipset Errors on HP Proliant 4 x Opteron in x86_64 mode
Next: x86: enlightenment for ticket spin locks - eliminate NOPs introduced by first patch
From: Chris Wilson on 1 Jul 2010 07:20 On Wed, 30 Jun 2010 18:24:04 -0700, Linus Torvalds <torvalds(a)linux-foundation.org> wrote: > On Wed, Jun 30, 2010 at 4:07 PM, Linus Torvalds > <torvalds(a)linux-foundation.org> wrote: > > > > That commit changes the page cache allocation to use > > > > + mapping_gfp_mask (mapping) | > > + __GFP_COLD | > > + gfpmask); > > > > if I read it right. And the default mapping_gfp_mask() is > > GFP_HIGHUSER_MOVABLE, so I think you get all of > > (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | __GFP_HIGHMEM) > > set by default. > > .. and then I left out the one flag I _meant_ to have there, namely > __GFP_MOVABLE. > > > The old code didn't just play games with ~__GFP_NORETRY and change > > that at runtime (which was buggy - no locking, no protection, no > > nothing), it also initialized the gfp mask. And that code also got > > removed: That code I added with the original shrinker patch, and the flags lifted from the shmem defaults, tweaked to what seemed sane with the addition of NORETRY and friends. I see that i915 is unique in using shmem as the page allocator, which perhaps explains why this failure is not observed with the ttm drivers. ttm uses two sets of gfp mask: HIGHUSER and USER | DMA32. So replacing the mapping_gfp_mask() with HIGHUSER would seem appropriate. And the interaction of MOVABLE could explain why hibernate broke with the introduction of GEM. * turns to his trusty copy of LDD to explain the various meanings of gfp_t... -- Chris Wilson, Intel Open Source Technology Centre -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: M. Vefa Bicakci on 1 Jul 2010 18:50 On 01/07/10 04:24 AM, Linus Torvalds wrote: > On Wed, Jun 30, 2010 at 4:07 PM, Linus Torvalds > <torvalds(a)linux-foundation.org> wrote: >> >> That commit changes the page cache allocation to use >> >> + mapping_gfp_mask (mapping) | >> + __GFP_COLD | >> + gfpmask); >> >> if I read it right. And the default mapping_gfp_mask() is >> GFP_HIGHUSER_MOVABLE, so I think you get all of >> (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | __GFP_HIGHMEM) >> set by default. > > .. and then I left out the one flag I _meant_ to have there, namely > __GFP_MOVABLE. > >> The old code didn't just play games with ~__GFP_NORETRY and change >> that at runtime (which was buggy - no locking, no protection, no >> nothing), it also initialized the gfp mask. And that code also got >> removed: > > In fact, I don't really see why we should use that mapping_gfp_mask() > at all, since all allocations should be going through that > i915_gem_object_get_pages() function anyway. So why not just change > that function to ignore the default gfp mask for the mapping, and just > use the mask that the o915 driver wants? > > Btw, why did it want to mark the pages reclaimable? > > Anyway, what I'm suggesting somebody who sees this test is just > something like the patch below (whitespace-damage - I'm cutting and > pasting, it's a trivial one-liner). Does this change any behavior? > Vefa? > > Linus Dear Linus, I made the code change you documented below to a vanilla 2.6.34 tree, compiled it and tested hibernate/thaw cycles. In total, I tested 16 cycles, with 8 consecutive cycles in one installation (Debian Sid) and 8 consecutive cycles in another one (Fedora 13). For every cycle, I tried to run "old" and "new" programs, in terms of whether they were run in previous cycles. I tried a few extra cycles with uswsusp as well. Based on my testing, I am happy to report that the change you suggest fixes the "memory corruption (segfaults) after thaw" issue for me. I can't thank you enough times for this. Now, the obligatory question: Could we have this fix applied to 2.6.32, 2.6.33 and 2.6.34 ? Thanks a lot again! M. Vefa Bicakci --- linux-2.6.34/drivers/gpu/drm/i915/i915_gem.c.orig 2010-07-01 17:47:30.000000000 +0000 +++ linux-2.6.34/drivers/gpu/drm/i915/i915_gem.c 2010-07-01 17:54:03.000000000 +0000 @@ -2312,7 +2312,7 @@ mapping = inode->i_mapping; for (i = 0; i < page_count; i++) { page = read_cache_page_gfp(mapping, i, - mapping_gfp_mask (mapping) | + __GFP_HIGHMEM | __GFP_COLD | gfpmask); if (IS_ERR(page)) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on 1 Jul 2010 21:40 On Thu, Jul 1, 2010 at 5:49 PM, Dave Airlie <airlied(a)gmail.com> wrote: > > RECLAIMABLE added also seems fine, of course you can't have > RECLAIMABLE and MOVABLE (I find this out when it oopses on boot). Yes. They are both flags for the anti-fragmentation code, and I think I'll leave the decision as to whether the i915 driver should use __GFP_RECLAIMABLE to the people who work with and care about the fragmentation issues. I doubt it matters much in practice, at least not for the loads that the fragmentation people tend to care most about. > So I suspect MOVABLE is the problem. but I don't know enough about gfp > flags to know what RECLAIMABLE buys us, and where it �might bite us so > I can test some more. I think I'll just apply your previous tested patch - GFP_HIGHUSER should take care of all the flags that matter fundamentally, and then the reclaimable flag is really just a small detail for others to worry about. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: M. Vefa Bicakci on 17 Jul 2010 15:00 On 02/07/10 04:28 AM, Linus Torvalds wrote: > On Thu, Jul 1, 2010 at 5:49 PM, Dave Airlie <airlied(a)gmail.com> wrote: >> >> RECLAIMABLE added also seems fine, of course you can't have >> RECLAIMABLE and MOVABLE (I find this out when it oopses on boot). > > Yes. They are both flags for the anti-fragmentation code, and I think > I'll leave the decision as to whether the i915 driver should use > __GFP_RECLAIMABLE to the people who work with and care about the > fragmentation issues. I doubt it matters much in practice, at least > not for the loads that the fragmentation people tend to care most > about. > >> So I suspect MOVABLE is the problem. but I don't know enough about gfp >> flags to know what RECLAIMABLE buys us, and where it might bite us so >> I can test some more. > > I think I'll just apply your previous tested patch - GFP_HIGHUSER > should take care of all the flags that matter fundamentally, and then > the reclaimable flag is really just a small detail for others to worry > about. > Dear Linus, I have bad news regarding your fix for self-reclaim and i915. Apparently, I haven't tried enough hibernate/thaw cycles while initially testing your fix. After applying your fix to 2.6.34.1 and using it for two weeks, I noticed that every now and then I get a black screen or random kernel errors after thawing. I thought maybe this might be the same problem caused by d8e0902806c0bd2ccc4f6a267ff52565a3ec933b . (It turns out that my guess was right.) So I compiled two vanilla 2.6.34.1 kernels. One with d8e0902806c0bd2ccc4f6a267ff52565a3ec933b reverted to get back to pre 2.6.32.8 state, and another one with your fix applied. Then I set up an automated process where the computer would hibernate, and reboot at the end of the hibernation sequence (by setting /sys/power/disk to reboot) and then thaw back. I made this process loop at least 20 times. The kernel with d8e0902806c0bd2ccc4f6a267ff52565a3ec933b reverted was able to hibernate/thaw at least 40 times in one go, while the one with your fix applied was able to hibernate/thaw at most 17 times (in two separate trials) after which it crashed during the next thaw. Is there anything I can do find out the correct flags to use in addition to GFP_HIGHUSER ? Can I do something like a bisection for the flags one by one starting from the pre 2.6.32.8 state? If you could outline a procedure to do this, I would be glad to follow it. Sorry to bug you again about this problem because of incomplete testing on my part. Regards, M. Vefa Bicakci -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on 17 Jul 2010 15:20
On Sat, Jul 17, 2010 at 11:58 AM, M. Vefa Bicakci <bicave(a)superonline.com> wrote: > > The kernel with d8e0902806c0bd2ccc4f6a267ff52565a3ec933b reverted > was able to hibernate/thaw at least 40 times in one go, while > the one with your fix applied was able to hibernate/thaw at most > 17 times (in two separate trials) after which it crashed during > the next thaw. Ok. I do wonder if the bug is possibly something entirely different, and the allocation patterns just happen to expose/hide it. Reverting the original commit should be pretty darn close to applying my fix. Any remaining issues would seem to be more about the actual bug in the original code (racing on changing that mapping->gfp_mask witthout any locking) than about anything else. > Is there anything I can do find out the correct flags to use > in addition to GFP_HIGHUSER ? Can I do something like a bisection > for the flags one by one starting from the pre 2.6.32.8 state? > If you could outline a procedure to do this, I would be glad to > follow it. You can try adding __GFP_RECLAIMABLE | __GFP_NOMEMALLOC to the set of flags in i915_gem_object_get_pages(). That's what the old code had (and then it played games with NORETRY|NOWARN). I've attached a patch (UNTESTED! Maybe it won't compile). Now, I don't see why those flags would matter, but NOMEMALLOC in particular does make a difference for memory allocation patterns under low memory conditions, so maybe it could make a difference. And if it _does_ make a difference, it would be interesting to know which of the two flags matter. So try both flags first, and see if that gets you something reliable. And if it does, remove one of them and try again - just to see _which_ flag it is that the i915 driver would care about. That would hopefully give us a hint. > Sorry to bug you again about this problem because of incomplete > testing on my part. Oh, never be sorry for testing even more, and testing something nobody else bothered to test. Linus |