Prev: BUG? boot failed with "crashkernel=256M@32M", but "crashkernel=256M@64M" can work
Next: USB: testusb: imported David Brownell's USB testing application
From: Peter Zijlstra on 12 Apr 2010 10:50 On Sat, 2010-04-10 at 11:21 -0700, Linus Torvalds wrote: > > Ho humm. > > Maybe I'm crazy, but something started bothering me. And I started > wondering: when is the 'page->mapping' of an anonymous page actually > cleared? > > The thing is, the mapping of an anonymous page is actually cleared only > when the page is _freed_, in "free_hot_cold_page()". > > Now, let's think about that. And in particular, let's think about how that > relates to the freeing of the 'anon_vma' that the page->mapping points to. > > The way the anon_vma is freed is when the mapping is torn down, and we do > roughly: > > tlb = tlb_gather_mmu(mm,..) > .. > unmap_vmas(&tlb, vma .. > .. > free_pgtables() > .. > tlb_finish_mmu(tlb, start, end); > > and we actually unmap all the pages in "unmap_vmas()", and then _after_ > unmapping all the pages we do the "unlink_anon_vmas(vma);" in > "free_pgtables()". Fine so far - the anon_vma stay around until after the > page has been happily unmapped. > > But "unmapped all the pages" is _not_ actually the same as "free'd all the > pages". The actual _freeing_ of the page happens generally in > tlb_finish_mmu(), because we can free the page only after we've flushed > any TLB entries. > > So what we have in that tlb_gather structure is a list of _pending_ pages > to be freed, while we already actually free'd the anon_vmas earlier! > > Now, the thing is, tlb_gather_mmu() begins a preempt-safe region (because > we use a per-cpu variable), but as far as I can tell it is _not_ an > RCU-safe region. > > So I think we might actually get a real RCU freeing event while this all > happens. So now the 'anon_vma' that 'page->mapping' points to has not just > been released back to the SLUB caches, the page itself might have been > released too. > > I dunno. Does the above sound at all sane? Or am I just raving? > > Something hacky like the above might fix it if I'm not just raving. I > really might be missing something here. Right, so unless you have CONFIG_TREE_PREEMPT_RCU=y, the preempt-disable == RCU read lock assumption does hold. But even with your patch it doesn't close all holes because while zap_pte_range() can remove the last mapcount of the page, the page_remove_tlb() et al. don't need to be the last use count of the page. Concurrent reclaim/gup/whatever could still have a count out on the page delaying the actual free beyond the tlb gather RCU section. So the reason page->mapping isn't cleared in page_remove_rmap() isn't detailed beyond a (possible) race with page_add_anon_rmap() (which I guess would be reclaim trying to unmap the page and a fault re-instating it). This also complicates the whole page_lock_anon_vma() thing, so it would be nice to be able to remove this race and clear page->mapping in page_remove_rmap(). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on 12 Apr 2010 12:10 On Mon, 2010-04-12 at 11:19 -0400, Rik van Riel wrote: > On 04/12/2010 10:40 AM, Peter Zijlstra wrote: > > > So the reason page->mapping isn't cleared in page_remove_rmap() isn't > > detailed beyond a (possible) race with page_add_anon_rmap() (which I > > guess would be reclaim trying to unmap the page and a fault re-instating > > it). > > > > This also complicates the whole page_lock_anon_vma() thing, so it would > > be nice to be able to remove this race and clear page->mapping in > > page_remove_rmap(). > > For anonymous pages, I don't see where the race comes from. > > Both do_swap_page and the reclaim code hold the page lock > across the entire operation, so they are already excluding > each other. > > Hugh, do you remember what the race between page_remove_rmap > and page_add_anon_rmap is/was all about? > > I don't see a race in the current code... Something like the below would be nice if possible. --- mm/rmap.c | 44 +++++++++++++++++++++++++++++++------------- 1 files changed, 31 insertions(+), 13 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index eaa7a09..241f75d 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -286,7 +286,22 @@ void __init anon_vma_init(void) /* * Getting a lock on a stable anon_vma from a page off the LRU is - * tricky: page_lock_anon_vma rely on RCU to guard against the races. + * tricky: + * + * page_add_anon_vma() + * atomic_add_negative(page->_mapcount); + * page->mapping = anon_vma; + * + * + * page_remove_rmap() + * atomic_add_negative(); + * page->mapping = anon_vma; + * + * So we have to first read page->mapping(), and then verify + * _mapcount, and make sure we order them correctly. + * + * We take anon_vma->lock in between so that if we see the anon_vma + * with a mapcount we know it won't go away on us. */ struct anon_vma *page_lock_anon_vma(struct page *page) { @@ -294,14 +309,24 @@ struct anon_vma *page_lock_anon_vma(struct page *page) unsigned long anon_mapping; rcu_read_lock(); - anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping); + anon_mapping = (unsigned long)rcu_dereference(page->mapping); if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) goto out; - if (!page_mapped(page)) - goto out; anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); spin_lock(&anon_vma->lock); + + /* + * Order the reading of page->mapping and page->_mapcount against the + * mb() implied by the atomic_add_negative() in page_remove_rmap(). + */ + smp_rmb(); + if (!page_mapped(page)) { + spin_unlock(&anon_vma->lock); + anon_vma = NULL; + goto out; + } + return anon_vma; out: rcu_read_unlock(); @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page) __dec_zone_page_state(page, NR_FILE_MAPPED); mem_cgroup_update_file_mapped(page, -1); } - /* - * It would be tidy to reset the PageAnon mapping here, - * but that might overwrite a racing page_add_anon_rmap - * which increments mapcount after us but sets mapping - * before us: so leave the reset to free_hot_cold_page, - * and remember that it's only reliable while mapped. - * Leaving it set also helps swapoff to reinstate ptes - * faster for those pages still in swapcache. - */ + + page->mapping = NULL; } /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on 12 Apr 2010 14:50 On Mon, 2010-04-12 at 09:46 -0700, Linus Torvalds wrote: > > On Mon, 12 Apr 2010, Rik van Riel wrote: > > > On 04/12/2010 12:01 PM, Peter Zijlstra wrote: > > > > > @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page) > > > __dec_zone_page_state(page, NR_FILE_MAPPED); > > > mem_cgroup_update_file_mapped(page, -1); > > > } > > > - /* > > > - * It would be tidy to reset the PageAnon mapping here, > > > - * but that might overwrite a racing page_add_anon_rmap > > > - * which increments mapcount after us but sets mapping > > > - * before us: so leave the reset to free_hot_cold_page, > > > - * and remember that it's only reliable while mapped. > > > - * Leaving it set also helps swapoff to reinstate ptes > > > - * faster for those pages still in swapcache. > > > - */ > > > + > > > + page->mapping = NULL; > > > } > > > > That would be a bug for file pages :) > > > > I could see how it could work for anonymous memory, though. > > I think it's scary for anonymous pages too. The _common_ case of > page_remove_rmap() is from unmap/exit, which holds no locks on the page > what-so-ever. So assuming the page could be reachable some other way (swap > cache etc), I think the above is pretty scary. Fully agreed. > Also do note that the bug we've been chasing has _always_ had that test > for "page_mapped(page)". See my other email about why the unmapped case > isn't even interesting, because it's so easy to see how page->mapping can > be stale for unmapped pages. > > It's the _mapped_ case that is interesting, not the unmapped one. So > setting page->mapping to NULL when unmapping is perhaps a nice consistency > issue ("never have stale pointers"), but it's missing the fact that it's > not really the case we care about. Yes, I don't think this is the problem that has been plaguing us for over a week now. But while staring at that code it did get me worried that the current code (page_lock_anon_vma): - is missing the smp_read_barrier_depends() after the ACCESS_ONCE - isn't properly ordered wrt page->mapping and page->_mapcount. - doesn't appear to guarantee much at all when returning an anon_vma since it locks after checking page->_mapcount so: * it can return !NULL for an unmapped page (your patch cures that) * it can return !NULL but for a different anon_vma (my earlier patch checking page_rmapping() after the spin_lock cures that, but doesn't cure the above): [ highly unlikely but not impossible race ] page_referenced(page_A) try_to_unmap(page_A) unrelated fault fault page_A CPU0 CPU1 CPU2 CPU3 rcu_read_lock() anon_vma = page->mapping; if (!anon_vma & ANON_BIT) goto out if (!page_mapped(page)) goto out page_remove_rmap() ... anon_vma_free()-----\ v anon_vma_alloc() anon_vma_alloc() page_add_anon_rmap() ^ spin_lock(anon_vma->lock)----------/ Now I don't think the above can happen due to how our slab allocators work, they won't share a slab page between cpus like that, but once we make the whole thing preemptible this race becomes a lot more likely. So a page_lock_anon_vma(), that looks a little like the below should (I think) cure all our problems with it. struct anon_vma *page_lock_anon_vma(struct page *page) { struct anon_vma *anon_vma; unsigned long anon_mapping; rcu_read_lock(); again: anon_mapping = (unsigned long)rcu_dereference(page->mapping); if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) goto out; anon_vma = (struct anon_vma *)(anon_mapping - PAGE_MAPPING_ANON); /* * The RCU read lock ensures we can safely dereference anon_vma * since it ensures the backing slab won't go away. It will however * not guarantee it's the right object. * * First take the anon_vma->lock, this will, per anon_vma_unlink() * avoid this anon_vma from being freed if it is a valid object. */ spin_lock(&anon_vma->lock); /* * Secondly, we have to re-read page->mapping, so ensure it * has not changed, rely on spin_lock() being at least a * compiler barrier to force the re-read. */ if (unlikely(page_rmapping(page) != anon_vma)) { spin_unlock(&anon_vma->lock); goto again; } /* * Ensure we read page->mapping before page->_mapcount, * orders against atomic_add_negative() in page_remove_rmap(). */ smp_rmb(); /* * Finally check that the page is still mapped, * if not, this can't possibly be the right anon_vma. */ if (!page_mapped(page)) goto unlock; return anon_vma; unlock: spin_unlock(&anon_vma->lock); out: rcu_read_unlock(); return NULL; } With this, I think we can actually drop the RCU read lock when returning since if this is indeed a valid anon_vma for this page, then the page is still mapped, and hence the anon_vma was not deleted, and a possible future delete will be held back by us holding the anon_vma->lock. Now I could be totally wrong and have confused myself throroughly, but how does this look? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Borislav Petkov on 12 Apr 2010 15:10 From: Rik van Riel <riel(a)redhat.com> Date: Mon, Apr 12, 2010 at 02:40:22PM -0400 > On 04/12/2010 12:26 PM, Linus Torvalds wrote: > > >But there is a _much_ more subtle case that involved swapping. > > > >So guys, here's my fairly simple theory on what happens: > > That bug looks entirely possible. Given that Borislav > has heavy swapping going on, it is quite possible that > this is the bug he has been triggering. Yeah, about that. I dunno whether you guys saw that but the machine has 8Gb of RAM and shouldn't be swapping, AFAIK. The largest mem usage I saw was 5Gb used, most of which pagecache. So I was kinda doubtful when Linus came up with the swapping theory earlier. I'll pay attention to the SwapCached in /proc/meminfo more to see whether we do any swapping. It could be that there is a small amount which is swapped out for whatever reason... Maybe that's the bug... But I'll give the patch a run anyway in an hour or so anyway. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on 12 Apr 2010 15:40
On Mon, 2010-04-12 at 20:40 +0200, Peter Zijlstra wrote: Hmm, if interleaved like so > struct anon_vma *page_lock_anon_vma(struct page *page) > { > struct anon_vma *anon_vma; > unsigned long anon_mapping; page_remove_rmap() anon_vma_unlink() anon_vma_free() So that the below will all observe the old page->mapping: > rcu_read_lock(); > again: > anon_mapping = (unsigned long)rcu_dereference(page->mapping); > if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) > goto out; > anon_vma = (struct anon_vma *)(anon_mapping - PAGE_MAPPING_ANON); > > /* > * The RCU read lock ensures we can safely dereference anon_vma > * since it ensures the backing slab won't go away. It will however > * not guarantee it's the right object. > * > * First take the anon_vma->lock, this will, per anon_vma_unlink() > * avoid this anon_vma from being freed if it is a valid object. > */ > spin_lock(&anon_vma->lock); > > /* > * Secondly, we have to re-read page->mapping, so ensure it > * has not changed, rely on spin_lock() being at least a > * compiler barrier to force the re-read. > */ > if (unlikely(page_rmapping(page) != anon_vma)) { > spin_unlock(&anon_vma->lock); > goto again; > } page_add_anon_rmap(), so that the page_mapped() test below would be positive, > /* > * Ensure we read page->mapping before page->_mapcount, > * orders against atomic_add_negative() in page_remove_rmap(). > */ > smp_rmb(); > > /* > * Finally check that the page is still mapped, > * if not, this can't possibly be the right anon_vma. > */ > if (!page_mapped(page)) > goto unlock; We could here return a non-valid and already freed anon_vma. > return anon_vma; > > unlock: > spin_unlock(&anon_vma->lock); > out: > rcu_read_unlock(); > return NULL; > } > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |