Prev: [Bug #15615] NULL pointer deref in task_is_waking
Next: [Bug #15603] lockdep warning at boot time when determining whether to resume
From: Rik van Riel on 12 Apr 2010 12:00 On 04/12/2010 11:44 AM, Linus Torvalds wrote: > On Sun, 11 Apr 2010, Rik van Riel wrote: >> >> Looking around the code some more, zap_pte_range() >> calls page_remove_rmap(), which leaves the >> page->mapping in place and has this comment: > > See my earlier email about this exact issue. It's well-known that there > are stale page->mapping pointers. The "page_mapped()" check _should_ have > meant that in that case we never follow them, though. Good point. I wonder if we have some SMP reordering issue then? >> I wonder if we can clear page->mapping here, if >> list_is_singular(anon_vma->head). That way we >> will not leave stale pointers behind. > > What does that help? What if list _isn't_ singular? Yeah, that was a bad idea. Looking at the same code for 11 days straight seems to have put some knots in my brain :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Rik van Riel on 12 Apr 2010 12:10 On 04/12/2010 12:01 PM, Peter Zijlstra wrote: > @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page) > __dec_zone_page_state(page, NR_FILE_MAPPED); > mem_cgroup_update_file_mapped(page, -1); > } > - /* > - * It would be tidy to reset the PageAnon mapping here, > - * but that might overwrite a racing page_add_anon_rmap > - * which increments mapcount after us but sets mapping > - * before us: so leave the reset to free_hot_cold_page, > - * and remember that it's only reliable while mapped. > - * Leaving it set also helps swapoff to reinstate ptes > - * faster for those pages still in swapcache. > - */ > + > + page->mapping = NULL; > } That would be a bug for file pages :) I could see how it could work for anonymous memory, though. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on 12 Apr 2010 12:10 On Mon, 12 Apr 2010, Borislav Petkov wrote: > > > > If the warnings do happen, they are not going to be printing out any > > hugely informative data apart from the fact that the bad case happened at > > all. But If they do trigger, I can try to improve on them - it's just not > > worth trying to make them any more interesting if they never trigger. > > Haa, I think you're gonna want to improve them :) > > WARN_ONCE(1, "page->mapping does not exist in vma chain"); > > triggered on the first resume showing a rather messy 4 WARN_ONCEs. Had I > more cores, there maybe would've been more of them :) Maybe need locking > if clean output is of interest (see below). Goodie. I can't trigger this on my machine (not that I tried very hard - but I did do some swapping loads etc by limiting my memory to just 1GB etc). So I'm pretty sure my verification code is "correct", and verifies things that should be right. And the fact that it triggers under the exact load that you use to then trigger the bug is a damn good thing. That means that we are finally on the right track, and we have somethign that correlates well with the actual bug. > So, anyway, if I can read this correctly, there is a page->mapping > anon_vma which is _not_ in the anon_vmas chain of the vma > (avc->same_vma). Yes, and that is supposed to be a no-no. The page is clearly associated with the vma in question (since we are unmapping it through that vma), but the vma list of 'anon_vma's doesn't actually have the one that 'page->mapping' points to. And that, in turn, means that we've lost sight of the 'page->mapping' anon_vma, and THAT in turn means that it could well have been free'd as being no longer referenced. And if it was free'd, it could be re-allocated as something else (after the RCU grace period), and that directly explains your oops. > By the way, I completely understand when you say that your head hurts > from looking at this :). Well, I have to say that I'm happy I've spent the time on it, because this way I got to learn all the new rules. It's just that I really wish I wouldn't have _had_ to. Anyway, I'll have to think way more about this to see if I can come up with a debugging patch that shows more details about what actually caused this to happen in the first place. But we definitely have a smoking gun. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on 12 Apr 2010 12:40 On Mon, 12 Apr 2010, Linus Torvalds wrote: > > Yes, and that is supposed to be a no-no. The page is clearly associated > with the vma in question (since we are unmapping it through that vma), but > the vma list of 'anon_vma's doesn't actually have the one that > 'page->mapping' points to. > > And that, in turn, means that we've lost sight of the 'page->mapping' > anon_vma, and THAT in turn means that it could well have been free'd as > being no longer referenced. > > And if it was free'd, it could be re-allocated as something else (after > the RCU grace period), and that directly explains your oops. I have a new theory. And this new theory is completely different from all the other things we've been looking at. The new theory is really simple: 'page->mapping' has been re-set to the wrong mapping. Now, there is one case where we reset page->mapping _intentionally_, namely in the COW-breaking case of having the last user ("page_move_anon_rmap"). And that looks fine, and happens under normal loads all the time. We _want_ to do it there. But there is a _much_ more subtle case that involved swapping. So guys, here's my fairly simple theory on what happens: - page gets allocated/mapped by process A. Let's call the anon_vma we associate the page with 'A' to keep it easy to track. - Process A forks, creating process B. The anon_vma in B is 'B', and has a chain that looks like 'B' -> 'A'. Everything is fine. - Swapping happens. The page (with mapping pointing to 'A') gets swapped out (perhaps not to disk - it's enough to assume that it's just not mapped any more, and lives entirely in the swap-cache) - Process B pages it in, which goes like this: do_swap_page -> page = lookup_swap_cache(entry); ... set_pte_at(mm, address, page_table, pte); page_add_anon_rmap(page, vma, address); And think about what happens here! In particular, what happens is that this will now be the "first" mapping of that page, so page_add_anon_rmap() will do if (first) __page_set_anon_rmap(page, vma, address); and notice what anon_vma it will use? It will use the anon_vma for process B! So now page->mapping actually points to anon_vma 'B', not 'A' like it used to. What happens then? Trivial: process 'A' also pages it in (nothing happens, it's not the first mapping), and then process 'B' execve's or exits or unmaps, making anon_vma B go away. End result: process A has a page that points to anon_vma B, but anon_vma B does not exist any more. This can go on forever. Forget about RCU grace periods, forget about locking, forget anything like that. The bug is simply that page->mapping points to an anon_vma that was correct at one point, but was _not_ the one that was shared by all users of that possible mapping. The patch below is my largely mindless try at fixing this. It's untested. I'm not entirely sure that it actually works. But it makes some amount of conceptual sense. No? Linus --- mm/rmap.c | 15 +++++++++++++-- 1 files changed, 13 insertions(+), 2 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index ee97d38..4bad326 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -734,9 +734,20 @@ void page_move_anon_rmap(struct page *page, static void __page_set_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address) { - struct anon_vma *anon_vma = vma->anon_vma; + struct anon_vma_chain *avc; + struct anon_vma *anon_vma; + + BUG_ON(!vma->anon_vma); + + /* + * We must use the _oldest_ possible anon_vma for the page mapping! + * + * So take the last AVC chain entry in the vma, which is the deepest + * ancestor, and use the anon_vma from that. + */ + avc = list_entry(vma->anon_vma_chain.prev, struct anon_vma_chain, same_vma); + anon_vma = avc->anon_vma; - BUG_ON(!anon_vma); anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; page->mapping = (struct address_space *) anon_vma; page->index = linear_page_index(vma, address); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on 12 Apr 2010 13:00
On Mon, 12 Apr 2010, Rik van Riel wrote: > On 04/12/2010 12:01 PM, Peter Zijlstra wrote: > > > @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page) > > __dec_zone_page_state(page, NR_FILE_MAPPED); > > mem_cgroup_update_file_mapped(page, -1); > > } > > - /* > > - * It would be tidy to reset the PageAnon mapping here, > > - * but that might overwrite a racing page_add_anon_rmap > > - * which increments mapcount after us but sets mapping > > - * before us: so leave the reset to free_hot_cold_page, > > - * and remember that it's only reliable while mapped. > > - * Leaving it set also helps swapoff to reinstate ptes > > - * faster for those pages still in swapcache. > > - */ > > + > > + page->mapping = NULL; > > } > > That would be a bug for file pages :) > > I could see how it could work for anonymous memory, though. I think it's scary for anonymous pages too. The _common_ case of page_remove_rmap() is from unmap/exit, which holds no locks on the page what-so-ever. So assuming the page could be reachable some other way (swap cache etc), I think the above is pretty scary. Also do note that the bug we've been chasing has _always_ had that test for "page_mapped(page)". See my other email about why the unmapped case isn't even interesting, because it's so easy to see how page->mapping can be stale for unmapped pages. It's the _mapped_ case that is interesting, not the unmapped one. So setting page->mapping to NULL when unmapping is perhaps a nice consistency issue ("never have stale pointers"), but it's missing the fact that it's not really the case we care about. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |