Prev: memcontrol: fix potential null deref
Next: [BUGFIX][PATCH] memcg: avoid use cmpxchg in swap cgroup maintainance (Was Re: 34-rc1-git3 build failure with CGROUP_MEM_RES_CTLR_SWAP=y
From: Minchan Kim on 16 Mar 2010 23:10 On Wed, Mar 17, 2010 at 11:12 AM, KAMEZAWA Hiroyuki > BTW, I doubt freeing anon_vma can happen even when we check mapcount. > > "unmap" is 2-stage operation. > 1. unmap_vmas() => modify ptes, free pages, etc. > 2. free_pgtables() => free pgtables, unlink vma and free it. > > Then, if migration is enough slow. > > Migration(): Exit(): > check mapcount > rcu_read_lock > pte_lock > replace pte with migration pte > pte_unlock > pte_lock > copy page etc... zap pte (clear pte) > pte_unlock > free_pgtables > ->free vma > ->free anon_vma > pte_lock > remap pte with new pfn(fail) > pte_unlock > > lock anon_vma->lock # modification after free. > check list is empty check list is empty? Do you mean anon_vma->head? If it is, is it possible that that list isn't empty since anon_vma is used by others due to SLAB_DESTROY_BY_RCU? but such case is handled by page_check_address, vma_address, I think. > unlock anon_vma->lock > free anon_vma > rcu_read_unlock > > > Hmm. IIUC, anon_vma is allocated as SLAB_DESTROY_BY_RCU. Then, while > rcu_read_lock() is taken, anon_vma is anon_vma even if freed. But it > may reused as anon_vma for someone else. > (IOW, it may be reused but never pushed back to general purpose memory > until RCU grace period.) > Then, touching anon_vma->lock never cause any corruption. > > Does use-after-free check for SLAB_DESTROY_BY_RCU correct behavior ? Could you elaborate your point? > Above case is not use-after-free. It's safe and expected sequence. > > Thanks, > -Kame > > > >> > --- >> > mm/migrate.c | 13 +++++++++++++ >> > 1 files changed, 13 insertions(+), 0 deletions(-) >> > >> > diff --git a/mm/migrate.c b/mm/migrate.c >> > index 98eaaf2..6eb1efe 100644 >> > --- a/mm/migrate.c >> > +++ b/mm/migrate.c >> > @@ -603,6 +603,19 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private, >> > */ >> > if (PageAnon(page)) { >> > rcu_read_lock(); >> > + >> > + /* >> > + * If the page has no mappings any more, just bail. An >> > + * unmapped anon page is likely to be freed soon but worse, >> > + * it's possible its anon_vma disappeared between when >> > + * the page was isolated and when we reached here while >> > + * the RCU lock was not held >> > + */ >> > + if (!page_mapcount(page)) { >> > + rcu_read_unlock(); >> > + goto uncharge; >> > + } >> > + >> > rcu_locked = 1; >> > anon_vma = page_anon_vma(page); >> > atomic_inc(&anon_vma->migrate_refcount); >> > >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo(a)kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: <a href=mailto:"dont(a)kvack.org"> email(a)kvack.org </a> >> > > -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Minchan Kim on 17 Mar 2010 00:20 On Wed, Mar 17, 2010 at 12:15 PM, KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com> wrote: > On Wed, 17 Mar 2010 12:00:15 +0900 > Minchan Kim <minchan.kim(a)gmail.com> wrote: > >> On Wed, Mar 17, 2010 at 11:12 AM, KAMEZAWA Hiroyuki >> > BTW, I doubt freeing anon_vma can happen even when we check mapcount. >> > >> > "unmap" is 2-stage operation. >> > 1. unmap_vmas() => modify ptes, free pages, etc. >> > 2. free_pgtables() => free pgtables, unlink vma and free it. >> > >> > Then, if migration is enough slow. >> > >> > Migration(): Exit(): >> > check mapcount >> > rcu_read_lock >> > pte_lock >> > replace pte with migration pte >> > pte_unlock >> > pte_lock >> > copy page etc... zap pte (clear pte) >> > pte_unlock >> > free_pgtables >> > ->free vma >> > ->free anon_vma >> > pte_lock >> > remap pte with new pfn(fail) >> > pte_unlock >> > >> > lock anon_vma->lock # modification after free. >> > check list is empty >> >> check list is empty? >> Do you mean anon_vma->head? >> > yes. > >> If it is, is it possible that that list isn't empty since anon_vma is >> used by others due to >> SLAB_DESTROY_BY_RCU? >> > There are 4 cases. > A) anon_vma->list is not empty because anon_vma is not freed. > B) anon_vma->list is empty because it's freed. > C) anon_vma->list is empty but it's reused. > D) anon_vma->list is not empty but it's reused. E) anon_vma is used for other object. That's because we don't hold rcu_read_lock. I think Mel met this E) situation. AFAIU, even slab page of SLAB_BY_RCU can be freed after grace period. Do I miss something? > >> but such case is handled by page_check_address, vma_address, I think. >> > yes. Then, this corrupt nothing, as I wrote. We just modify anon_vma->lock > and it's safe because of SLAB_DESTROY_BY_RCU. > > >> > unlock anon_vma->lock >> > free anon_vma >> > rcu_read_unlock >> > >> > >> > Hmm. IIUC, anon_vma is allocated as SLAB_DESTROY_BY_RCU. Then, while >> > rcu_read_lock() is taken, anon_vma is anon_vma even if freed. But it >> > may reused as anon_vma for someone else. >> > (IOW, it may be reused but never pushed back to general purpose memory >> > until RCU grace period.) >> > Then, touching anon_vma->lock never cause any corruption. >> > >> > Does use-after-free check for SLAB_DESTROY_BY_RCU correct behavior ? >> >> Could you elaborate your point? >> > > Ah, my point is "how use-after-free is detected ?" > > If use-after-free is detected by free_pages() (DEBUG_PGALLOC), it seems > strange because DESTROY_BY_RCU guarantee that never happens. > > So, I assume use-after-free is detected in SLAB layer. If so, > in above B), C), D) case, it seems there is use-after free in slab's point > of view but it works as expected, no corruption. > > Then, my question is > "Does use-after-free check for SLAB_DESTROY_BY_RCU work correctly ?" > I am not sure Mel found that by DEBUG_PGALLOC. But, E) case can be founded by DEBUG_PGALLOC. > and implies we need this patch ? > (But this will prevent unnecessary page copy etc. by easy check.) > > Thanks, > -Kame > > > > -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Minchan Kim on 25 Mar 2010 21:40
On Fri, Mar 26, 2010 at 9:58 AM, KAMEZAWA Hiroyuki <kamezawa.hiroyu(a)jp.fujitsu.com> wrote: > On Fri, 26 Mar 2010 00:29:01 +0900 > Minchan Kim <minchan.kim(a)gmail.com> wrote: > >> Hi, Kame. > <snip> > >> Which case do we have PageAnon && (page_mapcount == 0) && PageSwapCache ? >> With looking over code which add_to_swap_cache, I found somewhere. >> >> 1) shrink_page_list >> I think this case doesn't matter by isolate_lru_xxx. >> >> 2) shmem_swapin >> It seems to be !PageAnon >> >> 3) shmem_writepage >> It seems to be !PageAnon. >> >> 4) do_swap_page >> page_add_anon_rmap increases _mapcount before setting page->mapping to anon_vma. >> So It doesn't matter. > >> >> >> I think following codes in unmap_and_move seems to handle 3) case. >> >> --- >> * Corner case handling: >> * 1. When a new swap-cache page is read into, it is added to the LRU >> * and treated as swapcache but it has no rmap yet. >> ... >> if (!page->mapping) { >> if (!PageAnon(page) && page_has_private(page)) { >> .... >> } >> goto skip_unmap; >> } >> >> --- >> >> Do we really check PageSwapCache in there? >> Do I miss any case? >> > > When a page is fully unmapped, page->mapping is not cleared. > > from rmap.c > == > 734 void page_remove_rmap(struct page *page) > 735 { > .... > 758 /* > 759 * It would be tidy to reset the PageAnon mapping here, > 760 * but that might overwrite a racing page_add_anon_rmap > 761 * which increments mapcount after us but sets mapping > 762 * before us: so leave the reset to free_hot_cold_page, > 763 * and remember that it's only reliable while mapped. > 764 * Leaving it set also helps swapoff to reinstate ptes > 765 * faster for those pages still in swapcache. > 766 */ > 767 } > == > > What happens at memory reclaim is... > > the first vmscan > 1. isolate a page from LRU. > 2. add_to_swap_cache it. > 3. try_to_unmap it > 4. pageout it (PG_reclaim && PG_writeback) > 5. move page to the tail of LRU. > .....<after some time> > 6. I/O ends and PG_writeback is cleared. > > Here, in above cycle, the page is not freed. Still in LRU list. > next vmscan > 7. isolate a page from LRU. > 8. finds a unmapped clean SwapCache > 9. drop it. > > So, to _free_ unmapped SwapCache, sequence 7-9 should happen. > If enough memory is freed by the first itelation of vmscan before I/O end, > next vmscan doesn't happen. Then, we have "unmmaped clean Swapcache which has > anon_vma pointer on page->mapping" on LRU. Thanks for open my eye. Kame. :) -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |