From: Mel Gorman on 28 Apr 2010 10:30 On Wed, Apr 28, 2010 at 02:20:56AM +0200, Andrea Arcangeli wrote: > On Wed, Apr 28, 2010 at 09:13:45AM +0900, KAMEZAWA Hiroyuki wrote: > > Doing some check in move_ptes() after vma_adjust() is not safe. > > IOW, when vma's information and information in page-table is incosistent...objrmap > > is broken and migartion will cause panic. > > > > Then...I think there are 2 ways. > > 1. use seqcounter in "mm_struct" as previous patch and lock it at mremap. > > or > > 2. get_user_pages_fast() when do remap. > > 3 take the anon_vma->lock > I've been looking at ways during the day that the anon_vma lock can be held while the page tables are being allocated. The schemes were way too hairy just to cover a migration corner case. As this is particular to exec, I'm wondering if Kamezawa's additional proposal of just skipping migration of pages within the temporary stack might be the best solution overall in terms of effectiveness and simplicity. His patch introduced a new variable to the VMA but it shouldn't be necessary and it altered vma_address which is unnecessary. Here is a different version of the same basic idea to skip temporary VMAs during migration. Maybe go with this? (As a heads-up, I'll also be going offline in about 24 hours until Tuesday morning. The area I'm in has zero internet access) ==== CUT HERE ==== mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks Page migration requires rmap to be able to find all ptes mapping a page at all times, otherwise the migration entry can be instantiated, but it is possible to leave one behind if the second rmap_walk fails to find the page. If this page is later faulted, migration_entry_to_page() will call BUG because the page is locked indicating the page was migrated by the migration PTE not cleaned up. There is a race between shift_arg_pages and migration that allows this bug to trigger. A temporary stack is setup during exec and later moved. If migration moves a page in the temporary stack and the VMA is then removed, the migration PTE may not be found leading to a BUG when the stack is faulted. Ideally, shift_arg_pages must run atomically with respect of rmap_walk by holding the anon_vma lock but this is problematic as pages must be allocated for page tables. Instead, this patch identifies when it is about to migrate pages from a temporary stack and leaves them alone. Memory hot-remove will try again, sys_move_pages() wouldn't be operating during exec() time and memory compaction will just continue to another page without concern. [kamezawa.hiroyu(a)jp.fujitsu.com: Idea for having migration skip the stacks] Signed-off-by: Mel Gorman <mel(a)csn.ul.ie> --- mm/rmap.c | 31 ++++++++++++++++++++++++++++++- 1 files changed, 30 insertions(+), 1 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 85f203e..5aaf4df 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1141,6 +1141,21 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, return ret; } +static bool is_vma_temporary_stack(struct vm_area_struct *vma) +{ + if (vma->vm_flags != VM_STACK_FLAGS) + return false; + + /* + * Only during exec will the total VM consumed by a process + * be exacly the same as the stack + */ + if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1) + return true; + + return false; +} + /** * try_to_unmap_anon - unmap or unlock anonymous page using the object-based * rmap method @@ -1169,7 +1184,21 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags) list_for_each_entry(avc, &anon_vma->head, same_anon_vma) { struct vm_area_struct *vma = avc->vma; - unsigned long address = vma_address(page, vma); + unsigned long address; + + /* + * During exec, a temporary VMA is setup and later moved. + * The VMA is moved under the anon_vma lock but not the + * page tables leading to a race where migration cannot + * find the migration ptes. Rather than increasing the + * locking requirements of exec(), migration skips + * temporary VMAs until after exec() completes. + */ + if (PAGE_MIGRATION && (flags & TTU_MIGRATION) && + is_vma_temporary_stack(vma)) + continue; + + address = vma_address(page, vma); if (address == -EFAULT) continue; ret = try_to_unmap_one(page, vma, address, flags); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on 28 Apr 2010 11:00 On Wed, Apr 28, 2010 at 03:23:56PM +0100, Mel Gorman wrote: > On Wed, Apr 28, 2010 at 02:20:56AM +0200, Andrea Arcangeli wrote: > > On Wed, Apr 28, 2010 at 09:13:45AM +0900, KAMEZAWA Hiroyuki wrote: > > > Doing some check in move_ptes() after vma_adjust() is not safe. > > > IOW, when vma's information and information in page-table is incosistent...objrmap > > > is broken and migartion will cause panic. > > > > > > Then...I think there are 2 ways. > > > 1. use seqcounter in "mm_struct" as previous patch and lock it at mremap. > > > or > > > 2. get_user_pages_fast() when do remap. > > > > 3 take the anon_vma->lock > > > > <SNIP> > > Here is a different version of the same basic idea to skip temporary VMAs > during migration. Maybe go with this? > > +static bool is_vma_temporary_stack(struct vm_area_struct *vma) > +{ > + if (vma->vm_flags != VM_STACK_FLAGS) > + return false; > + > + /* > + * Only during exec will the total VM consumed by a process > + * be exacly the same as the stack > + */ > + if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1) > + return true; > + > + return false; > +} > + The assumptions on the vm flags is of course totally wrong. VM_EXEC might be applied as well as default flags from the mm. The following is the same basic idea, skip VMAs belonging to processes in exec rather than trying to hold anon_vma->lock across move_page_tables(). Not tested yet. ==== CUT HERE ==== mm,migration: Avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks Page migration requires rmap to be able to find all ptes mapping a page at all times, otherwise the migration entry can be instantiated, but it is possible to leave one behind if the second rmap_walk fails to find the page. If this page is later faulted, migration_entry_to_page() will call BUG because the page is locked indicating the page was migrated by the migration PTE not cleaned up. There is a race between shift_arg_pages and migration that allows this bug to trigger. A temporary stack is setup during exec and later moved. If migration moves a page in the temporary stack and the VMA is then removed, the migration PTE may not be found leading to a BUG when the stack is faulted. Ideally, shift_arg_pages must run atomically with respect of rmap_walk by holding the anon_vma lock but this is problematic as pages must be allocated for page tables. Instead, this patch skips processes in exec by making an assumption that a mm with stack_vm == 1 and total_vm == 1 is a process in exec() that hasn't finalised the temporary stack yet. Memory hot-remove will try again, sys_move_pages() wouldn't be operating during exec() time and memory compaction will just continue to another page without concern. [kamezawa.hiroyu(a)jp.fujitsu.com: Idea for having migration skip temporary stacks] Signed-off-by: Mel Gorman <mel(a)csn.ul.ie> --- mm/rmap.c | 28 +++++++++++++++++++++++++++- 1 files changed, 27 insertions(+), 1 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 85f203e..9e20188 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1141,6 +1141,18 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, return ret; } +static bool is_vma_in_exec(struct vm_area_struct *vma) +{ + /* + * Only during exec will the total VM consumed by a process + * be exacly the same as the stack and both equal to 1 + */ + if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1) + return true; + + return false; +} + /** * try_to_unmap_anon - unmap or unlock anonymous page using the object-based * rmap method @@ -1169,7 +1181,21 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags) list_for_each_entry(avc, &anon_vma->head, same_anon_vma) { struct vm_area_struct *vma = avc->vma; - unsigned long address = vma_address(page, vma); + unsigned long address; + + /* + * During exec, a temporary VMA is setup and later moved. + * The VMA is moved under the anon_vma lock but not the + * page tables leading to a race where migration cannot + * find the migration ptes. Rather than increasing the + * locking requirements of exec, migration skips + * VMAs in processes calling exec. + */ + if (PAGE_MIGRATION && (flags & TTU_MIGRATION) && + is_vma_in_exec(vma)) + continue; + + address = vma_address(page, vma); if (address == -EFAULT) continue; ret = try_to_unmap_one(page, vma, address, flags); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on 28 Apr 2010 11:20 On Wed, Apr 28, 2010 at 03:57:38PM +0100, Mel Gorman wrote: > On Wed, Apr 28, 2010 at 03:23:56PM +0100, Mel Gorman wrote: > > On Wed, Apr 28, 2010 at 02:20:56AM +0200, Andrea Arcangeli wrote: > > > On Wed, Apr 28, 2010 at 09:13:45AM +0900, KAMEZAWA Hiroyuki wrote: > > > > Doing some check in move_ptes() after vma_adjust() is not safe. > > > > IOW, when vma's information and information in page-table is incosistent...objrmap > > > > is broken and migartion will cause panic. > > > > > > > > Then...I think there are 2 ways. > > > > 1. use seqcounter in "mm_struct" as previous patch and lock it at mremap. > > > > or > > > > 2. get_user_pages_fast() when do remap. > > > > > > 3 take the anon_vma->lock > > > > > > > <SNIP> > > > > Here is a different version of the same basic idea to skip temporary VMAs > > during migration. Maybe go with this? > > > > +static bool is_vma_temporary_stack(struct vm_area_struct *vma) > > +{ > > + if (vma->vm_flags != VM_STACK_FLAGS) > > + return false; > > + > > + /* > > + * Only during exec will the total VM consumed by a process > > + * be exacly the same as the stack > > + */ > > + if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1) > > + return true; > > + > > + return false; > > +} > > + > > The assumptions on the vm flags is of course totally wrong. VM_EXEC might > be applied as well as default flags from the mm. The following is the same > basic idea, skip VMAs belonging to processes in exec rather than trying > to hold anon_vma->lock across move_page_tables(). Not tested yet. This is better than the other, that made it look like people could set broken rmap at arbitrary times, at least this shows it's ok only during execve before anything run, but if we can't take the anon-vma lock really better than having to make these special checks inside every rmap_walk that has to be accurate, we should just delay the linkage of the stack vma into its anon-vma, until after the move_pages. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on 28 Apr 2010 11:30 On Wed, Apr 28, 2010 at 05:16:14PM +0200, Andrea Arcangeli wrote: > On Wed, Apr 28, 2010 at 03:57:38PM +0100, Mel Gorman wrote: > > On Wed, Apr 28, 2010 at 03:23:56PM +0100, Mel Gorman wrote: > > > On Wed, Apr 28, 2010 at 02:20:56AM +0200, Andrea Arcangeli wrote: > > > > On Wed, Apr 28, 2010 at 09:13:45AM +0900, KAMEZAWA Hiroyuki wrote: > > > > > Doing some check in move_ptes() after vma_adjust() is not safe. > > > > > IOW, when vma's information and information in page-table is incosistent...objrmap > > > > > is broken and migartion will cause panic. > > > > > > > > > > Then...I think there are 2 ways. > > > > > 1. use seqcounter in "mm_struct" as previous patch and lock it at mremap. > > > > > or > > > > > 2. get_user_pages_fast() when do remap. > > > > > > > > 3 take the anon_vma->lock > > > > > > > > > > <SNIP> > > > > > > Here is a different version of the same basic idea to skip temporary VMAs > > > during migration. Maybe go with this? > > > > > > +static bool is_vma_temporary_stack(struct vm_area_struct *vma) > > > +{ > > > + if (vma->vm_flags != VM_STACK_FLAGS) > > > + return false; > > > + > > > + /* > > > + * Only during exec will the total VM consumed by a process > > > + * be exacly the same as the stack > > > + */ > > > + if (vma->vm_mm->stack_vm == 1 && vma->vm_mm->total_vm == 1) > > > + return true; > > > + > > > + return false; > > > +} > > > + > > > > The assumptions on the vm flags is of course totally wrong. VM_EXEC might > > be applied as well as default flags from the mm. The following is the same > > basic idea, skip VMAs belonging to processes in exec rather than trying > > to hold anon_vma->lock across move_page_tables(). Not tested yet. > > This is better than the other, that made it look like people could set > broken rmap at arbitrary times, at least this shows it's ok only > during execve before anything run, but if we can't take the anon-vma > lock really better than having to make these special checks inside > every rmap_walk that has to be accurate, we should just delay the > linkage of the stack vma into its anon-vma, until after the move_pages. > Is it possible to delay the linkage like that? As arguments get copied into the temporary stack before it gets moved, I'd have expected the normal fault path to prepare and attach the anon_vma. We could special case it but that isn't very palatable either. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on 28 Apr 2010 11:50
On Wed, Apr 28, 2010 at 04:23:54PM +0100, Mel Gorman wrote: > Is it possible to delay the linkage like that? As arguments get copied into > the temporary stack before it gets moved, I'd have expected the normal fault > path to prepare and attach the anon_vma. We could special case it but > that isn't very palatable either. I'm not sure what is more palatable, but I feel it should be fixed in execve. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |