From: Mel Gorman on
On Sat, May 01, 2010 at 08:56:18PM -0500, Christoph Lameter wrote:
> On Thu, 29 Apr 2010, Mel Gorman wrote:
>
> > There is a race between shift_arg_pages and migration that triggers this bug.
> > A temporary stack is setup during exec and later moved. If migration moves
> > a page in the temporary stack and the VMA is then removed before migration
> > completes, the migration PTE may not be found leading to a BUG when the
> > stack is faulted.
>
> A simpler solution would be to not allow migration of the temporary stack?
>

The patch's intention is to not migrate pages within the temporary
stack. What are you suggesting that is different?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Thu, Apr 29, 2010 at 06:21:20PM +0200, Andrea Arcangeli wrote:
> Hi Mel,
>
> did you see my proposed fix?

I did when I got back - sorry for the delay. The patchset I sent out was what
I had fully tested and was confident worked. I picked up the version of the
patch that was sent to Linus by Rik for merging.

> I'm running with it applied, I'd be
> interested if you can test it.

Unfortunately, the same bug triggers after about 18 minutes. The objective of
your fix is very simple - have a VMA covering the new range so that rmap can
find it. However, no lock is held during move_page_tables() because of the
need to call the page allocator. Due to the lack of locking, is it possible
that something like the following is happening?

Exec Process Migration Process
begin move_page_tables
begin rmap walk
take anon_vma locks
find new location of pte (do nothing)
copy migration pte to new location
#### Bad PTE now in place
find old location of pte
remove old migration pte
release anon_vma locks
remove temporary VMA
some time later, bug on migration pte

Even with the care taken, a migration PTE got copied and then left behind. What
I haven't confirmed at this point is if the ordering of the walk in "migration
process" is correct in the above scenario. The order is important for
the race as described to happen.

If the above is wrong, there is still a race somewhere else.

> Surely it will also work for new
> anon-vma code in upstream, because at that point there's just 1
> anon-vma and nothing else attached to the vma.
>
> http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=6efa1dfa5152ef8d7f26beb188d6877525a9dd03
>
> I think it's wrong to try to handle the race in rmap walk by making
> magic checks on vm_flags VM_GROWSDOWN|GROWSUP and
> vma->vm_mm->map_count == 1,

How bad is that magic check really? Is there a scenario when it's
the wrong thing to do?

I agree that migration skipping specific pages of the temporary stack is
unfortunate and having exec-aware informtion in migration is an odd dependency
at best. On the other hand, it's not as bad as skipping other regions as exec
will finish and allow the pages to be moved again. The impact to compaction
or transparent support would appear to be minimal.

> when we can fix it fully and simply in
> exec.c by indexing two vmas in the same anon-vma with a different
> vm_start so the pages will be found at all times by the rmap_walk.
>

If it can be simply fixed in exec, then I'll agree. Your patch looked simple
but unfortunately it doesn't fix the problem and it does introduce another
call to kmalloc() in the exec path. It's probably something that would only
be noticed by microbenchmarks though so I'm less concerned about that aspect.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Tue, May 04, 2010 at 02:56:06PM +0200, Andrea Arcangeli wrote:
> On Tue, May 04, 2010 at 11:32:13AM +0100, Mel Gorman wrote:
> > Unfortunately, the same bug triggers after about 18 minutes. The objective of
> > your fix is very simple - have a VMA covering the new range so that rmap can
> > find it. However, no lock is held during move_page_tables() because of the
> > need to call the page allocator. Due to the lack of locking, is it possible
> > that something like the following is happening?
> >
> > Exec Process Migration Process
> > begin move_page_tables
> > begin rmap walk
> > take anon_vma locks
> > find new location of pte (do nothing)
> > copy migration pte to new location
> > #### Bad PTE now in place
> > find old location of pte
> > remove old migration pte
> > release anon_vma locks
> > remove temporary VMA
> > some time later, bug on migration pte
> >
> > Even with the care taken, a migration PTE got copied and then left behind. What
> > I haven't confirmed at this point is if the ordering of the walk in "migration
> > process" is correct in the above scenario. The order is important for
> > the race as described to happen.
>
> Ok so this seems the ordering dependency on the anon_vma list that
> strikes again, I didn't realize the ordering would matter here, but it
> does as shown above, great catch! The destination vma of the
> move_page_tables has to be at the tail of the anon_vma list like the
> child vma have to be at the end to avoid the equivalent race in
> fork. This has to be a requirement for mremap too. We just want to
> enforce the same invariants that mremap already enforces, to avoid
> adding new special cases to the VM.
>

Agreed. To be honest, I found the problems ordering of the anon_vma a little
confusing but as long as it's consistent everywhere, it's manageable. If
this ever burns us in the future though, we might want DEBUG_VM option that
somehow verifies the ordering of the anon_vma list.

> == for new anon-vma code ==
> Subject: fix race between shift_arg_pages and rmap_walk
>
> From: Andrea Arcangeli <aarcange(a)redhat.com>
>
> migrate.c requires rmap to be able to find all ptes mapping a page at
> all times, otherwise the migration entry can be instantiated, but it
> can't be removed if the second rmap_walk fails to find the page.
>
> And split_huge_page() will have the same requirements as migrate.c
> already has.
>
> Signed-off-by: Andrea Arcangeli <aarcange(a)redhat.com>

Reviewed-by: Mel Gorman <mel(a)csn.ul.ie>

I'm currently testing this and have seen no problems after an hour which
is typically good. To be absolutly sure, it needs 24 hours but so far so
good. The changelog is a tad on the light side so maybe you'd like to take
this one instead and edit it to your liking?

==== CUT HERE ===

mm,migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack

Page migration requires rmap to be able to find all migration ptes
created by migration. If the second rmap_walk clearing migration PTEs
misses an entry, it is left dangling causing a BUG_ON to trigger during
fault. For example;

[ 511.201534] kernel BUG at include/linux/swapops.h:105!
[ 511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
[ 511.201534] last sysfs file: /sys/block/sde/size
[ 511.201534] CPU 0
[ 511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
[ 511.888526]
[ 511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[ 511.888526] RIP: 0010:[<ffffffff811094ff>] [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[ 512.173545] RSP: 0018:ffff880037b979d8 EFLAGS: 00010246
[ 512.198503] RAX: ffffea0000000000 RBX: ffffea0001a2ba10 RCX: 0000000000029830
[ 512.329617] RDX: 0000000001a2ba10 RSI: ffffffff818264b8 RDI: 000000000ef45c3e
[ 512.380001] RBP: ffff880037b97a08 R08: ffff880078003f00 R09: ffff880037b979e8
[ 512.380001] R10: ffffffff8114ddaa R11: 0000000000000246 R12: 0000000037304000
[ 512.380001] R13: ffff88007a9ed5c8 R14: f800000000077a2e R15: 000000000ef45c3e
[ 512.380001] FS: 00007f3d346866e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
[ 512.380001] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 512.380001] CR2: 00007fff6abec9c1 CR3: 0000000037a15000 CR4: 00000000000006f0
[ 512.380001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 513.004775] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 513.068415] Process date (pid: 20431, threadinfo ffff880037b96000, task ffff880078003f00)
[ 513.068415] Stack:
[ 513.068415] ffff880037b97b98 ffff880037b97a18 ffff880037b97be8 0000000000000c00
[ 513.228068] <0> ffff880037304f60 00007fff6abec9c1 ffff880037b97aa8 ffffffff810e951a
[ 513.228068] <0> ffff880037b97a88 0000000000000246 0000000000000000 ffffffff8130c5c2
[ 513.228068] Call Trace:
[ 513.228068] [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
[ 513.228068] [<ffffffff8130c5c2>] ? do_page_fault+0x26a/0x46e
[ 513.228068] [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
[ 513.720755] [<ffffffff8130875d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 513.789278] [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[ 513.851506] [<ffffffff813099b5>] page_fault+0x25/0x30
[ 513.851506] [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[ 513.851506] [<ffffffff811c1e27>] ? strnlen_user+0x3f/0x57
[ 513.851506] [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
[ 513.851506] [<ffffffff8111329b>] search_binary_handler+0x173/0x313
[ 513.851506] [<ffffffff8114c909>] ? load_elf_binary+0x0/0x192b
[ 513.851506] [<ffffffff81114896>] do_execve+0x219/0x30a
[ 513.851506] [<ffffffff8111887f>] ? getname+0x14d/0x1b3
[ 513.851506] [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
[ 514.483501] [<ffffffff8100320a>] stub_execve+0x6a/0xc0
[ 514.548357] Code: 74 05 83 f8 1f 75 68 48 b8 ff ff ff ff ff ff ff 07 48 21 c2 48 b8 00 00 00 00 00 ea ff ff 48 6b d2 38 48 8d 1c 02 f6 03 01 75 04 <0f> 0b eb fe 8b 4b 08 48 8d 73 08 85 c9 74 35 8d 41 01 89 4d e0
[ 514.704292] RIP [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[ 514.808221] RSP <ffff880037b979d8>
[ 514.906179] ---[ end trace 4f88495edc224d6b ]---

This particular BUG_ON is caused by a race between shift_arg_pages and
migration. During exec, a temporary stack is created and later moved to its
final location. If migration selects a page within the temporary stack,
the page tables and migration PTE can be copied to the new location
before rmap_walk is able to find the copy. This leaves a dangling
migration PTE behind that later triggers the bug.

This patch fixes the problem by using two VMAs - one which covers the temporary
stack and the other which covers the new location. This guarantees that rmap
can always find the migration PTE even if it is copied while rmap_walk is
taking place.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Mon, May 10, 2010 at 12:41:07PM -0500, Christoph Lameter wrote:
> On Tue, 4 May 2010, Mel Gorman wrote:
>
> > On Sat, May 01, 2010 at 08:56:18PM -0500, Christoph Lameter wrote:
> > > On Thu, 29 Apr 2010, Mel Gorman wrote:
> > >
> > > > There is a race between shift_arg_pages and migration that triggers this bug.
> > > > A temporary stack is setup during exec and later moved. If migration moves
> > > > a page in the temporary stack and the VMA is then removed before migration
> > > > completes, the migration PTE may not be found leading to a BUG when the
> > > > stack is faulted.
> > >
> > > A simpler solution would be to not allow migration of the temporary stack?
> > >
> >
> > The patch's intention is to not migrate pages within the temporary
> > stack. What are you suggesting that is different?
>
> A simple way to disallow migration of pages is to increment the refcount
> of a page.
>

I guess it could be done by walking the page-tables in advance of the move
and elevating the page count of any pages faulted and then finding those
pages afterwards. The fail path would be a bit of a pain though if the page
tables are partially moved though. It's unnecessarily complicated when the
temporary stack can be easily avoided.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Tue, May 11, 2010 at 08:59:12AM -0500, Christoph Lameter wrote:
> On Mon, 10 May 2010, Mel Gorman wrote:
>
> > > A simple way to disallow migration of pages is to increment the refcount
> > > of a page.
> > I guess it could be done by walking the page-tables in advance of the move
> > and elevating the page count of any pages faulted and then finding those
> > pages afterwards. The fail path would be a bit of a pain though if the page
> > tables are partially moved though. It's unnecessarily complicated when the
> > temporary stack can be easily avoided.
>
> Faulting during exec?

Copying in arguments and the like

> Dont we hold mmap_sem for write? A get_user_pages()
> or so on the range will increment the refcount.
>

Or just identify the temporary stack from the migration side instead of
adding to the cost of exec?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/