From: KAMEZAWA Hiroyuki on
On Wed, 28 Apr 2010 02:19:11 +0200
Andrea Arcangeli <aarcange(a)redhat.com> wrote:

> On Wed, Apr 28, 2010 at 02:18:21AM +0200, Andrea Arcangeli wrote:
> > On Wed, Apr 28, 2010 at 08:52:03AM +0900, KAMEZAWA Hiroyuki wrote:
> > > I already explained this doesn't happend and said "I'm sorry".
> >
> > Oops I must have overlooked it sorry! I just seen the trace quoted in
> > the comment of the patch and that at least would need correction
> > before it can be pushed in mainline, or it creates huge confusion to
> > see a reverse trace for CPU A for an already tricky piece of code.
> >
> > > But considering maintainance, it's not necessary to copy migration ptes
> > > and we don't have to keep a fundamental risks of migration circus.
> > >
> > > So, I don't say "we don't need this patch."
> >
> > split_huge_page also has the same requirement and there is no bug to
> > fix, so I don't see why to make special changes for just migrate.c
> > when we still have to list_add_tail for split_huge_page.
> >
> > Furthermore this patch isn't fixing anything in any case and it looks
> > a noop to me. If the order ever gets inverted, and process2 ptes are
> > scanned before process1 ptes in the rmap_walk, sure the
> > copy-page-tables will break and stop until the process1 rmap_walk will
> > complete, but that is not enough! You have to repeat the rmap_walk of
> > process1 if the order ever gets inverted and this isn't happening in
> ^^^^^^^2

why we have to remove migration_pte by rmap_walk() which doesnt' exist ?

Anyway, I agree there are no oops. But there are risks because migration is
a feature which people don't tend to take care of (as memcg ;)
I like conservative approach for this kind of features.

Thanks,
-Kame












--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on
On Wed, Apr 28, 2010 at 09:28:02AM +0900, KAMEZAWA Hiroyuki wrote:
> why we have to remove migration_pte by rmap_walk() which doesnt' exist ?

I already thought this for split_huge_page and because split_huge_page
that already waits inside copy_huge_pmd if there's any
pmd_trans_splitting bit set, isn't removing the requirement for
list_add_tail in anon_vma_[chain_]link this patch also isn't removing
it.

But it's more complex than my prev explanation now that I think about
it more so no wonder it's not clear why yet (my fault).

So the thing is, if you patch it this way, the second rmap_walk run by
remove_migration_ptes will be safe even if the ordering is inverted,
but then the first rmap_walk that establishes the migration entry, will
still break if the rmap walk scans the child before the parent.

The only objective of the patch is to remove the list_add_tail
requirement but I'll show that it's still required by the first
rmap_walk in try_to_unmap below:

CPU A CPU B
fork
try_to_unmap
try to set migration entry in child (null pte still, nothing done)
copy pte and map page in child pte (no
migration entry set on child)
set migration entry in parent

parent ends up with migration entry but child not, breaking migration
in another way. so even with the patch, the only way to be safe is
that rmap_walk always scans the parent _first_.

So this patch is a noop from every angle, and it just moves the
ordering requirement from the remove_migration_ptes to try_to_unmap,
but it doesn't remove it as far as I can tell. So it only slowdown
fork a bit for no good.

split_huge_page is about the same. If the ordering is inverted,
split_huge_page could set a splitting pmd in the parent only, like the
above try_to_unmap would set the migration pte in the parent only.

> Anyway, I agree there are no oops. But there are risks because migration is
> a feature which people don't tend to take care of (as memcg ;)
> I like conservative approach for this kind of features.

Don't worry, migrate will run 24/7 the moment THP is in ;).

What migration really changed (and it's beneficial so split_huge_page
isn't introducing new requirement) is that rmap has to be exact at all
times, if it was only swap using it, nothing would happen because of
the vma_adjust/move_page_tables/vma_adjust not being in an atomic
section against the rmap_walk (the anon_vma lock of the process stack
vma is enough to make it atomic against rmap_walk luckily).

On a side note: split_huge_page isn't using migration because:

1) migrate.c can't deal with pmd or huge pmd (or compound pages at all)
2) we need it to split in place for gup (used by futex, o_direct,
vmware workstation etc... and they definitely can't be slowed down
by having to split the hugepage before gup returns, the only one
that wouldn't be able to prevent gup splitting the hugepage without
causing issues to split_huge_page is kvm thanks to the mmu
notifier, but to do so it'd require to introduce
a gup_fast variant that wouldn't be increasing the page count, but
because of o_direct and friends we can't avoid to deal with staying
in place in split_huge_page), while migrate
by definition is about migrating something from here to there,
not doing anything in place
3) we need to convert page structures from compound to not compound in
place (beisdes the fact the physical memory is in place too),
inside of the two rmap walks which is definitely specialized
enough not to fit into the migrate framework

But memory compaction is the definitive user of migration code, as
it's only moving not-yet-huge pages from here to there, and THP
will run memory compaction every time we miss a hugepage in the buddy.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/