From: Mel Gorman on
On Thu, Mar 25, 2010 at 08:08:20PM +0900, KOSAKI Motohiro wrote:
> > > Hmm..Hmmm...........
> > >
> > > Today, I've reviewed this patch and [11/11] carefully twice. but It is harder to ack.
> > >
> > > This patch seems to assume page compaction is faster than direct
> > > reclaim. but it often doesn't, because dropping useless page cache is very
> > > lightweight operation,
> >
> > Two points with that;
> >
> > 1. It's very hard to know in advance how often direct reclaim of clean page
> > cache would be enough to satisfy the allocation.
>
> Yeah, This is main reason why I'd suggest tightly integrate vmscan and compaction.
>
>
> > 2. Even if it was faster to discard page cache, it's not necessarily
> > faster when the cost of reading that page cache back-in is taken into
> > account
>
> _If_ this is useful page, you are right. but please remember, In typical
> case the system have lots no longer used pages.
>

But we don't *know* that for sure. Lumpy reclaim for example can take an
unused clean page that happened to be surrounded by active hot pages and
throw out the whole lot.

I am not against integrating compaction with lumpy reclaim ultimately,
but I think we should have a good handle on the simple case first before
altering reclaim. In particular, I have concerns about how to efficiently
select pageblocks to migrate to when integrated with lumpy reclaim and how it
should be decided when to reclaim and when to compact with an integrated path.

I think it would be best if we had this basis of comparison and a workload
that turned out to be compaction-intensive to gain a full understanding of
the best integration between reclaim and compaction.

> > Lumpy reclaim tries to avoid dumping useful page cache but it is perfectly
> > possible for hot data to be discarded because it happened to be located
> > near cold data.
>
> Yeah, I fully agree.
>
> > It's impossible to know in general how much unnecessary IO
> > takes place as a result of lumpy reclaim because it depends heavily on the
> > system-state when lumpy reclaim starts.
>
> I think this is explained why vmscan and compaction shouldn't be separated.
>
> Yes, Only vmscan can know it.
>

vmscan doesn't know how much unnecessary IO it generated as a result of
it's actions. We can make a guess at it indirectly from vmstat but
that's about it.

> > > but page compaction makes a lot of memcpy (i.e. cpu cache
> > > pollution). IOW this patch is focusing to hugepage allocation very aggressively, but
> > > it seems not enough care to reduce typical workload damage.
> > >
> >
> > What typical workload is making aggressive use of high order
> > allocations? Typically when such a user is found, effort is spent on
> > finding alternatives to high-orders as opposed to worrying about the cost
> > of allocating them. There was a focus on huge page allocation because it
> > was the most useful test case that was likely to be encountered in practice.
> >
> > I can adjust the allocation levels to some other value but it's not typical
> > for a system to make very aggressive use of other orders. I could have it
> > use random orders but also is not very typical.
>
> If this compaction is trigged only order-9 allocation, I don't oppose it at all.
> Also PAGE_ALLOC_COSTLY_ORDER is probably acceptable. I agree huge page
> allocation made lots trouble. but low order and the system
> have lots no longer used page case, your logic is worse than current.
> I worry about it.
>

If you insist, I can limit direct compaction for > PAGE_ALLOC_COSTLY_ORDER. The
allocator is already meant to be able to handle these orders without special
assistance and it'd avoid compaction becoming a cruch for subsystems that
suddently decide it's a great idea to use order-1 or order-2 heavily.

> My point is, We have to consider to disard useful cached pages and to
> discard no longer accessed pages. latter is nearly zero cost.

I am not opposed to moving in this sort of direction although
particularly if we disable compaction for the lower orders. I believe
what you are suggesting is that the allocator would take the steps

1. Try allocate from lists
2. If that fails, do something like zone_reclaim_mode and lumpy reclaim
only pages which are cheap to discard
3. If that fails, try compaction to move around the active pages
4. If that fails, lumpy reclaim

> please
> don't consider page discard itself is bad, it is correct page life cycle.
> To protest discard useless cached page can makes reduce IO throughput.
>

I don't consider it bad as such but I had generally considered compaction to
be better than discarding pages. I take your point though that if we compact
many old pages, it might be a net loss.

> >
> > > At first, I would like to clarify current reclaim corner case and how
> > > vmscan should do at this mail.
> > >
> > > Now we have Lumpy reclaim. It is very excellent solution for externa
> > > fragmentation.
> >
> > In some situations, it can grind a system to trash for a time. What is far
> > more likely is to be dealing with a machine with no swap - something that
> > is common in clusters. In this case, lumpy is a lot less likely to succeed
> > unless the machine is very quiet. It's just not going to find the contiguous
> > page cache it needs to discard and anonymous pages get in the way.
> >
> > > but unfortunately it have lots corner case.
> > >
> > > Viewpoint 1. Unnecessary IO
> > >
> > > isolate_pages() for lumpy reclaim frequently grab very young page. it is often
> > > still dirty. then, pageout() is called much.
> > >
> > > Unfortunately, page size grained io is _very_ inefficient. it can makes lots disk
> > > seek and kill disk io bandwidth.
> > >
> >
> > Page-based IO like this has also been reported as being a problem for some
> > filesystems. When this happens, lumpy reclaim potentially stalls for a long
> > time waiting for the dirty data to be flushed by a flusher thread. Compaction
> > does not suffer from the same problem.
> >
> > > Viewpoint 2. Unevictable pages
> > >
> > > isolate_pages() for lumpy reclaim can pick up unevictable page. it is obviously
> > > undroppable. so if the zone have plenty mlocked pages (it is not rare case on
> > > server use case), lumpy reclaim can become very useless.
> > >
> >
> > Also true. Potentially, compaction can deal with unevictable pages but it's
> > not done in this series as it's significant enough as it is and useful in
> > its current form.
> >
> > > Viewpoint 3. GFP_ATOMIC allocation failure
> > >
> > > Obviously lumpy reclaim can't help GFP_ATOMIC issue.
> > >
> >
> > Also true although right now, it's not possible to compact for GFP_ATOMIC
> > either. I think it could be done on some cases but I didn't try for it.
> > High-order GFP_ATOMIC allocations are still something we simply try and
> > avoid rather than deal with within the page allocator.
> >
> > > Viewpoint 4. reclaim latency
> > >
> > > reclaim latency directly affect page allocation latency. so if lumpy reclaim with
> > > much pageout io is slow (often it is), it affect page allocation latency and can
> > > reduce end user experience.
> > >
> >
> > Also true. When allocation huge pages on a normal desktop for example,
> > it scan stall the machine for a number of seconds while reclaim kicks
> > in.
> >
> > With direct compaction, this does not happen to anywhere near the same
> > degree. There are still some stalls because as huge pages get allocated,
> > free memory drops until pages have to be reclaimed anyway. The effects
> > are a lot less prononced and the operation finishes a lot faster.
> >
> > > I really hope that auto page migration help to solve above issue. but sadly this
> > > patch seems doesn't.
> > >
> >
> > How do you figure? I think it goes a long way to mitigating the worst of
> > the problems you laid out above.
>
> Both lumpy reclaim and page comaction have some advantage and some disadvantage.
> However we already have lumpy reclaim. I hope you rememver we are attacking
> very narrowing corner case. we have to consider to reduce the downside of compaction
> at first priority.
> Not only big benefit but also big downside seems no good.
>
> So, I'd suggest either way
> 1) no change caller place, but invoke compaction at very limited situation, or

I'm ok with enabling compaction only for >= PAGE_ALLOC_COSTLY_ORDER.
This will likely limit it to just huge pages for the moment but even
that would be very useful to me on swapless systems

> 2) invoke compaction at only lumpy reclaim unfit situation
>
> My last mail, I proposed about (2). but you seems got bad impression. then,
> now I propsed (1).

1 would be my preference to start with.

After merge, I'd look into "cheap" lumpy reclaim which is used as a
first option, then compaction, then full direct reclaim. Would that be
satisfactory?

> I mean we will _start_ to treat the compaction is for
> hugepage allocation assistance feature, not generic allocation change.
>

Agreed.

> btw, I hope drop or improve patch 11/11 ;-)
>

I expect it to be improved over time. The compactfail counter is there to
identify when a bad situation occurs so that the workload can be better
understood. There are different heuristics that could be applied there to
avoid the wait but all of them have disadvantages.

> > > Honestly, I think this patch was very impressive and useful at 2-3 years ago.
> > > because 1) we didn't have lumpy reclaim 2) we didn't have sane reclaim bail out.
> > > then, old vmscan is very heavyweight and inefficient operation for high order reclaim.
> > > therefore the downside of adding this page migration is hidden relatively. but...
> > >
> > > We have to make an effort to reduce reclaim latency, not adding new latency source.
> >
> > I recognise that reclaim latency has been reduced but there is a wall.
>
> If it is a wall, we have to fix this! :)

Well, the wall I had in mind was IO bandwidth :)

>
> > The cost of reading the data back in will always be there and on
> > swapless systems, it might simply be impossible for lumpy reclaim to do
> > what it needs.
>
> Well, I didn't and don't think the compaction is useless. I haven't saied
> the compaction is useless. I've talked about how to avoid downside mess.
>
> > > Instead, I would recommend tightly integrate page-compaction and lumpy reclaim.
> > > I mean 1) reusing lumpy reclaim's neighbor pfn page pickking up logic
> >
> > There are a number of difficulties with this. I'm not saying it's impossible,
> > but the win is not very clear-cut and there are some disadvantages.
> >
> > One, there would have to be exceptions for kswapd in the path because it
> > really should continue reclaiming. The reclaim path is already very dense
> > and this would add significant compliexity to that path.
> >
> > The second difficulty is that the migration and free block selection
> > algorithm becomes a lot harder, more expensive and identifying the exit
> > conditions presents a significant difficultly. Right now, the selection is
> > based on linear scans with straight-forward selection and the exit condition
> > is simply when the scanners meet. With the migration scanner based on LRU,
> > significant care would have to be taken to ensure that appropriate free blocks
> > were chosen to migrate to so that we didn't "migrate from" a block in one
> > pass and "migrate to" in another (the reason why I went with linear scans
> > in the first place). Identifying when the zone has been compacted and should
> > just stop is no longer as straight-forward either. You'd have to track what
> > blocks had been operated on in the past which is potentially a lot of state. To
> > maintain this state, an unknown number structures would have to be allocated
> > which may re-enter the allocator presenting its own class of problems.
> >
> > Third, right now it's very easy to identify when compaction is not going
> > to work in advance - simply check the watermarks and make a calculation
> > based on fragmentation. With a combined reclaim/compaction step, these
> > type of checks would need to be made continually - potentially
> > increasing the latency of reclaim albeit very slightly.
> >
> > Lastly, with this series, there is very little difference between direct
> > compaction and proc-triggered compaction. They share the same code paths
> > and all that differs is the exit conditions. If it was integrated into
> > reclaim, it becomes a lot less straight-forward to share the code.
> >
> > > 2) do page
> > > migration instead pageout when the page is some condition (example active or dirty
> > > or referenced or swapbacked).
> > >
> >
> > Right now, it is identifed when pageout should happen instead of page
> > migration. It's known before compaction starts if it's likely to be
> > successful or not.
> >
>
> patch 11/11 says, it's known likely to be successfull or not, but not exactly.

Indeed. For example, it might not have been possible to migrate the necessary
pages because they were pagetables, slab etc. It might also be simply memory
pressure. It might look like there should be enough pages to compaction but
there are too many processes allocating at the same time.

> I think you and I don't have so big different analisys about current behavior.
> I feel I merely pesimistic rather than you.
>

Of course I'm optimistic :)

>
>
> > > This patch seems shoot me! /me die. R.I.P. ;-)
> >
> > That seems a bit dramatic. Your alternative proposal has some significant
> > difficulties and is likely to be very complicated. Also, there is nothing
> > to say that this mechanism could not be integrated with lumpy reclaim over
> > time once it was shown that useless migration was going on or latencies were
> > increased for some workload.
> >
> > This patch seems like a far more rational starting point to me than adding
> > more complexity to reclaim at the outset.
> >
> > > btw please don't use 'hugeadm --set-recommended-min_free_kbytes' at testing.
> >
> > It's somewhat important for the type of stress tests I do for huge page
> > allocation. Without it, fragmentation avoidance has trouble and the
> > results become a lot less repeatable.
> >
> > > To evaluate a case of free memory starvation is very important for this patch
> > > series, I think. I slightly doubt this patch might invoke useless compaction
> > > in such case.
> > >
> >
> > I can drop the min_free_kbytes change but the likely result will be that
> > allocation success rates will simply be lower. The calculations on
> > whether compaction should be used or not are based on watermarks which
> > adjust to the value of min_free_kbytes.
>
> Then, should we need min_free_kbytes auto adjustment trick?
>

I have considered this in the past. Specifically that it would be auto-adjusted
the first time a huge page was allocated. I never got around to it though.

>
> > > At bottom line, the explict compaction via /proc can be merged soon, I think.
> > > but this auto compaction logic seems need more discussion.
> >
> > My concern would be that the compaction paths would then be used very
> > rarely in practice and we'd get no data on how direct compaction should
> > be done.
>
> Agree almost.
>
> Again, I think this patch is attacking corner case issue. then, I don't
> hope this will makes new corner case. I don't think your approach is
> perfectly broken.
>
> But please remember, now compaction might makes very large lru shuffling
> in compaction failure case. It mean vmscan might discard very wrong pages.
> I have big worry about it.
>

Would disabling compaction for the lower orders alleviate your concerns?
I have also taken note to investigate how much LRU churn can be avoided.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KAMEZAWA Hiroyuki on
On Thu, 25 Mar 2010 10:16:54 +0000
Mel Gorman <mel(a)csn.ul.ie> wrote:

> On Thu, Mar 25, 2010 at 06:50:21PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 25 Mar 2010 09:48:26 +0000
> > Mel Gorman <mel(a)csn.ul.ie> wrote:
> >
> > > > In that case, compact_finished() can't
> > > > find there is a free chunk and do more work. How about using a function like
> > > > free_pcppages_bulk(zone, pcp->batch, pcp);
> > > > to bypass pcp list and freeing pages at once ?
> > > >
> > >
> > > I think you mean to drain the PCP lists while compaction is happening
> > > but is it justified? It's potentially a lot of IPI calls just to check
> > > if compaction can finish a little earlier. If the pages on the PCP lists
> > > are making that much of a difference to high-order page availability, it
> > > implies that the zone is pretty full and it's likely that compaction was
> > > avoided and we direct reclaimed.
> > >
> > Ah, sorry for my short word again. I mean draining "local" pcp list because
> > a thread which run direct-compaction freed pages. IPI is not necessary and
> > overkill.
> >
>
> Ah, I see now. There are two places that pages get freed. release_freepages()
> at the end of compaction when it's too late for compact_finished() to be
> helped and within migration itself. Migration frees with either
> free_page() or more commonly put_page() with put_page() being the most
> frequently used. As free_page() is called on failure to migrate (rare),
> there is little help in changing it and I'd rather not modify how
> put_page() works.
>
> I could add a variant of drain_local_pages() that drains just the local PCP of
> a given zone before compact_finished() is called. The cost would be a doubling
> of the number of times zone->lock is taken to do the drain. Is it
> justified? It seems overkill to me to take the zone->lock just in case
> compaction can finish a little earlier. It feels like it would be adding
> a guaranteed cost for a potential saving.
>
If you want to keep code comapct, I don't ask more.

I worried about that just because memory hot-unplug were suffered by pagevec
and pcp list before using MIGRATE_ISOLATE and proper lru_add_drain().

Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: KOSAKI Motohiro on
> If you insist, I can limit direct compaction for > PAGE_ALLOC_COSTLY_ORDER. The
> allocator is already meant to be able to handle these orders without special
> assistance and it'd avoid compaction becoming a cruch for subsystems that
> suddently decide it's a great idea to use order-1 or order-2 heavily.
>
> > My point is, We have to consider to disard useful cached pages and to
> > discard no longer accessed pages. latter is nearly zero cost.
>
> I am not opposed to moving in this sort of direction although
> particularly if we disable compaction for the lower orders. I believe
> what you are suggesting is that the allocator would take the steps
>
> 1. Try allocate from lists
> 2. If that fails, do something like zone_reclaim_mode and lumpy reclaim
> only pages which are cheap to discard
> 3. If that fails, try compaction to move around the active pages
> 4. If that fails, lumpy reclaim

This seems makes a lot of sense.
I think todo are

1) now almost system doesn't use zone_reclaim. we need to consider change
zone_reclaim as by default or not.
2) current zone_reclaim doesn't have light reclaim mode. it start reclaim as priority=5.
we need to consider adding new zone reclaim mode or not.


> > please
> > don't consider page discard itself is bad, it is correct page life cycle.
> > To protest discard useless cached page can makes reduce IO throughput.
>
> I don't consider it bad as such but I had generally considered compaction to
> be better than discarding pages. I take your point though that if we compact
> many old pages, it might be a net loss.

thanks.


> > > How do you figure? I think it goes a long way to mitigating the worst of
> > > the problems you laid out above.
> >
> > Both lumpy reclaim and page comaction have some advantage and some disadvantage.
> > However we already have lumpy reclaim. I hope you rememver we are attacking
> > very narrowing corner case. we have to consider to reduce the downside of compaction
> > at first priority.
> > Not only big benefit but also big downside seems no good.
> >
> > So, I'd suggest either way
> > 1) no change caller place, but invoke compaction at very limited situation, or
>
> I'm ok with enabling compaction only for >= PAGE_ALLOC_COSTLY_ORDER.
> This will likely limit it to just huge pages for the moment but even
> that would be very useful to me on swapless systems

Agreed! thanks.

sidenote: I don't think this is only a feature for swapless systems. example, btrfs
doesn't have pageout implementation, it mean btrfs can't use lumpy reclaim.
page comaction can help to solve this issue.


> > 2) invoke compaction at only lumpy reclaim unfit situation
> >
> > My last mail, I proposed about (2). but you seems got bad impression. then,
> > now I propsed (1).
>
> 1 would be my preference to start with.
>
> After merge, I'd look into "cheap" lumpy reclaim which is used as a
> first option, then compaction, then full direct reclaim. Would that be
> satisfactory?

Yeah! this is very nice for me!


> > I mean we will _start_ to treat the compaction is for
> > hugepage allocation assistance feature, not generic allocation change.
> >
>
> Agreed.
>
> > btw, I hope drop or improve patch 11/11 ;-)
>
> I expect it to be improved over time. The compactfail counter is there to
> identify when a bad situation occurs so that the workload can be better
> understood. There are different heuristics that could be applied there to
> avoid the wait but all of them have disadvantages.

great!


> > > > Honestly, I think this patch was very impressive and useful at 2-3 years ago.
> > > > because 1) we didn't have lumpy reclaim 2) we didn't have sane reclaim bail out.
> > > > then, old vmscan is very heavyweight and inefficient operation for high order reclaim.
> > > > therefore the downside of adding this page migration is hidden relatively. but...
> > > >
> > > > We have to make an effort to reduce reclaim latency, not adding new latency source.
> > >
> > > I recognise that reclaim latency has been reduced but there is a wall.
> >
> > If it is a wall, we have to fix this! :)
>
> Well, the wall I had in mind was IO bandwidth :)

ok, I catched you mention.

> > > Right now, it is identifed when pageout should happen instead of page
> > > migration. It's known before compaction starts if it's likely to be
> > > successful or not.
> > >
> >
> > patch 11/11 says, it's known likely to be successfull or not, but not exactly.
>
> Indeed. For example, it might not have been possible to migrate the necessary
> pages because they were pagetables, slab etc. It might also be simply memory
> pressure. It might look like there should be enough pages to compaction but
> there are too many processes allocating at the same time.

agreed.


> > > I can drop the min_free_kbytes change but the likely result will be that
> > > allocation success rates will simply be lower. The calculations on
> > > whether compaction should be used or not are based on watermarks which
> > > adjust to the value of min_free_kbytes.
> >
> > Then, should we need min_free_kbytes auto adjustment trick?
>
> I have considered this in the past. Specifically that it would be auto-adjusted
> the first time a huge page was allocated. I never got around to it though.

Hmhm, ok.
we can discuss it as separate patch and separate thread.


> > But please remember, now compaction might makes very large lru shuffling
> > in compaction failure case. It mean vmscan might discard very wrong pages.
> > I have big worry about it.
> >
>
> Would disabling compaction for the lower orders alleviate your concerns?
> I have also taken note to investigate how much LRU churn can be avoided.

that's really great.

I'm looking for your v6 post :)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Mel Gorman on
On Fri, Mar 26, 2010 at 10:03:08AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 25 Mar 2010 10:16:54 +0000
> Mel Gorman <mel(a)csn.ul.ie> wrote:
>
> > On Thu, Mar 25, 2010 at 06:50:21PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 25 Mar 2010 09:48:26 +0000
> > > Mel Gorman <mel(a)csn.ul.ie> wrote:
> > >
> > > > > In that case, compact_finished() can't
> > > > > find there is a free chunk and do more work. How about using a function like
> > > > > free_pcppages_bulk(zone, pcp->batch, pcp);
> > > > > to bypass pcp list and freeing pages at once ?
> > > > >
> > > >
> > > > I think you mean to drain the PCP lists while compaction is happening
> > > > but is it justified? It's potentially a lot of IPI calls just to check
> > > > if compaction can finish a little earlier. If the pages on the PCP lists
> > > > are making that much of a difference to high-order page availability, it
> > > > implies that the zone is pretty full and it's likely that compaction was
> > > > avoided and we direct reclaimed.
> > > >
> > > Ah, sorry for my short word again. I mean draining "local" pcp list because
> > > a thread which run direct-compaction freed pages. IPI is not necessary and
> > > overkill.
> > >
> >
> > Ah, I see now. There are two places that pages get freed. release_freepages()
> > at the end of compaction when it's too late for compact_finished() to be
> > helped and within migration itself. Migration frees with either
> > free_page() or more commonly put_page() with put_page() being the most
> > frequently used. As free_page() is called on failure to migrate (rare),
> > there is little help in changing it and I'd rather not modify how
> > put_page() works.
> >
> > I could add a variant of drain_local_pages() that drains just the local PCP of
> > a given zone before compact_finished() is called. The cost would be a doubling
> > of the number of times zone->lock is taken to do the drain. Is it
> > justified? It seems overkill to me to take the zone->lock just in case
> > compaction can finish a little earlier. It feels like it would be adding
> > a guaranteed cost for a potential saving.
> >
> If you want to keep code comapct, I don't ask more.
>
> I worried about that just because memory hot-unplug were suffered by pagevec
> and pcp list before using MIGRATE_ISOLATE and proper lru_add_drain().
>

What I can do to cover that situation that won't cost much is to call
drain_local_pages after compaction completes.

Thanks


--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andrew Morton on
On Fri, 2 Apr 2010 17:02:45 +0100
Mel Gorman <mel(a)csn.ul.ie> wrote:

> Ordinarily when a high-order allocation fails, direct reclaim is entered to
> free pages to satisfy the allocation. With this patch, it is determined if
> an allocation failed due to external fragmentation instead of low memory
> and if so, the calling process will compact until a suitable page is
> freed. Compaction by moving pages in memory is considerably cheaper than
> paging out to disk and works where there are locked pages or no swap. If
> compaction fails to free a page of a suitable size, then reclaim will
> still occur.

Does this work?

> Direct compaction returns as soon as possible. As each block is compacted,
> it is checked if a suitable page has been freed and if so, it returns.

So someone else can get in and steal it. How is that resolved?

Please expound upon the relationship between the icky pageblock_order
and the caller's desired allocation order here. The compaction design
seems fairly fixated upon pageblock_order - what happens if the caller
wanted something larger than pageblock_order? The
less-than-pageblock_order case seems pretty obvious, although perhaps
wasteful?

>
> ...
>
> +static unsigned long compact_zone_order(struct zone *zone,
> + int order, gfp_t gfp_mask)
> +{
> + struct compact_control cc = {
> + .nr_freepages = 0,
> + .nr_migratepages = 0,
> + .order = order,
> + .migratetype = allocflags_to_migratetype(gfp_mask),
> + .zone = zone,
> + };

yeah, like that.

> + INIT_LIST_HEAD(&cc.freepages);
> + INIT_LIST_HEAD(&cc.migratepages);
> +
> + return compact_zone(zone, &cc);
> +}
> +
> +/**
> + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
> + * @zonelist: The zonelist used for the current allocation
> + * @order: The order of the current allocation
> + * @gfp_mask: The GFP mask of the current allocation
> + * @nodemask: The allowed nodes to allocate from
> + *
> + * This is the main entry point for direct page compaction.
> + */
> +unsigned long try_to_compact_pages(struct zonelist *zonelist,
> + int order, gfp_t gfp_mask, nodemask_t *nodemask)
> +{
> + enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> + int may_enter_fs = gfp_mask & __GFP_FS;
> + int may_perform_io = gfp_mask & __GFP_IO;
> + unsigned long watermark;
> + struct zoneref *z;
> + struct zone *zone;
> + int rc = COMPACT_SKIPPED;
> +
> + /*
> + * Check whether it is worth even starting compaction. The order check is
> + * made because an assumption is made that the page allocator can satisfy
> + * the "cheaper" orders without taking special steps
> + */
> + if (order <= PAGE_ALLOC_COSTLY_ORDER

Was that a correct decision? If we perform compaction when smaller
allocation attemtps fail, will the kernel get better, or worse?

And how do we save my order-4-allocating wireless driver? That would
require that kswapd perform the compaction for me, perhaps?

> || !may_enter_fs || !may_perform_io)

Would be nice to add some comments explaining this a bit more.
Compaction doesn't actually perform IO, nor enter filesystems, does it?

> + return rc;
> +
> + count_vm_event(COMPACTSTALL);
> +
> + /* Compact each zone in the list */
> + for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
> + nodemask) {
> + int fragindex;
> + int status;
> +
> + /*
> + * Watermarks for order-0 must be met for compaction. Note
> + * the 2UL. This is because during migration, copies of
> + * pages need to be allocated and for a short time, the
> + * footprint is higher
> + */
> + watermark = low_wmark_pages(zone) + (2UL << order);
> + if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> + continue;

ooh, so that starts to explain split_free_page(). But
split_free_page() didn't do the 2UL thing.

Surely these things are racy? So we'll deadlock less often :(

> + /*
> + * fragmentation index determines if allocation failures are
> + * due to low memory or external fragmentation
> + *
> + * index of -1 implies allocations might succeed depending
> + * on watermarks
> + * index towards 0 implies failure is due to lack of memory
> + * index towards 1000 implies failure is due to fragmentation
> + *
> + * Only compact if a failure would be due to fragmentation.
> + */
> + fragindex = fragmentation_index(zone, order);
> + if (fragindex >= 0 && fragindex <= 500)
> + continue;
> +
> + if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
> + rc = COMPACT_PARTIAL;
> + break;
> + }

Why are we doing all this handwavy stuff? Why not just try a
compaction run and see if it worked? That would be more
robust/reliable, surely?

> + status = compact_zone_order(zone, order, gfp_mask);
> + rc = max(status, rc);
> +
> + if (zone_watermark_ok(zone, order, watermark, 0, 0))
> + break;
> + }
> +
> + return rc;
> +}
> +
> +
> /* Compact all zones within a node */
> static int compact_node(int nid)
> {
>
> ...
>
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -561,7 +561,7 @@ static int unusable_show(struct seq_file *m, void *arg)
> * The value can be used to determine if page reclaim or compaction
> * should be used
> */
> -int fragmentation_index(unsigned int order, struct contig_page_info *info)
> +int __fragmentation_index(unsigned int order, struct contig_page_info *info)
> {
> unsigned long requested = 1UL << order;
>
> @@ -581,6 +581,14 @@ int fragmentation_index(unsigned int order, struct contig_page_info *info)
> return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total);
> }
>
> +/* Same as __fragmentation index but allocs contig_page_info on stack */
> +int fragmentation_index(struct zone *zone, unsigned int order)
> +{
> + struct contig_page_info info;
> +
> + fill_contig_page_info(zone, order, &info);
> + return __fragmentation_index(order, &info);
> +}
>
> static void extfrag_show_print(struct seq_file *m,
> pg_data_t *pgdat, struct zone *zone)
> @@ -596,7 +604,7 @@ static void extfrag_show_print(struct seq_file *m,
> zone->name);
> for (order = 0; order < MAX_ORDER; ++order) {
> fill_contig_page_info(zone, order, &info);
> - index = fragmentation_index(order, &info);
> + index = __fragmentation_index(order, &info);
> seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
> }
>
> @@ -896,6 +904,9 @@ static const char * const vmstat_text[] = {
> "compact_blocks_moved",
> "compact_pages_moved",
> "compact_pagemigrate_failed",
> + "compact_stall",
> + "compact_fail",
> + "compact_success",

CONFIG_COMPACTION=n?

>
> ...
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/