oom: avoid oom killer for lowmem allocations [Kernel]

Prev: X25: Fix x25_create errors for bad protocol and ENOBUFS
Next: Remove unused macro, VM_MIN_READAHEAD.

From: KOSAKI Motohiro on 16 Feb 2010 00:40

> On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:
>
> > > If memory has been depleted in lowmem zones even with the protection
> > > afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
> > > killing current users will help. The memory is either reclaimable (or
> > > migratable) already, in which case we should not invoke the oom killer at
> > > all, or it is pinned by an application for I/O. Killing such an
> > > application may leave the hardware in an unspecified state and there is
> > > no guarantee that it will be able to make a timely exit.
> > >
> > > Lowmem allocations are now failed in oom conditions so that the task can
> > > perhaps recover or try again later. Killing current is an unnecessary
> > > result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
> > > lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
> > > unnecessary.
> > >
> > > Previously, the heuristic provided some protection for those tasks with
> > > CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> > > killing tasks for the purposes of ISA allocations.
> > >
> > > high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
> > > default for all allocations that are not __GFP_DMA, __GFP_DMA32,
> > > __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
> > > flags. Testing for high_zoneidx being less than ZONE_NORMAL will only
> > > return true for allocations that have either __GFP_DMA or __GFP_DMA32.
> > >
> > > Acked-by: Rik van Riel <riel(a)redhat.com>
> > > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>
> > > Signed-off-by: David Rientjes <rientjes(a)google.com>
> > > ---
> > > mm/page_alloc.c | 3 +++
> > > 1 files changed, 3 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1914,6 +1914,9 @@ rebalance:
> > > * running out of options and have to consider going OOM
> > > */
> > > if (!did_some_progress) {
> > > + /* The oom killer won't necessarily free lowmem */
> > > + if (high_zoneidx < ZONE_NORMAL)
> > > + goto nopage;
> > > if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > > if (oom_killer_disabled)
> > > goto nopage;
> >
> > WARN_ON((high_zoneidx < ZONE_NORMAL) && (gfp_mask & __GFP_NOFAIL))
> > plz.
> >
>
> As I already explained when you first brought this up, the possibility of
> not invoking the oom killer is not unique to GFP_DMA, it is also possible
> for GFP_NOFS. Since __GFP_NOFAIL is deprecated and there are no current
> users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.
> We're not adding any additional __GFP_NOFAIL allocations.

No current user? I don't think so.

int bio_integrity_prep(struct bio *bio)
{
(snip)
buf = kmalloc(len, GFP_NOIO | __GFP_NOFAIL | q->bounce_gfp);

and

void blk_queue_bounce_limit(struct request_queue *q, u64 dma_mask)
{
(snip)
if (dma) {
init_emergency_isa_pool();
q->bounce_gfp = GFP_NOIO | GFP_DMA;
q->limits.bounce_pfn = b_pfn;
}

I don't like rumor based discussion, I like fact based one.

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 16 Feb 2010 01:50

On Mon, Feb 15, 2010 at 04:10:15PM -0800, David Rientjes wrote:
> On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:
>
> > > If memory has been depleted in lowmem zones even with the protection
> > > afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
> > > killing current users will help. The memory is either reclaimable (or
> > > migratable) already, in which case we should not invoke the oom killer at
> > > all, or it is pinned by an application for I/O. Killing such an
> > > application may leave the hardware in an unspecified state and there is
> > > no guarantee that it will be able to make a timely exit.
> > >
> > > Lowmem allocations are now failed in oom conditions so that the task can
> > > perhaps recover or try again later. Killing current is an unnecessary
> > > result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
> > > lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
> > > unnecessary.
> > >
> > > Previously, the heuristic provided some protection for those tasks with
> > > CAP_SYS_RAWIO, but this is no longer necessary since we will not be
> > > killing tasks for the purposes of ISA allocations.
> > >
> > > high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
> > > default for all allocations that are not __GFP_DMA, __GFP_DMA32,
> > > __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
> > > flags. Testing for high_zoneidx being less than ZONE_NORMAL will only
> > > return true for allocations that have either __GFP_DMA or __GFP_DMA32.
> > >
> > > Acked-by: Rik van Riel <riel(a)redhat.com>
> > > Reviewed-by: KOSAKI Motohiro <kosaki.motohiro(a)jp.fujitsu.com>
> > > Signed-off-by: David Rientjes <rientjes(a)google.com>
> > > ---
> > > mm/page_alloc.c | 3 +++
> > > 1 files changed, 3 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1914,6 +1914,9 @@ rebalance:
> > > * running out of options and have to consider going OOM
> > > */
> > > if (!did_some_progress) {
> > > + /* The oom killer won't necessarily free lowmem */
> > > + if (high_zoneidx < ZONE_NORMAL)
> > > + goto nopage;
> > > if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> > > if (oom_killer_disabled)
> > > goto nopage;
> >
> > WARN_ON((high_zoneidx < ZONE_NORMAL) && (gfp_mask & __GFP_NOFAIL))
> > plz.
> >
>
> As I already explained when you first brought this up, the possibility of
> not invoking the oom killer is not unique to GFP_DMA, it is also possible
> for GFP_NOFS. Since __GFP_NOFAIL is deprecated and there are no current
> users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.
> We're not adding any additional __GFP_NOFAIL allocations.

Completely agree with this request. Actually, I think even better you
should just add && !(gfp_mask & __GFP_NOFAIL). Deprecated doesn't mean
it is OK to break the API (callers *will* oops or corrupt memory if
__GFP_NOFAIL returns NULL).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 16 Feb 2010 03:00

On Mon, Feb 15, 2010 at 11:41:49PM -0800, David Rientjes wrote:
> On Tue, 16 Feb 2010, Nick Piggin wrote:
>
> > > As I already explained when you first brought this up, the possibility of
> > > not invoking the oom killer is not unique to GFP_DMA, it is also possible
> > > for GFP_NOFS. Since __GFP_NOFAIL is deprecated and there are no current
> > > users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.
> > > We're not adding any additional __GFP_NOFAIL allocations.
> >
> > Completely agree with this request. Actually, I think even better you
> > should just add && !(gfp_mask & __GFP_NOFAIL). Deprecated doesn't mean
> > it is OK to break the API (callers *will* oops or corrupt memory if
> > __GFP_NOFAIL returns NULL).
> >
>
> ... unless it's used with GFP_ATOMIC, which we've always returned NULL
> for when even ALLOC_HARDER can't find pages, right?

Ye, it's never worked with GFP_ATOMIC.

> I'm wondering where this strong argument in favor of continuing to support
> __GFP_NOFAIL was when I insisted we call the oom killer for them even for
> allocations over PAGE_ALLOC_COSTLY_ORDER when __alloc_pages_nodemask() was
> refactored back in 2.6.31. The argument was that nobody is allocating
> that high of orders of __GFP_NOFAIL pages so we don't need to free memory
> for them and that's where the deprecation of the modifier happened in the
> first place. Ultimately, we did invoke the oom killer for those
> allocations because there's no chance of forward progress otherwise and,
> unlike __GFP_DMA, GFP_KERNEL | __GFP_NOFAIL actually is popular.

I don't know. IMO we should never just randomly weaken or break such
flag as the page allocator API.

>
> I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask &
> __GFP_NOFAIL) path since we're all content with endlessly looping.

Thanks. Yes endlessly looping is far preferable to randomly oopsing
or corrupting memory.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KAMEZAWA Hiroyuki on 16 Feb 2010 19:00

On Tue, 16 Feb 2010 00:25:22 -0800 (PST)
David Rientjes <rientjes(a)google.com> wrote:

> On Tue, 16 Feb 2010, Nick Piggin wrote:
>
> > > I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask &
> > > __GFP_NOFAIL) path since we're all content with endlessly looping.
> >
> > Thanks. Yes endlessly looping is far preferable to randomly oopsing
> > or corrupting memory.
> >
>
> Here's the new patch for your consideration.
>

Then, can we take kdump in this endlessly looping situaton ?

panic_on_oom=always + kdump can do that.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KAMEZAWA Hiroyuki on 16 Feb 2010 19:10

On Tue, 16 Feb 2010 16:03:23 -0800 (PST)
David Rientjes <rientjes(a)google.com> wrote:

> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
>
> > > > > I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask &
> > > > > __GFP_NOFAIL) path since we're all content with endlessly looping.
> > > >
> > > > Thanks. Yes endlessly looping is far preferable to randomly oopsing
> > > > or corrupting memory.
> > > >
> > >
> > > Here's the new patch for your consideration.
> > >
> >
> > Then, can we take kdump in this endlessly looping situaton ?
> >
> > panic_on_oom=always + kdump can do that.
> >
>
> The endless loop is only helpful if something is going to free memory
> external to the current page allocation: either another task with
> __GFP_WAIT | __GFP_FS that invokes the oom killer, a task that frees
> memory, or a task that exits.
>
> The most notable endless loop in the page allocator is the one when a task
> has been oom killed, gets access to memory reserves, and then cannot find
> a page for a __GFP_NOFAIL allocation:
>
> do {
> page = get_page_from_freelist(gfp_mask, nodemask, order,
> zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
> preferred_zone, migratetype);
>
> if (!page && gfp_mask & __GFP_NOFAIL)
> congestion_wait(BLK_RW_ASYNC, HZ/50);
> } while (!page && (gfp_mask & __GFP_NOFAIL));
>
> We don't expect any such allocations to happen during the exit path, but
> we could probably find some in the fs layer.
>
> I don't want to check sysctl_panic_on_oom in the page allocator because it
> would start panicking the machine unnecessarily for the integrity
> metadata GFP_NOIO | __GFP_NOFAIL allocation, for any
> order > PAGE_ALLOC_COSTLY_ORDER, or for users who can't lock the zonelist
> for oom kill that wouldn't have panicked before.
>

Then, why don't you check higzone_idx in oom_kill.c

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: X25: Fix x25_create errors for bad protocol and ENOBUFS
Next: Remove unused macro, VM_MIN_READAHEAD.