Memory overcommit [Kernel]

Prev: [PATCH -v6 00/13] ftrace for MIPS
Next: [Bug #14372] ath5k wireless not working after suspend-resume - eeepc

From: Vedran Furač on 29 Oct 2009 07:20

David Rientjes wrote:

> Right, because in Vedran's latest oom log it shows that Xorg is preferred
> more than any other thread other than the memory hogging test program with
> your patch than without. I pointed out a clear distinction in the killing
> order using both total_vm and rss in that log and in my opinion killing
> Xorg as opposed to krunner would be undesireable.

But then you should rename OOM killer to TRIPK:
Totally Random Innocent Process Killer

If you have OOM situation and Xorg is the first, that means it's leaking
memory badly and the system is probably already frozen/FUBAR. Killing
krunner in that situation wouldn't do any good. From a user perspective,
nothing changes, system is still FUBAR and (s)he would probably reboot
cursing linux in the process.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: David Rientjes on 29 Oct 2009 15:50

On Thu, 29 Oct 2009, Vedran Furac wrote:

> [ 1493.064458] Out of memory: kill process 6304 (kdeinit4) score 1190231
> or a child
> [ 1493.064467] Killed process 6409 (konqueror)
> [ 1493.261149] knotify4 invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1493.261166] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1493.276528] Out of memory: kill process 6304 (kdeinit4) score 1161265
> or a child
> [ 1493.276538] Killed process 6411 (krusader)
> [ 1499.221160] akregator invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1499.221178] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1499.236431] Out of memory: kill process 6304 (kdeinit4) score 1067593
> or a child
> [ 1499.236441] Killed process 6412 (irexec)
> [ 1499.370192] firefox-bin invoked oom-killer: gfp_mask=0x201da,
> order=0, oomkilladj=0
> [ 1499.370209] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1499.385417] Out of memory: kill process 6304 (kdeinit4) score 1066861
> or a child
> [ 1499.385427] Killed process 6420 (xchm)
> [ 1499.458304] kio_file invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1499.458333] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1499.458367] [<ffffffff81120900>] ? d_kill+0x5c/0x7c
> [ 1499.473573] Out of memory: kill process 6304 (kdeinit4) score 1043690
> or a child
> [ 1499.473582] Killed process 6425 (kio_file)
> [ 1500.250746] korgac invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1500.250765] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1500.266186] Out of memory: kill process 6304 (kdeinit4) score 1020350
> or a child
> [ 1500.266196] Killed process 6464 (icedove)
> [ 1500.349355] syslog-ng invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1500.349371] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1500.364689] Out of memory: kill process 6304 (kdeinit4) score 1019864
> or a child
> [ 1500.364699] Killed process 6477 (kio_http)
> [ 1500.452151] kded4 invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1500.452167] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1500.452196] [<ffffffff81120900>] ? d_kill+0x5c/0x7c
> [ 1500.467307] Out of memory: kill process 6304 (kdeinit4) score 993142
> or a child
> [ 1500.467316] Killed process 6478 (kio_http)
> [ 1500.780222] akregator invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1500.780239] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1500.796280] Out of memory: kill process 6304 (kdeinit4) score 966331
> or a child
> [ 1500.796290] Killed process 6484 (kio_http)
> [ 1501.065374] syslog-ng invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1501.065390] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1501.080579] Out of memory: kill process 6304 (kdeinit4) score 939434
> or a child
> [ 1501.080587] Killed process 6486 (kio_http)
> [ 1501.381188] knotify4 invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1501.381204] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1501.396338] Out of memory: kill process 6304 (kdeinit4) score 912691
> or a child
> [ 1501.396346] Killed process 6487 (firefox-bin)
> [ 1502.661294] icedove-bin invoked oom-killer: gfp_mask=0x201da,
> order=0, oomkilladj=0
> [ 1502.661311] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1502.676563] Out of memory: kill process 7580 (test) score 708945 or a
> child
> [ 1502.676575] Killed process 7580 (test)
>

Ok, so this is the forkbomb problem by adding half of each child's
total_vm into the badness score of the parent. We should address this
completely seperately by addressing that specific part of the heuristic,
not changing what we consider to be a baseline.

The rationale is quite simple: we'll still experience the same problem
with rss as we did with total_vm in the forkbomb scenario above on certain
workloads (maybe not yours, but others). The oom killer always kills a
child first if it has a different mm than the selected parent, so the
amount of memory freeing as a result of that is entirely dependent on the
order of the child list. It may be very little, but killed because its
siblings had large total_vm values.

So instead of focusing on rss, we simply need to find a better heuristic
for the forkbomb issue which I've already proposed a very trivial solution
for. Then, afterwards, we can debate about how the scoring heuristic can
be changed to select better tasks (and perhaps remove a lot of the clutter
that's there currently!).

> > Can you explain why Xorg is preferred as a baseline to kill rather than
> > krunner in your example?
>
> Krunner is a small app for running other apps and do similar things. It
> shouldn't use a lot of memory. OTOH, Xorg has to hold all the pixmaps
> and so on. That was expected result. Fist Xorg, then firefox and
> thunderbird.
>

You're making all these claims and assertions based _solely_ on the theory
that killing the application with the most resident RAM is always the
optimal solution. That's just not true, especially if we're just
allocating small numbers of order-0 memory.

Much better is to allow the user to decide at what point, regardless of
swap usage, their application is using much more memory than expected or
required. They can do that right now pretty well with /proc/pid/oom_adj
without this outlandish claim that they should be expected to know the rss
of their applications at the time of oom to effectively tune oom_adj.

What would you suggest? A script that sits in a loop checking each task's
current rss from /proc/pid/stat or their current oom priority though
/proc/pid/oom_score and adjusting oom_adj preemptively just in case the
oom killer is invoked in the next second?

And that "small app" has 30MB of rss which could be freed, if killed, and
utilized for subsequent page allocations.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: David Rientjes on 29 Oct 2009 16:00

On Thu, 29 Oct 2009, Vedran Furac wrote:

> But then you should rename OOM killer to TRIPK:
> Totally Random Innocent Process Killer
>

The randomness here is the order of the child list when the oom killer
selects a task, based on the badness score, and then tries to kill a child
with a different mm before the parent.

The problem you identified in http://pastebin.com/f3f9674a0, however, is a
forkbomb issue where the badness score should never have been so high for
kdeinit4 compared to "test". That's directly proportional to adding the
scores of all disjoint child total_vm values into the badness score for
the parent and then killing the children instead.

That's the problem, not using total_vm as a baseline. Replacing that with
rss is not going to solve the issue and reducing the user's ability to
specify a rough oom priority from userspace is simply not an option.

> If you have OOM situation and Xorg is the first, that means it's leaking
> memory badly and the system is probably already frozen/FUBAR. Killing
> krunner in that situation wouldn't do any good. From a user perspective,
> nothing changes, system is still FUBAR and (s)he would probably reboot
> cursing linux in the process.
>

It depends on what you're running, we need to be able to have the option
of protecting very large tasks on production servers. Imagine if "test"
here is actually a critical application that we need to protect, its
not solely mlocked anonymous memory, but still kill if it is leaking
memory beyond your approximate 2.5GB. How do you do that when using rss
as the baseline?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: KAMEZAWA Hiroyuki on 29 Oct 2009 20:00

On Thu, 29 Oct 2009 12:53:42 -0700 (PDT)
David Rientjes <rientjes(a)google.com> wrote:

> > If you have OOM situation and Xorg is the first, that means it's leaking
> > memory badly and the system is probably already frozen/FUBAR. Killing
> > krunner in that situation wouldn't do any good. From a user perspective,
> > nothing changes, system is still FUBAR and (s)he would probably reboot
> > cursing linux in the process.
> >
>
> It depends on what you're running, we need to be able to have the option
> of protecting very large tasks on production servers. Imagine if "test"
> here is actually a critical application that we need to protect, its
> not solely mlocked anonymous memory, but still kill if it is leaking
> memory beyond your approximate 2.5GB. How do you do that when using rss
> as the baseline?

As I wrote repeatedly,

- OOM-Killer itselfs is bad thing, bad situation.
- The kernel can't know the program is bad or not. just guess it.
- Then, there is no "correct" OOM-Killer other than fork-bomb killer.
- User has a knob as oom_adj. This is very strong.

Then, there is only "reasonable" or "easy-to-understand" OOM-Kill.
"Current biggest memory eater is killed" sounds reasonable, easy to
understand. And if total_vm works well, overcommit_guess should catch it.
Please improve overcommit_guess if you want to stay on total_vm.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: David Rientjes on 30 Oct 2009 05:20

On Fri, 30 Oct 2009, KAMEZAWA Hiroyuki wrote:

> As I wrote repeatedly,
>
> - OOM-Killer itselfs is bad thing, bad situation.

Not necessarily, the memory controller and cpusets uses it quite often to
enforce it's policy and is standard runtime behavior. We'd like to
imagine that our cpuset will never be too small to run all the attached
jobs, but that happens and we can easily recover from it by killing a
task.

> - The kernel can't know the program is bad or not. just guess it.

Totally irrelevant, given your fourth point about /proc/pid/oom_adj. We
can tell the kernel what we'd like the oom killer behavior should be if
the situation arises.

> - Then, there is no "correct" OOM-Killer other than fork-bomb killer.

Well of course there is, you're seeing this is a WAY too simplistic
manner. If we are oom, we want to be able to influence how the oom killer
behaves and respond to that situation. You are proposing that we change
the baseline for how the oom killer selects tasks which we use CONSTANTLY
as part of our normal production environment. I'd appreciate it if you'd
take it a little more seriously.

> - User has a knob as oom_adj. This is very strong.
>

Agreed.

> Then, there is only "reasonable" or "easy-to-understand" OOM-Kill.
> "Current biggest memory eater is killed" sounds reasonable, easy to
> understand. And if total_vm works well, overcommit_guess should catch it.
> Please improve overcommit_guess if you want to stay on total_vm.
>

I don't necessarily want to stay on total_vm, but I also don't want to
move to rss as a baseline, as you would probably agree.

We disagree about a very fundamental principle: you are coming from a
perspective of always wanting to kill the biggest resident memory eater
even for a single order-0 allocation that fails and I'm coming from a
perspective of wanting to ensure that our machines know how the oom killer
will react when it is used. Moving to rss reduces the ability of the user
to specify an expected oom priority other than polarizing it by either
disabling it completely with an oom_adj value of -17 or choosing the
definite next victim with +15. That's my objection to it: the user cannot
possibly be expected to predict what proportion of each application's
memory will be resident at the time of oom.

I understand you want to totally rewrite the oom killer for whatever
reason, but I think you need to spend a lot more time understanding the
needs that the Linux community has for its behavior instead of insisting
on your point of view.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: [PATCH -v6 00/13] ftrace for MIPS
Next: [Bug #14372] ath5k wireless not working after suspend-resume - eeepc