From: Mathieu Desnoyers on
Huge CFS vruntime spread (18 minutes) has been observed with LTTng while simply
running Xorg on a uniprocessor machine, 2.6.33.2 kernel. Detailed explanation in
my ELC2010 presentation at:

http://www.efficios.com/elc2010

(includes slides, ad-hoc CFS instrumentation patches and wakeup latency test
program)

I've torn the CFS scheduler apart in the past days to figure out what is causing
this weird behavior, and the culprit seems to be place_entity(). The problem
appears to be the cumulative effect of letting the min_vruntime go backward when
putting sleepers back on the runqueue. It lets the vruntime spread grow to
"entertaining" values (it is supposed to be in the 5ms range, not 18 minutes!).

In the original code, a max between the sched entity vruntime and the calculated
vruntime was supposed to "ensure that the thread time never go backward". But I
don't see why we even care about that. The key point is that the min_vruntime
of the runqueue should not go backward.

I propose to fix this by calculating the relative offset from
min_vruntime + sysctl_sched_latency rather than directly from min_vruntime. I
also ensure that the value never goes below min_vruntime.

Under the Xorg workload, moving a few windows around and starting firefox while
executing the wakeup-latency.c program (program waking up every 10ms and
reporting wakeup latency), this patch brings worse latency from 60ms down to
12ms. Even doing a kernel compilation at the same time, the worse latency stays
around 20ms now.

I'm submitting this patch ASAP, since it seems to fix CFS issues that many
people have been complaining about. I'm sending it as RFC because testing its
effect on more workloads would be welcome.

I can see that place_entity() has stayed more or less the same since 2.6.24 (and
maybe even before, as code has just been reorganised between 2.6.23 and 2.6.24),
so we can expect this to be a problem people have been experiencing for a while.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
CC: Ingo Molnar <mingo(a)elte.hu>
CC: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
CC: Mike Galbraith <efault(a)gmx.de>
CC: Andrew Morton <akpm(a)linux-foundation.org>
CC: Linus Torvalds <torvalds(a)linux-foundation.org>
CC: Greg Kroah-Hartman <greg(a)kroah.com>
CC: Steven Rostedt <rostedt(a)goodmis.org>
CC: Jarkko Nikula <jhnikula(a)gmail.com>
CC: Tony Lindgren <tony(a)atomide.com>
---
kernel/sched_fair.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)

Index: linux-2.6-lttng.git/kernel/sched_fair.c
===================================================================
--- linux-2.6-lttng.git.orig/kernel/sched_fair.c 2010-04-18 01:44:19.000000000 -0400
+++ linux-2.6-lttng.git/kernel/sched_fair.c 2010-04-18 01:47:38.000000000 -0400
@@ -738,6 +738,14 @@
unsigned long thresh = sysctl_sched_latency;

/*
+ * Place the woken up task relative to
+ * min_vruntime + sysctl_sched_latency.
+ * We must _never_ decrement min_vruntime, because the effect is
+ * that spread increases progressively under the Xorg workload.
+ */
+ vruntime += sysctl_sched_latency;
+
+ /*
* Convert the sleeper threshold into virtual time.
* SCHED_IDLE is a special sub-class. We care about
* fairness only relative to other SCHED_IDLE tasks,
@@ -755,11 +763,10 @@
thresh >>= 1;

vruntime -= thresh;
+ /* Ensure min_vruntime never go backwards. */
+ vruntime = max_t(u64, vruntime, cfs_rq->min_vruntime);
}

- /* ensure we never gain time by being placed backwards. */
- vruntime = max_vruntime(se->vruntime, vruntime);
-
se->vruntime = vruntime;
}

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/