BFS vs. mainline scheduler benchmarks and measurements [Kernel]

Prev: [PATCH 1/1] AGP: amd64, fix pci reference leaks
Next: [PATCH 2/3] viafb: remove unused structure member

From: Mike Galbraith on 9 Sep 2009 08:50

On Wed, 2009-09-09 at 13:54 +0200, Jens Axboe wrote:

> Things are much better with this patch on the notebook! I cannot compare
> with BFS as that still doesn't run anywhere I want it to run, but it's
> way better than -rc9-git stock. latt numbers on the notebook have 1/3
> the max latency, average is lower, and stddev is much smaller too.

That patch has a bit of bustage in it.

We definitely want to turn down sched_latency though, and LAST_BUDDY
also wants some examination it seems.

taskset -c 3 ./xx 1 (100% cpu 1 sec interval perturbation measurement proggy. overhead is what it is not getting)
xx says
2392.52 MHZ CPU
perturbation threshold 0.057 usecs.
....
'nuther terminal
taskset -c 3 make -j2 vmlinux

xx output

current (fixed breakage) patched tip tree
pert/s: 153 >18842.18us: 11 min: 0.50 max:36010.37 avg:4354.06 sum/s:666171us overhead:66.62%
pert/s: 160 >18767.18us: 12 min: 0.13 max:32011.66 avg:4172.69 sum/s:667631us overhead:66.66%
pert/s: 156 >18499.43us: 9 min: 0.13 max:27883.24 avg:4296.08 sum/s:670189us overhead:66.49%
pert/s: 146 >18480.71us: 10 min: 0.50 max:32009.38 avg:4615.19 sum/s:673818us overhead:67.26%
pert/s: 154 >18433.20us: 17 min: 0.14 max:31537.12 avg:4474.14 sum/s:689018us overhead:67.68%
pert/s: 158 >18520.11us: 9 min: 0.50 max:34328.86 avg:4275.66 sum/s:675554us overhead:66.76%
pert/s: 154 >18683.74us: 12 min: 0.51 max:35949.23 avg:4363.67 sum/s:672005us overhead:67.04%
pert/s: 154 >18745.53us: 8 min: 0.51 max:34203.43 avg:4399.72 sum/s:677556us overhead:67.03%

bfs209
pert/s: 124 >18681.88us: 17 min: 0.15 max:27274.74 avg:4627.36 sum/s:573793us overhead:56.70%
pert/s: 106 >18702.52us: 20 min: 0.55 max:32022.07 avg:5754.48 sum/s:609975us overhead:59.80%
pert/s: 116 >19082.42us: 17 min: 0.15 max:39835.34 avg:5167.69 sum/s:599452us overhead:59.95%
pert/s: 109 >19289.41us: 22 min: 0.14 max:36818.95 avg:5485.79 sum/s:597951us overhead:59.64%
pert/s: 108 >19238.97us: 19 min: 0.14 max:32026.74 avg:5543.17 sum/s:598662us overhead:59.87%
pert/s: 106 >19415.76us: 20 min: 0.54 max:36011.78 avg:6001.89 sum/s:636201us overhead:62.95%
pert/s: 115 >19341.89us: 16 min: 0.08 max:32040.83 avg:5313.45 sum/s:611047us overhead:59.98%
pert/s: 101 >19527.53us: 24 min: 0.14 max:36018.37 avg:6378.06 sum/s:644184us overhead:64.42%

stock tip (ouch ouch ouch)
pert/s: 153 >48453.23us: 5 min: 0.12 max:144009.85 avg:4688.90 sum/s:717401us overhead:70.89%
pert/s: 172 >47209.49us: 3 min: 0.48 max:68009.05 avg:4022.55 sum/s:691879us overhead:67.05%
pert/s: 148 >51139.18us: 5 min: 0.53 max:168094.76 avg:4918.14 sum/s:727885us overhead:71.65%
pert/s: 171 >51350.64us: 6 min: 0.12 max:102202.79 avg:4304.77 sum/s:736115us overhead:69.24%
pert/s: 153 >57686.54us: 5 min: 0.12 max:224019.85 avg:5399.31 sum/s:826094us overhead:74.50%
pert/s: 172 >55886.47us: 2 min: 0.11 max:75378.18 avg:3993.52 sum/s:686885us overhead:67.67%
pert/s: 157 >58819.31us: 3 min: 0.12 max:165976.63 avg:4453.16 sum/s:699146us overhead:69.91%
pert/s: 149 >58410.21us: 5 min: 0.12 max:104663.89 avg:4792.73 sum/s:714116us overhead:71.41%

sched_latency=20ms min_granularity=4ms
pert/s: 162 >30152.07us: 2 min: 0.49 max:60011.85 avg:4272.97 sum/s:692221us overhead:68.13%
pert/s: 147 >29705.33us: 8 min: 0.14 max:46577.27 avg:4792.03 sum/s:704428us overhead:70.44%
pert/s: 162 >29344.16us: 2 min: 0.49 max:48010.50 avg:4176.75 sum/s:676633us overhead:67.40%
pert/s: 155 >29109.69us: 2 min: 0.49 max:49575.08 avg:4423.87 sum/s:685700us overhead:68.30%
pert/s: 153 >30627.66us: 3 min: 0.13 max:84005.71 avg:4573.07 sum/s:699680us overhead:69.42%
pert/s: 142 >30652.47us: 5 min: 0.49 max:56760.06 avg:4991.61 sum/s:708808us overhead:70.88%
pert/s: 152 >30101.12us: 2 min: 0.49 max:45757.88 avg:4519.92 sum/s:687028us overhead:67.89%
pert/s: 161 >29303.50us: 3 min: 0.12 max:40011.73 avg:4238.15 sum/s:682342us overhead:67.43%

NO_LAST_BUDDY
pert/s: 154 >15257.87us: 28 min: 0.13 max:42004.05 avg:4590.99 sum/s:707013us overhead:70.41%
pert/s: 162 >15392.05us: 34 min: 0.12 max:29021.79 avg:4177.47 sum/s:676750us overhead:66.81%
pert/s: 162 >15665.11us: 33 min: 0.13 max:32008.34 avg:4237.10 sum/s:686410us overhead:67.90%
pert/s: 159 >15914.89us: 31 min: 0.56 max:32056.86 avg:4268.87 sum/s:678751us overhead:67.47%
pert/s: 166 >15858.94us: 26 min: 0.13 max:26655.84 avg:4055.02 sum/s:673134us overhead:66.65%
pert/s: 165 >15878.96us: 32 min: 0.13 max:28010.44 avg:4107.86 sum/s:677798us overhead:66.68%
pert/s: 164 >16213.55us: 29 min: 0.14 max:34263.04 avg:4186.64 sum/s:686610us overhead:68.04%
pert/s: 149 >16764.54us: 20 min: 0.13 max:38688.64 avg:4758.26 sum/s:708981us overhead:70.23%

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Pavel Machek on 9 Sep 2009 10:10

Hi!

> > So ... to get to the numbers - i've tested both BFS and the tip of
> > the latest upstream scheduler tree on a testbox of mine. I
> > intentionally didnt test BFS on any really large box - because you
> > described its upper limit like this in the announcement:
>
> I ran a simple test as well, since I was curious to see how it performed
> wrt interactiveness. One of my pet peeves with the current scheduler is
> that I have to nice compile jobs, or my X experience is just awful while
> the compile is running.
>
> Now, this test case is something that attempts to see what
> interactiveness would be like. It'll run a given command line while at
> the same time logging delays. The delays are measured as follows:
>
> - The app creates a pipe, and forks a child that blocks on reading from
> that pipe.
> - The app sleeps for a random period of time, anywhere between 100ms
> and 2s. When it wakes up, it gets the current time and writes that to
> the pipe.
> - The child then gets woken, checks the time on its own, and logs the
> difference between the two.
>
> The idea here being that the delay between writing to the pipe and the
> child reading the data and comparing should (in some way) be indicative
> of how responsive the system would seem to a user.
>
> The test app was quickly hacked up, so don't put too much into it. The
> test run is a simple kernel compile, using -jX where X is the number of
> threads in the system. The files are cache hot, so little IO is done.
> The -x2 run is using the double number of processes as we have threads,
> eg -j128 on a 64 thread box.

Could you post the source? Someone else might get us
numbers... preferably on dualcore box or something...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ingo Molnar on 9 Sep 2009 14:10

* Jens Axboe <jens.axboe(a)oracle.com> wrote:

> On Wed, Sep 09 2009, Jens Axboe wrote:
> > On Wed, Sep 09 2009, Jens Axboe wrote:
> > > On Wed, Sep 09 2009, Mike Galbraith wrote:
> > > > On Wed, 2009-09-09 at 08:13 +0200, Ingo Molnar wrote:
> > > > > * Jens Axboe <jens.axboe(a)oracle.com> wrote:
> > > > >
> > > > > > On Tue, Sep 08 2009, Peter Zijlstra wrote:
> > > > > > > On Tue, 2009-09-08 at 11:13 +0200, Jens Axboe wrote:
> > > > > > > > And here's a newer version.
> > > > > > >
> > > > > > > I tinkered a bit with your proglet and finally found the
> > > > > > > problem.
> > > > > > >
> > > > > > > You used a single pipe per child, this means the loop in
> > > > > > > run_child() would consume what it just wrote out until it got
> > > > > > > force preempted by the parent which would also get woken.
> > > > > > >
> > > > > > > This results in the child spinning a while (its full quota) and
> > > > > > > only reporting the last timestamp to the parent.
> > > > > >
> > > > > > Oh doh, that's not well thought out. Well it was a quick hack :-)
> > > > > > Thanks for the fixup, now it's at least usable to some degree.
> > > > >
> > > > > What kind of latencies does it report on your box?
> > > > >
> > > > > Our vanilla scheduler default latency targets are:
> > > > >
> > > > > single-core: 20 msecs
> > > > > dual-core: 40 msecs
> > > > > quad-core: 60 msecs
> > > > > opto-core: 80 msecs
> > > > >
> > > > > You can enable CONFIG_SCHED_DEBUG=y and set it directly as well via
> > > > > /proc/sys/kernel/sched_latency_ns:
> > > > >
> > > > > echo 10000000 > /proc/sys/kernel/sched_latency_ns
> > > >
> > > > He would also need to lower min_granularity, otherwise, it'd be larger
> > > > than the whole latency target.
> > > >
> > > > I'm testing right now, and one thing that is definitely a problem is the
> > > > amount of sleeper fairness we're giving. A full latency is just too
> > > > much short term fairness in my testing. While sleepers are catching up,
> > > > hogs languish. That's the biggest issue going on.
> > > >
> > > > I've also been doing some timings of make -j4 (looking at idle time),
> > > > and find that child_runs_first is mildly detrimental to fork/exec load,
> > > > as are buddies.
> > > >
> > > > I'm running with the below at the moment. (the kthread/workqueue thing
> > > > is just because I don't see any reason for it to exist, so consider it
> > > > to be a waste of perfectly good math;)
> > >
> > > Using latt, it seems better than -rc9. The below are entries logged
> > > while running make -j128 on a 64 thread box. I did two runs on each, and
> > > latt is using 8 clients.
> > >
> > > -rc9
> > > Max 23772 usec
> > > Avg 1129 usec
> > > Stdev 4328 usec
> > > Stdev mean 117 usec
> > >
> > > Max 32709 usec
> > > Avg 1467 usec
> > > Stdev 5095 usec
> > > Stdev mean 136 usec
> > >
> > > -rc9 + patch
> > >
> > > Max 11561 usec
> > > Avg 1532 usec
> > > Stdev 1994 usec
> > > Stdev mean 48 usec
> > >
> > > Max 9590 usec
> > > Avg 1550 usec
> > > Stdev 2051 usec
> > > Stdev mean 50 usec
> > >
> > > max latency is way down, and much smaller variation as well.
> >
> > Things are much better with this patch on the notebook! I cannot compare
> > with BFS as that still doesn't run anywhere I want it to run, but it's
> > way better than -rc9-git stock. latt numbers on the notebook have 1/3
> > the max latency, average is lower, and stddev is much smaller too.
>
> BFS210 runs on the laptop (dual core intel core duo). With make -j4
> running, I clock the following latt -c8 'sleep 10' latencies:
>
> -rc9
>
> Max 17895 usec
> Avg 8028 usec
> Stdev 5948 usec
> Stdev mean 405 usec
>
> Max 17896 usec
> Avg 4951 usec
> Stdev 6278 usec
> Stdev mean 427 usec
>
> Max 17885 usec
> Avg 5526 usec
> Stdev 6819 usec
> Stdev mean 464 usec
>
> -rc9 + mike
>
> Max 6061 usec
> Avg 3797 usec
> Stdev 1726 usec
> Stdev mean 117 usec
>
> Max 5122 usec
> Avg 3958 usec
> Stdev 1697 usec
> Stdev mean 115 usec
>
> Max 6691 usec
> Avg 2130 usec
> Stdev 2165 usec
> Stdev mean 147 usec

At least in my tests these latencies were mainly due to a bug in
latt.c - i've attached the fixed version.

The other reason was wakeup batching. If you do this:

echo 0 > /proc/sys/kernel/sched_wakeup_granularity_ns

.... then you can switch on insta-wakeups on -tip too.

With a dual-core box and a make -j4 background job running, on
latest -tip i get the following latencies:

$ ./latt -c8 sleep 30
Entries: 656 (clients=8)

Averages:
------------------------------
Max 158 usec
Avg 12 usec
Stdev 10 usec

Thanks,

Ingo

From: Nikos Chantziaras on 9 Sep 2009 16:20

On 09/09/2009 09:04 PM, Ingo Molnar wrote:
> [...]
> * Jens Axboe<jens.axboe(a)oracle.com> wrote:
>
>> On Wed, Sep 09 2009, Jens Axboe wrote:
>> [...]
>> BFS210 runs on the laptop (dual core intel core duo). With make -j4
>> running, I clock the following latt -c8 'sleep 10' latencies:
>>
>> -rc9
>>
>> Max 17895 usec
>> Avg 8028 usec
>> Stdev 5948 usec
>> Stdev mean 405 usec
>>
>> Max 17896 usec
>> Avg 4951 usec
>> Stdev 6278 usec
>> Stdev mean 427 usec
>>
>> Max 17885 usec
>> Avg 5526 usec
>> Stdev 6819 usec
>> Stdev mean 464 usec
>>
>> -rc9 + mike
>>
>> Max 6061 usec
>> Avg 3797 usec
>> Stdev 1726 usec
>> Stdev mean 117 usec
>>
>> Max 5122 usec
>> Avg 3958 usec
>> Stdev 1697 usec
>> Stdev mean 115 usec
>>
>> Max 6691 usec
>> Avg 2130 usec
>> Stdev 2165 usec
>> Stdev mean 147 usec
>
> At least in my tests these latencies were mainly due to a bug in
> latt.c - i've attached the fixed version.
>
> The other reason was wakeup batching. If you do this:
>
> echo 0> /proc/sys/kernel/sched_wakeup_granularity_ns
>
> ... then you can switch on insta-wakeups on -tip too.
>
> With a dual-core box and a make -j4 background job running, on
> latest -tip i get the following latencies:
>
> $ ./latt -c8 sleep 30
> Entries: 656 (clients=8)
>
> Averages:
> ------------------------------
> Max 158 usec
> Avg 12 usec
> Stdev 10 usec

With your version of latt.c, I get these results with 2.6-tip vs
2.6.31-rc9-bfs:

(mainline)
Averages:
------------------------------
Max 50 usec
Avg 12 usec
Stdev 3 usec

(BFS)
Averages:
------------------------------
Max 474 usec
Avg 11 usec
Stdev 16 usec

However, the interactivity problems still remain. Does that mean it's
not a latency issue?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jens Axboe on 9 Sep 2009 17:00

On Wed, Sep 09 2009, Nikos Chantziaras wrote:
> On 09/09/2009 09:04 PM, Ingo Molnar wrote:
>> [...]
>> * Jens Axboe<jens.axboe(a)oracle.com> wrote:
>>
>>> On Wed, Sep 09 2009, Jens Axboe wrote:
>>> [...]
>>> BFS210 runs on the laptop (dual core intel core duo). With make -j4
>>> running, I clock the following latt -c8 'sleep 10' latencies:
>>>
>>> -rc9
>>>
>>> Max 17895 usec
>>> Avg 8028 usec
>>> Stdev 5948 usec
>>> Stdev mean 405 usec
>>>
>>> Max 17896 usec
>>> Avg 4951 usec
>>> Stdev 6278 usec
>>> Stdev mean 427 usec
>>>
>>> Max 17885 usec
>>> Avg 5526 usec
>>> Stdev 6819 usec
>>> Stdev mean 464 usec
>>>
>>> -rc9 + mike
>>>
>>> Max 6061 usec
>>> Avg 3797 usec
>>> Stdev 1726 usec
>>> Stdev mean 117 usec
>>>
>>> Max 5122 usec
>>> Avg 3958 usec
>>> Stdev 1697 usec
>>> Stdev mean 115 usec
>>>
>>> Max 6691 usec
>>> Avg 2130 usec
>>> Stdev 2165 usec
>>> Stdev mean 147 usec
>>
>> At least in my tests these latencies were mainly due to a bug in
>> latt.c - i've attached the fixed version.
>>
>> The other reason was wakeup batching. If you do this:
>>
>> echo 0> /proc/sys/kernel/sched_wakeup_granularity_ns
>>
>> ... then you can switch on insta-wakeups on -tip too.
>>
>> With a dual-core box and a make -j4 background job running, on
>> latest -tip i get the following latencies:
>>
>> $ ./latt -c8 sleep 30
>> Entries: 656 (clients=8)
>>
>> Averages:
>> ------------------------------
>> Max 158 usec
>> Avg 12 usec
>> Stdev 10 usec
>
> With your version of latt.c, I get these results with 2.6-tip vs
> 2.6.31-rc9-bfs:
>
>
> (mainline)
> Averages:
> ------------------------------
> Max 50 usec
> Avg 12 usec
> Stdev 3 usec
>
>
> (BFS)
> Averages:
> ------------------------------
> Max 474 usec
> Avg 11 usec
> Stdev 16 usec
>
>
> However, the interactivity problems still remain. Does that mean it's
> not a latency issue?

It probably just means that latt isn't a good measure of the problem.
Which isn't really too much of a surprise.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Prev: [PATCH 1/1] AGP: amd64, fix pci reference leaks
Next: [PATCH 2/3] viafb: remove unused structure member