CFQ read performance regression [Kernel]

Prev: colibri.h: Add #include
Next: [PATCH] mac8390: change an error return code and some cleanup

From: Miklos Szeredi on 22 Apr 2010 06:30

On Thu, 2010-04-22 at 09:59 +0200, Corrado Zoccolo wrote:
> Hi Miklos,
> On Wed, Apr 21, 2010 at 6:05 PM, Miklos Szeredi <mszeredi(a)suse.cz> wrote:
> > Jens, Corrado,
> >
> > Here's a graph showing the number of issued but not yet completed
> > requests versus time for CFQ and NOOP schedulers running the tiobench
> > benchmark with 8 threads:
> >
> > http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/queue-depth.jpg
> >
> > It shows pretty clearly the performance problem is because CFQ is not
> > issuing enough request to fill the bandwidth.
> >
> > Is this the correct behavior of CFQ or is this a bug?
> This is the expected behavior from CFQ, even if it is not optimal,
> since we aren't able to identify multi-splindle disks yet. Can you
> post the result of "grep -r . ." in your /sys/block/*/queue directory,
> to see if we can find some parameter that can help identifying your
> hardware as a multi-spindle disk.

../iosched/quantum:8
../iosched/fifo_expire_sync:124
../iosched/fifo_expire_async:248
../iosched/back_seek_max:16384
../iosched/back_seek_penalty:2
../iosched/slice_sync:100
../iosched/slice_async:40
../iosched/slice_async_rq:2
../iosched/slice_idle:8
../iosched/low_latency:0
../iosched/group_isolation:0
../nr_requests:128
../read_ahead_kb:512
../max_hw_sectors_kb:32767
../max_sectors_kb:512
../max_segments:64
../max_segment_size:65536
../scheduler:noop deadline [cfq]
../hw_sector_size:512
../logical_block_size:512
../physical_block_size:512
../minimum_io_size:512
../optimal_io_size:0
../discard_granularity:0
../discard_max_bytes:0
../discard_zeroes_data:0
../rotational:1
../nomerges:0
../rq_affinity:1

> >
> > This is on a vanilla 2.6.34-rc4 kernel with two tunables modified:
> >
> > read_ahead_kb=512
> > low_latency=0 (for CFQ)
> You should get much better throughput by setting
> /sys/block/_your_disk_/queue/iosched/slice_idle to 0, or
> /sys/block/_your_disk_/queue/rotational to 0.

slice_idle=0 definitely helps. rotational=0 seems to help on 2.6.34-rc
but not on 2.6.32.

As far as I understand setting slice_idle to zero is just a workaround
to make cfq look at all the other queues instead of serving one
exclusively for a long time.

I have very little understanding of I/O scheduling but my idea of what's
really needed here is to realize that one queue is not able to saturate
the device and there's a large backlog of requests on other queues that
are waiting to be served. Is something like that implementable?

Thanks,
Miklos

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Miklos Szeredi on 21 Apr 2010 09:30

Corrado,

On Tue, 2010-04-20 at 22:50 +0200, Corrado Zoccolo wrote:
> can you give more information about the setup?
> How much memory do you have, what is the disk configuration (is this a
> hw raid?) and so on.

8G of memory 8-way Xeon CPU, fiber channel attached storage array (HP
HSV200). I don't know the configuration of the array.

> > low_latency is set to zero in all tests.
> >
> > The layout difference doesn't explain why setting the scheduler to
> > "noop" consistently speeds up read throughput in 8-thread tiobench to
> > almost twice. This fact alone pretty clearly indicates that the I/O
> > scheduler is the culprit.
> From the attached btt output, I see that a lot of time is spent
> waiting to allocate new request structures.
> > S2G 0.022460311 6.581680621 23.144763751 15
> Since noop doesn't attach fancy data to each request, it can save
> those allocations, thus resulting in no sleeps.
> The delays in allocation, though, may not be completely imputable to
> the I/O scheduler, and working in constrained memory conditions will
> negatively affect it.

I verified with the simple dd test that this happens even if there's no
memory pressure from the cache by dd-ing only 5G of files, after
clearing the cache. This way ~2G of memory are completely free
throughout the test.

> > I've also tested with plain "dd" instead of tiobench where the
> > filesystem layout stayed exactly the same between tests. Still the
> > speed difference is there.
> Does dropping caches before the read test change the situation?

In all my tests I drop caches before running it.

Please let me know if you need more information.

Thanks,
Miklos

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Miklos Szeredi on 21 Apr 2010 12:10

Jens, Corrado,

Here's a graph showing the number of issued but not yet completed
requests versus time for CFQ and NOOP schedulers running the tiobench
benchmark with 8 threads:

http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/queue-depth.jpg

It shows pretty clearly the performance problem is because CFQ is not
issuing enough request to fill the bandwidth.

Is this the correct behavior of CFQ or is this a bug?

This is on a vanilla 2.6.34-rc4 kernel with two tunables modified:

read_ahead_kb=512
low_latency=0 (for CFQ)

Thanks,
Miklos

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Corrado Zoccolo on 24 Apr 2010 16:40

On Fri, Apr 23, 2010 at 12:57 PM, Miklos Szeredi <mszeredi(a)suse.cz> wrote:
> On Thu, 2010-04-22 at 16:31 -0400, Vivek Goyal wrote:
>> On Thu, Apr 22, 2010 at 09:59:14AM +0200, Corrado Zoccolo wrote:
>> > Hi Miklos,
>> > On Wed, Apr 21, 2010 at 6:05 PM, Miklos Szeredi <mszeredi(a)suse.cz> wrote:
>> > > Jens, Corrado,
>> > >
>> > > Here's a graph showing the number of issued but not yet completed
>> > > requests versus time for CFQ and NOOP schedulers running the tiobench
>> > > benchmark with 8 threads:
>> > >
>> > > http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/queue-depth.jpg
>> > >
>> > > It shows pretty clearly the performance problem is because CFQ is not
>> > > issuing enough request to fill the bandwidth.
>> > >
>> > > Is this the correct behavior of CFQ or is this a bug?
>> > Â This is the expected behavior from CFQ, even if it is not optimal,
>> > since we aren't able to identify multi-splindle disks yet.
>>
>> In the past we were of the opinion that for sequential workload multi spindle
>> disks will not matter much as readahead logic (in OS and possibly in
>> hardware also) will help. For random workload we anyway don't idle on the
>> single cfqq so it is fine. But my tests now seem to be telling a different
>> story.
>>
>> I also have one FC link to one of the HP EVA and I am running increasing
>> number of sequential readers to see if throughput goes up as number of
>> readers go up. The results are with noop and cfq. I do flush OS caches
>> across the runs but I have no control on caching on HP EVA.
>>
>> Kernel=2.6.34-rc5
>> DIR=/mnt/iostestmnt/fio Â Â Â Â DEV=/dev/mapper/mpathe
>> Workload=bsr Â Â Â iosched=cfq Â Â Filesz=2G Â bs=4K
>> =========================================================================
>> job Â Â Â Set NR Â ReadBW(KB/s) Â MaxClat(us) Â Â WriteBW(KB/s) Â MaxClat(us)
>> --- Â Â Â --- -- Â ------------ Â ----------- Â Â ------------- Â -----------
>> bsr Â Â Â 1 Â 1 Â 135366 Â Â Â Â 59024 Â Â Â Â Â 0 Â Â Â Â Â Â Â 0
>> bsr Â Â Â 1 Â 2 Â 124256 Â Â Â Â 126808 Â Â Â Â 0 Â Â Â Â Â Â Â 0
>> bsr Â Â Â 1 Â 4 Â 132921 Â Â Â Â 341436 Â Â Â Â 0 Â Â Â Â Â Â Â 0
>> bsr Â Â Â 1 Â 8 Â 129807 Â Â Â Â 392904 Â Â Â Â 0 Â Â Â Â Â Â Â 0
>> bsr Â Â Â 1 Â 16 Â 129988 Â Â Â Â 773991 Â Â Â Â 0 Â Â Â Â Â Â Â 0
>>
>> Kernel=2.6.34-rc5
>> DIR=/mnt/iostestmnt/fio Â Â Â Â DEV=/dev/mapper/mpathe
>> Workload=bsr Â Â Â iosched=noop Â Â Filesz=2G Â bs=4K
>> =========================================================================
>> job Â Â Â Set NR Â ReadBW(KB/s) Â MaxClat(us) Â Â WriteBW(KB/s) Â MaxClat(us)
>> --- Â Â Â --- -- Â ------------ Â ----------- Â Â ------------- Â -----------
>> bsr Â Â Â 1 Â 1 Â 126187 Â Â Â Â 95272 Â Â Â Â Â 0 Â Â Â Â Â Â Â 0
>> bsr Â Â Â 1 Â 2 Â 185154 Â Â Â Â 72908 Â Â Â Â Â 0 Â Â Â Â Â Â Â 0
>> bsr Â Â Â 1 Â 4 Â 224622 Â Â Â Â 88037 Â Â Â Â Â 0 Â Â Â Â Â Â Â 0
>> bsr Â Â Â 1 Â 8 Â 285416 Â Â Â Â 115592 Â Â Â Â 0 Â Â Â Â Â Â Â 0
>> bsr Â Â Â 1 Â 16 Â 348564 Â Â Â Â 156846 Â Â Â Â 0 Â Â Â Â Â Â Â 0
>>
>
> These numbers are very similar to what I got.
>
>> So in case of NOOP, throughput shotup to 348MB/s but CFQ reamains more or
>> less constat, about 130MB/s.
>>
>> So atleast in this case, a single sequential CFQ queue is not keeing the
>> disk busy enough.
>>
>> I am wondering why my testing results were different in the past. May be
>> it was a different piece of hardware and behavior various across hardware?
>
> Probably. Â I haven't seen this type of behavior on other hardware.
>
>> Anyway, if that's the case, then we probably need to allow IO from
>> multiple sequential readers and keep a watch on throughput. If throughput
>> drops then reduce the number of parallel sequential readers. Not sure how
>> much of code that is but with multiple cfqq going in parallel, ioprio
>> logic will more or less stop working in CFQ (on multi-spindle hardware).
Hi Vivek,
I tried to implement exactly what you are proposing, see the attached patches.
I leverage the queue merging features to let multiple cfqqs share the
disk in the same timeslice.
I changed the queue split code to trigger on throughput drop instead
of on seeky pattern, so diverging queues can remain merged if they
have good throughput. Moreover, I measure the max bandwidth reached by
single queues and merged queues (you can see the values in the
bandwidth sysfs file).
If merged queues can outperform non-merged ones, the queue merging
code will try to opportunistically merge together queues that cannot
submit enough requests to fill half of the NCQ slots. I'd like to know
if you can see any improvements out of this on your hardware. There
are some magic numbers in the code, you may want to try tuning them.
Note that, since the opportunistic queue merging will start happening
only after merged queues have shown to reach higher bandwidth than
non-merged queues, you should use the disk for a while before trying
the test (and you can check sysfs), or the merging will not happen.

>
> Have you tested on older kernels? Â Around 2.6.16 it seemed to allow more
> parallel reads, but that might have been just accidental (due to I/O
> being submitted in a different pattern).
Is the BW for 1 single reader also better on 2.6.16, or the
improvement is only seen with more concurrent readers?

Thanks,
Corrado
>
> Thanks,
> Miklos
>
>

From: Vivek Goyal on 26 Apr 2010 10:00

On Sat, Apr 24, 2010 at 10:36:48PM +0200, Corrado Zoccolo wrote:

[..]
> >
> >> Anyway, if that's the case, then we probably need to allow IO from
> >> multiple sequential readers and keep a watch on throughput. If throughput
> >> drops then reduce the number of parallel sequential readers. Not sure how
> >> much of code that is but with multiple cfqq going in parallel, ioprio
> >> logic will more or less stop working in CFQ (on multi-spindle hardware).
> Hi Vivek,
> I tried to implement exactly what you are proposing, see the attached patches.
> I leverage the queue merging features to let multiple cfqqs share the
> disk in the same timeslice.
> I changed the queue split code to trigger on throughput drop instead
> of on seeky pattern, so diverging queues can remain merged if they
> have good throughput. Moreover, I measure the max bandwidth reached by
> single queues and merged queues (you can see the values in the
> bandwidth sysfs file).
> If merged queues can outperform non-merged ones, the queue merging
> code will try to opportunistically merge together queues that cannot
> submit enough requests to fill half of the NCQ slots. I'd like to know
> if you can see any improvements out of this on your hardware. There
> are some magic numbers in the code, you may want to try tuning them.
> Note that, since the opportunistic queue merging will start happening
> only after merged queues have shown to reach higher bandwidth than
> non-merged queues, you should use the disk for a while before trying
> the test (and you can check sysfs), or the merging will not happen.

Thanks corrado. Using split queue sounds like the right place to do it.
I will test it and report back my results.

>
> >
> > Have you tested on older kernels? �Around 2.6.16 it seemed to allow more
> > parallel reads, but that might have been just accidental (due to I/O
> > being submitted in a different pattern).
> Is the BW for 1 single reader also better on 2.6.16, or the
> improvement is only seen with more concurrent readers?

I will also test 2.6.16. I am anyway curious, how come 2.6.16 performed
better and we were dispatching requests from multiple cfqq and driving
deeper queue depths. To me this is fundamental cfq design that at one
time one queue gets to use the disk (at least for sync-idle tree). So
something must have been different in 2.6.16.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: colibri.h: Add #include
Next: [PATCH] mac8390: change an error return code and some cleanup