CFQ read performance regression [Kernel]

Prev: colibri.h: Add #include
Next: [PATCH] mac8390: change an error return code and some cleanup

From: Jan Kara on 22 Apr 2010 12:00

On Thu 22-04-10 12:23:29, Miklos Szeredi wrote:
> > >
> > > This is on a vanilla 2.6.34-rc4 kernel with two tunables modified:
> > >
> > > read_ahead_kb=512
> > > low_latency=0 (for CFQ)
> > You should get much better throughput by setting
> > /sys/block/_your_disk_/queue/iosched/slice_idle to 0, or
> > /sys/block/_your_disk_/queue/rotational to 0.
>
> slice_idle=0 definitely helps. rotational=0 seems to help on 2.6.34-rc
> but not on 2.6.32.
>
> As far as I understand setting slice_idle to zero is just a workaround
> to make cfq look at all the other queues instead of serving one
> exclusively for a long time.
Yes, basically it disables idling (i.e., waiting whether a thread sends
more IO so that we can get better IO locality).

> I have very little understanding of I/O scheduling but my idea of what's
> really needed here is to realize that one queue is not able to saturate
> the device and there's a large backlog of requests on other queues that
> are waiting to be served. Is something like that implementable?
I see a problem with defining "saturate the device" - but maybe we could
measure something like "completed requests / sec" and try autotuning
slice_idle to maximize this value (hopefully the utility function should
be concave so we can just use "local optimization").

Honza
--
Jan Kara <jack(a)suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Vivek Goyal on 22 Apr 2010 16:40

On Thu, Apr 22, 2010 at 09:59:14AM +0200, Corrado Zoccolo wrote:
> Hi Miklos,
> On Wed, Apr 21, 2010 at 6:05 PM, Miklos Szeredi <mszeredi(a)suse.cz> wrote:
> > Jens, Corrado,
> >
> > Here's a graph showing the number of issued but not yet completed
> > requests versus time for CFQ and NOOP schedulers running the tiobench
> > benchmark with 8 threads:
> >
> > http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/queue-depth.jpg
> >
> > It shows pretty clearly the performance problem is because CFQ is not
> > issuing enough request to fill the bandwidth.
> >
> > Is this the correct behavior of CFQ or is this a bug?
> This is the expected behavior from CFQ, even if it is not optimal,
> since we aren't able to identify multi-splindle disks yet.

In the past we were of the opinion that for sequential workload multi spindle
disks will not matter much as readahead logic (in OS and possibly in
hardware also) will help. For random workload we anyway don't idle on the
single cfqq so it is fine. But my tests now seem to be telling a different
story.

I also have one FC link to one of the HP EVA and I am running increasing
number of sequential readers to see if throughput goes up as number of
readers go up. The results are with noop and cfq. I do flush OS caches
across the runs but I have no control on caching on HP EVA.

Kernel=2.6.34-rc5
DIR=/mnt/iostestmnt/fio DEV=/dev/mapper/mpathe
Workload=bsr iosched=cfq Filesz=2G bs=4K
=========================================================================
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
bsr 1 1 135366 59024 0 0
bsr 1 2 124256 126808 0 0
bsr 1 4 132921 341436 0 0
bsr 1 8 129807 392904 0 0
bsr 1 16 129988 773991 0 0

Kernel=2.6.34-rc5
DIR=/mnt/iostestmnt/fio DEV=/dev/mapper/mpathe
Workload=bsr iosched=noop Filesz=2G bs=4K
=========================================================================
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
bsr 1 1 126187 95272 0 0
bsr 1 2 185154 72908 0 0
bsr 1 4 224622 88037 0 0
bsr 1 8 285416 115592 0 0
bsr 1 16 348564 156846 0 0

So in case of NOOP, throughput shotup to 348MB/s but CFQ reamains more or
less constat, about 130MB/s.

So atleast in this case, a single sequential CFQ queue is not keeing the
disk busy enough.

I am wondering why my testing results were different in the past. May be
it was a different piece of hardware and behavior various across hardware?

Anyway, if that's the case, then we probably need to allow IO from
multiple sequential readers and keep a watch on throughput. If throughput
drops then reduce the number of parallel sequential readers. Not sure how
much of code that is but with multiple cfqq going in parallel, ioprio
logic will more or less stop working in CFQ (on multi-spindle hardware).

FWIW, I also ran tiobench on same HP EVA with NOOP and CFQ. And indeed
Read throughput is bad with CFQ.

With NOOP
=========
# /usr/bin/tiotest -t 8 -f 2000 -r 4000 -b 4096 -d /mnt/mpathe
Tiotest results for 8 concurrent io threads:
,----------------------------------------------------------------------.
| Item | Time | Rate | Usr CPU | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write 16000 MBs | 44.1 s | 362.410 MB/s | 25.3 % | 1239.4 % |
| Random Write 125 MBs | 0.8 s | 156.182 MB/s | 19.7 % | 484.8 % |
| Read 16000 MBs | 59.9 s | 267.008 MB/s | 12.4 % | 197.1 % |
| Random Read 125 MBs | 16.7 s | 7.478 MB/s | 1.0 % | 23.7 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write | 0.083 ms | 834.092 ms | 0.00000 | 0.00000 |
| Random Write | 0.021 ms | 21.024 ms | 0.00000 | 0.00000 |
| Read | 0.115 ms | 105.830 ms | 0.00000 | 0.00000 |
| Random Read | 4.088 ms | 295.605 ms | 0.00000 | 0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total | 0.114 ms | 834.092 ms | 0.00000 | 0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

With CFQ
========
# /usr/bin/tiotest -t 8 -f 2000 -r 4000 -b 4096 -d /mnt/mpathe
Tiotest results for 8 concurrent io threads:
,----------------------------------------------------------------------.
| Item | Time | Rate | Usr CPU | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write 16000 MBs | 49.5 s | 323.086 MB/s | 21.7 % | 1175.6 % |
| Random Write 125 MBs | 2.2 s | 57.148 MB/s | 5.0 % | 188.1 % |
| Read 16000 MBs | 162.7 s | 98.311 MB/s | 4.7 % | 71.0 % |
| Random Read 125 MBs | 17.0 s | 7.344 MB/s | 0.8 % | 26.5 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write | 0.093 ms | 832.680 ms | 0.00000 | 0.00000 |
| Random Write | 0.017 ms | 12.031 ms | 0.00000 | 0.00000 |
| Read | 0.316 ms | 561.623 ms | 0.00000 | 0.00000 |
| Random Read | 4.126 ms | 273.156 ms | 0.00000 | 0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total | 0.219 ms | 832.680 ms | 0.00000 | 0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Miklos Szeredi on 23 Apr 2010 06:50

On Thu, 2010-04-22 at 17:53 +0200, Jan Kara wrote:
> On Thu 22-04-10 12:23:29, Miklos Szeredi wrote:
> > I have very little understanding of I/O scheduling but my idea of what's
> > really needed here is to realize that one queue is not able to saturate
> > the device and there's a large backlog of requests on other queues that
> > are waiting to be served. Is something like that implementable?
> I see a problem with defining "saturate the device" - but maybe we could
> measure something like "completed requests / sec" and try autotuning
> slice_idle to maximize this value (hopefully the utility function should
> be concave so we can just use "local optimization").

Yeah, detecting saturation may be difficult.

I guess that function depends on a lot of other things as well,
including seek times, etc. Not easy to optimize.

I'm still wondering what makes such a difference between CFQ on 2.6.16
and CFQ on 2.6.27-34, why is the one in older kernels performing so much
better in this situation?

What should we tell our customers? The default settings should at least
handle these systems a bit better.

Thanks,
Miklos

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Miklos Szeredi on 23 Apr 2010 07:00

On Thu, 2010-04-22 at 16:31 -0400, Vivek Goyal wrote:
> On Thu, Apr 22, 2010 at 09:59:14AM +0200, Corrado Zoccolo wrote:
> > Hi Miklos,
> > On Wed, Apr 21, 2010 at 6:05 PM, Miklos Szeredi <mszeredi(a)suse.cz> wrote:
> > > Jens, Corrado,
> > >
> > > Here's a graph showing the number of issued but not yet completed
> > > requests versus time for CFQ and NOOP schedulers running the tiobench
> > > benchmark with 8 threads:
> > >
> > > http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/queue-depth.jpg
> > >
> > > It shows pretty clearly the performance problem is because CFQ is not
> > > issuing enough request to fill the bandwidth.
> > >
> > > Is this the correct behavior of CFQ or is this a bug?
> > This is the expected behavior from CFQ, even if it is not optimal,
> > since we aren't able to identify multi-splindle disks yet.
>
> In the past we were of the opinion that for sequential workload multi spindle
> disks will not matter much as readahead logic (in OS and possibly in
> hardware also) will help. For random workload we anyway don't idle on the
> single cfqq so it is fine. But my tests now seem to be telling a different
> story.
>
> I also have one FC link to one of the HP EVA and I am running increasing
> number of sequential readers to see if throughput goes up as number of
> readers go up. The results are with noop and cfq. I do flush OS caches
> across the runs but I have no control on caching on HP EVA.
>
> Kernel=2.6.34-rc5
> DIR=/mnt/iostestmnt/fio DEV=/dev/mapper/mpathe
> Workload=bsr iosched=cfq Filesz=2G bs=4K
> =========================================================================
> job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
> --- --- -- ------------ ----------- ------------- -----------
> bsr 1 1 135366 59024 0 0
> bsr 1 2 124256 126808 0 0
> bsr 1 4 132921 341436 0 0
> bsr 1 8 129807 392904 0 0
> bsr 1 16 129988 773991 0 0
>
> Kernel=2.6.34-rc5
> DIR=/mnt/iostestmnt/fio DEV=/dev/mapper/mpathe
> Workload=bsr iosched=noop Filesz=2G bs=4K
> =========================================================================
> job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
> --- --- -- ------------ ----------- ------------- -----------
> bsr 1 1 126187 95272 0 0
> bsr 1 2 185154 72908 0 0
> bsr 1 4 224622 88037 0 0
> bsr 1 8 285416 115592 0 0
> bsr 1 16 348564 156846 0 0
>

These numbers are very similar to what I got.

> So in case of NOOP, throughput shotup to 348MB/s but CFQ reamains more or
> less constat, about 130MB/s.
>
> So atleast in this case, a single sequential CFQ queue is not keeing the
> disk busy enough.
>
> I am wondering why my testing results were different in the past. May be
> it was a different piece of hardware and behavior various across hardware?

Probably. I haven't seen this type of behavior on other hardware.

> Anyway, if that's the case, then we probably need to allow IO from
> multiple sequential readers and keep a watch on throughput. If throughput
> drops then reduce the number of parallel sequential readers. Not sure how
> much of code that is but with multiple cfqq going in parallel, ioprio
> logic will more or less stop working in CFQ (on multi-spindle hardware).

Have you tested on older kernels? Around 2.6.16 it seemed to allow more
parallel reads, but that might have been just accidental (due to I/O
being submitted in a different pattern).

Thanks,
Miklos

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Corrado Zoccolo on 22 Apr 2010 04:00

Hi Miklos,
On Wed, Apr 21, 2010 at 6:05 PM, Miklos Szeredi <mszeredi(a)suse.cz> wrote:
> Jens, Corrado,
>
> Here's a graph showing the number of issued but not yet completed
> requests versus time for CFQ and NOOP schedulers running the tiobench
> benchmark with 8 threads:
>
> http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/queue-depth.jpg
>
> It shows pretty clearly the performance problem is because CFQ is not
> issuing enough request to fill the bandwidth.
>
> Is this the correct behavior of CFQ or is this a bug?
This is the expected behavior from CFQ, even if it is not optimal,
since we aren't able to identify multi-splindle disks yet. Can you
post the result of "grep -r . ." in your /sys/block/*/queue directory,
to see if we can find some parameter that can help identifying your
hardware as a multi-spindle disk.
>
> This is on a vanilla 2.6.34-rc4 kernel with two tunables modified:
>
> read_ahead_kb=512
> low_latency=0 (for CFQ)
You should get much better throughput by setting
/sys/block/_your_disk_/queue/iosched/slice_idle to 0, or
/sys/block/_your_disk_/queue/rotational to 0.

Thanks,
Corrado
>
> Thanks,
> Miklos
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: colibri.h: Add #include
Next: [PATCH] mac8390: change an error return code and some cleanup