From: Christian Ehrhardt on
This is related to our discussion from October 09 e.g.
http://lkml.indiana.edu/hypermail/linux/kernel/0910.1/01468.html

I work for s390 where - as mainframe - we only have environments that
benefit from 512k readahead, but I still expect some embedded devices won't.
While my idea of making it configurable was not liked in the past, it
may be still useful when introducing this default change to let some
small devices choose without patching the src (a number field defaulting
to 512 and explaining the past of that value would be really nice).

For the discussion of 512 vs. 128 I can add from my measurements that I
have seen the following:
- 512 is by far superior to 128 for sequential reads
- improvements with iozone sequential read scaling from 1 to 64 parallel
processes up to +35%
- readahead sizes larger than 512 reevealed to not be "more useful" but
increasing the chance of trashing in low mem systems

So I appreciate this change with a little note that I would prefer a
config option.
-> tested & acked-by Christian Ehrhardt <ehrhardt(a)linux.vnet.ibm.com>

Wu Fengguang wrote:
>
> Use 512kb max readahead size, and 32kb min readahead size.
>
> The former helps io performance for common workloads.
> The latter will be used in the thrashing safe context readahead.
>
> -- Rationals on the 512kb size --
>
> I believe it yields more I/O throughput without noticeably increasing
> I/O latency for today's HDD.
>
> For example, for a 100MB/s and 8ms access time HDD, its random IO or
> highly concurrent sequential IO would in theory be:
>
> io_size KB access_time transfer_time io_latency util%
throughput KB/s
> 4 8 0.04 8.04 0.49% 497.57
> 8 8 0.08 8.08 0.97% 990.33
> 16 8 0.16 8.16 1.92% 1961.69
> 32 8 0.31 8.31 3.76% 3849.62
> 64 8 0.62 8.62 7.25% 7420.29
> 128 8 1.25 9.25 13.51% 13837.84
> 256 8 2.50 10.50 23.81% 24380.95
> 512 8 5.00 13.00 38.46% 39384.62
> 1024 8 10.00 18.00 55.56% 56888.89
> 2048 8 20.00 28.00 71.43% 73142.86
> 4096 8 40.00 48.00 83.33% 85333.33
>
> The 128KB => 512KB readahead size boosts IO throughput from ~13MB/s to
> ~39MB/s, while merely increases (minimal) IO latency from 9.25ms to 13ms.
>
> As for SSD, I find that Intel X25-M SSD desires large readahead size
> even for sequential reads:
>
> rasize 1st run 2nd run
> ----------------------------------
> 4k 123 MB/s 122 MB/s
> 16k 153 MB/s 153 MB/s
> 32k 161 MB/s 162 MB/s
> 64k 167 MB/s 168 MB/s
> 128k 197 MB/s 197 MB/s
> 256k 217 MB/s 217 MB/s
> 512k 238 MB/s 234 MB/s
> 1M 251 MB/s 248 MB/s
> 2M 259 MB/s 257 MB/s
> 4M 269 MB/s 264 MB/s
> 8M 266 MB/s 266 MB/s
>
> The two other impacts of an enlarged readahead size are
>
> - memory footprint (caused by readahead miss)
> Sequential readahead hit ratio is pretty high regardless of max
> readahead size; the extra memory footprint is mainly caused by
> enlarged mmap read-around.
> I measured my desktop:
> - under Xwindow:
> 128KB readahead hit ratio = 143MB/230MB = 62%
> 512KB readahead hit ratio = 138MB/248MB = 55%
> 1MB readahead hit ratio = 130MB/253MB = 51%
> - under console: (seems more stable than the Xwindow data)
> 128KB readahead hit ratio = 30MB/56MB = 53%
> 1MB readahead hit ratio = 30MB/59MB = 51%
> So the impact to memory footprint looks acceptable.
>
> - readahead thrashing
> It will now cost 1MB readahead buffer per stream. Memory tight
> systems typically do not run multiple streams; but if they do
> so, it should help I/O performance as long as we can avoid
> thrashing, which can be achieved with the following patches.
>
> -- Benchmarks by Vivek Goyal --
>
> I have got two paths to the HP EVA and got multipath device setup(dm-3).
> I run increasing number of sequential readers. File system is ext3 and
> filesize is 1G.
> I have run the tests 3 times (3sets) and taken the average of it.
>
> Workload=bsr iosched=cfq Filesz=1G bs=32K
> ======================================================================
> 2.6.33-rc5 2.6.33-rc5-readahead
> job Set NR ReadBW(KB/s) MaxClat(us) ReadBW(KB/s) MaxClat(us)
> --- --- -- ------------ ----------- ------------ -----------
> bsr 3 1 141768 130965 190302 97937.3
> bsr 3 2 131979 135402 185636 223286
> bsr 3 4 132351 420733 185986 363658
> bsr 3 8 133152 455434 184352 428478
> bsr 3 16 130316 674499 185646 594311
>
> I ran same test on a different piece of hardware. There are few SATA
disks
> (5-6) in striped configuration behind a hardware RAID controller.
>
> Workload=bsr iosched=cfq Filesz=1G bs=32K
> ======================================================================
> 2.6.33-rc5 2.6.33-rc5-readahead
> job Set NR ReadBW(KB/s) MaxClat(us) ReadBW(KB/s)
MaxClat(us)
> --- --- -- ------------ ----------- ------------
-----------
> bsr 3 1 147569 14369.7 160191
22752
> bsr 3 2 124716 243932 149343
184698
> bsr 3 4 123451 327665 147183
430875
> bsr 3 8 122486 455102 144568
484045
> bsr 3 16 117645 1.03957e+06 137485
1.06257e+06
>
> Tested-by: Vivek Goyal <vgoyal(a)redhat.com>
> CC: Jens Axboe <jens.axboe(a)oracle.com>
> CC: Chris Mason <chris.mason(a)oracle.com>
> CC: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
> CC: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
> CC: Christian Ehrhardt <ehrhardt(a)linux.vnet.ibm.com>
> Signed-off-by: Wu Fengguang <fengguang.wu(a)intel.com>
> ---
> include/linux/mm.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> --- linux.orig/include/linux/mm.h 2010-01-30 17:38:49.000000000 +0800
> +++ linux/include/linux/mm.h 2010-01-30 18:09:58.000000000 +0800
> @@ -1184,8 +1184,8 @@ int write_one_page(struct page *page, in
> void task_dirty_inc(struct task_struct *tsk);
>
> /* readahead.c */
> -#define VM_MAX_READAHEAD 128 /* kbytes */
> -#define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */
> +#define VM_MAX_READAHEAD 512 /* kbytes */
> +#define VM_MIN_READAHEAD 32 /* kbytes (includes current page) */
>
> int force_page_cache_readahead(struct address_space *mapping, struct
file *filp,
> pgoff_t offset, unsigned long nr_to_read);
>
>

--

Gr�sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
Chris,

Firstly inform the linux-embedded maintainers :)

I think it's a good suggestion to add a config option
(CONFIG_READAHEAD_SIZE). Will update the patch..

Thanks,
Fengguang

On Mon, Feb 08, 2010 at 03:20:31PM +0800, Christian Ehrhardt wrote:
> This is related to our discussion from October 09 e.g.
> http://lkml.indiana.edu/hypermail/linux/kernel/0910.1/01468.html
>
> I work for s390 where - as mainframe - we only have environments that
> benefit from 512k readahead, but I still expect some embedded devices won't.
> While my idea of making it configurable was not liked in the past, it
> may be still useful when introducing this default change to let some
> small devices choose without patching the src (a number field defaulting
> to 512 and explaining the past of that value would be really nice).
>
> For the discussion of 512 vs. 128 I can add from my measurements that I
> have seen the following:
> - 512 is by far superior to 128 for sequential reads
> - improvements with iozone sequential read scaling from 1 to 64 parallel
> processes up to +35%
> - readahead sizes larger than 512 reevealed to not be "more useful" but
> increasing the chance of trashing in low mem systems
>
> So I appreciate this change with a little note that I would prefer a
> config option.
> -> tested & acked-by Christian Ehrhardt <ehrhardt(a)linux.vnet.ibm.com>
>
> Wu Fengguang wrote:
> >
> > Use 512kb max readahead size, and 32kb min readahead size.
> >
> > The former helps io performance for common workloads.
> > The latter will be used in the thrashing safe context readahead.
> >
> > -- Rationals on the 512kb size --
> >
> > I believe it yields more I/O throughput without noticeably increasing
> > I/O latency for today's HDD.
> >
> > For example, for a 100MB/s and 8ms access time HDD, its random IO or
> > highly concurrent sequential IO would in theory be:
> >
> > io_size KB access_time transfer_time io_latency util%
> throughput KB/s
> > 4 8 0.04 8.04 0.49% 497.57
> > 8 8 0.08 8.08 0.97% 990.33
> > 16 8 0.16 8.16 1.92% 1961.69
> > 32 8 0.31 8.31 3.76% 3849.62
> > 64 8 0.62 8.62 7.25% 7420.29
> > 128 8 1.25 9.25 13.51% 13837.84
> > 256 8 2.50 10.50 23.81% 24380.95
> > 512 8 5.00 13.00 38.46% 39384.62
> > 1024 8 10.00 18.00 55.56% 56888.89
> > 2048 8 20.00 28.00 71.43% 73142.86
> > 4096 8 40.00 48.00 83.33% 85333.33
> >
> > The 128KB => 512KB readahead size boosts IO throughput from ~13MB/s to
> > ~39MB/s, while merely increases (minimal) IO latency from 9.25ms to 13ms.
> >
> > As for SSD, I find that Intel X25-M SSD desires large readahead size
> > even for sequential reads:
> >
> > rasize 1st run 2nd run
> > ----------------------------------
> > 4k 123 MB/s 122 MB/s
> > 16k 153 MB/s 153 MB/s
> > 32k 161 MB/s 162 MB/s
> > 64k 167 MB/s 168 MB/s
> > 128k 197 MB/s 197 MB/s
> > 256k 217 MB/s 217 MB/s
> > 512k 238 MB/s 234 MB/s
> > 1M 251 MB/s 248 MB/s
> > 2M 259 MB/s 257 MB/s
> > 4M 269 MB/s 264 MB/s
> > 8M 266 MB/s 266 MB/s
> >
> > The two other impacts of an enlarged readahead size are
> >
> > - memory footprint (caused by readahead miss)
> > Sequential readahead hit ratio is pretty high regardless of max
> > readahead size; the extra memory footprint is mainly caused by
> > enlarged mmap read-around.
> > I measured my desktop:
> > - under Xwindow:
> > 128KB readahead hit ratio = 143MB/230MB = 62%
> > 512KB readahead hit ratio = 138MB/248MB = 55%
> > 1MB readahead hit ratio = 130MB/253MB = 51%
> > - under console: (seems more stable than the Xwindow data)
> > 128KB readahead hit ratio = 30MB/56MB = 53%
> > 1MB readahead hit ratio = 30MB/59MB = 51%
> > So the impact to memory footprint looks acceptable.
> >
> > - readahead thrashing
> > It will now cost 1MB readahead buffer per stream. Memory tight
> > systems typically do not run multiple streams; but if they do
> > so, it should help I/O performance as long as we can avoid
> > thrashing, which can be achieved with the following patches.
> >
> > -- Benchmarks by Vivek Goyal --
> >
> > I have got two paths to the HP EVA and got multipath device setup(dm-3).
> > I run increasing number of sequential readers. File system is ext3 and
> > filesize is 1G.
> > I have run the tests 3 times (3sets) and taken the average of it.
> >
> > Workload=bsr iosched=cfq Filesz=1G bs=32K
> > ======================================================================
> > 2.6.33-rc5 2.6.33-rc5-readahead
> > job Set NR ReadBW(KB/s) MaxClat(us) ReadBW(KB/s) MaxClat(us)
> > --- --- -- ------------ ----------- ------------ -----------
> > bsr 3 1 141768 130965 190302 97937.3
> > bsr 3 2 131979 135402 185636 223286
> > bsr 3 4 132351 420733 185986 363658
> > bsr 3 8 133152 455434 184352 428478
> > bsr 3 16 130316 674499 185646 594311
> >
> > I ran same test on a different piece of hardware. There are few SATA
> disks
> > (5-6) in striped configuration behind a hardware RAID controller.
> >
> > Workload=bsr iosched=cfq Filesz=1G bs=32K
> > ======================================================================
> > 2.6.33-rc5 2.6.33-rc5-readahead
> > job Set NR ReadBW(KB/s) MaxClat(us) ReadBW(KB/s)
> MaxClat(us)
> > --- --- -- ------------ ----------- ------------
> -----------
> > bsr 3 1 147569 14369.7 160191
> 22752
> > bsr 3 2 124716 243932 149343
> 184698
> > bsr 3 4 123451 327665 147183
> 430875
> > bsr 3 8 122486 455102 144568
> 484045
> > bsr 3 16 117645 1.03957e+06 137485
> 1.06257e+06
> >
> > Tested-by: Vivek Goyal <vgoyal(a)redhat.com>
> > CC: Jens Axboe <jens.axboe(a)oracle.com>
> > CC: Chris Mason <chris.mason(a)oracle.com>
> > CC: Peter Zijlstra <a.p.zijlstra(a)chello.nl>
> > CC: Martin Schwidefsky <schwidefsky(a)de.ibm.com>
> > CC: Christian Ehrhardt <ehrhardt(a)linux.vnet.ibm.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu(a)intel.com>
> > ---
> > include/linux/mm.h | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > --- linux.orig/include/linux/mm.h 2010-01-30 17:38:49.000000000 +0800
> > +++ linux/include/linux/mm.h 2010-01-30 18:09:58.000000000 +0800
> > @@ -1184,8 +1184,8 @@ int write_one_page(struct page *page, in
> > void task_dirty_inc(struct task_struct *tsk);
> >
> > /* readahead.c */
> > -#define VM_MAX_READAHEAD 128 /* kbytes */
> > -#define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */
> > +#define VM_MAX_READAHEAD 512 /* kbytes */
> > +#define VM_MIN_READAHEAD 32 /* kbytes (includes current page) */
> >
> > int force_page_cache_readahead(struct address_space *mapping, struct
> file *filp,
> > pgoff_t offset, unsigned long nr_to_read);
> >
> >
>
> --
>
> Grüsse / regards, Christian Ehrhardt
> IBM Linux Technology Center, Open Virtualization
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Matt Mackall on
On Mon, 2010-02-08 at 21:46 +0800, Wu Fengguang wrote:
> Chris,
>
> Firstly inform the linux-embedded maintainers :)
>
> I think it's a good suggestion to add a config option
> (CONFIG_READAHEAD_SIZE). Will update the patch..

I don't have a strong opinion here beyond the nagging feeling that we
should be using a per-bdev scaling window scheme rather than something
static.

--
http://selenic.com : development and support for Mercurial and Linux


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jamie Lokier on
Matt Mackall wrote:
> On Mon, 2010-02-08 at 21:46 +0800, Wu Fengguang wrote:
> > Chris,
> >
> > Firstly inform the linux-embedded maintainers :)
> >
> > I think it's a good suggestion to add a config option
> > (CONFIG_READAHEAD_SIZE). Will update the patch..
>
> I don't have a strong opinion here beyond the nagging feeling that we
> should be using a per-bdev scaling window scheme rather than something
> static.

I agree with both. 100Mb/s isn't typical on little devices, even if a
fast ATA disk is attached. I've got something here where the ATA
interface itself (on a SoC) gets about 10MB/s max when doing nothing
else, or 4MB/s when talking to the network at the same time.
It's not a modern design, but you know, it's junk we try to use :-)

It sounds like a calculation based on throughput and seek time or IOP
rate, and maybe clamped if memory is small, would be good.

Is the window size something that could be meaningfully adjusted
according to live measurements?

-- Jamie



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on
On Fri, Feb 12, 2010 at 07:42:49AM +0800, Jamie Lokier wrote:
> Matt Mackall wrote:
> > On Mon, 2010-02-08 at 21:46 +0800, Wu Fengguang wrote:
> > > Chris,
> > >
> > > Firstly inform the linux-embedded maintainers :)
> > >
> > > I think it's a good suggestion to add a config option
> > > (CONFIG_READAHEAD_SIZE). Will update the patch..
> >
> > I don't have a strong opinion here beyond the nagging feeling that we
> > should be using a per-bdev scaling window scheme rather than something
> > static.

It's good to do dynamic scaling -- in fact this patchset has code to do
- scale down readahead size (per-bdev) for small devices
- scale down readahead size (per-stream) to thrashing threshold

At the same time, I'd prefer
- to _only_ do scale down (below the default size) for low end
- and have a uniform default readahead size for the mainstream

IMHO scaling up automatically
- would be risky
- hurts to build one common expectation on Linux behavior
(not only developers, but also admins will run into the question:
"what on earth is the readahead size?")
- and still not likely to please the high end guys ;)

> I agree with both. 100Mb/s isn't typical on little devices, even if a
> fast ATA disk is attached. I've got something here where the ATA
> interface itself (on a SoC) gets about 10MB/s max when doing nothing
> else, or 4MB/s when talking to the network at the same time.
> It's not a modern design, but you know, it's junk we try to use :-)

Good to know this. I guess the same situation for some USB-capable
wireless routers -- they typically don't have powerful hardware to
exert the full 100MB/s disk speed.

> It sounds like a calculation based on throughput and seek time or IOP
> rate, and maybe clamped if memory is small, would be good.
>
> Is the window size something that could be meaningfully adjusted
> according to live measurements?

We currently have live adjustment for
- small devices
- thrashed read streams

We could add new adjustments based on throughput (estimation is the
problem) and memory size.

Note that it does not really hurt to have big _readahead_ size on low
throughput or small memory conditions, because it's merely _max_
readahead size, the actual readahead size scales up step-by-step, and
scales down if thrashed, and the sequential readahead hit ratio is
pretty high (so no memory/bandwidth is wasted).

What may hurt is to have big mmap _readaround_ size. The larger
readaround size, the more readaround miss ratio (but still not
disastrous), hence more memory pages and bandwidth wasted. It's not a
big problem for mainstream, however embedded systems may be more
sensitive.

I would guess most embedded systems put executables on MTD devices
(anyone to confirm this?). And I wonder if MTDs have general
characteristics that are suitable for smaller readahead/readaround
size (the two sizes are bundled for simplicity)?

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/