From: Wu Fengguang on 1 Mar 2010 22:20 Dave, Here is one more test on a big ext4 disk file: 16k 39.7 MB/s 32k 54.3 MB/s 64k 63.6 MB/s 128k 72.6 MB/s 256k 71.7 MB/s rsize ==> 512k 71.7 MB/s 1024k 72.2 MB/s 2048k 71.0 MB/s 4096k 73.0 MB/s 8192k 74.3 MB/s 16384k 74.5 MB/s It shows that >=128k client side readahead is enough for single disk case :) As for RAID configurations, I guess big server side readahead should be enough. #!/bin/sh file=/mnt/ext4_test/zero BDI=0:24 for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384 do echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb echo readahead_size=${rasize}k fadvise $file 0 0 dontneed ssh p9 "fadvise $file 0 0 dontneed" dd if=$file of=/dev/null bs=4k count=402400 done Thanks, Fengguang On Fri, Feb 26, 2010 at 03:49:16PM +0800, Wu Fengguang wrote: > On Wed, Feb 24, 2010 at 03:39:40PM +0800, Dave Chinner wrote: > > On Wed, Feb 24, 2010 at 02:12:47PM +0800, Wu Fengguang wrote: > > > On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote: > > > > What I'm trying to say is that while I agree with your premise that > > > > a 7.8MB readahead window is probably far larger than was ever > > > > intended, I disagree with your methodology and environment for > > > > selecting a better default value. The default readahead value needs > > > > to work well in as many situations as possible, not just in perfect > > > > 1:1 client/server environment. > > > > > > Good points. It's imprudent to change a default value based on one > > > single benchmark. Need to collect more data, which may take time.. > > > > Agreed - better to spend time now to get it right... > > I collected more data with large network latency as well as rsize=32k, > and updates the readahead size accordingly to 4*rsize. > > === > nfs: use 2*rsize readahead size > > With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS > readahead size 512k*15=7680k is too large than necessary for typical > clients. > > On a e1000e--e1000e connection, I got the following numbers > (this reads sparse file from server and involves no disk IO) > > readahead size normal 1ms+1ms 5ms+5ms 10ms+10ms(*) > 16k 35.5 MB/s 4.8 MB/s 2.1 MB/s 1.2 MB/s > 32k 54.3 MB/s 6.7 MB/s 3.6 MB/s 2.3 MB/s > 64k 64.1 MB/s 12.6 MB/s 6.5 MB/s 4.7 MB/s > 128k 70.5 MB/s 20.1 MB/s 11.9 MB/s 8.7 MB/s > 256k 74.6 MB/s 38.6 MB/s 21.3 MB/s 15.0 MB/s > rsize ==> 512k 77.4 MB/s 59.4 MB/s 39.8 MB/s 25.5 MB/s > 1024k 85.5 MB/s 77.9 MB/s 65.7 MB/s 43.0 MB/s > 2048k 86.8 MB/s 81.5 MB/s 84.1 MB/s 59.7 MB/s > 4096k 87.9 MB/s 77.4 MB/s 56.2 MB/s 59.2 MB/s > 8192k 89.0 MB/s 81.2 MB/s 78.0 MB/s 41.2 MB/s > 16384k 87.7 MB/s 85.8 MB/s 62.0 MB/s 56.5 MB/s > > readahead size normal 1ms+1ms 5ms+5ms 10ms+10ms(*) > 16k 37.2 MB/s 6.4 MB/s 2.1 MB/s 1.2 MB/s > rsize ==> 32k 56.6 MB/s 6.8 MB/s 3.6 MB/s 2.3 MB/s > 64k 66.1 MB/s 12.7 MB/s 6.6 MB/s 4.7 MB/s > 128k 69.3 MB/s 22.0 MB/s 12.2 MB/s 8.9 MB/s > 256k 69.6 MB/s 41.8 MB/s 20.7 MB/s 14.7 MB/s > 512k 71.3 MB/s 54.1 MB/s 25.0 MB/s 16.9 MB/s > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > 1024k 71.5 MB/s 48.4 MB/s 26.0 MB/s 16.7 MB/s > 2048k 71.7 MB/s 53.2 MB/s 25.3 MB/s 17.6 MB/s > 4096k 71.5 MB/s 50.4 MB/s 25.7 MB/s 17.1 MB/s > 8192k 71.1 MB/s 52.3 MB/s 26.3 MB/s 16.9 MB/s > 16384k 70.2 MB/s 56.6 MB/s 27.0 MB/s 16.8 MB/s > > (*) 10ms+10ms means to add delay on both client & server sides with > # /sbin/tc qdisc change dev eth0 root netem delay 10ms > The total >=20ms delay is so large for NFS, that a simple `vi some.sh` > command takes a dozen seconds. Note that the actual delay reported > by ping is larger, eg. for the 1ms+1ms case: > rtt min/avg/max/mdev = 7.361/8.325/9.710/0.837 ms > > > So it seems that readahead_size=4*rsize (ie. keep 4 RPC requests in > flight) is able to get near full NFS bandwidth. Reducing the mulriple > from 15 to 4 not only makes the client side readahead size more sane > (2MB by default), but also reduces the disorderness of the server side > RPC read requests, which yeilds better server side readahead behavior. > > To avoid small readahead when the client mount with "-o rsize=32k" or > the server only supports rsize <= 32k, we take the max of 2*rsize and > default_backing_dev_info.ra_pages. The latter defaults to 512K, and can > be explicitly changed by user with kernel parameter "readahead=" and > runtime tunable "/sys/devices/virtual/bdi/default/read_ahead_kb" (which > takes effective for future NFS mounts). > > The test script is: > > #!/bin/sh > > file=/mnt/sparse > BDI=0:15 > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384 > do > echo 3 > /proc/sys/vm/drop_caches > echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb > echo readahead_size=${rasize}k > dd if=$file of=/dev/null bs=4k count=1024000 > done > > CC: Dave Chinner <david(a)fromorbit.com> > CC: Trond Myklebust <Trond.Myklebust(a)netapp.com> > Signed-off-by: Wu Fengguang <fengguang.wu(a)intel.com> > --- > fs/nfs/client.c | 4 +++- > fs/nfs/internal.h | 8 -------- > 2 files changed, 3 insertions(+), 9 deletions(-) > > --- linux.orig/fs/nfs/client.c 2010-02-26 10:10:46.000000000 +0800 > +++ linux/fs/nfs/client.c 2010-02-26 11:07:22.000000000 +0800 > @@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct > server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; > > server->backing_dev_info.name = "nfs"; > - server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD; > + server->backing_dev_info.ra_pages = max_t(unsigned long, > + default_backing_dev_info.ra_pages, > + 4 * server->rpages); > server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE; > > if (server->wsize > max_rpc_payload) > --- linux.orig/fs/nfs/internal.h 2010-02-26 10:10:46.000000000 +0800 > +++ linux/fs/nfs/internal.h 2010-02-26 11:07:07.000000000 +0800 > @@ -10,14 +10,6 @@ > > struct nfs_string; > > -/* Maximum number of readahead requests > - * FIXME: this should really be a sysctl so that users may tune it to suit > - * their needs. People that do NFS over a slow network, might for > - * instance want to reduce it to something closer to 1 for improved > - * interactive response. > - */ > -#define NFS_MAX_READAHEAD (RPC_DEF_SLOT_TABLE - 1) > - > /* > * Determine if sessions are in use. > */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Trond Myklebust on 2 Mar 2010 09:20 On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote: > Dave, > > Here is one more test on a big ext4 disk file: > > 16k 39.7 MB/s > 32k 54.3 MB/s > 64k 63.6 MB/s > 128k 72.6 MB/s > 256k 71.7 MB/s > rsize ==> 512k 71.7 MB/s > 1024k 72.2 MB/s > 2048k 71.0 MB/s > 4096k 73.0 MB/s > 8192k 74.3 MB/s > 16384k 74.5 MB/s > > It shows that >=128k client side readahead is enough for single disk > case :) As for RAID configurations, I guess big server side readahead > should be enough. There are lots of people who would like to use NFS on their company WAN, where you typically have high bandwidths (up to 10GigE), but often a high latency too (due to geographical dispersion). My ping latency from here to a typical server in NetApp's Bangalore office is ~ 312ms. I read your test results with 10ms delays, but have you tested with higher than that? Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: John Stoffel on 2 Mar 2010 12:40 >>>>> "Trond" == Trond Myklebust <Trond.Myklebust(a)netapp.com> writes: Trond> On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote: >> Dave, >> >> Here is one more test on a big ext4 disk file: >> >> 16k 39.7 MB/s >> 32k 54.3 MB/s >> 64k 63.6 MB/s >> 128k 72.6 MB/s >> 256k 71.7 MB/s >> rsize ==> 512k 71.7 MB/s >> 1024k 72.2 MB/s >> 2048k 71.0 MB/s >> 4096k 73.0 MB/s >> 8192k 74.3 MB/s >> 16384k 74.5 MB/s >> >> It shows that >=128k client side readahead is enough for single disk >> case :) As for RAID configurations, I guess big server side readahead >> should be enough. Trond> There are lots of people who would like to use NFS on their Trond> company WAN, where you typically have high bandwidths (up to Trond> 10GigE), but often a high latency too (due to geographical Trond> dispersion). My ping latency from here to a typical server in Trond> NetApp's Bangalore office is ~ 312ms. I read your test results Trond> with 10ms delays, but have you tested with higher than that? If you have that high a latency, the low level TCP protocol is going to kill your performance before you get to the NFS level. You really need to open up the TCP window size at that point. And it only gets worse as the bandwidth goes up too. There's no good solution, because while you can get good throughput at points, latency is going to suffer no matter what. John -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Trond Myklebust on 2 Mar 2010 13:50 On Tue, 2010-03-02 at 12:33 -0500, John Stoffel wrote: > >>>>> "Trond" == Trond Myklebust <Trond.Myklebust(a)netapp.com> writes: > > Trond> On Tue, 2010-03-02 at 11:10 +0800, Wu Fengguang wrote: > >> Dave, > >> > >> Here is one more test on a big ext4 disk file: > >> > >> 16k 39.7 MB/s > >> 32k 54.3 MB/s > >> 64k 63.6 MB/s > >> 128k 72.6 MB/s > >> 256k 71.7 MB/s > >> rsize ==> 512k 71.7 MB/s > >> 1024k 72.2 MB/s > >> 2048k 71.0 MB/s > >> 4096k 73.0 MB/s > >> 8192k 74.3 MB/s > >> 16384k 74.5 MB/s > >> > >> It shows that >=128k client side readahead is enough for single disk > >> case :) As for RAID configurations, I guess big server side readahead > >> should be enough. > > Trond> There are lots of people who would like to use NFS on their > Trond> company WAN, where you typically have high bandwidths (up to > Trond> 10GigE), but often a high latency too (due to geographical > Trond> dispersion). My ping latency from here to a typical server in > Trond> NetApp's Bangalore office is ~ 312ms. I read your test results > Trond> with 10ms delays, but have you tested with higher than that? > > If you have that high a latency, the low level TCP protocol is going > to kill your performance before you get to the NFS level. You really > need to open up the TCP window size at that point. And it only gets > worse as the bandwidth goes up too. Yes. You need to open the TCP window in addition to reading ahead aggressively. > There's no good solution, because while you can get good throughput at > points, latency is going to suffer no matter what. It depends upon your workload. Sequential read and write should still be doable if you have aggressive readahead and open up for lots of parallel write RPCs. Cheers Trond -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Bret Towe on 2 Mar 2010 15:20 On Mon, Mar 1, 2010 at 7:10 PM, Wu Fengguang <fengguang.wu(a)intel.com> wrote: > Dave, > > Here is one more test on a big ext4 disk file: > > � � � � � 16k �39.7 MB/s > � � � � � 32k �54.3 MB/s > � � � � � 64k �63.6 MB/s > � � � � �128k �72.6 MB/s > � � � � �256k �71.7 MB/s > rsize ==> 512k �71.7 MB/s > � � � � 1024k �72.2 MB/s > � � � � 2048k �71.0 MB/s > � � � � 4096k �73.0 MB/s > � � � � 8192k �74.3 MB/s > � � � �16384k �74.5 MB/s > > It shows that >=128k client side readahead is enough for single disk > case :) As for RAID configurations, I guess big server side readahead > should be enough. > > #!/bin/sh > > file=/mnt/ext4_test/zero > BDI=0:24 > > for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384 > do > � � � �echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb > � � � �echo readahead_size=${rasize}k > � � � �fadvise $file 0 0 dontneed > � � � �ssh p9 "fadvise $file 0 0 dontneed" > � � � �dd if=$file of=/dev/null bs=4k count=402400 > done how do you determine which bdi to use? I skimmed thru the filesystem in /sys and didn't see anything that says which is what > Thanks, > Fengguang > > On Fri, Feb 26, 2010 at 03:49:16PM +0800, Wu Fengguang wrote: >> On Wed, Feb 24, 2010 at 03:39:40PM +0800, Dave Chinner wrote: >> > On Wed, Feb 24, 2010 at 02:12:47PM +0800, Wu Fengguang wrote: >> > > On Wed, Feb 24, 2010 at 01:22:15PM +0800, Dave Chinner wrote: >> > > > What I'm trying to say is that while I agree with your premise that >> > > > a 7.8MB readahead window is probably far larger than was ever >> > > > intended, I disagree with your methodology and environment for >> > > > selecting a better default value. �The default readahead value needs >> > > > to work well in as many situations as possible, not just in perfect >> > > > 1:1 client/server environment. >> > > >> > > Good points. It's imprudent to change a default value based on one >> > > single benchmark. Need to collect more data, which may take time.. >> > >> > Agreed - better to spend time now to get it right... >> >> I collected more data with large network latency as well as rsize=32k, >> and updates the readahead size accordingly to 4*rsize. >> >> === >> nfs: use 2*rsize readahead size >> >> With default rsize=512k and NFS_MAX_READAHEAD=15, the current NFS >> readahead size 512k*15=7680k is too large than necessary for typical >> clients. >> >> On a e1000e--e1000e connection, I got the following numbers >> (this reads sparse file from server and involves no disk IO) >> >> readahead size � � � �normal � � � � �1ms+1ms � � � � 5ms+5ms � � � � 10ms+10ms(*) >> � � � � �16k �35.5 MB/s � � � �4.8 MB/s � � � �2.1 MB/s � � � 1.2 MB/s >> � � � � �32k �54.3 MB/s � � � �6.7 MB/s � � � �3.6 MB/s � � � 2.3 MB/s >> � � � � �64k �64.1 MB/s � � � 12.6 MB/s � � � �6.5 MB/s � � � 4.7 MB/s >> � � � � 128k �70.5 MB/s � � � 20.1 MB/s � � � 11.9 MB/s � � � 8.7 MB/s >> � � � � 256k �74.6 MB/s � � � 38.6 MB/s � � � 21.3 MB/s � � �15.0 MB/s >> rsize ==> 512k � � � �77.4 MB/s � � � 59.4 MB/s � � � 39.8 MB/s � � �25.5 MB/s >> � � � �1024k �85.5 MB/s � � � 77.9 MB/s � � � 65.7 MB/s � � �43.0 MB/s >> � � � �2048k �86.8 MB/s � � � 81.5 MB/s � � � 84.1 MB/s � � �59.7 MB/s >> � � � �4096k �87.9 MB/s � � � 77.4 MB/s � � � 56.2 MB/s � � �59.2 MB/s >> � � � �8192k �89.0 MB/s � � � 81.2 MB/s � � � 78.0 MB/s � � �41.2 MB/s >> � � � 16384k �87.7 MB/s � � � 85.8 MB/s � � � 62.0 MB/s � � �56.5 MB/s >> >> readahead size � � � �normal � � � � �1ms+1ms � � � � 5ms+5ms � � � � 10ms+10ms(*) >> � � � � �16k �37.2 MB/s � � � �6.4 MB/s � � � �2.1 MB/s � � � �1.2 MB/s >> rsize ==> �32k � � � �56.6 MB/s � � � �6.8 MB/s � � � �3.6 MB/s � � � �2.3 MB/s >> � � � � �64k �66.1 MB/s � � � 12.7 MB/s � � � �6.6 MB/s � � � �4.7 MB/s >> � � � � 128k �69.3 MB/s � � � 22.0 MB/s � � � 12.2 MB/s � � � �8.9 MB/s >> � � � � 256k �69.6 MB/s � � � 41.8 MB/s � � � 20.7 MB/s � � � 14.7 MB/s >> � � � � 512k �71.3 MB/s � � � 54.1 MB/s � � � 25.0 MB/s � � � 16.9 MB/s >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> � � � �1024k �71.5 MB/s � � � 48.4 MB/s � � � 26.0 MB/s � � � 16.7 MB/s >> � � � �2048k �71.7 MB/s � � � 53.2 MB/s � � � 25.3 MB/s � � � 17.6 MB/s >> � � � �4096k �71.5 MB/s � � � 50.4 MB/s � � � 25.7 MB/s � � � 17.1 MB/s >> � � � �8192k �71.1 MB/s � � � 52.3 MB/s � � � 26.3 MB/s � � � 16.9 MB/s >> � � � 16384k �70.2 MB/s � � � 56.6 MB/s � � � 27.0 MB/s � � � 16.8 MB/s >> >> (*) 10ms+10ms means to add delay on both client & server sides with >> � � # /sbin/tc qdisc change dev eth0 root netem delay 10ms >> � � The total >=20ms delay is so large for NFS, that a simple `vi some.sh` >> � � command takes a dozen seconds. Note that the actual delay reported >> � � by ping is larger, eg. for the 1ms+1ms case: >> � � � � rtt min/avg/max/mdev = 7.361/8.325/9.710/0.837 ms >> >> >> So it seems that readahead_size=4*rsize (ie. keep 4 RPC requests in >> flight) is able to get near full NFS bandwidth. Reducing the mulriple >> from 15 to 4 not only makes the client side readahead size more sane >> (2MB by default), but also reduces the disorderness of the server side >> RPC read requests, which yeilds better server side readahead behavior. >> >> To avoid small readahead when the client mount with "-o rsize=32k" or >> the server only supports rsize <= 32k, we take the max of 2*rsize and >> default_backing_dev_info.ra_pages. The latter defaults to 512K, and can >> be explicitly changed by user with kernel parameter "readahead=" and >> runtime tunable "/sys/devices/virtual/bdi/default/read_ahead_kb" (which >> takes effective for future NFS mounts). >> >> The test script is: >> >> #!/bin/sh >> >> file=/mnt/sparse >> BDI=0:15 >> >> for rasize in 16 32 64 128 256 512 1024 2048 4096 8192 16384 >> do >> � � � echo 3 > /proc/sys/vm/drop_caches >> � � � echo $rasize > /sys/devices/virtual/bdi/$BDI/read_ahead_kb >> � � � echo readahead_size=${rasize}k >> � � � dd if=$file of=/dev/null bs=4k count=1024000 >> done >> >> CC: Dave Chinner <david(a)fromorbit.com> >> CC: Trond Myklebust <Trond.Myklebust(a)netapp.com> >> Signed-off-by: Wu Fengguang <fengguang.wu(a)intel.com> >> --- >> �fs/nfs/client.c � | � �4 +++- >> �fs/nfs/internal.h | � �8 -------- >> �2 files changed, 3 insertions(+), 9 deletions(-) >> >> --- linux.orig/fs/nfs/client.c � � � �2010-02-26 10:10:46.000000000 +0800 >> +++ linux/fs/nfs/client.c � � 2010-02-26 11:07:22.000000000 +0800 >> @@ -889,7 +889,9 @@ static void nfs_server_set_fsinfo(struct >> � � � server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; >> >> � � � server->backing_dev_info.name = "nfs"; >> - � � server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD; >> + � � server->backing_dev_info.ra_pages = max_t(unsigned long, >> + � � � � � � � � � � � � � � � � � � � � � default_backing_dev_info.ra_pages, >> + � � � � � � � � � � � � � � � � � � � � � 4 * server->rpages); >> � � � server->backing_dev_info.capabilities |= BDI_CAP_ACCT_UNSTABLE; >> >> � � � if (server->wsize > max_rpc_payload) >> --- linux.orig/fs/nfs/internal.h � � �2010-02-26 10:10:46.000000000 +0800 >> +++ linux/fs/nfs/internal.h � 2010-02-26 11:07:07.000000000 +0800 >> @@ -10,14 +10,6 @@ >> >> �struct nfs_string; >> >> -/* Maximum number of readahead requests >> - * FIXME: this should really be a sysctl so that users may tune it to suit >> - * � � � �their needs. People that do NFS over a slow network, might for >> - * � � � �instance want to reduce it to something closer to 1 for improved >> - * � � � �interactive response. >> - */ >> -#define NFS_MAX_READAHEAD � �(RPC_DEF_SLOT_TABLE - 1) >> - >> �/* >> � * Determine if sessions are in use. >> � */ > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo(a)vger.kernel.org > More majordomo info at �http://vger.kernel.org/majordomo-info.html > Please read the FAQ at �http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
|
Next
|
Last
Pages: 1 2 Prev: arch/sh/boot/compressed/cache.c: Checkpatch cleanup Next: yaffs2 NAND fs |