From: Anoop on
Hey All,

I'm having trouble with NFS write performance.

The NFS server is a 35 TB Fileserver with a 10 GbE Myricom card. The
filesystem is ZFS with ZIL disabled for the time being.

The local writes to disk occur at 595 MBps.
The network speed (with proper TCP tuning and jumbo frames) is 9.8
Gbps (full capacity) on both reception and transmission. This was
benchmarked using nuttcp.

However, when filesystem is exported over NFS over the 10GbE card, the
NFS write performance seems to drop to 90 MBps. I'm getting less than
16% of peak performance. I'm at a loss on how to tune it.

My benchmarking tool is pretty rudimentary. I'm using "dd" with a 128k
block-size to write files.

Any pointers on how to help me get more out of the NFS system?

Thanks,
-Anoop
From: Rick Jones on
Anoop <anoop.rajendra(a)gmail.com> wrote:
> I'm having trouble with NFS write performance.

> The NFS server is a 35 TB Fileserver with a 10 GbE Myricom card. The
> filesystem is ZFS with ZIL disabled for the time being.

> The local writes to disk occur at 595 MBps. The network speed (with
> proper TCP tuning and jumbo frames) is 9.8 Gbps (full capacity) on
> both reception and transmission. This was benchmarked using nuttcp.

> However, when filesystem is exported over NFS over the 10GbE card,
> the NFS write performance seems to drop to 90 MBps. I'm getting less
> than 16% of peak performance. I'm at a loss on how to tune it.

NFS is actually a request/response protocol. It is for that, among
many other reasons that netperf includes request/response tests.
Simply taking a bulk, unidirectional bandwidth measurement like that
above is not giving you enough of a picture. Particularly if there is
some aggressive interrupt coalescing/avoidance going-on on either
side. Sometimes that is done in the name of reducing CPU overhead for
bulk transfers to try to enable getting to higher bulk transfer rates.
If not done "well" though it will trash latency.

netperf -t TCP_RR -H <remote>

is a good way to see if there is such aggressive interrupt coalescing
going-on.

> My benchmarking tool is pretty rudimentary. I'm using "dd" with a
> 128k block-size to write files.

> Any pointers on how to help me get more out of the NFS system?

You need to know how many writes your client system will have
outstanding at one time. Then to see how many you need/want to have
it have outstanding at one time I would probably:

download netperf http://www.netperf.org/
unpack the source on the server
unpack the source on the client
../configure --enable-burst --prefix=<where you want make to stick it>
make install

On the server:

netserver

On the client:

NO_HDR="-P 1"
for b in 0 1 2 3 etc etc
do
netperf -H <server> -t TCP_RR -f m -c -C -l 20 $NO_HDR -- -r 32K,256 -D -b $b
NO_HDR="-P 0"
done

I took a guess as to the mount size (32K - 32768 bytes) but didn't
include the NFS header - if I had netperf would have counted that as
goodput. I guessed that the write replies would be ~256 bytes.
Adjust as you see fit.

Actually there should probably be "test-specific" (after the "--") -s
and -S options to set the socket buffer size and thus the TCP window
size. I don't know what Sun's NFS stuff uses, I would start with a
guess of 1M just for grins "-s 1M -S 1M" . Whatever you do, you don't
want the request size (32K) times the value of $b (additional
transactions in flight at one time) to be larger than the socket
buffer. The way --enable-burst abuses the netperf TCP_RR test it will
lead to test deadlock :)

happy benchmarking,

rick jones

BTW, -c and -C will cause netperf to report local (netperf side) and
remote (netserver side) CPU utilization. It will also calculate a
"service demand" which is a measure of how much active CPU time was
consumed per unit of work performed. Smaller is better for service
demand. The -l option is telling netperf to run for 20 seconds.

http://www.netperf.org/svn/netperf2/tags/netperf-2.4.5/doc/netperf.html

--
oxymoron n, commuter in a gas-guzzling luxury SUV with an American flag
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...