From: Udo Grabowski on 18 Nov 2008 10:35 Hello, we get 'Device busy (error 16)' errors when hammering a powerful server (X4500) with about 40 clients reading the same file (not changing) at the same time, with a frequency of about 5 in 1 million reads (this is too large for our applications). This happens in our case with a Fortran inquire on file existence. The server and the clients run Solaris SXDE 1/08 (B79b), the filesystem is ZFS, the clients mount with vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,mirrormount,rsize=1048576, wsize=1048576,retrans=5,timeo=3000, acregmin=1,acregmax=1,acdirmin=1,acdirmax=1 We have not seen these problems with NFSv3. The server and clients have both a /etc/default/nfs file with these entries changed: # Maximum number of concurrent NFS requests. # Equivalent to last numeric argument on nfsd command line. NFSD_SERVERS=256 # Set connection queue length for lockd over a connection-oriented transport. # Default and minimum value is 32. LOCKD_LISTEN_BACKLOG=256 # Maximum number of concurrent lockd requests. # Default is 20. LOCKD_SERVERS=256 # Determines if the NFS version 4 delegation feature will be enabled # for the server. If it is enabled, the server will attempt to # provide delegations to the NFS version 4 client. The default is on. NFS_SERVER_DELEGATION=off We need v4, otherwise we would have a hard time constructing large hierarchical autofs maps (without exceeding the 4096 character limit !) due to nested ZFS filesystems. nfsstat -s shows nothing unusual on the server: Server rpc: Connection oriented: calls badcalls nullrecv badlen xdrcall dupchecks dupreqs 101095267 0 0 0 0 2514000 1 but the clients show a hell of badcalls, badxids, timeouts,interrupts and cltoomany: nfsstat -c Client rpc: calls badcalls badxids timeouts newcreds badverfs timers 10619334 10009 982 9492 0 0 0 cantconn nomem interrupts 0 0 79 Client nfs: calls badcalls clgets cltoomany 10609528 530 10609732 49 Any ideas what goes wrong or what should be tuned elsewhere ?
From: edcrosbys on 18 Nov 2008 11:19 > we get 'Device busy (error 16)' errors when hammering a powerful server > (X4500) with > about 40 clients reading the same file (not changing) at the same time, Are you getting these errors on the client or server? Do you have the NFS fs mounted, or are you letting automount take care of it? Approx. how big is the file?
From: Udo Grabowski on 18 Nov 2008 11:42 edcrosbys wrote: >> we get 'Device busy (error 16)' errors when hammering a powerful server >> (X4500) with >> about 40 clients reading the same file (not changing) at the same time, >> > > > Are you getting these errors on the client or server? > Do you have the NFS fs mounted, or are you letting automount take care > of it? > Approx. how big is the file? > Errors are on the client, it's automounted, and a ZFS sub-filesystem (1.level) of a master ZFS filesystem (NFSv4 in this build traverses and mounts these automatically, no need for extra autofs entries). Probably never gets unmounted since its used all the time during the tests. Nothing in the /var/adm/messages not on the server nor on the client. The file is 250 k large, but the error occurs (if it fails) not on read, but always on existence inquiry (which probably is a fstat, I don't know how Sun Studio exactly implements an inquire(file,exist=...) ), but certainly while the other clients are reading the same file.
From: Udo Grabowski on 19 Nov 2008 05:43 Udo Grabowski wrote: > Hello, > > we get 'Device busy (error 16)' errors when hammering a powerful > server (X4500) with > about 40 clients reading the same file (not changing) at the same > time, with a frequency of > about 5 in 1 million reads (this is too large for our applications). > This happens in our case > with a Fortran inquire on file existence. The server and the clients > run Solaris SXDE 1/08 (B79b), > the filesystem is ZFS, the clients mount with > > vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,mirrormount,rsize=1048576, > > wsize=1048576,retrans=5,timeo=3000, > acregmin=1,acregmax=1,acdirmin=1,acdirmax=1 > > ... > Any ideas what goes wrong or what should be tuned elsewhere ? I digged a bit deeper: We can produce this error with one client (U20 M2 dual Opteron) alone, starting a few (5 to 10) instances of the program. ALL instances but one fail after a short time with 'device busy', even much faster than calling from different clients. Setting back to NFS v3 makes everything working again. Since v3 uses only 32k read/write windows, I checked NFS v4 again setting rsize, wsize to 32768, and the problem went away (did not check the many client scenario yet). Setting to 131072 also gives quick failures. So something is still dependent on the old 32k Solaris NFS limit. We would like to use the larger windows, since that doesn't stress the server and network (1Gb/s switched fiber) side that much, but where to tune ? We already set ncsize=1048576 for better DNLS cache hits (now at 94%), but there seems to be a harder limit somewhere. So far all I can recommend is not to use NFSv4 with the default parameters when doing real world production.
From: Udo Grabowski on 19 Nov 2008 08:03 Udo Grabowski wrote: > Hello, > > we get 'Device busy (error 16)' errors when hammering a powerful > server (X4500) with > about 40 clients reading the same file (not changing) at the same > time, with a frequency of > about 5 in 1 million reads (this is too large for our applications). > This happens in our case > with a Fortran inquire on file existence. The server and the clients > run Solaris SXDE 1/08 (B79b), > the filesystem is ZFS, the clients mount with > > vers=4,proto=tcp,sec=sys,hard,intr,link,symlink,acl,mirrormount,rsize=1048576, > > wsize=1048576,retrans=5,timeo=3000, > acregmin=1,acregmax=1,acdirmin=1,acdirmax=1 > 10609528 530 10609732 49 > Any ideas what goes wrong or what should be tuned elsewhere ? > (I hate always answering myself.....) Looks like we hit a bug here. I mixed up test cases somehow, so I didn't notice that there was a second difference in my last reply: The v4 test with 32k windows took place on a plane filesystem, not on an automatic submount. Here's the clue: Regardless of the window sizes, the failure occurs always and only if loading an NFS v4 AUTOMATIC submount on the same client with more than one accessing program. It does not happen when explicitly mounting the submount via a hierarchical autofs table. Although nfsstat -m shows it's mounted with the same parameters, it looks like if there something different internally. Maybe an issue with the callback daemon. So we are back at problem how to construct a large hierarchical autofs table without hitting the 4096 character tablesize limit..... Solaris is something for the devotees.
|
Pages: 1 Prev: picld snmpplugin errors Next: Create a Solaris 10 10/08 DVD to deploy a flash archive |