Prev: IPMP
Next: cfgadm problem
From: Jonathan Joseph on 1 Nov 2007 12:51 I need help diagnosing a recent problem. I'm not really sure what the cause of this problem is, but the messages I'm seeing seem to pertain to NFS and NIS. Here's the setup (the names have been changed to protect the innocent). I have a Solaris 8 box running as an NIS server (sun1) and 2 other solaris 8 boxes running as NIS clients (sun2, sun3). All of the sun boxes (though predominantly sun1) export disks using NFS over tcp. I also have a 3rd party NAS storage device (added a couple of months ago) running NFS over udp which all of the suns can see (nasraid). In the past week, sun2 has hung itself 3 times (twice in the past 2 days) and required rebooting (only acheivable via stop-a, sync). The other two suns have had no obvious related problems. When sun2 comes back, everything seems to be working fine. The messages on sun2 leading up to each event have always been similar. Here's an example from the most recent hang. Oct 31 20:44:41 sun2 nfs: [ID 333984 kern.notice] NFS server nasraid not responding still trying Oct 31 20:55:23 sun2 ypbind[14963]: [ID 337329 daemon.error] NIS server not responding for domain "dom"; still trying Oct 31 20:57:08 sun2 ypbind[14964]: [ID 337329 daemon.error] NIS server not responding for domain "dom"; still trying [the NIS server not responding errors repeat ad infinitum until reboot] In each hang, the NIS server errors are always preceeded by the NFS error pertaining to nasraid. But that could be a red herring. On sun1, I recently saw the following set of errors, but they do not appear to be temporally related to any observed problems. I'm not quite sure exactly what they mean. Sun1 continued to operate correctly and the disks it exports via NFS could still be seen by others. Oct 29 13:52:27 sun1 mountd[15746]: [ID 367356 daemon.error] svc_tli_create: could not bind to requested address: address mismatch Oct 29 13:52:27 sun1 mountd[15746]: [ID 436217 daemon.error] svc_create: svc_tli_create failed Oct 29 13:52:27 sun1 mountd[15746]: [ID 367356 daemon.error] svc_tli_create: could not bind to requested address: address mismatch Oct 29 13:52:27 sun1 mountd[15746]: [ID 436217 daemon.error] svc_create: svc_tli_create failed Oct 29 13:52:27 sun1 mountd[15746]: [ID 367356 daemon.error] svc_tli_create: could not bind to requested address: address mismatch Oct 29 13:52:27 sun1 mountd[15746]: [ID 436217 daemon.error] svc_create: svc_tli_create failed Oct 29 13:52:27 sun1 mountd[15746]: [ID 882487 daemon.error] unable to create nfsauth service Oct 29 13:52:27 sun1 /usr/lib/nfs/nfsd[15748]: [ID 408793 daemon.error] t_bind to wrong address Oct 29 13:52:27 sun1 /usr/lib/nfs/nfsd[15748]: [ID 128213 daemon.error] Cannot establish NFS service over /dev/udp: transport setup problem. Oct 29 13:52:27 sun1 /usr/lib/nfs/nfsd[15748]: [ID 408793 daemon.error] t_bind to wrong address Oct 29 13:52:27 sun1 /usr/lib/nfs/nfsd[15748]: [ID 128213 daemon.error] Cannot establish NFS service over /dev/tcp: transport setup problem. Oct 29 13:52:27 sun1 /usr/lib/nfs/nfsd[15748]: [ID 679034 daemon.error] Could not start NFS service for any protocol. Exiting. In its hung state, you cannot ssh into sun2 or even ping it, however, it's clearly not totally dead. During one of the recent hangs, I happened to have some windows open on sun2 and was able to poke around a bit. Some commands I tried hung my terminal - but I was often able to ctrl-c and get a prompt back. some processes were still clearly running fine, while others were hung. I was able to "su root" I tried a wpwhich, and not unexpectedly got the result "domain not bound." The most unexpected result was that sun2 could still see and navigate around in the disks that were NFS exported by sun1 as well as its local disks, though it could not see nasraid. Other computers could not see the disks that were NFS exported by sun2, but could see nasraid and sun1 just fine. During all of the problems, sun3 had no apparent problems at all with NIS or NFS. I was thinking that the problem was network related (flaky cable or switch), but if so, I can't understand why sun2 could still see disks exported by sun1 when it could not bind to sun1 as an NIS server or see nasraid. Is it possible that an temporary network interruption could cause the NIS connection and NFS/udp connection to go away and not return even when the network was fine again - while the NFS/tcp connections remained fine? Any help appreciated. Thanks. I have the most recent patch cluster downloaded and ready to install. -Jonathan
From: James Carlson on 5 Nov 2007 07:52 Jonathan Joseph <jj21(a)cornell.edu> writes: > I'm not really sure what the cause of this problem is, but the > messages I'm seeing seem to pertain to NFS and NIS. I'm not sure those messages aren't just red herrings -- they look like victims of the underlying problem (causing a loss of connectivity) rather than causes of the problem. > In the past week, sun2 has hung itself 3 times (twice in the past 2 > days) and required rebooting (only acheivable via stop-a, sync). The If you did a "sync" you should have a system dump, right? What does the dump show? -- James Carlson, Solaris Networking <james.d.carlson(a)sun.com> Sun Microsystems / 35 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
From: Jonathan Joseph on 5 Nov 2007 13:51 > > If you did a "sync" you should have a system dump, right? What does > the dump show? > Can you tell me where I can find the "system dump" file? I presume it's not a text file. What utility can I use to get any useful information from it? Thanks. -Jonathan
From: Jonathan Joseph on 5 Nov 2007 16:41 > See /etc/init.d/savecore and /etc/dumpadm.conf for configuration/setup. > See http://www.sun.com/download/products.xml?id=3fce7df0 for crash dump analysis. Thanks to the person that emailed me this info. I found the dump files (unix.? and vmcore.?) and I installed the analysis tool, but I'm afraid this may be beyond my level of sys-admining. I ran "blast" for the gui interface, and was greeted with more buttons and tabs than I've yet seen in a single application. I tried loading the core files and poking around (especially in the Network/IPC tab) but have no idea what I'm looking for or if I'd know it when I found it. I did an "analysis with more detail" from the "general" tab and got the following thread summary (below), along with lots of other information. Not really sure what I'm looking for. Any helpful hints appreciated. Thanks -Jonathan ==== reporting thread summary ==== reference clock = panic_lbolt: 0xd45f38 37 threads ran since 1 second before current tick (20 user, 17 kernel) 99 threads ran since 1 minute before current tick (67 user, 32 kernel) 2 TS_RUN threads (0 user, 2 kernel) 2 TS_STOPPED threads (0 user, 2 kernel) 19 TS_FREE threads (0 user, 19 kernel) 0 !TS_LOAD (swapped) threads 0 threads trying to get a mutex 0 threads trying to get an rwlock 231 threads waiting for a condition variable (149 user, 82 kernel) 1 threads sleeping on a semaphore (0 user, 1 kernel) 87 threads sleeping on a user-level sobj (87 user, 0 kernel) 25 threads sleeping on a shuttle (door) (25 user, 0 kernel) 0 threads in biowait() 8* threads in nfs:rfscall() (0 user, 8 kernel) 2 threads in dispatch queues (0 user, 2 kernel) 1* threads in dispq of cpu running idle thread (0 user, 1 kernel) 1* interrupt threads running (0 user, 1 kernel) 371 total threads in allthreads list (261 user, 110 kernel) 8 thread_reapcnt 0 lwp_reapcnt 378 nthread
From: James Carlson on 7 Nov 2007 10:47 Jonathan Joseph <jj21(a)cornell.edu> writes: > > See /etc/init.d/savecore and /etc/dumpadm.conf for configuration/setup. > > See http://www.sun.com/download/products.xml?id=3fce7df0 for crash > dump analysis. > > > Thanks to the person that emailed me this info. I found the dump > files (unix.? and vmcore.?) and I installed the analysis tool, but I'm > afraid this may be beyond my level of sys-admining. I ran "blast" for I've never used any of the GUIs. > I did an "analysis with more detail" from the "general" tab and got > the following thread summary (below), along with lots of other > information. The first thing to look for would probably be the mdb $c (stack trace) output. echo '$c' | mdb unix.? vmcore.? .... but if you're in this deep, and you're using one of the commercial versions of Solaris, you should probably have a service contract and file a bug with Sun's support group. -- James Carlson, Solaris Networking <james.d.carlson(a)sun.com> Sun Microsystems / 35 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
|
Pages: 1 Prev: IPMP Next: cfgadm problem |