confusing NIS/NFS/networking issue [Solaris]

From: Jonathan Joseph on 1 Nov 2007 12:51

I need help diagnosing a recent problem.

I'm not really sure what the cause of this problem is, but the messages
I'm seeing seem to pertain to NFS and NIS.

Here's the setup (the names have been changed to protect the innocent).
I have a Solaris 8 box running as an NIS server (sun1) and 2 other
solaris 8 boxes running as NIS clients (sun2, sun3). All of the sun
boxes (though predominantly sun1) export disks using NFS over tcp. I
also have a 3rd party NAS storage device (added a couple of months ago)
running NFS over udp which all of the suns can see (nasraid).

In the past week, sun2 has hung itself 3 times (twice in the past 2
days) and required rebooting (only acheivable via stop-a, sync). The
other two suns have had no obvious related problems. When sun2 comes
back, everything seems to be working fine.

The messages on sun2 leading up to each event have always been similar.
Here's an example from the most recent hang.

Oct 31 20:44:41 sun2 nfs: [ID 333984 kern.notice] NFS server nasraid not
responding still trying
Oct 31 20:55:23 sun2 ypbind[14963]: [ID 337329 daemon.error] NIS server
not responding for domain "dom"; still trying
Oct 31 20:57:08 sun2 ypbind[14964]: [ID 337329 daemon.error] NIS server
not responding for domain "dom"; still trying
[the NIS server not responding errors repeat ad infinitum until reboot]

In each hang, the NIS server errors are always preceeded by the NFS
error pertaining to nasraid. But that could be a red herring.

On sun1, I recently saw the following set of errors, but they do not
appear to be temporally related to any observed problems. I'm not quite
sure exactly what they mean. Sun1 continued to operate correctly and
the disks it exports via NFS could still be seen by others.

Oct 29 13:52:27 sun1 mountd[15746]: [ID 367356 daemon.error]
svc_tli_create: could not bind to requested address: address mismatch
Oct 29 13:52:27 sun1 mountd[15746]: [ID 436217 daemon.error] svc_create:
svc_tli_create failed
Oct 29 13:52:27 sun1 mountd[15746]: [ID 367356 daemon.error]
svc_tli_create: could not bind to requested address: address mismatch
Oct 29 13:52:27 sun1 mountd[15746]: [ID 436217 daemon.error] svc_create:
svc_tli_create failed
Oct 29 13:52:27 sun1 mountd[15746]: [ID 367356 daemon.error]
svc_tli_create: could not bind to requested address: address mismatch
Oct 29 13:52:27 sun1 mountd[15746]: [ID 436217 daemon.error] svc_create:
svc_tli_create failed
Oct 29 13:52:27 sun1 mountd[15746]: [ID 882487 daemon.error] unable to
create nfsauth service
Oct 29 13:52:27 sun1 /usr/lib/nfs/nfsd[15748]: [ID 408793 daemon.error]
t_bind to wrong address
Oct 29 13:52:27 sun1 /usr/lib/nfs/nfsd[15748]: [ID 128213 daemon.error]
Cannot establish NFS service over /dev/udp: transport setup problem.
Oct 29 13:52:27 sun1 /usr/lib/nfs/nfsd[15748]: [ID 408793 daemon.error]
t_bind to wrong address
Oct 29 13:52:27 sun1 /usr/lib/nfs/nfsd[15748]: [ID 128213 daemon.error]
Cannot establish NFS service over /dev/tcp: transport setup problem.
Oct 29 13:52:27 sun1 /usr/lib/nfs/nfsd[15748]: [ID 679034 daemon.error]
Could not start NFS service for any protocol. Exiting.

In its hung state, you cannot ssh into sun2 or even ping it, however,
it's clearly not totally dead.

During one of the recent hangs, I happened to have some windows open on
sun2 and was able to poke around a bit. Some commands I tried hung my
terminal - but I was often able to ctrl-c and get a prompt back.

some processes were still clearly running fine, while others were hung.

I was able to "su root"

I tried a wpwhich, and not unexpectedly got the result "domain not bound."

The most unexpected result was that sun2 could still see and navigate
around in the disks that were NFS exported by sun1 as well as its local
disks, though it could not see nasraid. Other computers could not see
the disks that were NFS exported by sun2, but could see nasraid and sun1
just fine.

During all of the problems, sun3 had no apparent problems at all with
NIS or NFS.

I was thinking that the problem was network related (flaky cable or
switch), but if so, I can't understand why sun2 could still see disks
exported by sun1 when it could not bind to sun1 as an NIS server or see
nasraid. Is it possible that an temporary network interruption could
cause the NIS connection and NFS/udp connection to go away and not
return even when the network was fine again - while the NFS/tcp
connections remained fine?

Any help appreciated. Thanks.

I have the most recent patch cluster downloaded and ready to install.

-Jonathan

From: James Carlson on 5 Nov 2007 07:52

Jonathan Joseph <jj21(a)cornell.edu> writes:
> I'm not really sure what the cause of this problem is, but the
> messages I'm seeing seem to pertain to NFS and NIS.

I'm not sure those messages aren't just red herrings -- they look like
victims of the underlying problem (causing a loss of connectivity)
rather than causes of the problem.

> In the past week, sun2 has hung itself 3 times (twice in the past 2
> days) and required rebooting (only acheivable via stop-a, sync). The

If you did a "sync" you should have a system dump, right? What does
the dump show?

--
James Carlson, Solaris Networking <james.d.carlson(a)sun.com>
Sun Microsystems / 35 Network Drive 71.232W Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677

From: Jonathan Joseph on 5 Nov 2007 13:51

>
> If you did a "sync" you should have a system dump, right? What does
> the dump show?
>

Can you tell me where I can find the "system dump" file? I presume it's
not a text file. What utility can I use to get any useful information
from it?

Thanks.

-Jonathan

From: Jonathan Joseph on 5 Nov 2007 16:41

> See /etc/init.d/savecore and /etc/dumpadm.conf for configuration/setup.
> See http://www.sun.com/download/products.xml?id=3fce7df0 for crash
dump analysis.

Thanks to the person that emailed me this info. I found the dump files
(unix.? and vmcore.?) and I installed the analysis tool, but I'm afraid
this may be beyond my level of sys-admining. I ran "blast" for the gui
interface, and was greeted with more buttons and tabs than I've yet seen
in a single application. I tried loading the core files and poking
around (especially in the Network/IPC tab) but have no idea what I'm
looking for or if I'd know it when I found it.

I did an "analysis with more detail" from the "general" tab and got the
following thread summary (below), along with lots of other information.

Not really sure what I'm looking for. Any helpful hints appreciated.

Thanks

-Jonathan

==== reporting thread summary ====

reference clock = panic_lbolt: 0xd45f38
37 threads ran since 1 second before current tick (20 user, 17 kernel)
99 threads ran since 1 minute before current tick (67 user, 32 kernel)

2 TS_RUN threads (0 user, 2 kernel)
2 TS_STOPPED threads (0 user, 2 kernel)
19 TS_FREE threads (0 user, 19 kernel)
0 !TS_LOAD (swapped) threads

0 threads trying to get a mutex
0 threads trying to get an rwlock
231 threads waiting for a condition variable (149 user, 82 kernel)
1 threads sleeping on a semaphore (0 user, 1 kernel)
87 threads sleeping on a user-level sobj (87 user, 0 kernel)
25 threads sleeping on a shuttle (door) (25 user, 0 kernel)

0 threads in biowait()
8* threads in nfs:rfscall() (0 user, 8 kernel)

2 threads in dispatch queues (0 user, 2 kernel)
1* threads in dispq of cpu running idle thread (0 user, 1 kernel)
1* interrupt threads running (0 user, 1 kernel)

371 total threads in allthreads list (261 user, 110 kernel)
8 thread_reapcnt
0 lwp_reapcnt
378 nthread

From: James Carlson on 7 Nov 2007 10:47

Jonathan Joseph <jj21(a)cornell.edu> writes:
> > See /etc/init.d/savecore and /etc/dumpadm.conf for configuration/setup.
> > See http://www.sun.com/download/products.xml?id=3fce7df0 for crash
> dump analysis.
>
>
> Thanks to the person that emailed me this info. I found the dump
> files (unix.? and vmcore.?) and I installed the analysis tool, but I'm
> afraid this may be beyond my level of sys-admining. I ran "blast" for

I've never used any of the GUIs.

> I did an "analysis with more detail" from the "general" tab and got
> the following thread summary (below), along with lots of other
> information.

The first thing to look for would probably be the mdb $c (stack trace)
output.

echo '$c' | mdb unix.? vmcore.?

.... but if you're in this deep, and you're using one of the commercial
versions of Solaris, you should probably have a service contract and
file a bug with Sun's support group.

--
James Carlson, Solaris Networking <james.d.carlson(a)sun.com>
Sun Microsystems / 35 Network Drive 71.232W Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677

|
Pages: 1
Prev: IPMP
Next: cfgadm problem