From: Doug on
I have a Solaris 10u8 (10/09) Sun X4500 system that became completely
unresponsive for about 30 minutes. It did respond to ping. But, my
active ssh sessions did not respond nor was I able to get a login
prompt from the ILOM /SP/console.

After 30 minutes, the system did begin responding and I immediately
noticed the load average reported by prstat was >2000. In a few
minutes, it came down to its usual <1 value. "fmadm faulty -a"
reports no hardware faults. "svcs -x" reports no problems. Nothing
unusual in /var/adm/messages

Just before it hung, in addition to its usual load, the system was
running /usr/bin/sort
and was using about 4GB (system has 16 GB RAM + 20 GB disk swap). I
know
sort puts temp files in /var/tmp which currently reports >40GB of free
space.
The /var/tmp is in the root filesystem.

Any suggestions in finding what caused the hang? Could an out of
control sort fill up /var/tmp and cause the system to hang for 30
minutes and then recover? I have never had a system hang for so long
only to recover on its own--I was about to "reset /SYS" to force it to
restart.

Thanks for your advice
From: Richard B. Gilbert on
Doug wrote:
> I have a Solaris 10u8 (10/09) Sun X4500 system that became completely
> unresponsive for about 30 minutes. It did respond to ping. But, my
> active ssh sessions did not respond nor was I able to get a login
> prompt from the ILOM /SP/console.
>
> After 30 minutes, the system did begin responding and I immediately
> noticed the load average reported by prstat was >2000. In a few
> minutes, it came down to its usual <1 value. "fmadm faulty -a"
> reports no hardware faults. "svcs -x" reports no problems. Nothing
> unusual in /var/adm/messages
>
> Just before it hung, in addition to its usual load, the system was
> running /usr/bin/sort
> and was using about 4GB (system has 16 GB RAM + 20 GB disk swap). I
> know
> sort puts temp files in /var/tmp which currently reports >40GB of free
> space.
> The /var/tmp is in the root filesystem.
>
> Any suggestions in finding what caused the hang? Could an out of
> control sort fill up /var/tmp and cause the system to hang for 30
> minutes and then recover? I have never had a system hang for so long
> only to recover on its own--I was about to "reset /SYS" to force it to
> restart.
>
> Thanks for your advice

The only thing that occurs to me is that a process looping at high
priority would cause the system to ignore just about anything you could
do short of dumping power or rebooting.

Please let us know the details when you find the problem!
From: Tim Bradshaw on
On 2010-02-24 18:40:35 +0000, Doug said:

> Any suggestions in finding what caused the hang? Could an out of
> control sort fill up /var/tmp and cause the system to hang for 30
> minutes and then recover? I have never had a system hang for so long
> only to recover on its own--I was about to "reset /SYS" to force it to
> restart.

Filling /var/tmp should not cause this. Catastrophic memory shortage
just might.

A load average of 2000 implies there were at least 2000 processes in
the run queue however.

From: ITguy on
The X4500 generally runs ZFS on all the SATA disks. Did you have a
zpool scrub operation in progress? You should be able to see when the
last scrub completed by reviewing the output of "zpool status". It's
recommended to scrub the pools regularly, but it does tax the system
IO and should be done on off-peak times.
From: Doug on
Thanks for your suggestions so far.

I do run zpool scrub periodically, but it was not running when the
system hang. It usually takes around 12 hours to scrub around 12TB of
disk data on a relatively quiescent system. The load average is
between 4-5 when it is scrubbing.

I was running "prstat -Z" on the system when it hung. It has 5
zones. The process running the sort was from a non-global zone and
the last thing printed by prstat before the hang was that it was using
about 4GB of RSS. I am pretty sure it was /usr/bin/sort, which is a
32-bit binary, using that memory. I didn't see any temp files in /var/
tmp nor any messages that any filesystem filled up.

When the system did start responding again after 20 minutes, the load
average reported by prstat was >2000. It seems that >2000 processes
would need normally need service if the system hung for 20 minutes.
I'm frustrated that there were no messages left behind as to what
caused the hang, though.