From: Doug on 24 Feb 2010 13:40 I have a Solaris 10u8 (10/09) Sun X4500 system that became completely unresponsive for about 30 minutes. It did respond to ping. But, my active ssh sessions did not respond nor was I able to get a login prompt from the ILOM /SP/console. After 30 minutes, the system did begin responding and I immediately noticed the load average reported by prstat was >2000. In a few minutes, it came down to its usual <1 value. "fmadm faulty -a" reports no hardware faults. "svcs -x" reports no problems. Nothing unusual in /var/adm/messages Just before it hung, in addition to its usual load, the system was running /usr/bin/sort and was using about 4GB (system has 16 GB RAM + 20 GB disk swap). I know sort puts temp files in /var/tmp which currently reports >40GB of free space. The /var/tmp is in the root filesystem. Any suggestions in finding what caused the hang? Could an out of control sort fill up /var/tmp and cause the system to hang for 30 minutes and then recover? I have never had a system hang for so long only to recover on its own--I was about to "reset /SYS" to force it to restart. Thanks for your advice
From: Richard B. Gilbert on 24 Feb 2010 15:11 Doug wrote: > I have a Solaris 10u8 (10/09) Sun X4500 system that became completely > unresponsive for about 30 minutes. It did respond to ping. But, my > active ssh sessions did not respond nor was I able to get a login > prompt from the ILOM /SP/console. > > After 30 minutes, the system did begin responding and I immediately > noticed the load average reported by prstat was >2000. In a few > minutes, it came down to its usual <1 value. "fmadm faulty -a" > reports no hardware faults. "svcs -x" reports no problems. Nothing > unusual in /var/adm/messages > > Just before it hung, in addition to its usual load, the system was > running /usr/bin/sort > and was using about 4GB (system has 16 GB RAM + 20 GB disk swap). I > know > sort puts temp files in /var/tmp which currently reports >40GB of free > space. > The /var/tmp is in the root filesystem. > > Any suggestions in finding what caused the hang? Could an out of > control sort fill up /var/tmp and cause the system to hang for 30 > minutes and then recover? I have never had a system hang for so long > only to recover on its own--I was about to "reset /SYS" to force it to > restart. > > Thanks for your advice The only thing that occurs to me is that a process looping at high priority would cause the system to ignore just about anything you could do short of dumping power or rebooting. Please let us know the details when you find the problem!
From: Tim Bradshaw on 24 Feb 2010 19:06 On 2010-02-24 18:40:35 +0000, Doug said: > Any suggestions in finding what caused the hang? Could an out of > control sort fill up /var/tmp and cause the system to hang for 30 > minutes and then recover? I have never had a system hang for so long > only to recover on its own--I was about to "reset /SYS" to force it to > restart. Filling /var/tmp should not cause this. Catastrophic memory shortage just might. A load average of 2000 implies there were at least 2000 processes in the run queue however.
From: ITguy on 24 Feb 2010 19:35 The X4500 generally runs ZFS on all the SATA disks. Did you have a zpool scrub operation in progress? You should be able to see when the last scrub completed by reviewing the output of "zpool status". It's recommended to scrub the pools regularly, but it does tax the system IO and should be done on off-peak times.
From: Doug on 25 Feb 2010 09:47 Thanks for your suggestions so far. I do run zpool scrub periodically, but it was not running when the system hang. It usually takes around 12 hours to scrub around 12TB of disk data on a relatively quiescent system. The load average is between 4-5 when it is scrubbing. I was running "prstat -Z" on the system when it hung. It has 5 zones. The process running the sort was from a non-global zone and the last thing printed by prstat before the hang was that it was using about 4GB of RSS. I am pretty sure it was /usr/bin/sort, which is a 32-bit binary, using that memory. I didn't see any temp files in /var/ tmp nor any messages that any filesystem filled up. When the system did start responding again after 20 minutes, the load average reported by prstat was >2000. It seems that >2000 processes would need normally need service if the system hung for 20 minutes. I'm frustrated that there were no messages left behind as to what caused the hang, though.
|
Next
|
Last
Pages: 1 2 Prev: Kernel parms like MAXUPROC on AIX Next: Jumpstart Sol10 with Sol8 jumpstart server |