Prev: Problems while trying to set up a Kerberos 5 server on user ports
Next: Queries: How to split SVM mirror and apply patch
From: Ciccio on 25 Jan 2010 08:42 Hello List, I have a Solaris 10_x86 6/06 u2 server (running as a VM on top of ESX 3.0.1), which seems to be running out of memory. I can see this with "top" (the free mem value has been steadily decreasing). last pid: 27463; load avg: 2.11, 2.07, 1.91; up 10+17:10:49 10:21:12 47 processes: 45 sleeping, 2 on cpu CPU states: 24.8% idle, 21.6% user, 53.7% kernel, 0.0% iowait, 0.0% swap Memory: 3840M phys mem, 258M free mem, 4103M total swap, 4103M free swap PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND 12043 root 12 0 0 3708K 1592K sleep 7:51 10.28% syslogd 2009 pg10326 1 0 0 6536K 3856K sleep 0:19 0.45% sshd 27463 root 1 0 0 1392K 916K cpu/0 0:00 0.07% cut 1 root 1 0 0 2332K 944K sleep 4:12 0.04% init 27211 bb 1 0 0 2196K 924K sleep 0:00 0.04% bbd 21268 root 1 0 0 3124K 1788K cpu/1 0:00 0.03% top 27243 bb 1 0 0 2092K 1144K sleep 0:00 0.01% bbrun 11428 root 1 0 0 15M 4948K sleep 0:11 0.01% httpd 144 root 22 59 0 5508K 2796K sleep 2:40 0.00% nscd 7 root 14 59 0 12M 10M sleep 0:32 0.00% svc.startd 9050 pg10326 1 0 0 6384K 3808K sleep 0:00 0.00% sshd 317 root 1 0 0 1452K 748K sleep 0:09 0.00% utmpd 11433 nobody 1 0 0 25M 13M sleep 5:20 0.00% httpd and also with "mdb" (the MB value for the Kernel memory keeps increasing). [krypton]/export/home/pg10326$ echo "::memstat" | mdb -k Page Summary Pages MB %Tot ------------ ---------------- ---------------- ---- Kernel 880912 3441 90% Anon 20738 81 2% Exec and libs 5035 19 1% Page cache 12374 48 1% Free (cachelist) 55496 216 6% Free (freelist) 6323 24 1% Total 980878 3831 [krypton]/export/home/pg10326$ This is a pretty idle box, all that is running here is apache 2.0.59 serving SubVersion and a Big Brother client. (stopping either or both of these two doesn't fix/stop/mitigate the issue). What I have found is that there is an insanely high number of TCP connections in TIME_WAIT state ( over 42k ! ) - and I truly suspect this is our culprit here. On all our other Unix/Linux servers that number is well below 50 . For reference, the tcp_time_wait_interval kernel parameter is set to 60000, which is default value in Solaris. [krypton]/export/home/pg10326$ ndd -get /dev/tcp tcp_time_wait_interval 60000 [krypton]/export/home/pg10326$ Not sure how to fix this issue, I have found a "tcpdrop" utility that clears up these stale connections, but as it is a brute-force drop, I suspect that doesn't free up the allocated memory. Any idea how to get this sorted? Ciccio
From: nelson on 25 Jan 2010 18:12 that's an awful lot of time_wait - maybe if you can find out what's causing that your problem will disappear... as an aside, are you using ZFS pools? i can't remember the percentages but it tends to like using memory in it's cache. personally, i'd be more inclined to work out the number of closing connections, something certainly seems amiss there.
From: vinayag on 26 Jan 2010 02:33 On Jan 26, 4:12 am, nelson <nelson.bens...(a)gmail.com> wrote: > that's an awful lot of time_wait - maybe if you can find out what's > causing that your problem will disappear... > > as an aside, are you using ZFS pools? i can't remember the > percentages but it tends to like using memory in it's cache. > > personally, i'd be more inclined to work out the number of closing > connections, something certainly seems amiss there. I certainly agree that kernel is taking too much memory after your steps tried. Could you get prstat output as below. and find which process taking too memory prstat -s size -n 100 prstat -s rss -n 100 FYI, there are lot of bugs identified in s10 u2 version. u may need to update the latest kernel version, if you still need investigation then contact Sun support. ravinayag
From: Paul Floyd on 26 Jan 2010 15:38 On Mon, 25 Jan 2010 15:12:13 -0800 (PST), nelson <nelson.bensley(a)gmail.com> wrote: > that's an awful lot of time_wait - maybe if you can find out what's > causing that your problem will disappear... Not really, I think that the value is in milliseconds. According to Stevens, MAX_WAIT_TIME is 2MSP, where MSP is typically 30, 60 or 120s. This is a the bottom of that range. I suppose that as networks have gotten faster (or latencies shorter), then having lower MSP is safer. A bientot Paul -- Paul Floyd http://paulf.free.fr
From: nelson on 26 Jan 2010 16:01
> Not really, I think that the value is in milliseconds. According to > Stevens, MAX_WAIT_TIME is 2MSP, where MSP is typically 30, 60 or 120s. > This is a the bottom of that range. I suppose that as networks have > gotten faster (or latencies shorter), then having lower MSP is safer. i meant the number of connections in time_wait, not the time. takes time for pkts to clear the wire... |