From: Ciccio on
Hello List,

I have a Solaris 10_x86 6/06 u2 server (running as a VM on top of ESX
3.0.1), which seems to be running out of memory.


I can see this with "top" (the free mem value has been steadily
decreasing).

last pid: 27463; load avg: 2.11, 2.07, 1.91; up 10+17:10:49
10:21:12
47 processes: 45 sleeping, 2 on cpu
CPU states: 24.8% idle, 21.6% user, 53.7% kernel, 0.0% iowait, 0.0%
swap
Memory: 3840M phys mem, 258M free mem, 4103M total swap, 4103M free
swap

PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
12043 root 12 0 0 3708K 1592K sleep 7:51 10.28% syslogd
2009 pg10326 1 0 0 6536K 3856K sleep 0:19 0.45% sshd
27463 root 1 0 0 1392K 916K cpu/0 0:00 0.07% cut
1 root 1 0 0 2332K 944K sleep 4:12 0.04% init
27211 bb 1 0 0 2196K 924K sleep 0:00 0.04% bbd
21268 root 1 0 0 3124K 1788K cpu/1 0:00 0.03% top
27243 bb 1 0 0 2092K 1144K sleep 0:00 0.01% bbrun
11428 root 1 0 0 15M 4948K sleep 0:11 0.01% httpd
144 root 22 59 0 5508K 2796K sleep 2:40 0.00% nscd
7 root 14 59 0 12M 10M sleep 0:32 0.00%
svc.startd
9050 pg10326 1 0 0 6384K 3808K sleep 0:00 0.00% sshd
317 root 1 0 0 1452K 748K sleep 0:09 0.00% utmpd
11433 nobody 1 0 0 25M 13M sleep 5:20 0.00% httpd


and also with "mdb" (the MB value for the Kernel memory keeps
increasing).


[krypton]/export/home/pg10326$ echo "::memstat" | mdb -k
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 880912 3441 90%
Anon 20738 81 2%
Exec and libs 5035 19 1%
Page cache 12374 48 1%
Free (cachelist) 55496 216 6%
Free (freelist) 6323 24 1%

Total 980878 3831
[krypton]/export/home/pg10326$

This is a pretty idle box, all that is running here is apache 2.0.59
serving SubVersion and a Big Brother client.
(stopping either or both of these two doesn't fix/stop/mitigate the
issue).

What I have found is that there is an insanely high number of TCP
connections in TIME_WAIT state ( over 42k ! ) - and I truly suspect
this is
our culprit here.
On all our other Unix/Linux servers that number is well below 50 .

For reference, the tcp_time_wait_interval kernel parameter is set to
60000,
which is default value in Solaris.

[krypton]/export/home/pg10326$ ndd -get /dev/tcp
tcp_time_wait_interval
60000
[krypton]/export/home/pg10326$

Not sure how to fix this issue, I have found a "tcpdrop" utility that
clears up these stale connections, but as it is a brute-force drop, I
suspect that doesn't free up the allocated memory.


Any idea how to get this sorted?



Ciccio

From: nelson on
that's an awful lot of time_wait - maybe if you can find out what's
causing that your problem will disappear...

as an aside, are you using ZFS pools? i can't remember the
percentages but it tends to like using memory in it's cache.

personally, i'd be more inclined to work out the number of closing
connections, something certainly seems amiss there.
From: vinayag on
On Jan 26, 4:12 am, nelson <nelson.bens...(a)gmail.com> wrote:
> that's an awful lot of time_wait - maybe if you can find out what's
> causing that your problem will disappear...
>
> as an aside, are you using ZFS pools?  i can't remember the
> percentages but it tends to like using memory in it's cache.
>
> personally, i'd be more inclined to work out the number of closing
> connections, something certainly seems amiss there.

I certainly agree that kernel is taking too much memory after your
steps tried.
Could you get prstat output as below. and find which process taking
too memory
prstat -s size -n 100
prstat -s rss -n 100

FYI, there are lot of bugs identified in s10 u2 version. u may need to
update the latest kernel version,
if you still need investigation then contact Sun support.

ravinayag

From: Paul Floyd on
On Mon, 25 Jan 2010 15:12:13 -0800 (PST), nelson
<nelson.bensley(a)gmail.com> wrote:
> that's an awful lot of time_wait - maybe if you can find out what's
> causing that your problem will disappear...

Not really, I think that the value is in milliseconds. According to
Stevens, MAX_WAIT_TIME is 2MSP, where MSP is typically 30, 60 or 120s.
This is a the bottom of that range. I suppose that as networks have
gotten faster (or latencies shorter), then having lower MSP is safer.

A bientot
Paul
--
Paul Floyd http://paulf.free.fr
From: nelson on
> Not really, I think that the value is in milliseconds. According to
> Stevens, MAX_WAIT_TIME is 2MSP, where MSP is typically 30, 60 or 120s.
> This is a the bottom of that range. I suppose that as networks have
> gotten faster (or latencies shorter), then having lower MSP is safer.

i meant the number of connections in time_wait, not the time. takes
time for pkts to clear the wire...