How to "post-mortem" fileserver overload incidents? [General Linux]

Prev: wireless dhcp x2
Next: how to determine recording date from quicktime video?

From: gb345 on 13 Mar 2010 20:39

Our lab's computer cluster consists of about 35 nodes (running
Ubuntu), all attached to a single large fileserver. Every couple
of months, an access overload on the fileserver incapacitates the
whole system, and a reboot is required.

To make matters worse, we don't have a way to determine the user(s)
whose process or processes were responsible for hammering the
fileserver. So we can't even tell if we're dealing with malicious
attacks or not.

What tools/utilities exist for keeping an up-to-the-minute log of
each user's IO load on the fileserver?

Thanks in advance,

GB

From: Greg Russell on 14 Mar 2010 01:21

"gb345" <gb345(a)invalid.com> wrote in message
news:hnheo3$hoh$1(a)reader1.panix.com...
....
> To make matters worse, we don't have a way to determine the user(s)
> whose process or processes were responsible for hammering the
> fileserver. So we can't even tell if we're dealing with malicious
> attacks or not.
>
> What tools/utilities exist for keeping an up-to-the-minute log of
> each user's IO load on the fileserver?

"Up-to-the-minute" might require an every-minute cron invocation of
"top -H -n1 | head > /tmp/cron.top" or some such thing.

From: J G Miller on 14 Mar 2010 09:38

On Sun, 14 Mar 2010 01:39:15 +0000, gb345 wrote:

> To make matters worse, we don't have a way to determine the user(s)
> whose process or processes were responsible for hammering the
> fileserver.

It is impossible to answer your question because you provide no
information on what services this "file" server is running, or
how the files (and what are typical sizes for these files) are
being served viz SAMBA, NFS etc.

The most obvious first step would be to check the syslog / messages
/ daemon.log in /var/log/syslog for events just prior to the crash.

You could also try turning on more debug / verbose options on
your network services to see what is happening.

From: Balwinder S Dheeman on 14 Mar 2010 15:33

On 03/14/2010 07:08 PM, J G Miller wrote:
> On Sun, 14 Mar 2010 01:39:15 +0000, gb345 wrote:
>
>> To make matters worse, we don't have a way to determine the user(s)
>> whose process or processes were responsible for hammering the
>> fileserver.
>
> It is impossible to answer your question because you provide no
> information on what services this "file" server is running, or
> how the files (and what are typical sizes for these files) are
> being served viz SAMBA, NFS etc.
>
> The most obvious first step would be to check the syslog / messages
> / daemon.log in /var/log/syslog for events just prior to the crash.
>
> You could also try turning on more debug / verbose options on
> your network services to see what is happening.

I think, running a remote log server and capturing all log from that
file server would be more helpful, see
http://www.linuxsecurity.com/content/view/117513/171/ for how to do it.

--
Balwinder S "bdheeman" Dheeman Registered Linux User: #229709
Anu'z Linux(a)HOME (Unix Shoppe) Machines: #168573, 170593, 259192
Chandigarh, UT, 160062, India Plan9, T2, Arch/Debian/FreeBSD/XP
Home: http://werc.homelinux.net/ Visit: http://counter.li.org/

From: gb345 on 15 Mar 2010 11:55

In <1268573896_41(a)vo.lu> J G Miller <miller(a)yoyo.ORG> writes:

>On Sun, 14 Mar 2010 01:39:15 +0000, gb345 wrote:

>> To make matters worse, we don't have a way to determine the user(s)
>> whose process or processes were responsible for hammering the
>> fileserver.

>It is impossible to answer your question because you provide no
>information on what services this "file" server is running, or
>how the files (and what are typical sizes for these files) are
>being served viz SAMBA, NFS etc.

Sorry for the omission. The files are being served via NFS, but
I don't know what the typical size for this file is. I suppose I
could do a global analysis of all the files in the server to
determine their average size, although this may not be a very
accurate estimate of the average size of a *served* file. In fact,
knowing how to measure this "average served-file size" may give me
some ideas on how to monitor the IO load on the server.

>The most obvious first step would be to check the syslog / messages
>/ daemon.log in /var/log/syslog for events just prior to the crash.

>You could also try turning on more debug / verbose options on
>your network services to see what is happening.

Thanks,

GB

| Next | Last
Pages: 1 2
Prev: wireless dhcp x2
Next: how to determine recording date from quicktime video?