How to "post-mortem" fileserver overload incidents? [General Linux]

Prev: wireless dhcp x2
Next: how to determine recording date from quicktime video?

From: gb345 on 15 Mar 2010 12:04

In <29qv67x4i7.ln2(a)news.homelinux.net> Balwinder S Dheeman <bsd.SANSPAM(a)cto.homelinux.net> writes:

>On 03/14/2010 07:08 PM, J G Miller wrote:
>> On Sun, 14 Mar 2010 01:39:15 +0000, gb345 wrote:
>>
>>> To make matters worse, we don't have a way to determine the user(s)
>>> whose process or processes were responsible for hammering the
>>> fileserver.
>>
>> It is impossible to answer your question because you provide no
>> information on what services this "file" server is running, or
>> how the files (and what are typical sizes for these files) are
>> being served viz SAMBA, NFS etc.
>>
>> The most obvious first step would be to check the syslog / messages
>> / daemon.log in /var/log/syslog for events just prior to the crash.
>>
>> You could also try turning on more debug / verbose options on
>> your network services to see what is happening.

>I think, running a remote log server and capturing all log from that
>file server would be more helpful, see
>http://www.linuxsecurity.com/content/view/117513/171/ for how to do it.

Thanks for the link. This could be a useful technique, though I
still need to figure out how best to capture the information that
I want to log...

GB

From: gb345 on 15 Mar 2010 12:48

In <803djmFlrnU1(a)mid.individual.net> "Greg Russell" <grussell(a)invalid.com> writes:

>"gb345" <gb345(a)invalid.com> wrote in message
>news:hnheo3$hoh$1(a)reader1.panix.com...
>...
>> To make matters worse, we don't have a way to determine the user(s)
>> whose process or processes were responsible for hammering the
>> fileserver. So we can't even tell if we're dealing with malicious
>> attacks or not.
>>
>> What tools/utilities exist for keeping an up-to-the-minute log of
>> each user's IO load on the fileserver?

>"Up-to-the-minute" might require an every-minute cron invocation of
>"top -H -n1 | head > /tmp/cron.top" or some such thing.

Really? Wow. Isn't that kinda like to scraping one's own website
in order to debug it? :-)

Joking aside, from your reply it sounds like I'm one of the few
people who has ever needed this level of monitoring...

I suppose I could roll my own "process table analyzer" script to
and run it every minute, via cron as you suggest, but it's really
puzzling to me that such tools are not standard-issue in Linux
systems, either as free-standing utilities, or as options in other
low-level system-logging facilities. But maybe these are debug/verbose
options that J G Miller referred to in his reply. I must investigate
further.

BTW, I'm intrigued by the -H in your top invocation. I don't recall
ever needing it. My top manpage's description:

-H : Threads toggle
Starts top with the last remembered 'H' state reversed.
When this toggle is On, all individual threads will be
displayed. Otherwise, top displays a summation of all
threads in a process.

Why would you want this toggling behavior in this case? I suppose
it could be useful to run *two* of those invocations back-to-back
every time cron fires, to get both types of reports. Is this what
you had in mind?

Thanks,

GB

From: The Natural Philosopher on 15 Mar 2010 13:00

gb345 wrote:
> In <803djmFlrnU1(a)mid.individual.net> "Greg Russell" <grussell(a)invalid.com> writes:
>
>> "gb345" <gb345(a)invalid.com> wrote in message
>> news:hnheo3$hoh$1(a)reader1.panix.com...
>> ...
>>> To make matters worse, we don't have a way to determine the user(s)
>>> whose process or processes were responsible for hammering the
>>> fileserver. So we can't even tell if we're dealing with malicious
>>> attacks or not.
>>>
>>> What tools/utilities exist for keeping an up-to-the-minute log of
>>> each user's IO load on the fileserver?
>
>> "Up-to-the-minute" might require an every-minute cron invocation of
>> "top -H -n1 | head > /tmp/cron.top" or some such thing.
>
> Really? Wow. Isn't that kinda like to scraping one's own website
> in order to debug it? :-)
>

yes. :-)

> Joking aside, from your reply it sounds like I'm one of the few
> people who has ever needed this level of monitoring...
>
we did some years back when I was running a small ISP type setup.

I cant remember what we ended up with, but it w an ad hoc micture of
scripts and code.

The basic purpose was to alert us (actually via email and some bulbs
that lit up on teh ops room: I write the code that drove a paralell port
on the monitoring machine..gentle afternoon !) to any problems: THEN we
would go in manually and see what was happening.

It was IIRC a cron script that looked for full disks and high process
load every minute or so.

> I suppose I could roll my own "process table analyzer" script to
> and run it every minute, via cron as you suggest, but it's really
> puzzling to me that such tools are not standard-issue in Linux
> systems, either as free-standing utilities, or as options in other
> low-level system-logging facilities. But maybe these are debug/verbose
> options that J G Miller referred to in his reply. I must investigate
> further.

It's worth it: Another option is to use snmp. and interrogate a daemon
set up to specially take snapshots of useful things. Liek network
traffic and CPU usage and the like. I think there are some toolkits for
that..

> GB

From: Joe Beanfish on 15 Mar 2010 14:20

On 03/13/10 20:39, gb345 wrote:
> Our lab's computer cluster consists of about 35 nodes (running
> Ubuntu), all attached to a single large fileserver. Every couple
> of months, an access overload on the fileserver incapacitates the
> whole system, and a reboot is required.
>
> To make matters worse, we don't have a way to determine the user(s)
> whose process or processes were responsible for hammering the
> fileserver. So we can't even tell if we're dealing with malicious
> attacks or not.
>
> What tools/utilities exist for keeping an up-to-the-minute log of
> each user's IO load on the fileserver?

You could use iptables accounting to measure network traffic to/from
specific IPs. Collect and plot the data over time to find trends and
abusers.

First | Prev |
Pages: 1 2
Prev: wireless dhcp x2
Next: how to determine recording date from quicktime video?