Which disk is failing? [Debian]

Prev: [newbie] Logwatch + Postfix + Mailman
Next: iceweasel vs google-chrome

From: Gregory Seidman on 22 Jul 2010 07:50

I have a RAID1 (using md) running on two USB disks. (I'm working on moving
to eSATA, but it's USB for now.) That means I don't have any insight using
SMART. Meanwhile, I've been getting occasional fail events. Unfortunately,
I don't get any information on which disk is failing.

When the system comes up, it seems to be entirely random which disk comes
up as /dev/sda and which comes up as /dev/sdb. In fact, since my root disk
is on SATA, at least one time it came up as /dev/sda and the USB drives
came up as /dev/sdb and /dev/sdc, though I think that was under a different
kernel version. When I get a failure email, it tells me that it might be
due to /dev/sda1 failing -- except when it tells me that it might be due to
/dev/sdb1 failing. When things are working, mdadm -D /dev/md0 looks like
this:

/dev/md0:
Version : 00.90
Creation Time : Wed Feb 22 20:50:29 2006
Raid Level : raid1
Array Size : 312496256 (298.02 GiB 320.00 GB)
Used Dev Size : 312496256 (298.02 GiB 320.00 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Thu Jul 22 07:30:46 2010
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

UUID : e4feee4a:6b6be6d2:013f88ab:1b80cac5
Events : 0.17961786

Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 1 1 active sync /dev/sda1

When it fails, however, the device names disappear and it just tells me
it's clean, degraded and shows an active disk, a removed disk, and a faulty
spare without any device names.

I even tried doing dd if=/dev/md0 of=/dev/null to see if I could get the
light flickering on one and not the other, but I just get I/O errors. Once
a disk fails, the RAID seems to go into a nasty state where it reads
properly through the crypto loop and LVM I have on top of it, but the
filesystems become read-only and the block devices just give errors. Worse,
the first indication (even before the mdadm email) that something is wrong
is a message to console that an ext3 journal write failed.

What I've been doing (which makes me tremendously uncomfortable since I
know a disk is failing) is to reboot and bring everything back up. This has
been working, but I know it's just a matter of time before the failing disk
becomes a failed disk. I could wait until then, since presumably I'll then
know which is which, but who knows what data corruption is possible between
now and then?

So, um, help?

--Greg

--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/20100722113802.GB10802(a)anthropohedron.net

From: Michal on 22 Jul 2010 09:30

On 22/07/10 12:38, Gregory Seidman wrote:
> I have a RAID1 (using md) running on two USB disks. (I'm working on moving
> to eSATA, but it's USB for now.) That means I don't have any insight using
> SMART. Meanwhile, I've been getting occasional fail events. Unfortunately,
> I don't get any information on which disk is failing.
>
> When the system comes up, it seems to be entirely random which disk comes
> up as /dev/sda and which comes up as /dev/sdb. In fact, since my root disk
> is on SATA, at least one time it came up as /dev/sda and the USB drives
> came up as /dev/sdb and /dev/sdc, though I think that was under a different
> kernel version. When I get a failure email, it tells me that it might be
> due to /dev/sda1 failing -- except when it tells me that it might be due to
> /dev/sdb1 failing. When things are working, mdadm -D /dev/md0 looks like
> this:
>
>
> /dev/md0:
> Version : 00.90
> Creation Time : Wed Feb 22 20:50:29 2006
> Raid Level : raid1
> Array Size : 312496256 (298.02 GiB 320.00 GB)
> Used Dev Size : 312496256 (298.02 GiB 320.00 GB)
> Raid Devices : 2
> Total Devices : 2
> Preferred Minor : 0
> Persistence : Superblock is persistent
>
> Update Time : Thu Jul 22 07:30:46 2010
> State : clean
> Active Devices : 2
> Working Devices : 2
> Failed Devices : 0
> Spare Devices : 0
>
> UUID : e4feee4a:6b6be6d2:013f88ab:1b80cac5
> Events : 0.17961786
>
> Number Major Minor RaidDevice State
> 0 8 17 0 active sync /dev/sdb1
> 1 8 1 1 active sync /dev/sda1
>
> When it fails, however, the device names disappear and it just tells me
> it's clean, degraded and shows an active disk, a removed disk, and a faulty
> spare without any device names.
>
> I even tried doing dd if=/dev/md0 of=/dev/null to see if I could get the
> light flickering on one and not the other, but I just get I/O errors. Once
> a disk fails, the RAID seems to go into a nasty state where it reads
> properly through the crypto loop and LVM I have on top of it, but the
> filesystems become read-only and the block devices just give errors. Worse,
> the first indication (even before the mdadm email) that something is wrong
> is a message to console that an ext3 journal write failed.
>
> What I've been doing (which makes me tremendously uncomfortable since I
> know a disk is failing) is to reboot and bring everything back up. This has
> been working, but I know it's just a matter of time before the failing disk
> becomes a failed disk. I could wait until then, since presumably I'll then
> know which is which, but who knows what data corruption is possible between
> now and then?
>
> So, um, help?
>
> --Greg
>
>
>
cat /proc/mdstat can help but you need to get the serial numbers. Do this;

~# hdparm -i /dev/sda

/dev/sda:

Model=ST31000340AS , FwRev=SD15 ,
SerialNo=
9QJ1TRWK
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
BuffType=unknown, BuffSize=0kB, MaxMultSect=16, MultSect=?16?
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953523055
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
AdvancedPM=no WriteCache=enabled
Drive conforms to: unknown: ATA/ATAPI-4,5,6,7

* signifies the current active mode

You see it says SerialNo = On each HDD you will see the serial number on
their somewhere, often it's hard to ready, so get a lable machine out
and clearly lable each HDD with it's serial number. When one dies. do a
cat /proc/mdstat to see which drive has failed, so say /dev/sda has
failed, run that command to get the serial number of /dev/sda, open the
case, rip it out, stick a new HDD in making sure you label this one with
it's serial number, boot up and rebuild etc etc

--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/4C483F88.70705(a)sharescope.co.uk

From: Stan Hoeppner on 22 Jul 2010 09:40

Gregory Seidman put forth on 7/22/2010 6:38 AM:
> I have a RAID1 (using md) running on two USB disks. (I'm working on moving
> to eSATA, but it's USB for now.) That means I don't have any insight using
> SMART. Meanwhile, I've been getting occasional fail events. Unfortunately,
> I don't get any information on which disk is failing.

Are any USB communication errors being logged along with the md and ext3 errors?

Are you sure it's a disk drive problem and not an issue with the kernel
drivers, system BIOS, USB controller, cabling, or a combination thereof?

How long (days, weeks, months, years) did this exact setup function properly
before you started seeing these problems?

Did you recently perform any major software upgrades (kernel/drivers) shortly
before this problem surfaced?

Is this a laptop? If so which make/model?

What's the make/model of the USB disk drives?

What is the age of each piece of hardware we're discussing?

--
Stan

--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/4C48486C.7030804(a)hardwarefreak.com

From: randall on 22 Jul 2010 09:40

On 07/22/2010 02:54 PM, Michal wrote:
> On 22/07/10 12:38, Gregory Seidman wrote:
>> I have a RAID1 (using md) running on two USB disks. (I'm working on
>> moving
>> to eSATA, but it's USB for now.) That means I don't have any insight
>> using
>> SMART. Meanwhile, I've been getting occasional fail events.
>> Unfortunately,
>> I don't get any information on which disk is failing.
>>
>> When the system comes up, it seems to be entirely random which disk
>> comes
>> up as /dev/sda and which comes up as /dev/sdb. In fact, since my root
>> disk
>> is on SATA, at least one time it came up as /dev/sda and the USB drives
>> came up as /dev/sdb and /dev/sdc, though I think that was under a
>> different
>> kernel version. When I get a failure email, it tells me that it might be
>> due to /dev/sda1 failing -- except when it tells me that it might be
>> due to
>> /dev/sdb1 failing. When things are working, mdadm -D /dev/md0 looks like
>> this:
>>
>>
>> /dev/md0:
>> Version : 00.90
>> Creation Time : Wed Feb 22 20:50:29 2006
>> Raid Level : raid1
>> Array Size : 312496256 (298.02 GiB 320.00 GB)
>> Used Dev Size : 312496256 (298.02 GiB 320.00 GB)
>> Raid Devices : 2
>> Total Devices : 2
>> Preferred Minor : 0
>> Persistence : Superblock is persistent
>>
>> Update Time : Thu Jul 22 07:30:46 2010
>> State : clean
>> Active Devices : 2
>> Working Devices : 2
>> Failed Devices : 0
>> Spare Devices : 0
>>
>> UUID : e4feee4a:6b6be6d2:013f88ab:1b80cac5
>> Events : 0.17961786
>>
>> Number Major Minor RaidDevice State
>> 0 8 17 0 active sync /dev/sdb1
>> 1 8 1 1 active sync /dev/sda1
>>
>> When it fails, however, the device names disappear and it just tells me
>> it's clean, degraded and shows an active disk, a removed disk, and a
>> faulty
>> spare without any device names.
>>
>> I even tried doing dd if=/dev/md0 of=/dev/null to see if I could get the
>> light flickering on one and not the other, but I just get I/O errors.
>> Once
>> a disk fails, the RAID seems to go into a nasty state where it reads
>> properly through the crypto loop and LVM I have on top of it, but the
>> filesystems become read-only and the block devices just give errors.
>> Worse,
>> the first indication (even before the mdadm email) that something is
>> wrong
>> is a message to console that an ext3 journal write failed.
>>
>> What I've been doing (which makes me tremendously uncomfortable since I
>> know a disk is failing) is to reboot and bring everything back up.
>> This has
>> been working, but I know it's just a matter of time before the
>> failing disk
>> becomes a failed disk. I could wait until then, since presumably I'll
>> then
>> know which is which, but who knows what data corruption is possible
>> between
>> now and then?
>>
>> So, um, help?
>>
>> --Greg
>>
>>
> cat /proc/mdstat can help but you need to get the serial numbers. Do
> this;
>
> ~# hdparm -i /dev/sda
>
> /dev/sda:
>
> Model=ST31000340AS , FwRev=SD15 ,
> SerialNo=
> 9QJ1TRWK
> Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
> RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
> BuffType=unknown, BuffSize=0kB, MaxMultSect=16, MultSect=?16?
> CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953523055
> IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
> PIO modes: pio0 pio1 pio2 pio3 pio4
> DMA modes: mdma0 mdma1 mdma2
> UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
> AdvancedPM=no WriteCache=enabled
> Drive conforms to: unknown: ATA/ATAPI-4,5,6,7
>
> * signifies the current active mode
>
> You see it says SerialNo = On each HDD you will see the serial number
> on their somewhere, often it's hard to ready, so get a lable machine
> out and clearly lable each HDD with it's serial number. When one
> dies. do a cat /proc/mdstat to see which drive has failed, so say
> /dev/sda has failed, run that command to get the serial number of
> /dev/sda, open the case, rip it out, stick a new HDD in making sure
> you label this one with it's serial number, boot up and rebuild etc etc
>
>
you could also try smartctl -a /dev/sda to get the disks serial numbers

--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/4C484756.60809(a)songshu.org

From: Gregory Seidman on 22 Jul 2010 10:30

On Thu, Jul 22, 2010 at 03:27:50PM +0200, randall wrote:
> On 07/22/2010 02:54 PM, Michal wrote:
>> On 22/07/10 12:38, Gregory Seidman wrote:
[...]
>>> So, um, help?
>>>
>>> --Greg
>>>
>> cat /proc/mdstat can help but you need to get the serial numbers. Do
>> this;
>>
>> ~# hdparm -i /dev/sda
[...]

# hdparm -i /dev/sda
HDIO_GET_IDENTITY failed: Invalid argument

/dev/sda:

# hdparm -i /dev/sdb
HDIO_GET_IDENTITY failed: Invalid argument

/dev/sdb:

> you could also try smartctl -a /dev/sda to get the disks serial numbers

# smartctl -a /dev/sda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: ST332062 0A Version: 3.AA
scsiModePageOffset: response length too short, resp_len=4 offset=4 bd_len=0
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

# smartctl -a /dev/sda -T permissive
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: ST332062 0A Version: 3.AA
scsiModePageOffset: response length too short, resp_len=4 offset=4 bd_len=0
>> Terminate command early due to bad response to IEC mode page

Error Counter logging not supported
scsiModePageOffset: response length too short, resp_len=4 offset=4 bd_len=0
Device does not support Self Test logging

Neither of these tools seem to be of much use here.

--Greg

--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/20100722142710.GC10802(a)anthropohedron.net

| Next | Last
Pages: 1 2
Prev: [newbie] Logwatch + Postfix + Mailman
Next: iceweasel vs google-chrome