From: Gregory Seidman on 22 Jul 2010 07:50 I have a RAID1 (using md) running on two USB disks. (I'm working on moving to eSATA, but it's USB for now.) That means I don't have any insight using SMART. Meanwhile, I've been getting occasional fail events. Unfortunately, I don't get any information on which disk is failing. When the system comes up, it seems to be entirely random which disk comes up as /dev/sda and which comes up as /dev/sdb. In fact, since my root disk is on SATA, at least one time it came up as /dev/sda and the USB drives came up as /dev/sdb and /dev/sdc, though I think that was under a different kernel version. When I get a failure email, it tells me that it might be due to /dev/sda1 failing -- except when it tells me that it might be due to /dev/sdb1 failing. When things are working, mdadm -D /dev/md0 looks like this: /dev/md0: Version : 00.90 Creation Time : Wed Feb 22 20:50:29 2006 Raid Level : raid1 Array Size : 312496256 (298.02 GiB 320.00 GB) Used Dev Size : 312496256 (298.02 GiB 320.00 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Thu Jul 22 07:30:46 2010 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 UUID : e4feee4a:6b6be6d2:013f88ab:1b80cac5 Events : 0.17961786 Number Major Minor RaidDevice State 0 8 17 0 active sync /dev/sdb1 1 8 1 1 active sync /dev/sda1 When it fails, however, the device names disappear and it just tells me it's clean, degraded and shows an active disk, a removed disk, and a faulty spare without any device names. I even tried doing dd if=/dev/md0 of=/dev/null to see if I could get the light flickering on one and not the other, but I just get I/O errors. Once a disk fails, the RAID seems to go into a nasty state where it reads properly through the crypto loop and LVM I have on top of it, but the filesystems become read-only and the block devices just give errors. Worse, the first indication (even before the mdadm email) that something is wrong is a message to console that an ext3 journal write failed. What I've been doing (which makes me tremendously uncomfortable since I know a disk is failing) is to reboot and bring everything back up. This has been working, but I know it's just a matter of time before the failing disk becomes a failed disk. I could wait until then, since presumably I'll then know which is which, but who knows what data corruption is possible between now and then? So, um, help? --Greg -- To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org Archive: http://lists.debian.org/20100722113802.GB10802(a)anthropohedron.net
From: Michal on 22 Jul 2010 09:30 On 22/07/10 12:38, Gregory Seidman wrote: > I have a RAID1 (using md) running on two USB disks. (I'm working on moving > to eSATA, but it's USB for now.) That means I don't have any insight using > SMART. Meanwhile, I've been getting occasional fail events. Unfortunately, > I don't get any information on which disk is failing. > > When the system comes up, it seems to be entirely random which disk comes > up as /dev/sda and which comes up as /dev/sdb. In fact, since my root disk > is on SATA, at least one time it came up as /dev/sda and the USB drives > came up as /dev/sdb and /dev/sdc, though I think that was under a different > kernel version. When I get a failure email, it tells me that it might be > due to /dev/sda1 failing -- except when it tells me that it might be due to > /dev/sdb1 failing. When things are working, mdadm -D /dev/md0 looks like > this: > > > /dev/md0: > Version : 00.90 > Creation Time : Wed Feb 22 20:50:29 2006 > Raid Level : raid1 > Array Size : 312496256 (298.02 GiB 320.00 GB) > Used Dev Size : 312496256 (298.02 GiB 320.00 GB) > Raid Devices : 2 > Total Devices : 2 > Preferred Minor : 0 > Persistence : Superblock is persistent > > Update Time : Thu Jul 22 07:30:46 2010 > State : clean > Active Devices : 2 > Working Devices : 2 > Failed Devices : 0 > Spare Devices : 0 > > UUID : e4feee4a:6b6be6d2:013f88ab:1b80cac5 > Events : 0.17961786 > > Number Major Minor RaidDevice State > 0 8 17 0 active sync /dev/sdb1 > 1 8 1 1 active sync /dev/sda1 > > When it fails, however, the device names disappear and it just tells me > it's clean, degraded and shows an active disk, a removed disk, and a faulty > spare without any device names. > > I even tried doing dd if=/dev/md0 of=/dev/null to see if I could get the > light flickering on one and not the other, but I just get I/O errors. Once > a disk fails, the RAID seems to go into a nasty state where it reads > properly through the crypto loop and LVM I have on top of it, but the > filesystems become read-only and the block devices just give errors. Worse, > the first indication (even before the mdadm email) that something is wrong > is a message to console that an ext3 journal write failed. > > What I've been doing (which makes me tremendously uncomfortable since I > know a disk is failing) is to reboot and bring everything back up. This has > been working, but I know it's just a matter of time before the failing disk > becomes a failed disk. I could wait until then, since presumably I'll then > know which is which, but who knows what data corruption is possible between > now and then? > > So, um, help? > > --Greg > > > cat /proc/mdstat can help but you need to get the serial numbers. Do this; ~# hdparm -i /dev/sda /dev/sda: Model=ST31000340AS , FwRev=SD15 , SerialNo= 9QJ1TRWK Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=unknown, BuffSize=0kB, MaxMultSect=16, MultSect=?16? CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953523055 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 AdvancedPM=no WriteCache=enabled Drive conforms to: unknown: ATA/ATAPI-4,5,6,7 * signifies the current active mode You see it says SerialNo = On each HDD you will see the serial number on their somewhere, often it's hard to ready, so get a lable machine out and clearly lable each HDD with it's serial number. When one dies. do a cat /proc/mdstat to see which drive has failed, so say /dev/sda has failed, run that command to get the serial number of /dev/sda, open the case, rip it out, stick a new HDD in making sure you label this one with it's serial number, boot up and rebuild etc etc -- To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org Archive: http://lists.debian.org/4C483F88.70705(a)sharescope.co.uk
From: Stan Hoeppner on 22 Jul 2010 09:40 Gregory Seidman put forth on 7/22/2010 6:38 AM: > I have a RAID1 (using md) running on two USB disks. (I'm working on moving > to eSATA, but it's USB for now.) That means I don't have any insight using > SMART. Meanwhile, I've been getting occasional fail events. Unfortunately, > I don't get any information on which disk is failing. Are any USB communication errors being logged along with the md and ext3 errors? Are you sure it's a disk drive problem and not an issue with the kernel drivers, system BIOS, USB controller, cabling, or a combination thereof? How long (days, weeks, months, years) did this exact setup function properly before you started seeing these problems? Did you recently perform any major software upgrades (kernel/drivers) shortly before this problem surfaced? Is this a laptop? If so which make/model? What's the make/model of the USB disk drives? What is the age of each piece of hardware we're discussing? -- Stan -- To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org Archive: http://lists.debian.org/4C48486C.7030804(a)hardwarefreak.com
From: randall on 22 Jul 2010 09:40 On 07/22/2010 02:54 PM, Michal wrote: > On 22/07/10 12:38, Gregory Seidman wrote: >> I have a RAID1 (using md) running on two USB disks. (I'm working on >> moving >> to eSATA, but it's USB for now.) That means I don't have any insight >> using >> SMART. Meanwhile, I've been getting occasional fail events. >> Unfortunately, >> I don't get any information on which disk is failing. >> >> When the system comes up, it seems to be entirely random which disk >> comes >> up as /dev/sda and which comes up as /dev/sdb. In fact, since my root >> disk >> is on SATA, at least one time it came up as /dev/sda and the USB drives >> came up as /dev/sdb and /dev/sdc, though I think that was under a >> different >> kernel version. When I get a failure email, it tells me that it might be >> due to /dev/sda1 failing -- except when it tells me that it might be >> due to >> /dev/sdb1 failing. When things are working, mdadm -D /dev/md0 looks like >> this: >> >> >> /dev/md0: >> Version : 00.90 >> Creation Time : Wed Feb 22 20:50:29 2006 >> Raid Level : raid1 >> Array Size : 312496256 (298.02 GiB 320.00 GB) >> Used Dev Size : 312496256 (298.02 GiB 320.00 GB) >> Raid Devices : 2 >> Total Devices : 2 >> Preferred Minor : 0 >> Persistence : Superblock is persistent >> >> Update Time : Thu Jul 22 07:30:46 2010 >> State : clean >> Active Devices : 2 >> Working Devices : 2 >> Failed Devices : 0 >> Spare Devices : 0 >> >> UUID : e4feee4a:6b6be6d2:013f88ab:1b80cac5 >> Events : 0.17961786 >> >> Number Major Minor RaidDevice State >> 0 8 17 0 active sync /dev/sdb1 >> 1 8 1 1 active sync /dev/sda1 >> >> When it fails, however, the device names disappear and it just tells me >> it's clean, degraded and shows an active disk, a removed disk, and a >> faulty >> spare without any device names. >> >> I even tried doing dd if=/dev/md0 of=/dev/null to see if I could get the >> light flickering on one and not the other, but I just get I/O errors. >> Once >> a disk fails, the RAID seems to go into a nasty state where it reads >> properly through the crypto loop and LVM I have on top of it, but the >> filesystems become read-only and the block devices just give errors. >> Worse, >> the first indication (even before the mdadm email) that something is >> wrong >> is a message to console that an ext3 journal write failed. >> >> What I've been doing (which makes me tremendously uncomfortable since I >> know a disk is failing) is to reboot and bring everything back up. >> This has >> been working, but I know it's just a matter of time before the >> failing disk >> becomes a failed disk. I could wait until then, since presumably I'll >> then >> know which is which, but who knows what data corruption is possible >> between >> now and then? >> >> So, um, help? >> >> --Greg >> >> > cat /proc/mdstat can help but you need to get the serial numbers. Do > this; > > ~# hdparm -i /dev/sda > > /dev/sda: > > Model=ST31000340AS , FwRev=SD15 , > SerialNo= > 9QJ1TRWK > Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% } > RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 > BuffType=unknown, BuffSize=0kB, MaxMultSect=16, MultSect=?16? > CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953523055 > IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} > PIO modes: pio0 pio1 pio2 pio3 pio4 > DMA modes: mdma0 mdma1 mdma2 > UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 > AdvancedPM=no WriteCache=enabled > Drive conforms to: unknown: ATA/ATAPI-4,5,6,7 > > * signifies the current active mode > > You see it says SerialNo = On each HDD you will see the serial number > on their somewhere, often it's hard to ready, so get a lable machine > out and clearly lable each HDD with it's serial number. When one > dies. do a cat /proc/mdstat to see which drive has failed, so say > /dev/sda has failed, run that command to get the serial number of > /dev/sda, open the case, rip it out, stick a new HDD in making sure > you label this one with it's serial number, boot up and rebuild etc etc > > you could also try smartctl -a /dev/sda to get the disks serial numbers -- To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org Archive: http://lists.debian.org/4C484756.60809(a)songshu.org
From: Gregory Seidman on 22 Jul 2010 10:30 On Thu, Jul 22, 2010 at 03:27:50PM +0200, randall wrote: > On 07/22/2010 02:54 PM, Michal wrote: >> On 22/07/10 12:38, Gregory Seidman wrote: [...] >>> So, um, help? >>> >>> --Greg >>> >> cat /proc/mdstat can help but you need to get the serial numbers. Do >> this; >> >> ~# hdparm -i /dev/sda [...] # hdparm -i /dev/sda HDIO_GET_IDENTITY failed: Invalid argument /dev/sda: # hdparm -i /dev/sdb HDIO_GET_IDENTITY failed: Invalid argument /dev/sdb: > you could also try smartctl -a /dev/sda to get the disks serial numbers # smartctl -a /dev/sda smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: ST332062 0A Version: 3.AA scsiModePageOffset: response length too short, resp_len=4 offset=4 bd_len=0 >> Terminate command early due to bad response to IEC mode page A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. # smartctl -a /dev/sda -T permissive smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: ST332062 0A Version: 3.AA scsiModePageOffset: response length too short, resp_len=4 offset=4 bd_len=0 >> Terminate command early due to bad response to IEC mode page Error Counter logging not supported scsiModePageOffset: response length too short, resp_len=4 offset=4 bd_len=0 Device does not support Self Test logging Neither of these tools seem to be of much use here. --Greg -- To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org Archive: http://lists.debian.org/20100722142710.GC10802(a)anthropohedron.net
|
Next
|
Last
Pages: 1 2 Prev: [newbie] Logwatch + Postfix + Mailman Next: iceweasel vs google-chrome |