Prev: self assemble or off-the-shelf?
Next: Core i7: x86-64 PC and PC server (dual socket) with CentOS 4.8(RHEL 4.8)
From: philo on 11 Dec 2009 10:09 news.tpi.pl wrote: > Yes, data is backed up. > > But i can' t replace the drive (no manufacturer will 2 HDDs back because of > some bugs reported by kernel, when the drive is looking 100% healthy and > there are no errors). > > Any other ideas? > First off...I am not sure if I understood your first post correctly. I thought the error was only on *one* of the drives. I may have mis-read you. Is the error just on *one* drive...or does the error occur on both drives (but one at a time)? If the error can occur on either drive...then *maybe* the problem is with the controller. > > Uzytkownik "philo" <philo(a)privacy.invalid> napisal w wiadomosci > news:hfp82u$ucd$1(a)news.eternal-september.org... >> Hactar wrote: >>> In article <hfo93c$3gf$2(a)news.eternal-september.org>, >>> philo <philo(a)privacy.invalid> wrote: >>>> news.tpi.pl wrote: >>>>> I'm getting WDC HDDs dropping out of the RAID. Every 2-3 weeks, seem >>>>> random, only one drive at random partition (MD1, MD2). >>>>> >>>>> SMART is CLEAN for both drives. There are no errors for both short and >>>>> long smart tests, both drives. >>>>> BADBLOCKS returns no errors for read / write safe and write desructible >>>>> modes, both drives. >>>> I'd go further than that and run the manufacturer's diagnostic on the >>>> drive in question. >>>> >>>> If the diagnostic finds any errors, obviously you will have to replace >>>> the drive. >>>> >>>> OTOH: Even if the manufacturer's diagnostic does not find any errors... >>>> I'd err on the side of caution and replace the drive. >>> So, no matter what the manufacturer's diagnostic says, you'd replace the >>> drive. Why bother running it at all? I replace my drives too when they >>> start to act up, because I figure that's the beginning of the end. >>> >> >> If the drive is going to be replaced under warranty, >> the mfg will want the diagnostic error code. >> >> But, no matter what I'd replace it. >> I have seen drives that passed the mfg's diagnostic >> but were definitely bad. (rare though) >> >>>> Obviously I assume all data are backed up! >>> As it should be. >>> > >
From: philo on 11 Dec 2009 12:22 AZ Nomad wrote: > On Fri, 11 Dec 2009 13:35:35 +0100, news.tpi.pl <pslawek> wrote: >> Yes, data is backed up. > >> But i can' t replace the drive (no manufacturer will 2 HDDs back because of >> some bugs reported by kernel, when the drive is looking 100% healthy and >> there are no errors). > >> Any other ideas? > > Replace them one at a time. Tell WD that the drive is dead. Hard drive manufacturers will want the drive to be tested first with their diagnostic utility and they will want the error code. OTOH: I once did get a drive RMA'ed that did not give an error code... yet I had carefully documented the exact problem.
From: philo on 15 Dec 2009 07:56 news.tpi.pl wrote: >> First off...I am not sure if I understood your first post correctly. >> I thought the error was only on *one* of the drives. I may have mis-read >> you. Is the error just on *one* drive...or does the error occur on both >> drives (but one at a time)? > > Random partition and drives, but the error happens more often @ SDA. > > Just got some other error, this time the drive wasn't disconnected from the > array. > > Dec 14 17:02:32 kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 > action 0x6 frozen > Dec 14 17:02:32 kernel: ata1.00: cmd 25/00:08:cd:a7:36/00:00:57:00:00/e0 tag > 0 dma 4096 in > Dec 14 17:02:32 kernel: res 40/00:00:09:4f:c2/10:00:57:00:00/00 > Emask 0x4 (timeout) > Dec 14 17:02:32 kernel: ata1.00: status: { DRDY } > Dec 14 17:02:37 kernel: ata1: link is slow to respond, please be patient > (ready=0) > Dec 14 17:02:42 kernel: ata1: device not ready (errno=-16), forcing > hardreset > Dec 14 17:02:42 kernel: ata1: soft resetting link > Dec 14 17:02:42 kernel: ata1.00: configured for UDMA/133 > Dec 14 17:02:42 kernel: ata1: EH complete > Dec 14 17:02:42 kernel: sd 0:0:0:0: [sda] 1465149168 512-byte hardware > sectors (750156 MB) > Dec 14 17:02:42 kernel: sd 0:0:0:0: [sda] Write Protect is off > Dec 14 17:02:42 kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 > Dec 14 17:02:42 kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: > enabled, doesn't support DPO or FUA > > Could be intel chipset driver bug, what do you think? > http://lkml.indiana.edu/hypermail/linux/kernel/0808.3/2716.html > > it *could* be a bug but really it's going to need some investigating to narrow down
From: philo on 15 Dec 2009 14:27
news.tpi.pl wrote: >> it *could* be a bug >> >> but really it's going to need some investigating to narrow down > > > Ok so how it can be done? > > There is really only one way to know for sure and that is by experimentation. Of course that would involve experimenting with different drivers a different kernel perhaps and different hardware. As long as you are 100% certain all data are backed up you can afford to experiment. If it was my own machine I'd probably try a different controller and not use RAID... but I can't presume to tell you what to do with your own system |