WDC HDD RAID Failure / Intel SATA controller (random, every 2-3weeks) [Linux Hardware]

Prev: self assemble or off-the-shelf?
Next: Core i7: x86-64 PC and PC server (dual socket) with CentOS 4.8(RHEL 4.8)

From: philo on 11 Dec 2009 10:09

news.tpi.pl wrote:
> Yes, data is backed up.
>
> But i can' t replace the drive (no manufacturer will 2 HDDs back because of
> some bugs reported by kernel, when the drive is looking 100% healthy and
> there are no errors).
>
> Any other ideas?
>
First off...I am not sure if I understood your first post correctly.
I thought the error was only on *one* of the drives. I may have mis-read
you. Is the error just on *one* drive...or does the error occur on both
drives (but one at a time)?

If the error can occur on either drive...then *maybe* the problem is
with the controller.

>
> Uzytkownik "philo" <philo(a)privacy.invalid> napisal w wiadomosci
> news:hfp82u$ucd$1(a)news.eternal-september.org...
>> Hactar wrote:
>>> In article <hfo93c$3gf$2(a)news.eternal-september.org>,
>>> philo <philo(a)privacy.invalid> wrote:
>>>> news.tpi.pl wrote:
>>>>> I'm getting WDC HDDs dropping out of the RAID. Every 2-3 weeks, seem
>>>>> random, only one drive at random partition (MD1, MD2).
>>>>>
>>>>> SMART is CLEAN for both drives. There are no errors for both short and
>>>>> long smart tests, both drives.
>>>>> BADBLOCKS returns no errors for read / write safe and write desructible
>>>>> modes, both drives.
>>>> I'd go further than that and run the manufacturer's diagnostic on the
>>>> drive in question.
>>>>
>>>> If the diagnostic finds any errors, obviously you will have to replace
>>>> the drive.
>>>>
>>>> OTOH: Even if the manufacturer's diagnostic does not find any errors...
>>>> I'd err on the side of caution and replace the drive.
>>> So, no matter what the manufacturer's diagnostic says, you'd replace the
>>> drive. Why bother running it at all? I replace my drives too when they
>>> start to act up, because I figure that's the beginning of the end.
>>>
>>
>> If the drive is going to be replaced under warranty,
>> the mfg will want the diagnostic error code.
>>
>> But, no matter what I'd replace it.
>> I have seen drives that passed the mfg's diagnostic
>> but were definitely bad. (rare though)
>>
>>>> Obviously I assume all data are backed up!
>>> As it should be.
>>>
>
>

From: philo on 11 Dec 2009 12:22

AZ Nomad wrote:
> On Fri, 11 Dec 2009 13:35:35 +0100, news.tpi.pl <pslawek> wrote:
>> Yes, data is backed up.
>
>> But i can' t replace the drive (no manufacturer will 2 HDDs back because of
>> some bugs reported by kernel, when the drive is looking 100% healthy and
>> there are no errors).
>
>> Any other ideas?
>
> Replace them one at a time. Tell WD that the drive is dead.

Hard drive manufacturers will want the drive to be tested first with
their diagnostic utility and they will want the error code.

OTOH: I once did get a drive RMA'ed that did not give an error code...
yet I had carefully documented the exact problem.

From: philo on 15 Dec 2009 07:56

news.tpi.pl wrote:
>> First off...I am not sure if I understood your first post correctly.
>> I thought the error was only on *one* of the drives. I may have mis-read
>> you. Is the error just on *one* drive...or does the error occur on both
>> drives (but one at a time)?
>
> Random partition and drives, but the error happens more often @ SDA.
>
> Just got some other error, this time the drive wasn't disconnected from the
> array.
>
> Dec 14 17:02:32 kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0
> action 0x6 frozen
> Dec 14 17:02:32 kernel: ata1.00: cmd 25/00:08:cd:a7:36/00:00:57:00:00/e0 tag
> 0 dma 4096 in
> Dec 14 17:02:32 kernel: res 40/00:00:09:4f:c2/10:00:57:00:00/00
> Emask 0x4 (timeout)
> Dec 14 17:02:32 kernel: ata1.00: status: { DRDY }
> Dec 14 17:02:37 kernel: ata1: link is slow to respond, please be patient
> (ready=0)
> Dec 14 17:02:42 kernel: ata1: device not ready (errno=-16), forcing
> hardreset
> Dec 14 17:02:42 kernel: ata1: soft resetting link
> Dec 14 17:02:42 kernel: ata1.00: configured for UDMA/133
> Dec 14 17:02:42 kernel: ata1: EH complete
> Dec 14 17:02:42 kernel: sd 0:0:0:0: [sda] 1465149168 512-byte hardware
> sectors (750156 MB)
> Dec 14 17:02:42 kernel: sd 0:0:0:0: [sda] Write Protect is off
> Dec 14 17:02:42 kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
> Dec 14 17:02:42 kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache:
> enabled, doesn't support DPO or FUA
>
> Could be intel chipset driver bug, what do you think?
> http://lkml.indiana.edu/hypermail/linux/kernel/0808.3/2716.html
>
>

it *could* be a bug

but really it's going to need some investigating to narrow down

From: philo on 15 Dec 2009 14:27

news.tpi.pl wrote:
>> it *could* be a bug
>>
>> but really it's going to need some investigating to narrow down
>
>
> Ok so how it can be done?
>
>

There is really only one way to know for sure

and that is by experimentation.

Of course that would involve experimenting with different drivers
a different kernel perhaps and different hardware.

As long as you are 100% certain all data are backed up
you can afford to experiment.

If it was my own machine I'd probably try a different controller
and not use RAID...
but I can't presume to tell you what to do with your own system

First | Prev |
Pages: 1 2
Prev: self assemble or off-the-shelf?
Next: Core i7: x86-64 PC and PC server (dual socket) with CentOS 4.8(RHEL 4.8)