From: Mike Tomlinson on
In article <82cni4F42iU1(a)mid.individual.net>, Arno <me(a)privacy.net>
writes

>That sounds like BS to me. A soft pencil eraser cannot remove silver
>sulfide, it is quite resilient.

It's a technique that has been used on edge connectors for many years.

--
Mike Tomlinson
From: JW on
On Mon, 12 Apr 2010 13:16:00 +0100 Mike Tomlinson <mike(a)none.invalid>
wrote in Message id: <iEygm6AA8wwLFwTM(a)none.invalid>:

>In article <82cni4F42iU1(a)mid.individual.net>, Arno <me(a)privacy.net>
>writes
>
>>That sounds like BS to me. A soft pencil eraser cannot remove silver
>>sulfide, it is quite resilient.
>
>It's a technique that has been used on edge connectors for many years.

Yup, and it works. I learned the technique when servicing Multibus I
systems, and still use it to this day.
From: Jeff Liebermann on
On Sat, 10 Apr 2010 22:33:49 +0000 (UTC), Sergey Kubushyn
<ksi(a)koi8.net> wrote:

>I'm right here in the US and I had 3 of 3 WD 1TB drives failed at the same
>time in RAID1 thus making the entire array dead.

That's the real problem with RAID using identical drives. When one
drive dies, the others are highly likely to follow. I had that
experience in about 2003 with a Compaq something Unix server running
SCSI RAID 1+0 (4 drives). One drive failed, and I replacing it with a
backup drive, which worked. The drive failure was repeated a week
later when a 2nd drive failed. When I realized what was happening, I
ran a complete tape backup, replaced ALL the drives, and restored from
the the backup. That was just in time as both remaining drives were
dead when I tested them a few weeks later. I've experienced similar
failures since then, and have always recommended replacing all the
drives, if possible (which is impractical for large arrays).



--
Jeff Liebermann jeffl(a)cruzio.com
150 Felker St #D http://www.LearnByDestroying.com
Santa Cruz CA 95060 http://802.11junk.com
Skype: JeffLiebermann AE6KS 831-336-2558
From: Arno on
In comp.sys.ibm.pc.hardware.storage Sergey Kubushyn <ksi(a)koi8.net> wrote:
> In sci.electronics.repair Arno <me(a)privacy.net> wrote:
>> In comp.sys.ibm.pc.hardware.storage Sergey Kubushyn <ksi(a)koi8.net> wrote:
>>> In sci.electronics.repair Franc Zabkar <fzabkar(a)iinternode.on.net> wrote:
>>>> On Thu, 8 Apr 2010 14:03:39 -0700 (PDT), whit3rd <whit3rd(a)gmail.com>
>>>> put finger to keyboard and composed:
>>>>
>>>>>On Apr 8, 12:11?am, Franc Zabkar <fzab...(a)iinternode.on.net> wrote:
>>>>
>>>>>> Is this the fallout from RoHS?
>>>>>
>>>>>Maybe not. There are other known culprits, like the drywall (gypsum
>>>>>board,
>>>>>sheetrock... whatever it's called in your region) that outgasses
>>>>>hydrogen
>>>>>sulphide. Some US construction of a few years ago is so bad with
>>>>>this
>>>>>toxic and corrosive gas emission that demolition of nearly-new
>>>>>construction
>>>>>is called for.
>>>>>
>>>>>Corrosion of nearby copper is one of the symptoms of the nasty
>>>>>product.
>>>>
>>>> It's not just Russia that has this problem. The same issue comes up
>>>> frequently at the HDD Guru forums.
>>
>>> I'm right here in the US and I had 3 of 3 WD 1TB drives failed at the same
>>> time in RAID1 thus making the entire array dead. It is not that you can
>>> simply buff that dark stuff off and you're good to go. Drive itself tries to
>>> recover from failures by rewriting service info (remapping etc.) but
>>> connection is unreliable and it trashes the entire disk beyound repair. Then
>>> you have that infamous "click of death"... BTW, it is not just WD; others
>>> are also that bad.
>>
>> It is extremly unlikely for a slow chemical process to achive this
>> level of syncronicity. About as unlikely that it would be fair to call
>> it impossible
>>
>> Your array died from a different cause that would affect all drives
>> simultaneously, such as a power spike.

> Yes, they did not die from contacts oxidation at that very same moment. I
> can not even tell they all died the same month--that array might've been
> running in degraded mode with one drive dead, then after some time second
> drive died but it was still running on one remaining drive. And only when
> the last one crossed the Styx the entire array went dead.

Ah, I see. I did misunderstand that. May still be something
else but the contacts are a possible explanation with that.

> I don't use Windows so my machines are never turned off unless there
> is a real need for this. And they are rarely updated once they are
> up and running so there is no reboots. Typical uptime is more than a
> year.

So your disks worked and then refused to restart? Or you are running
a RAID1 without monitoring?

> I don't know though how I could miss a degradation alert if there was any.

Well, if it is Linux with mdadm, it only sends one email per
degradation event in the default settings.

> All 3 drives in the array simply failed to start after reboot. There were
> some media errors reported before reboot but all drives somehow worked. Then
> the system got rebooted and all 3 drives failed with the same "click of
> death."

> The mechanism here is not that oxidation itself killed the drives. It never
> happens that way. It was a main cause of a failure, but drives actually
> performed suicide like body immune system kills that body when overreacting
> to some kind of hemorrargic fever or so.

> The probable sequence is something like this:

> - Drives run for a long time with majority of the files never
> accessed so it doesn't matter if that part of the disk where they
> are stored is bad or not

I run long smart selftest on all my drives (RAID or no) every
14 days to prevent that. Works well.

> - When the system is rebooted RAID array assembly is performed

> - While this assembly is being performed a number of sectors on a
> drive found to be defective and drive tries to remap them

> - Such action involves rewriting service information

> - Read/write operations are unreliable because of failing head
> contacts so the service areas become filled with garbage

> - Once the vital service information is damaged the drive is
> essentially dead because its controller can not read vital data to
> even start the disk

> - The only hope for the controller to recover is to repeat the read
> in hope that it might somehow get read. This is that infamous
> "click of death" sound when drive tries to read the info again and
> again. There is no way it can recover because that data are
> trashed.

> - Drives do NOT fail while they run, the failure happens on the next
> reboot. The damage that would kill the drives on that reboot
> happened way before that reboot though.

> That suicide also can happen when some old file that was not accessed for
> ages is read. That attempt triggers the suicide chain.

Yes, that makes sense. However you should do surface scans on
RAIDed disks regularly, e.g. by long SMART selftests. This will
catch weak sectors early and other degradation as well.

Arno

--
Arno Wagner, Dr. sc. techn., Dipl. Inform., CISSP -- Email: arno(a)wagner.name
GnuPG: ID: 1E25338F FP: 0C30 5782 9D93 F785 E79C 0296 797F 6B50 1E25 338F
----
Cuddly UI's are the manifestation of wishful thinking. -- Dylan Evans
From: Arno on
In comp.sys.ibm.pc.hardware.storage Jeff Liebermann <jeffl(a)cruzio.com> wrote:
> On Sat, 10 Apr 2010 22:33:49 +0000 (UTC), Sergey Kubushyn
> <ksi(a)koi8.net> wrote:

>>I'm right here in the US and I had 3 of 3 WD 1TB drives failed at the same
>>time in RAID1 thus making the entire array dead.

> That's the real problem with RAID using identical drives. When one
> drive dies, the others are highly likely to follow. I had that
> experience in about 2003 with a Compaq something Unix server running
> SCSI RAID 1+0 (4 drives). One drive failed, and I replacing it with a
> backup drive, which worked. The drive failure was repeated a week
> later when a 2nd drive failed. When I realized what was happening, I
> ran a complete tape backup, replaced ALL the drives, and restored from
> the the backup. That was just in time as both remaining drives were
> dead when I tested them a few weeks later. I've experienced similar
> failures since then, and have always recommended replacing all the
> drives, if possible (which is impractical for large arrays).

For high reliability requirements it is also a good idea to use
different brand drives, to get a better distributed times between
failures. Some people have reported the effect you see.

A second thing that can cause this effect is when the disks are not
regularly surface scanned. I run a long SMART selftest on all disks,
also the RAIDed ones for this every 14 days. The remaining disks are
under more stress during array rebuild, especially if the have weak
sectors. This additional load can cause the remaining drives to
fail a lot faster, in the wort case during array rebuild.

Arno
--
Arno Wagner, Dr. sc. techn., Dipl. Inform., CISSP -- Email: arno(a)wagner.name
GnuPG: ID: 1E25338F FP: 0C30 5782 9D93 F785 E79C 0296 797F 6B50 1E25 338F
----
Cuddly UI's are the manifestation of wishful thinking. -- Dylan Evans