mdadm doing strange things [Debian]

Prev: Desconfiguración al conectrar laptop a la tele :S
Next: Downgraded Flash Non-Free 32 to Backports

From: Alan Chandler on 19 Jun 2010 14:30

I have a server with a pair of raided (RAID1) disks using partition 1,2
and 4 as /boot root and and and lvm volume respectively. The two disks
are /dev/sda and /dev/sdb. They have just replaced two smaller disks
where the root partiton was NOT a raid device - it was just /dev/sda2
although there was a raided boot partition in the first partition.
Hardware only supports 2 sata channels.

I wanted to revert to root partition to the same state as one I just
took out, so I failed and removed sdb

mdadm /dev/md0 --fail /dev/sdb1 --remove /dev/sdb1
mdadm /dev/md1 --fail /dev/sdb2 --remove /dev/sdb2
mdadm /dev/md2 --fail /dev/sdb4 --remove /dev/sdb4

for each of the partitions, and shut the machine down.

I unplugged /dev/sdb and plugged in the old disk in its place and booted
up knoppix.

I asked knoppix to recreate the md devices

mdadm --assemble --scan

and it found 4 raid devices. The three on sda and the one (from the old
sda, now on sdb).

So I mounted /dev/md1 and /dev/sdb2 and reverted the root partition.

I shut the machine down again. I now removed the old disk and plugged
back in the new /dev/sdb that I had failed and removed in the first step.

HOWEVER (the punch line). When this system booted, it was not the old
reverted one but how it was before I started this cycle. In other words
it looked as though the disk which I had failed and removed was being used

If I did mdadm --detail /dev/md1 (or any of the other devices) it shows
/dev/sdb as the only device on the raid pair. To sync up again I am
having to add in the various /dev/sda partitions.

SO THE QUESTION IS. What went wrong. How does a failed device end up
being used to build the operational arrays, and the other devices end up
not being included.

--
Alan Chandler
http://www.chandlerfamily.org.uk

--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/4C1D0A6B.60302(a)chandlerfamily.org.uk

From: Stan Hoeppner on 19 Jun 2010 19:00

Alan Chandler put forth on 6/19/2010 1:20 PM:
>
> I have a server with a pair of raided (RAID1) disks using partition 1,2
> and 4 as /boot root and and and lvm volume respectively. The two disks
> are /dev/sda and /dev/sdb. They have just replaced two smaller disks
> where the root partiton was NOT a raid device - it was just /dev/sda2
> although there was a raided boot partition in the first partition.
> Hardware only supports 2 sata channels.
>
> I wanted to revert to root partition to the same state as one I just
> took out, so I failed and removed sdb

_why_? This doesn't make any sense.

--
Stan

--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/4C1D4AB2.9040602(a)hardwarefreak.com

From: Andrew Reid on 19 Jun 2010 22:20

On Saturday 19 June 2010 14:20:27 Alan Chandler wrote:

[ Details elided ]

> HOWEVER (the punch line). When this system booted, it was not the old
> reverted one but how it was before I started this cycle. In other words
> it looked as though the disk which I had failed and removed was being used
>
> If I did mdadm --detail /dev/md1 (or any of the other devices) it shows
> /dev/sdb as the only device on the raid pair. To sync up again I am
> having to add in the various /dev/sda partitions.
>
> SO THE QUESTION IS. What went wrong. How does a failed device end up
> being used to build the operational arrays, and the other devices end up
> not being included.

My understanding of how mdadm re-arranges the array (including for
failures, etc.) is that it writes metadata into the various partitions,
so I agree with you that this is weird -- I would have expected the
RAID array to come up with the sda devices as the only devices present.

There are two things I can think of, neither quite right, but maybe
they'll motivate someone else to figure it out:

(1) Device naming can be tricky when you're unplugging drives.
Maybe the devices now showing up as "sdb" actually are the original
"sda" devices. Can you check UUIDs? This explanation also requires
that you didn't actually revert the disk, you only thought you did,
but then didn't catch it because the conjectural device-renaming
convinced you that the RAID was being weird.

(2) How did you revert the root partition? If you copied all the
files, then I have nothing else to add. If you did "dd" between the
partitions, however, you may have creamed the md metadata, and caused
the system to think the sdb device was the "good" one. This explanation
is unsatisfactory because, even if it's right, it only explains why
that partition should be reversed, not the others, although if you didn't
revert the others, they're copies, and you can't tell them apart anyways.

Also, what happened to /etc/mdadm/mdadm.conf on the reverted root
partition? Is it nonexistent on the one you're now booting from?
There's potential for confusion there also, although I think the
initramfs info will suffice until the next kernel update.

-- A.
--
Andrew Reid / reidac(a)bellatlantic.net

--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/201006192115.16300.reidac(a)bellatlantic.net

From: Alan Chandler on 20 Jun 2010 03:40

On 19/06/10 23:54, Stan Hoeppner wrote:
> Alan Chandler put forth on 6/19/2010 1:20 PM:
>>
>> I have a server with a pair of raided (RAID1) disks using partition 1,2
>> and 4 as /boot root and and and lvm volume respectively. The two disks
>> are /dev/sda and /dev/sdb. They have just replaced two smaller disks
>> where the root partiton was NOT a raid device - it was just /dev/sda2
>> although there was a raided boot partition in the first partition.
>> Hardware only supports 2 sata channels.
>>
>> I wanted to revert to root partition to the same state as one I just
>> took out, so I failed and removed sdb
>
> _why_? This doesn't make any sense.
>
The new system I had just built used the Nouveau driver for my Geforce
graphics chip, and that in combination with standard settings for
Hauppage Nova T 500, was stuttering and then locking up when watching TV
with Mythtv. The symptoms were that the Nova T stuff was was failing

The old system was built with proprietary NVidia Driver and maybe (I
can't remember) built from source Nova T stuff. That had worked
perfectly for the last 6 months or so with no locking up.

The (supposed) quickest way to try it was to revert to that system. But
the disks I had taken out where too small for the other jobs I need this
box to do so it was a question of copying the fully configured system over.

--
Alan Chandler
http://www.chandlerfamily.org.uk

--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/4C1DC0DE.3090409(a)chandlerfamily.org.uk

From: Alan Chandler on 20 Jun 2010 03:50

On 20/06/10 02:15, Andrew Reid wrote:
> On Saturday 19 June 2010 14:20:27 Alan Chandler wrote:
>
> [ Details elided ]
>
>> HOWEVER (the punch line). When this system booted, it was not the old
>> reverted one but how it was before I started this cycle. In other words
>> it looked as though the disk which I had failed and removed was being used
>>
>> If I did mdadm --detail /dev/md1 (or any of the other devices) it shows
>> /dev/sdb as the only device on the raid pair. To sync up again I am
>> having to add in the various /dev/sda partitions.
>>
>> SO THE QUESTION IS. What went wrong. How does a failed device end up
>> being used to build the operational arrays, and the other devices end up
>> not being included.
>
> My understanding of how mdadm re-arranges the array (including for
> failures, etc.) is that it writes metadata into the various partitions,
> so I agree with you that this is weird -- I would have expected the
> RAID array to come up with the sda devices as the only devices present.
>
> There are two things I can think of, neither quite right, but maybe
> they'll motivate someone else to figure it out:
>
> (1) Device naming can be tricky when you're unplugging drives.
> Maybe the devices now showing up as "sdb" actually are the original
> "sda" devices. Can you check UUIDs? This explanation also requires
> that you didn't actually revert the disk, you only thought you did,
> but then didn't catch it because the conjectural device-renaming
> convinced you that the RAID was being weird.

Of course that was my first thought. But I was doing this via SSH from
an machine, so the terminal screen contents survived the power down. It
was clear what I had done and which disks had failed etc

>
> (2) How did you revert the root partition? If you copied all the
> files, then I have nothing else to add.

Yes I did a file copy (using rsync -aH)

....

>
> Also, what happened to /etc/mdadm/mdadm.conf on the reverted root
> partition? Is it nonexistent on the one you're now booting from?
> There's potential for confusion there also, although I think the
> initramfs info will suffice until the next kernel update.
>

This point is a possiblility as I didn't check the mdadm.conf file, but
the initramfs was the same one throughout.

I got into more trouble, because in order to correct stuff (but before
the failed disk had even started to be resynced - I had asked it, but a
much bigger partition was in the processes, so it hadn't started) I
powered down, removed both disks from the system and put an old disk
back and powered up copied some files across to a third disk powered
down and replaced the two raided disks back. When I powered up again,
it switched again and said the two disks were in sync on the partitions
that hadn't started. This left the file system in an unusable state.

Fortunately the more important big partition that was only partially
synced carried on syncing in the same configuration (although I believe
it started again from scratch rather than carrying on from where it left
off).

What I think was happening was that the BIOS was changing the boot order
whenever I changed the disks and I then ended up booting from an
incorrectly synced partition.

--
Alan Chandler
http://www.chandlerfamily.org.uk

--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/4C1DC4C6.5070306(a)chandlerfamily.org.uk

|
Pages: 1
Prev: Desconfiguración al conectrar laptop a la tele :S
Next: Downgraded Flash Non-Free 32 to Backports