Problem with Developer Releases - Drive offline [Solaris]

Prev: proc_t for the active process
Next: shadow file question

From: Mark Jacobs on 1 Nov 2009 15:36

I understand that there will be problems with devloper releases but I
have a rather strange problem.

I'm currently running the 118 snapshot without any real problems but
I've attempted to boot with the 120 and 126 snapshots and I've
experienced the same problem both times. I have six sata drives in a
raidz2 configuration that all come up online in 118, but in the two
other developer releases I've tried the last disk in the set always
comes up offine and the pool is in a degraded state. I don't see any
error messages in the system logs.

When I backout the new OS level and reboot to the 118 boot environment
the offline drive comes up online and a resliver operation
automatically fixes the degraded pool.

Any hints on how I can debug this problem?

Mark Jacobs

From: cindy on 2 Nov 2009 13:17

On Nov 1, 1:36 pm, Mark Jacobs <jaco...(a)gate.net> wrote:
> I understand that there will be problems with devloper releases but I
> have a rather strange problem.
>
> I'm currently running the 118 snapshot without any real problems but
> I've attempted to boot with the 120 and 126 snapshots and I've
> experienced the same problem both times. I have six sata drives in a
> raidz2 configuration that all come up online in 118, but in the two
> other developer releases I've tried the last disk in the set always
> comes up offine and the pool is in a degraded state. I don't see any
> error messages in the system logs.
>
> When I backout the new OS level and reboot to the 118 boot environment
> the offline drive comes up online and a resliver operation
> automatically fixes the degraded pool.
>
> Any hints on how I can debug this problem?
>
> Mark Jacobs

Hi Mark,

I have a few suggestions:

1. You can use the fmdump -eV command to isolate when the problem
with
this drive started. If it predates the migration from b118, then
something is
going on with this drive, like an transient disk problem. I haven't
seen many
instances of bad drives offlining themselves, which is puzzling. What
hardware
is this?

2. Did your pool configuration change between b 118 and b 120 or 126?
If so, then you might need to remove your zpool.cache file and re-
import
the pool. I see a similar problem reported in CR 6896803 and 6497675.

3. Review the RAIDZ corruption problem reported in builds 120-123,
that
are fixed in b 124, but the symptoms are not similar.

http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#RAID-Z_Checksum_Errors_in_Nevada_Builds.2C_120-123

From: Mark Jacobs on 2 Nov 2009 19:56

On 2009-11-02 13:17:33 -0500, cindy <cindy.swearingen(a)sun.com> said:

> On Nov 1, 1:36 pm, Mark Jacobs <jaco...(a)gate.net> wrote:
>> I understand that there will be problems with devloper releases but I
>> have a rather strange problem.
>>
>> I'm currently running the 118 snapshot without any real problems but
>> I've attempted to boot with the 120 and 126 snapshots and I've
>> experienced the same problem both times. I have six sata drives in a
>> raidz2 configuration that all come up online in 118, but in the two
>> other developer releases I've tried the last disk in the set always
>> comes up offine and the pool is in a degraded state. I don't see any
>> error messages in the system logs.
>>
>> When I backout the new OS level and reboot to the 118 boot environment
>> the offline drive comes up online and a resliver operation
>> automatically fixes the degraded pool.
>>
>> Any hints on how I can debug this problem?
>>
>> Mark Jacobs
>
> Hi Mark,
>
> I have a few suggestions:
>
> 1. You can use the fmdump -eV command to isolate when the problem
> with
> this drive started. If it predates the migration from b118, then
> something is
> going on with this drive, like an transient disk problem. I haven't
> seen many
> instances of bad drives offlining themselves, which is puzzling. What
> hardware
> is this?
>
> 2. Did your pool configuration change between b 118 and b 120 or 126?
> If so, then you might need to remove your zpool.cache file and re-
> import
> the pool. I see a similar problem reported in CR 6896803 and 6497675.
>
> 3. Review the RAIDZ corruption problem reported in builds 120-123,
> that
> are fixed in b 124, but the symptoms are not similar.
>
> http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#RAID-Z_Checksum_Errors_in_Nevada_Builds.2C_120-123

I
>
ran the command you recommended and I see several occurances of this reported;

Nov 01 2009 14:59:02.575986933 ereport.fs.zfs.checksum

nvlist version: 0

class = ereport.fs.zfs.checksum

ena = 0x10ea09ddfa00401

detector = (embedded nvlist)

nvlist version: 0

version = 0x0

scheme = zfs

pool = 0x6f39e1c7b0ed3a51

vdev = 0xec8808ee9f058d20

(end detector)

pool = mark

pool_guid = 0x6f39e1c7b0ed3a51

pool_context = 1

pool_failmode = wait

vdev_guid = 0xec8808ee9f058d20

vdev_type = disk

vdev_path = /dev/dsk/c8t5d0s0

vdev_devid = id1,sd(a)SATA_____Hitachi_HDT72101______STF610MR1Z72NP/a

parent_guid = 0xd00b4d053e99f0b

parent_type = raidz

zio_err = 0

zio_offset = 0xd55610800

zio_size = 0x200

zio_objset = 0x0

zio_object = 0x0

zio_level = 0

zio_blkid = 0x78

__ttl = 0x1

__tod = 0x4aede886 0x2254dcf5

The drive that goes offline with later developer builds is the last one
in this set;

mark(a)opensolaris:~$ zpool status mark

pool: mark

state: ONLINE

scrub: scrub completed after 1h43m with 0 errors on Sun Nov 1 16:46:27 2009

config:

NAME STATE READ WRITE CKSUM

mark ONLINE 0 0 0

raidz2 ONLINE 0 0 0

c8t0d0 ONLINE 0 0 0

c8t1d0 ONLINE 0 0 0

c8t2d0 ONLINE 0 0 0

c8t3d0 ONLINE 0 0 0

c8t4d0 ONLINE 0 0 0

c8t5d0 ONLINE 0 0 0

errors: No known data errors

Do you think its a physical drive problem that only shows up in later
developer builds?

Mark Jacobs

From: cindy on 3 Nov 2009 13:08

On Nov 2, 5:56 pm, Mark Jacobs <jaco...(a)gate.net> wrote:
> On 2009-11-02 13:17:33 -0500, cindy <cindy.swearin...(a)sun.com> said:
>
>
>
> > On Nov 1, 1:36 pm, Mark Jacobs <jaco...(a)gate.net> wrote:
> >> I understand that there will be problems with devloper releases but I
> >> have a rather strange problem.
>
> >> I'm currently running the 118 snapshot without any real problems but
> >> I've attempted to boot with the 120 and 126 snapshots and I've
> >> experienced the same problem both times. I have six sata drives in a
> >> raidz2 configuration that all come up online in 118, but in the two
> >> other developer releases I've tried the last disk in the set always
> >> comes up offine and the pool is in a degraded state. I don't see any
> >> error messages in the system logs.
>
> >> When I backout the new OS level and reboot to the 118 boot environment
> >> the offline drive comes up online and a resliver operation
> >> automatically fixes the degraded pool.
>
> >> Any hints on how I can debug this problem?
>
> >> Mark Jacobs
>
> > Hi Mark,
>
> > I have a few suggestions:
>
> > 1. You can use the fmdump -eV command to isolate when the problem
> > with
> > this drive started. If it predates the migration from b118, then
> > something is
> > going on with this drive, like an transient disk problem. I haven't
> > seen many
> > instances of bad drives offlining themselves, which is puzzling. What
> > hardware
> > is this?
>
> > 2. Did your pool configuration change between b 118 and b 120 or 126?
> > If so, then you might need to remove your zpool.cache file and re-
> > import
> > the pool. I see a similar problem reported in CR 6896803 and 6497675.
>
> > 3. Review the RAIDZ corruption problem reported in builds 120-123,
> > that
> > are fixed in b 124, but the symptoms are not similar.
>
> >http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Gu...
>
> I
>
> ran the command you recommended and I see several occurances of this reported;
>
> Nov 01 2009 14:59:02.575986933 ereport.fs.zfs.checksum
>
> nvlist version: 0
>
> class = ereport.fs.zfs.checksum
>
> ena = 0x10ea09ddfa00401
>
> detector = (embedded nvlist)
>
> nvlist version: 0
>
> version = 0x0
>
> scheme = zfs
>
> pool = 0x6f39e1c7b0ed3a51
>
> vdev = 0xec8808ee9f058d20
>
> (end detector)
>
> pool = mark
>
> pool_guid = 0x6f39e1c7b0ed3a51
>
> pool_context = 1
>
> pool_failmode = wait
>
> vdev_guid = 0xec8808ee9f058d20
>
> vdev_type = disk
>
> vdev_path = /dev/dsk/c8t5d0s0
>
> vdev_devid = id1,sd(a)SATA_____Hitachi_HDT72101______STF610MR1Z72NP/a
>
> parent_guid = 0xd00b4d053e99f0b
>
> parent_type = raidz
>
> zio_err = 0
>
> zio_offset = 0xd55610800
>
> zio_size = 0x200
>
> zio_objset = 0x0
>
> zio_object = 0x0
>
> zio_level = 0
>
> zio_blkid = 0x78
>
> __ttl = 0x1
>
> __tod = 0x4aede886 0x2254dcf5
>
> The drive that goes offline with later developer builds is the last one
> in this set;
>
> mark(a)opensolaris:~$ zpool status mark
>
> pool: mark
>
> state: ONLINE
>
> scrub: scrub completed after 1h43m with 0 errors on Sun Nov 1 16:46:27 2009
>
> config:
>
> NAME STATE READ WRITE CKSUM
>
> mark ONLINE 0 0 0
>
> raidz2 ONLINE 0 0 0
>
> c8t0d0 ONLINE 0 0 0
>
> c8t1d0 ONLINE 0 0 0
>
> c8t2d0 ONLINE 0 0 0
>
> c8t3d0 ONLINE 0 0 0
>
> c8t4d0 ONLINE 0 0 0
>
> c8t5d0 ONLINE 0 0 0
>
> errors: No known data errors
>
> Do you think its a physical drive problem that only shows up in later
> developer builds?
>
> Mark Jacobs

Hi Mark,

I'm not familiar with any ZFS error condition that would offline a
disk. I suppose
its potentially possible that this disk is getting more scrutiny in a
later build, but
I'm not sure which feature/bug fix that integrated would provide this.
I will try
to find out.

According to your fmdump output, the last disk c8t5d0 in your pool
suffered
a checksum error. What else did FM record for this disk, similar
checksum
errors or something else? And if so, over what period of time?

Thanks,

Cindy

From: Mark Jacobs on 4 Nov 2009 18:17

On 2009-11-03 13:08:51 -0500, cindy <cindy.swearingen(a)sun.com> said:

> On Nov 2, 5:56 pm, Mark Jacobs <jaco...(a)gate.net> wrote:
>> On 2009-11-02 13:17:33 -0500, cindy <cindy.swearin...(a)sun.com> said:
>>
>>
>>
>>> On Nov 1, 1:36 pm, Mark Jacobs <jaco...(a)gate.net> wrote:
>>>> I understand that there will be problems with devloper releases but I
>>>> have a rather strange problem.
>>
>>>> I'm currently running the 118 snapshot without any real problems but
>>>> I've attempted to boot with the 120 and 126 snapshots and I've
>>>> experienced the same problem both times. I have six sata drives in a
>>>> raidz2 configuration that all come up online in 118, but in the two
>>>> other developer releases I've tried the last disk in the set always
>>>> comes up offine and the pool is in a degraded state. I don't see any
>>>> error messages in the system logs.
>>
>>>> When I backout the new OS level and reboot to the 118 boot environment
>>>> the offline drive comes up online and a resliver operation
>>>> automatically fixes the degraded pool.
>>
>>>> Any hints on how I can debug this problem?
>>
>>>> Mark Jacobs
>>
>>> Hi Mark,
>>
>>> I have a few suggestions:
>>
>>> 1. You can use the fmdump -eV command to isolate when the problem
>>> with
>>> this drive started. If it predates the migration from b118, then
>>> something is
>>> going on with this drive, like an transient disk problem. I haven't
>>> seen many
>>> instances of bad drives offlining themselves, which is puzzling. What
>>> hardware
>>> is this?
>>
>>> 2. Did your pool configuration change between b 118 and b 120 or 126?
>>> If so, then you might need to remove your zpool.cache file and re-
>>> import
>>> the pool. I see a similar problem reported in CR 6896803 and 6497675.
>>
>>> 3. Review the RAIDZ corruption problem reported in builds 120-123,
>>> that
>>> are fixed in b 124, but the symptoms are not similar.
>>
>>> http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Gu...
>>
>> I
>>
>> ran the command you recommended and I see several occurances of this reported;
>>
>> Nov 01 2009 14:59:02.575986933 ereport.fs.zfs.checksum
>>
>> nvlist version: 0
>>
>> class = ereport.fs.zfs.checksum
>>
>> ena = 0x10ea09ddfa00401
>>
>> detector = (embedded nvlist)
>>
>> nvlist version: 0
>>
>> version = 0x0
>>
>> scheme = zfs
>>
>> pool = 0x6f39e1c7b0ed3a51
>>
>> vdev = 0xec8808ee9f058d20
>>
>> (end detector)
>>
>> pool = mark
>>
>> pool_guid = 0x6f39e1c7b0ed3a51
>>
>> pool_context = 1
>>
>> pool_failmode = wait
>>
>> vdev_guid = 0xec8808ee9f058d20
>>
>> vdev_type = disk
>>
>> vdev_path = /dev/dsk/c8t5d0s0
>>
>> vdev_devid = id1,sd(a)SATA_____Hitachi_HDT72101______STF610MR1Z72NP/a
>>
>> parent_guid = 0xd00b4d053e99f0b
>>
>> parent_type = raidz
>>
>> zio_err = 0
>>
>> zio_offset = 0xd55610800
>>
>> zio_size = 0x200
>>
>> zio_objset = 0x0
>>
>> zio_object = 0x0
>>
>> zio_level = 0
>>
>> zio_blkid = 0x78
>>
>> __ttl = 0x1
>>
>> __tod = 0x4aede886 0x2254dcf5
>>
>> The drive that goes offline with later developer builds is the last one
>> in this set;
>>
>> mark(a)opensolaris:~$ zpool status mark
>>
>> pool: mark
>>
>> state: ONLINE
>>
>> scrub: scrub completed after 1h43m with 0 errors on Sun Nov 1 16:46:27 2009
>>
>> config:
>>
>> NAME STATE READ WRITE CKSUM
>>
>> mark ONLINE 0 0 0
>>
>> raidz2 ONLINE 0 0 0
>>
>> c8t0d0 ONLINE 0 0 0
>>
>> c8t1d0 ONLINE 0 0 0
>>
>> c8t2d0 ONLINE 0 0 0
>>
>> c8t3d0 ONLINE 0 0 0
>>
>> c8t4d0 ONLINE 0 0 0
>>
>> c8t5d0 ONLINE 0 0 0
>>
>> errors: No known data errors
>>
>> Do you think its a physical drive problem that only shows up in later
>> developer builds?
>>
>> Mark Jacobs
>
> Hi Mark,
>
> I'm not familiar with any ZFS error condition that would offline a
> disk. I suppose
> its potentially possible that this disk is getting more scrutiny in a
> later build, but
> I'm not sure which feature/bug fix that integrated would provide this.
> I will try
> to find out.
>
> According to your fmdump output, the last disk c8t5d0 in your pool
> suffered
> a checksum error. What else did FM record for this disk, similar
> checksum
> errors or something else? And if so, over what period of time?
>
> Thanks,
>
> Cindy

The only two sets of errors that I see are the above mentioned checksum
errors. One set was during the time I booted with lvl 120 (and went
away when I booted back under 118) and the other set was during the 126
bootup.

Nothing else is being recorded in between, before or after the attempts
to boot under a higher maintenance level than 118.

Mark Jacobs

|
Pages: 1
Prev: proc_t for the active process
Next: shadow file question