From: Mark Jacobs on 1 Nov 2009 15:36 I understand that there will be problems with devloper releases but I have a rather strange problem. I'm currently running the 118 snapshot without any real problems but I've attempted to boot with the 120 and 126 snapshots and I've experienced the same problem both times. I have six sata drives in a raidz2 configuration that all come up online in 118, but in the two other developer releases I've tried the last disk in the set always comes up offine and the pool is in a degraded state. I don't see any error messages in the system logs. When I backout the new OS level and reboot to the 118 boot environment the offline drive comes up online and a resliver operation automatically fixes the degraded pool. Any hints on how I can debug this problem? Mark Jacobs
From: cindy on 2 Nov 2009 13:17 On Nov 1, 1:36 pm, Mark Jacobs <jaco...(a)gate.net> wrote: > I understand that there will be problems with devloper releases but I > have a rather strange problem. > > I'm currently running the 118 snapshot without any real problems but > I've attempted to boot with the 120 and 126 snapshots and I've > experienced the same problem both times. I have six sata drives in a > raidz2 configuration that all come up online in 118, but in the two > other developer releases I've tried the last disk in the set always > comes up offine and the pool is in a degraded state. I don't see any > error messages in the system logs. > > When I backout the new OS level and reboot to the 118 boot environment > the offline drive comes up online and a resliver operation > automatically fixes the degraded pool. > > Any hints on how I can debug this problem? > > Mark Jacobs Hi Mark, I have a few suggestions: 1. You can use the fmdump -eV command to isolate when the problem with this drive started. If it predates the migration from b118, then something is going on with this drive, like an transient disk problem. I haven't seen many instances of bad drives offlining themselves, which is puzzling. What hardware is this? 2. Did your pool configuration change between b 118 and b 120 or 126? If so, then you might need to remove your zpool.cache file and re- import the pool. I see a similar problem reported in CR 6896803 and 6497675. 3. Review the RAIDZ corruption problem reported in builds 120-123, that are fixed in b 124, but the symptoms are not similar. http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#RAID-Z_Checksum_Errors_in_Nevada_Builds.2C_120-123
From: Mark Jacobs on 2 Nov 2009 19:56 On 2009-11-02 13:17:33 -0500, cindy <cindy.swearingen(a)sun.com> said: > On Nov 1, 1:36 pm, Mark Jacobs <jaco...(a)gate.net> wrote: >> I understand that there will be problems with devloper releases but I >> have a rather strange problem. >> >> I'm currently running the 118 snapshot without any real problems but >> I've attempted to boot with the 120 and 126 snapshots and I've >> experienced the same problem both times. I have six sata drives in a >> raidz2 configuration that all come up online in 118, but in the two >> other developer releases I've tried the last disk in the set always >> comes up offine and the pool is in a degraded state. I don't see any >> error messages in the system logs. >> >> When I backout the new OS level and reboot to the 118 boot environment >> the offline drive comes up online and a resliver operation >> automatically fixes the degraded pool. >> >> Any hints on how I can debug this problem? >> >> Mark Jacobs > > Hi Mark, > > I have a few suggestions: > > 1. You can use the fmdump -eV command to isolate when the problem > with > this drive started. If it predates the migration from b118, then > something is > going on with this drive, like an transient disk problem. I haven't > seen many > instances of bad drives offlining themselves, which is puzzling. What > hardware > is this? > > 2. Did your pool configuration change between b 118 and b 120 or 126? > If so, then you might need to remove your zpool.cache file and re- > import > the pool. I see a similar problem reported in CR 6896803 and 6497675. > > 3. Review the RAIDZ corruption problem reported in builds 120-123, > that > are fixed in b 124, but the symptoms are not similar. > > http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#RAID-Z_Checksum_Errors_in_Nevada_Builds.2C_120-123 I > ran the command you recommended and I see several occurances of this reported; Nov 01 2009 14:59:02.575986933 ereport.fs.zfs.checksum nvlist version: 0 class = ereport.fs.zfs.checksum ena = 0x10ea09ddfa00401 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x6f39e1c7b0ed3a51 vdev = 0xec8808ee9f058d20 (end detector) pool = mark pool_guid = 0x6f39e1c7b0ed3a51 pool_context = 1 pool_failmode = wait vdev_guid = 0xec8808ee9f058d20 vdev_type = disk vdev_path = /dev/dsk/c8t5d0s0 vdev_devid = id1,sd(a)SATA_____Hitachi_HDT72101______STF610MR1Z72NP/a parent_guid = 0xd00b4d053e99f0b parent_type = raidz zio_err = 0 zio_offset = 0xd55610800 zio_size = 0x200 zio_objset = 0x0 zio_object = 0x0 zio_level = 0 zio_blkid = 0x78 __ttl = 0x1 __tod = 0x4aede886 0x2254dcf5 The drive that goes offline with later developer builds is the last one in this set; mark(a)opensolaris:~$ zpool status mark pool: mark state: ONLINE scrub: scrub completed after 1h43m with 0 errors on Sun Nov 1 16:46:27 2009 config: NAME STATE READ WRITE CKSUM mark ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c8t0d0 ONLINE 0 0 0 c8t1d0 ONLINE 0 0 0 c8t2d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 c8t4d0 ONLINE 0 0 0 c8t5d0 ONLINE 0 0 0 errors: No known data errors Do you think its a physical drive problem that only shows up in later developer builds? Mark Jacobs
From: cindy on 3 Nov 2009 13:08 On Nov 2, 5:56 pm, Mark Jacobs <jaco...(a)gate.net> wrote: > On 2009-11-02 13:17:33 -0500, cindy <cindy.swearin...(a)sun.com> said: > > > > > On Nov 1, 1:36 pm, Mark Jacobs <jaco...(a)gate.net> wrote: > >> I understand that there will be problems with devloper releases but I > >> have a rather strange problem. > > >> I'm currently running the 118 snapshot without any real problems but > >> I've attempted to boot with the 120 and 126 snapshots and I've > >> experienced the same problem both times. I have six sata drives in a > >> raidz2 configuration that all come up online in 118, but in the two > >> other developer releases I've tried the last disk in the set always > >> comes up offine and the pool is in a degraded state. I don't see any > >> error messages in the system logs. > > >> When I backout the new OS level and reboot to the 118 boot environment > >> the offline drive comes up online and a resliver operation > >> automatically fixes the degraded pool. > > >> Any hints on how I can debug this problem? > > >> Mark Jacobs > > > Hi Mark, > > > I have a few suggestions: > > > 1. You can use the fmdump -eV command to isolate when the problem > > with > > this drive started. If it predates the migration from b118, then > > something is > > going on with this drive, like an transient disk problem. I haven't > > seen many > > instances of bad drives offlining themselves, which is puzzling. What > > hardware > > is this? > > > 2. Did your pool configuration change between b 118 and b 120 or 126? > > If so, then you might need to remove your zpool.cache file and re- > > import > > the pool. I see a similar problem reported in CR 6896803 and 6497675. > > > 3. Review the RAIDZ corruption problem reported in builds 120-123, > > that > > are fixed in b 124, but the symptoms are not similar. > > >http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Gu... > > I > > ran the command you recommended and I see several occurances of this reported; > > Nov 01 2009 14:59:02.575986933 ereport.fs.zfs.checksum > > nvlist version: 0 > > class = ereport.fs.zfs.checksum > > ena = 0x10ea09ddfa00401 > > detector = (embedded nvlist) > > nvlist version: 0 > > version = 0x0 > > scheme = zfs > > pool = 0x6f39e1c7b0ed3a51 > > vdev = 0xec8808ee9f058d20 > > (end detector) > > pool = mark > > pool_guid = 0x6f39e1c7b0ed3a51 > > pool_context = 1 > > pool_failmode = wait > > vdev_guid = 0xec8808ee9f058d20 > > vdev_type = disk > > vdev_path = /dev/dsk/c8t5d0s0 > > vdev_devid = id1,sd(a)SATA_____Hitachi_HDT72101______STF610MR1Z72NP/a > > parent_guid = 0xd00b4d053e99f0b > > parent_type = raidz > > zio_err = 0 > > zio_offset = 0xd55610800 > > zio_size = 0x200 > > zio_objset = 0x0 > > zio_object = 0x0 > > zio_level = 0 > > zio_blkid = 0x78 > > __ttl = 0x1 > > __tod = 0x4aede886 0x2254dcf5 > > The drive that goes offline with later developer builds is the last one > in this set; > > mark(a)opensolaris:~$ zpool status mark > > pool: mark > > state: ONLINE > > scrub: scrub completed after 1h43m with 0 errors on Sun Nov 1 16:46:27 2009 > > config: > > NAME STATE READ WRITE CKSUM > > mark ONLINE 0 0 0 > > raidz2 ONLINE 0 0 0 > > c8t0d0 ONLINE 0 0 0 > > c8t1d0 ONLINE 0 0 0 > > c8t2d0 ONLINE 0 0 0 > > c8t3d0 ONLINE 0 0 0 > > c8t4d0 ONLINE 0 0 0 > > c8t5d0 ONLINE 0 0 0 > > errors: No known data errors > > Do you think its a physical drive problem that only shows up in later > developer builds? > > Mark Jacobs Hi Mark, I'm not familiar with any ZFS error condition that would offline a disk. I suppose its potentially possible that this disk is getting more scrutiny in a later build, but I'm not sure which feature/bug fix that integrated would provide this. I will try to find out. According to your fmdump output, the last disk c8t5d0 in your pool suffered a checksum error. What else did FM record for this disk, similar checksum errors or something else? And if so, over what period of time? Thanks, Cindy
From: Mark Jacobs on 4 Nov 2009 18:17 On 2009-11-03 13:08:51 -0500, cindy <cindy.swearingen(a)sun.com> said: > On Nov 2, 5:56 pm, Mark Jacobs <jaco...(a)gate.net> wrote: >> On 2009-11-02 13:17:33 -0500, cindy <cindy.swearin...(a)sun.com> said: >> >> >> >>> On Nov 1, 1:36 pm, Mark Jacobs <jaco...(a)gate.net> wrote: >>>> I understand that there will be problems with devloper releases but I >>>> have a rather strange problem. >> >>>> I'm currently running the 118 snapshot without any real problems but >>>> I've attempted to boot with the 120 and 126 snapshots and I've >>>> experienced the same problem both times. I have six sata drives in a >>>> raidz2 configuration that all come up online in 118, but in the two >>>> other developer releases I've tried the last disk in the set always >>>> comes up offine and the pool is in a degraded state. I don't see any >>>> error messages in the system logs. >> >>>> When I backout the new OS level and reboot to the 118 boot environment >>>> the offline drive comes up online and a resliver operation >>>> automatically fixes the degraded pool. >> >>>> Any hints on how I can debug this problem? >> >>>> Mark Jacobs >> >>> Hi Mark, >> >>> I have a few suggestions: >> >>> 1. You can use the fmdump -eV command to isolate when the problem >>> with >>> this drive started. If it predates the migration from b118, then >>> something is >>> going on with this drive, like an transient disk problem. I haven't >>> seen many >>> instances of bad drives offlining themselves, which is puzzling. What >>> hardware >>> is this? >> >>> 2. Did your pool configuration change between b 118 and b 120 or 126? >>> If so, then you might need to remove your zpool.cache file and re- >>> import >>> the pool. I see a similar problem reported in CR 6896803 and 6497675. >> >>> 3. Review the RAIDZ corruption problem reported in builds 120-123, >>> that >>> are fixed in b 124, but the symptoms are not similar. >> >>> http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Gu... >> >> I >> >> ran the command you recommended and I see several occurances of this reported; >> >> Nov 01 2009 14:59:02.575986933 ereport.fs.zfs.checksum >> >> nvlist version: 0 >> >> class = ereport.fs.zfs.checksum >> >> ena = 0x10ea09ddfa00401 >> >> detector = (embedded nvlist) >> >> nvlist version: 0 >> >> version = 0x0 >> >> scheme = zfs >> >> pool = 0x6f39e1c7b0ed3a51 >> >> vdev = 0xec8808ee9f058d20 >> >> (end detector) >> >> pool = mark >> >> pool_guid = 0x6f39e1c7b0ed3a51 >> >> pool_context = 1 >> >> pool_failmode = wait >> >> vdev_guid = 0xec8808ee9f058d20 >> >> vdev_type = disk >> >> vdev_path = /dev/dsk/c8t5d0s0 >> >> vdev_devid = id1,sd(a)SATA_____Hitachi_HDT72101______STF610MR1Z72NP/a >> >> parent_guid = 0xd00b4d053e99f0b >> >> parent_type = raidz >> >> zio_err = 0 >> >> zio_offset = 0xd55610800 >> >> zio_size = 0x200 >> >> zio_objset = 0x0 >> >> zio_object = 0x0 >> >> zio_level = 0 >> >> zio_blkid = 0x78 >> >> __ttl = 0x1 >> >> __tod = 0x4aede886 0x2254dcf5 >> >> The drive that goes offline with later developer builds is the last one >> in this set; >> >> mark(a)opensolaris:~$ zpool status mark >> >> pool: mark >> >> state: ONLINE >> >> scrub: scrub completed after 1h43m with 0 errors on Sun Nov 1 16:46:27 2009 >> >> config: >> >> NAME STATE READ WRITE CKSUM >> >> mark ONLINE 0 0 0 >> >> raidz2 ONLINE 0 0 0 >> >> c8t0d0 ONLINE 0 0 0 >> >> c8t1d0 ONLINE 0 0 0 >> >> c8t2d0 ONLINE 0 0 0 >> >> c8t3d0 ONLINE 0 0 0 >> >> c8t4d0 ONLINE 0 0 0 >> >> c8t5d0 ONLINE 0 0 0 >> >> errors: No known data errors >> >> Do you think its a physical drive problem that only shows up in later >> developer builds? >> >> Mark Jacobs > > Hi Mark, > > I'm not familiar with any ZFS error condition that would offline a > disk. I suppose > its potentially possible that this disk is getting more scrutiny in a > later build, but > I'm not sure which feature/bug fix that integrated would provide this. > I will try > to find out. > > According to your fmdump output, the last disk c8t5d0 in your pool > suffered > a checksum error. What else did FM record for this disk, similar > checksum > errors or something else? And if so, over what period of time? > > Thanks, > > Cindy The only two sets of errors that I see are the above mentioned checksum errors. One set was during the time I booted with lvl 120 (and went away when I booted back under 118) and the other set was during the 126 bootup. Nothing else is being recorded in between, before or after the attempts to boot under a higher maintenance level than 118. Mark Jacobs
|
Pages: 1 Prev: proc_t for the active process Next: shadow file question |