Disk problems or worse? [Debian]

Prev: Bandwidth usage daemon recommendation
Next: linux-kbuild-2.6.32

From: Ralph Katz on 2 Jun 2010 21:30

Lenny install on newly acquired used Dell hangs and throws errors to
syslog. Do I have two bad disks or a more serious hardware problem?
Short of buying a new disk, how would I know? What would you recommend?
Or do I have a simple BIOS setting problem?

(My last post to debian-user was in 2008. Etch has continued to be rock
solid on two desktops. Now I felt was time to upgrade.)

First, an old DELL GX240 was obtained and Lenny/xfce installed; P4, 1Gb,
120 Gb WDC disk.

Syslog showed all kinds of errors while system would hang at times:

May 24 21:53:39 spike kernel: [ 5034.952013] hda: status timeout:
status=0x80 { Busy }
May 24 21:53:39 spike kernel: [ 5034.952021] ide: failed opcode was: unknown
May 24 21:53:39 spike kernel: [ 5034.952030] hda: DMA disabled
May 24 21:53:39 spike kernel: [ 5034.952066] hda: drive not ready for
command
May 24 21:54:14 spike kernel: [ 5064.952021] ide0: reset timed-out,
status=0x80
May 24 21:54:14 spike kernel: [ 5065.393331] hda: status timeout:
status=0x80 { Busy }
May 24 21:54:14 spike kernel: [ 5065.393331] ide: failed opcode was: unknown
May 24 21:54:14 spike kernel: [ 5065.393331] hda: drive not ready for
command
May 24 21:54:14 spike kernel: [ 5065.393331] Clocksource tsc unstable
(delta = 4686898152 ns)
May 24 21:54:44 spike kernel: [ 5099.964023] ide0: reset timed-out,
status=0x80
May 24 21:54:44 spike kernel: [ 5099.964040] end_request: I/O error, dev
hda, sector 10867375
May 24 21:54:44 spike kernel: [ 5099.964104] end_request: I/O error, dev
hda, sector 13826839
May 24 21:54:44 spike kernel: [ 5099.964115] Buffer I/O error on device
dm-2, logical block 360455

[snipped 20 Kb of I/O errors]

May 24 21:54:44 spike kernel: [ 5099.967007] end_request: I/O error, dev
hda, sector 208223535
May 24 21:54:44 spike kernel: [ 5099.967024] EXT3-fs error (device
dm-5): ext3_get_inode_loc: unable to read inode block - inode=5792911,
block=23167050
May 24 21:54:44 spike kernel: [ 5099.967128] Aborting journal on device
dm-5.
May 24 21:54:44 spike kernel: [ 5099.968575] ext3_abort called.
May 24 21:54:44 spike kernel: [ 5099.968587] EXT3-fs error (device
dm-5): ext3_journal_start_sb: Detected aborted journal
May 24 21:54:44 spike kernel: [ 5099.968594] Remounting filesystem read-only

I concluded the disk was dead (but SMART tests PASSED), and replaced it
with another used 120 Gb WDC, re-installed Lenny, and soon the system
would again hang, typically at start up.

Sylog entries of note with the second disk installed:

/var/log/syslog:Jun 2 08:52:40 spike smartd[2346]: Device: /dev/hda,
SMART Prefailure Attribute: 7 Seek_Error_Rate changed from 100 to 198
/var/log/syslog.1:Jun 1 08:13:56 spike kernel: [ 936.000023] hda:
dma_timer_expiry: dma status == 0x21
/var/log/syslog.1:Jun 1 08:28:44 spike smartd[2357]: Device: /dev/hda,
SMART Usage Attribute: 196 Reallocated_Event_Count changed from 196 to 195

May 31 09:54:09 spike kernel: [ 620.084022] hda: dma_timer_expiry: dma
status == 0x20
May 31 09:54:09 spike kernel: [ 620.084031] hda: DMA timeout retry
May 31 09:54:09 spike kernel: [ 620.084034] hda: timeout waiting for DMA
May 31 09:54:09 spike kernel: [ 624.232267] Clocksource tsc unstable
(delta = 4686697657 ns)
May 31 10:14:07 spike smartd[2331]: Device: /dev/hda, SMART Prefailure
Attribute: 5 Reallocated_Sector_Ct changed from 200 to 199
May 31 10:14:07 spike smartd[2331]: Device: /dev/hda, SMART Usage
Attribute: 196 Reallocated_Event_Count changed from 200 to 196

Meanwhile, SMART self-tests short and long passed. No errors were
reported by smartctl -a /dev/hda.

This morning I had to reboot a hung system with Alt SysRq b because X,
an ssh connection, VT1 and CrlAltDel failed.

Searching the net for "Clocksource tsc unstable" suggested disabling
acpi in bios. Hey, I'm just a desktop user, and this is beginning to
get beyond my 7 yrs capabilities of understanding the magic.

Suggestions welcomed, thanks!

Ralph

--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/4C070380.700(a)rcn.com

From: Mark on 2 Jun 2010 23:30

On Wed, Jun 2, 2010 at 6:21 PM, Ralph Katz <ralph.katz(a)rcn.com> wrote:

> Lenny install on newly acquired used Dell hangs and throws errors to
> syslog. Do I have two bad disks or a more serious hardware problem?
> Short of buying a new disk, how would I know? What would you recommend?
> Or do I have a simple BIOS setting problem?
>

[snip]

If you boot to an Ubuntu Live CD, it will automatically let you know of any
bad hard disk sectors via a pop up GUI upon booting to the desktop
environment. I inherited a decommissioned hard drive from a server room and
used Ubuntu Live CD to confirm it had bad sectors, hence the reason for its
decommissioning.

Once you confirm it's not the hdd, then you can troubleshoot other
possibilities.

HTH.

Mark

From: Jochen Schulz on 3 Jun 2010 02:10

Ralph Katz:
>
> Lenny install on newly acquired used Dell hangs and throws errors to
> syslog. Do I have two bad disks or a more serious hardware problem?

Another option: it might be a kernel problem. I don't remember the
specifics anymore, but on one of my systems I had similar errors. After
replacing the disk and still getting these errors, I found hints that
the kernel might be at fault. I then installed a newer kernel from
backports.org and the problems went away.

> May 24 21:54:14 spike kernel: [ 5065.393331] Clocksource tsc unstable
> (delta = 4686898152 ns)

This line is irrelevant for the hard disk problem.

> /var/log/syslog:Jun 2 08:52:40 spike smartd[2346]: Device: /dev/hda,
> SMART Prefailure Attribute: 7 Seek_Error_Rate changed from 100 to 198
> /var/log/syslog.1:Jun 1 08:13:56 spike kernel: [ 936.000023] hda:
> dma_timer_expiry: dma status == 0x21
> /var/log/syslog.1:Jun 1 08:28:44 spike smartd[2357]: Device: /dev/hda,
> SMART Usage Attribute: 196 Reallocated_Event_Count changed from 196 to 195

That's a real hard disk error, but unless it happens regularly, you
don't need to worry. These happen sometimes and the disk is usually able
to handle it.

> Meanwhile, SMART self-tests short and long passed. No errors were
> reported by smartctl -a /dev/hda.

Well, at least the reallocation events should have been counted. It
doesn't hurt to post smartctl's output.

J.
--
In an ideal world I would cure poverty and go to the gym at least three
days a week.
[Agree] [Disagree]
<http://www.slowlydownward.com/NODATA/data_enter2.html>

From: David Baron on 3 Jun 2010 09:50

I sometimes get this. The disks click-clack. Those messages.

Usually rebooting after jiggling the cables fixes it. Maybe replace them. Also
check the power supply. Working? Adequate?

--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/201006031625.34181.d_baron(a)012.net.il

From: Daniel Barclay on 3 Jun 2010 12:00

Ralph,

Jochen Schulz wrote:
> Ralph Katz:
>> Lenny install on newly acquired used Dell hangs and throws errors to
>> syslog. Do I have two bad disks or a more serious hardware problem?
>
> Another option: it might be a kernel problem. I don't remember the
> specifics anymore, but on one of my systems I had similar errors. After
> replacing the disk and still getting these errors, I found hints that
> the kernel might be at fault. I then installed a newer kernel from
> backports.org and the problems went away.

What processor and chipset does your motherboard use?

Do you get

Does changing your IDE/ATA controllers from DMA mode to PIO
mode stop the message?

(I had similar problems (got similar log message) with a dual-processor
AMD Athlon MP board. Apparently, the AMD chipset apparently had some
bug, the Linux didn't work around that particular bug, and the kernel's
IDE DMA code (or maybe filesystem code) wasn't very robust--it didn't
retry an operation that failed because of a detected DMA timeout,
and it didn't even detect that the operation failed and stop (panic
or something) before things (disk and filesystem state) became
inconsistent.)

Daniel
--

--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/4C07D027.9020203(a)fgm.com

| Next | Last
Pages: 1 2 3
Prev: Bandwidth usage daemon recommendation
Next: linux-kbuild-2.6.32