From: Ant on
Uh oh. I just discovered mcelog and something new and scary in its
/var/log/syslog:

Mar 6 01:19:37 foobar kernel: [15299.988025] Machine check events logged
Mar 6 01:42:07 foobar kernel: [16649.989021] Machine check events logged
Mar 6 02:05:19 foobar -- MARK --
Mar 6 02:19:37 foobar kernel: [18899.989024] Machine check events logged
Mar 6 02:37:07 foobar kernel: [19949.988027] Machine check events logged
Mar 6 03:05:19 foobar -- MARK --
Mar 6 03:24:37 foobar kernel: [22799.989023] Machine check events logged
Mar 6 03:45:19 foobar -- MARK --
Mar 6 04:05:19 foobar -- MARK --
Mar 6 04:25:19 foobar -- MARK --
Mar 6 04:45:19 foobar -- MARK --
Mar 6 05:02:07 foobar kernel: [28649.989023] Machine check events logged
Mar 6 05:25:19 foobar -- MARK --
Mar 6 05:45:19 foobar -- MARK --
Mar 6 06:05:19 foobar -- MARK --
Mar 6 06:24:37 foobar kernel: [33599.989027] Machine check events logged
Mar 6 06:33:13 foobar syslogd 1.5.0#5: restart.
Mar 6 06:45:19 foobar -- MARK --
Mar 6 07:05:19 foobar -- MARK --
Mar 6 07:25:19 foobar -- MARK --
Mar 6 07:45:19 foobar -- MARK --
Mar 6 08:05:19 foobar -- MARK --
Mar 6 08:17:07 foobar kernel: [40349.989022] Machine check events logged
Mar 6 08:24:37 foobar kernel: [40799.988036] Machine check events logged
Mar 6 08:45:19 foobar -- MARK --
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 0
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 1
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 2
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 3
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 4
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 5
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 6
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 7
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
problem!
Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
Mar 6 08:52:09 foobar mcelog: MCE 8
Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
level 1'
Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43


What does that mean? Dying CPU (had it since 12/24/2006)? Maybe that's
why memtest86+ didn't find any problems last week.

On 3/5/2010 11:12 PM PT, Ant typed:

> Hello.
>
> Is /var/log/syslog the only place where Linux keeps records of kernel
> (v2.6.30 and v2.6.32) panics? dmesg and /var/log/messages doesn't seem
> to show anything about the crashes unless I am misreading them. I am
> trying to figure out a rare and random kernel panic issue on my old
> Debian box.
>
> I know it's not X because I exited it, logged out of bash, went into
> fullscreen text console's login screen (I boot up my Debian to text
> mode, log into bash, and use startx command to go to X), and saw a bunch
> of datas (e.g., memory addresses and codes) on my screen from the kernel
> crash. However, its data dump was too long and my computer was in frozen
> mode with two blinking PS/2 keyboard lights (caps and scroll lock) so I
> couldn't scroll up or copy and paste.
>
> I poked around in my Debian and on the Web. I read that kernel panic
> errors/datas can be found in /var/log/syslog (dmesg didn't show me
> anything related to Kernel panics that I could find) like:
>
> # cat /var/log/syslog
> ...
> Mar 4 23:12:07 foobar smartd[2647]: Device: /dev/hda, SMART Usage
> Attribute: 194 Temperature_Celsius changed from 30 to 31
> ...
> Mar 5 15:11:31 foobar smartd[2610]: Device: /dev/hda, SMART Prefailure
> Attribute: 1 Raw_Read_Error_Rate changed from 58 to 59
> Mar 5 15:11:31 foobar smartd[2610]: Device: /dev/hda, SMART Usage
> Attribute: 195 Hardware_ECC_Recovered changed from 58 to 59
> Mar 5 15:15:01 foobar /USR/SBIN/CRON[8815]: (root) CMD (command -v
> debian-sa1 > /dev/null && debian-sa1 1 1)
> Mar 5 15:17:01 foobar /USR/SBIN/CRON[11199]: (root) CMD ( cd / &&
> run-parts --report /etc/cron.hourly)
> Mar 5 15:25:01 foobar /USR/SBIN/CRON[20721]: (root) CMD (command -v
> debian-sa1 > /dev/null && debian-sa1 1 1)
> Mar 5 15:35:01 foobar /USR/SBIN/CRON[32588]: (root) CMD (command -v
> debian-sa1 > /dev/null && debian-sa1 1 1)
> Mar 5 15:45:01 foobar /USR/SBIN/CRON[12129]: (root) CMD (command -v
> debian-sa1 > /dev/null && debian-sa1 1 1)
> Mar 5 15:55:01 foobar /USR/SBIN/CRON[23947]: (root) CMD (command -v
> debian-sa1 > /dev/null && debian-sa1 1 1)
> < rebooted my crashed PC from its kernel panic >
> Mar 5 21:05:19 foobar syslogd 1.5.0#5: restart.
> ...
>
> I couldn't find any similiar from an earlier one like (don't think
> smartctl with /dev/hda is it?):
> ...
> Mar 5 05:17:01 foobar /USR/SBIN/CRON[26833]: (root) CMD ( cd / &&
> run-parts --report /etc/cron.hourly)
> Mar 5 05:25:01 foobar /USR/SBIN/CRON[29514]: (root) CMD (command -v
> debian-sa1 > /dev/null && debian-sa1 1 1)
> Mar 5 05:35:01 foobar /USR/SBIN/CRON[372]: (root) CMD (command -v
> debian-sa1 > /dev/null && debian-sa1 1 1)
> Mar 5 05:45:01 foobar /USR/SBIN/CRON[3772]: (root) CMD (command -v
> debian-sa1 > /dev/null && debian-sa1 1 1)
> Mar 5 05:55:01 foobar /USR/SBIN/CRON[7160]: (root) CMD (command -v
> debian-sa1 > /dev/null && debian-sa1 1 1)
> Mar 5 06:41:19 foobar syslogd 1.5.0#5: restart.
> ...
>
> I saw LKCD (http://lkcd.sourceforge.net/ and
> http://sourceforge.net/projects/lkcd/files/), but it seems to be
> outdated? I also couldn't find a Debian package of it, so I don't know
> if I should even try it to get more datas.
>
> And yes, I already tried memtest86+ v4.00 and it came out no errors
> after six hours with its default tests. I will try it again later just
> in case.
>
> Thank you in advance. :)
--
"What is it going to be like in eternity with God? Frankly, the capacity
of our brains cannot handle the wonder and greatness of heaven. It would
be like trying to describe the Internet to an ant." --Rick Warren's
book, The Purpose Driven Life
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.
From: Ant on
On 3/6/2010 2:13 PM PT, Darren Salt typed:

>> Uh oh. I just discovered mcelog and something new and scary in its
>> /var/log/syslog:
> [snip]
>> Mar 6 08:24:37 foobar kernel: [40799.988036] Machine check events logged
>> Mar 6 08:45:19 foobar -- MARK --
>> Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
>> problem!
>> Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
>> Mar 6 08:52:09 foobar mcelog: MCE 0
>> Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
>> Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
>> Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
>> Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
>> Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction,
>> level 1'
>> Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
>> Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
>> Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
> [snip duplicate entries]
>
> Ouch.

:(


>> What does that mean? Dying CPU (had it since 12/24/2006)?
>
> 12/12/2007? ;-)

Eh?

>
> (Hint: use ISO8601 date formats or use month names. Broken-endian dates can
> all too easily cause error; fortunately, that one's unambiguous.)

I don't get it. :(


> Anyway, it does look like a fault in that CPU. I'd certainly be considering
> replacing it, though due to your earlier mention of kernel panics, I wouldn't
> rule out board problems either; are there any visible signs of hardware
> problems (leaky/bulging capacitors etc.)? Checking the PSU is probably also
> worthwhile.

Hmmm, I just swapped my PSU because the old one (FSP650-80GLC PSU (650
watts) from 5/14/2007) died on 12/2009. I recalled days before,
something smelled burning but I couldn't figure out where it came from
since I had two desktops. I guess it was the PSU that went poof!

At the same time, my EVGA GeForce 8800 GT video card had to be RMA'ed
since it didn't work anymore since the new PSU still wouldn't boot the
box up at all. After getting a RMA'ed refurbished video card back, my
box was fine for a bit and then got kernel panics once in a while. Then,
it seems to become more frequently slowly. One day in February, I ran
memtest86+ v4.00 for like five hours and found lots of errors. My friend
and I narrowed it down to a 512 MB RAM and left with 2.5 GB remaining
(still plenty for an old Linux workstation!). Oh and we didn't see
anything burned, busted, etc.

It sounds like that PSU bust damaged a lot of my hardwares. Argh! I
don't have the time and resources to build another one (guess I could do
a clean install with it too :P). :(


> (http://en.wikipedia.org/wiki/Translation_lookaside_buffer describes the
> affected area of the CPU.)

Hmm, I wonder if that 512 MB RAM that memtest86 detected having errors
wasn't bad?


>> Maybe that's why memtest86+ didn't find any problems last week.
>
> That doesn't seem to be relevant.

Why do you say that? I am going to run it again soon to double check.


> http://www.mjmwired.net/kernel/Documentation/networking/netconsole.txt
>
> That needs a second computer, but it will at least allow most panics to be
> captured. (Exceptions include hard hangs, where there may be no panic which
> can be reported, and problems which affect the network interface over which
> the log is being sent.)

Interesting. I wished Linux's Kernel panics would log to a file like
Windows' memory dumps from blue screens so I can use a debugger to see
what the dumps.
--
"Left right left right we're army ants. We swarm we fight. We have no
home. We roam. We race. You're lucky if we miss your place." --Douglas
Florian (The Army Ants Poem)
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.
From: Ant on
On 3/7/2010 5:52 AM PT, Darren Salt typed:

>>> (Hint: use ISO8601 date formats or use month names. Broken-endian dates
>>> can all too easily cause error; fortunately, that one's unambiguous.)
>
>> I don't get it. :(
>
> Well... today is 7/3/2010 or 3/7/2010, according to locale; it is better
> represented as 2010-03-07.

OH! Bah, I am an American. :P


>>> Anyway, it does look like a fault in that CPU. I'd certainly be
>>> considering replacing it, though due to your earlier mention of kernel
>>> panics, I wouldn't rule out board problems either; are there any visible
>>> signs of hardware problems (leaky/bulging capacitors etc.)? Checking the
>>> PSU is probably also worthwhile.
>
>> Hmmm, I just swapped my PSU because the old one (FSP650-80GLC PSU (650
>> watts) from 5/14/2007) died on 12/2009. I recalled days before,
>> something smelled burning but I couldn't figure out where it came from
>> since I had two desktops. I guess it was the PSU that went poof!
>
> I've had that happen once here. Advice given was to replace the whole lot
> because of possible damage to components, and I can see where that's coming
> from: brief over-voltage or over-current. (Would anybody who knows more about
> your typical switched-mode PSU care to comment?)

:( It sounds common I guess. I ran memtest86+ v4.000 overnight for over
five hours. It had two passes and almost done with the third one on its
test 8. I guess RAM is still OK!


>> At the same time, my EVGA GeForce 8800 GT video card had to be RMA'ed since
>> it didn't work anymore since the new PSU still wouldn't boot the box up at
>> all.
>
> Dead card, due to The Way of the Exploding PSU?

I guess so if it stopped working right after PSU went dead and repalced
with a new one. Or a coincident?


>> After getting a RMA'ed refurbished video card back, my box was fine for a
>> bit and then got kernel panics once in a while. Then, it seems to become
>> more frequently slowly. One day in February, I ran memtest86+ v4.00 for
>> like five hours and found lots of errors. My friend and I narrowed it down
>> to a 512 MB RAM
>
> I've seen bad RAM before. On visual inspection, it looks exactly like good
> RAM.

Yeah. It's old too (four years I think)!


>> and left with 2.5 GB remaining (still plenty for an old
>> Linux workstation!). Oh and we didn't see anything burned, busted, etc.
>
> That's the thing. It might not *look* damaged...

Right, but you asked if there were any physical damages from our eyes. :P


>> It sounds like that PSU bust damaged a lot of my hardwares. Argh! I
>> don't have the time and resources to build another one
>
> Yet you have the time to respond here. ;-)

That's faster. Sometimes I do it from work too. :P


>> Hmm, I wonder if that 512 MB RAM that memtest86 detected having errors
>> wasn't bad?
>
> Chances are that memtest86 was right. (I can see how bad memory might cause
> incorrect TLB entries, but not parity errors.)

So parity errors are from CPU only? I am not an expert in hardwares area.


>>>> Maybe that's why memtest86+ didn't find any problems last week.
>>> That doesn't seem to be relevant.
>
>> Why do you say that? I am going to run it again soon to double check.
>
> It's testing the memory, and (probably) isn't making use of logical
> addressing. If it isn't, then it's not going to be making use of the TLB, so
> it's not going to cause MCEs. (Or perhaps they *were* happening, but
> memtest86+ was ignoring them.)

So how can I test this with another bootable tool like memtest86+?


>> Interesting. I wished Linux's Kernel panics would log to a file like
>> Windows' memory dumps from blue screens so I can use a debugger to see what
>> the dumps.
>
> Logging to a file isn't an option (at this point, things are probably too far
> gone for this to be practical); but they could, perhaps, be stored in some
> non-volatile memory. (You'd need at least 16K for this, ideally 64K or more;
> and I don't think that there's enough in your typical PC RTC.)

Bummer. I am surprised Linux doesn't do this, but MS does with its
NT-based Windows.
--
"To the gods I am an ant, but to the ants, I am a god." --unknown
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.
From: Ant on
I ran memtest86+ v4.00 overnight again for over five hours and no errors
after two passed test. The third test was on test #8, and I was sure
that would pass too so I aborted it and rebooted to use the box.

I just finished running sys_basher
(http://www.polybus.com/sys_basher_web/) and ran it in my Debian a few
times in the past. Still no errors or crashes.

Weird stuff.


On 3/6/2010 9:05 AM PT, Ant typed:

> Uh oh. I just discovered mcelog and something new and scary in its
> /var/log/syslog:
>
> Mar 6 01:19:37 foobar kernel: [15299.988025] Machine check events logged
> Mar 6 01:42:07 foobar kernel: [16649.989021] Machine check events logged
> Mar 6 02:05:19 foobar -- MARK --
> Mar 6 02:19:37 foobar kernel: [18899.989024] Machine check events logged
> Mar 6 02:37:07 foobar kernel: [19949.988027] Machine check events logged
> Mar 6 03:05:19 foobar -- MARK --
> Mar 6 03:24:37 foobar kernel: [22799.989023] Machine check events logged
> Mar 6 03:45:19 foobar -- MARK --
> Mar 6 04:05:19 foobar -- MARK --
> Mar 6 04:25:19 foobar -- MARK --
> Mar 6 04:45:19 foobar -- MARK --
> Mar 6 05:02:07 foobar kernel: [28649.989023] Machine check events logged
> Mar 6 05:25:19 foobar -- MARK --
> Mar 6 05:45:19 foobar -- MARK --
> Mar 6 06:05:19 foobar -- MARK --
> Mar 6 06:24:37 foobar kernel: [33599.989027] Machine check events logged
> Mar 6 06:33:13 foobar syslogd 1.5.0#5: restart.
> Mar 6 06:45:19 foobar -- MARK --
> Mar 6 07:05:19 foobar -- MARK --
> Mar 6 07:25:19 foobar -- MARK --
> Mar 6 07:45:19 foobar -- MARK --
> Mar 6 08:05:19 foobar -- MARK --
> Mar 6 08:17:07 foobar kernel: [40349.989022] Machine check events logged
> Mar 6 08:24:37 foobar kernel: [40799.988036] Machine check events logged
> Mar 6 08:45:19 foobar -- MARK --
> Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
> problem!
> Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
> Mar 6 08:52:09 foobar mcelog: MCE 0
> Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
> Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
> Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
> Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
> Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction, level 1'
> Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
> Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
> Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
> Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
> problem!
> Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
> Mar 6 08:52:09 foobar mcelog: MCE 1
> Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
> Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
> Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
> Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
> Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction, level 1'
> Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
> Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
> Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
> Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
> problem!
> Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
> Mar 6 08:52:09 foobar mcelog: MCE 2
> Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
> Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
> Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
> Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
> Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction, level 1'
> Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
> Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
> Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
> Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
> problem!
> Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
> Mar 6 08:52:09 foobar mcelog: MCE 3
> Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
> Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
> Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
> Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
> Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction, level 1'
> Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
> Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
> Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
> Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
> problem!
> Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
> Mar 6 08:52:09 foobar mcelog: MCE 4
> Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
> Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
> Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
> Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
> Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction, level 1'
> Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
> Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
> Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
> Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
> problem!
> Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
> Mar 6 08:52:09 foobar mcelog: MCE 5
> Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
> Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
> Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
> Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
> Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction, level 1'
> Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
> Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
> Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
> Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
> problem!
> Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
> Mar 6 08:52:09 foobar mcelog: MCE 6
> Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
> Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
> Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
> Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
> Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction, level 1'
> Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
> Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
> Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
> Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
> problem!
> Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
> Mar 6 08:52:09 foobar mcelog: MCE 7
> Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
> Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
> Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
> Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
> Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction, level 1'
> Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
> Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
> Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
> Mar 6 08:52:09 foobar mcelog: HARDWARE ERROR. This is *NOT* a software
> problem!
> Mar 6 08:52:09 foobar mcelog: Please contact your hardware vendor
> Mar 6 08:52:09 foobar mcelog: MCE 8
> Mar 6 08:52:09 foobar mcelog: CPU 1 1 instruction cache
> Mar 6 08:52:09 foobar mcelog: ADDR c11b6ff0
> Mar 6 08:52:09 foobar mcelog: TIME 1267894329 Sat Mar 6 08:52:09 2010
> Mar 6 08:52:09 foobar mcelog: TLB parity error in virtual array
> Mar 6 08:52:09 foobar mcelog: TLB error 'instruction transaction, level 1'
> Mar 6 08:52:09 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
> Mar 6 08:52:09 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
> Mar 6 08:52:09 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
>
>
> What does that mean? Dying CPU (had it since 12/24/2006)? Maybe that's
> why memtest86+ didn't find any problems last week.
>
> On 3/5/2010 11:12 PM PT, Ant typed:
>
>> Hello.
>>
>> Is /var/log/syslog the only place where Linux keeps records of kernel
>> (v2.6.30 and v2.6.32) panics? dmesg and /var/log/messages doesn't seem
>> to show anything about the crashes unless I am misreading them. I am
>> trying to figure out a rare and random kernel panic issue on my old
>> Debian box.
>>
>> I know it's not X because I exited it, logged out of bash, went into
>> fullscreen text console's login screen (I boot up my Debian to text
>> mode, log into bash, and use startx command to go to X), and saw a bunch
>> of datas (e.g., memory addresses and codes) on my screen from the kernel
>> crash. However, its data dump was too long and my computer was in frozen
>> mode with two blinking PS/2 keyboard lights (caps and scroll lock) so I
>> couldn't scroll up or copy and paste.
>>
>> I poked around in my Debian and on the Web. I read that kernel panic
>> errors/datas can be found in /var/log/syslog (dmesg didn't show me
>> anything related to Kernel panics that I could find) like:
>>
>> # cat /var/log/syslog
>> ...
>> Mar 4 23:12:07 foobar smartd[2647]: Device: /dev/hda, SMART Usage
>> Attribute: 194 Temperature_Celsius changed from 30 to 31
>> ...
>> Mar 5 15:11:31 foobar smartd[2610]: Device: /dev/hda, SMART Prefailure
>> Attribute: 1 Raw_Read_Error_Rate changed from 58 to 59
>> Mar 5 15:11:31 foobar smartd[2610]: Device: /dev/hda, SMART Usage
>> Attribute: 195 Hardware_ECC_Recovered changed from 58 to 59
>> Mar 5 15:15:01 foobar /USR/SBIN/CRON[8815]: (root) CMD (command -v
>> debian-sa1 > /dev/null && debian-sa1 1 1)
>> Mar 5 15:17:01 foobar /USR/SBIN/CRON[11199]: (root) CMD ( cd / &&
>> run-parts --report /etc/cron.hourly)
>> Mar 5 15:25:01 foobar /USR/SBIN/CRON[20721]: (root) CMD (command -v
>> debian-sa1 > /dev/null && debian-sa1 1 1)
>> Mar 5 15:35:01 foobar /USR/SBIN/CRON[32588]: (root) CMD (command -v
>> debian-sa1 > /dev/null && debian-sa1 1 1)
>> Mar 5 15:45:01 foobar /USR/SBIN/CRON[12129]: (root) CMD (command -v
>> debian-sa1 > /dev/null && debian-sa1 1 1)
>> Mar 5 15:55:01 foobar /USR/SBIN/CRON[23947]: (root) CMD (command -v
>> debian-sa1 > /dev/null && debian-sa1 1 1)
>> < rebooted my crashed PC from its kernel panic >
>> Mar 5 21:05:19 foobar syslogd 1.5.0#5: restart.
>> ...
>>
>> I couldn't find any similiar from an earlier one like (don't think
>> smartctl with /dev/hda is it?):
>> ...
>> Mar 5 05:17:01 foobar /USR/SBIN/CRON[26833]: (root) CMD ( cd / &&
>> run-parts --report /etc/cron.hourly)
>> Mar 5 05:25:01 foobar /USR/SBIN/CRON[29514]: (root) CMD (command -v
>> debian-sa1 > /dev/null && debian-sa1 1 1)
>> Mar 5 05:35:01 foobar /USR/SBIN/CRON[372]: (root) CMD (command -v
>> debian-sa1 > /dev/null && debian-sa1 1 1)
>> Mar 5 05:45:01 foobar /USR/SBIN/CRON[3772]: (root) CMD (command -v
>> debian-sa1 > /dev/null && debian-sa1 1 1)
>> Mar 5 05:55:01 foobar /USR/SBIN/CRON[7160]: (root) CMD (command -v
>> debian-sa1 > /dev/null && debian-sa1 1 1)
>> Mar 5 06:41:19 foobar syslogd 1.5.0#5: restart.
>> ...
>>
>> I saw LKCD (http://lkcd.sourceforge.net/ and
>> http://sourceforge.net/projects/lkcd/files/), but it seems to be
>> outdated? I also couldn't find a Debian package of it, so I don't know
>> if I should even try it to get more datas.
>>
>> And yes, I already tried memtest86+ v4.00 and it came out no errors
>> after six hours with its default tests. I will try it again later just
>> in case.
>>
>> Thank you in advance. :)
--
"When the ant grows wings it is about to die." --Arabic
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.
From: Ant on
On 3/7/2010 7:08 PM PT, Darren Salt typed:

> Yes, on the grounds that it's not worth looking further if you see obvious
> damage. :-�

No obvious damages from a few weeks ago when my friend (hardware person)
and I checked.


> [snip]
>>>> Hmm, I wonder if that 512 MB RAM that memtest86 detected having errors
>>>> wasn't bad?
>>> Chances are that memtest86 was right. (I can see how bad memory might
>>> cause incorrect TLB entries, but not parity errors.)
>
>> So parity errors are from CPU only? I am not an expert in hardwares area.
>
> If you happen to be using ECC RAM, errors can be reported from that too.
> Hopefully, they'd be correctable ones...

Hmm, I don't know if my RAM uses ECC? How can I check?


>> So how can I test this with another bootable tool like memtest86+?
>
> Boot from USB or CD, drop to a text console, stress it with a kernel compile
> or something (preferably without touching disk). Wait. :-)

Hmm, I did that in my regular Debian and no problems! I used sys_basher,
unrar 10 GB of datas, etc. I can't make it happen with stress tests.
Most of the kernel panics happened when idled! :D
--
"The Hunam Tiger ant has been known to consume an entire meal before the
picnic guest arrive." --12th century Tang Dynasty proverb.
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.