"TLB parity error in virtual array; TLB error 'instruction"? [Linux Hardware]

Prev: gpm mouse event codes?
Next: "FATAL: Module i2c_nforce2 not found." and "FATAL: Module i2c_dev not found." in sensors-detect. [resolved]

From: Ant on 30 Apr 2010 17:39

On Mar 16, 1:02 pm, ANT...(a)zimage.com wrote:
> >> Having a better look through your logs, I see this addr is
> >> very common (almost all errs are at this addr). Aren't
> >> you curious about the instruction that produced the errors?
> >> /boot/System.map should contain the addr of all kernel fns,
> >> and there should be some way to lookup modules.
>
> > I did a "cat /var/log/messages |grep ADDR" and found these addresses:
> > c104e3f0
> > c106e8c0
> > c11b6ff0 (most common)
>
> > But none of them matched to /boot/System.map-2.6.32-trunk-686. Here are
> > close addresses around them for each one:
>
> > c104e2f9 T tick_handle_periodic
> > c104e360 T tick_get_broadcast_device
>
> > c1063e1b t stop_cpu
> > c1063ec6 T stop_machine_destroy
>
> > c11b6fb8 T acpi_pm_read_verified
> > c11b6ffc t acpi_pm_read
>
> Since I did a Kernel upgrade (2.6.32-3 from -2 trunk) yesterday morning,
> I noticed a new address in my /var/log/messages (only one so far):
> Mar 16 05:41:16 foobar mcelog: HARDWARE ERROR. This is *NOT* a software problem!
> Mar 16 05:41:16 foobar mcelog: Please contact your hardware vendor
> Mar 16 05:41:16 foobar mcelog: MCE 0
> Mar 16 05:41:16 foobar mcelog: CPU 1 1 instruction cache
> Mar 16 05:41:16 foobar mcelog: ADDR c104e570
> Mar 16 05:41:16 foobar mcelog: TIME 1268743276 Tue Mar 16 05:41:16 2010
> Mar 16 05:41:16 foobar mcelog: TLB parity error in virtual array
> Mar 16 05:41:16 foobar mcelog: TLB error 'instruction transaction, level 1'
> Mar 16 05:41:16 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
> Mar 16 05:41:16 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
> Mar 16 05:41:16 foobar mcelog: CPUID Vendor AMD Family 15 Model 43
>
> # ls -all /boot/System.map-2.6.32-3-686
> -rw-r--r-- 1 root root 1259340 2010-02-25 01:00 /boot/System.map-2.6.32-3-686
>
> I am going to assume contents changed in both Kernel and the system.map. I did a look up to match that c104e570 address. Closest address were:
> # cat /boot/System.map-2.6.32-3-686 |grep c104e
> c104e07d t tick_notify
> c104e374 t tick_periodic
> c104e3dd T tick_handle_periodic
> c104e444 T tick_get_broadcast_device
> c104e44a T tick_get_broadcast_mask
> c104e450 T tick_is_broadcast_device
> c104e464 T tick_set_periodic_handler
> c104e477 T tick_get_broadcast_oneshot_mask
> c104e47d T tick_broadcast_oneshot_active
> c104e48a T tick_shutdown_broadcast_oneshot
> c104e4ac T tick_check_oneshot_broadcast
> c104e4d5 T tick_resume_broadcast_oneshot
> c104e4e2 T tick_broadcast_setup_oneshot
> c104e5ae T tick_broadcast_switch_to_oneshot
> c104e5e0 t tick_do_broadcast
> c104e634 t tick_handle_oneshot_broadcast
> c104e71d t tick_do_periodic_broadcast
> c104e74a T tick_broadcast_oneshot_control
> c104e82c T tick_resume_broadcast
> c104e8a3 T tick_device_uses_broadcast
> c104e91b T tick_suspend_broadcast
> c104e943 T tick_shutdown_broadcast
> c104e989 t tick_handle_periodic_broadcast
> c104e9ce T tick_broadcast_on_off
> c104eb0e T tick_check_broadcast_device
> c104eb60 T tick_oneshot_mode_active
> c104eb96 T tick_switch_to_oneshot
> c104ec1e T tick_init_highres
> c104ec28 T tick_dev_program_event
> c104eca9 T tick_setup_oneshot
> c104ecd9 T tick_program_event
> c104ecfc T tick_resume_oneshot
> c104ed24 T tick_get_tick_sched
> c104ed33 T tick_nohz_get_sleep_length
> c104ed4c T tick_oneshot_notify
> c104ed63 t tick_init_jiffy_update
> c104edae T tick_check_oneshot_change
> c104eea1 t tick_do_update_jiffies64
> c104ef87 t tick_nohz_handler

After 1.5 months later, I did comparisons with the last two weeks'
logs with two different kernel 2.6.32 i686 (-3 and -4) packages.

-3:
Apr 20 04:13:52 mcelog: ADDR c104e500
Apr 14 01:36:16 mcelog: ADDR c104e530
Apr 16 06:03:52 mcelog: ADDR c104e540
Apr 20 02:51:22 mcelog: ADDR c104e570
/boot/System.map-2.6.32-3-686 showed:
c104e4e2 T tick_broadcast_setup_oneshot
c104e5ae T tick_broadcast_switch_to_oneshot

Apr 13 23:58:46 mcelog: ADDR c104f2c0
/boot/System.map-2.6.32-4-686 showed:
c104f2bb T tick_check_idle
c104f32f T tick_nohz_restart_sched_tick

Most /var/log/messages' addresses were at c104e570 for Kernel
2.6.32-3.

-4 has four days and 21 hours of uptime after upgrading the kernel and
rebooting. So far, only two machine errors and no kernel panics:
Apr 27 09:00:20 mcelog: ADDR c1046d30
/boot/System.map-2.6.32-4-686 showed:
c1046cae T hrtimer_interrupt
c1046e08 t __hrtimer_peek_ahead_timers

Apr 30 07:17:50 mcelog: ADDR c106ee80
/boot/System.map-2.6.32-4-686 showed:
c106ee3c T rcu_irq_enter
c106ee88 T rcu_nmi_exit

Completely different now. Weird.

From: Ant on 30 Apr 2010 17:41

On Apr 25, 5:44 pm, "Trevor Hemsley"
<Trevor.Hems...(a)mytrousers.ntlworld.com> wrote:
> On Sun, 25 Apr 2010 21:06:30 UTC in comp.os.linux.hardware, Ant
>
> <a...(a)zimage.comANT> wrote:
> > > So how much longer before you acknowledge that your CPU is defunct and
> > > replace it?
>
> > Well, I haven't been able to reproduce it outside of my Debian box yet.
> > How do I know it is not my old Debian OS?
>
> Because you don't get machine checks from software related problems. This is
> hardware, pick (at least) one of:
>
> Processor
> Motherboard
> Power Supply

That still doesn't makes sense if I can see these errors and kernel
panics in my 2005's Debian installation and not be able to see them
outside of it. It's probably not long enough in terms of uptime
(longest is 15.5 hours in a Ubuntu LiveCD).

From: Trevor Hemsley on 1 May 2010 06:14

On Fri, 30 Apr 2010 21:41:08 UTC in comp.os.linux.hardware, Ant
<antdude(a)gmail.com> wrote:

> On Apr 25, 5:44�pm, "Trevor Hemsley"
> <Trevor.Hems...(a)mytrousers.ntlworld.com> wrote:
> > On Sun, 25 Apr 2010 21:06:30 UTC in comp.os.linux.hardware, Ant
> >
> > <a...(a)zimage.comANT> wrote:
> > > > So how much longer before you acknowledge that your CPU is defunct and
> > > > replace it?
> >
> > > Well, I haven't been able to reproduce it outside of my Debian box yet.
> > > How do I know it is not my old Debian OS?
> >
> > Because you don't get machine checks from software related problems. This is
> > hardware, pick (at least) one of:
> >
> > Processor
> > Motherboard
> > Power Supply
>
> That still doesn't makes sense if I can see these errors and kernel
> panics in my 2005's Debian installation and not be able to see them
> outside of it. It's probably not long enough in terms of uptime
> (longest is 15.5 hours in a Ubuntu LiveCD).

Dunno but I'd lay odds that this is a hardware problem and the longer you delay
fixing it, the more chance it has to corrupt data for you.

Maybe the liveCD you tested with had no machine check code built into its
kernel?

You can try and narrow it down all you like but I suspect that in the end you
are going to need to replace things. The cheapest thing to replace is the PSU
but it's nearly as much work to do that as it is to replace the motherboard. And
if you try to replace the cpu I am not sure you will find another one that you
can buy new so you may be restricted to getting one from Ebay. So at that point
do you just cut your losses and replace all 3 and the RAM at the same time -
since that's likely to be incompatible and also a possible source of this
problem.

I did read about your old PSU frying things and they can do nasty things before
they actually go - if they produce voltage spikes then they could be causing
damage to lots of things. It could have taken the cpu with it, or the
motherboard or the RAM. Or it might be that your motherboard is of an age to
suffer from the capacitor problem which causes them to bulge and leak and cause
all sorts of strange things to happen.

What makes you so sure it's a software problem when all the symptoms point to it
being hardware?

--
Trevor Hemsley, Brighton, UK
Trevor dot Hemsley at ntlworld dot com

From: Ant on 1 May 2010 18:10

On 5/1/2010 3:14 AM PT, Trevor Hemsley typed:

> Maybe the liveCD you tested with had no machine check code built into its
> kernel?

Isn't that what installing mcelog package supposed to do? I did that in
my old Debian OS.

> I did read about your old PSU frying things and they can do nasty things before
> they actually go - if they produce voltage spikes then they could be causing
> damage to lots of things. It could have taken the cpu with it, or the
> motherboard or the RAM. Or it might be that your motherboard is of an age to
> suffer from the capacitor problem which causes them to bulge and leak and cause
> all sorts of strange things to happen.

My friend and I didn't see anything odd on the motherboard, CPU, etc.

> What makes you so sure it's a software problem when all the symptoms point to it
> being hardware?

Because I couldn't reproduce these kernel panics and machine errors
outside of my Debian installation. Once I confirm I can reproduce them
outside, then I will definitely know it is a hardware issue.
--
"Look not to the windmill's turning while the ant still burrows." --unknown
/\___/\ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
/ /\ /\ \ Ant's Quality Foraged Links: http://aqfl.net
| |o o| |
\ _ / If crediting, then use Ant nickname and AQFL URL/link.
( ) If e-mailing, then axe ANT from its address if needed.
Ant is currently not listening to any songs on this computer.

From: Trevor Hemsley on 1 May 2010 18:50

On Sat, 1 May 2010 22:10:09 UTC in comp.os.linux.hardware, Ant
<ant(a)zimage.comANT> wrote:

> On 5/1/2010 3:14 AM PT, Trevor Hemsley typed:
>
> > Maybe the liveCD you tested with had no machine check code built into its
> > kernel?
>
> Isn't that what installing mcelog package supposed to do? I did that in
> my old Debian OS.

There is a kernel compile time option to report them and if the kernel has no
support then I do not know what mcelog would do. I suspect that you'd get an
error because the device that mcelog reads would not exist - ah, the man page is
illuminating, it has a --ignorenodev option which implies that it will tell you
if there is no device and that option is not specified.

> > I did read about your old PSU frying things and they can do nasty things before
> > they actually go - if they produce voltage spikes then they could be causing
> > damage to lots of things. It could have taken the cpu with it, or the
> > motherboard or the RAM. Or it might be that your motherboard is of an age to
> > suffer from the capacitor problem which causes them to bulge and leak and cause
> > all sorts of strange things to happen.
>
> My friend and I didn't see anything odd on the motherboard, CPU, etc.

Who said the damage has to be visible? The traces inside computer chips are
measured in nanometers. I remember replacing the power supply on my nephew's AMD
Athlon 2000 machine because it was going flakey and finding that it made no
difference to the crashes that that was having - those only went away when I
replaced the cpu which had been fried by the dodgy PSU. No visible damage on
that either.

> > What makes you so sure it's a software problem when all the symptoms point to it
> > being hardware?
>
> Because I couldn't reproduce these kernel panics and machine errors
> outside of my Debian installation. Once I confirm I can reproduce them
> outside, then I will definitely know it is a hardware issue.

How many other Debian users do you see complaining of these sort of errors? I've
not seen any which means that it's not a generic problem with a Debian
installation. I did a search last night for your sort of error messages and in
the first couple of pages of google results, I found you in several places and a
guy reporting a similar thing in 2005/6 who reported that his symptoms went away
when he replaced the RAM on the machine.

I have never seen mcelog report false errors. I have seen it tell me that I had
a problem with overheating and it was correct.

--
Trevor Hemsley, Brighton, UK
Trevor dot Hemsley at ntlworld dot com

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: gpm mouse event codes?
Next: "FATAL: Module i2c_nforce2 not found." and "FATAL: Module i2c_dev not found." in sensors-detect. [resolved]