"TLB parity error in virtual array; TLB error 'instruction"? [Linux Hardware]

Prev: gpm mouse event codes?
Next: "FATAL: Module i2c_nforce2 not found." and "FATAL: Module i2c_dev not found." in sensors-detect. [resolved]

From: ANTant on 16 Mar 2010 16:02

>> Having a better look through your logs, I see this addr is
>> very common (almost all errs are at this addr). Aren't
>> you curious about the instruction that produced the errors?
>> /boot/System.map should contain the addr of all kernel fns,
>> and there should be some way to lookup modules.
>
> I did a "cat /var/log/messages |grep ADDR" and found these addresses:
> c104e3f0
> c106e8c0
> c11b6ff0 (most common)
>
> But none of them matched to /boot/System.map-2.6.32-trunk-686. Here are
> close addresses around them for each one:
>
> c104e2f9 T tick_handle_periodic
> c104e360 T tick_get_broadcast_device
>
> c1063e1b t stop_cpu
> c1063ec6 T stop_machine_destroy
>
> c11b6fb8 T acpi_pm_read_verified
> c11b6ffc t acpi_pm_read

Since I did a Kernel upgrade (2.6.32-3 from -2 trunk) yesterday morning,
I noticed a new address in my /var/log/messages (only one so far):
Mar 16 05:41:16 foobar mcelog: HARDWARE ERROR. This is *NOT* a software problem!
Mar 16 05:41:16 foobar mcelog: Please contact your hardware vendor
Mar 16 05:41:16 foobar mcelog: MCE 0
Mar 16 05:41:16 foobar mcelog: CPU 1 1 instruction cache
Mar 16 05:41:16 foobar mcelog: ADDR c104e570
Mar 16 05:41:16 foobar mcelog: TIME 1268743276 Tue Mar 16 05:41:16 2010
Mar 16 05:41:16 foobar mcelog: TLB parity error in virtual array
Mar 16 05:41:16 foobar mcelog: TLB error 'instruction transaction, level 1'
Mar 16 05:41:16 foobar mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 16 05:41:16 foobar mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 16 05:41:16 foobar mcelog: CPUID Vendor AMD Family 15 Model 43

# ls -all /boot/System.map-2.6.32-3-686
-rw-r--r-- 1 root root 1259340 2010-02-25 01:00 /boot/System.map-2.6.32-3-686

I am going to assume contents changed in both Kernel and the system.map. I did a look up to match that c104e570 address. Closest address were:
# cat /boot/System.map-2.6.32-3-686 |grep c104e
c104e07d t tick_notify
c104e374 t tick_periodic
c104e3dd T tick_handle_periodic
c104e444 T tick_get_broadcast_device
c104e44a T tick_get_broadcast_mask
c104e450 T tick_is_broadcast_device
c104e464 T tick_set_periodic_handler
c104e477 T tick_get_broadcast_oneshot_mask
c104e47d T tick_broadcast_oneshot_active
c104e48a T tick_shutdown_broadcast_oneshot
c104e4ac T tick_check_oneshot_broadcast
c104e4d5 T tick_resume_broadcast_oneshot
c104e4e2 T tick_broadcast_setup_oneshot
c104e5ae T tick_broadcast_switch_to_oneshot
c104e5e0 t tick_do_broadcast
c104e634 t tick_handle_oneshot_broadcast
c104e71d t tick_do_periodic_broadcast
c104e74a T tick_broadcast_oneshot_control
c104e82c T tick_resume_broadcast
c104e8a3 T tick_device_uses_broadcast
c104e91b T tick_suspend_broadcast
c104e943 T tick_shutdown_broadcast
c104e989 t tick_handle_periodic_broadcast
c104e9ce T tick_broadcast_on_off
c104eb0e T tick_check_broadcast_device
c104eb60 T tick_oneshot_mode_active
c104eb96 T tick_switch_to_oneshot
c104ec1e T tick_init_highres
c104ec28 T tick_dev_program_event
c104eca9 T tick_setup_oneshot
c104ecd9 T tick_program_event
c104ecfc T tick_resume_oneshot
c104ed24 T tick_get_tick_sched
c104ed33 T tick_nohz_get_sleep_length
c104ed4c T tick_oneshot_notify
c104ed63 t tick_init_jiffy_update
c104edae T tick_check_oneshot_change
c104eea1 t tick_do_update_jiffies64
c104ef87 t tick_nohz_handler

A Google quick search
(http://www.google.com/search?q=linux+kernel+tick+broadcast) seems to
show related to APIC? Does anyone know what these ticks do to cause
these rare and random machine errors and kernel panics? The address
seems to hang out in broadcast area. Again, I am not familiar with
hardwares. :(
--
"We are anthill men upon an anthill world." --Ray Bradbury
/\___/\
/ /\ /\ \ Phillip (Ant) @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links (AQFL): http://aqfl.net
\ _ / Please remove ANT if replying by e-mail.
( )

From: Trevor Hemsley on 16 Mar 2010 18:39

On Tue, 16 Mar 2010 20:02:49 UTC in comp.os.linux.hardware, ANTant(a)zimage.com
wrote:

> Does anyone know what these ticks do to cause
> these rare and random machine errors and kernel panics?

No but everything about those errors looks hardware related so I'd be looking at
replacing the cpu at the very least. That looks like the most likely component
but it's not necessarily the right one - other bits that spring to mind are
motherboard, PSU and RAM.

--
Trevor Hemsley, Brighton, UK
Trevor dot Hemsley at ntlworld dot com

From: ANTant on 16 Mar 2010 20:19

>> Does anyone know what these ticks do to cause
>> these rare and random machine errors and kernel panics?
>
> No but everything about those errors looks hardware related so I'd be looking at
> replacing the cpu at the very least. That looks like the most likely component
> but it's not necessarily the right one - other bits that spring to mind are
> motherboard, PSU and RAM.

Yeah, it is probably my CPU since my PSU+video card went dead and a 512
MB RAM piece showed memory errors in memtest86+ v4.00 before these
problems came out. After replacing all of them, memtest86+ v4.00 passed
a few times for several hours and few days of testings (including its
test #9).
--
"We are anthill men upon an anthill world." --Ray Bradbury
/\___/\
/ /\ /\ \ Phillip (Ant) @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links (AQFL): http://aqfl.net
\ _ / Please remove ANT if replying by e-mail.
( )

From: Ant on 18 Mar 2010 07:47

There's something else I noticed the last few days (this week so far)
that might be related to my Linux/Debian's machine errors and kernel panics?

Mar 14 21:11:53
Mar 16 05:41:16

/var/log/messages showed only these two machine errors for this week so
far. I usually get daily and several if really bad. Also, I haven't had
kernel panics for a while too, but then it is probably because I
manually rebooted a lot. I currently only have almost three days of
uptime and they usually come when I have about a week or so.

The only thing different is the weather and temperatures are much
higher. My room has been about 80F degrees lately (yeah, too warm)
without the windows and fan opened.

Before this week since the issues started, it was much cooler (mid
60-70F degrees in my room). Remember how I said my issues usually come
up during idle times and not during stress times? I wonder if there is a
relationship with temperatures. I checked weather.com's calendar showing
past temperatures for my city, and they seem to match. It doesn't seem
like weather will be cold again for a while too since spring is here. I
am going to keep watching this pattern.
--
"If I want to be a robber, I rob the king's treasury. If I want to be a
hunter, I hunt the rhino. What is the use of robbing beggars and hunting
ants? So if you want to love, love God." --Swami Vivekananda
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.

From: Ant on 18 Mar 2010 08:33

Weird. I just noticed this in my dmesg and have no idea if this is bad
or not:

[246348.660025] Clocksource tsc unstable (delta = -62500120 ns)
I checked previous logs, and none of them have it so it might had been a
hiccup?

# cat /var/log/messages* |grep clocksource (all the way to 2/28/2010
6:47:02 AM PST)
Mar 5 06:41:19 foobar kernel: [ 0.241186] Switching to clocksource
jiffies
Mar 5 06:41:19 foobar kernel: [ 0.281777] Switching to clocksource
acpi_pm
Mar 5 21:05:19 foobar kernel: [ 0.241193] Switching to clocksource
jiffies
Mar 5 21:05:19 foobar kernel: [ 0.281790] Switching to clocksource
acpi_pm
Mar 7 07:30:45 foobar kernel: [ 0.241186] Switching to clocksource
jiffies
Mar 7 07:30:45 foobar kernel: [ 0.281778] Switching to clocksource
acpi_pm
Mar 8 07:43:15 foobar kernel: [ 0.241194] Switching to clocksource
jiffies
Mar 8 07:43:15 foobar kernel: [ 0.281782] Switching to clocksource
acpi_pm
Mar 11 00:29:19 foobar kernel: [ 0.240922] Switching to clocksource
jiffies
Mar 11 00:29:19 foobar kernel: [ 0.281516] Switching to clocksource
acpi_pm
Mar 12 05:45:36 foobar kernel: [ 0.237194] Switching to clocksource
jiffies
Mar 12 05:45:36 foobar kernel: [ 0.277790] Switching to clocksource
acpi_pm
Mar 12 23:57:13 foobar kernel: [ 0.241187] Switching to clocksource
jiffies
Mar 12 23:57:13 foobar kernel: [ 0.281779] Switching to clocksource
acpi_pm
Mar 15 00:32:48 foobar kernel: [ 0.237192] Switching to clocksource
jiffies
Mar 15 00:32:48 foobar kernel: [ 0.277782] Switching to clocksource
acpi_pm
Mar 15 01:16:00 foobar kernel: [ 0.237290] Switching to clocksource
jiffies
Mar 15 01:16:00 foobar kernel: [ 0.277886] Switching to clocksource
acpi_pm
Mar 15 08:25:09 foobar kernel: [ 0.242800] Switching to clocksource
jiffies
Mar 15 08:25:09 foobar kernel: [ 0.283406] Switching to clocksource
acpi_pm
Mar 15 08:31:58 foobar kernel: [ 0.242802] Switching to clocksource
jiffies
Mar 15 08:31:58 foobar kernel: [ 0.283405] Switching to clocksource
acpi_pm

I did a quick Google research and found
https://lists.ubuntu.com/archives/ubuntu-users/2009-February/175828.html
with commands:
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
acpi_pm
# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
acpi_pm

I don't know if this something to worry about or a new clue.
--
"An anthill increases by accumulation. / Medicine is consumed by
distribution. / That which is feared lessens by association. / This is
the thing to understand." --Siddha Nagarjuna
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: philpi(a)earthlink.netANT
( ) or ANTant(a)zimage.com
Ant is currently not listening to any songs on his home computer.

| Next | Last
Pages: 1 2 3 4 5
Prev: gpm mouse event codes?
Next: "FATAL: Module i2c_nforce2 not found." and "FATAL: Module i2c_dev not found." in sensors-detect. [resolved]