From: Andrew Gabriel on
In article <hurjeh$r53$1(a)canopus.cc.umanitoba.ca>,
Gary Mills <mills(a)cc.umanitoba.ca> writes:
> We had a reboot recently that was a result of this hardware fault:
>
> --------------- ------------------------------------ -------------- ---------
> TIME EVENT-ID MSG-ID SEVERITY
> --------------- ------------------------------------ -------------- ---------
> Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical
>
> Fault class : fault.cpu.intel.nb.fsb
> FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4
> faulty
>
> How do I determine which CPU or core is at fault? This is on an E4450

Look at the output from /usr/lib/fm/fmd/fmtopo -V
for the same FRU and see if that entry tells you which socket.
Also, you might find the system is numbering the chips 0, 2, 4, 6 in
the fmtopo output, which would make it the third socket.

I believe the fm output has recently been changed to be more helpful
in this case, but I don't know if/when that's gone back into S10.

--
Andrew Gabriel
[email address is not usable -- followup in the newsgroup]
From: Andrew Gabriel on
In article <huru7f$ing$1(a)reader1.panix.com>,
Cydrome Leader <presence(a)MUNGEpanix.com> writes:
>
> I've seen xeon processors (really cores) fail in solaris before and in
> real life there's nothing wrong at all with the CPU. For intel hardware
> just rebooting seems to be the fix. I suspect it's some sort of software
> issue.

Blimy. We go to extra ordinary effort to retrieve and decode all the Intel
chip telemetry (which Intel tell me no other OS has managed to do to
anywhere near the same degree) to ensure you don't get any data corruption
when parts of chips/busses/memory/etc detect error situations, as you'd
expect from an Enterprise grade OS. Then when it happens, someone says

"I suspect it's some sort of software issue."

;-)

--
Andrew Gabriel
[email address is not usable -- followup in the newsgroup]
From: Gary Mills on
In <hv0osf$s49$1(a)news.eternal-september.org> andrew(a)cucumber.demon.co.uk (Andrew Gabriel) writes:

>In article <hurjeh$r53$1(a)canopus.cc.umanitoba.ca>,
> Gary Mills <mills(a)cc.umanitoba.ca> writes:
>> We had a reboot recently that was a result of this hardware fault:
>>
>> --------------- ------------------------------------ -------------- ---------
>> TIME EVENT-ID MSG-ID SEVERITY
>> --------------- ------------------------------------ -------------- ---------
>> Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical
>>
>> Fault class : fault.cpu.intel.nb.fsb
>> FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4
>> faulty
>>
>> How do I determine which CPU or core is at fault? This is on an E4450

>Look at the output from /usr/lib/fm/fmd/fmtopo -V
>for the same FRU and see if that entry tells you which socket.
>Also, you might find the system is numbering the chips 0, 2, 4, 6 in
>the fmtopo output, which would make it the third socket.

Ah, that clears up the confusion. I wasn't sure if `chip' meant CPU
socket or CPU core. I see that individual cores are represented this
way:

motherboard=0/chip=4/cpu=0

Yes, they are numbered 0, 2, 4, 6 in fmtopo and `psrinfo -vp' output.
The board diagram is labelled 0, 1, 2, 3, making it #2 that's faulty.
The FE is going to replace CPU3. I suspect that's the same one.

--
-Gary Mills- -Unix Group- -Computer and Network Services-
From: Cydrome Leader on
Andrew Gabriel <andrew(a)cucumber.demon.co.uk> wrote:
> In article <huru7f$ing$1(a)reader1.panix.com>,
> Cydrome Leader <presence(a)MUNGEpanix.com> writes:
>>
>> I've seen xeon processors (really cores) fail in solaris before and in
>> real life there's nothing wrong at all with the CPU. For intel hardware
>> just rebooting seems to be the fix. I suspect it's some sort of software
>> issue.
>
> Blimy. We go to extra ordinary effort to retrieve and decode all the Intel
> chip telemetry (which Intel tell me no other OS has managed to do to
> anywhere near the same degree) to ensure you don't get any data corruption
> when parts of chips/busses/memory/etc detect error situations, as you'd
> expect from an Enterprise grade OS. Then when it happens, someone says
>
> "I suspect it's some sort of software issue."
>
> ;-)

You work for sun?

While I agree a machine with a nonrecoverable fault should just crash, I
will point out that writing software to just crash a machine over and over
again without any meaninful error output is in fact a sofware issue as
well.
From: Andrew Gabriel on
In article <hv1200$hda$5(a)reader1.panix.com>,
Cydrome Leader <presence(a)MUNGEpanix.com> writes:
> Andrew Gabriel <andrew(a)cucumber.demon.co.uk> wrote:
>> In article <huru7f$ing$1(a)reader1.panix.com>,
>> Cydrome Leader <presence(a)MUNGEpanix.com> writes:
>>>
>>> I've seen xeon processors (really cores) fail in solaris before and in
>>> real life there's nothing wrong at all with the CPU. For intel hardware
>>> just rebooting seems to be the fix. I suspect it's some sort of software
>>> issue.
>>
>> Blimy. We go to extra ordinary effort to retrieve and decode all the Intel
>> chip telemetry (which Intel tell me no other OS has managed to do to
>> anywhere near the same degree) to ensure you don't get any data corruption
>> when parts of chips/busses/memory/etc detect error situations, as you'd
>> expect from an Enterprise grade OS. Then when it happens, someone says
>>
>> "I suspect it's some sort of software issue."
>>
>> ;-)
>
> You work for sun?

Yes, well Oracle now, although I don't speak for them.

> While I agree a machine with a nonrecoverable fault should just crash, I
> will point out that writing software to just crash a machine over and over
> again without any meaninful error output is in fact a sofware issue as
> well.

I agree. The fact that Solaris managed to record the necessary chip
failure telemetry after a hardware failure which hit the system hard
enough for it to be unable to dump and unable to recover even after
a reset is quite remarkable. I don't think [m]any other OS's would
give you the slightest clue what when wrong with the system in this
case, yet here we have the relevant faulty chip identified (hopefully,
although in some cases the chip which detects a fault isn't the one
where the fault lays;-), and the more detailed fm record should include
details of exactly what's wrong, for those intimately familiar with its
innards.

--
Andrew Gabriel
[email address is not usable -- followup in the newsgroup]