From: Gary Mills on 10 Jun 2010 16:54 We had a reboot recently that was a result of this hardware fault: --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical Fault class : fault.cpu.intel.nb.fsb FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4 faulty How do I determine which CPU or core is at fault? This is on an E4450 with four four-core CPUs. `psrinfo -vp' says: The physical processor has 4 virtual processors (0 4-6) x86 (chipid 0x0 GenuineIntel family 6 model 15 step 11 clock 2933 MHz) Intel(r) Xeon(r) CPU X7350 @ 2.93GHz The physical processor has 4 virtual processors (1 7-9) x86 (chipid 0x2 GenuineIntel family 6 model 15 step 11 clock 2933 MHz) Intel(r) Xeon(r) CPU X7350 @ 2.93GHz The physical processor has 4 virtual processors (2 10-12) x86 (chipid 0x4 GenuineIntel family 6 model 15 step 11 clock 2933 MHz) Intel(r) Xeon(r) CPU X7350 @ 2.93GHz The physical processor has 4 virtual processors (3 13-15) x86 (chipid 0x6 GenuineIntel family 6 model 15 step 11 clock 2933 MHz) Intel(r) Xeon(r) CPU X7350 @ 2.93GHz -- -Gary Mills- -Unix Group- -Computer and Network Services-
From: Cydrome Leader on 10 Jun 2010 19:58 Gary Mills <mills(a)cc.umanitoba.ca> wrote: > We had a reboot recently that was a result of this hardware fault: > > --------------- ------------------------------------ -------------- --------- > TIME EVENT-ID MSG-ID SEVERITY > --------------- ------------------------------------ -------------- --------- > Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical > > Fault class : fault.cpu.intel.nb.fsb > FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4 > faulty > > How do I determine which CPU or core is at fault? This is on an E4450 > with four four-core CPUs. `psrinfo -vp' says: While you can disable cores/processors for solaris x86, it's not clear if it really does anything. On a sparc platform, yes you can really disable memory and processors and it's for real. I've seen xeon processors (really cores) fail in solaris before and in real life there's nothing wrong at all with the CPU. For intel hardware just rebooting seems to be the fix. I suspect it's some sort of software issue.
From: Gary Mills on 11 Jun 2010 19:25 In <huru7f$ing$1(a)reader1.panix.com> Cydrome Leader <presence(a)MUNGEpanix.com> writes: >Gary Mills <mills(a)cc.umanitoba.ca> wrote: >> We had a reboot recently that was a result of this hardware fault: >> >> --------------- ------------------------------------ -------------- --------- >> TIME EVENT-ID MSG-ID SEVERITY >> --------------- ------------------------------------ -------------- --------- >> Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical >> >> Fault class : fault.cpu.intel.nb.fsb >> FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4 >> faulty >> >> How do I determine which CPU or core is at fault? This is on an E4450 >> with four four-core CPUs. `psrinfo -vp' says: In this instance, I'd really like to know which CPU was faulty. I can guess, but I might be wrong. (It was actually an X4450.) >I've seen xeon processors (really cores) fail in solaris before and in >real life there's nothing wrong at all with the CPU. For intel hardware >just rebooting seems to be the fix. I suspect it's some sort of software >issue. This server needed a power-cycle before it came back to normal. A reboot wasn't sufficient. Either something didn't get reset fully or it was a real hardware failure. -- -Gary Mills- -Unix Group- -Computer and Network Services-
From: Cydrome Leader on 11 Jun 2010 23:49 Gary Mills <mills(a)cc.umanitoba.ca> wrote: > In <huru7f$ing$1(a)reader1.panix.com> Cydrome Leader <presence(a)MUNGEpanix.com> writes: > >>Gary Mills <mills(a)cc.umanitoba.ca> wrote: >>> We had a reboot recently that was a result of this hardware fault: >>> >>> --------------- ------------------------------------ -------------- --------- >>> TIME EVENT-ID MSG-ID SEVERITY >>> --------------- ------------------------------------ -------------- --------- >>> Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical >>> >>> Fault class : fault.cpu.intel.nb.fsb >>> FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4 >>> faulty >>> >>> How do I determine which CPU or core is at fault? This is on an E4450 >>> with four four-core CPUs. `psrinfo -vp' says: > > In this instance, I'd really like to know which CPU was faulty. > I can guess, but I might be wrong. (It was actually an X4450.) > >>I've seen xeon processors (really cores) fail in solaris before and in >>real life there's nothing wrong at all with the CPU. For intel hardware >>just rebooting seems to be the fix. I suspect it's some sort of software >>issue. > > This server needed a power-cycle before it came back to normal. A > reboot wasn't sufficient. Either something didn't get reset fully > or it was a real hardware failure. If you have any core files, sun might be able to tell you which cpu it feels faulted. Since you're running on sun hardware they should probably be able to help with this. If you can, running VTS for a few days might be a good idea.
From: Gary Mills on 12 Jun 2010 08:06 In <huv042$2rn$2(a)reader1.panix.com> Cydrome Leader <presence(a)MUNGEpanix.com> writes: >Gary Mills <mills(a)cc.umanitoba.ca> wrote: >> In <huru7f$ing$1(a)reader1.panix.com> Cydrome Leader <presence(a)MUNGEpanix.com> writes: >> >>>Gary Mills <mills(a)cc.umanitoba.ca> wrote: >>>> We had a reboot recently that was a result of this hardware fault: >>>> >>>> --------------- ------------------------------------ -------------- --------- >>>> TIME EVENT-ID MSG-ID SEVERITY >>>> --------------- ------------------------------------ -------------- --------- >>>> Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical >>>> >>>> Fault class : fault.cpu.intel.nb.fsb >>>> FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4 >>>> faulty >> >> This server needed a power-cycle before it came back to normal. A >> reboot wasn't sufficient. Either something didn't get reset fully >> or it was a real hardware failure. >If you have any core files, sun might be able to tell you which cpu it >feels faulted. Since you're running on sun hardware they should probably >be able to help with this. There was no core file or traceback, just a sudden reboot. Oracle/Sun is going to replace one of the CPUs. I just wanted an independant way to verify which one it was. -- -Gary Mills- -Unix Group- -Computer and Network Services-
|
Next
|
Last
Pages: 1 2 3 Prev: Equivalent of "passwd -as" command in solaris / linux Next: disk naming in solaris |