From: Michael (Xcitelogic) on
Gday TP - thank you very much for your comprehensive reply to my query. I'll
address the points below if I may with my own responses:

"TP" wrote:

> Hi,
>
> Most likely one or more of the following are causing the problem:
>
> - Faulty hardware
> - Corrupt files
> - Virus/malware/rootkit
> - Faulty device drivers
> - Bug in windows
>
> Since this has been an ongoing problem I would recommend you
> open a case with Microsoft PSS so that they can assist in dump
> analysis. It may be necessary to open another case with IBM
> as well.
>
> Another option is to completely wipe the server and reinstall
> from a known good image or from scratch. If the problem
> comes back then it is most likely faulty hardware.
>

I agree completely, and did this approximately 2 months back performing a
fresh/clean install of the Operating System using the IBM Server Installer,
updating all system/hardware drivers to the latest version and using the same
version print drivers as are in use on my other "identical" Terminal Servers.
Unfortunately afterwards I put the machine back into production and
experienced the same errors as before. I should note however that prior to
putting the server back into the CoLocation facility I had the server running
for about 2 weeks on our test bench. During this time, the system did not
Blue Screen once, however once returned, it resumed the BSoD's - noticably
this occurred whether users were logged in or not.


> Below are some suggestions for troubleshooting/fixing the problem:
>
> 1. Possible faulty hardware
>
> - Run all hardware diagnostics provided by hardware vendor(s)
> for devices installed in the server
> - Run memory test/stress utilities
> - Remove all RAM chips and reinstall them
> - Remove and reinstall cards, if present (for example, NIC, RAID)
> - If above does not narrow down or fix the issue, swap out parts
> one at a time and then test to see if crash occurs, starting with
> RAM chips (for RAM swap a bank at a time to narrow it down
> quicker)
>

As noted above, I performed the IBM Enhanced Diagnostics, in addition to a
MemTest (x86), do not have any additional cards, apart from those provided
with the server (not using an RSA, but I believe the second NIC was an
expansion module), in addition I did attempt to run the MemTest with one set
of DIMMS, followed by the other, and then both (ran with both for an extended
period, and basic test for the individual banks)


> 2. Possible file corruption/disk issues
>
> - Make certain you have a current backup
> - Run the consistency check on your RAID array(s) using your
> vendor's Array management software
> - Check the event log for the array for errors/warnings
> - Run chkdsk on all drives using the /r option. For the C: drive
> and others that are in use you will need to schedule it for next
> reboot.
> - Run sfc /scannow to check & replace problem system files
>

Regarding the RAID consistency check - I have yet to perform this (I believe
that with my chipset I can boot into the RAID bios and perform this test - I
will bring the server back and give this a shot).
I can confirm I have run a full chkdsk with /r, however not an sfc /scannow
- which I will do shortly & reply with details.

> 3. Possible Virus/Malware--malware that uses kernel-mode
> components often cause STOP errors.
>
> - Scan all drives using multiple virus and malware scanners
> - Scan using *multiple* packages specifically designed to look
> for cloaked viruses/malware. Cloaked software typically will
> not show up using regular virus scan software.
> - If needed scan/examine manually the contents of the drives
> offline so that any malware will not be running and thus unable
> to hide itself or prevent removal.
> - Examine everything that is starting up on this server using
> autoruns.exe and compare to healthy servers for clues. Use
> verify signatures and hide ms entries to limit the results.
> - Watch Mark Russinovich's video on Malware removal:
>
> http://www.microsoft.com/emea/spotlight/sessionh.aspx?videoid=359

I can confirm that the server has been both virus scanned & rootkit scanned
by several online scanners (plus downloaded sophos anti-root kit) with no
malware detected. All processes running on the server appear to be
expected/normal.

>
> 4. Possible Faulty Device Drivers
>
> - Perform analysis of dump file to see if it points to a specific driver
> - Look for errors in the event logs for clues
> - Update drivers for critical system devices (NIC, RAID, Motherboard
> devices, etc.)
> - Update firmware for motherboard, RAID, etc.
> - Uninstall non-critical software that uses kernel-mode drivers. For
> example, anti-virus, anti-malware, firewall, backup agents, etc.
>

I performed some basic dump analysis above - is this sufficient or do you
need more information? As noted I have the latest firmware & drivers for all
hardware components in the puzzle (printers excepted - they are running the
same drivers as my other TS' in the cluster)

> 5. Bug in windows--there are hotfixes available to fix various Stop
> errors that occur, many related to TS environments. One may
> be applicable to you.
>
> - Analyze dump file for clues
> - Narrow down what happens near the time of each crash. For
> example, is a user logging off immediately before the crash? If
> yes there is a hotfix to address that. Is there heavy disk activity
> around the crash time? How about heavy network activity? Etc.
> - Use the Advanced Search feature of the MS knowledge base
> to search for Stop errors similar to yours. For example, here
> are the results for the ab error:
>
> http://support.microsoft.com/search/default.aspx?mode=a&query=%22Stop+0x000000ab%22&spid=3198
>
> Please see the following document for more information:
>
> Windows Server 2003 Troubleshooting Stop Errors
>
> http://www.microsoft.com/downloads/details.aspx?familyid=859637b4-85f1-4215-b7d0-25f32057921c
>
> Thanks.
>
> -TP
>

Regarding the hotfixes - I found several relating to the TS environment and
correlating with my stop codes and have applied all that were applicable.
Alas I still seem to be having the issue!

I will reply with the results of the above actions I'm taking and then might
look at opening a case with MS PS - I believe we get these complimentary as
MS Partners anyway.

Thank you for all of your suggestions/assistance!

> Michael (Xcitelogic) wrote:
> > Gday All
> >
> > I have a Windows Server 2003 SP2 Terminal Server which has been
> > giving me grief on and off for a year or so now. The server restarts
> > approximately once a day - without an apparent trigger for doing so.
> > The server in question is an IBM x306m with the latest
> > firmware/device drivers as per directions from IBM Technical Support.
> >
> > I seem to get fairly consistent STOP 0x00000050 codes, interspersed
> > with a STOP 0x000000ab or two, and a STOP 0x000000be and a STOP
> > 0x000000c2 for good measure. I'm being assured by the hardware
> > vendor that the problem lies at the software side - and having
> > googled for all relevant STOP codes and applying every hotfix I could
> > see, I still receive BSoD's on this Terminal Server.
> >
> > The last series of STOP codes were these:
> > Error code 000000ab, parameter1 00000001, parameter2 fffffd78,
> > parameter3 00000000, parameter4 ffffffff.
> > Error code 000000be, parameter1 f714a0b8, parameter2 bfc30121,
> > parameter3 f78cea4c, parameter4 0000000b.
> > Error code 00000050, parameter1 f7460000, parameter2 00000001,
> > parameter3 80834bde, parameter4 00000000.
> > Error code 000000ab, parameter1 00000004, parameter2 fffff338,
> > parameter3 00000000, parameter4 ffffffff.
> > Error code 000000ab, parameter1 00000013, parameter2 fffff5d8,
> > parameter3 00000000, parameter4 ffffffff.
> > Error code 000000ab, parameter1 00000001, parameter2 fffff510,
> > parameter3 00000000, parameter4 ffffffff.
> > Error code 00000050, parameter1 f7468000, parameter2 00000001,
> > parameter3 80834bde, parameter4 00000000.
> >
> > Is anyone able to point me in the right direction at resolving this
> > issue?
> >
> > Thanks in advance for any help you can provide.
>