Bugcheck 101 [Device Drivers]

Prev: Sound card C-media 6501 on windows 7
Next: EngMapFile vs. CreateFileMapping - filename differences

From: Alberto on 23 Jan 2009 16:54

Hi, Stefan,

Sorry for the delay in answering! We haven't totally characterized the speed
of the Vp2000 yet, however, a 4-node Vp2000 with 4Gb of memory can spin a
512x512x512 16-bit CT dataset at 20 frames per second. Actually, the
AquariusNet application has a 3D viewer that's awesome to use, you see real
3D images of real human beings, and the software allows you to cut through
volumes, make holes in them, embed other images into them, and do all kinds
of other manipulations. But all of this is too high level for me! I'm just
the driver guy.

Alberto.

"stefanbanev(a)yahoo.com" wrote:

>
> Hello Alberto,
>
> Thanks for some insightful information about status of VolumePro 2000
> development.
>
> A> I wouldn't have bothered too much about this crash, because our
> A> Vp2000 board doesn't appear anywhere when I look into it with
> A> Windbg; it looks more like an OS bug than anything else.
>
> Definitely OS bugs are quite annoying but customers are so picky ;o)
>
> Unfortunately tech info (link below) regarding board performance is
> very limited:
> http://www.terarecon.com/downloads/products/datasheet_VP2000_120506.pdf
>
> Besides native perspective volume rendering; what performance
> enchantment we may expect vs. Volume Pro 1000: particularly, level of
> super-sampling (IC case) for interactive rendering for let say
> 4000x512x512 (12 bit/pxl volume) and view port 1024x1024 (?).
>
> It took quite a while to develop this board and eventually it seems
> getting close... When you expect VolumePro2000 board may be
> technically ready for deployment?
>
> Thanks for doing a great job...,
> Stefan
>
>
> On Dec 9, 6:34 am, alberto <amore...(a)ieee.org> wrote:
> > Windbg shows processor 0 doing something at IRQL 13 and all other 7
> > processors halted. The actual function or thread in Processor 0's
> > stack varies from crash to crash, but it's always some memory
> > management thread running under the system processor, for example, the
> > zero memory thread, working set balance, or similar. Everything looks
> > quite normal, and !analyze -v doesn't give much information. If you
> > Google for "Bugcheck 101", you will see a few reports by !analyze -v
> > that say that an unknown device driver generated the bugcheck, and
> > that's what I invariably get.
> >
> > I wouldn't have bothered too much about this crash, because our Vp2000
> > board doesn't appear anywhere when I look into it with Windbg; it
> > looks more like an OS bug than anything else. However, the bugcheck
> > doesn't happen unless our Vp2000 board is running. The system has a
> > Vp1000 board and a Vp2000 board, side by side: when we run the app on
> > the Vp1000 board, things go ok, but when we run on the Vp2000 board we
> > get the crash. The bugcheck 101 happens both in Vista64 and XP64, and
> > it has been reported by many people outside the development
> > organization; but it does not happen on a 32-bit system. We thought it
> > was a power issue, and we powered boards from outside; no change in
> > behavior. We thought it was a heat problem and put a big fan near the
> > machine; no change in behavior. We turned Verifier on; no change in
> > behavior. In fact, turning Verifier on unearthed a minor IRP snag in
> > the Vp1000 driver that has probably been there for the last 5 years or
> > so, but no issues with the Vp20000 driver!
> >
> > I'm trying to catch the issue upstream, that is, before we get to the
> > bugcheck. What happens is, the app is running in the system - for
> > example, rotating a 3d image - and at some point in time it freezes.
> > We see the system visibly upset with things, the mouse moves jerkily
> > and slowly, keystrokes are delayed, and after 10 or 15 seconds, bang,
> > we get the bugcheck.
> >
> > Alberto.
> >
> > On Dec 8, 11:20 pm, "Alexander Grigoriev" <al...(a)earthlink.net> wrote:
> >
> > > If you check call stack locations in disassembly window, do you see any
> > > meaningful commands there?
> >
> > > HLT command is providing a way to put an idle processor into lower power
> > > state (C1).
> >
> > > Connect to the system with a debugger and run !analyze - v
> >
> > > "alberto" <amore...(a)ieee.org> wrote in message
> >
> > >news:fe0ef786-f26f-4a08-9616-08f176b23eff(a)t11g2000yqg.googlegroups.com....
> > > Yet that's the processor whose PRCB is given by the bluescreen
> > > parameters.
> >
> > > At the time of the crach the processor isn't idle, it's actually
> > > halted. The intelppm.sys driver issues a hlt instruction. Interrupts
> > > at that point are actually enabled.
> >
> > > My current reading is that somehow the timer interrupt got lost,
> > > either because another processor is stuck at a high IRQL or because
> > > something happened at hardware level that caused the interrupt not to
> > > be generated.
> >
> > > I run it under Verifier, nothing changed. I was somehow hopeful that
> > > memory corruption was to blame, but no cigar!
> >
> > > Alberto.
> >
> > > On Dec 5, 10:06 pm, "Alexander Grigoriev" <al...(a)earthlink.net> wrote:
> >
> > > > This is stack of an idle processor, doing nothing. intelpp is
> > > > processor-specific driver providing, besides from other thing, a
> > > > power-saving idle loop.
> >
> > > > Your problem is on a differrent proc.
> >
> > > > "alberto" <amore...(a)ieee.org> wrote in message
> >
> > > >news:74926d8c-808e-4d7f-baca-fb2709c5f3b7(a)d32g2000yqe.googlegroups.com....
> > > > This is what the stack in the hung processor looks like:
> >
> > > > fffffa60`00b92685 : 00000000`dfe5f0a2 00000000`00000008
> > > > fffffa60`017d8180 fffff800`02055979 : intelppm!C1Halt+0x2
> > > > fffff800`0208f7f8 : fffffa60`017db580 fffffa60`017e1d40
> > > > fffffa60`0000040e fffffa60`017ffd40 : intelppm!C1Idle+0x9
> > > > fffff800`0207eb21 : fffffa60`017d8180 00000000`00061c82
> > > > 00000000`00000000 00000000`00000000 : nt!PoIdle+0x148
> > > > fffff800`0224c5c0 : 00000000`00000000 00000000`00000000
> > > > 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x21
> >
> > > > I was assuming that the problem was created by the other processor,
> > > > but, thanks! This gives me new food for thought. I'll disable
> > > > intelppm.sys and see what happens!
> >
> > > > Alberto.
> >
> > > > On Dec 5, 4:07 pm, "Scott Noone" <sno...(a)osr.com> wrote:
> >
> > > > > I've never actually hit this bugcheck, but I'll bite.
> >
> > > > > The bugcheck information should show the hung processor. Have you looked
> > > > > at
> > > > > the call stack on that processor to see why it's locked up?
> >
> > > > > -scott
> >
> > > > > --
> > > > > Scott Noone
> > > > > Software Engineer
> > > > > OSR Open Systems Resources, Inc.http://www.osronline.com
> >
> > > > > "Alberto" <more...(a)terarecon.com> wrote in message
> >
> > > > >news:38960766-493e-4629-8459-02f4e82d662f(a)v4g2000yqa.googlegroups.com...
> >
> > > > > > Hi, All,
> >
> > > > > > I bumped into a nasty, hard to debug crash. It's a Bugcheck 101. It
> > > > > > happens when we run one of our products on our Vp2000 volume rendering
> > > > > > board: after we play with images for ten or fifteen minutes, the
> > > > > > machine gets unresponsive and after a few more seconds we get the blue
> > > > > > screen. This is a 4-processor Dell 5400 with hyperthreading on and
> > > > > > running 64-bit Vista. There's a lot going on in there at the time of
> > > > > > the crash, and all 8 virtual processors are busy at that time.
> >
> > > > > > By the time the dump gets taken, the system's long gone into la-la
> > > > > > land, and there isn't much in the dump that's useful to diagnose
> > > > > > what's going on. The crash is supposed to be a processor timeout
> > > > > > waiting for a timer interrupt, and while processor 3 is the timed out
> > > > > > processor, a thread in processor 0 seems to be at IRQL 13, which is
> > > > > > the level for the Amd64 timer interrupt. If that's a sustained
> > > > > > situation, that might explain what's going on, although actually
> > > > > > tracking it requires more work.
> >
> > > > > > There isn't much on the web about this Bugcheck, except the normal
> > > > > > "make sure your hw is not overheating or this or that" or "download
> > > > > > your latest bios and video drivers". No indication of what in those
> > > > > > new versions might actually have fixed the problem!
> >
> > > > > > My user has Daemon Tools installed, and I hear that they install a
> > > > > > hard-to-get-rid-of driver called sptd.sys. People on the web say that
> > > > > > sptd.sys sometimes interacts with the rest of the system in ways that
> > > > > > end up generating a Bugcheck 101. My user uninstalled Daemon Tools but
> > > > > > the crash is still there, and I'm pretty sure that sptd.sys has not
> > > > > > been disabled.
> >
> > > > > > My question is, do any of you have any experience with this Bugcheck
> > > > > > you might be willing to share ? At this point, any information,
> > > > > > however minor, will be highly appreciated!
> >
> > > > > > Thanks,
> >
> > > > > > Alberto.- Hide quoted text -
> >
> > > > > - Show quoted text -- Hide quoted text -
> >
> > > > - Show quoted text -- Hide quoted text -
> >
> > > - Show quoted text -
>
>

From: Alberto on 23 Jan 2009 17:07

Daniel,

Thanks for the suggestion! I got a few more entries to refer to. Still,
sorry to say, none of them helped in this particular case.

I did some more digging into this problem, and this is what I found.

My chip has a hw dma queue and a hw render queue. The chip fetches command
streams from these queues, which can be placed on host or board memory .
These commands are set up by the driver to move data and tables in and out of
the board, or to perform the rendering.

In this case, we're fetching from host memory. I preallocate a large (the
default is 4 megabytes) contiguous slab of memory at initialization time,
which becomes a buffer pool from where the driver suballocates queue command
buffers.

I found that the problem goes away if I force the physical address of this
memory slab to be under 4Gb. If I let the command buffer pool to go beyond
4Gb, every once in a blue moon I either get a Bugcheck 101 or I get a hard
machine freeze where not even the keyboard LEDs are functional.

Now, this can be an issue with my hardware's PCI Express implementation, or
it can have something to do with the way Vista 64-bit handles physical
memory. I faintly recall to have read some warning somewhere on the Web, but
I cannot locate it any longer.

Meanwhile, I'll double check to make sure my chip isn't mistreating the PCI
Express bus. And thanks to all of you who pitched in!

Alberto.

===========

"daniel(a)resplendence.com" wrote:

> Now that I am thinking about it, the driver where I have seen this bugcheck
> was doing exotic experiments in an attempt to achieve a real time
> environment by "liberating" CPUs from any workload.
>
> For this purpose, I was setting affinity for all processes and threads in
> the system (KeSetSystemAffinityThread) and dequeueing DPCs which were not
> mine (with KeSetTargetProcessorDpc) from the CPU I wanted to liberate.
>
> //Daniel
>
>
>
>
>
> <daniel(a)resplendence.com> wrote in message
> news:C8BBC04B-1A83-4A12-9EC2-0D5CEE1AE617(a)microsoft.com...
> > Through the years I have had a few of these bugchecks in my software only
> > drivers, but they were never reproduceable and always while running under
> > VmWare.
> >
> > I hate to make such a silly suggestion but have you considered Googling
> > for "CLOCK_WATCHDOG_TIMEOUT" and "bugcheck 0x101" rather than "bucheck
> > 101", these do yield some results.
> >
> > //Daniel
> >
> >
> > "Alberto" <moreira(a)terarecon.com> wrote in message
> > news:38960766-493e-4629-8459-02f4e82d662f(a)v4g2000yqa.googlegroups.com...
> >>
> >> My question is, do any of you have any experience with this Bugcheck
> >> you might be willing to share ? At this point, any information,
> >> however minor, will be highly appreciated!
> >>
> >> Thanks,
> >>
> >>
> >> Alberto.
> >
>

From: Maxim S. Shatskih on 23 Jan 2009 17:25

> I found that the problem goes away if I force the physical address of this
> memory slab to be under 4Gb. If I let the command buffer pool to go beyond
> 4Gb, every once in a blue moon I either get a Bugcheck 101 or I get a hard

Are you using IoGetDmaAdapter or not so?

--
Maxim S. Shatskih
Windows DDK MVP
maxim(a)storagecraft.com
http://www.storagecraft.com

From: Alberto on 26 Jan 2009 15:45

Hi, Maxim,

I'm not using IoGetDmaAdapter. I'm using
MmAllocateContiguousMemorySpecifyCache at Start time to grab a big
slab of kernel-side contiguous memory which becomes my queue command
buffer pool, and the rest is done by the chip; there's basically no
other OS interaction involved in the dma process.

The chip fetches commands and executes them. Some of these commands
tell it to dma, between a user-side buffer and board memory. I use
MmProbeAndLockPages to fix user-side buffers in memory so that I can
dma without moving data around. When the bugcheck happens, the chip
has just fetched a new dma command batch, but that batch never
executes and the dma never starts: some milliseconds elapse, an
external interrupt comes (for example, from a network card that shares
the same interrupt line as our board) and at that time the system
hangs solid.

The funky thing is, this chip's predecessor uses the same dma scheme,
and the same hw queue implementation, and it has been working fine for
years now. No problem, not even with Vista 64-bit!

Alberto.

On Jan 23, 5:25 pm, "Maxim S. Shatskih"
<ma...(a)storagecraft.com.no.spam> wrote:
> > I found that the problem goes away if I force the physical address of this
> > memory slab to be under 4Gb. If I let the command buffer pool to go beyond
> > 4Gb, every once in a blue moon I either get a Bugcheck 101 or I get a hard
>
> Are you using IoGetDmaAdapter or not so?
>
> --
> Maxim S. Shatskih
> Windows DDK MVP
> ma...(a)storagecraft.comhttp://www.storagecraft.com

From: Maxim S. Shatskih on 26 Jan 2009 15:55

>The funky thing is, this chip's predecessor uses the same dma scheme,
>and the same hw queue implementation, and it has been working fine for
>years now.

On the same surrounding hardware?

--
Maxim S. Shatskih
Windows DDK MVP
maxim(a)storagecraft.com
http://www.storagecraft.com

| Next | Last
Pages: 1 2
Prev: Sound card C-media 6501 on windows 7
Next: EngMapFile vs. CreateFileMapping - filename differences