From: Alberto on
Actually, we run them both on the same machine. We usually run this
system with a Vp1000 (the older card) and a Vp2000 (the newer) side by
side. But the difference is, now things are beginning to clear out in
my mind, the Vp1000 runs on the PCI bus while the Vp2000 runs on a PCI
Express bus.

This is what I found out. I don't know if this is what's happening in
my case, but I fired out a question to Dell and I'm waiting on an
answer. You can start by taking a look at

http://www.microsoft.com/whdc/system/platform/server/PAE/PAEdrv.mspx

This is about PAE, but it fits my case. Microsoft states that there
are chipsets that do not support the "Dual Address Cycle", or DAC,
which prevents the bus from accessing more than 32 bits worth of
addressing space. They state that a DAC-capable adapter, such as our
Vp2000, must run on a DAC-capable bus. The DDK Dma complex checks for
DAC capability at boot time, and if it finds a DAC-capable adapter
running on a non-DAC-capable bus, it then resorts to a double-
buffering scheme to prevent bus accesses above 4Gb.

Now, my 64-bit code doesn't care about 4Gb boundaries, blindly
assuming that a 64-bit-OS-capable machine will have a 64-bit-capable
PCI Express bus. Problem is, maybe the T5400 bus isn't 64-bit-address
capable, and that's the answer I'm waiting for from Dell.

This would explain why the frequency of bugcheck 101 hits decreased
significantly once I forced my physical buffer allocations to stay
below the 4Gb line. There still may be some dark corner of the driver
that's generating physical addresses above 4Gb, or maybe some other
driver is causing the issue by not being careful enough with their bus
accesses. In either case, maybe the answer is plainly not to run 64-
bit Windows in such platforms.

I intend to make absolutely sure I'm not allocating physical memory
above the 4Gb line, that should absolve my driver. But even then, I'm
not that sure I can run on such platforms. My question then is, how do
I test for the "DACness" of a bus ? Do any of you out there know ?

Or am I barking up the wrong tree ? Thanks for any help you can
provide!


Alberto.





On Jan 26, 3:55 pm, "Maxim S. Shatskih"
<ma...(a)storagecraft.com.no.spam> wrote:
> >The funky thing is, this chip's predecessor uses the same dma scheme,
> >and the same hw queue implementation, and it has been working fine for
> >years now.
>
> On the same surrounding hardware?
>
> --
> Maxim S. Shatskih
> Windows DDK MVP
> ma...(a)storagecraft.comhttp://www.storagecraft.com

From: Maxim S. Shatskih on
>Now, my 64-bit code doesn't care about 4Gb boundaries, blindly
>assuming

That's why it is a good idea to use IoGetDmaAdapter, since in this case you can get the check for the non-conforming root complex for free.

>not that sure I can run on such platforms. My question then is, how do
>I test for the "DACness" of a bus ?

Use IoGetDmaAdapter, this can possibly do this for you for free (in MS's pci.sys)

--
Maxim S. Shatskih
Windows DDK MVP
maxim(a)storagecraft.com
http://www.storagecraft.com

From: Alberto on
The problem is, this is not standard dma, I don't know how well I
could fit it within the DDK model.

The chip fetches dma/render/synchronization/interrupt driver-generated
command streams and scatter-gather lists from queues in host memory,
asynchronously and in parallel. There are no map registers, scatter-
gather lists are dynamic, and one single dma transaction could include
hundreds of scatter-gather list items; command streams of 4Mb or more
aren't uncommon, and the data that gets moved by the subsequent dma's
may be multiple gigabytes at each throw. One of the queue commands -
and we use it a lot - is a call/return, which redirects the queue to
fetch its command stream, on the fly, to a command buffer in host
memory. So, an engine would (1) fetch from the queue, (2) fetch from
the buffer, (3) start a dma between yet another buffer and board
memory, and while that goes on, the other queue is fetching and
running a render command stream from yet another host buffer. When the
dma completes, the queue automatically writes a number of state
registers to slots in the device extension, so that the ISR knows
which are the current dma and render transactions.

You can see that at any one time the chip can be transacting with
several host buffers in parallel. The chip can also handle several
dma, render, and internal message passing transactions before it
interrupts; it can internally track and stack multiple dma and render
operations and completions, including their mutual synchronization.
The driver keeps the queues busy by continuously and asynchronously
enqueuing multiple transactions in what looks a lot like batch mode.
Plus, the chip bugs force us to play games with the chip command
streams, including the scatter-gather lists, that might be hard to
duplicate within the DDK model!

Alberto.



On Jan 27, 3:19 pm, "Maxim S. Shatskih"
<ma...(a)storagecraft.com.no.spam> wrote:
> >Now, my 64-bit code doesn't care about 4Gb boundaries, blindly
> >assuming
>
> That's why it is a good idea to use IoGetDmaAdapter, since in this case you can get the check for the non-conforming root complex for free.
>
> >not that sure I can run on such platforms. My question then is, how do
> >I test for the "DACness" of a bus ?
>
> Use IoGetDmaAdapter, this can possibly do this for you for free (in MS's pci.sys)
>
> --
> Maxim S. Shatskih
> Windows DDK MVP
> ma...(a)storagecraft.comhttp://www.storagecraft.com


From: Alberto on
I want to thank all of you who contributed to this thread. For the
sake of closure, I will report my final findings on this Bugcheck 101,
if nothing else because it seems to be quite hard to find
documentation on the Bugcheck anywhere, Internet included!

I saw the problem happen on Dell T5400 and Dell 490 machines running
Vista 64 or XP 64. It does not happen in other Dell machines we use,
for example, on the 670 or on the 2900. It doesn't happen on our HP
systems either. The machine must have more than 4Gb of memory for the
problem to show up. The problem does not happen on 32-bit Windows,
although I did not try running with PAE enabled; by the looks of it,
chances are that I might bump into that problem on PAE enabled
machines.

We traced the problem to an intermittent faulty Pci Express 64-bit bus
access when the address is higher than 4Gb. We solved the problem by
forcing all chip/bus traffic to use physical addresses lower than 4Gb.
Once we completed this implementation, the problem went away and it
cannot be duplicated no matter how much traffic we throw at the bus.

At this point I don't know if the problem is the bus implementation,
the bridge, the bios, the OS bus driver, or our own hardware board.
There's a faint chance that this might be an issue withour driver, but
at this point in time I doubt it very much. I am ceasing to
investigate this problem because it's far too late in the game to fix
an eventual chip issue, and anything else is beyond our control.
Hence, the path of least resistance was, do everything from the lower
4Gb.

Again, thanks to all who contributed to this thread!


Alberto.


On Jan 27, 5:12 pm, Alberto <more...(a)terarecon.com> wrote:
> The problem is, this is not standard dma, I don't know how well I
> could fit it within the DDK model.
>
> The chip fetches dma/render/synchronization/interrupt driver-generated
> command streams and scatter-gather lists from queues in host memory,
> asynchronously and in parallel. There are no map registers, scatter-
> gather lists are dynamic, and one single dma transaction could include
> hundreds of scatter-gather list items; command streams of 4Mb or more
> aren't uncommon, and the data that gets moved by the subsequent dma's
> may be multiple gigabytes at each throw. One of the queue commands -
> and we use it a lot - is a call/return, which redirects the queue to
> fetch its command stream, on the fly, to a command buffer in host
> memory. So, an engine would (1) fetch from the queue, (2) fetch from
> the buffer, (3) start a dma between yet another buffer and board
> memory, and while that goes on, the other queue is fetching and
> running a render command stream from yet another host buffer. When the
> dma completes, the queue automatically writes a number of state
> registers to slots in the device extension, so that the ISR knows
> which are the current dma and render transactions.
>
> You can see that at any one time the chip can be transacting with
> several host buffers in parallel. The chip can also handle several
> dma, render, and internal message passing transactions before it
> interrupts; it can internally track and stack multiple dma and render
> operations and completions, including their mutual synchronization.
> The driver keeps the queues busy by continuously and asynchronously
> enqueuing multiple transactions in what looks a lot like batch mode.
> Plus, the chip bugs force us to play games with the chip command
> streams, including the scatter-gather lists, that might be hard to
> duplicate within the DDK model!
>
> Alberto.
>
> On Jan 27, 3:19 pm, "Maxim S. Shatskih"
>
>
>
> <ma...(a)storagecraft.com.no.spam> wrote:
> > >Now, my 64-bit code doesn't care about 4Gb boundaries, blindly
> > >assuming
>
> > That's why it is a good idea to use IoGetDmaAdapter, since in this case you can get the check for the non-conforming root complex for free.
>
> > >not that sure I can run on such platforms. My question then is, how do
> > >I test for the "DACness" of a bus ?
>
> > Use IoGetDmaAdapter, this can possibly do this for you for free (in MS's pci.sys)
>
> > --
> > Maxim S. Shatskih
> > Windows DDK MVP
> > ma...(a)storagecraft.comhttp://www.storagecraft.com- Hide quoted text -
>
> - Show quoted text -