From: Jonathan Morrison on
I'll double-check this - but IIRC the 8.X compilers implement volatile with
acquire-release semantics - which would make the code solid, as the volatile
write in WRITE_XXX would force a flush of all previous operations before it
could publish its results. However (again IIRC), the previous compiler
didn't do this - so it does seem that there could be some problem in that
case. Let me ask around in compiler land and see if I can get a definitive
answer. Thanks.

--
This posting is provided "AS IS" with no warranties, and confers no rights.
Use of any included script samples are subject to the terms specified at
http://www.microsoft.com/info/cpyright.htm
<BubbaGump> wrote in message
news:0fnlj25kc87dcsot6sjphvjak1rr3lvekr(a)4ax.com...
> Based on the non-cacheable ordering JM pointed out, I think the last
> statement I made below is false. The write of the logical address
> would probably be to a device register, and the register access macros
> like WRITE_REGISTER_ULONG might only use the volatile keyword or a
> compiler barrier (under the assumption that the memory-mapped I/O they
> will touch will be uncachable and not require a CPU barrier). The
> volatile keyword doesn't serve as a barrier since nonvolatile accesses
> can still be reordered around it, and the compiler barrier might need
> to be both a compiler and CPU barrier since the buffer to be DMA'd
> might be cached.
>
> My point is I don't think the register macros like
> WRITE_REGISTER_ULONG will compensate in all cases for the barrier that
> appears to be missing from KeFlushIoBuffers.
>
>
>
>
> On Sat, 21 Oct 2006 18:15:17 -0400, BubbaGump <> wrote:
>
>>I'm thinking of a transfer of a buffer out to a device using common
>>buffer DMA:
>>
>> 1) driver writes to the common buffer
>> 2) driver calls KeFlushIoBuffers
>> (device has logical address of buffer from previous operation)
>> 3) driver writes "Go" bit of a device register
>> 4) device reads from the common buffer
>>
>>I realize that if at least the logical address must be passed again
>>before each operation, then its passing will already require a memory
>>barrier between (2) and (3), which would substitute for the one
>>apparently missing from KeFlushIoBuffers.
>


From: already5chosen on

BubbaGump wrote:
> I noticed in my DDK header files (3790.1830) that KeFlushIoBuffers()
> is defined to do nothing, absolutely nothing. I know x86 cache's are
> already coherent with respect to DMA, but isn't there still the
> possibility of out-of-order loads and stores? Is this a bug? Are
> driver's supposed to do a KeMemoryBarrier() explicitly?
>
> Even if the x86 still does program ordered stores, and Microsoft will
> update KeFlushIoBuffers() at some point in the future when that
> ordering is no longer true, the compiler's memory accesses might not
> be ordered so at least a compiler memory barrier is needed for DMA out
> to a device.
>
> What about after a DMA? In order to account for speculative loads
> during DMA in from a device, does the call to FlushAdapterBuffers()
> have a memory barrier or should the driver also do a KeMemoryBarrier()
> explicitly here?

IA32 (and AMD64 for that matter) instruction set architecture
guarantees so called processor consistency (PC) for WB and UC memory
regions. PC means that stores by particular processor are observed in
program order by all bus agents present in the system.
Intel and AMD assure that processor consistency would be maintained
over all future x86 compatible CPUs.
So as far as the (IA32/AMD64) driver doesn't use WC memory regions it
needs no memory barrier in KeFlushIoBuffers().
If your driver use WC memory region you have to call KeMemoryBarrier()
explicitly in your code.

On the receiving end of DMA operation the situation is different -
x86-compatible CPUs can issue load instructions out of order. So I'd
guess that FlushAdapterBuffers() routine issues memory fence (or load
fence on the processors that support it).

From: BubbaGump on
On 22 Oct 2006 02:31:02 -0700, already5chosen(a)yahoo.com wrote:

>IA32 (and AMD64 for that matter) instruction set architecture
>guarantees so called processor consistency (PC) for WB and UC memory
>regions. PC means that stores by particular processor are observed in
>program order by all bus agents present in the system.
>Intel and AMD assure that processor consistency would be maintained
>over all future x86 compatible CPUs.

I see the first part in the IA-32 spec about the present, but where is
the second part stated about the future? I don't necessarily think
it's a problem, but I see the opposite about the future:

"It is recommended that software written to run on Pentium 4,
Intel Xeon, and P6 family processors assume the processor-ordering
model or a weaker memory-ordering model. The Pentium 4, Intel Xeon,
and P6 family processors do not implement a strong memory-ordering
model, except when using the UC memory type. Despite the fact that
Pentium 4, Intel Xeon, and P6 family processors support processor
ordering, Intel does not guarantee that future processors will support
this model."


>So as far as the (IA32/AMD64) driver doesn't use WC memory regions it
>needs no memory barrier in KeFlushIoBuffers().

I agree no CPU barrier would be needed, but what about a compiler
barrier?

From: Mark Roddy on
On Fri, 20 Oct 2006 20:21:24 -0400, BubbaGump <> wrote:

>I noticed in my DDK header files (3790.1830) that KeFlushIoBuffers()
>is defined to do nothing, absolutely nothing. I know x86 cache's are
>already coherent with respect to DMA, but isn't there still the
>possibility of out-of-order loads and stores? Is this a bug? Are
>driver's supposed to do a KeMemoryBarrier() explicitly?
>

No it isn't a bug. This is DMA, not processor load store operations.
You need a memory barrier only for shared memory regions that are not
otherwise protected by a lock or an interlocked operation. not for DMA
buffers.

>Even if the x86 still does program ordered stores, and Microsoft will
>update KeFlushIoBuffers() at some point in the future when that
>ordering is no longer true,

KeFlushIoBuffers is intended to accommodate platforms where some
operation is required to guarantee memory coherency after a DMA
operation.

> the compiler's memory accesses might not
>be ordered so at least a compiler memory barrier is needed for DMA out
>to a device.
>

Why? Is the DMA initiated without any locking operations? Is your
concern here compiler optimizations requiring memory barriers (all of
which are implicitly solved through a lock or interlock operation) or
CPU read write re-ordering?

>What about after a DMA? In order to account for speculative loads
>during DMA in from a device, does the call to FlushAdapterBuffers()
>have a memory barrier or should the driver also do a KeMemoryBarrier()
>explicitly here?

I'm confused. Why is the compiler doing speculative loads from the
buffer that was the DMA target before the DMA completed? I don't
believe that this is a real world example of compiler optimization
requiring a memory barrier, nor is there a hardware coherency problem
as cache coherency rules will guarantee the contents of the buffer
once the DMA is complete. At a minimum the thread that is reading the
contents of the DMA target buffer must wait on some lock for the DMA
to complete and that lock is a barrier.

All of the examples I have seen regarding memory barriers are
concerned with unlocked shared memory access that local compiler
optimizations or cpu read write reordering can render incoherent. All
of these problems disappear when the shared region is protected by any
of the standard locks.



=====================
Mark Roddy DDK MVP
Windows Vista/2003/XP/2000 Consulting
Device and Filesystem Drivers
Hollis Technology Solutions 603-321-1032
www.hollistech.com
From: Mark Roddy on
On Sat, 21 Oct 2006 18:15:17 -0400, BubbaGump <> wrote:

>I'm thinking of a transfer of a buffer out to a device using common
>buffer DMA:
>
> 1) driver writes to the common buffer
> 2) driver calls KeFlushIoBuffers
> (device has logical address of buffer from previous operation)
> 3) driver writes "Go" bit of a device register
> 4) device reads from the common buffer
>


"On x86-based, x64-based and Itanium-based hardware, reordering might
take place when a write operation for one location precedes a read
operation for a different location. Processor reordering might move
the read operation ahead of the write operation on the same CPU, thus
effectively reversing their order in code. These architectures do not
reorder read operations followed by read operations or write
operations followed by write operations"
http://download.microsoft.com/download/e/b/a/eba1050f-a31d-436b-9281-92cdfeae4b45/MP_issues.doc#_Toc119927283

Go re-read the write post buffer descriptions in the IA32 specs.
Write-write reordering would break all kinds of stuff. Your example is
not valid. If the processor in step (3) initiated the DMA via a read
operation (unlikely but possible) and did so without using the HAL
READ_REGISTER functions or otherwise introducing a barrier, you might
have a valid case.

>I realize that if at least the logical address must be passed again
>before each operation, then its passing will already require a memory
>barrier between (2) and (3), which would substitute for the one
>apparently missing from KeFlushIoBuffers.
>
>I know it's an odd case, but I don't think it breaks any rules except
>for what KeFlushIoBuffers might not do.
>
>
>
>
>On Sat, 21 Oct 2006 12:38:46 -0700, "Jonathan Morrison"
><jonathanm(a)mindspring.com> wrote:
>
>>Can you show an example of a case that you think needs the barrier please. I
>>am trying to come up with the case in my head and having a hard time coming
>>up with one. Thanks.


=====================
Mark Roddy DDK MVP
Windows Vista/2003/XP/2000 Consulting
Device and Filesystem Drivers
Hollis Technology Solutions 603-321-1032
www.hollistech.com