From: Chris Thomasson on
"Ben Voigt [C++ MVP]" <rbv(a)nospam.nospam> wrote in message
news:uEfnFpsJIHA.4044(a)TK2MSFTNGP02.phx.gbl...
>
> "Pops" <dude(a)nospamnospamnospam.com> wrote in message
> news:%23tndYNsJIHA.4712(a)TK2MSFTNGP04.phx.gbl...
>> Ben (C++ MVP) recommended using the Interlocked Singly Linked Lists API:
>>
>> http://msdn2.microsoft.com/en-us/library/ms686962.aspx
>>
>> as a more efficient synchronization method for a recent thread discussion
>> in regards to a FIFO queue design.
>>
>> If Ben is reading or if anyone else is familiar with this API, I have a
>> few questions:

I have a question as well. Do you know if Microsoft uses a pointer
compression algorithm to implement their SList API on 64-bit systems? I
think I remember Neill Clift, who works on the Windows Kernel, mention
something about that, but I can't seem to find the message on Google right
now.

From: Chris Thomasson on
"Chris Thomasson" <cristom(a)comcast.net> wrote in message
news:Q9adnRAzYrses6banZ2dnUVZ_oWdnZ2d(a)comcast.com...
>
> "Pops" <dude(a)nospamnospamnospam.com> wrote in message
> news:%23tndYNsJIHA.4712(a)TK2MSFTNGP04.phx.gbl...
>> Ben (C++ MVP) recommended using the Interlocked Singly Linked Lists API:
>>
>> http://msdn2.microsoft.com/en-us/library/ms686962.aspx
>>
>> as a more efficient synchronization method for a recent thread discussion
>> in regards to a FIFO queue design.
>>
>> If Ben is reading or if anyone else is familiar with this API, I have a
>> few questions:
>>
>> First, based on what I see, this is for LIFO designs (i.e., stacks)? Not
>> FIFO. Am I missing something? I only see Push/Pop functions, no
>> head/tail functions.
>>
>> Seconds, can you tell how it works with IPC, can it work for IPC?
>
> FWIW, if you want to create a blocking lock-less queue (e.g., consumer
> waits while queue is empty), well here is an algorithm I invented a while
> back:
>
> http://groups.google.de/group/comp.programming.threads/msg/632b6bdc2a137459
>
[...]

I should point out that the dwcas used in that algorithm uses an IBM-Style
implementation. It automatically updates the comprand on failure. Here is my
implementation in x86:



int
np_ac_i686_atomic_dwcas_fence(void*, void*, void*);


..globl np_ac_i686_atomic_dwcas_fence
np_ac_i686_atomic_dwcas_fence:
pushl %esi
pushl %ebx
movl 16(%esp), %esi
movl (%esi), %eax
movl 4(%esi), %edx
movl 20(%esp), %esi
movl (%esi), %ebx
movl 4(%esi), %ecx
movl 12(%esp), %esi
lock cmpxchg8b (%esi)
jne np_ac_i686_atomic_dwcas_fence_fail
xorl %eax, %eax
popl %ebx
popl %esi
ret

np_ac_i686_atomic_dwcas_fence_fail:
movl 16(%esp), %esi
movl %eax, (%esi)
movl %edx, 4(%esi)
movl $1, %eax
popl %ebx
popl %esi
ret


From: Pops on
Ben Voigt [C++ MVP] wrote:

> We were discussing a multiple writer, single consumer queue. For this the
> writers each use the InterlockedPushEntrySList function, while the consumer
> uses the InterlockedFlushSList function to retrieve the entire queue as an
> atomic operation. The consumer is then responsible for iterating through
> the list of items retrieved in FIFO order.

Ben, just a few more questions.

Sorry if I seem dense about this to you, but I would to like to get this
clarification.

When I see you throwing in the multiple writer, single consumer queue
premise above, I am now getting the idea if you introduced this
Interlocked Singly Linked List API to me for this particular scenario
only - one consumer.

In other words, is this API thread safe? The docs say "nonblocking
algorithm to provide atomic synchronization" which I presume it is using
some internal lock-free shared memory method.

But it is protected from multiple threads? I don't see how it can guard
a flush once the initial pointer is read without some lock.

TIA

--
HLS


From: Chris Thomasson on
[comp.programming.threads included because the discussion is on lock-free
algorithms, and there no place like cpt wrt that subject... ;^]


"Pops" <dude(a)nospamnospamnospam.com> wrote in message
news:%23uMo9huJIHA.6108(a)TK2MSFTNGP03.phx.gbl...
> Ben Voigt [C++ MVP] wrote:
>
>> We were discussing a multiple writer, single consumer queue. For this
>> the writers each use the InterlockedPushEntrySList function, while the
>> consumer uses the InterlockedFlushSList function to retrieve the entire
>> queue as an atomic operation. The consumer is then responsible for
>> iterating through the list of items retrieved in FIFO order.

You can use multiple consumers with that paticiular technique:

http://groups.google.com/group/comp.arch/msg/6ac06e17e5138820

http://groups.google.ca/group/comp.programming.threads/msg/a359bfb41c68b98b




> Ben, just a few more questions.
>
> Sorry if I seem dense about this to you, but I would to like to get this
> clarification.
>
> When I see you throwing in the multiple writer, single consumer queue
> premise above, I am now getting the idea if you introduced this
> Interlocked Singly Linked List API to me for this particular scenario
> only - one consumer.
>
> In other words, is this API thread safe? The docs say "nonblocking
> algorithm to provide atomic synchronization" which I presume it is using
> some internal lock-free shared memory method.
>
> But it is protected from multiple threads?

The API is 100% atomically thread-safe. You can hit the following functions
with any number of threads concurrently:

______________________________________
InterlockedFlushSList
InterlockedPopEntrySList
InterlockedPushEntrySList
QueryDepthSList
______________________________________

BTW, this algorithm has been around for ages as you probably know. In fact,
an IBM Principals of Operations give a more detailed description of why it
atomically thread-safe:

http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
(Appendix A/Multi-Processing Examples/Free-Pool Manipulation)


Did that help clear some things up for you?




> I don't see how it can guard a flush once the initial pointer is read
> without some lock.

The flush is implemented with double-width compare-and-swap function that
can atomically mutate the entire SLIST_HEADER data-structure:

http://groups.google.co.uk/group/comp.programming.threads/browse_frm/thread/b5bedbe04e939f55
(I would advise to read entire thread...)


On x86-32 the instruction is CMPXCHG8B. On some x86-64's the instruction is
CMPXCHG16B. On the Itanium, there is a weird instruction called CMP8XCHG16B.
This is instruction is tricky to use because you cannot count on it being
available to most 64-bit systems. For instance, I do a lot of work on the
SPARC that has the CASX instruction which works on 64-bit datum in 32-bit
mode, however, for SPARC64 the CASX instruction only works on 64-bit datum.
This can be a royal pain in the neck if your into inventing new lock-free
algorithms:

http://groups.google.com/group/comp.arch/browse_frm/thread/71f8e0094e353e5
(I would advise to read entire thread...)



There are some nifty tricks you can do to get around this nightmare. I
outline some of them here:

http://groups.google.com/group/comp.arch/msg/a3ebfe80363a4399
(examples toward end of message...)


Any thoughts on this?

From: Chris Thomasson on
"Chris Thomasson" <cristom(a)comcast.net> wrote in message
news:uqednbJACNZ0sKbanZ2dnUVZ_ualnZ2d(a)comcast.com...
> "Pops" <dude(a)nospamnospamnospam.com> wrote in message
> news:%23tndYNsJIHA.4712(a)TK2MSFTNGP04.phx.gbl...
[...]
>> Seconds, can you tell how it works with IPC, can it work for IPC?
>
> Yes. You can create a node cache in shared memory and allocate/deallocate
> your SLIST_ENTRY from there. Something like:
>
>
> (quick and dirty pseudo-code sketch)
> ____________________________________________________________
>
[...]
> ____________________________________________________________
>
>
>
>
> Also, for what its worth, here is how Microsoft implements the SList API
> under 32-bit systems:
>
>
>
> typedef struct slist_anchor_s slist_anchor;
> typedef struct slist_node_s slist_node;
>
> struct slist_node_s {
> slist_node* nx;
> };
>
> struct slist_anchor_s {
> slist_node* head;
> /* monotonic counter to avoid the infamous aba problem */
> __int32 aba;
> };
>
>
> void
> slist_anchor_init(
> slist_anchor* const _this
> ) {
> _this->head = 0;
> _this->aba = 0;
> }
>
>
> void
> slist_anchor_push(
> slist_anchor* const _this,
> slist_node* node
> ) {
> slist_node* cmp;
> do {
> cmp = _this->head;
> node->nx = cmp;
> } while (! CAS(&_this->anchor, cmp, node));
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

should read as:

} while (! CAS(&_this->head, cmp, node));

> }



I forgot to show the flush. Here is how to do that:


slist_node*
slist_anchor_flush(
slist_anchor* const _this
) {
return SWAP(&_this->head, 0);
}


The reason I don't need to bump the monotonic aba counter for the flush is
because its immune to the problem in the first place, here is why:



LIFO producers are inherently not affected by aba. On the other hand, LIFO
consumers are affected, but only if there is a "compare" type method being
used for synchronization. Using an atomic swap to set the stack anchor
to zero can get around aba wrt LIFO lock-free stacks. Was that similar to
your initial solution?





> slist_node*
> slist_anchor_pop(
> slist_anchor* const _this
> ) {
> slist_anchor cmp, xchg;
> do {
> cmp = *_this;
> if (! cmp.head) { return 0; }
>
> /* hazard! refer to PDR algorihtm for more details... */
> xchg.head = cmp.head->nx;
>
> /* increment ABA */
> xchg.aba = cmp.aba + 1;
>
> } while (! DWCAS(_this, &cmp, &xchg));
>
> return cmp.head;
> }
>
>
>
>
> Any thoughts?



BTW, would anybody like to see an actual implementation of this in shared
memory? If so, I will stick it on my to-do list.

;^)