From: Ben Voigt [C++ MVP] on

"Chris Thomasson" <cristom(a)comcast.net> wrote in message
news:9c6dnab76-611abanZ2dnUVZ_rmjnZ2d(a)comcast.com...
> [comp.programming.threads included because the discussion is on lock-free
> algorithms, and there no place like cpt wrt that subject... ;^]
>
>
> "Pops" <dude(a)nospamnospamnospam.com> wrote in message
> news:%23uMo9huJIHA.6108(a)TK2MSFTNGP03.phx.gbl...
>> Ben Voigt [C++ MVP] wrote:
>>
>>> We were discussing a multiple writer, single consumer queue. For this
>>> the writers each use the InterlockedPushEntrySList function, while the
>>> consumer uses the InterlockedFlushSList function to retrieve the entire
>>> queue as an atomic operation. The consumer is then responsible for
>>> iterating through the list of items retrieved in FIFO order.
>
> You can use multiple consumers with that paticiular technique:

Yes, but with multiple consumers you

(a) won't get work evenly distributed
(b) ordering becomes rather meaningless, and Pops specified FIFO

[snip]

>> I don't see how it can guard a flush once the initial pointer is read
>> without some lock.
>
> The flush is implemented with double-width compare-and-swap function that
> can atomically mutate the entire SLIST_HEADER data-structure:

AFAICT, you need only pointer-sized CAS in order to make the singly-linked
list work, at least I wrote my own version in C++ for template-driven
type-safety and used only pointer-sized functions (each element can only be
in the list once, maybe that helps).


From: Chris Thomasson on
"Ben Voigt [C++ MVP]" <rbv(a)nospam.nospam> wrote in message
news:u5uxrPwJIHA.4808(a)TK2MSFTNGP05.phx.gbl...
>
> "Chris Thomasson" <cristom(a)comcast.net> wrote in message
> news:9c6dnab76-611abanZ2dnUVZ_rmjnZ2d(a)comcast.com...
>> [comp.programming.threads included because the discussion is on lock-free
>> algorithms, and there no place like cpt wrt that subject... ;^]
>>
>>
>> "Pops" <dude(a)nospamnospamnospam.com> wrote in message
>> news:%23uMo9huJIHA.6108(a)TK2MSFTNGP03.phx.gbl...
>>> Ben Voigt [C++ MVP] wrote:
>>>
>>>> We were discussing a multiple writer, single consumer queue. For this
>>>> the writers each use the InterlockedPushEntrySList function, while the
>>>> consumer uses the InterlockedFlushSList function to retrieve the entire
>>>> queue as an atomic operation. The consumer is then responsible for
>>>> iterating through the list of items retrieved in FIFO order.
>>
>> You can use multiple consumers with that paticiular technique:
>
> Yes, but with multiple consumers you
>
> (a) won't get work evenly distributed

You could use a work-stealing scheme to provide automatic distribution.



> (b) ordering becomes rather meaningless, and Pops specified FIFO

Yes. Good point.




> [snip]
>
>>> I don't see how it can guard a flush once the initial pointer is read
>>> without some lock.
>>
>> The flush is implemented with double-width compare-and-swap function that
>> can atomically mutate the entire SLIST_HEADER data-structure:
>
> AFAICT, you need only pointer-sized CAS in order to make the singly-linked
> list work, at least I wrote my own version in C++ for template-driven
> type-safety and used only pointer-sized functions (each element can only
> be in the list once, maybe that helps).

Okay. Well, you can't immediately reuse or free nodes. So, if you can
guarantee that a node will never be popped off a list, and reused right away
when there could be other threads concurrently popping, then your going to
need to take the ABA problem into account. Read here:

http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
(Appendix A/Multi-Processing Examples/Free-Pool Manipulation/Second
Paragraph)

This explains the ABA race-condition in detail. After reading that, you will
understand why its important to have some sort of solution to the ABA. DWCAS
is just one way to do it.

You cannot use pointer-sized CAS to remove a node from a lock-free LIFO
unless you have an external method to get around the ABA problem in general.
ABA is what happens when a thread T1 reads A and then thread T2 uses CAS to
change A to B, then back to A. Thread T1 performs its CAS, and it will still
succeed. This is fine some algorithms. However, its not okay for others. For
instance, if you don't have an external solution for ABA, and you use
pointer-sized CAS to try and pop an item off a lock-free linked-list, well
that a massive race-condition waiting to happen.


It sounds like you already solve ABA by enforcing the fact that a node will
only be in the list one time. However, you cannot deallocate them either
unless you use a lifetime management scheme to get around the hazard in the
pop function:

_________________________________________
slist_node*
slist_anchor_pop(
slist_anchor* const _this
) {
slist_anchor cmp, xchg;
do {
cmp = *_this;
if (! cmp.head) { return 0; }

/* hazard! refer to PDR algorihtm for more details... */
xchg.head = cmp.head->nx;

/* increment ABA */
xchg.aba = cmp.aba + 1;

} while (! DWCAS(_this, &cmp, &xchg));

return cmp.head;
}
_________________________________________


You need to be able to get around the hazard. Here is a paper that describes
most of the concerns:

http://www.research.ibm.com/people/m/michael/podc-2002.pdf
(read all...)


You can hit the ABA scenario if you pop node A; free node A; allocate new
node that happens to have same address as the old node A; and push it back
into the list. This would change the head of the list from A, to something
else, and right back to A.

From: Chris Thomasson on
[...]
>> AFAICT, you need only pointer-sized CAS in order to make the
>> singly-linked list work, at least I wrote my own version in C++ for
>> template-driven type-safety and used only pointer-sized functions (each
>> element can only be in the list once, maybe that helps).
>
> Okay. Well, you can't immediately reuse or free nodes.





> So, if you can
^^^^^^^^^^^^^^^^^^^^^^^^

Should read as:

So, if you cannot


> guarantee that a node will never be popped off a list, and reused right
> away when there could be other threads concurrently popping, then your
> going to need to take the ABA problem into account. Read here:
[...]


___________________________________________________________________
Okay. Well, you can't immediately reuse or free nodes. So, if you _cannot_
guarantee that a node will never be popped off a list, and reused right away
when there could be other threads concurrently popping, then your going to
need to take the ABA problem into account. Read here:

http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
(Appendix A/Multi-Processing Examples/Free-Pool Manipulation/Second
Paragraph)

This explains the ABA race-condition in detail. After reading that, you will
understand why its important to have some sort of solution to the ABA. DWCAS
is just one way to do it.
___________________________________________________________________

Sorry for any confusion!

From: Ben Voigt [C++ MVP] on
> This explains the ABA race-condition in detail. After reading that, you
> will
> understand why its important to have some sort of solution to the ABA.
> DWCAS
> is just one way to do it.

After thinking about it for a bit, I do not think ABA is actually a problem.

The only way that InterlockedCompareExchange will succeed is if the old head
pointer is what I have stored in the next-node pointer of the new head,
therefore I have correctly inserted a new node regardless of what
manipulations took place in the meantime.

Please take a look at my queue implementation and let me know if you see any
problem (I realize it isn't truely lock-free until I replace the allocator):

#pragma once

#include "Queuefwd.h"

#pragma managed(push, off)

/**
** \brief thread-safe queue, single element append, all elements retrieved
**
** High-performance lockless queue for multi-threading applications. The
order
** of the elements is preserved, but elements cannot be individually
removed.
** Intended application is for multiple threads to post work items with a
single
** reader.
**
** <center>
** \dotfile squeue-concept.dot "API Concept"
**
** \dotfile squeue-filled.dot "Typical State"
** </center>
**/
template<typename T>
class SQueue climodifier(sealed)
{
/**
** \brief internal linked-list implementation of queue for max
performance,
** follows ordinary singly-linked-list pattern
**/
struct SQueueElement climodifier(sealed)
{
/**
** \brief pointer to next list element, ala singly-linked-list
pattern
**
** NULL for the last element in the list
**/
SQueueElement* pNext;

/**
** \brief data storage pigeonhole, ala singly-linked-list pattern
**/
const T data;

SQueueElement( const T& dataInit )
: data(dataInit)
{}

private:
SQueueElement( const SQueueElement& );
SQueueElement& operator=(const SQueueElement&);
};

/**
** \brief pointer to first list element, ala singly-linked-list pattern
**
** NULL if the list is empty.
**
** Since the elements are stored in reverse order to leverage
** the compare-exchange primitive, this first list position holds
** the last (most recent) queue element.
**
** \invariant m_pFirst holds nullptr or the address of an
SQueueElement.
**/
void* volatile m_pFirst;

//static lenarray<T>* TransferElements(SQueueElement* pStart, size_t
startIndex);
public:
//! \brief Constructs a new empty queue
SQueue( void ) { m_pFirst = nullptr; }

//! \brief Adds an element to the end of the queue
void Add(const T& data);
//! \brief Retrieves all elements, removing them from the queue
lenarray<T>* Get(void);
};

/**
** Appends the data parameter to the queue. A copy
** of data is made using T::operator=(const T&)
**
** \param[in] data Value to be added as a new tail element
**/
template<typename T>
void SQueue<T>::Add( const T& data )
{
SQueueElement* pNewHead = new SQueueElement(data);
void* pCurrent = m_pFirst;
void* pOldHead;
do {
pOldHead = pCurrent;
pNewHead->pNext = static_cast<SQueueElement*>(pOldHead);
} while ((pCurrent = ::InterlockedCompareExchangePointer(&m_pFirst,
pNewHead, pOldHead)) != pOldHead);
}

/**
** Fetches all queued items. A copy
** of each element is made using T::operator=(const T&)
**
** \returns a lenarray containing the count of items removed,
** along with the items themselves in order of queueing.
**
** The complexity of this routine is due mainly to the fact
** that the queue is stored with items in reverse order for
** the sake of performance. It's not a bad tradeoff, because
** the extra code is minimal and it lets us avoid locking.
**
** \dotfile squeue-get.dot "Fetching the elements"
**/
template<typename T>
lenarray<T>* SQueue<T>::Get(void)
{
SQueueElement* pGotTail =
static_cast<SQueueElement*>(InterlockedExchangePointer(&m_pFirst,
(void*)nullptr));

if (pGotTail == nullptr)
return nullptr;

size_t length = 0;
SQueueElement* pElem = pGotTail;
do {
length++;
pElem = pElem->pNext;
} while (pElem != nullptr);

lenarray<T>* pArray = new lenarray<T>(length);
T* pData = *pArray + length;

pElem = pGotTail;
do {
*(--pData) = pElem->data;
SQueueElement* pNext = pElem->pNext;
delete pElem;
pElem = pNext;
} while (--length > 0);

return pArray;
}
#pragma managed(pop)


From: Chris Thomasson on
"Ben Voigt [C++ MVP]" <rbv(a)nospam.nospam> wrote in message
news:O2Xn5DxJIHA.3400(a)TK2MSFTNGP03.phx.gbl...
>> This explains the ABA race-condition in detail. After reading that, you
>> will
>> understand why its important to have some sort of solution to the ABA.
>> DWCAS
>> is just one way to do it.
>
> After thinking about it for a bit, I do not think ABA is actually a
> problem.

It is. There are numerous papers that deal with the subject. If you have a
soultion for it thats great. I look forward to looking at the code you
attached.


>
> The only way that InterlockedCompareExchange will succeed is if the old
> head pointer is what I have stored in the next-node pointer of the new
> head, therefore I have correctly inserted a new node regardless of what
> manipulations took place in the meantime.

I suggest that you read the paper by Maged I lined to in the previous post.
ABA problem is real. At least search the text in the PDF for the term "ABA",
and read the text contained in all the occurrences therein.



> Please take a look at my queue implementation and let me know if you see
> any problem (I realize it isn't truely lock-free until I replace the
> allocator):

Okay. I am going to look at it. I am still tinkering around with my
rwmutex_win class. I want to extend the API to allow for try read/write
functions. (e.g., tryrdlock/wrlock).