Interlocked Singly Linked Lists (for Ben or anyone else). [Win32 Kernel]

Prev: Howto convert IPv6 from WTSClientAddress to string?
Next: Application Error, The memory could not be "read". How to debug?

From: Chris Thomasson on 15 Nov 2007 16:00

> "Chris Thomasson" <cristom(a)comcast.net> wrote in message
> news:f7OdndviL--1OqHanZ2dnUVZ_qmlnZ2d(a)comcast.com...
>> One could theoretically open themselves up to live-lock conditions if
>> they
>> did not take this into consideration when implementing certain lock-free
>> algorithms on PPC. For instance, if the anchor of a lock-free LIFO was
>> not
>> the size of a cache-line, and was not aligned on a cache-line boundary
>> you
>> could get "false-sharing" on the anchor. I don't have much experience
>> with
>> PPC, but it seems that false-sharing could break reservations on the
>> anchor
>> and adversely affect forward progress, and might cause live-lock like
>> conditions...

[top-post corrected]

"m" <m(a)b.c> wrote in message news:OMNiHf8JIHA.5116(a)TK2MSFTNGP03.phx.gbl...
> Lock-free & wait-free are independent idioms which, in certain algorithms,
> can overlap. Except that lock-free doesn�t imply conditional looping,
> this is an attribute of many algorithms that are lock-free not a defining
> property of the idiom,

Fair enough.

> your definitions are reasonable; so what am I missing?

I don't think your missing anything.

> The fact that on certain architectures, false-sharing can adversely affect
> performance is a consideration in implementation and algorithm choice.
> Specifically, don�t choose an algorithm that is hard to implement or
> performs badly on your target architecture ;)

100% Agreed.

From: Chris Thomasson on 15 Nov 2007 23:17

"Ben Voigt [C++ MVP]" <rbv(a)nospam.nospam> wrote in message
news:uwEcTw5JIHA.3820(a)TK2MSFTNGP03.phx.gbl...
>
> "Chris Thomasson" <cristom(a)comcast.net> wrote in message
> news:I5-dnZi55r10FabanZ2dnUVZ_judnZ2d(a)comcast.com...
>> "Ben Voigt [C++ MVP]" <rbv(a)nospam.nospam> wrote in message
>> news:O2Xn5DxJIHA.3400(a)TK2MSFTNGP03.phx.gbl...
[...]
>>>
>>> Please take a look at my queue implementation and let me know if you see
>>> any problem (I realize it isn't truely lock-free until I replace the
>>> allocator):
>>>
>> [...]
>>
>> Good news! ;^)
[...]

> Thanks for the review. I was pretty sure it was correct but it's very
> nice to have concurrence from an expert.

You welcome. I think your SQueue class is sound wrt the overall concerns of
the ABA problem in general. I also think that the code correctly address the
concerns of deallocating nodes from lock-free data-structures. You have a
performance enhancement on the SQueue<T>::Get() function because its 100%
wait-free as long as the hardware that Windows happens to be running on
supports an atomic swap instruction; Intel does. SPARC used to support the
SWAP instruction; they depreciated it, and delegated its functionally over
to the CAS instruction on 32-bit systems, and CASX on 64-bit systems. Its
Kind of funny IMHO simply because CASX operates on 64-bit datum in 64-bit,
despite its literal name; the way I read things, CASX implies DWCAS-like
operation, which means 128-bit operation when in 64-bit mode. That seems odd
to me to say the least...

> As far as I'm concerned, the only (tenable) way to write multi-threaded
> code is to have a couple of lock-free thread-safe components (like my
> squeue) that fit entirely on a single page, and elsewhere avoid accessing
> the same variable from more than one thread. In any non-trivial project,
> trying to protect shared data structures any other way is just too complex
> to reason about, with too many opportunities for deadlocks and/or race
> conditions, and even if those are solved, then contention to a degree that
> removes the benefits of multithreading.

Well, deadlocks are caused by the misuse of locking. IMVHO, if a programmer
cannot get lock-based algorithms correct, then they should be nowhere near a
lock-free one; to say the least. The first of the two links provided below
points out some of the basic caveats wrt replacing a lock-based
synchronization point with its non-blocking equivalent. For instance, if you
stick a huge engine in a car with crappy tires, well, then Murphy's Law
dictates that your going to have a blow-out at high speeds, on a windy road,
with a cliff on one side. Therefore, you must upgrade the tires, and
probably some other components as well... Solving that in an efficient
manner can sometimes turn out to be a fairly long, and tedious endeavor;
indeed!

>> Humm... Well, FWIW, if your into low-overhead queuing, I suggest you skim
>> through the following message and follow links:
>>
>> http://groups.google.com/group/comp.arch/msg/a7501814f19731cd
>>
>
> Thanks, I'll take a look.
[...]

I think that you just might find the brief overview of the message-passing
techniques described therein fairly interesting, and perhaps a "bit useful";
as in you may be able to fit it in some of your _existing_ synchronization
schemes with "fairly minimal" effort(s). You can read the following for some
more details on that particular subject:

http://groups.google.com/group/comp.programming.threads/msg/9a5500d831dd2ec7

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/ef3f26bbe1c52e0b

Any thoughts on this?

From: Tim Roberts on 15 Nov 2007 23:48

"Chris Thomasson" <cristom(a)comcast.net> wrote:
>
>I have a question as well. Do you know if Microsoft uses a pointer
>compression algorithm to implement their SList API on 64-bit systems? I
>think I remember Neill Clift, who works on the Windows Kernel, mention
>something about that, but I can't seem to find the message on Google right
>now.

I, personally, would be very surprised if they did. The benefit is so
small, and memory is essentially free.

Further, in my experience, most kernel structures use the doubly-linked
lists.
--
Tim Roberts, timr(a)probo.com, DDK MVP
Providenza & Boekelheide, Inc.

From: Joe Seigh on 16 Nov 2007 06:22

m wrote:
> Yes I am extending the discussion to include all kinds of data structures
> besides queues; and yes there are various ways to protect them. No I don�t
> really want to discuss them in detail, I just wanted to put something out
> there to see if I could tweak someone�s interest ;)
>
You made a generalization and applied it to the special case.
>
>
> Thread pre-emption is important on all pre-emptive multi-tasking operating
> systems regardless of how many CPUs are present in the system. The
> importance of this consideration lies in the fact that between any two
> instructions executed by thread A, thread B may have executed a large number
> of instructions (either on this CPU or on another one) or none at all. The
> reason this is, implicitly, important is that, even though one could use a
> lock-free algorithm for nearly anything, they are specifically designed for
> HPC scenarios with multiple produces & multiple consumers (or more
> generally, multiple modifiers).

Yes, preemption causes more variation in a threads rate of progress but so?
In multi-threading you don't make any assumptions about the rate of forward
progress, preemption or not. And it's certainly news to me that lock-free
is mainly important to HPC and not so much for other things.
>
>
>
> Lock-free is not a forward progress guarantee. Lock-free, a misnomer, is an
> idiom in which software locks are replaced with hardware locks because they
> are, presumably, more efficient. This efficiently is achieved, if by
> nothing else, by being smaller (one instruction vs. hundreds) and by
> avoiding interaction with the OS scheduler. Lock-free algorithms must
> provide a forward progress guarantee, otherwise they would cease to function
> under contention, but that guarantee, in and of itself, does not make an
> algorithm lock-free and hence does not define the idiom.
>

The most commonly used definition for lock-free is by Herlihy. He also did
definitions for wait-free and obstruction-free.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

From: Pops on 16 Nov 2007 08:25

Ben Voigt [C++ MVP] wrote:

> We were discussing a multiple writer, single consumer queue. For this the
> writers each use the InterlockedPushEntrySList function, while the consumer
> uses the InterlockedFlushSList function to retrieve the entire queue as an
> atomic operation. The consumer is then responsible for iterating through
> the list of items retrieved in FIFO order.

Hi Ben,

I been spending the last day or so working on this Interlocked Singly
Linked List (ISLL) and did a profile of two basically the same
applications with the key difference:

PROG-A: MFC Clist FIFO queue w/ my RW and Chris's RW code,
PROG-B: ISLL with no RW code. Only LIFO analyzed here.

Each allows me to play with different parameters:

#define USE_CRITSECT_H // Use my RW or Chris's RW

BOOL PRINT_LINES = FALSE; // Avoid console overhead
DWORD MAX_WRITER_THREADS = 1; // # of writer threads
DWORD MAX_READER_THREADS = 1; // # of reader threads
DWORD AUTO_STOP_COUNT = 1000; // # of writer events
DWORD WRITER_FREQ_MS = 1; // Frequency of writes
DWORD READER_FREQ_MS = 1; // Frequency of reads
DWORD READER_EVENT_THRESHOLD= 0; // # before reader is signaled
DWORD READER_PROCESS_DELAY = 0; // Simulate processing time

Each writer and reader threads has a one (1) quantum frequency using
while( WFSO(ShutdownEvent,1) ) thread break test. So essentially a 15ms
frequency of pumping 1 item into the list, a 15ms reader frequency. The
wait can be changed with the XXXX_FREQ_MS parameters.

AUTO_STOP_COUNT allows the test to end. The main thread watches this
count before it signals the threads to shutdown.

The READEREVENT_THRESHOLD, if set, will allow the queue to build before
the writer will signal the reader event.

For PROG-A with the RW locks, the USE_CRIT_SECT pragma allows me test
my RW classes or Chris's fine RW class. :-)

The testing is on a Intel Pentium D 2.66ghz, 1 gig RAM XP/PRO box.

The programs main thread creates athe threadsd. When AUTO_STOP_COUNT is
reached, it signals the shutdown event.

Another console utility is use to set the performance counters to watch
for and log. In this profile, I am only looking at the context switch rate.

The results:

# PROG-A: USING SSI CRITSECT.H READER/WRITER CLASSES
* Successful Shutdown: 16000 ms
* Total Processed : 1023
* Total Empty Gets : 362

PROG-A Reader: ~117 cs/sec
PROG-A Writer: ~88 cs/sec

# PROG-A: USING CHRIS'S READER/WRITER CLASSES
* Successful Shutdown: 16000 ms
* Total Processed : 1023
* Total Empty Gets : 259

PROG-A Reader: ~77 cs/sec
PROG-A Writer: ~71 cs/sec

So Chris's code offers less context switching. Don't get excited Chris.
:-) Note, the reader in the first test is wasting time with empty read
events. Comes into play in addition tests.

# PROG-B: USING ISLL API WITH NO RW LOCKS
* Successful Shutdown: 16000 ms
* Total Processed : 1023
* Total Empty Gets : 279

PROG-B Reader: ~67 cs/sec
PROG-B Writer: ~65 cs/sec

Even slightly better with the ISLL, about the same waste in reader
events. Note: This is using a PUSH and POP LIFO.

For the next tests, I optimized the sychronization using event driven
thresholds, READEREVENT_THRESHOLD = 10; the pump will will signal the
reader when 10 items are add to the list.

# PROG-A: USING SSI CRITSECT.H READER/WRITER CLASSES
* Successful Shutdown: 16000 ms
* Total Processed : 1021
* Total Empty Gets : 0

PROG-A Reader: ~6 cs/sec
PROG-A Writer: ~70 cs/sec

Drastic improvement here by collecting a bucket of items before
processing. Notice no waste in reader events.

# PROG-A: USING CHRIS'S READER/WRITER CLASSES
* Successful Shutdown: 16000 ms
* Total Processed : 1021
* Total Empty Gets : 0

PROG-A Reader: ~8 cs/sec
PROG-A Writer: ~66 cs/sec

Essentially the same performance.

# PROG-B: USING ISLL API WITH NO RW LOCKS
* Successful Shutdown: 16000 ms
* Total Processed : 1021
* Total Empty Gets : 0

PROG-B Reader: ~7 cs/sec
PROG-B Writer: ~67 cs/sec

And for ISLL, essentially the same performance, when using a sychronized
event driven bucket brigade.

I performed many other test, and real world simulations, like adding
delays for processing time, lowing the frequency of the pump but overall
I think it is pretty clear to me, there is no real big difference when
you need to take into account key optimization and synchronization factors.

In regards to ISLL, I did see a very small slight improving using the
flush when I began to work on a LIFO to FIFO transformation. But seeing
that regular MFC Clist based FIFO performed equally as well, to me, its
the added complexity, unsureness of lock-free methods makes, OS
dependency, makes it practical at the moment.

Other considerations such as memory fragmentation, needs to be reviewed.

Anyway, I though I would share these results.

Thanks

--
HLS

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10
Prev: Howto convert IPv6 from WTSClientAddress to string?
Next: Application Error, The memory could not be "read". How to debug?