Interlocked Singly Linked Lists (for Ben or anyone else). [Win32 Kernel]

Prev: Howto convert IPv6 from WTSClientAddress to string?
Next: Application Error, The memory could not be "read". How to debug?

From: Alexander Grigoriev on 16 Nov 2007 09:11

Per lack of double-pointer-sized InterlockedCompareExchange on some 64 bit
platforms, it's necessary to pack more than a pointer into SLIST_HEADER:

#if defined(_WIN64)
typedef union _SLIST_HEADER {
ULONGLONG Alignment;
struct {
ULONGLONG Depth : 16;
ULONGLONG Sequence : 8;
ULONGLONG Next : 40;
};
} SLIST_HEADER, *PSLIST_HEADER;

See ntddk.h

"Tim Roberts" <timr(a)probo.com> wrote in message
news:a58qj3p6jmkts8vtis9ljenk4kg47daiej(a)4ax.com...
> "Chris Thomasson" <cristom(a)comcast.net> wrote:
>>
>>I have a question as well. Do you know if Microsoft uses a pointer
>>compression algorithm to implement their SList API on 64-bit systems? I
>>think I remember Neill Clift, who works on the Windows Kernel, mention
>>something about that, but I can't seem to find the message on Google right
>>now.
>
> I, personally, would be very surprised if they did. The benefit is so
> small, and memory is essentially free.
>
> Further, in my experience, most kernel structures use the doubly-linked
> lists.
> --
> Tim Roberts, timr(a)probo.com, DDK MVP
> Providenza & Boekelheide, Inc.

From: Ben Voigt [C++ MVP] on 16 Nov 2007 23:20

"Pops" <dude(a)nospamnospamnospam.com> wrote in message
news:%23wNxgRFKIHA.4948(a)TK2MSFTNGP02.phx.gbl...
> Ben Voigt [C++ MVP] wrote:
>
>> We were discussing a multiple writer, single consumer queue. For this
>> the writers each use the InterlockedPushEntrySList function, while the
>> consumer uses the InterlockedFlushSList function to retrieve the entire
>> queue as an atomic operation. The consumer is then responsible for
>> iterating through the list of items retrieved in FIFO order.
>
> Hi Ben,
>
> I been spending the last day or so working on this Interlocked Singly
> Linked List (ISLL) and did a profile of two basically the same
> applications with the key difference:
>
> PROG-A: MFC Clist FIFO queue w/ my RW and Chris's RW code,
> PROG-B: ISLL with no RW code. Only LIFO analyzed here.
>
> Each allows me to play with different parameters:
>
> #define USE_CRITSECT_H // Use my RW or Chris's RW
>
> BOOL PRINT_LINES = FALSE; // Avoid console overhead
> DWORD MAX_WRITER_THREADS = 1; // # of writer threads
> DWORD MAX_READER_THREADS = 1; // # of reader threads
> DWORD AUTO_STOP_COUNT = 1000; // # of writer events
> DWORD WRITER_FREQ_MS = 1; // Frequency of writes
> DWORD READER_FREQ_MS = 1; // Frequency of reads
> DWORD READER_EVENT_THRESHOLD= 0; // # before reader is signaled
> DWORD READER_PROCESS_DELAY = 0; // Simulate processing time
>
> Each writer and reader threads has a one (1) quantum frequency using
> while( WFSO(ShutdownEvent,1) ) thread break test. So essentially a 15ms
> frequency of pumping 1 item into the list, a 15ms reader frequency. The
> wait can be changed with the XXXX_FREQ_MS parameters.
>
> AUTO_STOP_COUNT allows the test to end. The main thread watches this
> count before it signals the threads to shutdown.
>
> The READEREVENT_THRESHOLD, if set, will allow the queue to build before
> the writer will signal the reader event.
>
> For PROG-A with the RW locks, the USE_CRIT_SECT pragma allows me test
> my RW classes or Chris's fine RW class. :-)
>
> The testing is on a Intel Pentium D 2.66ghz, 1 gig RAM XP/PRO box.
>
> The programs main thread creates athe threadsd. When AUTO_STOP_COUNT is
> reached, it signals the shutdown event.
>
> Another console utility is use to set the performance counters to watch
> for and log. In this profile, I am only looking at the context switch
> rate.
>
> The results:
>
> # PROG-A: USING SSI CRITSECT.H READER/WRITER CLASSES
> * Successful Shutdown: 16000 ms
> * Total Processed : 1023
> * Total Empty Gets : 362
>
> PROG-A Reader: ~117 cs/sec
> PROG-A Writer: ~88 cs/sec
>
> # PROG-A: USING CHRIS'S READER/WRITER CLASSES
> * Successful Shutdown: 16000 ms
> * Total Processed : 1023
> * Total Empty Gets : 259
>
> PROG-A Reader: ~77 cs/sec
> PROG-A Writer: ~71 cs/sec
>
> So Chris's code offers less context switching. Don't get excited Chris.
> :-) Note, the reader in the first test is wasting time with empty read
> events. Comes into play in addition tests.
>
> # PROG-B: USING ISLL API WITH NO RW LOCKS
> * Successful Shutdown: 16000 ms
> * Total Processed : 1023
> * Total Empty Gets : 279
>
> PROG-B Reader: ~67 cs/sec
> PROG-B Writer: ~65 cs/sec
>
> Even slightly better with the ISLL, about the same waste in reader events.
> Note: This is using a PUSH and POP LIFO.
>
> For the next tests, I optimized the sychronization using event driven
> thresholds, READEREVENT_THRESHOLD = 10; the pump will will signal the
> reader when 10 items are add to the list.
>
> # PROG-A: USING SSI CRITSECT.H READER/WRITER CLASSES
> * Successful Shutdown: 16000 ms
> * Total Processed : 1021
> * Total Empty Gets : 0
>
> PROG-A Reader: ~6 cs/sec
> PROG-A Writer: ~70 cs/sec
>
> Drastic improvement here by collecting a bucket of items before
> processing. Notice no waste in reader events.
>
> # PROG-A: USING CHRIS'S READER/WRITER CLASSES
> * Successful Shutdown: 16000 ms
> * Total Processed : 1021
> * Total Empty Gets : 0
>
> PROG-A Reader: ~8 cs/sec
> PROG-A Writer: ~66 cs/sec
>
> Essentially the same performance.
>
> # PROG-B: USING ISLL API WITH NO RW LOCKS
> * Successful Shutdown: 16000 ms
> * Total Processed : 1021
> * Total Empty Gets : 0
>
> PROG-B Reader: ~7 cs/sec
> PROG-B Writer: ~67 cs/sec
>
> And for ISLL, essentially the same performance, when using a sychronized
> event driven bucket brigade.
>
> I performed many other test, and real world simulations, like adding
> delays for processing time, lowing the frequency of the pump but overall I
> think it is pretty clear to me, there is no real big difference when you
> need to take into account key optimization and synchronization factors.

Did you try a multiple writer scenario, or reduced waits where you could
have some contention?

Especially don't use Sleep to simulate processing time, use a loop with a
volatile variable to make sure the compiler doesn't eliminate it.

>
> In regards to ISLL, I did see a very small slight improving using the
> flush when I began to work on a LIFO to FIFO transformation. But seeing
> that regular MFC Clist based FIFO performed equally as well, to me, its
> the added complexity, unsureness of lock-free methods makes, OS
> dependency, makes it practical at the moment.
>
> Other considerations such as memory fragmentation, needs to be reviewed.
>
> Anyway, I though I would share these results.
>
> Thanks
>
> --
> HLS

From: Chris Thomasson on 17 Nov 2007 00:28

"Joe Seigh" <jseigh_01(a)xemaps.com> wrote in message
news:_3f%i.7212$RR1.5517(a)trnddc02...
>m wrote:
>> Yes I am extending the discussion to include all kinds of data structures
>> besides queues; and yes there are various ways to protect them. No I don�t
>> really want to discuss them in detail, I just wanted to put something out
>> there to see if I could tweak someone�s interest ;)
>>
> You made a generalization and applied it to the special case.
>>
>>
>> Thread pre-emption is important on all pre-emptive multi-tasking
>> operating systems regardless of how many CPUs are present in the system.
>> The importance of this consideration lies in the fact that between any
>> two instructions executed by thread A, thread B may have executed a large
>> number of instructions (either on this CPU or on another one) or none at
>> all. The reason this is, implicitly, important is that, even though one
>> could use a lock-free algorithm for nearly anything, they are
>> specifically designed for HPC scenarios with multiple produces & multiple
>> consumers (or more generally, multiple modifiers).
>
>
> Yes, preemption causes more variation in a threads rate of progress but
> so?
> In multi-threading you don't make any assumptions about the rate of
> forward
> progress, preemption or not.

Yeah, especially since any predicates that are based on an assumption of the
non-deterministic rate of a threads forward progress is basically a
race-condition in and of itself. I think its similar to all the reasons
against implementing RCU with timers; reader threads just might decide to
stay in the active read epoch, longer than the timer allows for... Not good.

> And it's certainly news to me that lock-free
> is mainly important to HPC and not so much for other things.

IMHO, I find the reader pattern to be the most useful overall technique.

[...]

From: Vladimir Petter [MSFT] on 17 Nov 2007 01:07

You can find more info on this here http://www.alex-ionescu.com/?p=50
Vladimir Petter. [microsoft]

"Alexander Grigoriev" <alegr(a)earthlink.net> wrote in message
news:#UdcMqFKIHA.5860(a)TK2MSFTNGP04.phx.gbl...
> Per lack of double-pointer-sized InterlockedCompareExchange on some 64 bit
> platforms, it's necessary to pack more than a pointer into SLIST_HEADER:
>
> #if defined(_WIN64)
> typedef union _SLIST_HEADER {
> ULONGLONG Alignment;
> struct {
> ULONGLONG Depth : 16;
> ULONGLONG Sequence : 8;
> ULONGLONG Next : 40;
> };
> } SLIST_HEADER, *PSLIST_HEADER;
>
> See ntddk.h
>
> "Tim Roberts" <timr(a)probo.com> wrote in message
> news:a58qj3p6jmkts8vtis9ljenk4kg47daiej(a)4ax.com...
>> "Chris Thomasson" <cristom(a)comcast.net> wrote:
>>>
>>>I have a question as well. Do you know if Microsoft uses a pointer
>>>compression algorithm to implement their SList API on 64-bit systems? I
>>>think I remember Neill Clift, who works on the Windows Kernel, mention
>>>something about that, but I can't seem to find the message on Google
>>>right
>>>now.
>>
>> I, personally, would be very surprised if they did. The benefit is so
>> small, and memory is essentially free.
>>
>> Further, in my experience, most kernel structures use the doubly-linked
>> lists.
>> --
>> Tim Roberts, timr(a)probo.com, DDK MVP
>> Providenza & Boekelheide, Inc.
>
>

From: Chris Thomasson on 17 Nov 2007 12:26

>> "Tim Roberts" <timr(a)probo.com> wrote in message
>> news:a58qj3p6jmkts8vtis9ljenk4kg47daiej(a)4ax.com...
>>> "Chris Thomasson" <cristom(a)comcast.net> wrote:
>>>>
>>>>I have a question as well. Do you know if Microsoft uses a pointer
>>>>compression algorithm to implement their SList API on 64-bit systems? I
>>>>think I remember Neill Clift, who works on the Windows Kernel, mention
>>>>something about that, but I can't seem to find the message on Google
>>>>right
>>>>now.
>>>
>>> I, personally, would be very surprised if they did. The benefit is so
>>> small, and memory is essentially free.
>>>
>>> Further, in my experience, most kernel structures use the doubly-linked
>>> lists.
>>> --
>>> Tim Roberts, timr(a)probo.com, DDK MVP
>>> Providenza & Boekelheide, Inc.

> "Alexander Grigoriev" <alegr(a)earthlink.net> wrote in message
> news:#UdcMqFKIHA.5860(a)TK2MSFTNGP04.phx.gbl...
>> Per lack of double-pointer-sized InterlockedCompareExchange on some 64
>> bit platforms, it's necessary to pack more than a pointer into
>> SLIST_HEADER:
>>
>> #if defined(_WIN64)
>> typedef union _SLIST_HEADER {
>> ULONGLONG Alignment;
>> struct {
>> ULONGLONG Depth : 16;
>> ULONGLONG Sequence : 8;
>> ULONGLONG Next : 40;
>> };
>> } SLIST_HEADER, *PSLIST_HEADER;
>>
>> See ntddk.h
>>

"Vladimir Petter [MSFT]" <vladp72(a)hotmail.com> wrote in message
news:u$CatAOKIHA.5400(a)TK2MSFTNGP04.phx.gbl...
> You can find more info on this here http://www.alex-ionescu.com/?p=50
> Vladimir Petter. [microsoft]

Thanks for that. Humm... So, Windows 64-bit running on x86-64 systems that
lack CMPXCHG16B instruction uses only 9-bits for an ABA counter; I think
that may be a bit dangerous. That counter will be rolling over quite
frequently under load... Humm:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/133f764f1bc0c9f4

Any thoughts?

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10
Prev: Howto convert IPv6 from WTSClientAddress to string?
Next: Application Error, The memory could not be "read". How to debug?