Prev: Howto convert IPv6 from WTSClientAddress to string?
Next: Application Error, The memory could not be "read". How to debug?
From: Alexander Grigoriev on 16 Nov 2007 09:11 Per lack of double-pointer-sized InterlockedCompareExchange on some 64 bit platforms, it's necessary to pack more than a pointer into SLIST_HEADER: #if defined(_WIN64) typedef union _SLIST_HEADER { ULONGLONG Alignment; struct { ULONGLONG Depth : 16; ULONGLONG Sequence : 8; ULONGLONG Next : 40; }; } SLIST_HEADER, *PSLIST_HEADER; See ntddk.h "Tim Roberts" <timr(a)probo.com> wrote in message news:a58qj3p6jmkts8vtis9ljenk4kg47daiej(a)4ax.com... > "Chris Thomasson" <cristom(a)comcast.net> wrote: >> >>I have a question as well. Do you know if Microsoft uses a pointer >>compression algorithm to implement their SList API on 64-bit systems? I >>think I remember Neill Clift, who works on the Windows Kernel, mention >>something about that, but I can't seem to find the message on Google right >>now. > > I, personally, would be very surprised if they did. The benefit is so > small, and memory is essentially free. > > Further, in my experience, most kernel structures use the doubly-linked > lists. > -- > Tim Roberts, timr(a)probo.com, DDK MVP > Providenza & Boekelheide, Inc.
From: Ben Voigt [C++ MVP] on 16 Nov 2007 23:20 "Pops" <dude(a)nospamnospamnospam.com> wrote in message news:%23wNxgRFKIHA.4948(a)TK2MSFTNGP02.phx.gbl... > Ben Voigt [C++ MVP] wrote: > >> We were discussing a multiple writer, single consumer queue. For this >> the writers each use the InterlockedPushEntrySList function, while the >> consumer uses the InterlockedFlushSList function to retrieve the entire >> queue as an atomic operation. The consumer is then responsible for >> iterating through the list of items retrieved in FIFO order. > > Hi Ben, > > I been spending the last day or so working on this Interlocked Singly > Linked List (ISLL) and did a profile of two basically the same > applications with the key difference: > > PROG-A: MFC Clist FIFO queue w/ my RW and Chris's RW code, > PROG-B: ISLL with no RW code. Only LIFO analyzed here. > > Each allows me to play with different parameters: > > #define USE_CRITSECT_H // Use my RW or Chris's RW > > BOOL PRINT_LINES = FALSE; // Avoid console overhead > DWORD MAX_WRITER_THREADS = 1; // # of writer threads > DWORD MAX_READER_THREADS = 1; // # of reader threads > DWORD AUTO_STOP_COUNT = 1000; // # of writer events > DWORD WRITER_FREQ_MS = 1; // Frequency of writes > DWORD READER_FREQ_MS = 1; // Frequency of reads > DWORD READER_EVENT_THRESHOLD= 0; // # before reader is signaled > DWORD READER_PROCESS_DELAY = 0; // Simulate processing time > > Each writer and reader threads has a one (1) quantum frequency using > while( WFSO(ShutdownEvent,1) ) thread break test. So essentially a 15ms > frequency of pumping 1 item into the list, a 15ms reader frequency. The > wait can be changed with the XXXX_FREQ_MS parameters. > > AUTO_STOP_COUNT allows the test to end. The main thread watches this > count before it signals the threads to shutdown. > > The READEREVENT_THRESHOLD, if set, will allow the queue to build before > the writer will signal the reader event. > > For PROG-A with the RW locks, the USE_CRIT_SECT pragma allows me test > my RW classes or Chris's fine RW class. :-) > > The testing is on a Intel Pentium D 2.66ghz, 1 gig RAM XP/PRO box. > > The programs main thread creates athe threadsd. When AUTO_STOP_COUNT is > reached, it signals the shutdown event. > > Another console utility is use to set the performance counters to watch > for and log. In this profile, I am only looking at the context switch > rate. > > The results: > > # PROG-A: USING SSI CRITSECT.H READER/WRITER CLASSES > * Successful Shutdown: 16000 ms > * Total Processed : 1023 > * Total Empty Gets : 362 > > PROG-A Reader: ~117 cs/sec > PROG-A Writer: ~88 cs/sec > > # PROG-A: USING CHRIS'S READER/WRITER CLASSES > * Successful Shutdown: 16000 ms > * Total Processed : 1023 > * Total Empty Gets : 259 > > PROG-A Reader: ~77 cs/sec > PROG-A Writer: ~71 cs/sec > > So Chris's code offers less context switching. Don't get excited Chris. > :-) Note, the reader in the first test is wasting time with empty read > events. Comes into play in addition tests. > > # PROG-B: USING ISLL API WITH NO RW LOCKS > * Successful Shutdown: 16000 ms > * Total Processed : 1023 > * Total Empty Gets : 279 > > PROG-B Reader: ~67 cs/sec > PROG-B Writer: ~65 cs/sec > > Even slightly better with the ISLL, about the same waste in reader events. > Note: This is using a PUSH and POP LIFO. > > For the next tests, I optimized the sychronization using event driven > thresholds, READEREVENT_THRESHOLD = 10; the pump will will signal the > reader when 10 items are add to the list. > > # PROG-A: USING SSI CRITSECT.H READER/WRITER CLASSES > * Successful Shutdown: 16000 ms > * Total Processed : 1021 > * Total Empty Gets : 0 > > PROG-A Reader: ~6 cs/sec > PROG-A Writer: ~70 cs/sec > > Drastic improvement here by collecting a bucket of items before > processing. Notice no waste in reader events. > > # PROG-A: USING CHRIS'S READER/WRITER CLASSES > * Successful Shutdown: 16000 ms > * Total Processed : 1021 > * Total Empty Gets : 0 > > PROG-A Reader: ~8 cs/sec > PROG-A Writer: ~66 cs/sec > > Essentially the same performance. > > # PROG-B: USING ISLL API WITH NO RW LOCKS > * Successful Shutdown: 16000 ms > * Total Processed : 1021 > * Total Empty Gets : 0 > > PROG-B Reader: ~7 cs/sec > PROG-B Writer: ~67 cs/sec > > And for ISLL, essentially the same performance, when using a sychronized > event driven bucket brigade. > > I performed many other test, and real world simulations, like adding > delays for processing time, lowing the frequency of the pump but overall I > think it is pretty clear to me, there is no real big difference when you > need to take into account key optimization and synchronization factors. Did you try a multiple writer scenario, or reduced waits where you could have some contention? Especially don't use Sleep to simulate processing time, use a loop with a volatile variable to make sure the compiler doesn't eliminate it. > > In regards to ISLL, I did see a very small slight improving using the > flush when I began to work on a LIFO to FIFO transformation. But seeing > that regular MFC Clist based FIFO performed equally as well, to me, its > the added complexity, unsureness of lock-free methods makes, OS > dependency, makes it practical at the moment. > > Other considerations such as memory fragmentation, needs to be reviewed. > > Anyway, I though I would share these results. > > Thanks > > -- > HLS
From: Chris Thomasson on 17 Nov 2007 00:28 "Joe Seigh" <jseigh_01(a)xemaps.com> wrote in message news:_3f%i.7212$RR1.5517(a)trnddc02... >m wrote: >> Yes I am extending the discussion to include all kinds of data structures >> besides queues; and yes there are various ways to protect them. No I don�t >> really want to discuss them in detail, I just wanted to put something out >> there to see if I could tweak someone�s interest ;) >> > You made a generalization and applied it to the special case. >> >> >> Thread pre-emption is important on all pre-emptive multi-tasking >> operating systems regardless of how many CPUs are present in the system. >> The importance of this consideration lies in the fact that between any >> two instructions executed by thread A, thread B may have executed a large >> number of instructions (either on this CPU or on another one) or none at >> all. The reason this is, implicitly, important is that, even though one >> could use a lock-free algorithm for nearly anything, they are >> specifically designed for HPC scenarios with multiple produces & multiple >> consumers (or more generally, multiple modifiers). > > > Yes, preemption causes more variation in a threads rate of progress but > so? > In multi-threading you don't make any assumptions about the rate of > forward > progress, preemption or not. Yeah, especially since any predicates that are based on an assumption of the non-deterministic rate of a threads forward progress is basically a race-condition in and of itself. I think its similar to all the reasons against implementing RCU with timers; reader threads just might decide to stay in the active read epoch, longer than the timer allows for... Not good. > And it's certainly news to me that lock-free > is mainly important to HPC and not so much for other things. IMHO, I find the reader pattern to be the most useful overall technique. [...]
From: Vladimir Petter [MSFT] on 17 Nov 2007 01:07 You can find more info on this here http://www.alex-ionescu.com/?p=50 Vladimir Petter. [microsoft] "Alexander Grigoriev" <alegr(a)earthlink.net> wrote in message news:#UdcMqFKIHA.5860(a)TK2MSFTNGP04.phx.gbl... > Per lack of double-pointer-sized InterlockedCompareExchange on some 64 bit > platforms, it's necessary to pack more than a pointer into SLIST_HEADER: > > #if defined(_WIN64) > typedef union _SLIST_HEADER { > ULONGLONG Alignment; > struct { > ULONGLONG Depth : 16; > ULONGLONG Sequence : 8; > ULONGLONG Next : 40; > }; > } SLIST_HEADER, *PSLIST_HEADER; > > See ntddk.h > > "Tim Roberts" <timr(a)probo.com> wrote in message > news:a58qj3p6jmkts8vtis9ljenk4kg47daiej(a)4ax.com... >> "Chris Thomasson" <cristom(a)comcast.net> wrote: >>> >>>I have a question as well. Do you know if Microsoft uses a pointer >>>compression algorithm to implement their SList API on 64-bit systems? I >>>think I remember Neill Clift, who works on the Windows Kernel, mention >>>something about that, but I can't seem to find the message on Google >>>right >>>now. >> >> I, personally, would be very surprised if they did. The benefit is so >> small, and memory is essentially free. >> >> Further, in my experience, most kernel structures use the doubly-linked >> lists. >> -- >> Tim Roberts, timr(a)probo.com, DDK MVP >> Providenza & Boekelheide, Inc. > >
From: Chris Thomasson on 17 Nov 2007 12:26
>> "Tim Roberts" <timr(a)probo.com> wrote in message >> news:a58qj3p6jmkts8vtis9ljenk4kg47daiej(a)4ax.com... >>> "Chris Thomasson" <cristom(a)comcast.net> wrote: >>>> >>>>I have a question as well. Do you know if Microsoft uses a pointer >>>>compression algorithm to implement their SList API on 64-bit systems? I >>>>think I remember Neill Clift, who works on the Windows Kernel, mention >>>>something about that, but I can't seem to find the message on Google >>>>right >>>>now. >>> >>> I, personally, would be very surprised if they did. The benefit is so >>> small, and memory is essentially free. >>> >>> Further, in my experience, most kernel structures use the doubly-linked >>> lists. >>> -- >>> Tim Roberts, timr(a)probo.com, DDK MVP >>> Providenza & Boekelheide, Inc. > "Alexander Grigoriev" <alegr(a)earthlink.net> wrote in message > news:#UdcMqFKIHA.5860(a)TK2MSFTNGP04.phx.gbl... >> Per lack of double-pointer-sized InterlockedCompareExchange on some 64 >> bit platforms, it's necessary to pack more than a pointer into >> SLIST_HEADER: >> >> #if defined(_WIN64) >> typedef union _SLIST_HEADER { >> ULONGLONG Alignment; >> struct { >> ULONGLONG Depth : 16; >> ULONGLONG Sequence : 8; >> ULONGLONG Next : 40; >> }; >> } SLIST_HEADER, *PSLIST_HEADER; >> >> See ntddk.h >> "Vladimir Petter [MSFT]" <vladp72(a)hotmail.com> wrote in message news:u$CatAOKIHA.5400(a)TK2MSFTNGP04.phx.gbl... > You can find more info on this here http://www.alex-ionescu.com/?p=50 > Vladimir Petter. [microsoft] Thanks for that. Humm... So, Windows 64-bit running on x86-64 systems that lack CMPXCHG16B instruction uses only 9-bits for an ABA counter; I think that may be a bit dangerous. That counter will be rolling over quite frequently under load... Humm: http://groups.google.com/group/comp.programming.threads/browse_frm/thread/133f764f1bc0c9f4 Any thoughts? |