Prev: Howto convert IPv6 from WTSClientAddress to string?
Next: Application Error, The memory could not be "read". How to debug?
From: Chris Thomasson on 15 Nov 2007 16:00 > "Chris Thomasson" <cristom(a)comcast.net> wrote in message > news:f7OdndviL--1OqHanZ2dnUVZ_qmlnZ2d(a)comcast.com... >> One could theoretically open themselves up to live-lock conditions if >> they >> did not take this into consideration when implementing certain lock-free >> algorithms on PPC. For instance, if the anchor of a lock-free LIFO was >> not >> the size of a cache-line, and was not aligned on a cache-line boundary >> you >> could get "false-sharing" on the anchor. I don't have much experience >> with >> PPC, but it seems that false-sharing could break reservations on the >> anchor >> and adversely affect forward progress, and might cause live-lock like >> conditions... [top-post corrected] "m" <m(a)b.c> wrote in message news:OMNiHf8JIHA.5116(a)TK2MSFTNGP03.phx.gbl... > Lock-free & wait-free are independent idioms which, in certain algorithms, > can overlap. Except that lock-free doesn�t imply conditional looping, > this is an attribute of many algorithms that are lock-free not a defining > property of the idiom, Fair enough. > your definitions are reasonable; so what am I missing? I don't think your missing anything. > The fact that on certain architectures, false-sharing can adversely affect > performance is a consideration in implementation and algorithm choice. > Specifically, don�t choose an algorithm that is hard to implement or > performs badly on your target architecture ;) 100% Agreed.
From: Chris Thomasson on 15 Nov 2007 23:17 "Ben Voigt [C++ MVP]" <rbv(a)nospam.nospam> wrote in message news:uwEcTw5JIHA.3820(a)TK2MSFTNGP03.phx.gbl... > > "Chris Thomasson" <cristom(a)comcast.net> wrote in message > news:I5-dnZi55r10FabanZ2dnUVZ_judnZ2d(a)comcast.com... >> "Ben Voigt [C++ MVP]" <rbv(a)nospam.nospam> wrote in message >> news:O2Xn5DxJIHA.3400(a)TK2MSFTNGP03.phx.gbl... [...] >>> >>> Please take a look at my queue implementation and let me know if you see >>> any problem (I realize it isn't truely lock-free until I replace the >>> allocator): >>> >> [...] >> >> Good news! ;^) [...] > Thanks for the review. I was pretty sure it was correct but it's very > nice to have concurrence from an expert. You welcome. I think your SQueue class is sound wrt the overall concerns of the ABA problem in general. I also think that the code correctly address the concerns of deallocating nodes from lock-free data-structures. You have a performance enhancement on the SQueue<T>::Get() function because its 100% wait-free as long as the hardware that Windows happens to be running on supports an atomic swap instruction; Intel does. SPARC used to support the SWAP instruction; they depreciated it, and delegated its functionally over to the CAS instruction on 32-bit systems, and CASX on 64-bit systems. Its Kind of funny IMHO simply because CASX operates on 64-bit datum in 64-bit, despite its literal name; the way I read things, CASX implies DWCAS-like operation, which means 128-bit operation when in 64-bit mode. That seems odd to me to say the least... > As far as I'm concerned, the only (tenable) way to write multi-threaded > code is to have a couple of lock-free thread-safe components (like my > squeue) that fit entirely on a single page, and elsewhere avoid accessing > the same variable from more than one thread. In any non-trivial project, > trying to protect shared data structures any other way is just too complex > to reason about, with too many opportunities for deadlocks and/or race > conditions, and even if those are solved, then contention to a degree that > removes the benefits of multithreading. Well, deadlocks are caused by the misuse of locking. IMVHO, if a programmer cannot get lock-based algorithms correct, then they should be nowhere near a lock-free one; to say the least. The first of the two links provided below points out some of the basic caveats wrt replacing a lock-based synchronization point with its non-blocking equivalent. For instance, if you stick a huge engine in a car with crappy tires, well, then Murphy's Law dictates that your going to have a blow-out at high speeds, on a windy road, with a cliff on one side. Therefore, you must upgrade the tires, and probably some other components as well... Solving that in an efficient manner can sometimes turn out to be a fairly long, and tedious endeavor; indeed! >> Humm... Well, FWIW, if your into low-overhead queuing, I suggest you skim >> through the following message and follow links: >> >> http://groups.google.com/group/comp.arch/msg/a7501814f19731cd >> > > Thanks, I'll take a look. [...] I think that you just might find the brief overview of the message-passing techniques described therein fairly interesting, and perhaps a "bit useful"; as in you may be able to fit it in some of your _existing_ synchronization schemes with "fairly minimal" effort(s). You can read the following for some more details on that particular subject: http://groups.google.com/group/comp.programming.threads/msg/9a5500d831dd2ec7 http://groups.google.com/group/comp.programming.threads/browse_frm/thread/ef3f26bbe1c52e0b Any thoughts on this?
From: Tim Roberts on 15 Nov 2007 23:48 "Chris Thomasson" <cristom(a)comcast.net> wrote: > >I have a question as well. Do you know if Microsoft uses a pointer >compression algorithm to implement their SList API on 64-bit systems? I >think I remember Neill Clift, who works on the Windows Kernel, mention >something about that, but I can't seem to find the message on Google right >now. I, personally, would be very surprised if they did. The benefit is so small, and memory is essentially free. Further, in my experience, most kernel structures use the doubly-linked lists. -- Tim Roberts, timr(a)probo.com, DDK MVP Providenza & Boekelheide, Inc.
From: Joe Seigh on 16 Nov 2007 06:22 m wrote: > Yes I am extending the discussion to include all kinds of data structures > besides queues; and yes there are various ways to protect them. No I don�t > really want to discuss them in detail, I just wanted to put something out > there to see if I could tweak someone�s interest ;) > You made a generalization and applied it to the special case. > > > Thread pre-emption is important on all pre-emptive multi-tasking operating > systems regardless of how many CPUs are present in the system. The > importance of this consideration lies in the fact that between any two > instructions executed by thread A, thread B may have executed a large number > of instructions (either on this CPU or on another one) or none at all. The > reason this is, implicitly, important is that, even though one could use a > lock-free algorithm for nearly anything, they are specifically designed for > HPC scenarios with multiple produces & multiple consumers (or more > generally, multiple modifiers). Yes, preemption causes more variation in a threads rate of progress but so? In multi-threading you don't make any assumptions about the rate of forward progress, preemption or not. And it's certainly news to me that lock-free is mainly important to HPC and not so much for other things. > > > > Lock-free is not a forward progress guarantee. Lock-free, a misnomer, is an > idiom in which software locks are replaced with hardware locks because they > are, presumably, more efficient. This efficiently is achieved, if by > nothing else, by being smaller (one instruction vs. hundreds) and by > avoiding interaction with the OS scheduler. Lock-free algorithms must > provide a forward progress guarantee, otherwise they would cease to function > under contention, but that guarantee, in and of itself, does not make an > algorithm lock-free and hence does not define the idiom. > The most commonly used definition for lock-free is by Herlihy. He also did definitions for wait-free and obstruction-free. -- Joe Seigh When you get lemons, you make lemonade. When you get hardware, you make software.
From: Pops on 16 Nov 2007 08:25
Ben Voigt [C++ MVP] wrote: > We were discussing a multiple writer, single consumer queue. For this the > writers each use the InterlockedPushEntrySList function, while the consumer > uses the InterlockedFlushSList function to retrieve the entire queue as an > atomic operation. The consumer is then responsible for iterating through > the list of items retrieved in FIFO order. Hi Ben, I been spending the last day or so working on this Interlocked Singly Linked List (ISLL) and did a profile of two basically the same applications with the key difference: PROG-A: MFC Clist FIFO queue w/ my RW and Chris's RW code, PROG-B: ISLL with no RW code. Only LIFO analyzed here. Each allows me to play with different parameters: #define USE_CRITSECT_H // Use my RW or Chris's RW BOOL PRINT_LINES = FALSE; // Avoid console overhead DWORD MAX_WRITER_THREADS = 1; // # of writer threads DWORD MAX_READER_THREADS = 1; // # of reader threads DWORD AUTO_STOP_COUNT = 1000; // # of writer events DWORD WRITER_FREQ_MS = 1; // Frequency of writes DWORD READER_FREQ_MS = 1; // Frequency of reads DWORD READER_EVENT_THRESHOLD= 0; // # before reader is signaled DWORD READER_PROCESS_DELAY = 0; // Simulate processing time Each writer and reader threads has a one (1) quantum frequency using while( WFSO(ShutdownEvent,1) ) thread break test. So essentially a 15ms frequency of pumping 1 item into the list, a 15ms reader frequency. The wait can be changed with the XXXX_FREQ_MS parameters. AUTO_STOP_COUNT allows the test to end. The main thread watches this count before it signals the threads to shutdown. The READEREVENT_THRESHOLD, if set, will allow the queue to build before the writer will signal the reader event. For PROG-A with the RW locks, the USE_CRIT_SECT pragma allows me test my RW classes or Chris's fine RW class. :-) The testing is on a Intel Pentium D 2.66ghz, 1 gig RAM XP/PRO box. The programs main thread creates athe threadsd. When AUTO_STOP_COUNT is reached, it signals the shutdown event. Another console utility is use to set the performance counters to watch for and log. In this profile, I am only looking at the context switch rate. The results: # PROG-A: USING SSI CRITSECT.H READER/WRITER CLASSES * Successful Shutdown: 16000 ms * Total Processed : 1023 * Total Empty Gets : 362 PROG-A Reader: ~117 cs/sec PROG-A Writer: ~88 cs/sec # PROG-A: USING CHRIS'S READER/WRITER CLASSES * Successful Shutdown: 16000 ms * Total Processed : 1023 * Total Empty Gets : 259 PROG-A Reader: ~77 cs/sec PROG-A Writer: ~71 cs/sec So Chris's code offers less context switching. Don't get excited Chris. :-) Note, the reader in the first test is wasting time with empty read events. Comes into play in addition tests. # PROG-B: USING ISLL API WITH NO RW LOCKS * Successful Shutdown: 16000 ms * Total Processed : 1023 * Total Empty Gets : 279 PROG-B Reader: ~67 cs/sec PROG-B Writer: ~65 cs/sec Even slightly better with the ISLL, about the same waste in reader events. Note: This is using a PUSH and POP LIFO. For the next tests, I optimized the sychronization using event driven thresholds, READEREVENT_THRESHOLD = 10; the pump will will signal the reader when 10 items are add to the list. # PROG-A: USING SSI CRITSECT.H READER/WRITER CLASSES * Successful Shutdown: 16000 ms * Total Processed : 1021 * Total Empty Gets : 0 PROG-A Reader: ~6 cs/sec PROG-A Writer: ~70 cs/sec Drastic improvement here by collecting a bucket of items before processing. Notice no waste in reader events. # PROG-A: USING CHRIS'S READER/WRITER CLASSES * Successful Shutdown: 16000 ms * Total Processed : 1021 * Total Empty Gets : 0 PROG-A Reader: ~8 cs/sec PROG-A Writer: ~66 cs/sec Essentially the same performance. # PROG-B: USING ISLL API WITH NO RW LOCKS * Successful Shutdown: 16000 ms * Total Processed : 1021 * Total Empty Gets : 0 PROG-B Reader: ~7 cs/sec PROG-B Writer: ~67 cs/sec And for ISLL, essentially the same performance, when using a sychronized event driven bucket brigade. I performed many other test, and real world simulations, like adding delays for processing time, lowing the frequency of the pump but overall I think it is pretty clear to me, there is no real big difference when you need to take into account key optimization and synchronization factors. In regards to ISLL, I did see a very small slight improving using the flush when I began to work on a LIFO to FIFO transformation. But seeing that regular MFC Clist based FIFO performed equally as well, to me, its the added complexity, unsureness of lock-free methods makes, OS dependency, makes it practical at the moment. Other considerations such as memory fragmentation, needs to be reviewed. Anyway, I though I would share these results. Thanks -- HLS |