From: Andy Venikov on 29 Mar 2010 20:03 Herb Sutter wrote: > Please remember this: Standard ISO C/C++ volatile is useless for > multithreaded programming. No argument otherwise holds water; at best > the code may appear to work on some compilers/platforms, including all > attempted counterexamples I've seen on this thread. You have an enormous clout on C++ professionals, including myself, so before permanently agreeing to such an all-encompassing statement allow me to maybe step back a little and see what it is that's at the core of this argument. Maybe we're arguing the same point. Or maybe I'm missing something big in which case I'll be doubly glad to have been shown my wrong assumptions. I understand that volatile never was supposed to be of any help for multithreaded programming. I don't expect it to issue any memory fences nor make any guarantees whatsoever about anything thread-related... Yet, on all the compilers I know of (gcc, mingw, MSVC, LLVM, Intel) it produces just the code I need for my multithreaded programs. And I really don't see how it wouldn't, given common-sense understanding of what it should do in single-threaded programs. And I'm pretty sure that it's not going to change in a foreseeable future. So my use of volatile maybe not standard-portable, but it sure is real-life portable. Here's the point of view I'm coming from. Imagine that someone needs to implement a library that provides certain multithreading (multiprogramming) tools like atomic access, synchronization primitives and some lock-free algorithms that will be used by other developers so that they wouldn't have to worry about things like volatile. (Now that boost.atomic is almost out, I'll happily use it. But Helge Bahmann (the author of the library) didn't have such a luxury, so to make his higher-level APIs work he had to internally resort to low-level tools like volatiles where appropriate.) So, with the above said, here's a concrete example of how I'd use volatile without an access to a ready-made library. Let's take Magued Michael's lock-free queue ("Simple, Fast and Practical Non-blocking and blocking queue algorithms", Magued Michael & Michael Scott; 1996). It uses a technique similar to DCL to verify a validity of a read. Look into it's deque() method. I'll provide the pseudo code here: dequeue(Q: pointer to queue t, pvalue: pointer to data type): boolean D1: loop # Keep trying until Dequeue is done D2: head = Q�>Head # Read Head D3: tail = Q�>Tail # Read Tail D4: next = head�>next # Read Head.ptr�>next D5: if head == Q�>Head # Are head, tail, and next consistent? D6: if head.ptr == tail.ptr # Is queue empty or Tail falling behind? D7: if next.ptr == NULL # Is queue empty? D8: return FALSE # Queue is empty, couldn�t dequeue D9: endif # Tail is falling behind. Try to advance it D10: CAS(&Q�>Tail, tail, <next.ptr, tail.count+1>) D11: else # No need to deal with Tail # Read value before CAS, otherwise another dequeue might free the next node D12: *pvalue = next.ptr�>value # Try to swing Head to the next node D13: if CAS(&Q�>Head, head, <next.ptr, head.count+1>) D14: break # Dequeue is done. Exit loop D15: endif D16: endif D17: endif D18: endloop D19: free(head.ptr) # It is safe now to free the old dummy node D20: return TRUE # Queue was not empty, dequeue succeeded Look at line D5: it needs to check if Q->Head is still the same as what we read from it before. Otherwise two possibilities for breaking the correctness arise: 1) it would be possible for the element pointed to by Q->Head to be re-inserted back into the queue with NULL in the "next" and then dequeue would return "empty" when in reality the queue was never empty in any given moment; or 2) The first element was removed after we've read Q->Head and before we've read next thus there could be garbage in head->next by the time we read it and we'd try to access garbage on line D12. This piece of pseudo code could be naively translated to a following c++ code: while (true) { Node * localHead = head_; Node * localTail = tail_; Node * localNext = localHead->next; if (localHead == head_) { ... } But it wouldn't work for the obvious reasons. One needs to insert MemoryFences in the right places. Memory fences is something that is highly platform-specific, so one would define macros for them that would expand to different instructions on different platforms. Here's the code with memory fences inserted: while (true) { Node * localHead = head_; Node * localTail = tail_; DataDependencyBarrier(); //All the systems that I know of will do //this sort of barrier automatically, so //this macro will expand to nothing Node * localNext = localHead->next; LoadLoadBarrier(); //on x86 this will expand to nothing if (localHead == head_) { .... } This is much better, but it still got problems: first, on x86, the LoadLoadBarrier() will expand to nothing and there will be no indication to the compiler not to re-order different loads; and second (and I think it's the crux of my argument) that an optimizing compiler will dispose of the "if" statement even in the face of memory barriers. No matter how many or what type of memory barriers you insert, the compiler will be allowed to omit the if statement. The ONLY way to force the compiler (any compiler for that matter) to generate it is to declare head_ as volatile. Here's the final code: struct Node { <unspecified> data; Node volatile * pNext; }; Node volatile * volatile head_; Node volatile * volatile tail_; dequeue() { while (true) { Node volatile * localHead = head_; Node volatile * localTail = tail_; DataDependencyBarrier(); Node volatile * localNext = localHead->next; if (localHead == head_) { ... } ..... } Now this code will produce the intended correct object code on all the compilers I've listed above and on at least these CPUs: x86, itanium, mips, PowerPC (assuming that all the MemoryBarriers have been defined for all the platforms). And without any modifications to the above code. How's that for portability? I think my fault was that in my previous posts I was pushing more heavily on volatile's ability to tell the compiler not to reorder the instructions it generates (which is still useful) rather than to emphasize the fact that I want volatile to tell the compiler not to optimize away certain instructions. The reordering problem could be circumvented by using inline asm statements (and then again, on x86, LoadLoadBarrier would expand to nothing, so we'd be forced to use a bogus inline asm statement - I'd rather chose to use volatile), but I don't see how the optimizing away problem could be circumvented without the use of volatile. Now, after writing all this, I realize that I could've used a simpler example - a simple Peterson's algorithm for two threads wouldn't work without a use of a volatile: the "turn" variable is assigned the same value as it's being compared to later, so the compiler will omit the "if turn == x" part in the if statement. <snip> I hope this clears matters - I'm sorry if I wasn't clear before. > --- > Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca) > > Convener, SC22/WG21 (C++) (www.gotw.ca/iso) > Architect, Visual C++ (www.gotw.ca/microsoft) > Andy. -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Andy Venikov on 29 Mar 2010 20:04 James Kanze wrote: > On Mar 26, 12:05 pm, Andy Venikov <swojchelo...(a)gmail.com> wrote: >> James Kanze wrote: >>> If that's the case, then the fence instruction is seriously >>> broken. The whole purpose of a fence instruction is to >>> guarantee that another CPU (with another thread) can see the >>> changes. (Of course, the other thread also needs a fence.) > >> Hmm, the way I understand fences is that they introduce >> ordering and not necessarily guarantee visibility. For >> example: > >> 1. Store to location 1 >> 2. StoreStore fence >> 3. Store to location 2 > >> will guarantee only that if store to location 2 is visible to >> some thread, then the store to location 1 is guaranteed to be >> visible to the same thread as well. > > A StoreStore fence guarantees that all stores issued before the > fence are visible in main memory, and that none issued after the > fence are visible (at the time the StoreStore fence is > executed). > > Of course, for another thread to be guaraneed to see the results > of any store, it has to use a load fence, to ensure that the > values it sees are those after the load fence, and not some > value that it happened to pick up earlier. What I meant was that memory fence doesn't mean that the effects of a write will be immediately flushed to the main memory or effects of a read immediately read from the main memory. Generally, memory fence is merely a checkpoint to tell the processor not to reorder instructions around the fence. I don't remember what processor docs I've read (I believe it was Itanium) but here's for example what the docs said about a store fence: a store barrier would make sure that all the stores appearing before a fence would be stored in the write-queue before any of the stores that follow the fence. In no way you're guaranteed that any of the stores are in main memory after the fence instruction was executed. For that you'd have to use a flush instruction. Andy. > > -- > James Kanze > -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Andy Venikov on 29 Mar 2010 20:04 James Kanze wrote: > On Mar 28, 10:05 pm, George Neuner <gneun...(a)comcast.net> wrote: <snip> >> All the fence does is create a checkpoint in the instruction >> sequence at which relevant load or store instructions >> dispatched prior to dispatch of the fence instruction will >> have completed execution. > > That's not true for the two architectures whose documentation > I've studied, Intel and Sparc. To quote the Intel documentation > of MFENCE: > > Performs a serializing operation on all load and store > instructions that were issued prior the MFENCE > instruction. This serializing operation guarantees that > every load and store instruction that precedes in > program order the MFENCE instruction is globally visible > before any load or store instruction that follows the > MFENCE instruction is globally visible. > > Note the "globally visible". If you read the whole sentence, you have: <1> is globally visible before <2> is globally visible. That doesn't sound to me as saying that <1> is globally visible. I don't think that the above says that instructions that precede MFENCE are guaranteed to be visible after the MFENCE instruction completes. It does guarantee, however, that the earlier instructions are visible before ANY of the following instructions are visible. <snip> Andy. -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: George Neuner on 29 Mar 2010 20:05 On Mon, 29 Mar 2010 16:53:44 CST, James Kanze <james.kanze(a)gmail.com> wrote: >On Mar 28, 10:05 pm, George Neuner <gneun...(a)comcast.net> wrote: >> On Thu, 25 Mar 2010 17:31:25 CST, James Kanze <james.ka...(a)gmail.com> >> wrote: > >> >On Mar 25, 7:10 pm, George Neuner <gneun...(a)comcast.net> wrote: >> >> >> As you noted, 'volatile' does not guarantee that an OoO CPU will >> >> execute the stores in program order ... > >> >Arguably, the original intent was that it should. But it >> >doesn't, and of course, the ordering guarantee only applies to >> >variables actually declared volatile. > >> "volatile" is quite old ... I'm pretty sure the "intent" was defined >> before there were OoO CPUs (in de facto use if not in standard >> document). Regardless, "volatile" only constrains the behavior of the >> *compiler*. > >More or less. Volatile requires the compiler to issue code >which is conform to what the documentation says it does. It >requires all accesses to take place after the preceding sequence >point, and the results of those accesses to be stable before the >following sequence point. But it leaves it up to the >implementation to define what is meant by "access", and most >take a very, very liberal view of it. Agreed, with the caveat that some ISAs do not give the compiler the tools to achieve this. >> The purpose of the fence is to sequence memory accesses. > >For a much more rigorous definition of "access" that that used >by the C++ standard. Not exactly. I agree that the processor's memory model guarantees stronger ordering than that of the C++ standard (or almost any language standard, for that matter), but you are attributing semantics to "fence" that aren't there. >> All the fence does is create a checkpoint in the instruction >> sequence at which relevant load or store instructions >> dispatched prior to dispatch of the fence instruction will >> have completed execution. > >That's not true for the two architectures whose documentation >I've studied, Intel and Sparc. Then you'd better go back and study 8) >To quote the Intel documentation of MFENCE: > > Performs a serializing operation on all load and store > instructions that were issued prior the MFENCE > instruction. This serializing operation guarantees that > every load and store instruction that precedes in > program order the MFENCE instruction is globally visible > before any load or store instruction that follows the > MFENCE instruction is globally visible. Now look at LFENCE, SFENCE and CLFUSH and think about why they are provided separately. Also look at PREFETCH and see what it says about fences. Intel provides MFENCE as a heavyweight combination of LFENCE, SFENCE and CLFLUSH. MFENCE does propagate to memory *because* it flushes the cache. However the primitive, SFENCE, ensures propagation of writes only to L2 cache. Sparc has no single instruction that both fences and flushes the cache. MEMBAR ensures propagation only to L2 cache. A separate FLUSH instruction is necessary to ensure propagation to memory. Sparc also does not have separate load and store fences, but it offers two variants of MEMBAR which provide differing consistency guarantees. >Note the "globally visible". Both Intel and Sparc guarantee >strong ordering within a single core (i.e. a single thread); >mfence or membar (Sparc) are only necessary if the memory will >also be "accessed" from a separate unit: a thread running on a >different core, or memory mapped IO. Again, you're attributing semantics that aren't there. For a store to be "globally visible" means that the value must be visible from outside the core. This requires the value be in *some* externally visible memory - not *main* memory in particular. For both x86 and Sparc, this means L2 cache - the first level that can be snooped off-chip. For a load "globally visible" means that the value is present at all levels of the memory hierarchy and cannot be seen differently by an external observer. This simply follows from the normal operation of the read pipeline - the value is written into all levels of cache (more or less) at the same time it is loaded into the core register. Note also that some CPUs can prefetch data in ways that bypass externally visible levels of cache. Sparc and x86 (at least since Pentium III) do not permit this. >> However, in a memory hierarchy with caching, a store >> instruction does not guarantee a write to memory but only that >> one or more write cycles is executed on the core's memory >> connection bus. > >On Intel and Sparc architectures, a store instruction doesn't >even guarantee that. All it guarantees is that the necessary >information is somehow passed to the write pipeline. What >happens after that is anybody's guess. No. On both of those architectures a store instruction will eventually cause the value to be written out of the core (except maybe if a hardware exception occurs). Additionally the source register may renamed or the stored value may be forwarded within the core to rendezvous with a subsequent read of the same location already in the pipeline ... but these internal flow optimizations don't affect the externally visible operation of the store instruction. >> For another thread (or core or CPU) to perceive a change a >> value must be propagated into shared memory. For all >> multi-core processors I am aware of, the first shared level of >> memory is cache - not main memory. Cores on the same die >> snoop each other's primary caches and share higher level >> caches. Cores on separate dies in the same package share >> cache at the secondary or tertiary level. > >And on more advanced architectures, there are core's which don't >share any cache. All of which is irrelevant, since simply >issuing a store instruction doesn't even guarantee a write to >the highest level cache, and a membar or a fence instruction >guarantees access all the way down to the main, shared memory. Sorry, but no. Even the architectures we've discussed here, x86 and Sparc, do not satisfy your statement. There might be architectures I'm unaware of which can elide an off-core write entirely by rendezvous forwarding and register renaming, but you haven't named one. I would consider eliding the store to be a dangerous interpretation of memory semantics and I suspect I would not be alone. I'm not familiar with any cached architecture for which fencing alone guarantees that a store writes all the way to main memory - I know some that don't even have/need fencing because their on-chip caches are write-through. >> The upshot is this: >> - "volatile" is required for any CPU. > >I'm afraid that doesn't follow from anything you've said. >Particularly because the volatile is largely a no-op on most >current compilers---it inhibits compiler optimizations, but the >generated code does nothing to prevent the reordering that >occurs at the hardware level. "volatile" is required because the compiler must not reorder or optimize away the loads or stores. >> - fences are required for an OoO CPU. > >By OoO, I presume you mean "out of order". That's not the only >source of the problems. OoO is not the *only* source of the problem. The compiler has little control over hardware reordering ... fences are blunt instruments that impact all loads or stores ... not just those to language level "volatiles". >> - cache control is required for a write-back cache between >> CPU and main memory. > >The cache is largely irrelevent on Sparc or Intel. The >processor architectures are designed in a way to make it >irrelevant. All of the problems would be there even in the >absence of caching. They're determined by the implementation of >the write and read pipelines. That's a naive point of view. For a cached processor, the operation of the cache and it's impact on real programs is *never* "irrelevant". >James Kanze George -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Herb Sutter on 29 Mar 2010 20:28
On Mon, 29 Mar 2010 16:55:44 CST, "Leigh Johnston" <leigh(a)i42.co.uk> wrote: >"James Kanze" <james.kanze(a)gmail.com> wrote in message >news:36f7e40e-4584-430d-980e-5f7478728d16(a)z3g2000yqz.googlegroups.com... >>> Performance is often cited as another reason to not use >>> volatile however the use of volatile can actually help with >>> multi-threading performance as you can perform a safe >>> lock-free check before performing a more expensive lock. >> >> Again, I'd like to see how. This sounds like the double-checked >> locking idiom, and that's been proven not to work. > >IMO for an OoO CPU the double checked locking pattern can be made to work >with volatile if fences are also used or the lock also acts as a fence (as >is the case with VC++/x86). This is also the counter-example you are >looking for, it should work on some implementations. FWIW VC++ is clever >enough to make the volatile redundant for this example however adding >volatile makes no difference to the generated code (read: no performance >penalty) Are you sure? On x86 a VC++ volatile write is supposed to be emitted as xchg, whereas an ordinary write is usually emitted as mov. If the DCL control variable write is emitted as mov on x86 then DCL won't work correctly (well, it'll appear to work...). Herb --- Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca) Convener, SC22/WG21 (C++) (www.gotw.ca/iso) Architect, Visual C++ (www.gotw.ca/microsoft) -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |