From: Leigh Johnston on 30 Mar 2010 12:14 "James Kanze" <james.kanze(a)gmail.com> wrote in message news:da63ca83-4d6e-416a-9825-c24deed3e49f(a)10g2000yqq.googlegroups.com... <snip> > Double checked locking can be made to work if you introduce > inline assembler or use some other technique to insert a fence > or a membar instruction in the appropriate places. But of > course, then, the volatile becomes superficial. It is only superficial if there is a compiler guarantee that a load/store for a non-volatile variable is emitted in the presence of a fence which sounds like a dubious guarantee to me. What compilers stop performing optimizations in the presence of a fence and/or how does the compiler know which variables accesses can be optimized in the presence of a fence? > >> This is also the counter-example you are looking for, it >> should work on some implementations. > > It's certainly not an example of a sensible use of volatile, > since without the membar/fence, the algorithm doesn't work (at > least on most modern processors, which are multicore). And with > the membar/fence, the volatile is superfluous, and not needed. Read what I said above. >> FWIW VC++ is clever enough to make the volatile redundant for >> this example however adding volatile makes no difference to >> the generated code (read: no performance penalty) and I like >> making such things explicit similar to how one uses const >> (doesn't effect the generated output but documents the >> programmer's intentions). > > The use of a fence or membar (or some system specific "atomic" > access) would make the intent explicit. The use of volatile > suggests something completely different (memory mapped IO, or > some such). Obviously we disagree on this point hence the reason for the existence of this argument we are having. <snip> >> The only volatile in my entire codebase is for the "status" of >> my "threadable" base class and I don't always acquire a lock >> before checking this status and I don't fully trust that the >> optimizer won't cache it for all cases that might crop up as I >> develop code. > > I'd have to see the exact code to be sure, but I'd guess that > without an mfence somewhere in there, the code won't work on a > multicore machine (which is just about everything today), and > with the mfence, the the volatile isn't necessary. The code does work on a multi-core machine and I am confident it will continue to work when I write new code precisely because I am using volatile and therefore guaranteed a load will be emitted not optimized away. > Also, at least under Solaris, if there is no contention, the > execution time of pthread_mutex_lock is practically the same as > that of membar. Although I've never actually measured it, I > suspect that the same is true if you use CriticalSection (and > not Mutex) under Windows. Critical sections are expensive when compared to a simple load that is guaranteed by using volatile. It is not always necessary to use a fence as all a fence is doing is guaranteeing order so it all depends on the use-case. > >> BTW I try and avoid singletons too so I haven't found the need >> to use the double checked locking pattern AFAICR. > > Double checked locking is a pattern which can be applied to many > things, not just to singletons. > I never said otherwise, singletons are an obvious example use though and this thread was originally about singletons. /Leigh -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Herb Sutter on 30 Mar 2010 12:14 On Mon, 29 Mar 2010 20:22:54 CST, "Leigh Johnston" <leigh(a)i42.co.uk> wrote: >"Herb Sutter" <herb.sutter(a)gmail.com> wrote in message news:j462r55lr984u1vg0jttl47upeutfcardg(a)4ax.com... ><snip> >>>On Mar 29, 7:45 am, "Leigh Johnston" <le...(a)i42.co.uk> wrote: >>>> Performance is often cited as another reason to not use >>>> volatile however >> >> No "however" needed, that cited reason is correct. Volatile disables >> optimizations that would be legal for an atomic<>. A quick example is >> combining/eliding writes (e.g., v = 1; v = 2; can't be transformed to >> v = 2;, but a = 1; a = 2; can be transformed to a = 2;). Another is >> combining/eliding reads. > >If atomic reads can be elided won't that be problematic for using atomics with the double checked locking pattern (so we are back to using volatile atomics)? No, because it's only adjacent atomic reads (i.e., separated only by ordinary memory operations) that can be combined. In DCL there's a lock acquire operation in between, and the atomic reads can't be reordered across that to make them adjacent, so they can't be combined. Herb --- Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca) Convener, SC22/WG21 (C++) (www.gotw.ca/iso) Architect, Visual C++ (www.gotw.ca/microsoft) -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Herb Sutter on 30 Mar 2010 12:15 On Tue, 30 Mar 2010 05:03:11 CST, Andy Venikov <swojchelowek(a)gmail.com> wrote: >Herb Sutter wrote: >> Please remember this: Standard ISO C/C++ volatile is useless for >> multithreaded programming. No argument otherwise holds water; at best >> the code may appear to work on some compilers/platforms, including all >> attempted counterexamples I've seen on this thread. > >You have an enormous clout on C++ professionals, including myself, so >before permanently agreeing to such an all-encompassing statement allow >me to maybe step back a little and see what it is that's at the core of >this argument. Maybe we're arguing the same point. Or maybe I'm missing >something big in which case I'll be doubly glad to have been shown my >wrong assumptions. Short answer: Note I deliberately said "Standard" above -- the above statement is completely true for portable usage. You may get away with it on some platforms today, but it's nonportable and even the getting-away won't last. Slightly longer answer follows: >I understand that volatile never was supposed to be of any help for >multithreaded programming. I don't expect it to issue any memory fences > nor make any guarantees whatsoever about anything thread-related... Yes, and that's why it can't reliably be used for inter-thread communication == synchronization. >Yet, on all the compilers I know of (gcc, mingw, MSVC, LLVM, Intel) it >produces just the code I need for my multithreaded programs. And I >really don't see how it wouldn't, given common-sense understanding of >what it should do in single-threaded programs. And I'm pretty sure that >it's not going to change in a foreseeable future. > >So my use of volatile maybe not standard-portable, but it sure is >real-life portable. It's like relying on undefined behavior. UB may happen to do what you expected, most of the time, on your current compiler and platform. That doesn't mean it's correct or portable, and it will be less and less real-life portable on multi-core systems. Because there was no better hook, volatile was strengthened (in non-standard ways) on various systems. For example, on MS VC++ prior to VC++ 2005 (I think), volatile had no ordering semantics at all, but people thought it was used for inter-thread communications because the Windows InterlockedXxxx APIs happened to take a volatile variable. But that was just using volatile as a type system tag to help you not accidentally pass a plain variable, and a little bit to leverage the lack of optimizations on volatile -- the real reason it worked was because you were calling the InterlockedXxx APIs because *those* are correctly synchronized for lock-free coding. Even now in VC++ 2005 and later, when volatile was strengthened so that reads and writes are (almost) SC, to get fully SC lock-free code in all cases you still have to use the InterlockedXxx APIs rather than direct reads and writes of the volatile variable. The strengthened volatile semantics makes that, on that compiler and when targeting x86/x64, using direct reads and writes is enough to make most examples like DCL work, but it isn't enough to make examples like Dekker's work -- for Dekker's to work correctly you still have to use the InterlockedXxx APIs. >Here's the point of view I'm coming from. >Imagine that someone needs to implement a library that provides certain >multithreading (multiprogramming) tools like atomic access, >synchronization primitives and some lock-free algorithms that will be >used by other developers so that they wouldn't have to worry about >things like volatile. (Now that boost.atomic is almost out, I'll happily >use it. Important note: Using std::atomic<> is exactly the correct answer! The only caveat is that it's not yet widely available, but this year we're getting over the hump of wide availability thanks to Boost and others. >But Helge Bahmann (the author of the library) didn't have such a Isn't it Anthony Williams who's doing Boost's atomic<> implementation? Hmm. >luxury, so to make his higher-level APIs work he had to internally >resort to low-level tools like volatiles where appropriate.) Of course, sure. The implementation of std::atomic<> on any given platform needs to use platform-specific tools, including things like explicit fences/membars (e.g., mf+st.rel on IA64), ordered APIs (e.g,. InterlockedIncrement on Windows), and/or other nonstandard and nonportable goo (e.g., platform-specific variants of volatile). The implementation of any standard feature typically will internally use nonstandard system-specific features. That's the standard feature's purpose, to shield users from those details and make this particular system do the right particular thing. [...] >Look at line D5: it needs to check if Q->Head is still the same as what >we read from it before. Otherwise two possibilities for breaking the >correctness arise: 1) it would be possible for the element pointed to by [...] >This piece of pseudo code could be naively translated to a following c++ >code: > >while (true) >{ >Node * localHead = head_; >Node * localTail = tail_; >Node * localNext = localHead->next; >if (localHead == head_) >{ > ... >} > >But it wouldn't work for the obvious reasons. >One needs to insert MemoryFences in the right places. [...] Fences are evil. Nearly nobody can use them consistently correctly, including people who have years of experience with them. Those people (write once, and from then on) use the Linux atomics package or C++0x std::atomic. Every mutable shared object should be protected by a mutex (99.9% case) or be atomic (0.1% case). If you're going to write lock-free code, it's really, really, really important to just make the shared variables be C++0x std::atomic<> (or equivalently Java or .NET volatile, which isn't the same thing as ISO C and C++ volatile). If you do, you won't have to reason about where the fences need to go. Reasoning about where the fences need to go is such a futile and error-prone job that most lock-free papers don't even try to say where to put them and just assume SC execution. >Here's the final code: I apologize for not having time to read your transformations of Maged's code closely, but in all of the following, why is the volatile on the Node, not on the pointer? Even if volatile did all the magic you want it to do (like Java/.NET volatile), that's broken because it's in the wrong place, isn't it? Of course, the usual manifestation of the problem is that the code will compile, run, and appear to work... >struct Node >{ > <unspecified> data; > Node volatile * pNext; >}; >Node volatile * volatile head_; >Node volatile * volatile tail_; > >dequeue() >{ > while (true) > { > Node volatile * localHead = head_; > Node volatile * localTail = tail_; > DataDependencyBarrier(); > Node volatile * localNext = localHead->next; > > if (localHead == head_) > { > ... > } >.... >} > > >Now this code will produce the intended correct object code on all the >compilers I've listed above and on at least these CPUs: x86, itanium, >mips, PowerPC (assuming that all the MemoryBarriers have been defined >for all the platforms). And without any modifications to the above code. >How's that for portability? Without even read the code logic and looking for races, I doubt it. For a detailed analysis of multiple lock-free implementations of a similar queue example, including an exceedingly rare race that even under sustained heavy stress on a 24-core system only manifested once every tens of millions of insertions, see: Measuring Parallel Performance: Optimizing a Concurrent Qeue http://www.drdobbs.com/high-performance-computing/212201163 >Now, after writing all this, I realize that I could've used a simpler >example - a simple Peterson's algorithm for two threads wouldn't work >without a use of a volatile: the "turn" variable is assigned the same >value as it's being compared to later, so the compiler will omit the "if >turn == x" part in the if statement. Actually, Dekker's/Peterson's is broken even with VC++ 2008 heavily-strengthened volatile. (Sorry.) To make it correct you have to store to the flag variable using InterlockedExchange() or similar, not using a simple write to the flag variable. Herb --- Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca) Convener, SC22/WG21 (C++) (www.gotw.ca/iso) Architect, Visual C++ (www.gotw.ca/microsoft) -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Herb Sutter on 30 Mar 2010 12:16 On Tue, 30 Mar 2010 09:47:55 CST, "Leigh Johnston" <leigh(a)i42.co.uk> wrote: >>>IMO for an OoO CPU the double checked locking pattern can be made to work >>>with volatile if fences are also used or the lock also acts as a fence (as >>>is the case with VC++/x86). This is also the counter-example you are >>>looking for, it should work on some implementations. FWIW VC++ is clever >>>enough to make the volatile redundant for this example however adding >>>volatile makes no difference to the generated code (read: no performance >>>penalty) >> >> Are you sure? On x86 a VC++ volatile write is supposed to be emitted >> as xchg, whereas an ordinary write is usually emitted as mov. If the >> DCL control variable write is emitted as mov on x86 then DCL won't >> work correctly (well, it'll appear to work...). > >Yes on x86 VC++ (VC9) emits a MOV for a volatile write however entering the >critical section in the DCL should act as a fence so it should work. I >asked this question (about VC++ volatile not emitting fences) in >microsoft.public.vc.language but didn't get a satisfactory reply. Ah yes, my thinko. I was momentarily rusty on the outcome of internal design discussions a year or three ago. Let me try again: Yes, we (VC++ since VC++ 2005 I think) do emit a plan MOV for a volatile write. DCL works with a plain MOV on x86/x64 because x86/x64's strong memory model combined with VC++ 2005+'s restrictions on reorderings around volatile reads/wrties is just enough to do the trick. However, that's not enough to make Dekker's work. To make Dekker's work correctly today on our platform, you have to perform the write to the flag variable using an InterlockedXxx() call of some sort, not as a direct write to a volatile flag variable. If we wanted native volatile to be suitable for general lock-free coding including Dekker's using only direct reads and writes of the volatile variable (i.e., without InterlockedXxx API calls), however, we would have to emit writes as XCHG, not MOV (and some other things). We don't do that today, we discussed doing it, and the current decision is that we do not intend to do it -- it would penalize any existing uses of volatile, and the right tool for lock-free coding is atomic<>. (Which we also need to ship and isn't in VC 2010, sorry, but we're working on it next.) FWIW, the issue is that VC++ volatile is load/acquire and store/release, but that still allows one reordering -- store-load reordering, which kills Dekker's. Emitting the store as xchg would prevent store-load reordering, hence would fix Dekker's. But the right way to fix Dekker's on our platform, again, is: for now, use InterlockedXxx APIs for lock-free code; and when it's available, use atomic<> (Boost's or eventually ours) for lock-free code. Herb --- Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca) Convener, SC22/WG21 (C++) (www.gotw.ca/iso) Architect, Visual C++ (www.gotw.ca/microsoft) -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Herb Sutter on 30 Mar 2010 12:14
On Tue, 30 Mar 2010 16:37:59 CST, James Kanze <james.kanze(a)gmail.com> wrote: >(I keep seeing mention here of instruction reordering. In the >end, instruction reordering is irrelevant. It's only one thing >that may lead to reads and writes being reordered. Yes, but: Any reordering at any level can be treated as an instruction reordering -- actually, as a source code reordering. That's why all language-level MM discussions only bother to talk about source-level reorderings, because any CPU or cache transformations end up having the same effect as some corresponding source-level reordering. Herb --- Herb Sutter (herbsutter.wordpress.com) (www.gotw.ca) Convener, SC22/WG21 (C++) (www.gotw.ca/iso) Architect, Visual C++ (www.gotw.ca/microsoft) -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |