From: Joshua Maurice on 30 Mar 2010 17:56 On Mar 30, 8:14 pm, Herb Sutter <herb.sut...(a)gmail.com> wrote: > On Tue, 30 Mar 2010 16:37:59 CST, James Kanze <james.ka...(a)gmail.com> > wrote: > > >(I keep seeing mention here of instruction reordering. In the > >end, instruction reordering is irrelevant. It's only one thing > >that may lead to reads and writes being reordered. > > Yes, but: Any reordering at any level can be treated as an instruction > reordering -- actually, as a source code reordering. That's why all > language-level MM discussions only bother to talk about source-level > reorderings, because any CPU or cache transformations end up having > the same effect as some corresponding source-level reordering. Not quite, no. On "weaker guarantee" processors, let's take the following example: /* start pseudo code example. Forgive me for any "typos". This is off the top of my head and I haven't really used lambda functions. */ int main() { int a = 0; int b = 0; int c[4]; int d[4]; start_thread([&]() -> void { c[0] = a; d[0] = b; }); start_thread([&]() -> void { c[1] = a; d[1] = b; }); start_thread([&]() -> void { c[2] = a; d[2] = b; }); start_thread([&]() -> void { c[3] = a; d[3] = b; }); a = 1; b = 2; cout << c[0] << " " << d[0] << '\n' << c[1] << " " << d[1] << '\n' << c[2] << " " << d[2] << '\n' << c[3] << " " << d[3] << endl; } //end pseudo code example On some modern processors, most (in)famously the DEC Alpha with its awesome split cache, this program in the real world (or something very much like it) can print: 0 0 0 2 1 0 1 2 Specifically, this is a single execution of the program. In this single execution, the writes "a = 1; b = 2;" are seen to happen in two different orders, the exact same "store instructions" become visible to other cores in different orders. There is no (sane) source code level reordering that can achieve this. I tried to emphasize this else- thread: you cannot think about threading in terms of "possible interleavings of instructions". It does not portably work. Absent synchronization, on some processors, there is no global order of instructions. -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: James Kanze on 31 Mar 2010 04:40 On 31 Mar, 04:14, "Leigh Johnston" <le...(a)i42.co.uk> wrote: > "James Kanze" <james.ka...(a)gmail.com> wrote in > messagenews:da63ca83-4d6e-416a-9825-c24deed3e49f(a)10g2000yqq.googlegroups.com... > <snip> > > Double checked locking can be made to work if you introduce > > inline assembler or use some other technique to insert a > > fence or a membar instruction in the appropriate places. > > But of course, then, the volatile becomes superficial. > It is only superficial if there is a compiler guarantee that a > load/store for a non-volatile variable is emitted in the > presence of a fence which sounds like a dubious guarantee to > me. What compilers stop performing optimizations in the > presence of a fence and/or how does the compiler know which > variables accesses can be optimized in the presence of a > fence? All of the compilers I know either treat inline assembler or an external function call to a function written in assembler as a worse case with regards to optimizing, and do not move code accross it, or they provide a means of specifying to the compiler which variables, etc. are affected by the assembler. > >> This is also the counter-example you are looking for, it > >> should work on some implementations. > > It's certainly not an example of a sensible use of volatile, > > since without the membar/fence, the algorithm doesn't work > > (at least on most modern processors, which are multicore). > > And with the membar/fence, the volatile is superfluous, and > > not needed. > Read what I said above. I have. But it doesn't hold water. > >> FWIW VC++ is clever enough to make the volatile redundant > >> for this example however adding volatile makes no > >> difference to the generated code (read: no performance > >> penalty) and I like making such things explicit similar to > >> how one uses const (doesn't effect the generated output but > >> documents the programmer's intentions). > > The use of a fence or membar (or some system specific > > "atomic" access) would make the intent explicit. The use of > > volatile suggests something completely different (memory > > mapped IO, or some such). > Obviously we disagree on this point hence the reason for the > existence of this argument we are having. Yes. Theoretically, I suppose, you could find a compiler which documented that it would move code accross a fence or a membar instruction. In practice: either the compiler treats assembler as a black box, and supposes that it might do anything, or it analyses the assembler, and takes the assembler into account when optimizing. In the first case, the compiler must synchronize it's view of the memory, because it must suppose that the assembler reads and writes arbitrary values from memory. And in the second (which is fairly rare), it recognizes the fence, and adjusts its optimization accordingly. Your argument is basically that the compiler writers are either completely incompetent, or that they are intentionally out to make your life difficult. In either case, there are a lot more things that they can do to make your life difficult. I wouldn't use such a compiler, because it would be, in effect, unusable. > <snip> > >> The only volatile in my entire codebase is for the "status" of > >> my "threadable" base class and I don't always acquire a lock > >> before checking this status and I don't fully trust that the > >> optimizer won't cache it for all cases that might crop up as I > >> develop code. > > I'd have to see the exact code to be sure, but I'd guess that > > without an mfence somewhere in there, the code won't work on a > > multicore machine (which is just about everything today), and > > with the mfence, the the volatile isn't necessary. > The code does work on a multi-core machine and I am confident > it will continue to work when I write new code precisely > because I am using volatile and therefore guaranteed a load > will be emitted not optimized away. If you have the fence in the proper place, you're guaranteed that it will work, even without volatile. If you don't, you're not guaranteed anything. > > Also, at least under Solaris, if there is no contention, the > > execution time of pthread_mutex_lock is practically the same > > as that of membar. Although I've never actually measured > > it, I suspect that the same is true if you use > > CriticalSection (and not Mutex) under Windows. > Critical sections are expensive when compared to a simple load > that is guaranteed by using volatile. It is not always > necessary to use a fence as all a fence is doing is > guaranteeing order so it all depends on the use-case. I'm not sure I follow. Basically, the fence guarantees that the hardware can't do specific optimizations. The same optimizations that the software can't do in the case of volatile. If you think you need volatile, then you certainly need a fence. (And if you have the fence, you no longer need the volatile.) -- James Kanze [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Anthony Williams on 31 Mar 2010 07:17 Herb Sutter <herb.sutter(a)gmail.com> writes: >>But Helge Bahmann (the author of the library) didn't have such a > > Isn't it Anthony Williams who's doing Boost's atomic<> implementation? > Hmm. No. Helge's implementation covers more platforms than I have access to or know how to write atomics for. Anthony -- Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/ just::thread C++0x thread library http://www.stdthread.co.uk Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk 15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976 [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Leigh Johnston on 31 Mar 2010 07:21 "James Kanze" <james.kanze(a)gmail.com> wrote in message news:bbd4bca1-2c16-489b-b814-98db0aafb492(a)z4g2000yqa.googlegroups.com... > On 31 Mar, 04:14, "Leigh Johnston" <le...(a)i42.co.uk> wrote: >> "James Kanze" <james.ka...(a)gmail.com> wrote in >> messagenews:da63ca83-4d6e-416a-9825-c24deed3e49f(a)10g2000yqq.googlegroups.com... > >> <snip> > >> > Double checked locking can be made to work if you introduce >> > inline assembler or use some other technique to insert a >> > fence or a membar instruction in the appropriate places. >> > But of course, then, the volatile becomes superficial. > >> It is only superficial if there is a compiler guarantee that a >> load/store for a non-volatile variable is emitted in the >> presence of a fence which sounds like a dubious guarantee to >> me. What compilers stop performing optimizations in the >> presence of a fence and/or how does the compiler know which >> variables accesses can be optimized in the presence of a >> fence? > > All of the compilers I know either treat inline assembler or an > external function call to a function written in assembler as a > worse case with regards to optimizing, and do not move code > accross it, or they provide a means of specifying to the > compiler which variables, etc. are affected by the assembler. > Yes I realized that after posting but as this newsgroup is moderated posting an immediate retraction reply is not possible. :) { An immediate retraction may be possible. Just write to the moderators (see the link in the banner at the end of this article) including the article's tracking number. If not yet approved the article is then rejected per request. -mod } <snip> >> The code does work on a multi-core machine and I am confident >> it will continue to work when I write new code precisely >> because I am using volatile and therefore guaranteed a load >> will be emitted not optimized away. > > If you have the fence in the proper place, you're guaranteed > that it will work, even without volatile. If you don't, you're > not guaranteed anything. It is guaranteed to work on the platform for which I am implementing for and I find it hard to believe that it wouldn't work on other platforms/compilers which have similar semantics for volatile (which you already agreed was a fair assumption). > >> > Also, at least under Solaris, if there is no contention, the >> > execution time of pthread_mutex_lock is practically the same >> > as that of membar. Although I've never actually measured >> > it, I suspect that the same is true if you use >> > CriticalSection (and not Mutex) under Windows. > >> Critical sections are expensive when compared to a simple load >> that is guaranteed by using volatile. It is not always >> necessary to use a fence as all a fence is doing is >> guaranteeing order so it all depends on the use-case. > > I'm not sure I follow. Basically, the fence guarantees that the > hardware can't do specific optimizations. The same > optimizations that the software can't do in the case of > volatile. If you think you need volatile, then you certainly > need a fence. (And if you have the fence, you no longer need > the volatile.) > My point is that it is possible to write a piece of multi-threaded code which does not use a fence or a mutex/critical section and just reads a single shared variable in isolation (ordering not important and read is atomic on the platform in question) and for this *particular* case volatile can be useful. I find it hard to believe that there are no cases at all where this applies. /Leigh -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Andy Venikov on 31 Mar 2010 07:35
James Kanze wrote: <snip> > I'm not sure I follow. Basically, the fence guarantees that the > hardware can't do specific optimizations. The same > optimizations that the software can't do in the case of > volatile. If you think you need volatile, then you certainly > need a fence. (And if you have the fence, you no longer need > the volatile.) > Ah, finally I think I see where you are coming from. You think that if you have the fence you no longer need a volatile. I think you assume too much about how fence is really implemented. Since the standard says nothing about fences you have to rely on a library that provides them and if you don't have such a library, you'll have to implement one yourself. A reasonable way to implement a barrier would be to use macros that, depending on a platform you run, expand to inline assembly containing the right instruction. In this case the inline asm will make sure that the compiler won't reorder the emitted instructions, but it won't make sure that the optimizer will not throw away some needed instructions. For example, following my post where I described Magued Michael's algorithm, here's how relevant excerpt without volatiles would look like: //x86-related defines: #define LoadLoadBarrier() asm volatile ("mfence") //Common code struct Node { Node * pNext; }; Node * head_; void f() { Node * pLocalHead = head_; Node * pLocalNext = pLocalHead->pNext; LoadLoadBarrier(); if (pLocalHead == head_) { printf("pNext = %p\n", pLocalNext); } } Just to make you happy I defined LoadLoadBarrier as a full mfence instruction, even though on x86 there is no need for a barrier here, even on a multicore/multiprocessor. And here's how gcc 4.3.2 on Linux/x86-64 generated object code: 0000000000400630 <_Z1fv>: 400630: 0f ae f0 mfence 400633: 48 8b 05 fe 09 20 00 mov 0x2009fe(%rip),%rax # 601038 <head_> 40063a: bf 5c 07 40 00 mov $0x40075c,%edi 40063f: 48 8b 30 mov (%rax),%rsi 400642: 31 c0 xor %eax,%eax 400644: e9 bf fe ff ff jmpq 400508 <printf(a)plt> 400649: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) As you can see, it uselessly put mfence right at the beginning of function f() and threw away the second read of head_ and the whole if statement altogether. Naively, you could say that we could put "memory" clobber in the inline assembly clobber list like this: #define LoadLoadBarrier() asm volatile ("mfence" : : : "memory") This will work, but it will be a huge overkill, because after this the compiler will need to re-read all variables, even unrelated ones. And when f() gets inlined, you get a huge performance hit. Volatile saves the day nicely and beautifully, albeit not "standards" portably. But as I said elsewhere, this will work on most compilers and hardware. Of course I'd need to test it on the compiler/hardware combination that client is going to run it on, but such is the peril of trying to provide portable interface with non-portable implementation. But so far I haven't found a single combination that wouldn't correctly compile the code with volatiles. And of course I'll gladly embrace C++0x atomic<>... when it becomes available. Right now though, I'm slowly migrating to boost::atomic (which again, internally HAS TO and IS using volatiles). Thanks, Andy. -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |