From: James Kanze on 30 Mar 2010 07:38 On Mar 30, 12:04 pm, Andy Venikov <swojchelo...(a)gmail.com> wrote: > James Kanze wrote: >> On Mar 28, 10:05 pm, George Neuner <gneun...(a)comcast.net> wrote: > <snip> >>> All the fence does is create a checkpoint in the >>> instruction sequence at which relevant load or store >>> instructions dispatched prior to dispatch of the fence >>> instruction will have completed execution. >> That's not true for the two architectures whose >> documentation I've studied, Intel and Sparc. To quote the >> Intel documentation of MFENCE: >> Performs a serializing operation on all load and store >> instructions that were issued prior the MFENCE >> instruction. This serializing operation guarantees that >> every load and store instruction that precedes in >> program order the MFENCE instruction is globally visible >> before any load or store instruction that follows the >> MFENCE instruction is globally visible. >> Note the "globally visible". > If you read the whole sentence, you have: <1> is globally > visible before <2> is globally visible. That doesn't sound to > me as saying that <1> is globally visible. It guarantees that if <1> is not globally visible, then neither will <2> be. I don't know what more you might want. > I don't think that the above says that instructions that > precede MFENCE are guaranteed to be visible after the MFENCE > instruction completes. I don't think that "instruction completes" has any real meaning in a modern processor. Arguably, a store instruction doesn't complete until the data is stable in main memory. But that doesn't stop the processor from going on and doing other things. In practice, the ordering is all that is relevant anyway. > It does guarantee, however, that the earlier instructions are > visible before ANY of the following instructions are visible. Which is all that is required. -- James Kanze -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: James Kanze on 30 Mar 2010 07:36 On Mar 29, 11:55 pm, "Leigh Johnston" <le...(a)i42.co.uk> wrote: > "James Kanze" <james.ka...(a)gmail.com> wrote in message > news:36f7e40e-4584-430d-980e-5f7478728d16(a)z3g2000yqz.googlegroups.com... > <snip> >>> Performance is often cited as another reason to not use >>> volatile however the use of volatile can actually help with >>> multi-threading performance as you can perform a safe >>> lock-free check before performing a more expensive lock. >> Again, I'd like to see how. This sounds like the >> double-checked locking idiom, and that's been proven not to >> work. > IMO for an OoO CPU the double checked locking pattern can be > made to work with volatile if fences are also used or the lock > also acts as a fence (as is the case with VC++/x86). Double checked locking can be made to work if you introduce inline assembler or use some other technique to insert a fence or a membar instruction in the appropriate places. But of course, then, the volatile becomes superficial. > This is also the counter-example you are looking for, it > should work on some implementations. It's certainly not an example of a sensible use of volatile, since without the membar/fence, the algorithm doesn't work (at least on most modern processors, which are multicore). And with the membar/fence, the volatile is superfluous, and not needed. > FWIW VC++ is clever enough to make the volatile redundant for > this example however adding volatile makes no difference to > the generated code (read: no performance penalty) and I like > making such things explicit similar to how one uses const > (doesn't effect the generated output but documents the > programmer's intentions). The use of a fence or membar (or some system specific "atomic" access) would make the intent explicit. The use of volatile suggests something completely different (memory mapped IO, or some such). > Which is better: use volatile if there is no noticeable > performance penalty or constantly check your compiler's > generated assembler to check the optimizer is not breaking > things? The reason there is no performance penalty is because volatile doesn't do the job. And you don't have to check the generated assembler for anything (unless you suspect a compiler error); you check the guarantees given by the compiler. OK: I'll admit that finding such specifications is very, very difficult. But they should exist, and they'll guarantee you with regards to future releases, as well. And there are some guarantees that are expressed indirectly: if a compiler claims Posix conformance, and supports multithreading, then you get the guarantees from the Posix standard; the issue is a bit less clear under Windows, but if a compiler claims to support multithreading, then it should conform to the Windows conventions about this. > The only volatile in my entire codebase is for the "status" of > my "threadable" base class and I don't always acquire a lock > before checking this status and I don't fully trust that the > optimizer won't cache it for all cases that might crop up as I > develop code. I'd have to see the exact code to be sure, but I'd guess that without an mfence somewhere in there, the code won't work on a multicore machine (which is just about everything today), and with the mfence, the the volatile isn't necessary. Also, at least under Solaris, if there is no contention, the execution time of pthread_mutex_lock is practically the same as that of membar. Although I've never actually measured it, I suspect that the same is true if you use CriticalSection (and not Mutex) under Windows. > BTW I try and avoid singletons too so I haven't found the need > to use the double checked locking pattern AFAICR. Double checked locking is a pattern which can be applied to many things, not just to singletons. -- James Kanze -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: James Kanze on 30 Mar 2010 07:39 On Mar 30, 12:05 pm, George Neuner <gneun...(a)comcast.net> wrote: > On Mon, 29 Mar 2010 16:53:44 CST, James Kanze <james.ka...(a)gmail.com> > wrote: >> On Mar 28, 10:05 pm, George Neuner <gneun...(a)comcast.net> wrote: [...] >>> The purpose of the fence is to sequence memory accesses. >> For a much more rigorous definition of "access" that that used >> by the C++ standard. > Not exactly. I agree that the processor's memory model > guarantees stronger ordering than that of the C++ standard (or > almost any language standard, for that matter), but you are > attributing semantics to "fence" that aren't there. I'm not attributing anything. I'm just quoting the documentation. >>> All the fence does is create a checkpoint in the instruction >>> sequence at which relevant load or store instructions >>> dispatched prior to dispatch of the fence instruction will >>> have completed execution. >> That's not true for the two architectures whose documentation >> I've studied, Intel and Sparc. > Then you'd better go back and study 8) I've quoted Intel. Regretfully, the Sparc site was down when I tried to access it, but I've studied their documentation fairly intensely, and it basically guarantees the same thing. >> To quote the Intel documentation of MFENCE: >> Performs a serializing operation on all load and store >> instructions that were issued prior the MFENCE >> instruction. This serializing operation guarantees that >> every load and store instruction that precedes in >> program order the MFENCE instruction is globally visible >> before any load or store instruction that follows the >> MFENCE instruction is globally visible. > Now look at LFENCE, SFENCE and CLFUSH and think about why they > are provided separately. Also look at PREFETCH and see what > it says about fences. There are many types of fences. Obviously, in any given case, you should use the one which provides the guarantees you need. > Intel provides MFENCE as a heavyweight combination of LFENCE, > SFENCE and CLFLUSH. MFENCE does propagate to memory *because* > it flushes the cache. However the primitive, SFENCE, ensures > propagation of writes only to L2 cache. So what use is it, then? > Sparc has no single instruction that both fences and flushes > the cache. MEMBAR ensures propagation only to L2 cache. A > separate FLUSH instruction is necessary to ensure propagation > to memory. That's not what it says in the Sparc Architecture Specification. (Levels of cache are never mentionned; the architecture allows an implementation with any number of levels.) > Sparc also does not have separate load and store fences, but > it offers two variants of MEMBAR which provide differing > consistency guarantees. There is only one Membar instruction, with a 4 bit mask to control the barriers: LOADLOAD, LOADSTORE, STORELOAD and STORESTORE. (There are some other bits to control other functionality, but they are irrelevant with regards to memory synchronization in a multithreaded environment.) >> Note the "globally visible". Both Intel and Sparc guarantee >> strong ordering within a single core (i.e. a single thread); >> mfence or membar (Sparc) are only necessary if the memory will >> also be "accessed" from a separate unit: a thread running on a >> different core, or memory mapped IO. > Again, you're attributing semantics that aren't there. I just quoted the documentation. What part of "globally visible" don't you understand. > For a store to be "globally visible" means that the value must > be visible from outside the core. This requires the value be > in *some* externally visible memory - not *main* memory in > particular. For both x86 and Sparc, this means L2 cache - the > first level that can be snooped off-chip. That's an original definition of "global". > For a load "globally visible" means that the value is present > at all levels of the memory hierarchy and cannot be seen > differently by an external observer. This simply follows from > the normal operation of the read pipeline - the value is > written into all levels of cache (more or less) at the same > time it is loaded into the core register. > Note also that some CPUs can prefetch data in ways that bypass > externally visible levels of cache. Sparc and x86 (at least > since Pentium III) do not permit this. Sparc certainly does allow it (at least according to the Sparc Architecture Specification), and I believe some of the newer Intel do as well. >>> However, in a memory hierarchy with caching, a store >>> instruction does not guarantee a write to memory but only that >>> one or more write cycles is executed on the core's memory >>> connection bus. >> On Intel and Sparc architectures, a store instruction doesn't >> even guarantee that. All it guarantees is that the necessary >> information is somehow passed to the write pipeline. What >> happens after that is anybody's guess. > No. On both of those architectures a store instruction will > eventually cause the value to be written out of the core > (except maybe if a hardware exception occurs). Not on a Sparc. At least not according to the Sparc Architecture Specification. Practically speaking I doubt that this is guaranteed for any modern architecture, given the performance implications. > Additionally the source register may renamed or the stored > value may be forwarded within the core to rendezvous with a > subsequent read of the same location already in the pipeline > ... but these internal flow optimizations don't affect the > externally visible operation of the store instruction. As long as there is only a single store instruction to a given location, that store will eventually percolate out to the main memory. If there are several, it's quite possible that some of them will never appear outside the processor. >>> For another thread (or core or CPU) to perceive a change a >>> value must be propagated into shared memory. For all >>> multi-core processors I am aware of, the first shared level of >>> memory is cache - not main memory. Cores on the same die >>> snoop each other's primary caches and share higher level >>> caches. Cores on separate dies in the same package share >>> cache at the secondary or tertiary level. >> And on more advanced architectures, there are core's which >> don't share any cache. All of which is irrelevant, since >> simply issuing a store instruction doesn't even guarantee a >> write to the highest level cache, and a membar or a fence >> instruction guarantees access all the way down to the main, >> shared memory. > Sorry, but no. Even the architectures we've discussed here, x86 and > Sparc, do not satisfy your statement. I quoted the specification from Intel for the x86. The Sparc site was down, and my copy of the Sparc Architecture Specification is on a machine in France, so I'm sorry, I can't quote it here. But I do know what it says. And a membar instruction does guarantee strong ordering. > There might be architectures I'm unaware of which can elide an > off-core write entirely by rendezvous forwarding and register > renaming, but you haven't named one. I would consider eliding > the store to be a dangerous interpretation of memory semantics > and I suspect I would not be alone. Dangerous or not, no processor can afford to neglect this important optimization opportunity. And it causes no problems in single threaded programs, nor in multithreaded programs which use proper synchronization methods. > I'm not familiar with any cached architecture for which > fencing alone guarantees that a store writes all the way to > main memory - I know some that don't even have/need fencing > because their on-chip caches are write-through. I just pointed one out. By quoting the manufacturer's specifications for the mfence instruction. If I were on my Unix machine in Paris, I could equally quote similar text for the Sparc. >>> The upshot is this: >>> - "volatile" is required for any CPU. >> I'm afraid that doesn't follow from anything you've said. >> Particularly because the volatile is largely a no-op on most >> current compilers---it inhibits compiler optimizations, but the >> generated code does nothing to prevent the reordering that >> occurs at the hardware level. > "volatile" is required because the compiler must not reorder > or optimize away the loads or stores. Which loads and stores. The presence of a fence (or inline assembler, or specific system or library calls) guarantee that the compiler cannot reorder around it. And whether the compiler reorders or suppresses elsewhere is irrelevant, since the hardware can do it regardless of the code the compiler generates. >>> - fences are required for an OoO CPU. >> By OoO, I presume you mean "out of order". That's not the only >> source of the problems. > OoO is not the *only* source of the problem. The compiler has > little control over hardware reordering ... fences are blunt > instruments that impact all loads or stores ... not just those > to language level "volatiles". Agreed. Volatile has different semantics (at least that was the intent). See Herb Sutter's comments else thread. >>> - cache control is required for a write-back cache between >>> CPU and main memory. >> The cache is largely irrelevent on Sparc or Intel. The >> processor architectures are designed in a way to make it >> irrelevant. All of the problems would be there even in the >> absence of caching. They're determined by the implementation of >> the write and read pipelines. > That's a naive point of view. For a cached processor, the > operation of the cache and it's impact on real programs is > *never* "irrelevant". I was speaking uniquely in the context of threading. The operation of the cache is very relevant with regards to performance, for example. -- James Kanze -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: George Neuner on 30 Mar 2010 07:36 On Tue, 30 Mar 2010 05:04:48 CST, Andy Venikov <swojchelowek(a)gmail.com> wrote: > James Kanze wrote: > >> To quote the Intel documentation of MFENCE: >> >> Performs a serializing operation on all load and store >> instructions that were issued prior the MFENCE >> instruction. This serializing operation guarantees that >> every load and store instruction that precedes in >> program order the MFENCE instruction is globally visible >> before any load or store instruction that follows the >> MFENCE instruction is globally visible. >> >> Note the "globally visible". > > If you read the whole sentence, you have: > <1> is globally visible before <2> is globally visible. That doesn't > sound to me as saying that <1> is globally visible. > > I don't think that the above says that instructions that precede MFENCE > are guaranteed to be visible after the MFENCE instruction completes. It > does guarantee, however, that the earlier instructions are visible > before ANY of the following instructions are visible. It reads a bit funny, but it is correct. MFENCE divides the total set of loads and stores in process into the group dispatched before the fence instruction and the group dispatched after. At completion of the fence instruction, it is guaranteed that all of the before group will have been completed AND none of the after group will be completed. See LFENCE and SFENCE for more details. MFENCE combines their effects and adds a cache flush as well. George -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: James Kanze on 30 Mar 2010 07:42
On Mar 30, 12:03 pm, Andy Venikov <swojchelo...(a)gmail.com> wrote: > Herb Sutter wrote: [...] > So, with the above said, here's a concrete example of how I'd > use volatile without an access to a ready-made library. Let's > take Magued Michael's lock-free queue ("Simple, Fast and > Practical Non-blocking and blocking queue algorithms", Magued > Michael & Michael Scott; 1996). It uses a technique similar to > DCL to verify a validity of a read. Look into it's deque() > method. > I'll provide the pseudo code here: > dequeue(Q: pointer to queue t, pvalue: pointer to data type): boolean > D1: loop # Keep trying until Dequeue is done > D2: head = Q->Head # Read Head > D3: tail = Q->Tail # Read Tail > D4: next = head->next # Read Head.ptr->next > D5: if head == Q->Head # Are head, tail, and next consistent? > D6: if head.ptr == tail.ptr # Is queue empty or Tail falling behind? > D7: if next.ptr == NULL # Is queue empty? > D8: return FALSE # Queue is empty, couldn�t dequeue > D9: endif > # Tail is falling behind. Try to advance it > D10: CAS(&Q->Tail, tail, <next.ptr, tail.count+1>) > D11: else # No need to deal with Tail > # Read value before CAS, otherwise another dequeue might free the > next node > D12: *pvalue = next.ptr->value > # Try to swing Head to the next node > D13: if CAS(&Q->Head, head, <next.ptr, head.count+1>) > D14: break # Dequeue is done. Exit loop > D15: endif > D16: endif > D17: endif > D18: endloop > D19: free(head.ptr) # It is safe now to free the old dummy node > D20: return TRUE # Queue was not empty, dequeue succeeded > Look at line D5: it needs to check if Q->Head is still the > same as what we read from it before. Otherwise two > possibilities for breaking the correctness arise: 1) it would > be possible for the element pointed to by Q->Head to be > re-inserted back into the queue with NULL in the "next" and > was never empty in any given moment; or 2) The first element > was removed after we've read Q->Head and before we've read > next thus there could be garbage in head->next by the time we > read it and we'd try to access garbage on line D12. > This piece of pseudo code could be naively translated to a > following c++ code: > while (true) > { > Node * localHead = head_; > Node * localTail = tail_; > Node * localNext = localHead->next; > if (localHead == head_) > { > ... > } > But it wouldn't work for the obvious reasons. One needs to > insert MemoryFences in the right places. Memory fences is > something that is highly platform-specific, so one would > define macros for them that would expand to different > instructions on different platforms. Here's the code with > memory fences inserted: > while (true) > { > Node * localHead = head_; > Node * localTail = tail_; > DataDependencyBarrier(); //All the systems that I know of will do > //this sort of barrier automatically, so > //this macro will expand to nothing > Node * localNext = localHead->next; > LoadLoadBarrier(); //on x86 this will expand to nothing > if (localHead == head_) > { > ... > > } > This is much better, but it still got problems: first, on x86, > the LoadLoadBarrier() will expand to nothing and there will be > no indication to the compiler not to re-order different loads; > and second (and I think it's the crux of my argument) that an > optimizing compiler will dispose of the "if" statement even in > the face of memory barriers. First, I very much doubt that the LoadLoad barrier can expand to nothing if the code is to work. It certainly cannot on a Sparc, and I rather doubt that it can on an Intel; I'd have to see the guarantees in writing to believe otherwise. And second, if a compiler moves code accross a barrier, it is broken, and there's not much you can do about it. > No matter how many or what type of memory barriers you insert, > the compiler will be allowed to omit the if statement. Really? According to who? Obviously, the ISO standard says nothing about this case; the presence of the barrier introduces undefined behavior, and the compiler might do anything, as far as ISO C++ is concerned. But you're obviously depending on other standards as well. Standards which specify what the barrier guarantees, for example. And these standards apply to the compiler as well. > The ONLY way to force the compiler (any compiler for that > matter) to generate it is to declare head_ as volatile. No. The way to force the compiler to generate the necessary accesses is to use the same implementation specific guarantees you count on for the barrier to work. > Here's the final code: > struct Node > { > <unspecified> data; > Node volatile * pNext;}; > Node volatile * volatile head_; > Node volatile * volatile tail_; > dequeue() > { > while (true) > { > Node volatile * localHead = head_; > Node volatile * localTail = tail_; > DataDependencyBarrier(); > Node volatile * localNext = localHead->next; > if (localHead == head_) > { > ... > } > .... > > } > Now this code will produce the intended correct object code on > all the compilers I've listed above and on at least these > CPUs: x86, itanium, mips, PowerPC (assuming that all the > MemoryBarriers have been defined for all the platforms). And > without any modifications to the above code. How's that for > portability? Yes, but it's just as portable without the volatile. If the barriers are defined correctly, the compiler will not move code over them. -- James Kanze -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |