From: Andy Venikov on 25 Mar 2010 20:05 James Kanze wrote: > On Mar 25, 7:10 pm, George Neuner <gneun...(a)comcast.net> wrote: >> On Thu, 25 Mar 2010 00:20:43 CST, Andy Venikov > > [...] >> As you noted, 'volatile' does not guarantee that an OoO CPU will >> execute the stores in program order ... > > Arguably, the original intent was that it should. But it > doesn't, and of course, the ordering guarantee only applies to > variables actually declared volatile. > >> for that you need to add a write fence between them. However, >> neither 'volatile' nor write fence guarantees that any written >> value will be flushed all the way to memory - depending on >> other factors - cache snooping by another CPU/core, cache >> write back policies and/or delays, the span to the next use of >> the variable, etc. - the value may only reach to some level of >> cache before the variable is referenced again. The value may >> never reach memory at all. > > If that's the case, then the fence instruction is seriously > broken. The whole purpose of a fence instruction is to > guarantee that another CPU (with another thread) can see the > changes. (Of course, the other thread also needs a fence.) Hmm, the way I understand fences is that they introduce ordering and not necessarily guarantee visibility. For example: 1. Store to location 1 2. StoreStore fence 3. Store to location 2 will guarantee only that if store to location 2 is visible to some thread, then the store to location 1 is guaranteed to be visible to the same thread as well. But it doesn't necessarily guarantee that the stores will be ever visible to some other thread. Yes, on certain CPUs fences are implemented as "flushes", but they don't need to be. Thanks, Andy. -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Joshua Maurice on 27 Mar 2010 06:13 On Mar 26, 4:05 am, Andy Venikov <swojchelo...(a)gmail.com> wrote: > James Kanze wrote: > > On Mar 25, 7:10 pm, George Neuner <gneun...(a)comcast.net> wrote: > >> On Thu, 25 Mar 2010 00:20:43 CST, Andy Venikov > > > [...] > >> As you noted, 'volatile' does not guarantee that an OoO CPU will > >> execute the stores in program order ... > > > Arguably, the original intent was that it should. But it > > doesn't, and of course, the ordering guarantee only applies to > > variables actually declared volatile. > > >> for that you need to add a write fence between them. However, > >> neither 'volatile' nor write fence guarantees that any written > >> value will be flushed all the way to memory - depending on > >> other factors - cache snooping by another CPU/core, cache > >> write back policies and/or delays, the span to the next use of > >> the variable, etc. - the value may only reach to some level of > >> cache before the variable is referenced again. The value may > >> never reach memory at all. > > > If that's the case, then the fence instruction is seriously > > broken. The whole purpose of a fence instruction is to > > guarantee that another CPU (with another thread) can see the > > changes. (Of course, the other thread also needs a fence.) > > Hmm, the way I understand fences is that they introduce ordering and not > necessarily guarantee visibility. For example: > > 1. Store to location 1 > 2. StoreStore fence > 3. Store to location 2 > > will guarantee only that if store to location 2 is visible to some > thread, then the store to location 1 is guaranteed to be visible to the > same thread as well. But it doesn't necessarily guarantee that the > stores will be ever visible to some other thread. Yes, on certain CPUs > fences are implemented as "flushes", but they don't need to be. Well yes. Volatile does not change that though. Most of my understanding comes from http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt and The JSR-133 Cookbook for Compiler Writers http://g.oswego.edu/dl/jmm/cookbook.html (Note that the discussion of volatile in the above link is for Java volatile 1.5+, not C and C++ volatile.) I'm not the most versed on this, so please correct me if I'm wrong. As an example: main thread: a = 0 b = 0 start thread 2 a = 1 write barrier b = 2 thread 2: print b read barrier print a Without the read and write memory barriers, this will print any of the 4 possible combinations: 0 0, 2 0, 0 1, 2 1 With the barriers, it removes one possible: 0 0, 0 1, 2 1 As I understand "read" and "write" barriers (which are a subset of "store/store, store/load, load/store, load/load", the semantics are: "If a read before the read barrier sees a write after the write barrier, then all reads after the read barrier will see all writes before the write barrier." Yes, the semantics are conditional. It does not guarantee that a write will ever become visible. However, volatile will not change that. If thread 2 prints b == 2, then thread 2 will print a == 1, volatile or no volatile. If thread 2 prints b == 0, then thread 2 can print a == 0 or a == 1, volatile or no volatile. For some lock free algorithms, these guarantees are very useful, such as making double checked locking correct. Ex: T* get_singleton() { //all static storage is zero initialized before runtime static singleton_t * p; if (0 != p) //check 1 { READ_BARRIER(); return p; } Lock lock; if (0 != p) //check 2 return p; singleton_t * tmp = new singleton_t; WRITE_BARRIER(); p = tmp; return p; } If a thread reads p != 0 at check 1 which is before the read barrier, then it sees the write after the write barrier "p = tmp", and it is thus guaranteed that all subsequent reads after the read barrier (in the caller code) will see all writes before the write barrier (from the singleton_t constructor). This conditional visibility is exactly what we need in this case, what DCLP really wants. If the read at check 1 gives us 0, then we do have to use a mutex to force visibility, but most of the time it will read p as nonzero at check 1, and the barriers will guarantee correct semantics. Also, from what I remember, the read barrier is quite cheap on most systems, possibly free on the x86 (?). (See the JRS Cookbook linked above.) I don't quite grasp the nuances enough yet to say anything more concrete than this at this time. Again, I'm coding this up from memory, so please correct if any mistakes. -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: George Neuner on 28 Mar 2010 06:05 On Thu, 25 Mar 2010 17:31:25 CST, James Kanze <james.kanze(a)gmail.com> wrote: >On Mar 25, 7:10 pm, George Neuner <gneun...(a)comcast.net> wrote: >> On Thu, 25 Mar 2010 00:20:43 CST, Andy Venikov > > [...] >> As you noted, 'volatile' does not guarantee that an OoO CPU will >> execute the stores in program order ... > >Arguably, the original intent was that it should. But it >doesn't, and of course, the ordering guarantee only applies to >variables actually declared volatile. "volatile" is quite old ... I'm pretty sure the "intent" was defined before there were OoO CPUs (in de facto use if not in standard document). Regardless, "volatile" only constrains the behavior of the *compiler*. >> for that you need to add a write fence between them. However, >> neither 'volatile' nor write fence guarantees that any written >> value will be flushed all the way to memory - depending on >> other factors - cache snooping by another CPU/core, cache >> write back policies and/or delays, the span to the next use of >> the variable, etc. - the value may only reach to some level of >> cache before the variable is referenced again. The value may >> never reach memory at all. > >If that's the case, then the fence instruction is seriously >broken. The whole purpose of a fence instruction is to >guarantee that another CPU (with another thread) can see the >changes. The purpose of the fence is to sequence memory accesses. All the fence does is create a checkpoint in the instruction sequence at which relevant load or store instructions dispatched prior to dispatch of the fence instruction will have completed execution. There may be separate load and store fence instructions and/or they may be combined in a so-called "full fence" instruction. However, in a memory hierarchy with caching, a store instruction does not guarantee a write to memory but only that one or more write cycles is executed on the core's memory connection bus. Where that write goes is up to the cache/memory controller and the policies of the particular cache levels involved. For example, many CPUs have write-thru primary caches while higher levels are write-back with delay (an arrangement that allows snooping of either the primary or secondary cache with identical results). For another thread (or core or CPU) to perceive a change a value must be propagated into shared memory. For all multi-core processors I am aware of, the first shared level of memory is cache - not main memory. Cores on the same die snoop each other's primary caches and share higher level caches. Cores on separate dies in the same package share cache at the secondary or tertiary level. The same holds true for all separate CPU shared memory multiprocessors I am aware of ... they are connected so that they can snoop other's caches at some level, or an additional level of shared cache is placed between the CPUs and memory, or both. >>(Of course, the other thread also needs a fence.) Not necessarily. >> OoO execution and cache behavior are the reasons 'volatile' >> doesn't work as intended for many systems even in >> single-threaded use with memory-mapped peripherals. > >The reason volatile doesn't work with memory-mapped peripherals >is because the compilers don't issue the necessary fence or >membar instruction, even if a variable is volatile. It still wouldn't matter if they did. Lets take a simple case of one thread and two memory mapped registers: volatile unsigned *regA = 0x...; volatile unsigned *regB = 0x...; unsigned oldval, retval; *regA = SOME_OP; *regA = SOME_OP; oldval = *regB; do { retval = *regB; } while ( retval == oldval ); Let's suppose that writing a value twice to regA initiates some operation that returns a value in regB. Will the above code work? No. The processor will execute both writes, but the cache will combine them so the device will see only a single write. The cache needs to be flushed between writes to regA. Ok, let's assume there is a flush API and add some flushes: *regA = SOME_OP; FLUSH *regA; *regA = SOME_OP; FLUSH *regA; oldval = *regB; do { retval = *regB; } while ( retval == oldval ); Does this now work? Maybe. It will work if the flush operation includes a fence, otherwise you can't know whether the write has occurred before the cache line is flushed. Ok, let's assume there is a fence API and add fences: *regA = SOME_OP; SFENCE; FLUSH *regA; *regA = SOME_OP; SFENCE; FLUSH *regA; oldval = *regB; do { retval = *regB; } while ( retval == oldval ); Does this now work? Yes. Now I am guaranteed that the first value will be written all the way to memory (and to my device) before the second value is written. Now the question is whether a cache flush includes a fence operation (or vice versa)? The answer is "it depends". On many architectures, the ISA has no cache control instructions - the cache controller is mapped to reserved memory addresses or I/O ports. Some cache controllers permit only programming replacement policy and do not allow programs to manipulate the entries. Some controllers flush everything rather than allowing individual lines to be flushed. It depends. If there is a language level API for cache control or for fencing, it may or may not include the other operation depending on the whim of the developer. The upshot is this: - "volatile" is required for any CPU. - fences are required for an OoO CPU. - cache control is required for a write-back cache between CPU and main memory. >James Kanze George -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: James Kanze on 28 Mar 2010 06:25 On Mar 26, 12:33 am, Herb Sutter <herb.sut...(a)gmail.com> wrote: > Please remember this: Standard ISO C/C++ volatile is useless > for multithreaded programming. No argument otherwise holds > water; at best the code may appear to work on some > compilers/platforms, including all attempted counterexamples > I've seen on this thread. I agree with you in principle, but do be careful as to how you formulate this. Standard ISO C/C++ is useless for multithreaded programming, at least today. With or without volatile. And in Standard ISO C/C++, volatile is useless for just about anything; it was always intended to be mainly a hook for implementation defined behavior, i.e. to allow things like memory-mapped IO while not imposing excessive loss of optimizing posibilities everywhere. In theory, an implementation could define volatile in a way that would make it useful in multithreading---I think Microsoft once proposed doing so in the standard. In my opinion, this sort of violates the original intention behind volation, which was that volatile is applied to a single object, and doesn't affect other objects in the code. But it's certainly something you could argue both ways. [...] > No. The reason that can't use volatiles for synchronization is that > they aren't synchronized (QED). :-). And the reason their not synchronized is that synchronization involves more than one variable, and that it was never the intent of volatile to involve more than one variable. (On a lot of modern processors, however, it would be impossible to fully implement the original intent of volatile without synchronization. The only instructions available on a Sparc, for example, to ensure that a store instruction actually results in a write to an external device is a membar. And that synchronizes *all* accesses of the given type.) [...] > (and it was a mistake to try to add those > guarantees to volatile in VC++). Just curious: is that Microsoft talking, or Herb Sutter (or both)? -- James Kanze -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: James Kanze on 28 Mar 2010 06:23
On Mar 26, 12:05 pm, Andy Venikov <swojchelo...(a)gmail.com> wrote: > James Kanze wrote: >> If that's the case, then the fence instruction is seriously >> broken. The whole purpose of a fence instruction is to >> guarantee that another CPU (with another thread) can see the >> changes. (Of course, the other thread also needs a fence.) > Hmm, the way I understand fences is that they introduce > ordering and not necessarily guarantee visibility. For > example: > 1. Store to location 1 > 2. StoreStore fence > 3. Store to location 2 > will guarantee only that if store to location 2 is visible to > some thread, then the store to location 1 is guaranteed to be > visible to the same thread as well. A StoreStore fence guarantees that all stores issued before the fence are visible in main memory, and that none issued after the fence are visible (at the time the StoreStore fence is executed). Of course, for another thread to be guaraneed to see the results of any store, it has to use a load fence, to ensure that the values it sees are those after the load fence, and not some value that it happened to pick up earlier. > But it doesn't necessarily guarantee that the stores will be > ever visible to some other thread. Yes, on certain CPUs fences > are implemented as "flushes", but they don't need to be. If you redefine fence to mean something different than it normally means, then who knows. The normal definition requires all writes to have propagated to main memory (supposing it is a store fence) before the instruction procedes. This is why they can be so slow. (And all of the processors I know guaranteed coherence within a single core; you never need a fence if you're single threaded.) -- James Kanze -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |