From: Chris Thomasson on 26 Oct 2006 00:07 "Eric P." <eric_pattison(a)sympaticoREMOVE.ca> wrote in message news:453eeacf$0$1353$834e42db(a)reader.greatnowhere.com... > Chris Thomasson wrote: >> "Eric P." <eric_pattison(a)sympaticoREMOVE.ca> wrote in message >> news:453ec796$0$1355$834e42db(a)reader.greatnowhere.com... >> > Chris Thomasson wrote: >> >> "Eric P." <eric_pattison(a)sympaticoREMOVE.ca> wrote in message >> >> news:453e3ee4$0$1351$834e42db(a)reader.greatnowhere.com... >> >> > Del Cecchi wrote: >> >> > 2) It states that the x86 allows "Loads Reordered After Stores". >> >> > He does not define this non standard terminology, but if it means >> >> > what it sounds like then he is claiming that the x86 allows >> >> > a later store to bypass outstanding earlier loads. >> >> > That is wrong. >> >> >> >> On current x86 loads and stores are as follows: >> >> >> >> void* load(void **p) { >> >> void p* = *p; >> >> membar #LoadStore | #LoadLoad; >> >> } [...] > According to your load definition, there is a membar between the > two loads which *appears* to prevent the load &q from bypassing > the load &p. If that is in fact what your terminology means > then it directly contradicts the manual which states that > "reads can be performed in any order". (7.2.2, item #1). Okay... Well, I was under the impression that atomic loads on current x86 have load-acquire membar semantics... I know that you need explicit membar to handle StoreLoad dependences'... However I thought that the LoadStore dependences' was honored by atomic loads on current x86...
From: Eric P. on 26 Oct 2006 09:13 Alexander Terekhov wrote: > > "Eric P." wrote: > [...] > > If by "remote write atomicity" you mean atomic global visibility > > (all processors agree that each memory location has a single same > > value), we discussed that here and it was determined (based on > > 'knowledgeable sources') that x86 does have atomic global visibility. > > Really? IIRC, Glew went on record*** claiming that it is not true. > > See also > > http://www.decadentplace.org.uk/pipermail/cpp-threads/2006-September/001141.html > > ***) "WB memory is processor consistent, type II." > > With "type II" he meant "Extension to Dubois' Abstraction", I gather. > > regards, > alexander. (I don't know what "type II" and "Extension to Dubois' Abstraction" mean. I can't find reference to either in Gharachorloo.) Hmmm.... I thought it was resolved. Joe Seigh said on x86 memory model: http://groups.google.ca/group/comp.arch/msg/6af78be87ca29f31?hl=en& "It turns out the x86 memory model is defined, it's just not defined in the IA-32 manuals which is where you would expect it to be defined. It's defined in the Itanium manuals and is equivalent to Sparc TSO memory model." At a moral level, with all due respect to Gharachorloo, if a cache protocol allows processors, other than the most recent writer, to see different values for the same memory location, then I don't think anyone would consider that anything but broken, no matter what kind of consistency label was attached. I remember you had that comment that if it was PC then it would require your AtomicCmpXchg (&x, 42, val) trick to be guaranteed to read the most recent value. Obviously that would be a silly thing to require programmers to do, so I really can't see anyone designing a cache that requires it. I also just came across this doc when searching for 'global visibility' Fast and Generalized Polynomial Time Memory Consistency Verification Amitabha Roy, Stephan Zeisset, Charles J. Fleckenstein, John C. Huang Intel Corporation http://arxiv.org/pdf/cs.AR/0605039.pdf makes multiple references to TSO and the following statements: "The algorithm we have developed is currently implemented in Intel?s in house random test generator and is used by both the IA-32 and Itanium verification teams." "A load is considered performed (or executed) if no subsequent store to that location (on any processor) can change the load return value. A store is considered performed (or executed) if any subsequent load to that location (on any processor) returns its value." "Axiom 2 (Value Coherence) The value returned by a read is from either the most recent store in program order or the most recent store in global order." These are TSO rules, not PC rules. It seems to me that they would only develope a test program for TSO on IA-32 if it actually worked that way. Eric
From: Alexander Terekhov on 26 Oct 2006 12:50 "Eric P." wrote: [...] > (I don't know what "type II" and "Extension to Dubois' Abstraction" > mean. I can't find reference to either in Gharachorloo.) ------- A load by Pi is considered performed at a point in time when the issuing of a store to the same address by any P cannot affect the value returned by the load A store by Pi is considered performed with respect to Pk (i and k different) before a point in time when issuing a load to the same address by Pk returns the value defined by this store or a subsequent store to the same address that has been performed with respect to Pk A store by Pi eventually performs with respect to Pi. If a load by Pi performs before the last store (in program order) to the same address by Pi performs with respect to Pi, then the load returns the value defined by that store. Otherwise, the load returns the value defined by the last store to the same address (by any P) that performed with resprct to Pi (before the load performs). A store is performed when it is performed with respect to all processors Conditions for Processor Consistency before a LOAD is allowed to perform with respect to any other processor, all previous LOAD accesses must be performed before a STORE is allowed to perform with respect to any other processor, all previous accesses (LOADs and STOREs) must be performed ------ > > Hmmm.... I thought it was resolved. > > Joe Seigh said on x86 memory model: > http://groups.google.ca/group/comp.arch/msg/6af78be87ca29f31?hl=en& > > "It turns out the x86 memory model is defined, it's just not defined > in the IA-32 manuals which is where you would expect it to be defined. > It's defined in the Itanium manuals and is equivalent to Sparc TSO > memory model." Itanium x86 mapping being a bit stronger ordered than x86 native won't break anything. Just make it slow. ;-) [... snip moral level ...] > I also just came across this doc when searching for 'global visibility' > > Fast and Generalized Polynomial Time Memory Consistency Verification > Amitabha Roy, Stephan Zeisset, Charles J. Fleckenstein, John C. Huang > Intel Corporation > http://arxiv.org/pdf/cs.AR/0605039.pdf > > makes multiple references to TSO and the following statements: > > "The algorithm we have developed is currently implemented in Intel?s > in house random test generator and is used by both the IA-32 and > Itanium verification teams." > > "A load is considered performed (or executed) if no subsequent store > to that location (on any processor) can change the load return value. > A store is considered performed (or executed) if any subsequent > load to that location (on any processor) returns its value." > > "Axiom 2 (Value Coherence) > The value returned by a read is from either the most recent store > in program order or the most recent store in global order." > > These are TSO rules, not PC rules. > It seems to me that they would only develope a test program for > TSO on IA-32 if it actually worked that way. The actual hardware implementation may well do TSO. So what? "The algorithm assumes store atomicity, which is necessary for Axiom 3. However it supports slightly relaxed consistency models which allow a load to observe a local store which precedes it in program order, before it is globally observed. Thus we cover all coherence protocols that support the notion of relaxed write atomicity which can be defined as: No store is visible to any other processor before the execution point of the store. Based on our discussion with Intel microarchitects we determined that all IA-32 and current generations of Itanium microprocessors support this due to identifiable and atomic global observation points for any store. This is mostly due to the shared bus and single chipset." Not very promising. regards, alexander.
From: Chris Thomasson on 27 Oct 2006 19:09 "Eric P." <eric_pattison(a)sympaticoREMOVE.ca> wrote in message news:4540b737$0$1355$834e42db(a)reader.greatnowhere.com... > Chris Thomasson wrote: >> "Eric P." <eric_pattison(a)sympaticoREMOVE.ca> wrote in message >> news:453f7657$0$1353$834e42db(a)reader.greatnowhere.com... >> > If you are concerened about read bypassing side effects then >> > add an LFENCE or MFENCE. [...] > Also Andy Glew had some comments on load & store ordering > http://groups.google.com/group/comp.arch/msg/96ec4a9fb75389a2 Right... Basically, something like this: http://groups.google.com/group/comp.programming.threads/msg/68ba70e66d6b6ee9
From: Brian Hurt on 20 Nov 2006 20:55
nmm1(a)cus.cam.ac.uk (Nick Maclaren) writes: >That experience debunked the claims of the >functional programming brigade that such methodology gave automatic >parallelisation. Automatic parallelization, no. You're looking for a silver bullet that probably doesn't exist. On the other hand, functional programming makes writting parallel code much easier to do. The biggest problem with parallelized code is the race condition- which arise from mutable data. Every peice of mutable data is a race condition waiting to happen. Mutable data needs to be kept to an absolute minimum, and then handled in such a way that it's correct in the presence of threads. I've come to the conclusion that functional programming is necessary- just not sufficient. There are two languages I know of in which it may be possible to write non-trivial parallel programs correctly and maintainably- concurrent haskell with STM and erlang- and both are, at their core, purely functional languages. Brian |