Prev: CPU <> Memory chip communication interface
Next: interrupting for overflow and loop termination
From: Alexander Terekhov on 1 Sep 2005 07:25 Err.. Alexander Terekhov wrote: > > Ricardo Bugalho wrote: > > > > On Wed, 31 Aug 2005 21:57:58 +0000, Seongbae Park wrote: > > > > > I didn't bother to look at IA64 manual - anybody care to comment on this ? > > > but I suspect that IA64 is RCpc and the manual is exactly correct after > > > all. > > > > It's RCpc indeed. > > Not quite. Release stores to *WB* memory are constrained to ensure > "remote write atomicity". Classic RCpc is weaker in this respect > (and that's what makes RC != TSO). You better not rely on this ^ | PC, not RC. -------------+ > property because emulating it on CELLs (for example) will make your > ports run really slow. ;-) regards, alexander.
From: Joe Seigh on 1 Sep 2005 07:37 Ricardo Bugalho wrote: > On Wed, 31 Aug 2005 21:57:58 +0000, Seongbae Park wrote: > > >>I didn't bother to look at IA64 manual - anybody care to comment on this ? >>but I suspect that IA64 is RCpc and the manual is exactly correct after >>all. > > > It's RCpc indeed. So what does "manual is exactly correct" in this case mean? Are IA-32 loads equivalent to IA64 ld.acq and they are not equivalent to IA64 ld? I.e. the latter can't emulate a IA-32 load in all cases. -- Joe Seigh When you get lemons, you make lemonade. When you get hardware, you make software.
From: Alexander Terekhov on 1 Sep 2005 07:49 Joe Seigh wrote: [...] > Are IA-32 loads equivalent to IA64 ld.acq and they are not equivalent > to IA64 ld? The ordering constraints are equivalent for IA32 loads and IA64 acquire loads. But IA64 release stores to WB memory are more constrained than PC stores, and IA32-under-IA64 effectively runs in TSO for WB memory, not PC. regards, alexander.
From: Eric P. on 1 Sep 2005 12:46 Ricardo Bugalho wrote: > > On Wed, 31 Aug 2005 18:02:34 -0400, Eric P. wrote: > > > > > I think the underlying question you asked about the x86 is: > > > > Does the Intel Processor Consistency model require processors to wait > > for all other processors to acknowledge receipt of their invalidates > > before any are allowed to use the new value? > > > > It does not. > The most straightforward example is buffered store forwarding: when a CPU > writes a value into memory, it can read it again directly from the store > buffer, even before it tries to make it visible to other processors. I meant with regard to other processors not to itself. Within a processor, yes, the docs explicitly state that data from buffered writes can be forwarded to waiting reads. As I understand it, while such local forwarding can have consequences for consistency models, presumably because it allows subsequent instructions to complete earlier than they otherwise would have, it should not have an effect remote data update ordering. In short, store to load forwarding, in and of itself, would not allow a new value of Y to arrive at P3 before the new value of X. For this to occur seems to me to require both of: (a) the cache protocol to distribute updates in a non atomic manner by allowing a new value to be available before all acks are received. (b) the bus topology and protocol to somehow allow a message to get from P1 to P2 then P2 to P3 passing the one from P1 to P3, possibly due to an error and retransmit. Eric
From: Andy Glew on 2 Sep 2005 14:51
Bottom quoting: asbestos donned! I think that Joe Seigh has incorrectly assumed that processor consistency implies (a) a global ordering of all loads, and (b) causal ordering. This is not true. At least, I am fairly certain that there is a causal ordering memory model that is intermediate in semantics between processor consistency and sequential consistency. (Google finfslots of papers; I specifically recall Mossberger's survey.) And I do not believe that I have ever seen a proof that processor consistency implies a global ordering of all loads; I don't think such a proof exists; I would be interested to see it if it does; and I strongly suspect that there is a proof that orderings consistent with processor consistency may violate causal ordering. Indeed, Joe may have provided one. (I do confess that I have occasionally wanted to move from processor consistency to causal consistency, mainly because causal consistency sounds like it should be easier to make proofs for; but I am not sure if causal consistency is any easier to implement than sequential consistency. Since sequential consistency is easy enough to implement, I suspect that if we tighten up the memory model we will go all the way.) Nearly all statements in processor consistency are local. For processors Pi, i = ... Each Pi has a set of instructions Pi.Ij, some of which are loads, some of which are stores. Notationally Pi.Lj and Pi.Sj, where the index sets for Lj and Sj are not necessarily contiguous. Each Pi also sees external stores in some order Pi.Xk. The sequence of external stores seen by Pi, Pi.Xk, can be formed out of an interleaving the set of stores from all other processors Pm.Sj, m!=i. The only real constraint is that in this interleaving all of the stores from a particular processor Pm.Sj appear in the order in which they occurred on that processor; stores from a given processor are not reordered in the sequence. The sequence of external stores Pi.Xk is not necessarily equal to Pj.Xk, for different processors i and j. I.e. although stores from any single processor are performed in order at any other processor, other processors do not necessarily see stores from different processors interleaved in the same order. I.e. there is no single global store order. Instruction execution at a single Pi proceeds as if one instruction at a time were executed, with some interleaving of the external stores Pi.Xk. I.e. from the point of view of the local processor, it's loads Pi.Lj are performed in order, and in order with the local stores Pi.Sj. More specifically, there can be constructed an ordering Pi.Mx which is an interleaving of Pi.Ij (and hence Pi.Lj and Pi.Sj) and Pi.Xk, and local processor execution is consistent with such an ordering Pi.Mx. Note: we say "there can be constructed an ordering". But, so far as I know, there is no easy way to construct such an ordering for an particular processor. We know that one could be constructed, but we don't know what it is. And certainly not an easy way to construct this in an online manner. And, again: there need not be a global ordering of stores from all processors. And nor need there be a global ordering of loads. A formal model must make a few more statements about the limited forms of causality that are maintained in processor consistent system. (E.g. two party causality; three party causality is not maintained, to the best of my knowledge.) And, to be perfectly honest, I forget what statements need to be made to differentiate between the two sub-types of processor consistency: Gharacharloo type I and type II, where in the latter you can forward from a store buffer (an implementation consideration). --- As Mitch says, the above can be briefly stated: WB memory is processor consistent, type II. Describing the interaction of other memory types is morecomplicated. --- I do not know or care very much what the Itanium processor manual says about x86 memory ordering. I wouldn't be surprised if they got it wrong; or, as in the examples Joe provide, describe a mapping which has explanatory value, but not definitional value. --- Joe Seigh <jseigh_01(a)xemaps.com> writes: > MitchAlsup(a)aol.com wrote: > > I didn't find it in the Intel book I have (Pentium Pro) > > But chapter 7 in Volume 2 of AMD x86-64 Architecture Programmer's > > Manual (System Programming) describes AMD's side of the situation, > > starting on page 191 of the Purple Volume. > > The problem is when you consider the number of memory modes {UC, CD, > > WC, WP, WT and WB} that no simplistic statement can fully address what > > the programmer can assume about memory and its ordering properties. > > WriteBack (cacheable) memory is, however, Processor Consistent. > > > > The argument being presented in c.p.t. is that processor consistency > implies loads are in order, perhaps instigated by something Andy Glew > said about this here > http://groups.google.com/group/comp.arch/msg/96ec4a9fb75389a2 > > AFAICT, this is not true for 3 or more processors. E.g. > > processor 1 stores into X > processor 2 see the store by 1 into X and stores into Y > > So the store into Y occurred after causal reasoning. > > processor 3 loads from Y > processor 3 loads from X > > If loads were in order you could infer that if processor 3 > sees the new value of Y then it will see the new value of X. > But the rules for processor consistency *clearly* state that > you will necessarily see stores by different processors in > order. > > While there are still ordering constraints on the loads they > don't have to be strictly in order as Andy incorrectly infers. |