Prev: Fujitsu SPARC VIII-fx HPC-ACE adds instruction prefixes: variablelength instructions for RISC!!!
Next: Fujitsu SPARC VIII-fx HPC-ACE adds instruction prefixes: variable?length instructions for RISC!!!
From: MitchAlsup on 28 Dec 2009 19:54 On Dec 26, 2:39 pm, EricP <ThatWouldBeTell...(a)thevillage.com> wrote: > EricP wrote: > > > Having the ability to perform a LoadLocked/StoreConditional on > > up to 4 separate memory locations would eliminate much of the > > need to escalate to the heavyweight OS synchronization ops. > > This appears to require a method of establishing a global > first-come-first-served ordering for the cpus that is > independent of the physical memory locations involved. Correct. > In the unlikely event of cache line ownership contention then > the first cpu to begin its multiple update sequence wins > and the other rolls back. But more importantly, that entity that determins who wins also establishes order over current participants avoid contention on the subsequent access. Thus, one can achieve something on the order of BigO (log(n) instead of BigO(n**2) memory references worst case. You cannot get to this point unless the synchronization 'event' returns an integer number instead of simple win-retry. > > The trick is for it be a low cost mechanism (ideally the cost of > a single cache miss to establish the order) that works within > the existing cpu hardware, bus and coherency protocol. In practice it requires a two way transfer through the fabric, but does not require a DRAM access delay. So the latency is better than a DRAM access. The entity looks and smells remarkably like a TLB and can process a stream of requests as fast as the fabric can deliver a stream of requests (i.e. no back pressure--at least none required). And the TLB does not have to be "that big" either. > For that I'm thinking that maybe a global device located > at some special physical memory location would establish > the global order at the start of a multiple update sequence. Yep, programmed up by a standard header making it look like a device witting anywhere in fabric addressible space. > Then using Invalidate bus ops to a set of other special > physical memory locations could communicate that ordering > to other cpus and they can associate that with the bus id. Nope, dead wrong, here. You return the order as an integer response to a message that contains all of the participating addresses. This part of the process does not use any side-band signaling. After a CPU hase been granted exclusive access to those cache lines, it, then, is enabled to NAK requests from other CPUs (or devices) so that it, the blessed CPU makes forward progress while the unblessed are delayed. > So in this example the overhead cost would be 3 bus ops to > Read the global device, an Invalidate indicate my order to > the peers, and an Invalidate at the end of the sequence. In my model, there is a message carrying up to eight 64-bit physicall addreses to the Observer entity, if there are no current grants to any of the requested cache lines, the transaction as a whole is granted and a 64-bit response is given as a response to the sending CPU. Most would call this one fabric transaction. Just like a Read-cache-Line is one fabric operation. The CPUs contend for the actual cache lines in the standard maner (with the exception of the NAK above). Mitch
From: EricP on 29 Dec 2009 13:38
EricP wrote: > > It requires 2 bus message features: a broadcast of the order > number to all peers at the start of an MU attempt, > and the ability to NAK a ReadToOwn cache line request with > a special error code that triggers an Abort in the requester. > > <snip> > > - Each cpu now has a bit vector, indexed by bus id #, > that tells that processor whether it should respond to > an individual ReadToOwn by sending a line and aborting myself, > or sending a NAK which will trigger an abort in the peer. This could also be done without a NAK, though it is not very elegant: it could do a grab-back. If a line owner receives a ReadToOwn it consults the bus id bit vector. If the requester is lower order, this cpu replies as normal and aborts its own MU sequence. If the requester is higher order, this cpu replies with the value but immediately requests it back. That will trigger the same logic sequence in the requester (because we all agree on the order numbers) who will reply and abort its MU sequence. Eric |