Prev: Fujitsu SPARC VIII-fx HPC-ACE adds instruction prefixes: variable?length instructions for RISC!!!
Next: questions about x86
From: EricP on 26 Dec 2009 19:55 Bart Swinnen wrote: > EricP wrote: >> I believe Mitch is referring to potential new hardware functionality >> like AMD's Advanced Synchronization Facility proposal. >> I can't seem to find any info on it on the AMD website as the proposal >> seems to have degenerated into just a registered trademark notice. > > Huray for Google? > http://developer.amd.com/CPU/ASF/Pages/default.aspx The problem with the current definition of ASF is that it has _no_ guarantee of forward progress. Two contenders can wind up ping-ponging back and forth in a livelock and software must provide intervention that resolves this. I was suggesting in my other message that it looks to me that in order to guarantee that _one_ contender makes forward progress, one need only establish a global order for aborts in the face of contention. That does not mean fairness or eliminate livelock as an individual can still wind up being continuously victimized. To guarantee fairness there must be some sort of queue to fall back upon. Eric
From: EricP on 29 Dec 2009 12:45
MitchAlsup wrote: > On Dec 26, 2:39 pm, EricP <ThatWouldBeTell...(a)thevillage.com> wrote: >> EricP wrote: > > You return the order as an integer response to > a message that contains all of the participating addresses. This part > of the process does not use any side-band signaling. After a CPU hase > been granted exclusive access to those cache lines, it, then, is > enabled to NAK requests from other CPUs (or devices) so that it, the > blessed CPU makes forward progress while the unblessed are delayed. > >> So in this example the overhead cost would be 3 bus ops to >> Read the global device, an Invalidate indicate my order to >> the peers, and an Invalidate at the end of the sequence. > > In my model, there is a message carrying up to eight 64-bit physicall > addreses to the Observer entity, if there are no current grants to any > of the requested cache lines, the transaction as a whole is granted > and a 64-bit response is given as a response to the sending CPU. Most > would call this one fabric transaction. Just like a Read-cache-Line is > one fabric operation. > > The CPUs contend for the actual cache lines in the standard maner > (with the exception of the NAK above). In your method, the MU (Multiple Update) participants collect their physical addresses together into a batch and send them to an arbiter. I am exploring a slightly different approach which dynamically self organizes amongst the processors. It is based on the idea that only the order number is important for dispute resolution. The physical addresses are just triggers. If there is a collision then the lower (earlier) order number wins and the higher number aborts. Once the order is established, the lower order will always eventually win. The loser will retry using its original order number so it will eventually win too. The addresses don't matter for dispute resolution. It requires 2 bus message features: a broadcast of the order number to all peers at the start of an MU attempt, and the ability to NAK a ReadToOwn cache line request with a special error code that triggers an Abort in the requester. - A processor starting a MU attempt reads a global 64 unsigned integer counter from a global up-counter device. It then broadcasts that value to all processors so all agree on that processor's order. The broadcast is necessary because there can be multiple buses and therefore multiple pathways through the interconnect. Only a central global counter can establish a single order. - Each processor has a flag bit vector indexed by bus id #. Each peer processor compares the order number to itself and sets a flag for that bus id of the broadcaster. If a peer is not in a MU sequence the the flag = 0. If a peer is in a MU sequence then this cpu has its own order number. We compare the broadcast number to ours and the flag = 1 if broadcast is higher, otherwise = 0. - Each cpu now has a bit vector, indexed by bus id #, that tells that processor whether it should respond to an individual ReadToOwn by sending a line and aborting myself, or sending a NAK which will trigger an abort in the peer. - Each MU cpu executes a mu_load instruction. If successful it records the physical address in a special MU collision detect register. If any peer later executes a mu_load to that same address then there is a collision and we consult the bit vector to decide how to respond. - As an optimization, when a cpu wins a contention and sends a Nak, then it also remembers the bus id of who it Nak'ed. When it finished its sequence, or if it aborts, it sends a TryAgain message to the loser who spins on a special mu_wait instruction. - The MU instruction sequence is then designed to support the above handshake sequence. Just thinking out loud... Eric |