Prev: Ordering Tramadol online without prescription. Cod saturday Tramadol. Free fedex delivery Tramadol. Doctor shopping for Tramadol prescription.
Next: ca-fest Portland: Anyone going to Supercomputers '09 in Portland?
From: Cameron Jack on 24 Nov 2009 07:05 On Nov 18, 9:16 pm, Mayan Moudgill <ma...(a)bestweb.net> wrote: > Andy "Krazy" Glew wrote: > > Mayan Moudgill wrote: > > >> Matt wrote: > > >>> Have you seen the "eager execution" patent application (app. no. > >>> 20090172370)? I mentioned it in my blog a while back ( > >>>http://citavia.blog.de/2009/07/07/more-details-on-bulldozers-multi-th.... > > >> Had a look; it seems, at a first reading, to be one more of the > >> "wouldn't it be great to follow both paths of a branch, and once we > >> figure out which way the branch went, we'll copy registers from one > >> path over to the other, and, oh, BTW, its not clear that it wouldn't > >> have been less work to just re-execute the second path" ideas. > > > When forking, you copied the renamer state (not really > > that many bits). > > You'll also have to duplicate the call-return prediction structure. > > You'll have to tweak the address matching logic in the store queue. > > You'll have to change the branch resolution clean-up. IMO, this is the > difficult bit. > - you have to free up some renamed registers (or, if you're using a > different implementation, partially purge the writeback queue). > - you have to free up entries in the store queue. > > On a non-eager machine, the store queue is logically a FIFO; when a > branch mispredict resolves you (logically) move the valid pointer to the > last correctly predicted store. In the case of eager execution, the > store queue becomes a sparse structure, which in turn means that there > are going to be issues with allocation, load/store matching, and > identifying age for commit. > > Even free-ing of mispredicted registers could be a little more expensive. > - You may need more bits to distinguish between true/false path predictions. > - If you're going to be eager on more than one branch, then mispredict > resoultion can potentially kill trees of rename sequences, not just > straight-line paths. > - Detecting a free register may become more of a challenge. Below is a repost of a comment I made at RWT a few weeks back. I had no replies and am keen to hear what people think of my ideas - whether they are derivative, someone else's, good or rubbish: I've been wondering about dynamic predication ever since the first static implementation in Itanium. Could you use the existing instruction pointer stack with an additional branch ID field (like Thread ID) and predicate value, and additionally tagging each instruction as it passed through the pipeline with these? If the processor already uses a Future Register File like the Opteron already does then it should be an easy matter to reset all renamed registers with the matching branch ID and logically false, while completing all those with the matching BID and logically true. You'll need more rename registers and everything will get bulkier. You'd also want to limit its usage to simple cases like where the BHT has seen the branch a lot of times but the outcome in the BTB is less than say 30% certain. Added: My reasoning for going down the dynamic predication route rather than DMT as such is less problem with context sharing - especially with VMs and the extra page table structures now on the CPU. Also, since you mention register access becoming a problem I've also been wondering about using a register "block" use/renaming/retiring scheme similar to Power5's instruction "blocks" (where up to 5 instructions in a certain mix are build, executed and retired together). Obviously you'd need a lot more registers, but wiring could be made a lot simpler - especially renaming could be extremely wide.
From: "Andy "Krazy" Glew" on 24 Nov 2009 10:19 Cameron Jack wrote: > I've been wondering about dynamic predication ever since the first > static implementation in Itanium. Go for it! Some of us have done a lot of work on dynamic predication. (My resume includes an OOO Itanium, plus I have been working on VLIW and predication longer than OOO.) But since such work inside companies will never see the light of day, do not let that hold you back, since you are not so constrained by NDAs and trade secrets. I worked on the trace cache (back in my MS, before Intel) mainly so that I could easily see convergence points, and predicate the sides of the branches. That was naive: you don't need to see the convergence point when you start predicating. Perhaps because I have worked on this, I am not so afraid of eager execution as is Mayan. When renamed registers have multiple copies with predicate mask sets attached to them, doing the same thing for the store buffer is not such a step. My problem with eager execution is not about how to build it, but about how to make it worthwhile. Sure, you can track branch probabilities, and then go the other path... But branch predictors are so accurate. Go for it, Jack. Maybe I should brush off my dynamic predication work from my MS circa 1990, and/or forget anything I did on Intel and try to come up with new, better, ways. Trouble is, it is just so painful to have to forget almost 2 decades' worth of ideas: I almost prefer not to work on such topics.
From: Mayan Moudgill on 24 Nov 2009 11:55 Andy "Krazy" Glew wrote: > Perhaps because I have worked on this, I am not so afraid of eager > execution as is Mayan. When renamed registers have multiple copies with > predicate mask sets attached to them, doing the same thing for the store > buffer is not such a step. > I don't know if that is correct (BTW: I've worked on this on paper, not at the VHDL level, though). Assumption is that all in-flight instructions have an issue order (tag); this is the sequence number in which that instance of the instruction would have been executed. Lets look at the store buffer. Its entries are: A. invalid B. store written out to L1 C. committed but not yet written out to L1 D. uncommitted Case B. is useful for bypassing access to the L1 in some cases. We shall ignore this case, treating these entries as "free", but its worth keeping in mind When a load comes in, it must search the store buffer. Possible answers from each entry are: E. missed F. hit but store is from the future (i.e. was issed later than the load) G. partial hit from unwritten store H. complete hit Case F. can be treated as a miss, but its important to be able to distinguish between future matches and past matches. Order is also important when reconciling between multiple G. & H. hits; for each byte, you have to pick the isued-closest-to-the-load store that wrote to the byte, and if there is no such store, you have to go the L1 cache. We call the case where there is a (partial) hit in the store queue a (partial) bypass, and the issue order(s) of the bypassing store(s) the bypass issue order Additionally, if loads can overtake stores, then there must be a load-bypass queue in which the load's issue-order, address and size must be stored, and the bypassed store issue order(s) must be stored. [This can be combined with the load-waiting-for-a-L1-miss queue]. When a new store comes in, an invalid (or free) entry is found, and the store's adddres, size, data, and issue order is written to the entry. [Couple of things: the entry can be allocated earlier, and the computation of the data and the address can be split]. Additionally, if loads can overtake stores, it must search the load queue, check to see: - does it overlap with a load? - does the load come later in issue-order? - would it have mattered (i.e. for each overlapping byte, did the byte get bypassed from a yet-later store)? If so, corrective action must be taken [generally equivalent to a branch mispredict] The store retire logic generally will retire committed stores in issue order. It may be possible to retire committed stores in other orders, but things get squirrely w.r.t. memory ordering and overlapping writes. [BTW: I am ignoring a large number of details w.r.t. memory ordering and cache coherence protocols] When a branch mispredict is determined, any store queue entries that are from the mispredicted path are rolled back; i.e. if an entry's issue order is after the issue-order of the mispredicted branch, the entry marked invalid. One possible simplification is to allocate store queue entries in issue order. This allows depth in the store-queue to be a proxy for issue-order; the closer it is to the head of the queue, the earlier it was issued. Writing to L1 is always from the head of the queue. Mispredict rollback moves the tail of the queue to the last known mispredicted entry. The next free entry to be allocated is at the tail of the queue. [This can also be done by allocating entries at execution time and executing stores in-order] Eager execution complicates the concept of issue-order. It is now issue-order-on-a-particular-path. The behaviors required of the store queue remain the same. However, allocating store queues in issue order is no longer practical; it would lead to invalid entries between the head and the tail of the FIFO, and a sparse store queue. This, in turn, would make the store queue a potential bottleneck (either because of the number of effective entries if you kept the store queue size the same, or because of the cycle time if you increased the store queue size). So, the store queue structure you would need to use would have to have out-of-order entries, additional logic to check for order and path, etc., etc. I will offer an interesting data-point: On both Power5, and AFAIK, Core-i7/Nehalem, the store-queue is NOT shared between the two threads; they have separate queues. The modifications to support eager execution are similar, but more complicated, than those required for two threads. I would suggest that, given store-queue partitioning for 2 threads was picked over a shared store-queue, the architects found that modifying the store queue for eager execution would indeed have a cycle time impact.
From: Brett Davis on 3 Dec 2009 02:01 In article <4B01BAD2.6080002(a)patten-glew.net>, "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> wrote: > Andy "Krazy" Glew wrote: > > Brett Davis wrote: > >> For SpMT I looked at Alpha style paired integer pipelines ... > > > > You haven't said enough about the physical layout to talk about those > > clustering effects. > > The physical layout matters a lot, and hence has its own terminology. > > 8-wide on a bit interleaved datapath is pushing the envelopew. > > > 8-wide as 2 4 wide bit interleaved datapaths is not so bad, although you > will pay a lot of area for the wire turns to get to side by side > datapaths. S I might call this an 8-wide cluster composed of two > adjacent 4-wide bit-interleaved clusters. With whatever bypassing you > say. > > As I mentioned earlier, there is a sweet trick you can play with > datapaths that are paired and opposite, reflections. You bit interleave > the input and output wires within the, say, 4-wide cluster, and then you > bit interleave the outputs (but not the inputs). Trouble with this > trick is that it takes one of the ends of the datapath away - where do > you put the scheduler? The data cache? > So, I call this "opposed" clustering. I am afraid I need a Transistor 101 refresher: The speed of light in copper, or aluminum, in a modern fab process, and the resulting mm covered at 3GHz, and translated into the number of wires you can cross at a right angle, distance wise. The time a transistor takes to switch, and the resulting number of transistors you can jump through in 3GHz for a CPU stage. The number of active inputs you can drive with one transistors output. The number of inactive inputs you can drive with one transistors output. (Does a hard limit of only two active listeners out of eight help.) I do not really like the Alpha 4+4 and two cycle delay to cross, actually I hate it. I can live with 0 cycles to two closet ALUs, with a one cycle delay for every second crossed. With 64 bit registers that means crossing 127 wires. You have like 15 wire layers, so there is a desire to stack the 2in1out wires to reduce the distance covered by the bypass wires if you do not interleave. (Ignoring the big economic incentive to use big previous gen patterning for the upper level wires.) How many layers are 45nm on a 45nm CPU process, verses 65nm layers, and 90nm? Most die slices I have seen tend to march up to low res fast. I am willing to skin this old cat in new ways to make a 8 way design work. RRAM is going to force many of the upper layers to high res anyway, so I will be catching a wave that enables high density wire stacking where it is needed. Lots of breakthroughs and patents to be had in this area... Brett
From: Del Cecchi on 3 Dec 2009 16:03
"Brett Davis" <ggtgp(a)yahoo.com> wrote in message news:ggtgp-191C1C.01014603122009(a)netnews.asp.att.net... > In article <4B01BAD2.6080002(a)patten-glew.net>, > "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> wrote: > >> Andy "Krazy" Glew wrote: >> > Brett Davis wrote: >> >> For SpMT I looked at Alpha style paired integer pipelines ... >> > >> > You haven't said enough about the physical layout to talk about >> > those >> > clustering effects. >> >> The physical layout matters a lot, and hence has its own >> terminology. >> >> 8-wide on a bit interleaved datapath is pushing the envelopew. >> >> >> 8-wide as 2 4 wide bit interleaved datapaths is not so bad, >> although you >> will pay a lot of area for the wire turns to get to side by side >> datapaths. S I might call this an 8-wide cluster composed of two >> adjacent 4-wide bit-interleaved clusters. With whatever bypassing >> you >> say. >> >> As I mentioned earlier, there is a sweet trick you can play with >> datapaths that are paired and opposite, reflections. You bit >> interleave >> the input and output wires within the, say, 4-wide cluster, and >> then you >> bit interleave the outputs (but not the inputs). Trouble with this >> trick is that it takes one of the ends of the datapath away - where >> do >> you put the scheduler? The data cache? >> So, I call this "opposed" clustering. > > I am afraid I need a Transistor 101 refresher: > > The speed of light in copper, or aluminum, in a modern fab process, > and the resulting mm covered at 3GHz, and translated into the number > of wires you can cross at a right angle, distance wise. > > The time a transistor takes to switch, and the resulting number of > transistors you can jump through in 3GHz for a CPU stage. > > The number of active inputs you can drive with one transistors > output. > The number of inactive inputs you can drive with one transistors > output. > (Does a hard limit of only two active listeners out of eight help.) > > > I do not really like the Alpha 4+4 and two cycle delay to cross, > actually I hate it. I can live with 0 cycles to two closet ALUs, > with > a one cycle delay for every second crossed. With 64 bit registers > that > means crossing 127 wires. You have like 15 wire layers, so there is > a > desire to stack the 2in1out wires to reduce the distance covered by > the bypass wires if you do not interleave. (Ignoring the big > economic > incentive to use big previous gen patterning for the upper level > wires.) > > How many layers are 45nm on a 45nm CPU process, verses 65nm layers, > and 90nm? Most die slices I have seen tend to march up to low res > fast. > > I am willing to skin this old cat in new ways to make a 8 way design > work. RRAM is going to force many of the upper layers to high res > anyway, so I will be catching a wave that enables high density wire > stacking where it is needed. Lots of breakthroughs and patents to be > had in this area... > > Brett In CMOS, the transistors don't really have a switching time. Rather there is delay associated with charging or discharging the gate (transistor gate, not logic gate) between the on voltage and the off voltage (usually VDD and Gnd respectively). Since both gate capacitance and current capability of a transistor are proportional to width, this leads to the concept of gate limited delay, or the minimum delay of a gate with no interconnect driving an identical gate, and driven by an identical gate. To the gate delay one must add the effects of the interconnect and fan out. So far as I know, most of the layers in a 45 nm process are 45 nm although typically the last few metal layers are thicker and coarser for lower resistance power distribution. Unless things have changed recently I don't think anybody is doing 15 levels of interconnect. del |