Prev: Delphi needs opengl gui :)
Next: 100% without investment online part time jobs..(adsense,datawork,neobux..more jobs)
From: Andy Glew "newsgroup at on 4 Aug 2010 19:32 Attempting to describe some microarchitectures in text-only email to a friend inspires me to post this. It would be much better to have lots of drawings illustrating precisely what I am talking about; but this textual notation is fairly compact, can be understood by some people, and in some ways, because it underspecifies certain configurations, allows us to talk without getting bogged down in multiple different ways of interconnecting the blocks. I.e. sometimes drawing is TOO specific. = Basic Pipelines = Let's start out with in-order pipelines InO and out-of-order pipelines OoO A typical in-order pipeline might look like InO = BP -> I$ -> D -> RD -> X -> L1$ -> WB with pipestage names * BP = branch prediction * I$ = instruction cache * RD = register read * D = decode * X = execute * L1$ = L1 data cache * WB = writeback A prototypical out-of-order pipeline might look like OoO = BP -> I$ -> D -> RR -> S -> X -> L1$ -> IW with additional pipestages * RR = register rename * S = schedule * IW = instruction window - ROB, etc. Where are the PRF reads? On a P6-style read-before-schedule & capture machine: OoO = BP -> I$ -> D -> RR -> RD -> S -> X -> L1$ -> IW On a HaRRM (1987) or Wmt-style read-after-schedule machine: OoO = BP -> I$ -> D -> RR -> S -> X -> RD -> L1$ -> IW I am not going to go into details about this, or the ROB/PRF, because that's not my topic of discussion here. Actually, the execution units may be superscalar, or they may share the same single port. I could contrive ways of drawing this, but again it's not my real point. Similarly, the L1$ is often accessed in parallel with the first few cycles of X. I might depict this as X || L1$ or just X L1$ rather than X -> L1$ In fact, eliding the arrows overall makes things less cluttered A typical in-order pipeline might look like InO = BP I$ D RD X L1$ WB OoO = BP I$ D RR S X RD L1$ IW and it allows us to throw other stuff on, such as the L2$ InO = BP I$ D RD X L1$ WB L2$ OoO = BP I$ D RR S X RD L1$ IW L2$ we can stretch the notation to indicate store queues, etc; it is nice to use the DEC/AMD term LSU, even though Intel tends to split them up = SMT = SMT, symmetric multithreading, like Intel Hyperthreading, runs multiple threads on the same out-of-order pipeline. SMT2(OoO) = SMT2(BP I$ D RR S X RD L1$ IW) L2$ Not indicated: the fact that I$ is shared, but renamer is replicated, etc. We might also have SMT in-order machines, like Intel's first generation Atom: SMT4(InO) = SMT4(BP I$ D RD X L1$ WB) L2$ By comparison, we can emphasize the single threaded nature of the orignal pipelines: ST(OoO) = ST(BP I$ D RR S X RD L1$ IW) L2$ ST(InO) = ST(BP I$ D RD X L1$ WB) L2$ = Multicluster = The original multicluster microarchitectures that I am familiar created wider superscalar machines by replicating the execution units. They were multicluster in that they did not have uniform bypassing: some might bypass within the cluster fast, but impose an extra cycle bypassing to other clusters; some might not permit bypassing at all, except perhaps through special inter-cluster instructions. We can attempt to draw this in ascii art as MC(OoO) = BP I$ D RR S X RD L1$ IW L2$ X MC(InO) = BP I$ D RD X L1$ WB L2$ X but I will instead draw it as MC(OoO) = BP I$ D RR S MC2(X) RD L1$ IW L2$ MC(InO) = BP I$ D RD MC2(X) L1$ WB L2$ The question always arises as to how much is replicated in the clusters: MC(X) MC(X RD) MC(S X) MC(S X RD) MC(S X RD L1$) This last will be revisited in MCMT. Execution units can be shared between clusters, like the FPU in AMD Bulldozer. It's a pain to depict this in ascii art, so I won't bother here. Should the renamer be inside the clusters? Depends. It's easier to maintain single thread semantics if the renamer is outside, and similar the LSU: MC(OoO) = BP I$ D RR MC2(S X RD L1$) LSU IW L2$ But since the S X LSU unit is as critical as the S X L1$ loop, it is natural to put it inside. But then coordinating the two clusters becomes a pain. You might, e.g., copy all stores to both clusters' LSU. I don't know how to draw that in ascii art. MC(OoO) = BP I$ D RR MC2(S X RD L1$ LSU) IW L2$ You start down the slippery slope of complexity if you have an L1 LSU iside the cluster for speedd, and an L2 LSU outside the cluster to coordinate things. Note that the LSU2 need not have any data. MC(OoO) = BP I$ D RR MC2(S X RD L1$ LSU1) LSU2 IW L2$ We'll assume that IW is outside the cluster for single thread = MCMT = Once you have multiple clusters, you can run multiple threads on them. One thread per cluster is natural. It permits the nice simplification of not having to do bypasses between clusters. MCMT2(OoO) = BP I$ D RR MCMT2(S X RD L1$) IW L2$ With separate threads, an LSU inside the cluster is natural MCMT2(OoO) = BP I$ D RR MCMT2(S X RD L1$ LSU) IW L2$ but the stuff talked about above may apply. For separate threads, it might be natural to put IW inside the cluster. But IW, instruction window management such as ROB, is not in the critical loop. If you put IW outside, you might be able to share non-uniformly, e.g. giving simple threads 1/4 of the IW, and complex 3/4. I call one way of doing this a [[segmented sequential instruction window]]. = MCMT and SMT = You could make each cluster itself SMT: MCMT(SMT(OoO)) = BP I$ D RR MCMT2(SMT(S X RD L1$ LSU)) IW L2$ although now you start driving the ifetch prety hard. = Speeding Up Single Threads = SMT and MCMT are ways of supporting multiple threads. It would be nice if they could speed up single threads. MC is used for single threads. However, it is a challenge to do scheduling for truly decoupled MC microarchitectures. One additional cycle of bypassing can be afforded, but it gets harder the more decoupled the clusters are. My old, old, name for using multiple threads to speed up a single thread was IMT, implicit multithreading. SpMT, speculative multithreading, is the form of IMT I am most familiar with, There are other forms of IMT - more fine grained than SpMT, with executes contiguous sections of the instruction such as loop bodies and regions separated by CALLs and RETurns. IMT and SpMT need to run on a multithreaded substrate. Above we have described two multithreaded substrates SMT MCMT with the obvious addition of MP Any of which can be InO or OoO. MCMT can be built with SMT inside the clusters. And MP, multiprocessor, separate cores, can be built out of single threaded cores, multithreaded cores, etc. MP = MP(ST) MP(SMT) MP(MCMT) MP(MCMT(SMT)) You can therefore run, or at least think about running, IMT or SpMT on any of these multithreaded substrates: SpMT(SMT) SpMT(MCMT) SpMT(MCMT(SMT)) SpMT(MP) SpMT(MP(ST)) SpMT(MP(SMT)) SpMT(MP(MCMT)) SpMT(MP(MCMT(SMT))) Overall, SpMT(MP(ST)) is hard. It is hard to overcome the inter-core latencies when forking threads. MP(SMT) is doable - indeed, it is Haitham Akkary's DMT. However, SMT is often limited to only 2 threads, which really does not give you much opportunity to speed things up with SpMT. And the speculative threads tend to interfere with the non-speculative thread. First, do no harm. ::SpMT(SMT2) - low potential I motivated MCMT as a way of building a multithreaded microarchitecture for SpMT, that might be able to support more threads. Alternately, I motivated MCMT by itself, without SpMT, as something intermediate between SMT and M. SpMT(MCMT(ST)) is okay, but there is still probably overhead to go between clusters, felt on forking. By the way, SpMT(MCMT(ST)) probably has lots of other stuff added to preserve sequential semantics. Won't discuss that here, although I will mention that I am found of log based execution. SpMT(MCMT(SMT2)) is a potential sweet spot. I think the normal operating mode would be with one thread per cluster, but the SMT2 allows a "run thread" to be used to fork off the non-speculative thread, or a less speculative thread. The runt thread could cycle steal transferring state from cluster to cluster, and thus minimize slowing down the non-speculative thread. SpMT(MP) is hard, as mentioned above. But SpMT(MP(SMT2)) is another sweet spot - with the runt thread trick again played to help forking. So, we have two attractive configurations: SpMT(MCMT(SMT2)) and SpMT(MP(SMT2)) with the inner SMT2 used to speed up forking. SpMT(MCMT(SMT2)) is probably smaller, but SpMT(MP(SMT2)) makes better use of the many, many, cores we have nowadays. Combining both in SpMT(MP(MCMT(SMT2))) is imaginable, but it is more complex than doing just SpMT(MP(SMT2)). Not worth doing unless there is some big win. Or unless you want to fill a product line hole with MCMT(SMT2) - e.g. something larger than a single core ST or SMT2(St), but smaller than MP2. = Multilevel Schedulers, etc. = TBD: S2 S1 schedulers, etc. TBD: batch scheduling across clusters versus fine grain de-interleaved scheduling. = See = Wiki at http://semipublic.comp-arch.net/wiki/Alphabet_Soup:_a_Collection_of_Microarchitectures#Basic_Pipelines
From: Paul A. Clayton on 4 Aug 2010 21:57 On Aug 4, 7:32 pm, Andy Glew <"newsgroup at comp-arch.net"> wrote: [snip] > = Basic Pipelines = > > Let's start out with in-order pipelines > InO > and out-of-order pipelines > OoO > > A typical in-order pipeline might look like > InO = BP -> I$ -> D -> RD -> X -> L1$ -> WB > with pipestage names > * BP = branch prediction > * I$ = instruction cache > * RD = register read > * D = decode > * X = execute > * L1$ = L1 data cache > * WB = writeback > > A prototypical out-of-order pipeline might look like > OoO = BP -> I$ -> D -> RR -> S -> X -> L1$ -> IW > with additional pipestages > * RR = register rename > * S = schedule > * IW = instruction window - ROB, etc. > > Where are the PRF reads? > > On a P6-style read-before-schedule & capture machine: > OoO = BP -> I$ -> D -> RR -> RD -> S -> X -> L1$ -> IW > > On a HaRRM (1987) or Wmt-style read-after-schedule machine: > OoO = BP -> I$ -> D -> RR -> S -> X -> RD -> L1$ -> IW > > I am not going to go into details about this, or the ROB/PRF, because > that's not my topic of discussion here. One of the pipeline variants that I like is the use of a future file for some values (particularly the stack pointer, global pointer, thread-local storage pointer-- for early address generation; but also some LSbits of a counter [with an upper-bits-not-zero bit] and perhaps condition registers for early branch resolution). This allows moving RD before RR (scheduling and execution could also be simplified/accelerated--FIFO scheduling and only shortish immediate offset addressing [perhaps with base register stored in moderately redundant format]). (It could even be possible in some cases to resolve a branch rather than predict it.) Specializing those three pointers also makes TLB specialization easier, reducing storage redundancy and tag comparisons. Paul A. Clayton just a technophile
From: jacko on 4 Aug 2010 23:24 Storage redundancy elimination? Sounds good. Does the local stack pointer(s) (a slight forth language effect) have to be a pointer, or would a compression codec with a finite logic area, and depth independent memory (perfect recall but may ocassional stall (rate dependent on area)) be more appropriate?
From: Andy Glew "newsgroup at on 5 Aug 2010 09:20 On 8/4/2010 6:57 PM, Paul A. Clayton wrote: > On Aug 4, 7:32 pm, Andy Glew<"newsgroup at comp-arch.net"> wrote: > [snip] >> Where are the PRF reads? >> >> On a P6-style read-before-schedule& capture machine: >> OoO = BP -> I$ -> D -> RR -> RD -> S -> X -> L1$ -> IW >> >> On a HaRRM (1987) or Wmt-style read-after-schedule machine: >> OoO = BP -> I$ -> D -> RR -> S -> X -> RD -> L1$ -> IW >> >> I am not going to go into details about this, or the ROB/PRF, because >> that's not my topic of discussion here. > > One of the pipeline variants that I like is the use of a > future file for some values (particularly the stack > pointer, global pointer, thread-local storage pointer-- > for early address generation; but also some LSbits of a > counter [with an upper-bits-not-zero bit] and perhaps > condition registers for early branch resolution). This > allows moving RD before RR (scheduling and execution > could also be simplified/accelerated--FIFO scheduling > and only shortish immediate offset addressing [perhaps > with base register stored in moderately redundant > format]). (It could even be possible in some cases to > resolve a branch rather than predict it.) Specializing > those three pointers also makes TLB specialization easier, > reducing storage redundancy and tag comparisons. Yep. Talking about the different OOO pipes in detail was not the main purpose of yesterday's exercise (which I did at a very late afternoon lunch break, after email the night before), which was mainly to talk about MC and MCMT and SpMT. But you can use the same notation to talk about varieties of OOO. The importance of notation is when it promotes conversation by eliding unnecessary details. The weakness is when it elides necessary details. Viz, my post of yesterday hopefully makes it plain that you can have MC( in-order ) MC( OoO P6 ROB style read before schedule ) MC( OoO Wmt / HaRRM style read after schedule ) MC( OoO future/file AMD integer style ) I was about to say that I do not like future files, because most of the data is invalid. But I see that you specified only certain values likely to be known up front. I think that it is unlikely that FF RD is moved before RR: they are logically similar, indexing a data structure with the logical register number, and getting out stuff like physical register number value (if known) estimated completion time ready/not ready producer where the value is coming from (which instruction, pipe, cluster, RF, PRF, ...) rewritten/optimized instructions. Hmmm, I can feel the need for another blurb, on renamer tricks.
From: Paul A. Clayton on 5 Aug 2010 11:18
On Aug 5, 9:20 am, Andy Glew <"newsgroup at comp-arch.net"> wrote: [snip] > I think that it is unlikely that FF RD is moved before RR: they are > logically similar, indexing a data structure with the logical register > number, and getting out stuff like > physical register number > value (if known) > estimated completion time > ready/not ready > producer > where the value is coming from > (which instruction, pipe, cluster, RF, PRF, ...) > rewritten/optimized instructions. Yeah. I guess I was just thinking in terms of the early RISC in-order pipelines in which register read was part of decode. Once the logical RegID is known a RF or RAT could be read. A tiny future file (e.g., 16 bits of three registers) integrated with an adder (or two) could be very low latency. Interestingly, most recent x86 processors do have a limited set of front-end registers--the segment registers--(with dedicated adders to create a single immediate), though a segment register update stalls the pipeline rather than allowing forward progress on independent operations. Paul A. Clayton just a technophile |