Prev: Lolling at programmers, how many ways are there to create a bitmask ? ;) :)
Next: Larrabee Dead
From: Paul A. Clayton on 27 May 2010 18:10 Another obvious (possibly half-way decent) idea: Use the duplicated register file of a clustered processor design like the Alpha 21264 to hold distinct contexts. Such a static partitioning might not be advisable under two simultaneous threads usually, but at four (reasonably active) threads, static partitioning might be a net gain in many cases. To allow a slight increase in support for burst ILP, the inter-cluster forwarding could write to register caches rather than to the other register file and these register values could be used for issuing instructions from the other cluster. The extra write ports in each cluster could then be used to support two-result operations if desired. (A two issue per cluster processor might share a multiplier/divider [possibly replicating enough of a multiplier to support independent 16- bit by 64-bit multiplications??]. At three issues per cluster distinct multipliers might make sense.) (Static partitioning of two threads might make sense when ILP is relatively low with little benefit from using full issue width issue for a single thread, when extra registers could be used to support deeper speculation, or under other circumstances.) (Obviously, one could also use such register-duplicating clustering to support SIMD-like operations.) Paul A. Clayton just a technophile
From: Andy 'Krazy' Glew on 28 May 2010 02:13 On 5/27/2010 3:10 PM, Paul A. Clayton wrote: > Another obvious (possibly half-way decent) idea: Use the duplicated > register file of a clustered processor design like the Alpha 21264 to > hold distinct contexts. Looks like you have found another way of arriving at, another evolutionary path, to a) AMD's MCMT (Multicluster Multithreading) as in Bulldozer b) my MultiStar. I arrived at it from a different path: (a) thinking that most multicluster uarch for single threads were not very successful, (b) using multicluster for separate threads, and (c) then trying to go back and use the MCMT to speed up single thread. I.e. you MCST (multicluster singlethread) -> MCMT me MCMT -> MCST ? I wonder what things work out differently when you think this way? I never liked the inter-cluster bypass of the 21264. Complete bypass networks are expensive; incomplete are a glass jaw. But, heck, even un-clustered machines now have incomplete bypass networks. > > Such a static partitioning might not be advisable under two > simultaneous threads usually, but at four (reasonably active) threads, > static partitioning might be a net gain in many cases. To allow a > slight increase in support for burst ILP, the inter-cluster forwarding > could write to register caches rather than to the other register file > and these register values could be used for issuing instructions from > the other cluster. The extra write ports in each cluster could then > be used to support two-result operations if desired. > > (A two issue per cluster processor might share a multiplier/divider > [possibly replicating enough of a multiplier to support independent 16- > bit by 64-bit multiplications??]. At three issues per cluster > distinct multipliers might make sense.) > > (Static partitioning of two threads might make sense when ILP is > relatively low with little benefit from using full issue width issue > for a single thread, when extra registers could be used to support > deeper speculation, or under other circumstances.) > > (Obviously, one could also use such register-duplicating clustering to > support SIMD-like operations.) > > > Paul A. Clayton > just a technophile
From: Paul A. Clayton on 29 May 2010 22:19 On May 28, 2:13 am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote: [snip] > I.e. you > > MCST (multicluster singlethread) -> MCMT > > me > > MCMT -> MCST ? > > I wonder what things work out differently when you think this way? Well, one of my habits of thinking seems to be to exploit existing features for alternate uses (e.g., huge page TLB entries holding PDEs). (This is probably part of the reason I find SMT appealing-- existing [or extreme] ILP core -> choice of single thread performance or moderately great multithread throughput.) > I never liked the inter-cluster bypass of the 21264. Complete bypass networks are expensive; incomplete are a glass > jaw. But, heck, even un-clustered machines now have incomplete bypass networks. I kind of dislike complete bypass because it seems wasteful. (I would irrationally dislike it even if it were cheap.) Other than squaring, when is a result used by both inputs of a functional unit? (Intelligent forwarding would seem desirable, but such could add excessive delay [aside from area/power costs].) BTW, could a staggered ALU be used to ease the delay problem of scheduling/forwarding? If one 'cluster' of ALUs was staggered a half-cycle relative to the other with the less significant bits forwarded as soon as available, could one see some benefit? (I like the Pentium 4 staggered ALU concept. I do wonder if it might be useful for a low-power design--i.e., addition takes two cycles to fully complete [less logic activity] but has single cycle forwarding. [I suspect the ideas in the Pentium 4 are now tainted with the relative failure of the Pentium 4.]) Paul A. Clayton just a technophile
From: Andy 'Krazy' Glew on 30 May 2010 11:18 On 5/29/2010 7:19 PM, Paul A. Clayton wrote: > BTW, could a staggered ALU be used to ease the delay > problem of scheduling/forwarding? If one 'cluster' of > ALUs was staggered a half-cycle relative to the other with > the less significant bits forwarded as soon as available, > could one see some benefit? (I like the Pentium 4 > staggered ALU concept. I do wonder if it might be useful > for a low-power design--i.e., addition takes two cycles > to fully complete [less logic activity] but has single cycle > forwarding. [I suspect the ideas in the Pentium 4 are > now tainted with the relative failure of the Pentium 4.]) I don't think that Pentium 4 had what you think of as a staggered ALU. When I think of staggered ALU, I think of two ALUs, with the second ALU receiving inputs from the first, and possibly from the generic register file. I.e. something that allows you to execute A+B->C; C+D->E in one clock cycle. Pentium 4 actually just ran the ALUs - and the associating support logic, like the scheduler - at 2X the published frequency of the core. I.e. if the core was publicly 2.5GHz, the "fireball" was actually running at 5GHz. The original Pentium 4 ALUs were staggered in that they computed the low 16 bits in one of these fast clock cycles, and the high in the next - allowing back to back adds. But that is not the widespread definition of "staggered" ALU.
From: Paul A. Clayton on 31 May 2010 22:02 On May 30, 11:18 am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote: [snip] > The original Pentium 4 ALUs were staggered in that they computed the low 16 bits in one of these fast clock cycles, and > the high in the next - allowing back to back adds. But that is not the widespread definition of "staggered" ALU. I took the term from "Using Internal Redundant Representations and Limited Bypass to Support Pipelined Adders and Register Files" (Mary D. Brown, Yale N. Patt; 2001 [HPCA-3]): "An example of this concept, called staggered adds, was implemented in the Intel Pentium 4 [10]. When staggering a 32-bit add over two cycles, the carry-out of the 16th bit and the lower half of the result are produced in the first cycle, and the upper half of the result is produced in the second cycle." So what is the proper term for this kind of pipelined addition? (ISTR reading somewhere that the AMD K5 used the early availability of the less significant bits of a sum to shorten load latency, so early use of partial results is not an extremely new idea.) Paul A. Clayton just a technophile
|
Next
|
Last
Pages: 1 2 Prev: Lolling at programmers, how many ways are there to create a bitmask ? ;) :) Next: Larrabee Dead |