From: nedbrek on 23 Apr 2010 19:14 Hello all, "Robert Myers" <rbmyersusa(a)gmail.com> wrote in message news:0QlAn.90340$kj3.47473(a)newsfe08.iad... > Anton Ertl wrote: >> Robert Myers <rbmyersusa(a)gmail.com> writes: >>> Sifting through the comments, and especially yours, I wonder if a >>> candidate pair of smoking guns >> >> Smoking guns for what? >> > What didn't work the way it was supposed to (maybe the RSE) or what > feature cost way too much with too little payback (too many architectural > registers). Or maybe it's what many have implied: what do you expect from > a design by committee?--in which case there are no smoking guns (the > lethal shot that killed the world's most amazing processor). The main "smoking gun" is that instruction set doesn't matter. Installed base matters a lot, and performance matters. Also, delivering product on time and in quantity... Of course, too many registers didn't help. I mentioned this elsewhere, but I can add: 7 register bits * 4 ops + 6 predicate bits = 34 bit instruction (worst case) => no 32 bit instructions => bundling Bundling in itself isn't too bad, you need somewhere to stash dependency info. But, Itanium tried to record independence - turns out, determining dependence is much more important (see Smith's dependency chain processing research). Also, some dork tried to "fix" the (non-existant) "dispatch problem", and ended up messing things up even worse. This lead to: extra decode info in the bundle template => not enough templates => lots of 41 bit NOPs => poor icache and front-end utilization Ned
From: MitchAlsup on 23 Apr 2010 18:51 On Apr 23, 2:06 am, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net> wrote: > While I have worked on, and advocated, handling reg-reg move instructions efficiently, this introduces a whole new level > of complexity. > > Specifically, MOVE elimination, changing > > lreg2 := MOVE lreg1 > lreg3 := ADD lreg2 + 1 Ireg2 := MOV Ireg1 Ireg2 := OP Ireg2,<const or reg or mem> Was a hardware optimization in K9 easily detecteed during trace building. Two x86 instructions <const or reg> became a single operation in the trace cache. The memory form became a two op form in the trace cache. Recognizing Andy's version (Ireg3) leads to the ability to do the Move elimination found later in this post. This and branch fusing recognized most of Idium recognition that was done in that microarchitecture. A side effect of the Idium recognizer was that: (Move elimination) temp := MOVE Ireg1 Ireg1 := MOVE Ireg2 Ireg2 := MOVE temp Would only cause 2 operations in the trace cache. All you had to track down was that another operation destroyed 'temp' by the end of the trace boundary. It was after this kind of realization that I became convinced that 3 register architectural instructions formats waste bits that might be better expended on other encoding stuff. 3 register microarchitectural operation formats remain de rigor. I also recognized that I will never (97% level) get a chance to use those bits in more profitable endeavors. Mitch
From: Brett Davis on 21 Apr 2010 22:50 In article <hqmoq8$5mt$1(a)news.eternal-september.org>, "nedbrek" <nedbrek(a)yahoo.com> wrote: > "Brett Davis" <ggtgp(a)yahoo.com> wrote in message > > The "acheck" and "use" stuff makes me go: What!?! Are you serious!?! > > Hehe, the ALAT was a disaster. ALAT: http://en.wikipedia.org/wiki/Advanced_Load_Address_Table So this was a special sidecar cache that held 32 long words? From a hardware design point dealing with all the special cases would make that a disaster. (Grep the Itanic manual to see what I mean.) I assume it had some sort of redeeming benefit, like a load-to-use delay of fewer cycles? In fairy dreamland it could load into the register by itself, and turn the check instruction into a NOP, saving a cycle. ;) Verses just using a cache prefetch and ordinary load? Brett
From: Robert Myers on 24 Apr 2010 13:42 nedbrek wrote: > Bundling in itself isn't too bad, you need somewhere to stash dependency > info. > > But, Itanium tried to record independence - turns out, determining > dependence is much more important (see Smith's dependency chain processing > research). The paper I found An Instruction Set and Microarchitecture for Instruction Level Distributed Processing Ho-Seop Kim and James E. Smith Department of Electrical and Computer Engineering University of Wisconsin�Madison advertises the ability to run at a high clock rate and also proposes binary translation. This paper was, of course, before the Pentium 4 clock rate debacle, before Transmeta folded, and before power consumption became an obsession. That is not to say that the idea may still not have merit. On the face of it, keeping dependent chains together has the obvious advantage of increasing locality, so that computation can be efficiently parceled out over threads in a core, over separate cores on a chip, or even conceivably over multiple sockets. Robert.
From: =?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?= on 24 Apr 2010 14:15
Anton Ertl <anton(a)mips.complang.tuwien.ac.at> wrote: > There is at least one paper around that claims that moving execute > down one stage to make load-use latency shorter (at the cost of higher > ALU-Load latencies and higher ALU-Branch latencies) is a win even in > load/store architectures. And at least one of the MIPS > implementations (R8000? R10000?) actually had that arrangement. IBM inorder PPC designs do this. Cell PPE skewed 3 cycles and POWER6 skewed 2 cycles for simple integer instructions. > For IA-64, there is the additional problem that it does not have a > reg+const addressing mode, so I guess it will see more ALU-load > dependencies than most other architectures; this can change the > balance. Good point. -- Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark |