Prev: A post to comp.risks that everyone on comp.arch should read
Next: Call for papers : HPCS-10, USA, July 2010
From: nmm1 on 20 May 2010 09:21 In article <ht1k70$lvu$1(a)news.eternal-september.org>, ned <nedbrek(a)yahoo.com> wrote: >>> >>>That is to say, I would have put my money on the hardware guys to find >>>parallelism in the instruction stream while software guys were still >>>dithering about language aesthetics. I thought this way all the way >>>up to the 90nm step for the P4. >> >> As you know, I didn't. The performance/clock factor (which is what >> the architecture delivers) hasn't improved much. > >Near the end of my uarch career, I came to realize that much of "the >game" is keeping perf/clock from collapsing while ramping clock. At >least, that is about the only thing that has been successful. Precisely. But we haven't seen any increase in clock rate in nearly a decade now - isn't it time to accept that a rethink is needed? Regards, Nick Maclaren.
From: Andy 'Krazy' Glew on 20 May 2010 09:44 On 5/20/2010 5:50 AM, Morten Reistad wrote: > Yes, the compatibility argument is important. But noone can do magic. > We are now at the end, or pretty close, of the rope regarding > single processor performance on von-neumann computers. We used > pipelining, oo execution and lots of other tricks to push this > envelope a hundredfold or more. Now we are up against handling the > logic expression in the code. "Up against handling the logic expressions in the code." ? Hardly. Unless you mean "up against handling the logic expressions in the code, whose execution is delayed because of memory". Run this experiment on your favorite simulator. Note that many simulators are not capable of such limit studies. Set the latency of all arithmetic operations - integer, FP, logical - to 0 cycles. Allow an infinite number of them to execute per cycle. But keep all of the rest of the system the same - instruction window, # cache misses outstanding per cycle, etc. Your speedup is usually not that much. Usually not 2X. (Not except for multimedia, streaming, codes.) You are usually limited by memory. Do the same thing for memory operations, especially cache misses - 0 latency, infinite bandwidth - and you get a much better result. Especially if you handle branch mispredictions - in such an idealized model, should branch mispredictions cost 0 cycles (in which case you have great speedups) or N cycles, where N is the pipeline depth (in which case you get good, but not great, speedups, if you have a pipeline. Great speedups if you similarly limit study the pipeline, because if everything takes zero cycles (run that limit study, because it will probably show you that there are artifacts in your simulator)). Do similar limit studies - make certain cache hit and miss latencies zero. So long as there is a non-zero cache miss latency, you will see that instruction window is a bottleneck. Not just static size (size of RS, ROB), but also dynamic size (distance between branch mispredictions). Making the window infinite sized does not help unless you either reduce branch mispredictions, or have multiple sequencers. I suspect that there is or should be a limit study that can be run to examine the speed of light wrt pefetchers into a memory hierarchy of fixed physical parameters. I.e. you cannot make all memory close to all processing elements - you are necessarily limited, literally, by the speed of light. Unfortunately, I do not know how to do it right now - I think I have figured it out more than once, but that would be in notebooks that I no longer have access to. I suspect that it can or should be possible to show that prefetching into a single L1/L2/L3/... cache hierarchy with a single arbitrarily large instruction window is suboptimal. I suspect that it is necessary to have a multi-headed cache hierarchy - multiple L1s per L2, multiple L2s per L3, etc. - in order to get the best performance. This applies equally well to explicit parallel programming, as to the implicit parallelism of OOO dataflow execution ILP and MLP. It applies just as well to prefetchers.
From: Robert Myers on 20 May 2010 12:25 Andy 'Krazy' Glew wrote: > On 5/19/2010 11:23 PM, Andy 'Krazy' Glew wrote: > >> I must admit that I am puzzled as to why this happened. I thought that >> P6 showed (and is still showing) that OOO microarchitecture can be >> successful. I would have expected Intel to bet on the proven winners, by >> doing it over again. Didn't happen. > > One hypothesis, based on observation: > > Many non-x86 processor companies failed at about this time: > DEC, IBM downsized, RISC. > > Many refugees from these companies spread throughout the rest of the > industry, including Intel and AMD, carrying their attitudes that of > course OOO could not be pushed further. > > At the beginning of Willamette I remember Dave Sager coming back from an > invitation only meeting - Copper Mountain? - of computer architects who > all agreed that OOO could not be pushed further. Nobody asked my > opinion. And, I daresay, that nobody at that conference had actually > built a successful OOO processor; quite possibly, the only OOO > experience at that conference was with the PPC 670. > The big selling point of IA-64 was that it didn't have the N^2 complexity of out-of-order. Instead, true to the law of conservation of complexity, it had lots of complexity elsewhere, and much of it turned out to be of little benefit. Patterson had been saying for a while(in print) that cores were already too big and too complex. Would an insiders-only meeting have had anything to add? The economics of the business say you need a one-size-fits all design, or maybe two, as previously suggested: one for very low power and another for everything else. Most desktop users already have more power than they need or even can use, and many of the remaining volume consumers benefit more from extra cores than they would from aggressive microarchitecture. Who does that leave? A boutique processor for Wall Street quants who are forever trying to outguess each other by a couple of milliseconds, maybe. Robert.
From: David Kanter on 20 May 2010 22:02 > Hmm, I think I am just realizing that we need different metrics, with different acronyms. I want to express the number > of outstanding operations. IPC is not a measure of ILP. OOO window size is extreme. A lower number is the number of > insructions simultaneously in some stage of execution; more precisely, simultaneously at the same stage of exection. > > "SIX"?: simultaneous instructions in execution? "SIF"?: ... in flight? "SMF"?: simultaneous memory operations in > flight? What do you mean by 'same stage of execution'? Anyway, I think the concept you are trying to get at is what I'd call a 'cross section'. Essentially if you think of the CPU as a physical pipeline (or the memory hierarchy as a pipeline), you want the cross sectional area. So perhaps the right terms are 'memory cross section' and 'instruction cross section'? DK
From: Andy 'Krazy' Glew on 21 May 2010 11:19
On 5/20/2010 7:02 PM, David Kanter wrote: >> Hmm, I think I am just realizing that we need different metrics, with different acronyms. I want to express the number >> of outstanding operations. IPC is not a measure of ILP. OOO window size is extreme. A lower number is the number of >> insructions simultaneously in some stage of execution; more precisely, simultaneously at the same stage of exection. >> >> "SIX"?: simultaneous instructions in execution? "SIF"?: ... in flight? "SMF"?: simultaneous memory operations in >> flight? > > What do you mean by 'same stage of execution'? > > Anyway, I think the concept you are trying to get at is what I'd call > a 'cross section'. Essentially if you think of the CPU as a physical > pipeline (or the memory hierarchy as a pipeline), you want the cross > sectional area. So perhaps the right terms are 'memory cross section' > and 'instruction cross section'? > > DK Exactly: a cross section. I was trying to use "same stage of execution" to filter out pipeline effects. E.g. a machine with a 42 deep pipeline, that is capable of only one load per cycle, latency of load data to load address of 1 cycle, etc., might be said to have 42 loads in flight at all time. I.e. a cross section of 42. However, most of that parallelism wuld be due to instruction fetch effects, not the actual execution parallelism. |