From: Daniel A. Jimenez on 15 Oct 2009 20:09 In article <2b90$4ad7b7c8$45c49ea8$21677(a)TEKSAVVY.COM>, EricP <ThatWouldBeTelling(a)thevillage.com> wrote: >Daniel A. Jimenez wrote: >> ... >> Trace cache is another more-or-less recent microarchitectural innovation >> that allowed Pentium 4 to get away with decoding one x86 instruction >> per cycle and still have peak IPC greater than 1. > >Actually trace cache goes back to the VAX HPS, circa 1985. >They called the decoded instruction cache a "node cache". >As far as I know, VAX HPS was never built though. Was it a trace cache, i.e., were decoded instructions stored in order of execution rather than the order of the program text? >> Cracking instructions into micro-ops, scheduling the micro-ops, then fusing >> the micro-ops back together in a different way later in the pipeline allows >> an effectively larger instruction window and more efficient pipeline. >> That's a relatively recent innovation, too. > >Except for the fused micro-ops, this was also VAX HPS. The fused micro-ops is the innovation. It allows an effectively larger instruction window, so more ILP, more performance. One can argue that micro-ops are equivalent to microcode, which predates minis. -- Daniel Jimenez djimenez(a)cs.utexas.edu "I've so much music in my head" -- Maurice Ravel, shortly before his death. " " -- John Cage
From: EricP on 15 Oct 2009 21:07 Daniel A. Jimenez wrote: > In article <2b90$4ad7b7c8$45c49ea8$21677(a)TEKSAVVY.COM>, > EricP <ThatWouldBeTelling(a)thevillage.com> wrote: >> Daniel A. Jimenez wrote: >>> ... >>> Trace cache is another more-or-less recent microarchitectural innovation >>> that allowed Pentium 4 to get away with decoding one x86 instruction >>> per cycle and still have peak IPC greater than 1. >> Actually trace cache goes back to the VAX HPS, circa 1985. >> They called the decoded instruction cache a "node cache". >> As far as I know, VAX HPS was never built though. > > Was it a trace cache, i.e., were decoded instructions stored in order of > execution rather than the order of the program text? They were discussing a number of design issues and evaluating different possible ways of implementing it, but yes I think it is the same idea. They state the nodes are entered into the node cache in decode order, and must be merged with certain non-microcode values like immediate literal instruction constants. Also it must handle 'vax-isms' like procedure call instructions CALLG/CALLS pick up their register save mask from the routine entry point, and save that into the cache. >>> Cracking instructions into micro-ops, scheduling the micro-ops, then fusing >>> the micro-ops back together in a different way later in the pipeline allows >>> an effectively larger instruction window and more efficient pipeline. >>> That's a relatively recent innovation, too. >> Except for the fused micro-ops, this was also VAX HPS. > > The fused micro-ops is the innovation. It allows an effectively larger > instruction window, so more ILP, more performance. One can argue that > micro-ops are equivalent to microcode, which predates minis. But for the VAX, the instr. decode was very much a bottleneck because it was originally designed as an LL sequential parse. The node cache bypasses that bottleneck and, in theory, allows parallel scheduling of micro-ops to their OoO function units. Eric
From: "Andy "Krazy" Glew" on 16 Oct 2009 01:22 Jean wrote: > In last couple of decades the exponential increase in computer > performance was because of the advancements in both computer > architecture and fabrication technology. > What will be the case for future ? Can I comment that the next major > leap in computer performance will not because of breakthroughs in > computer architecture but rather from new underlying technology ? I am not so sure. There are significant improvements to be had in single thread performance by going to really large instruction windows. Multilevel instruction windows. The key is how to do this in a smart and power efficient manner. Such as I have patent pending, on inventions I made outside Intel. (I owe y'all an article on this.) I doubt that this will deliver performance improvements linear in the number of transistors. However, all of the evidence that I have seen indicates that it will deliver performance proportional to the square root of the number of trsnistors. By the way, some people call this - performance proportional to square root of the number of transistors - Pollack's Law. Fred Pollack, my old boss, presented it at some big conferences. Myself, I told Fred about this "law", which I first encountered in Tjaden and Flynn's paper that said performance is proportional to the square root of the number of branches looked past. After encountering such square root laws in several places, I conjectured the generalization, which seems to be confirmed by many metrics. To differentiate myself, let me conjecture further that in a space of dimension d, performance is proportional to the (d-1)/d-th root of the number of devices. E.g. in 3D, I conjecture that performance is proportional to the 2/3-rds root of the number of devices. I suspect that there are significant improvements in parallel processing to be made, most likely in the "how to make it easier" vein. I'm in the many, many, processors camp. I believe that there is significant potential to apply parallelism to improve single thread performance. Speculative multithreading, SpMT. I wish Haitham Akkary luck as he carries the torch for this research. DMT. I wish I could do the same. Along the lines of technology, as indicated above I suspect that 3D integration could bring many benefits. But heat dissipation is such a big problem that I doubt that it is reasonable to hope for this in the next 10 years. I.e. I doubt that we will have cubes of logic intermixed with memory 1 cm on a side. However, incremental progress will be made. 2-4 layers of transistors within 10 years. Although smaller, faster, more power efficient devices are always a possibility, I think that the human brain points out the capabilities of relatively slow computation, albeit with complex elements and high connectivity. I suppose this counts as technology, although not necessarily on the traditional access of evolution.
From: "Andy "Krazy" Glew" on 16 Oct 2009 01:34 EricP wrote: > Daniel A. Jimenez wrote: >> ... >> Trace cache is another more-or-less recent microarchitectural innovation >> that allowed Pentium 4 to get away with decoding one x86 instruction >> per cycle and still have peak IPC greater than 1. > > Actually trace cache goes back to the VAX HPS, circa 1985. > They called the decoded instruction cache a "node cache". > As far as I know, VAX HPS was never built though. Sorry, no. The HPS (and HPSm) node cache was not a trace cache. It did not have a single entry point for a trace of instructions. I invented the trace cache, or at least the term "trace cache", while taking the first Wen-Mei Hwu (the H in HPS) taught after receiving his Ph.D. and coming to UIUC in 1986 or 1987. I invented it to solve the problems that a decoded instruction cache had with variable length instructions (and also to support forms of guarded execution, what would now be called control independence or hammocks or hardware if conversion). Wen-mei was my MS advisor. I am sure that he would have informed me if the trace cache was just the node cache rehashed (and given me a bash, and thrown it in the trash, and not given me any cash). Alex Peleg and Uri Weiser may have preceded me in inventing the trace cache, and certainly patented it first. (I never patented anything at UIUC, or prior to Intel.) But so far as I know, I invented the term "trace cache", and popularized it at Intel in 1991, before Peleg and Weiser.
From: nmm1 on 16 Oct 2009 03:54
In article <hb86g3$fo6$1(a)apu.cs.utexas.edu>, Daniel A. Jimenez <djimenez(a)cs.utexas.edu> wrote: > >Sorry, can't let that one go. There have been tremendous improvements in >branch prediction accuracy from the late eighties to today. Without >highly accurate branch prediction, the pipeline is filled with too many >wrong path instructions so it's not worth going to deeper pipelines. >Without deeper pipelines we don't get higher clock rates. So without >highly accurate branch predictors, clock rates and performance would be >much worse than they are today. If we hadn't hit the power wall in the >early 2000s we would still be improving performance through better branch >prediction and deeper pipelines. Oh, really? I don't see it. The big difference is that the modern processes remove the size limits that made earlier branch predictors relatively ineffective. And that's not architecture. >History-based memory schedulers are another recent innovation that >promises to improve performance significantly. Don't bet on it. A hell of a lot of the papers on the underlying requirements are based on a gross abuse of statistics. They make the cardinal (and capital) error of confusing decisions based on perfect knowledge (i.e. foresight) with admissible decision rules (which can use only history). The point here is that, as with branch prediction, a hell of a lot of the important problem codes are precisely those that are least amenable to such optimisations. The ONLY solution is to get them rewritten in a more civilised paradigm. Regards, Nick Maclaren. |