From: Andy 'Krazy' Glew on 2 Jun 2010 00:17 On 6/1/2010 12:31 PM, Robert Myers wrote: > On Jun 1, 2:48 pm, n...(a)cam.ac.uk wrote: >> In article<4C041839.90...(a)patten-glew.net>, >> Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote: >> >> >> >>> So, first, it is well-known that there is some potential in simply >>> increasing instruction window size. The problem is how costly is the >>> hardware, esp. power wise, to do so. >> >> Yes. >> >>> E.g. the kilo-instruction papers report 3.5x for FP and 1.5X for >>> integer, going from instruction windows of 128 to 4K instructions. >> >>> Now, that is not a very good payoff. Definitely sublinear. .... >> >> Bluntly, it's dire. There are lots of other ways to optimise the >> 'floating-point' workloads to that scale, and a 50% improvement for >> a 32-fold increase in window size (for integer) is horrible. >> > > But what are the alternatives? Computers deal with ever-larger > datasets, with ever-increasing numbers of processors, and the data > gradually slide off the latency-tolerance horizon of any given > processor. Faster memory? Chaa, right. Violate apparently basic > laws of physics? Maybe with some as yet undiscovered twist on the EPR > paradox. > > Personally, I think speed-up is the wrong focus. Sooner or later, for > reasons that seem pretty fundamental to me, we will be back to > bragging, as Intel once did, about instructions in flight. When did Intel ever brag about that? I missed it. I would have clipped and framed such an ad.
From: Robert Myers on 2 Jun 2010 00:36 On Jun 2, 12:17 am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote: > > > Personally, I think speed-up is the wrong focus. Sooner or later, for > > reasons that seem pretty fundamental to me, we will be back to > > bragging, as Intel once did, about instructions in flight. > > When did Intel ever brag about that? > > I missed it. I would have clipped and framed such an ad. I'm not sure if it was an ad or a glitzy web page on Intel's site. It may have been both. It was when the P4 had just been announced and its performance was less than impressive. Since Intel couldn't sell benefits (like performance), it had to sell features, like the trace cache and a huge number of instructions in flight. You see how my education in computer microarchitecture has progessed? I remember it because that's where I learned the phrase "instructions in flight." Robert.
From: Robert Myers on 3 Jun 2010 14:58 On Jun 2, 12:15 am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote: > > You are correct in observing that my batch scheduling across processor/cluster boundaries is a retreat from such > analysis. But, I emphasize: "batch" is just a technique for creating a larger contiguous instruction window. It's > strength lies in its simplicity. It does not have great upside, and it isn't supposed to. > A few random thoughts provoked by this exchange, for whatever they are worth: First, we may be experiencing a version of the RISC "revolution" redux. The logic behind that non-revolution seemed so iron-clad and at the same time transparent that it was hard to imagine how something with as many warts as x86 would survive. We all know how that came out. It may seem that all the ways forward with run-time hardware are too complicated and too power-hungry, but we've seen fairly amazing progress with very low-cost solutions to branch prediction and memory disambiguation. Rather than favoring the path forward as some revolutionary new architecture, the odds favor picking off the easiest pieces of a really smart run-time analyzer one-by-one. If we do have a language revolution, it will probably happen the same way. A second, unrelated thought, is that I think I know the answer to my own question about how you finessed the theoretical N^2 complexity of out-of-order operation. Inter-instruction dependencies have a probability distribution, and if that probability distribution is peaked near zero (like, say, a Gaussian) with a very small number of dependencies outside some instruction strip W, then the complexity can be arranged to be more like (N/W)*W^2=NW, because the dependency matrix can be dealt with in blocks with some tolerable number of outliers. If N=MW, where M is the number of batches, the complexity comes out to be N^2/M, with some presumably small cost of dealing with dependencies outside the statistically-likely dependency window. That just formalizes what to me was the implicit assumption that dependencies are localized. You may have always understood what I just observed, either explicitly or implicitly. A third thought, which you may not now care for, is that it's hard to imagine the best solution coming out of a shop that doesn't control both the compiler and the microarchitecture, because the answer to: "should the parallelism come from the software end or the hardware end?" is probably "both." Robert.
From: Andy 'Krazy' Glew on 4 Jun 2010 00:15 On 6/3/2010 11:58 AM, Robert Myers wrote: > On Jun 2, 12:15 am, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote: > >> >> You are correct in observing that my batch scheduling across processor/cluster boundaries is a retreat from such >> analysis. But, I emphasize: "batch" is just a technique for creating a larger contiguous instruction window. It's >> strength lies in its simplicity. It does not have great upside, and it isn't supposed to. >> > > A few random thoughts provoked by this exchange, for whatever they are > worth: > > A third thought, which you may not now care for, is that it's hard to > imagine the best solution coming out of a shop that doesn't control > both the compiler and the microarchitecture, because the answer to: > "should the parallelism come from the software end or the hardware > end?" is probably "both." I agree - or, rather, I strongly agreed back in 1991, and I overall agree now - although experience tends to suggest that it ain't necessarily so. Now, as for "should the parallelism come from the software end, or the hardware end?" You have created a false dichotomy: it is really a triangle with three corners: 1. explicit programmer expressed parallelism 2. compiler expressed parallelism, 2.1. parallelization of non-parallel from the programmer 2.2. compiler supported parallelism of code that the programmer has expressed as parallel 3. hardware supported parallelization 3.1. of the explicit parallelism of 1. and 2.2 3.2. of code that the programmer and compiler treat as non-parallel Of the above, I am most heartily in favor of Explicit parallelism all along the way: 1. + 2.2. + 3.1. I have a loot of experience in 3.2. hardware parallelization (and in 3.1. hardware support for explicit parallelism). I'm all in favor of 2.1, compiler parallelization of programmer expressed non-parallel code. But I am most suspicious. E.g. Andrew Wolfe (author of "Optimizing Supercompilers for Supercomputers") taught that compilers had never really gotten all that good at vector parallelism. Rather, humans started learning to write code in the idioms that compilers could vectorize. --- Nevertheless, I still like to work with compiler teams. History: 1991: at the start of P6 Intel's compilers were not that good - but Intel made a great effort to improe them. However, much of that effort was devoted to optimizing P5 (in-order machines need a lot of optimization). Compiler folk were somewhat frustrated by P6 running unoptimized code almost as fast as optimized code [*], although they were happy to learn that many supposedly P6 specific optimizations improved P6 even more. Overall, working with Intel's compiler team in the early days was fun and productive. But it did show me the benefits of loose coupling between teams. Towards 1995, the compiler team started getting sucked into the black hole of Itanium. I don't think we ever really saw real dedication to optimizing for P6-style OOO. I wasn't there, but I have heard that the compiler was instrumental in band-aiding the worst of the stupidities of Willamette. That was also true for P6: the compiler really helped make up for the shortsighted decisions wrt partial registers and memory. Overall a pattern: tight interaction between compilers and hardware really helps to make up for hardware shortcomings. However, truly aggressive compiler optimization can often be done in a more loosely coupled fashion. 2006-9: I can't leave off without thanking the compiler guy who worked with me on my last project at Intel: Bob Kushlis, the one man compiler SWAT team. This wasn't a parallelism project, but it was nevertheless a wonderful example of compiler/hardware collaboration. --- Getting back to parallelism: I'm most hopeful about programmer expressed parallelism. I think that one of the most important things for compilers will be to large amounts of programmer expressed parallelism in an ideal machine - PRAM? CSP? - to whatever machine you have.
From: Mike Hore on 4 Jun 2010 01:02
Andy 'Krazy' Glew wrote: > ... > Nevertheless, I still like to work with compiler teams. History: > > 1991: at the start of P6 Intel's compilers were not that good - but > Intel made a great effort to improe them. However, much of that effort > was devoted to optimizing P5 (in-order machines need a lot of > optimization). Compiler folk were somewhat frustrated by P6 running > unoptimized code almost as fast as optimized code [*], although they > were happy to learn that many supposedly P6 specific optimizations > improved P6 even more. > > Overall, working with Intel's compiler team in the early days was fun > and productive. But it did show me the benefits of loose coupling > between teams. > > Towards 1995, the compiler team started getting sucked into the black > hole of Itanium. I don't think we ever really saw real dedication to > optimizing for P6-style OOO. > > I wasn't there, but I have heard that the compiler was instrumental in > band-aiding the worst of the stupidities of Willamette. That was also > true for P6: the compiler really helped make up for the shortsighted > decisions wrt partial registers and memory. Overall a pattern: tight > interaction between compilers and hardware really helps to make up for > hardware shortcomings. However, truly aggressive compiler optimization > can often be done in a more loosely coupled fashion. Thanks for that fascinating stuff, Andy. I'm wondering, where was Microsoft while this was going on? Did they use Intel's compiler at all, or did they "do it their way?" Cheers, Mike. --------------------------------------------------------------- Mike Hore mike_horeREM(a)OVE.invalid.aapt.net.au --------------------------------------------------------------- |