From: Noob on 22 Oct 2009 06:53 Andrew Reilly wrote: > Didn't SGI open-source their own in-house itanium compiler, > (open64 or something like that)? Correct. http://en.wikipedia.org/wiki/Open64 http://www.open64.net/about-open64.html """ Formerly known as Pro64, Open64 was initially created by SGI and licensed under the GNU Public License (GPL v2). It was derived from SGI's MIPSPro compiler. """
From: Noob on 22 Oct 2009 07:18 Robert Myers wrote: > Open source depends on gcc, perhaps the cruftiest bit of code > on the planet. What kind of unsubstantiated BS is this?
From: Mayan Moudgill on 22 Oct 2009 07:33 Andy "Krazy" Glew wrote: > Mayan Moudgill wrote: > > Branch prediction: > > (1) branch predictors *have* gotten a lot better, and will continue to > get better for quite a few more years. There are at least 3 branch predictors you have to worry about: direction predictors, return stack predictors and next-fetch-address predictors. If you combine the effect of all 3, then, depending on your code mix, your accuracies can be considerably lower than most published work used to get. Database code, certain kinds of highly OO code, and code which did a lot of OS calls were among the prominent culprits. BTW: since the work we were doing was in simulation, we looked at impractical structures: large tables, high-associativity, next-cycle NFA fixup for computed branches, perfect & early table update etc. So our accuracies in some of the models we looked at was considerably higher than what was then practical, and may still be higher than what is now practical. > > Cache misses: I'm more worried about I$ misses. Even with a 100% accuracy, you might end up missing your I$. At which point, what do you do? Somehow you have to figure out some code that is: - in the I$ - whose input registers are available (or predictable!) - which has a reasonable chance of actually being executed. One approach was to go back and execute some other path, such as the other side of a weakly taken/not-taken branch. We didn't look at that, partly because prior work suggested that it wasn't much of an improvement, and partly because it would have been difficult to get the renaming structures (particularily freeing the registers on retirement/branch resolution) done easily. Loop-ful code doesn't run into these problems. But then, loop-ful code doesn't need fancy predictors to get very good results, either. For non loop-heavy code, I seem to remember that the number of instructions between cache misses was small-ish (assumes caches in the 32K-128K region). My memory was hazy, but IIRC correctly, the one-sigma was about 40? [Note that this is independent of prediction and everything else - this is just number of intructions in taken path between misses]. > (3) Recall that I am a fan of skip-ahead, speculative multithreading > architectures such as Haitham Akkary's DMT. If you can't predict a > branch, skip ahead to the next loop iteration or function return, and > execute code that you know will be executed with high probability. Possibly. I am a little skeptical of results based on Spec95, but it seems worth looking into. But (and I may be overly-cynical here) I suspect that in a real implementation, it will end up giving the usual +/-5% performance delta, with a 50:50 chance that it is -5% rather than +5%. Note that the original late 90's paper required broadside copying of some fairly large arrays. Slightly off-topic: did IBM or anyone else make real traces available to researchers? I know they were talking about it, but did they follow through?
From: Mayan Moudgill on 22 Oct 2009 07:37 Terje Mathisen wrote: > For loops you unroll enough to cover the expected latency from L1 (or L2 > for fp), using the huge register arrays to save all the intermediate > results. > Huh? Terje, I agree with your other points, but *SURELY* the compiler would optimize loops by unrolling (and applying other techniques, including register tiling and software pipelining) correctly? Are you telling me that they didn't get even *THAT* implemented correctly?
From: eternal september on 22 Oct 2009 07:50
"Andy "Krazy" Glew" <ag-news(a)patten-glew.net> wrote in message news:4ADF1711.6060107(a)patten-glew.net... > eternal september wrote: >> "Andy "Krazy" Glew" <ag-news(a)patten-glew.net> wrote in message >> news:4ADEA866.5090000(a)patten-glew.net... > > I'll reread the HSW papers and get back to comp.arch. Looking forward to it! > Note: I'm not just an OOO bigot. I also have neat ideas about parallel > systems, MP, MIMD, Coherent Threading. But I am probably the most > aggressive OOO computer architect on Earth. I don't know, I am pretty agressive. We will have to have an OOO arm wrestling contest! I have argued in favor of OOO against Itanium architects... > This is why I get frustrated when people say "OOO CPU design has run into > a wall, and can't improve: look at Intel and AMD's latest designs". I know > how to improve things - but I was not allowed to work on it at Intel for > the last 5 years. Now I can - in my copious free time. You're right that power is not the issue. Power has set things back a few generations (the Atom scheduler is as complex as first gen OOO schedulers - although there is no renamer; ARM has released an OOO machine). The low power guys can learn from P4's mistakes - but, there was a lot of good stuff there that was thrown out for Core 2. The problems I see: 1) We are running out of generations of Si - not many generations left (from an economics point of view), and the generations give us less performance than they used to (because of wire scaling and optimizing for leakage) 2) Management is risk averse This is the big one. With a new core costing hundreds of millions, only small improvements on existing cores can be justified. Thanks! Ned |