From: "Andy "Krazy" Glew" on 22 Oct 2009 00:26 Bill Todd wrote: >[...Itanium...] > Intel wasn't run by complete idiots, just by insufficiently skeptical > (and/or 'easily impressed') but otherwise reasonably bright people. All > they had to believe was that the expected performance domination would > materialize (which was HP's area of expertise, and HP was at that time a > reputable source) - and a hell of a lot of fairly bright people > *outside* Intel bought into this right into the start of this decade, > not just the middle of the last one. Some of those people are still around. Some of them don't understand the value of x86. Some of them now have swung too far, and think x86 forever. Most of them just don't understand OOO execution. How you can judiciously add hardware to solve problems. Some of them think that the advent of Larrabee and Atom show that OOO is a dead end. They think that we are resetting to simple P5-era in-order machines. More, reversing evolution: retreating from Pentium 4 "fireball", backing out of OOO. Some think that we will never go back to OOO. Me, I think that we are resetting. I think of it as a sawtooth wave: backing out a bit, but probably advancing to dynamic techniques in a few years. Heck: Willamette / Pentium 4 was brought to you by peopled who thought OOO was a bad idea. The original concept was anti-OOO. They were forced to implement OOO, badly, because the anti-OOO approach did not fly. Some of the people who brought you Itanium, who drank the VLIW koolaid, are the people who are bringing you Larrabee. 'Nuff said. Some of them think that the problem with VLIW was the instruction set. Some of them think that binary translation will solve all those problems. Me, I think that binary translation is a great idea - but only if the target microarchitecture makes sense in the absence of binary translation, if we lived in an open source world and x86 compatibility was not an issue. To many binary translation projects end up missing the point: you have got to have a good target microarchitecture.
From: Terje Mathisen on 22 Oct 2009 02:06 Robert Myers wrote: > When I talked about rewriting code, I meant just that, not merely > recompiling it. I wasn't all that interested in the standard task: > how do you feed bad code to an Itanium compiler and get acceptable > performance, because I was pretty sure that the answer was: you > don't. :-) > > I was more interested in the question: how do you write code so that a > compiler can understand enough about it to emit code that could really > exploit the architectural features of Itanium? I always assumed that That didn't seem too hard to figure out: You write your code so that it has short if/then/else blocks, preferably of approximately the same size: This makes it easy for the compiler to handle both paths simultaneously, with if-generated predicates to save the proper results. For loops you unroll enough to cover the expected latency from L1 (or L2 for fp), using the huge register arrays to save all the intermediate results. You inline _very_ aggressively, since call/return is relatively expensive, and you avoid all interrupt handling if at all possible. The best solution here is probably to dedicate one cpu/core for this. You also make sure that tasks spend a _long_ time between each time they are switched out, since the overhead of saving/restoring the huge register files is pretty significant. I.e. this is/was a cpu which was very good at going fast in a straight line, with the added capability of being able to do two-way splits for short periods to absorb little branches. > someone at Intel understood all that and briefed it to management and > management said, "No problem. We'll have the only game in town, so > people will conform their code to our hardware." > > If you accept that proposition, then all you need to do is to get > enough code to run well to convince everyone else that it's either > make their code do well on the architecture or die. I'm pretty sure > that Intel tried to convince developers that that was the future they > should prepare for. Of course. :-) Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Terje Mathisen on 22 Oct 2009 02:26 Andy "Krazy" Glew wrote: > That's the whole point: you want to get as many cache misses outstanding > as possible. MLP. Memory level parallelism. > > If you are serialized on the cache misses, e.g. in a linear linked list > > a) skip ahead to a piece of code that isn't. E.g. if you are pointer > chasing in an inner loop, skip ahead to the next iteration of an outer > loop. Or, to a next function. Andy, you really owe it to yourself to take a hard look at h264 and CABAC: In approximately the same timeframe as DES was replaced with AES, with a stated requirement of being easy to make fast/efficient on a PentiumPro cpu, the MPEG working groups decided that a "Context Adaptive Binary Arithmetic Coder" was the best choice for a video codec. CABAC requires 3 or 4 branches for every single _bit_ decoded, and the last of these branches depends on the value of that decoded bit. Until you've made that branch you don't even know which context to apply when decoding the next bit! (I have figured out workarounds (either branchless code or making them predictable) for most of those inline branches in the bit decoder, but that last context branch is unavoidable.) The only possible skip ahead is a really big one: You have to locate the next key frame and start another core/thread, but this approach is of extremely limited value if you are in a realtime situation, i.e. video conferencing. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Robert Myers on 22 Oct 2009 03:38 On Oct 22, 2:06 am, Terje Mathisen <Terje.Mathi...(a)tmsw.no> wrote: > I.e. this is/was a cpu which was very good at going fast in a straight > line, with the added capability of being able to do two-way splits for > short periods to absorb little branches. I'm _way_ out on a limb here, Terje, but It think you can design a GPGPU/stream-processor to to do the same much more effectively. If that's all Itanium could do, then we've all been snookered. That is to say, Itanium was a ridiculously power-hungry GPGPU. I don't really know if that's a fair characterization, but that's what your formula seems to reduce to. Robert.
From: nmm1 on 22 Oct 2009 04:08
In article <1b3a5ckrqn.fsf(a)snowball.wb.pfeifferfamily.net>, Joe Pfeiffer <pfeiffer(a)cs.nmsu.edu> wrote: >Robert Myers <rbmyersusa(a)gmail.com> writes: >> On Oct 21, 8:16�pm, Bill Todd <billt...(a)metrocast.net> wrote: >>> >>> > I think that Intel seriously expected that the entire universe of >>> > software would be rewritten to suit its ISA. >>> >>> > As crazy as that sounds, it's the only way I can make sense of Intel's >>> > idea that Itanium would replace x86 as a desktop chip. No. Intel were suckered by the HP people who said that compiler technology could handle that. >>> Did you forget that the original plan (implemented in Merced and I'm >>> pretty sure McKinley as well) was to include x86 hardware on the chip to >>> run existing code natively? It wasn't in the original plan. It was in the first post-panic redesign. >> I never took that capability seriously. Was I supposed to? I always >> thought it was a marketing gimmick. > >We were sure supposed to take it seriously -- didn't Merced actually >have a i386 core on it when delivered? I can't remember - they changed plans several times, and I can't remember which they delivered for the Merced. The original plan was that ISA translation technology was advancing fast enough that they could convert x86 code to IA64 code and beat the best x86s by a factor of three. Like Alpha, only more so. When they discovered that it didn't work (for reasons some of us had predicted), they panicked and proposed to add a complete x86 core 'until the software was improved'. That went through a couple of revisions. Regards, Nick Maclaren. |