From: Anton Ertl on 21 Apr 2010 02:29 "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes: >Register windows: questionable value. Yes, if they are not present, we can use inlining to reduce the call overhead. Except when we cannot, as for calls to dynamically-linked libraries or polymorphic calls (fortunately, none of that really happens for the SPEC CPU benchmarks, and that's all we care about, no?). But wait, doesn't inlining increase code size? Given that IA-64 detractors claim that code size is a problem already, would leaving away the register stack be a good idea? Maybe, if it had allowed the implementations to reach higher clock speeds. But would that be the case? All of the features of IA-64 seem to have their value, individually, and often also in combination. I guess (based on little evidence) it's the accumulation of these features that is the problem. My guess is that the implementors had their hands (or rather heads) full with the architectural features, so their was no time to invent or adapt the tricks that led to fast-clocked IA-32 implementations at the same time, at least not in the first two iterations of the architecture; and afterwards Intel seems to have given up on it; they do little more than shrink McKinley to new processes. Meanwhile, IBM showed with Power6 in 2007 that in-order processors can be clocked higher. But that was 17 years after the Power architecture was introduced in 1990, and up to the Power4 in 2001 all of the Power implementations had been on the slow-clocked side. So maybe with enough attempts and enough effort at each attempt Intel/HP could produce an IA-64 implementation that's fast (whether by being in-order with a very fast clock or out-of-order with just a fast clock). But I guess it would not increase revenue from the architecture much (performance did not help Alpha), so the current course of Intel and HP appears economically sensible. - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html
From: Terje Mathisen "terje.mathisen at on 23 Apr 2010 14:12 Anton Ertl wrote: > Terje Mathisen<"terje.mathisen at tmsw.no"> writes: >> Anton Ertl wrote: >>> 2) The people from the Daisy project at IBM came up with a software >>> scheme that makes something like ALAT unnecessary (but may need more >>> load units instead): For checking, just load from the same address >>> again, and check if the result is the same. I guess that hardware >>> memory disambiguation could use the same approach, but maybe the >>> ALAT-like memory disambiguator is cheaper than the additional cache >>> ports and load units (then it would also be a with for software memory >>> disambiguation). >> >> This only works for a single level of load, otherwise you end up with >> the ABA problem. > > What do you mean with "level of load"? > > And what do you mean with ABA problem? What I understand as ABA > problem is not a problem here: If the speculative load loads the right > value, that value and any computation based on that will be correct > even if the content of memory location changes several times between > the speculative load and the checking load. I'm thinking of a multi-level structure where the critical value is a pointer: First you load it and get A, then load an item in the block A points at, then another process comes and does the following: Load A, process what it points at and free that block. (At this point A=NULL). Next the same or yet another process allocates a new block and gets to reuse the area A used to point to, but this time it is filled by another set of data, OK? Finally you are rescheduled, finish the processing you started and do a compare against the original value of A to make sure it has all been safe, before committing your updates. I.e. a single final compare isn't sufficient if the meaning can change, you have to verify every single item you have loaded that depended upon that speculatively loaded item. The ALAT is similar to LLSC in that it will detect all modifications, including a rewrite of the same value. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Robert Myers on 23 Apr 2010 14:39 Anton Ertl wrote: > Robert Myers <rbmyersusa(a)gmail.com> writes: >> On Apr 22, 7:30=A0am, "nedbrek" <nedb...(a)yahoo.com> wrote: >>> The irony in Itanium was that the compiler would only use software >>> pipelining in floating point code (i.e. short code segments). =A0I think = >> the >>> memcpy in libc used it too. =A0That accounted for the only times I saw it= >> in >>> integer code. > > Does it have rotation for integer registers? > >> Sifting through the comments, and especially yours, I wonder if a >> candidate pair of smoking guns > > Smoking guns for what? > What didn't work the way it was supposed to (maybe the RSE) or what feature cost way too much with too little payback (too many architectural registers). Or maybe it's what many have implied: what do you expect from a design by committee?--in which case there are no smoking guns (the lethal shot that killed the world's most amazing processor). >> is that the visible register set was >> too large and/or that the register stack engine never worked the way >> it was supposed to (perhaps--and I sure don't know--because of >> problems with Microsoft software). > > Problems with Microsoft software should be irrelevant on non-Microsoft > platforms. > Yes, but if problems with Microsoft forced a change of plans, the resulting loss in performance would have appeared on all platforms. > IIRC I read about the hardware for transparent register stack engine > operation not working, requiring a fallback to exception-driven > software spilling and refilling. That would not be a big problem on > most workloads. AFAIK SPARC and AMD29k have always used > exception-driven software spilling and refilling. > And that says what about Itanium, which had a completely different set of priorities? The fact that register spills could be handled asynchronously meant that you could use registers with reckless abandon--unless the RSE never worked the way it should have, in which case you couldn't. Then you had the cost of all those architectural registers without a commensurate payback. >> If the RSE didn't really work the way it was supposed to, then there >> would have been a fairly big downside to aggressive use of a large >> number of registers in any given procedure, thus limiting software >> pipelining to short loops. > > Not really, because software pipelining is beneficial mainly for inner > loops with many iterations; if you have that, then any register > spilling and refilling overhead is amortized over many executed > instructions. Of course, all of this depends on the compiler being > able to predict which loops have many iterations. But this is no > problem for SPEC CPU, which uses profile feedback; and of course, SPEC > CPU performance is what's relevant. > In other words, if Itanium hadn't attempted to embrace a design philosophy that is still apparently unwelcome to you, there shouldn't have been a problem. Are you being serious, or are you just jerking my chain? Same for your snarky comments about proile-directed optimization. Okay, you don't like it. We got that. Robert.
From: Robert Myers on 23 Apr 2010 15:09 Robert Myers wrote: > Anton Ertl wrote: >> Robert Myers <rbmyersusa(a)gmail.com> writes: >>> On Apr 22, 7:30=A0am, "nedbrek" <nedb...(a)yahoo.com> wrote: >>>> The irony in Itanium was that the compiler would only use software >>>> pipelining in floating point code (i.e. short code segments). =A0I >>>> think = >>> the >>>> memcpy in libc used it too. =A0That accounted for the only times I >>>> saw it= >>> in >>>> integer code. >> >> Does it have rotation for integer registers? >> >>> Sifting through the comments, and especially yours, I wonder if a >>> candidate pair of smoking guns >> >> Smoking guns for what? >> > What didn't work the way it was supposed to (maybe the RSE) or what > feature cost way too much with too little payback (too many > architectural registers). Or maybe it's what many have implied: what do > you expect from a design by committee?--in which case there are no > smoking guns (the lethal shot that killed the world's most amazing > processor). > It occurred to me, after I sent that off, that maybe Itanium was *expected* to have a two cycle L1 load delay. If registers are free, they essentially become another layer of cache. Load (or preload) everything once. Who cares if it takes two cycles? Of course, things didn't work out that way. Robert.
From: Rick Jones on 23 Apr 2010 15:06
Anton Ertl <anton(a)mips.complang.tuwien.ac.at> wrote: > A more frequent situation in practice is probably when the compiler > does not know what will happen at run-time; but that's irrelevant, > because it does not happen with SPEC CPU, because that uses profile > feedback. SPECcpu2006 explicitly disallows PBO in base and only allows it in peak. That was a change from SPECcpu2000, which allowed PBO in both. rick jones -- The computing industry isn't as much a game of "Follow The Leader" as it is one of "Ring Around the Rosy" or perhaps "Duck Duck Goose." - Rick Jones these opinions are mine, all mine; HP might not want them anyway... :) feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH... |