From: Robert Myers on 23 Apr 2010 15:29 On Apr 23, 3:06 pm, Rick Jones <rick.jon...(a)hp.com> wrote: > Anton Ertl <an...(a)mips.complang.tuwien.ac.at> wrote: > > A more frequent situation in practice is probably when the compiler > > does not know what will happen at run-time; but that's irrelevant, > > because it does not happen with SPEC CPU, because that uses profile > > feedback. > > SPECcpu2006 explicitly disallows PBO in base and only allows it in > peak. That was a change from SPECcpu2000, which allowed PBO in both. > That just forces you to design the compiler around the benchmarks. A realistic set of rules for profile-based optimization would ask: how much predictability does this code have in practice and how well do the compiler and processor exploit it? Hard to answer such a question in the world of what Nick calls benchmarketing. Robert.
From: Terje Mathisen "terje.mathisen at on 23 Apr 2010 01:44 Anton Ertl wrote: > 2) The people from the Daisy project at IBM came up with a software > scheme that makes something like ALAT unnecessary (but may need more > load units instead): For checking, just load from the same address > again, and check if the result is the same. I guess that hardware > memory disambiguation could use the same approach, but maybe the > ALAT-like memory disambiguator is cheaper than the additional cache > ports and load units (then it would also be a with for software memory > disambiguation). This only works for a single level of load, otherwise you end up with the ABA problem. I.e. you'll need to do the check on every single subsequent/dependent load as well. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: "Andy "Krazy" Glew" on 23 Apr 2010 03:06 On 4/21/2010 8:21 AM, Stefan Monnier wrote: >> One of the nice things about register rotation is that it almost removes the >> need for the compiler to make a decision: optimize this loop by unrolling it >> and software pipelining it, or not? Register rotation makes that >> optimization decision much simpler. > > But does this warrant support in the architecture? My understanding is > that this can only be applied to loops where software pipelining can be > used, and these tend to be fairly short anyway, right? so unrolling them > a little and adding some startup/cleanup shouldn't be too costly (as > long as you have enough registers). If register pressure is a problem, > you can't unroll enough and you need to add register moves (which > basically perform the rotation by hand). > Wouldn't it be preferable (and just as easy/easier) to handle register-move > instructions efficiently? While I have worked on, and advocated, handling reg-reg move instructions efficiently, this introduces a whole new level of complexity. Specifically, MOVE elimination, changing lreg2 := MOVE lreg1 lreg3 := ADD lreg2 + 1 into something like preg2 := MOVE preg1 // eliminated, or ... preg3 := ADD preg1 + 1 requires that you do some form of reference counting or garbage collection for registers - to track that both lreg1 and lreg2 map to preg1. While doable, it's a chunk of omplexity. Observe that the nice thing about register rotation is that it is a permutation. No reference counts. And it operates on a lot of registers all at the same time. No arbitrary limits of "at most 2 MOVes may be handled in a cycle.".
From: "Andy "Krazy" Glew" on 23 Apr 2010 03:08 On 4/22/2010 8:16 AM, Robert Myers wrote: > Having so many visible registers had to have increased the complexity > of so many things, one of which, the ALAT, you mentioned in another > post. Hardware complexity, hell: The ALAT made single threaded code non-deterministic. You could get different bugs depending on the load average on the machine. That's stupid.
From: nedbrek on 23 Apr 2010 18:25
Hello all, "Anton Ertl" <anton(a)mips.complang.tuwien.ac.at> wrote in message news:2010Apr23.153819(a)mips.complang.tuwien.ac.at... > "nedbrek" <nedbrek(a)yahoo.com> writes: >>2) chk.a is too expensive. You suffer a branch mispredict penalty, plus >>you >>probably miss in the L1I (recovery code is rarely used, therefore rarely >>cached). > > If the recovery code is rarely used, why is the branch mispredicted? > And why do you suffer a miss in the I-cache? In the usual case the > branch prediction will be correct (no mispredict penalty), and the > recovery code will not be performed (no I-cache miss). The code is going to look like: ld r1 = a add = r1 sub = r1 .... chk.a r1, fixup .... <a long way away> fixup: ld r1 = a add = r1 sub = r1 jmp back The chk is the branch (if r1 has been invalidated, jump to a section which will redo the dependent instructions). If the load is always invalid, the branch can be predicted correctly - but then you always do the work twice -> lower performance. If the load is infrequently invalidated, you probably can't predict it -> branch mispredict and I cache miss. >>3) Using whole program analysis, compilers got a lot better at static >>alias >>detection. > > Yes, SPEC CPU is everything that counts. The applications that use > dynamic linking and that have to build in finite time, and are not > written in the subset of C that's supported by the whole-program > analyser (which is not used anyway because the rebuilds take too > long), i.e., most of the real-world applications, they are irrelevant. Sure, this is Itanium. Spec was all we had for a long time (most of the architecture committee used 9-queens), and we were glad to have it. Ned |