From: Anton Ertl on 23 Apr 2010 04:57 Terje Mathisen <"terje.mathisen at tmsw.no"> writes: >Anton Ertl wrote: >> 2) The people from the Daisy project at IBM came up with a software >> scheme that makes something like ALAT unnecessary (but may need more >> load units instead): For checking, just load from the same address >> again, and check if the result is the same. I guess that hardware >> memory disambiguation could use the same approach, but maybe the >> ALAT-like memory disambiguator is cheaper than the additional cache >> ports and load units (then it would also be a with for software memory >> disambiguation). > >This only works for a single level of load, otherwise you end up with >the ABA problem. What do you mean with "level of load"? And what do you mean with ABA problem? What I understand as ABA problem is not a problem here: If the speculative load loads the right value, that value and any computation based on that will be correct even if the content of memory location changes several times between the speculative load and the checking load. - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html
From: Anton Ertl on 23 Apr 2010 09:38 "nedbrek" <nedbrek(a)yahoo.com> writes: >2) chk.a is too expensive. You suffer a branch mispredict penalty, plus you >probably miss in the L1I (recovery code is rarely used, therefore rarely >cached). If the recovery code is rarely used, why is the branch mispredicted? And why do you suffer a miss in the I-cache? In the usual case the branch prediction will be correct (no mispredict penalty), and the recovery code will not be performed (no I-cache miss). >3) Using whole program analysis, compilers got a lot better at static alias >detection. Yes, SPEC CPU is everything that counts. The applications that use dynamic linking and that have to build in finite time, and are not written in the subset of C that's supported by the whole-program analyser (which is not used anyway because the rebuilds take too long), i.e., most of the real-world applications, they are irrelevant. - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html
From: nmm1 on 23 Apr 2010 09:52 In article <2010Apr23.153819(a)mips.complang.tuwien.ac.at>, Anton Ertl <anton(a)mips.complang.tuwien.ac.at> wrote: >"nedbrek" <nedbrek(a)yahoo.com> writes: > >>3) Using whole program analysis, compilers got a lot better at static alias >>detection. > >Yes, SPEC CPU is everything that counts. The applications that use >dynamic linking and that have to build in finite time, and are not >written in the subset of C that's supported by the whole-program >analyser (which is not used anyway because the rebuilds take too >long), i.e., most of the real-world applications, they are irrelevant. Or where they make heavy use a library that is not distributed as source! Regards, Nick Maclaren.
From: Terje Mathisen "terje.mathisen at on 21 Apr 2010 02:26 Robert Myers wrote: > Brett Davis wrote: > >> But ultimately is not register windowing just a horrid complex slow >> way to get more register bits, in a fix width instruction set? > > Not in the case of Itanium, which has tons of registers. > > The purpose, as I understand it, is to permit more seamless operation > across procedure calls. I think both were supposed to be important: Using rotating regs, along with predicated/masked execution, made it very natural to write almost naive loops, with zero (visible) unrolling, that still managed to match both L2 load delays and fp latencies, and got rid of all the normal startup/cleanup code paths. This did save quite a bit of instruction space. On the other hand rotation did indeed make it very cheap to do a limited number of relatively shallow (not too many parameters) function calls, avoiding much of the need for inlining which can also be responsible for code bloat. On the gripping hand, the async register save/restore engine was supposed to make the limited depth of the register stack completely transparent to programmers, and this is the feature Nick have lambasted the most, at least 100+ times over the last decade. :-( Full disclosure: When I read the original asm manual and started looking into ways to wrap my code around the architecture, I really liked it! With the targeted speeds, it seemed obvious that it could indeed deliver very high performance for my handcoded asm. What neither I nor Intel/HP seemed to understand at the time was that the chip would end up ~5 years late, while still running at the originally targeted speed: Too little, too late. :-( Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Anton Ertl on 23 Apr 2010 11:36
"nedbrek" <nedbrek(a)yahoo.com> writes: >Hello all, > >"Anton Ertl" <anton(a)mips.complang.tuwien.ac.at> wrote in message >> 1) OOO CPUs that try to reorder loads above stores tend to have >> something like an ALAT themselves (I found some slides on the Core >> microarchitecture that call it "memory disambiguation"); actually, >> it's usually even more involved, because it contains an alias >> predictor in addition to the checker. > >Sure, they exposed a hardware structure to software. Of course, software >has to handle all cases in general, where hardware only has to handle the >case at hand. That means software is going to be conservative (not using it >all the time, and adding extra flushes). Yes, in IA-64 the compiler does the prediction of how often a given store aliases with a given load, so the hardware does not need a separate predictor for that. And yes, if a load aliases with one of the later stores several times in a row, and then does not alias several times in a row, that's a situation where the dynamic hardware solution will be better than the static compiler solution; but how often does that happen? A more frequent situation in practice is probably when the compiler does not know what will happen at run-time; but that's irrelevant, because it does not happen with SPEC CPU, because that uses profile feedback. The conservative approach for using the ALAT seems to me to use it if it offers a latency advantage when there is no alias. In contrast, you imply that it is used less often if the compiler does not know enough; why? What extra flushes do you mean? - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html |