From: Andy 'Krazy' Glew on 18 Jun 2010 01:17 On 6/17/2010 7:34 AM, MitchAlsup wrote: > On Jun 17, 12:33 am, Brett Davis<gg...(a)yahoo.com> wrote: >> What am I missing. > Secondarily, once you microarchitect a fully out-of-order processor, > it really does not mater what the native instruction set is. ReRead > that previous sentance until you understand. The native instruction > set no longer maters once the microarchitecture has gone fully OoO! Well, there are small effects, e.g. in code size. x86 is no paragon of virtue, compactness, or regular instruction bytes, but you can imagine that a "RISCy" 2-register instruction set might have instructions that look like: reg1 += reg2 and might fit most instructions in 16 bits, with a 5 bit register field leaving 6 bits for opcodes. Perhaps with hardware to recognize the instruction sequence reg1 := reg0; reg1 += reg2 and emit the 3-input operation reg1 := reg0 + reg2 Given that reg1 += reg2 is much more common than the general form reg1 := reg0 + reg2, there may be a net savings. I'm not aware of any x86 procesor that does this, but this technique is well known even from before I joined Intel in 1991. At various times it has been called "move elimination". --- We also have not talked about the possibility of a load-op pipeline, yes, even on an out-of-order CPU. (Anecdote: Tomasulo took me aside at a conference, and asked me why P6 did not have a load-op pipeline. I knew that we had studied it; and it has been studied and restudied every 2nd or 3rd processor generation. It usually has insufficient advantage. I conjecture(d) that the P5 had encouraged non-load-op instructions. Tomasulo pointed out that x86 had many reg,reg operations, and that this would waste the load- part of a load-op pipeline. My overall feeling, however, is that on a load-op pipeline you have to handle the possibility of the load missing, so you either have to have a buffer between load and op, or you have to provide a late read of the register operand of the op. In either case, you have to handle the possibility of the load-op being decoupled. Or, you could just replay the entire load-op on a cache miss. In any case, there is wastage - it's the usual centralized versus decentralized buffer issue.) === But, Brett was asking "Why RISC a:=b+c?", not "Why a+=b or a+=load(mem)?" And Mitch has provided the answer. x86 has complicated decoding. Market volumes amortize, but it is still a cost. All other things being equal, I would rather build a RISC, perhaps a 16-bit a+=b RISC as described above. But all other things are not equal. Out-of-order is a big leveller. Although, if you wanted to have lots of simple cores, you might want to give up x86.
From: nedbrek on 18 Jun 2010 07:51 Hello all, "Stephen Sprunk" <stephen(a)sprunk.org> wrote in message news:hvev00$f6m$1(a)news.eternal-september.org... > On 17 Jun 2010 10:24, Anton Ertl wrote: >> Stephen Sprunk <stephen(a)sprunk.org> writes: >>> On 17 Jun 2010 00:33, Brett Davis wrote: > > The point I was trying to make is that x86 has no 3-operand add > instruction like the one he used in his example, nor does RISC allow a > memory address as the destination of an add instruction as he did in his > example. I corrected both to show a fairer comparison. lea r1 = &[r2 + r3] from (the general form): lea r1 = &[r2 << {0,1,2,3} + r3 + imm] I don't have any proof that a compiler will actually emit it... :) Ned
From: nedbrek on 18 Jun 2010 08:13 Hello all, "Brett Davis" <ggtgp(a)yahoo.com> wrote in message news:ggtgp-1B263C.00331217062010(a)news.isp.giganews.com... > RISC load-store verses x86 Add from memory. > t = a->x + a->y; This is a really bad choice (as others have shown)... I don't know about RISC vs. CISC, but if you want to compare "complexity in the compiler" vs. "smarts in the hardware" - use Itanium: x86: div AX /= BX (like 2 or 3 bytes) Itanium (from the application writers guide, section 13.3.3.1)(min latency, 13 instructions, 7 PIGs): frcpa.s0 f8.p6 = f6,f7 ;; (p6) fma.s1 f9 = f6,f8,f0 (p6) fnma.s1 f10 = f7,f8,f1 ;; (p6) fma.s1 f9 = f10,f9,f9 (p6) fma.s1 // lots of regs in here (p6) fma.s1 (p6) fma.s1 (p6) fma.s1 (p6) fma.s1 (p6) fma.s1 (p6) fma.s1 (p6) fnma.d1.s1 (p6) fma.d.s0 My boss and I walked through this sequence one day. You need all these multiplies to eliminate the approximation error in frcpa. Sadly, I've forgetten the exact details (we also walked through the timing). We compared the 1 GHz McKinley to a 2 GHz Willamette (which were both shipping, and in similar process). The latencies were (roughly) equal (McKinley can chew through a lot of FP instructions!). Of course, the Itanium code is a lot bigger, and uses a whole lot more power (chargining up all those register port reads and writes, and all the predication and bypass). I guess it didn't matter, because McKinley had solved the delta-power problem that plagued P4 - they burned max power all the time! All this because the Itanium architects refused to have long latency instructions sully their beautiful architecture. Ned
From: Anton Ertl on 18 Jun 2010 08:08 Stephen Sprunk <stephen(a)sprunk.org> writes: >On 17 Jun 2010 10:24, Anton Ertl wrote: >> Stephen Sprunk <stephen(a)sprunk.org> writes: >>> On 17 Jun 2010 00:33, Brett Davis wrote: >>>> RISC load-store verses x86 Add from memory. >>>> t = a->x + a->y; >>>> >>>> RISC >>>> load x,a[0] >>>> load y,a[1] >>>> add t = x,y >>> >>> load r1, a[0] >>> load r2, a[1] >>> add r3, r1, r2 >>> store t, r3 >>> >>>> x86 >>>> load x,a[0] >>>> add t = x,a[1] >>> >>> load r1, a[0] >>> add r1, a[1] >>> store t, r1 >> >> If t is a local variable, decent C compilers will usually allocate it >> into a register, and no store is needed. > >True, but if you're going to talk about compiler optimizations, No, I am talking about register allocation. Here's an example of what the compiler of a student of mine in this year's compiler course produces for the following program (in a Algol-family programming language): struct x y end; method f(a) var t:=a.x-a.y; /* no + in this programming language:-) */ return t; end; Here's the output: f: mov 0(%rsi), %rdx sub 8(%rsi), %rdx mov %rdx, %rax ret It's easy to see that a resides in %rsi and t resides in %rdx. Does this compiler optimize? No. E.g., it did not perform the copy propagation or register coalescing that would have allowed to optimize the last (return) mov away. >then >odds are the code is unlikely to resemble what you wrote in a HLL in the >first place except for the most trivial of programs. Have you ever looked at the output of an optimizing compiler? Most of the time the translation is pretty straightforward. >The point I was trying to make is that x86 has no 3-operand add >instruction like the one he used in his example, True, but it's not needed. That could be written just as well as: load t=a[0] add t+=a[1] >nor does RISC allow a >memory address as the destination of an add instruction as he did in his >example. He didn't. The destination is in a register. >> What you may be thinking of is that the microarchitectures of current >> high-performance CISC and RISC CPUs are relatively similar, and quite >> different from the microarchitectures of CISC and RISC CPU when RISCs >> were introduced. > >Alternately, one can look at a modern x86 chip as a core that runs a >model-specific RISC ISA hidden behind a decoder that translates x86 CISC >instructions into that ISA. Let's see. Intel: The original P6 uops have 118 bits (they may have grown since then; the P6 is the basis of the Core i line) according to Microprocessor report 9(2). A bit longer than a RISC instruction. AMD: The K7/K8/K10 microarchitecture contains macro-ops (including read-modify-write instructions), that are later split into micro-ops (which still include a micro-instruction that does a read and a write to the same address). Quite unriscy features. >That may offend purists, but IMHO it's >accurate enough for those of us who don't actually design CPUs. Yes, given that the interface is the ISA, you can invent any fairy tale you like about what's going on behind that interface, and hardly anybody will care (apart from the few of us who try to make sense of the performance counters, and even we prefer to count architectural events like committed instructions over microarchitectural events such as started uops). - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html
From: Stephen Sprunk on 18 Jun 2010 09:17
On 18 Jun 2010 07:08, Anton Ertl wrote: > Stephen Sprunk <stephen(a)sprunk.org> writes: >> On 17 Jun 2010 10:24, Anton Ertl wrote: >>> Stephen Sprunk <stephen(a)sprunk.org> writes: >>>> On 17 Jun 2010 00:33, Brett Davis wrote: >>>>> RISC load-store verses x86 Add from memory. >>>>> t = a->x + a->y; >>>>> >>>>> RISC >>>>> load x,a[0] >>>>> load y,a[1] >>>>> add t = x,y >>>> >>>> load r1, a[0] >>>> load r2, a[1] >>>> add r3, r1, r2 >>>> store t, r3 >>>> >>>>> x86 >>>>> load x,a[0] >>>>> add t = x,a[1] >>>> >>>> load r1, a[0] >>>> add r1, a[1] >>>> store t, r1 >>> >>> If t is a local variable, decent C compilers will usually allocate it >>> into a register, and no store is needed. >> >> True, but if you're going to talk about compiler optimizations, > > No, I am talking about register allocation. That is a (very basic) optimization. Have you looked at what GCC does at -O0, i.e. with all optimization disabled? It translates each statement into self-contained assembly which loads, operates on, and then stores the relevant variables--even if two successive statements operate on the same variables. For instance: x=a+b; y=a+b; gets translated into something like this: load r1, a load r2, b add r1, r2 store x, r1 load r1, a load r2, b add r1, r2 store y, r1 > It's easy to see that a resides in %rsi and t resides in %rdx. Does > this compiler optimize? No. E.g., it did not perform the copy > propagation or register coalescing that would have allowed to optimize > the last (return) mov away. Not performing optimization A isn't proof that you're not performing optimization B. >> then odds are the code is unlikely to resemble what you wrote in a >> HLL in the first place except for the most trivial of programs. > > Have you ever looked at the output of an optimizing compiler? Most of > the time the translation is pretty straightforward. I have plenty of times. Perhaps my assembly skills aren't as good as yours, but for the most part, I see little resemblance between the C code and the assembly (for non-trivial functions) when I crank up GCC to maximum optimization. Loop unrolling, strength reduction, dead code elimination, common sub-expression elimination, load hoisting, inlining, etc. can all cause significant changes. >> The point I was trying to make is that x86 has no 3-operand add >> instruction like the one he used in his example, > > True, but it's not needed. That could be written just as well as: > > load t=a[0] > add t+=a[1] If you're going to load a[0] and a[1], you need to store t. It's called symmetry. >> nor does RISC allow a memory address as the destination of an add >> instruction as he did in his example. > > He didn't. The destination is in a register. If you're going to claim that t lives in a register, then you might as well claim that a[0] and a[1] do as well and eliminate those loads. However, the OP didn't do that, and in any case that just means the loads (and stores) are probably somewhere else in the program, so eliminating them from the snippet does not paint a true picture of what's going on. >>> What you may be thinking of is that the microarchitectures of current >>> high-performance CISC and RISC CPUs are relatively similar, and quite >>> different from the microarchitectures of CISC and RISC CPU when RISCs >>> were introduced. >> >> Alternately, one can look at a modern x86 chip as a core that runs a >> model-specific RISC ISA hidden behind a decoder that translates x86 CISC >> instructions into that ISA. > > Let's see. > > Intel: The original P6 uops have 118 bits (they may have grown since > then; the P6 is the basis of the Core i line) according to > Microprocessor report 9(2). A bit longer than a RISC instruction. And how big are instructions in a traditional RISC core after decoding? Is that even relevant, since the point of RISC is reduced _complexity_ rather than _size_? (RISC programs are usually bigger than CISC ones, both in total and average instruction size, and modern RISCs have larger instruction sets as well.) > AMD: The K7/K8/K10 microarchitecture contains macro-ops (including > read-modify-write instructions), that are later split into micro-ops > (which still include a micro-instruction that does a read and a write > to the same address). Quite unriscy features. How is that possible, since AMD's own docs say that a plain write requires a store-address uop and a store-data uop? S -- Stephen Sprunk "God does not play dice." --Albert Einstein CCIE #3723 "God is an inveterate gambler, and He throws the K5SSS dice at every possible opportunity." --Stephen Hawking |