Prev: Effects of Memory Latency and Bandwidth onSupercomputer,Application Performance
Next: Effects of Memory Latency and Bandwidth on Supercomputer,Application Performance
From: Owen Shepherd on 11 Aug 2010 13:42 wrote: > Kai Harrekilde-Petersen wrote: >> Owen Shepherd<owen.shepherd(a)e43.eu> writes: >>> A couple of simple examples from Thumb2: >>> 1. Registers r0-r7 are preferred to r8-r12*, because most instructions >>> only use 3 bits to encode each opcode (Thumb2 added a bunch of >>> longer opcodes to make the upper registers more accessible, but >>> they're 32-bit instructions) > > x86 prefers AL/AX/EAX for many instructions, since they have special, > shorter encodings. > > It also prefers, in 64-bit mode, the 8 old registers vs the 8 new, since > those new regs require an extra prefix byte. x86 has a lot of preferences, yes, but they're not enforced. Prefering A is probably pushing non-orthogonality too far from the compiler's perspective. As for preferring the first 8 registers: This is untrue. It prefers the first 8 registers for sub-64-bit operations, yes, however >>> 2. ARM has an array of modes for the STM/LDM modes: increment before, >>> increment after, decrement before, decrement after. Thumb only has >>> STM decrement before (STMDB) and LDM increment after (LDMIA). This >>> is not coincidentally the way the stack operates > > x86 has a real stack... How is x86' stack any different from ARM's? In fact, ARM's is more flexible, because you can push down as many registers as you want in a single instruction It may surprise you, but on x86 its faster to do sub $n, %rsp mov %rax, 0(%rsp) mov %rbx, 8(%rsp) and so on, than to do it with pushes, if you're spilling quite a few registers. Plus, ARM gets the use of its instructions for all registers. >>> >>> Owen >>> >>> * Remember that r13=SP, r14=LR, r15=PC, so they're somewhat less useful >>> from many perspectives >> >> Basically, they've traded orthogonal-ness for code density. In >> small/cost-sensitive embedded designs, where the code footprint >> determines the significant part of the IC area and thereby cost, this >> could be just the right solution. > > I sort of accept all that, what I don't get is the fact that for more or > less my entire IT career, I've been told that all the special x86 > instructions with fixed and/or implied register operands made it very > hard/impossible to generate really good compiled code, and that this > problem was solved by having more registers and an othogonal instruction > set, i.e. RISC. :-) > > (Personally I've never really understood what was so hard about x86, > except for register pressure, mapping algorithms onto the > register/instruction set have felt quite natural.) > > Terje > Register Allocation is a *hard* problem. When the architecture fixes registers, it gets harder. ARM's 8 registers may be limited compared to the usual 12, but its not like it ever forces you to use a given register. This helps quite a bit - Owen
From: Morten Reistad on 11 Aug 2010 14:04 In article <lro9j7-38b.ln1(a)ntp.tmsw.no>, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote: >Kai Harrekilde-Petersen wrote: >> Owen Shepherd<owen.shepherd(a)e43.eu> writes: >>> * Remember that r13=SP, r14=LR, r15=PC, so they're somewhat less useful >>> from many perspectives >> >> Basically, they've traded orthogonal-ness for code density. In >> small/cost-sensitive embedded designs, where the code footprint >> determines the significant part of the IC area and thereby cost, this >> could be just the right solution. > >I sort of accept all that, what I don't get is the fact that for more or >less my entire IT career, I've been told that all the special x86 >instructions with fixed and/or implied register operands made it very >hard/impossible to generate really good compiled code, and that this >problem was solved by having more registers and an othogonal instruction >set, i.e. RISC. :-) > >(Personally I've never really understood what was so hard about x86, >except for register pressure, mapping algorithms onto the >register/instruction set have felt quite natural.) The x86 is a little weird, but not overly so. The various instructions have lots of implied associations, but nothing totally exotic. For exotic, try the VAX, or the Prime 50-series. There is an effect of code compression in the x86 because of all the implict associations, but we pay for it with register transfers. With the cache-memory transfers being the limiting factor such compresion actually has merit. It makes writing the compiler somewhat more involved, and the linear equations for optimising code get a few more terms; with the possibility of local optima that gets in the way of optimisation. The current state of the art seems to be to make "meta-operations" in the compiler, map these to the x86 api, and then the hardware designers decode the x86 code, make meta-operations and execute these. -- mrr
From: Terje Mathisen "terje.mathisen at on 11 Aug 2010 15:02 Owen Shepherd wrote: >> It also prefers, in 64-bit mode, the 8 old registers vs the 8 new, since >> those new regs require an extra prefix byte. > > x86 has a lot of preferences, yes, but they're not enforced. Prefering A is > probably pushing non-orthogonality too far from the compiler's perspective. > > As for preferring the first 8 registers: This is untrue. It prefers the > first 8 registers for sub-64-bit operations, yes, however So basically it does prefer the first 8 regs, right? :-) > >>>> 2. ARM has an array of modes for the STM/LDM modes: increment before, >>>> increment after, decrement before, decrement after. Thumb only has >>>> STM decrement before (STMDB) and LDM increment after (LDMIA). This >>>> is not coincidentally the way the stack operates >> >> x86 has a real stack... > > How is x86' stack any different from ARM's? In fact, ARM's is more flexible, > because you can push down as many registers as you want in a single > instruction That's a load/store multi, it is mostly a win for program size, more seldom an actual cpu speedup. > > It may surprise you, but on x86 its faster to do > sub $n, %rsp > mov %rax, 0(%rsp) > mov %rbx, 8(%rsp) > and so on, than to do it with pushes, if you're spilling quite a few > registers. On many x86 models that is true, but not all afaik. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Owen Shepherd on 11 Aug 2010 15:39 >> How is x86' stack any different from ARM's? In fact, ARM's is more >> flexible, because you can push down as many registers as you want in a >> single instruction > > That's a load/store multi, it is mostly a win for program size, more > seldom an actual cpu speedup. It depends. Load/Store multiple generally are slightly faster, if only because it gives the CPU more opportunities to perform 64-bit (or bigger) accesses.
From: Nick Maclaren on 11 Aug 2010 15:58
In article <s95bj7-ib7.ln1(a)laptop.reistad.name>, Morten Reistad <first(a)last.name> wrote: >In article <lro9j7-38b.ln1(a)ntp.tmsw.no>, >Terje Mathisen <"terje.mathisen at tmsw.no"> wrote: >>Kai Harrekilde-Petersen wrote: >>> Owen Shepherd<owen.shepherd(a)e43.eu> writes: > >>>> * Remember that r13=SP, r14=LR, r15=PC, so they're somewhat less useful >>>> from many perspectives >>> >>> Basically, they've traded orthogonal-ness for code density. In >>> small/cost-sensitive embedded designs, where the code footprint >>> determines the significant part of the IC area and thereby cost, this >>> could be just the right solution. >> >>I sort of accept all that, what I don't get is the fact that for more or >>less my entire IT career, I've been told that all the special x86 >>instructions with fixed and/or implied register operands made it very >>hard/impossible to generate really good compiled code, and that this >>problem was solved by having more registers and an othogonal instruction >>set, i.e. RISC. :-) I am afraid that you were taught by religious dogmatists :-( >>(Personally I've never really understood what was so hard about x86, >>except for register pressure, mapping algorithms onto the >>register/instruction set have felt quite natural.) > >The x86 is a little weird, but not overly so. The various instructions >have lots of implied associations, but nothing totally exotic. For >exotic, try the VAX, or the Prime 50-series. Yes. The same remarks were made by the same dogmatists about the System/370 series, and they were even less justified. There were some weird instructions, but they were used only by people who wrote assembler procedures and run-time systems. The basic instruction set was very simple, and that is all that almost all compilers used. Regards, Nick Maclaren. |