From: MitchAlsup on 18 Jun 2010 18:41 On Jun 18, 2:28 pm, r...(a)clozure.com (R. Matthew Emerson) wrote: > "nedbrek" <nedb...(a)yahoo.com> writes: > > But Mitch has it right. Architecture does not matter. I think it might be better to say, Instruction sets don't mater, the rest of what we call architecture does mater, now and again. > You guys keep saying this, and maybe for large majority of people it is > even true. > > But I still say that ISA makes a difference. As an example, our Common > Lisp implementation targeted only PowerPC for a long time. Here we have the classic mismatch of architecture and application. I might note that those machines that had the instruction set infrastructure to support <the various> LISPs did not end up surviving into the present (save, <ahem> SPARC). These architectures were also pretty good at Prolog, and at emulating other instruction sets. Me, I write LISP in C. That is, for those applications (and there are some) that are best written in the LISP style (without a self interpreting nature,) I write them in C. The 88K assembler code scheduler is one in particular. Mitch {Note: I am in no way deriding you product or its needs.}
From: nmm1 on 19 Jun 2010 02:46 In article <0922168e-6d6f-4480-85ec-fa5996c336a7(a)z10g2000yqb.googlegroups.com>, MitchAlsup <MitchAlsup(a)aol.com> wrote: >On Jun 18, 2:28=A0pm, r...(a)clozure.com (R. Matthew Emerson) wrote: >> "nedbrek" <nedb...(a)yahoo.com> writes: >> > But Mitch has it right. =A0Architecture does not matter. > >I think it might be better to say, Instruction sets don't mater, the >rest of what we call architecture does mater, now and again. Er, no. Sorry. I agree that the days when the instruction set made a big difference to the performance are long gone - but that's only 20 years gone, not 40. However, the same does NOT apply to RAS and usability. Any defects cause trouble to compilers and debuggers, and one result is higher software costs, and lower RAS RAS and usability. Also, most weird properties, dogmas etc. have a tendency to show through. A classic example here is the way that integer overflow used to be (and occasionally still is) trapped - but overflow in multiplication rarely was. Why? Well, the basic intructions rarely did .... Regards, Nick Maclaren.
From: Brett Davis on 19 Jun 2010 03:06 In article <2010Jun17.172422(a)mips.complang.tuwien.ac.at>, anton(a)mips.complang.tuwien.ac.at (Anton Ertl) wrote: > Stephen Sprunk <stephen(a)sprunk.org> writes: > >On 17 Jun 2010 00:33, Brett Davis wrote: > >> RISC load-store verses x86 Add from memory. > >> t = a->x + a->y; > >> > >> RISC > >load r1, a[0] > >load r2, a[1] > >add r3, r1, r2 > > > >> x86 > >load r1, a[0] > >add r1, a[1] > > >> RISC shows its superiority by being 50% more instructions and 50% slower... > > It's just as easy to find an example where IA-32 and AMD64 have 100% > more instructions: > > x = y+z; > > where x, y, and z are locals that live in registers, and y and z are > alive after this statement. On RISC: > > add x<-y+z; > > On IA-32/AMD64: > > mov x<-y > add x<-x+z "Move elimination" has been mentioned in this thread, and I confirmed that Intel is merging the load micro-op into the add micro-op. From "Intel� 64 and IA-32 Architectures Optimization Reference Manual" page 2-9, section 2.1.2.6: http://www.intel.com/Assets/PDF/manual/248966.pdf http://www.intel.com/products/processor/manuals/ (AMD may have done this first, each AMD integer unit has a address unit.) " 2.1.2.6 Micro-fusion Micro-fusion fuses multiple u-ops from the same instruction into a single complex u-op. The complex u-op is dispatched in the out-of-order execution core. Micro-fusion provides the following performance advantages: � Improves instruction bandwidth delivered from decode to retirement. � Reduces power consumption as the complex ?op represents more work in a smaller format (in terms of bit density), reducing overall �bit-toggling� in the machine for a given amount of work and virtually increasing the amount of storage in the out-of-order execution engine. Many instructions provide register flavors and memory flavors. The flavor involving a memory operand will decodes into a longer flow of ?ops than the register version. Micro-fusion enables software to use memory to register operations to express the actual program behavior without worrying about a loss of decode bandwidth. " See also "2.1.2.4 Instruction Decode" You can do the same with a OoO RISC chip, but its harder, I believe you would need an extra write port. I do not know of a RISC chip that does fusion with reads, I do know that PowerPC does do some Micro-fusion on other opcodes. We are back to my original question, is Add from Memory RISCier than RISC for a hugely OoO design? (The real win is less than 50%, far less, you have to be starved for issue slots.) The power savings is real, and important.
From: jacko on 19 Jun 2010 06:51 On Jun 19, 8:06 am, Brett Davis <gg...(a)yahoo.com> wrote: > In article <2010Jun17.172...(a)mips.complang.tuwien.ac.at>, > an...(a)mips.complang.tuwien.ac.at (Anton Ertl) wrote: > We are back to my original question, is Add from Memory RISCier than RISC > for a hugely OoO design? > > (The real win is less than 50%, far less, you have to be starved for issue slots.) > The power savings is real, and important. Yes. I wonder in my NiBZ design if adding extra cycles in the instruction will significantly reduce area/power? The shadow registers take up some space, and the 3 in 1 decoder takes up time, making to memory speed lower. reducing Fmax. Maybe a single extra cycle could do both these. Freeing up space for something else of use. Cheers Jacko.
From: nedbrek on 19 Jun 2010 08:23
Hello all, "Brett Davis" <ggtgp(a)yahoo.com> wrote in message news:ggtgp-6935B6.02064019062010(a)news.isp.giganews.com... > > 2.1.2.6 Micro-fusion > > You can do the same with a OoO RISC chip, but its harder, I believe you > would > need an extra write port. I do not know of a RISC chip that does fusion > with > reads, I do know that PowerPC does do some Micro-fusion on other opcodes. > > We are back to my original question, is Add from Memory RISCier than RISC > for a hugely OoO design? > > (The real win is less than 50%, far less, you have to be starved for issue > slots.) The power savings is real, and important. I believe the sequence you are describing is: add r1 += [r2] The advantage CISC has is that the uop sequence looks like: ld tmp = [r2] add r1 += tmp Since tmp is not an architected register, it does not have to be preserved for an interrupt, or seen past the use in add (it is known dead). Thus, it can exist strictly in the bypass network (it is not allocated a rename register, it is not visible to later instructions [does not participate in renaming], and has no architected effects at retirement). The RISC sequence will always be (ld r3 = [r2]; add r1 += r3). r3 is live out, and must be architecturally visible. You can smash ops together, giving you r3,r1 = load-op [r2] + r1 You can't say just "need an extra write port" unless you have a simple 5 stage pipeline. In a modern machine, this means extra decode bits (in the scheduler and ROB), extra RAT ports, extra complexity come retirement time (do you allow every instruction to update two entries in the retirement register table?) Ned |