Prev: Fujitsu SPARC VIII-fx HPC-ACE adds instruction prefixes: variablelength instructions for RISC!!!
Next: Fujitsu SPARC VIII-fx HPC-ACE adds instruction prefixes: variable?length instructions for RISC!!!
From: Robert Myers on 24 Dec 2009 22:50 On Dec 24, 4:34 pm, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net> wrote: > I'm enjoying reading Hot Chips presentations. > > I'm - happy? chuffed? not surprised? interested? - to see one of the > last bastions of RISC fall down. Fujitsu has added an instruction > prefix. Albeit a 32 bit instruction prefix, not an 8 bit prefix like > the amd x86-64 REX byte. But same idea. New register for the prefix state. > > Also specifies extended opcodes for new instructions. > > Pardon the mess, but I'll just cut and paste the text from the slide: > > Large register sets 2/2 > > Instruction format for 256 FP registers > > 8 bit x 4 (3 read +1 write) register number fields are necessary for FMA > (Floating-point Multiply and Add) instruction. > > But SPARC-V9 instruction length is limited (32bits fixed) > > Defined a new prefix instruction (SXAR) to specify upper-3bit of > register numbers of the following two instructions. > > SXAR > inst1 > inst2 > Lower-5bit x4 > Upper-3bit x4 > > SXAR (Set XAR) instruction > > XAR: Extended Arithmetic Register > Set by the SXAR instruction > Valid bit is cleared once the corresponding subsequent instruction gets > executed. > > Operand fields of SXAR > fv > furd > furs1 > furs2 > furs3 > sv > surd > surs1 > surs2 > surs3 > 31 > 0 > fsimd > ssimd > 16 > 15 > First Upper Register Source-1 bits > > SXAR1: set XAR for subsequent one instruction. > > SXAR2: set XAR for subsequent two instructions. Does anyone still care about SPARC? If they do, that would be the real news. Robert.
From: MitchAlsup on 25 Dec 2009 19:19 Having buit several of each RISC and CISC, this is totally UNsurprising. Overall, it only adds about 16 gates of total pipeline delay to have byte level instruction lengths. Doing it at the word level cannot add even this amount of gates. That is if a RISC design has 70 gates of fall through delay, and x86 will have but 85-86. The fact that the typical x86 has 256 gates of fall through delay cannot be blamed on the instruction set! (But I digress) In addition, last year I did some consultation with a company that was considering adding a "payload" instruction to the instruction set. The payload instruction carried a number of bits that other instructions (already defined) could consume. You might use such a feature to carry some more addressing bits, some register specifying bits, or some instruciton set expanding bits. The payload instruction did not care how the bits were consumed. (But again, I digress) What we are arriving at is a point where we (the microarchitects and implementers) have exploited all that is exploitable from the architects of the past (6600, 360-91, 360-85, 360-67) in the context of general purpose. If one looks at the distance in time between 360- ISA introduction and the first RISC ISA introductions we have about 20 years. Now, it has been another 20 years and this keg seams tapped out. In order to accrete that last modicum of performance for that last application someone cares about, half a zillion instructions are throw in. This is a sign that things are not well in architecture- ville. But, of course, the problem is not even in the instruction set, and since the 1 million transistor level has not been. That is, as long as the instructions that get created exist within the kinds of data-flow the microarchitecture already supports; adding instructions is, for all intents and purposes, (almost) free. It certainly takes more die area to manage the data-flow than to manage the data-computations, so to a first order, adding instructions is free (at the large end of processor microarchitecture.) In addition, to a large extent, nobody cares about the instruction set since compilers got "reasonably good". As long as the programmer does not have to see the instructionset, why should the customer care? This new foray into MAC-ville with 3-operand instructions (sometimes with a 4-th destination) simply causes the microarchitect to provide adequate register ports, and adequate reservation station tracking. As long as this does not break the camels back, its OK--not great but not worse than OK either. Just plan for it and get on with life. So, where are the instructions designed to allow the n-way multiprocessor do synchronizations 10X faster than current? (OK, how about 2X with guarenteed forward progress for at least one thread.) This is really the kind of breakthrough that the large machines need. (Where n is greater than 64) Even the scientific number chrunchers would benefit from better synchronizations. So, where are the new technologies to allow greater bandwidths to greater memory with lesser latency? Say, 1 TB main memories with (say) 100 ns total latency average case (OK, maybe 150 ns total latency with up to 64 nodes accessing the cabinet filled with DRAMs.) Seems to me, that too many clever people doing the processors (squeezing the last blood from the stone), and too few doing the microarchitecture of the rest of the system (adding blood to the stone). Merry christmas Mitch
From: Mayan Moudgill on 26 Dec 2009 11:46 MitchAlsup wrote: > > So, where are the instructions designed to allow the n-way > multiprocessor do synchronizations 10X faster than current? (OK, how > about 2X with guarenteed forward progress for at least one thread.) Synchronization is just one part of the communication between two CPUs; its generally followed by a transfer of some amount of data. In many cases, the data-transfer completely dominates this overall communication, so the cost of the synchronization is in the noise. Further, synchronization is done at the level of "processes", not hardware. If a process happens to be swapped out or not yet ready to synchronize, the wait time for the last processes to get to the synchronization point will dominate the overall cost. The overall performance impact on the program of improving the hardware support for synchronization is, in IMO, generally going to be insignificant. Can you show studies to the contrary? > This is really the kind of breakthrough that the large machines need. > (Where n is greater than 64) Even the scientific number chrunchers > would benefit from better synchronizations. Are there any studies, particularily on non-micro-benchmark codes, that would quantify this improvement?
From: nmm1 on 26 Dec 2009 12:11 In article <NMmdnREyrJlroKvWnZ2dnUVZ_qOdnZ2d(a)bestweb.net>, Mayan Moudgill <mayan(a)bestweb.net> wrote: >MitchAlsup wrote: > >> So, where are the instructions designed to allow the n-way >> multiprocessor do synchronizations 10X faster than current? (OK, how >> about 2X with guarenteed forward progress for at least one thread.) > >Synchronization is just one part of the communication between two CPUs; >its generally followed by a transfer of some amount of data. In many >cases, the data-transfer completely dominates this overall >communication, so the cost of the synchronization is in the noise. For message-passing codes, perhaps. For the shared-memory parallel paradigms that are currently trendy, not at all. >Further, synchronization is done at the level of "processes", not >hardware. If a process happens to be swapped out or not yet ready to >synchronize, the wait time for the last processes to get to the >synchronization point will dominate the overall cost. If you are working with very coarse-grained parallelism, then I agree hardware instructions are irrelevant. >The overall performance impact on the program of improving the hardware >support for synchronization is, in IMO, generally going to be >insignificant. Don't bet on it. What it does is to make it feasible to parallelise the sort of program where the parallelism cannot be made coarse-grained, or where there is potentially much more gain for fine-grained. >Can you show studies to the contrary? I could, once. I no longer have easy access to the relevant classes of system. >> This is really the kind of breakthrough that the large machines need. >> (Where n is greater than 64) Even the scientific number chrunchers >> would benefit from better synchronizations. > >Are there any studies, particularily on non-micro-benchmark codes, that >would quantify this improvement? Yes. How many have been published in places you can find them, or even written up suitable for publication, I don't know. I know that mine weren't. Note that the situation involves more than just the synchronisation operations, because a lot of it is about scheduling. If you are trying to parallelise code with a 10 microsecond grain, having to do ANY interaction with the system scheduler runs the risk of a major problem. That is one of the main reasons that almost all HPC codes rely on gang scheduling, with all threads running all the time. Regards, Nick Maclaren.
From: Mayan Moudgill on 26 Dec 2009 12:42
nmm1(a)cam.ac.uk wrote: > In article <NMmdnREyrJlroKvWnZ2dnUVZ_qOdnZ2d(a)bestweb.net>, > Mayan Moudgill <mayan(a)bestweb.net> wrote: > >>MitchAlsup wrote: >> >> >>>So, where are the instructions designed to allow the n-way >>>multiprocessor do synchronizations 10X faster than current? (OK, how >>>about 2X with guarenteed forward progress for at least one thread.) >> >>Synchronization is just one part of the communication between two CPUs; >>its generally followed by a transfer of some amount of data. In many >>cases, the data-transfer completely dominates this overall >>communication, so the cost of the synchronization is in the noise. > > > For message-passing codes, perhaps. For the shared-memory parallel > paradigms that are currently trendy, not at all. > > So core 1 writes some data, core 1&2 synchronize, and core 2 reads the data. What actually happens post-synchronization? Well, cache lines get copied from dcache-CPU-1 to dcache-CPU-2. This takes time. This time will be proportional to the shared data. The cost can actually be higher than in the case of an explicit message passing system. The synchronization, by contrast, can involve the tranfer of exactly one cache-line [e.g. if you're doing an atomic-increment]. More heavyweight synchronization operations (such as a lock with suspend on the lock if already locked) *can* be more expensive - but the cost is due to all the additional function in the operation. Its not clear that tweaking the underlying hardware primitives is going to do much for this. |