From: Iain McClatchie on 13 Sep 2005 05:10 Mash> There is a great deal of pushback in introducing features that Mash> might add gate delays in awkward places, of which two are: Mash> a) Something only computable on the *output* of an ALU Mash> operation Mash> b) The result of a load operation Mash> In many implementations, such paths may be among the critical Mash> paths. Sometimes, the need to get a trap indication from an Mash> ALU, FP ALU, Load/store unit to the instruction fetch unit Mash> may create a long wire that causes serious angst, or yelling Mash> in design meetings. Hmm... a feature that hangs some logic on the output of the ALU or load pipe, and causes a pipe flush and IF retarget if the logic detects some condition. I don't think this is a problem, Mash. We're already doing this for integer overflow and various floating-point exceptions. Suppose for a moment that the additional complexity of the feature added a pipe stage to this recurrence... in an OoO core, who cares? GPR writeback is unaffected, you just have more logic writing to the tag bits in the reorder buffer. It's not like we're going to see one or more exception per every 1000 instructions... right? Now what would be very unpopular with the CPU guys would be instructions that monkey around with the dataflow inside the ALU. I skimmed the description of the Sparc tagged adds, but they sounded like just the kind of thing I'd want to kick out of the hardware, because getting data through the ALU really is the common case. Heck, I'd like to get rid of sign extension on loads. In an earlier proposal, I wanted to bolt an ALU (including shifter) onto the end of the load pipe, so that the op after the load could be scheduled with the load in one go. The trouble is that raw pointer chasing is just too popular, and you don't want the load pipe latency dinking back and forth between two values. Side note: earlier in this thread people seemed to be having trouble with the difference between jumps/branches and exceptions. On OoO CPUs, there is one relevant distinction: predicted versus nonpredicted control flow. For instance, it might be totally reasonable for the processor to predict TLB faults on certain load instructions, and avoid the double pipe flush by predicting the exception. So... exceptions to get out of loops is not changing the problem that the core faces. Now, a separate issue is how that control flow is encoded. It is definitely the case that instruction fetch engines are having a great deal of difficulty with all these branches. Once predicted, verifying the predictions is actually not too bad, which is why trace caches are so enticing.
From: JJ on 13 Sep 2005 16:48 John Mashey wrote: > David Hopwood wrote: > > andrewspencers(a)yahoo.com wrote: > > > Terje Mathisen wrote: > > > A slightly different situation is where you have code that in practice > > always handles integers that fit in a single word, but that can't be > > statically guaranteed to do so, and the language specification says that > > bignum arithmetic must be supported -- the obvious example being Smalltalk. > > There were some attempts to support this in hardware (e.g. "Smalltalk on > > a RISC"; also something on SPARC that I can't remember the details of), > > but it turned out to be easier and faster for implementations of Smalltalk > > and similar languages to use other tricks that don't require hardware support. > snipping > > Anyway, it's pretty clear that relevant mechanisms were being discussed > ~20 years ago, but nobody seems to have figured out features that > actually make implementation sense. I'd be delighted to see a > well-informed proposal that had sensible hardware/software > implementations and really helped LISP/Smalltalk/ADA and hopefully > other languages... I suspect in current single threaded processor designs, clock to the max, with current cache model, such a proposal would be hard to come by and justify esp when the memory wall forces such extreme locality of reference and so many wait states. A processor designed solely around communicating sequential processes running on multiple MTAs can fairly well hide memory latency (well known). By sharing a high issue rate RLDRAM with say 200M-400M interleaved load stores per sec driven by a nice hash box to destroy all locality of reference from numerous PE requests, and to reduce bank collisions to random chance, object support comes naturally. The hashing takes 32b Object-MMU IDs and hashes with 32b linear index to the particular PA size. Object IDs are generated by new[] using a PRNG. MMU IDs are enumerated at boot time over Links. A 32MByte RLDRAM can appear to store upto 1M single line objects, more typically <<100k objects of all types and sizes. By trading space for rehashes, performance can be kept good. Message object IDs are passed around through channels syncronized by !,?. Besides occam support, ADA, Lisp, Smalltalk support comes to mind all the time. Object support in hardware to a very fine grain level (32 byte pages or lines) with full protections of all object lines. It makes lists, sparse arrays, hash tables a snap, all fit right on top of each other all Mashed up as long as memory is <say 70% full. The MMU model can be tested out in a compiler for its own object store but this test is only single threaded. For more performance the scheme can be replicated at lower <ns and higher 50ns levels for raw flat memory thoughput or volume. At the sub ns level, it allows say 16 way interleaved N cycle concurrent SRAM banks to appear to have performance of MMU issue box even with relatively slow SRAMs (or maybe even 5ns DRAM). At the other end, the SDRAM controller has little throughput but latency is only a few times that of RLDRAM. You takes the wait states from few huge processors or numerous hardware threads from many simple processors, I'll take many threads anytime. In this scheme its the MMU thats really interesting, the PEs are just little grunt boxes to generate enough memory requests to keep MMU near 100%. Even the PE ISA doesn't matter much a 486 RISC ISA would work as well as anything else with the extra par support. Anyway I will describe it at cpa2005 for anyone interested johnjakson at usa dot ...
From: John Mashey on 13 Sep 2005 20:43 Iain McClatchie wrote: > Mash> There is a great deal of pushback in introducing features that > Mash> might add gate delays in awkward places, of which two are: > Mash> a) Something only computable on the *output* of an ALU > Mash> operation > Mash> b) The result of a load operation > > Mash> In many implementations, such paths may be among the critical > Mash> paths. Sometimes, the need to get a trap indication from an > Mash> ALU, FP ALU, Load/store unit to the instruction fetch unit > Mash> may create a long wire that causes serious angst, or yelling > Mash> in design meetings. > > Hmm... a feature that hangs some logic on the output of the ALU or > load pipe, and causes a pipe flush and IF retarget if the logic > detects some condition. > > I don't think this is a problem, Mash. We're already doing this > for integer overflow and various floating-point exceptions. Suppose > for a moment that the additional complexity of the feature added a > pipe stage to this recurrence... in an OoO core, who cares? GPR > writeback is unaffected, you just have more logic writing to the tag > bits in the reorder buffer. Of course (i.e., it might not matter in an OoO), but you may have missed the careful weasel-words "In many implementations". After all, of the horde of distinct pipeline implementations that have ever existed, only a tiny fraction are OoO... For what it's worth, there was some argument about this (overflow in R2000) in 1985, because it was literally the *only* integer exception that needed to be detected after the ALU stage, and in time to inhibit register writeback, and somebody was worried about a possible extra delay for a while. > Now what would be very unpopular with the CPU guys would be > instructions that monkey around with the dataflow inside the ALU. > I skimmed the description of the Sparc tagged adds, but they > sounded like just the kind of thing I'd want to kick out of the > hardware, because getting data through the ALU really is the > common case. Again, I don't think the SPARC tagged ops are so bad, because they just look at two bits each of the two inputs, so one can detect the trap early. > > Heck, I'd like to get rid of sign extension on loads. In an earlier > proposal, I wanted to bolt an ALU (including shifter) onto the end of > the load pipe, so that the op after the load could be scheduled with > the load in one go. The trouble is that raw pointer chasing is just > too popular, and you don't want the load pipe latency dinking back > and forth between two values. You hardware guys are all alike [in hating sign-extension on loads] :-). We seriously looked at various schemes found elsewhere, i.e., where one loads zero-extended partial-word data, and then uses an explicit EXT to sign-extend. We had enough data to prefer having both zero-extend and sign-extend as operations, and if push had really come to shove, I would have lived with an explicit EXT, although having done 68K compiler work, and dealt with some of the funny optimization hassles (i.e., can one get correct results without the EXT, sometimes?) I certainly preferred to have the signed-load opcodes as first choice. My second choice would have been 2-cycle load-signeds. Third choice was the explicit EXT.
From: Seongbae Park on 13 Sep 2005 22:51 John Mashey <old_systems_guy(a)yahoo.com> wrote: .... > You hardware guys are all alike [in hating sign-extension on loads] >:-). I haven't met a hardware guy who likes that, either. > We seriously looked at various schemes found elsewhere, i.e., where one > loads zero-extended partial-word data, and then uses an explicit EXT to > sign-extend. We had enough data to prefer having both zero-extend and > sign-extend as operations, and if push had really come to shove, I > would have lived with an explicit EXT, although having done 68K > compiler work, and dealt with some of the funny optimization hassles > (i.e., can one get correct results without the EXT, sometimes?) > I certainly preferred to have the signed-load opcodes as first choice. > My second choice would have been 2-cycle load-signeds. Well, if the sign-extend version takes more cycles than zero-extend - I suppose your second choice meant such a case - it creates the same funny optimization hassle and such an optimization accompanies occasional bug reports that cry wolf over the zero-extend load that correctly replaced sign-extend load ("It's a signed char in my code. Why is the compiler using a zero-extend load ? The compiler must be buggy!"). And since ISAs usually don't define exact cycles nor they require two operations to take same number of cycles or issue/execution/etc resources, implementations of ISAs that have both versions tend to take an extra cycle for sign-extend load. > Third choice was the explicit EXT. -- #pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"
From: Nick Maclaren on 14 Sep 2005 03:32
In article <1126658582.506368.173210(a)g43g2000cwa.googlegroups.com>, John Mashey <old_systems_guy(a)yahoo.com> wrote: > >For what it's worth, there was some argument about this (overflow in >R2000) in 1985, because it was literally the *only* integer exception >that needed to be detected after the ALU stage, and in time to inhibit >register writeback, and somebody was worried about a possible extra >delay for a while. Why on earth was that? I.e. why should it need to inhibit register writeback? MIPS is twos complement, and the only real advantage of that is that it enables writeback and overflow flagging to be done in either order. If the architecture specified that writeback did not occur if overflow occurred, then the designers weren't thinking about that aspect. It isn't as if it wasn't an ancient problem, after all. Regards, Nick Maclaren. |