From: glen herrmannsfeldt on 18 Sep 2005 13:02 Terje Mathisen wrote: > glen herrmannsfeldt wrote: (snip) >>In the case of a conditional branch based on the overflow bit, >>INTO trap instruction, or overflow trap without any added instructions, >>nothing after that can be committed until the status bit is available. >>I would assume that a branch on overflow would be predicted >> as not taken. >>Other than the cost of the actual instruction bits, which are pretty >>variable over different architectures, would you expect any difference >>at runtime for the three cases? > The only important difference is that INTO is effectively a > predicted-not-taken call, making it very easy to return to the correct > point after fixup. I was only considering the cost of the instruction, especially in the case that it isn't taken. > To replace this with an inline branch instruction would require the > capability to save the next (i.e. return) address. If you have only one in a loop you know where it is... > If your architecture can do this, then by all means use an inline > branch-and-link instead of the INTO style check! But how much cost is there in not being able to retire the instruction until the overflow status is known? Assuming the usual out of order, but in order retire model. > The only real difference in this case would be that the branch opcode > probably takes 4 bytes, vs the single byte of INTO. However, as long as > you have fixed 16 or 32-bit instruction sizes anyway, it really doesn't > matter. Imprecise interrupts were well hated in the days when they were around, but consider the case where you really don't need to know exactly where the interrupt is. Maybe special ADDIIO, ADD with Imprecise Interrupt on Overflow instruction. You could use them in a loop, and would know which iteration of the loop (put a barrier instruction near the end of the loop), but not necessarily exactly where. Then again, maybe all you need is a sticky overflow bit. You could do some set of calculations and test once at the end for any overflow, and clear the sticky overflow bit at that time. I don't see conditional interrupts any better than conditional branches as far as pipelined out-of-order processors are concerned. -- glen
From: Terje Mathisen on 18 Sep 2005 15:08 glen herrmannsfeldt wrote: > Terje Mathisen wrote: >> To replace this with an inline branch instruction would require the >> capability to save the next (i.e. return) address. > > If you have only one in a loop you know where it is... This would be a _very_ special case, since it would require a loop with only a single int variable, or at least you must be able to prove that only this int can ever overflow. In that case you can of course skip the overflow testing. >> If your architecture can do this, then by all means use an inline >> branch-and-link instead of the INTO style check! > > But how much cost is there in not being able to retire the instruction > until the overflow status is known? Assuming the usual out of order, > but in order retire model. Not much: Processing can easily continue for at least one or two cycles, forwaring the preliminary result to the next user, and as long as the vast majority of these ints won't actually overflow, the delayed retirement does not correspond to any actual branch misses. > Then again, maybe all you need is a sticky overflow bit. You could > do some set of calculations and test once at the end for any overflow, > and clear the sticky overflow bit at that time. Nice idea. If overflows are really rare, then it would be a win to redo the entire loop to save testing on every operation. > > I don't see conditional interrupts any better than conditional > branches as far as pipelined out-of-order processors are concerned. Rather the opposite, since branch handling is optimized with serious hw resources anyway. Terje -- - <Terje.Mathisen(a)hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
From: Andy Freeman on 19 Sep 2005 10:39 Terje Mathisen wrote: > Eliot Miranda wrote: > > highest dynamic frequency for integer math is likely to be for iteration. > > Right, it is hard to imagine how it could be otherwise. Most checks on iteration vars can be pulled out of the loop. In most of the others, it can be done every <large constant> iterations.
From: Terje Mathisen on 19 Sep 2005 12:14 Andy Freeman wrote: > Terje Mathisen wrote: > >>Eliot Miranda wrote: >> >>>highest dynamic frequency for integer math is likely to be for iteration. >> >>Right, it is hard to imagine how it could be otherwise. > > > Most checks on iteration vars can be pulled out of the loop. In most > of the > others, it can be done every <large constant> iterations. I.e. all of the JIT optimizations possible for a Java implementation to catch overflows, can also be used to convert those variables to bigints. By reserving the all-zero tag for integers you can even use regular ADD/SUB operations, and MUL/DIV with just a shift fixup. Is this how current implementations do it on x86? Terje -- - <Terje.Mathisen(a)hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
From: John Mashey on 19 Sep 2005 15:06
Terje Mathisen wrote: > glen herrmannsfeldt wrote: > This would be a _very_ special case, since it would require a loop with > only a single int variable, or at least you must be able to prove that > only this int can ever overflow. In that case you can of course skip the > overflow testing. > > >> If your architecture can do this, then by all means use an inline > >> branch-and-link instead of the INTO style check! > > > > But how much cost is there in not being able to retire the instruction > > until the overflow status is known? Assuming the usual out of order, > > but in order retire model. > > Not much: Processing can easily continue for at least one or two cycles, > forwaring the preliminary result to the next user, and as long as the > vast majority of these ints won't actually overflow, the delayed > retirement does not correspond to any actual branch misses. > > > Then again, maybe all you need is a sticky overflow bit. You could > > do some set of calculations and test once at the end for any overflow, > > and clear the sticky overflow bit at that time. > > Nice idea. If overflows are really rare, then it would be a win to redo > the entire loop to save testing on every operation. People have sometimes used sticky overflow bits. To me, the hierarchy is: a) Precise exceptions: you know exactly where the exception occurred. b) Sticky bits, you don't know exactly where, but get some bound. c) No sticky-bits, explicit tests required everywhere, and hence, in many cases, the test are omitted for speed and code size. As a software guy, I am, of course, biased in favor of a). As a software/hardware guy, I also realize that sometimes one may have to compromise, but one should never be *looking* for reasons to compromise on a) - at most, one might reluctantly admit that introducing some imprecision is the lesser evil, grit one's teeth and bear it. At MIPS, I (and the rest of the OS group) were among the loudest voices in demanding precise exceptions everywhere, most of us having had to deal with weird cases and hardware bugs and related software bugs too many times in past lives. In the long run, this did turn out to help chip verification as well. To be more specific, we always wished for: When a user-level instruction X caused an exception, and a trap to the OS: a) The Exception Program Counter (EPC) was set to point at the instruction. b) The CAUSE register was set to indicate the reason. c) All instructions before X had been completed (or equivalent). d) X itself had had no user-visible side-effects, i.e., there were no partial stores, no register writebacks, no auto-increments, no shadow registers to be distentangled, i.e., X had no effect except to cause the trap. We didn't quite get that, as: a) and b): if X was in a Branch Delay slot, the EPC pointed at X-4, and the BD-bit was set in the CAUSE register. Although this was soemtiems a pin, and proved to be a source of designer complaint in some later implementations, software people viewed it as tolerable, even thought it sometimes meant messy debugger code and trickier emulation code (as in floating poitn emulation done on systems lacking FPUs). c) This wasn't true from the hardware viewpoint, but the CPU provided this illusion to the programmer, so that was viewed as OK. Specifically, consider the sequence: Y: (long-running FP operation, or integer multipley/divide) .... X, causing trap. At the time of the trap Y, might still be running in an independent fucntional unit. However, this was all carefully defined so that: Either the instruction was defined so that it could not ever cause an exception (in MIPS, the divide doesn't raise a divide-by-zero; instead, if the compiler sees variable1/varaible2, and can't be sure variable2 is non-zero, it generates code like: DIV rs,rt BEQZ rt,divide-by-zero .... In the floating point cases, the insistence on precise exceptions ended up encouraging the innovative "check the exponent fields quickly and stall the CPU if there is any chance of an exception" patent of Tom Riordan's that I've mentioned before. In any of these cases, a reference to a register that was to ber written by one fo these asynchronous units simply caused a stall. Interrupt code woudl save away the regular integer registers first, by which time any lignering MUL/DIV or FP ops would likely have completed. Of course, a "fast-user-trap" feature would have to modify some of the above, and I still wish we'd had time to think hard enough about that early enough [2Q85]. Anyway, the message of all this is: Start with the goal of keeping the exception model simple and precise. Only back off reluctantly. I've posted many a time that dealing with exceptions and their weird interactions has long been a source of problems, as even experienced humans miss things. Although there are some things about MIPS that I'd do differently if I could do it over, THIS wasn't one of them. |