From: Terje Mathisen "terje.mathisen at on 25 Jun 2010 03:44 Andy 'Krazy' Glew wrote: > Overflow :: > > a >= 0 && b >= 0 :: > > a+b >= a ==> no overflow > a+b < a ==> overflow > > a < 0 && b >= 0 :: > a >= 0 && b < 0 :: > no overflow > (although you can have underflow, > negative overflow, which is handled > similarly) No, you cannot get that: Opposite signs are totally safe. > > a < 0 && b < 0 > no overflow > (although you can have underflow, > negative overflow, which is handled > similarly) Right. The rule isn't too bad then: sum = a+b; possible_overflow = (a ^ b) > 0; // I.e. same sign overflow = possible_overflow && ((sum ^ a) < 0); // Sum of sign have changed! Alternatively: sum = a+b; no_overflow = ((a ^ b) < 0) || ((sum ^ a) > 0); or: sum = a+b; no_overflow = ((a ^ b) | (~sum ^ a)) < 0; Checking... a b sum no_overflow + + + 1 + + - 0 + - + 1 + - - 1 - + + 1 - + - 1 - - + 0 - - - 1 I.e. that seems to work. Latencywise it is better to invert a (or b) instead of the sum: sum = a+b; no_overflow = ((a ^ b) | (sum ^ ~a)) < 0; The next stage is to convert this flag into a properly saturated result: sum = a+b; no_overflow = ((a ^ b) | (sum ^ ~a)); // True if negative mask = no_overflow >> (INT_BITS-1); // Smear the sign bit sum = (sum & mask) | ((SAT_MIN ^ ((a >> (INT_BITS-1)) & ~mask); I.e. the result is either the original result, if the mask is all 1 bits, or a saturated value: Either SAT_MIN if a (and b!) was positive, or SAT_MAX which is SAT_MIN with all bits inverted. The problem is the total latency: ; mov eax,[a] ; mov ebx,[b] lea esi,[eax+ebx] // sum xor ebx,eax // b ^= a mov edx,eax xor eax,-1 // ~a xor eax,esi // ~a ^ sum sar edx,31 // a >> 31 or eax,ebx // (a ^ b) | (~a ^ sum) sar eax,31 // mask value, -1 or 0 and esi,eax xor eax,-1 and eax,edx xor eax,0x7fffffff or eax,esi I.e. at least 8 cycles. OTOH, this is just two cycles more than the branchless CMOVcc version, and the code above is not too dissimmilar from what a compiler would generate. Terje PS. It is interesting that I only had to use a single MOV in the code above to compensate for the fact that x86 is using two-operand instead of three-operand instructions, and that MOV did not add any extra latency at all. -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Anton Ertl on 25 Jun 2010 11:14 Terje Mathisen <"terje.mathisen at tmsw.no"> writes: > sum = a+b; > possible_overflow = (a ^ b) > 0; // I.e. same sign That must be (a^b)>=0, or you get false for a=b. > sum = a+b; > no_overflow = ((a ^ b) | (~sum ^ a)) < 0; Faster on some machines, more elegant, and probably easier to explain: no_overflow = ((a^sum)&(b^sum))>=0 But of course, all of this is outside the standardized subset of C. - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html
From: Jeremy Linton on 25 Jun 2010 13:54 On 6/22/2010 5:06 AM, Terje Mathisen wrote: > Watcom allowed you to define pretty much any operation yourself, in the > form of inline macro operations where you specified to the compiler > where the inputs needed to be, where the result would end up and exactly > which registers/parts of memory would be modified. > > This basically meant that you could extend the built-in range of basic > operations while still getting pretty well-optimized code. How is that different from GCC's extended inline assembly (input/output/clobber lists)? The linux kernel uses macro's all over place to inline system register moves, or similar behaviors on assorted arch's. Looking at the resultant code, gcc seems to do a very good job of register scheduling even through blocks of inline assembly.
From: Terje Mathisen "terje.mathisen at on 25 Jun 2010 15:51 Jeremy Linton wrote: > On 6/22/2010 5:06 AM, Terje Mathisen wrote: > >> Watcom allowed you to define pretty much any operation yourself, in the >> form of inline macro operations where you specified to the compiler >> where the inputs needed to be, where the result would end up and exactly >> which registers/parts of memory would be modified. >> >> This basically meant that you could extend the built-in range of basic >> operations while still getting pretty well-optimized code. > > How is that different from GCC's extended inline assembly > (input/output/clobber lists)? The linux kernel uses macro's all over It isn't. Watcom had (imho) a somewhat more elegant syntax to describe this stuff, but the end result is pretty close to identical. > place to inline system register moves, or similar behaviors on assorted > arch's. Looking at the resultant code, gcc seems to do a very good job > of register scheduling even through blocks of inline assembly. :-) Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: MitchAlsup on 25 Jun 2010 19:18
On Jun 24, 4:04 pm, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote: > In your dreams. Reminds me of the old idium: "Never bet against you branch predictor". Mitch |