From: Andy 'Krazy' Glew on 24 Jun 2010 09:03 On 6/23/2010 11:18 PM, EricP wrote: > Terje Mathisen wrote: >> Thomas Womack wrote: >>> In >>> article<a6d4ec20-9052-4003-a3c7-486885d791a4(a)q12g2000yqj.googlegroups.com>, >>> >>> MitchAlsup<MitchAlsup(a)aol.com> wrote: >>>> # define sat_add(a,b) (((tmp =3D (a)+(b)), (tmp> SAT_MAX ? SAT_MAX: >>>> (tmp< SAT_MIN ? SAT_MIN : tmp))) >>> >>> And what type is 'tmp'? >> >> Any signed type with at least one more bit of precision than a or b? >> :-) >> >> Terje >> > > Ok, a signed (2's complement) overflow detect is: > sum = a + b; > overflow = ((a^sum)&(b^sum)) < 0; > > so.... > > #define sat_add(a,b) ((sum=(a)+(b)), (((a)^sum)&((b)^sum)) < 0 ? \ > (sum < 0 ? SAT_MAX : SAT_MIN) : sum) It will be fun to collect a list of idioms for overflow detection in not-really-portable C assuming 2's complement signed and/or normal binary unsigned.
From: Andy 'Krazy' Glew on 24 Jun 2010 09:37 On 6/23/2010 11:18 PM, EricP wrote: > Ok, a signed (2's complement) overflow detect is: > sum = a + b; > overflow = ((a^sum)&(b^sum)) < 0; > > so.... > > #define sat_add(a,b) ((sum=(a)+(b)), (((a)^sum)&((b)^sum)) < 0 ? \ > (sum < 0 ? SAT_MAX : SAT_MIN) : sum) I'm going to start collecting these at https://semipublic.comp-arch.net/wiki/overflow_detection_idioms that page is writeable only by me, but anyone can write to its shadow page http://wiki.public.comp-arch.net/wiki/Overflow_detection_idioms if you want to add. Although I suggest just posting to comp.arch; I'll collect.
From: Tim McCaffrey on 24 Jun 2010 10:20 In article <cb699368-3591-4f59-a35e-df5878d5bce8(a)z31g2000vbk.googlegroups.com>, MitchAlsup(a)aol.com says... > >On Jun 23, 10:42=A0am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> >wrote: >> On 6/22/2010 11:12 AM, Tim McCaffrey wrote: >> >> > 2) Easy to decode: =A0reduces gate count, which reduces power consumpti= >on, and >> > potentially removes a pipeline stage (maybe). =A0AFAICT, every x86 has = >a >> > limitation of only being able to decode/issue one instruction if it has= >n't >> > been executed before. =A0It appears all x86 implementations use the I-c= >ache to >> > mark instruction boundaries for parallel decoding on the following pass= >es. ><snip> >> AMD has long had this limit. > >No, not quite. When the Athlon/Opteron processors fetch an instruction >that has no marker bits, it decodes 4 bytes per cycle. There can be >0,1,2,3, or 4 instructions, and the decode pipeline is capable of >doing 0,1,2,3 from there. A majority of the time, the choice is from >the set {0,1} due to boundary spanning. > So, ISA does matter here. If the ISA was easier to decode then more instructions could be decoded on the first pass. - Tim
From: MitchAlsup on 24 Jun 2010 15:49 On Jun 24, 9:20 am, timcaff...(a)aol.com (Tim McCaffrey) wrote: > In article > <cb699368-3591-4f59-a35e-df5878d5b...(a)z31g2000vbk.googlegroups.com>, > MitchAl...(a)aol.com says... > > > > > > > > >On Jun 23, 10:42=A0am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> > >wrote: > >> On 6/22/2010 11:12 AM, Tim McCaffrey wrote: > > >> > 2) Easy to decode: =A0reduces gate count, which reduces power consumpti= > >on, and > >> > potentially removes a pipeline stage (maybe). =A0AFAICT, every x86 has = > >a > >> > limitation of only being able to decode/issue one instruction if it has= > >n't > >> > been executed before. =A0It appears all x86 implementations use the I-c= > >ache to > >> > mark instruction boundaries for parallel decoding on the following pass= > >es. > ><snip> > >> AMD has long had this limit. > > >No, not quite. When the Athlon/Opteron processors fetch an instruction > >that has no marker bits, it decodes 4 bytes per cycle. There can be > >0,1,2,3, or 4 instructions, and the decode pipeline is capable of > >doing 0,1,2,3 from there. A majority of the time, the choice is from > >the set {0,1} due to boundary spanning. > > So, ISA does matter here. If the ISA was easier to decode then more > instructions could be decoded on the first pass. The loss in performace of doing it this way, as oppposed to figuring out how to build a 3-wide unmarked decoder is down in the noise (close to but slightly under 1% for Opteron.) But the point I want to make here, is tha the word 'decode' is being used improperly. A better word would be 'parse' as in the instructions have been 'parsed' out of the byte stream. The instructions have not been decoded--which in the RICS universe meant registers have been read and reservation station entries written. This is still a couple cycles down the pipe for x86. These are simply pipe stages with a cost just above the noise floor (under 2%), the actual hard part of x86 instruction processing is determining where the first instruction-byte of the NEXT parse cycle starts. The rest of it is "not-so-bad". Yes, there is a lot of cruft in x86. However, there are more efficient automotive engines (sterling) and more powerful automotive engines (turbine) but nothing today results in the utility per unit cost of the internal compustion engine. And so it is with x86 (for better or for worse,.) ---------------------------- But back on topic: Can anyone show me a piece of code that depends on integer overflow to create a buffer overrun? (Note: not integer underflow) Mitch
From: Terje Mathisen "terje.mathisen at on 24 Jun 2010 17:04
EricP wrote: > Terje Mathisen wrote: >> Thomas Womack wrote: >>> In >>> article<a6d4ec20-9052-4003-a3c7-486885d791a4(a)q12g2000yqj.googlegroups.com>, >>> >>> MitchAlsup<MitchAlsup(a)aol.com> wrote: >>>> # define sat_add(a,b) (((tmp =3D (a)+(b)), (tmp> SAT_MAX ? SAT_MAX: >>>> (tmp< SAT_MIN ? SAT_MIN : tmp))) >>> >>> And what type is 'tmp'? >> >> Any signed type with at least one more bit of precision than a or b? >> :-) >> >> Terje >> > > Ok, a signed (2's complement) overflow detect is: > sum = a + b; > overflow = ((a^sum)&(b^sum)) < 0; > > so.... > > #define sat_add(a,b) ((sum=(a)+(b)), (((a)^sum)&((b)^sum)) < 0 ? \ > (sum < 0 ? SAT_MAX : SAT_MIN) : sum) > > Good compiler would use CMOVs so no branches. In your dreams. :-( I believe this is the logic you want to define, right? mov eax,[a] add eax,[b] jno done ; If not Overflow then we're OK mov eax,SAT_MIN ; Assume negative underflow jns done ; SAT_MIN unless the sign flag got set mov eax,SAT_MAX ; SAT_MAX otherwise done: That macro above will, for most codes run in about 2 cycles since the first branch will be correctly predicted as taken. Using an INTO handler avoids the JNO, making the macro shorter, but it also increases the cost of handling overflows very significantly. :-( The problem is when you have an array which is about to become ill-conditioned and you get unpredictable overflows on about half of all iterations: In that case we're talking 15-25 cycles or an order of magnitude slower, due to mispredicted branches. Using CMOVs instead could help: mov eax,[a] mov ebx,SAT_MIN add eax,[b] mov edx,SAT_MAX cmovs ebx,edx ;; Select the proper kind of saturation cmovo eax,ebx ;; Replace the result with saturated value This code runs in a totally predictable 6 (!) cycles, which means that it is only when you have at least 30+% mispredicted branches that this will be faster than the first macro. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching" |