From: Rod Pemberton on 17 Oct 2007 03:03 "Wolfgang Kern" <nowhere(a)never.at> wrote in message news:ff358o$o9$1(a)newsreader2.utanet.at... > You may wonder how often CMOVcc and SETcc occure in programs targeted > to +486 CPUs. > As this two can save on many branch-instruction, I started to rewrite all > my older code and gain ~20% speed in average without increasing its size. > I'm curious as to how you do this with SETcc. My understanding is that SETcc: 1) is non-pairable 2) only operates on 8-bit operands 3) requires 'xor reg,reg' or 'sub reg,reg' to the full register prior to the instruction to prevent a partial register stall 4) requires an 'movzx' or 'movsz' to the full register after the instruction to prevent a partial register stall From what I can tell, for unsigned results it's faster to use full register combinations of sbb,cmp,xor due to: 1) pairing 2) partial pairing 3) slow movzx/sx otherwise required That's what I got from the Intel optimization manuals, anyway... What'd I miss? Rod Pemberton
From: Wolfgang Kern on 17 Oct 2007 07:45 Rod Pemberton wrote: >> You may wonder how often CMOVcc and SETcc occure in programs targeted >> to +486 CPUs. >> As this two can save on many branch-instruction, I started to rewrite all >> my older code and gain ~20% speed in average without increasing its size. > I'm curious as to how you do this with SETcc. My understanding is that > SETcc: > 1) is non-pairable > 2) only operates on 8-bit operands > 3) requires 'xor reg,reg' or 'sub reg,reg' to the full register prior to > the instruction to prevent a partial register stall > 4) requires an 'movzx' or 'movsz' to the full register after the > instruction to prevent a partial register stall The idea is not (but also possible) to work out values with SETcc, I mainly use it as temporary needed condition status storage instead of PUSHF/POPF and/or in addition to LAHF/SAHF. I usually try to work in registers, and we got eight GP byte-regs, but when registers become rare I may even use: ________(this isn't actual code)________ PUSH +0 ;creates four 'local' bytes ... SETcc [esp+x] ;x can be 0..3 yet ... CMP byte[esp+x],0 SETcc .. CMOVnz .. ;also working: TEST dword[esp],0x01010001 ;to check several states at once CMOVz .. ;sometimes helpful for table offsets: MOV ecx,0x00480080 OR eax,eax SETz CH ;adjust ecx to 480180h or 480080h CMOVs eax,[ecx] ... LEA esp,[esp+4] ;instead of ADD ESP,4 to keep flags alive ________________ > From what I can tell, for unsigned results it's faster to use full register > combinations of sbb,cmp,xor due to: > 1) pairing > 2) partial pairing > 3) slow movzx/sx otherwise required Yes, except 3) MOVSX/ZX and Shifts aren't at all slow on AMD's. > That's what I got from the Intel optimization manuals, anyway... > What'd I miss? The penalties for partial register stalls and unaligned byte access are easy gained back by saving on branch-instructions and code size. Sure, avoiding SETcc may improve speed, but on cost of code size and the need for more registers/locals, which again will increase timing. __ wolfgang
First
|
Prev
|
Pages: 1 2 3 Prev: Significant Pure Assembler Application In MASM ? Next: DynatOS 64-bit is here!!! |