From: Terje Mathisen "terje.mathisen at on 7 Aug 2010 09:25 Andy Glew wrote: > On 8/5/2010 7:20 AM, Skybuck Flying wrote: >> Suppose RLE compression is used, tried it myself. >> >> That means a branch for each color. > > No it doesn't. > > Just think about it. On comp.arch back a few years agao, this would have > been a newbie question. :-) > > Think branchless code. Heck, I could code it up branchlessly in C. > > Not sure of branchless is a win over a machine with branch prediction, > where correctly predicted branches are free. But if you say you have > branch mispredictions, try the branchless version. One RLE approach is to always have an initial count: If it is 0-127 it indicates 3 to 130 repeats of the next byte. If it is 128-255 (or -128 to -1) the next -128+129=1 to -1+129=128 bytes are immediate values. Using this approach the absolute worst case (i.e. incompressible) byte stream will cause an expansion by 129/128 in order to incorporate the length bytes. As soon as you get values repeated 3 times or more the compression ratio increases, up to a best possible factor of 2/130 for a long block of identical bytes. Decoding such a byte stream branchlessly can be done, but not very easily, and it might not be very efficient either: movsx ebx, byte ptr [esi] ; Count mov al,[esi+1] ; Next byte, possible STOSB data inc esi lea ecx,[ebx+129] ; MOVSB count lea edx,[ebx+3] ; STOSB count sar ebx,31 ; -1 if MOVSB, 0 if STOSB and ecx,ebx ; Either MOVSB count or 0 xor ebx,-1 ; NOT EBX, invert flag mask rep movsb ; Copy 0 or (Count+129) bytes and edx,ebx mov ecx,edx rep stosb ; Store 0 or (Count+3) copies ;; If the operation was a STOSB, then we have to skip the byte we loaded ;; into AL, in order to point at the next Count byte: sub esi,ebx ; Increment if STOSB! It looks like this piece of code can run in a minimum of 5 cycles, plus whatever extra latency is caused by the MOVSB and STOSB opcodes. It is quite possible that a naive version with a branch on every Count byte will be faster, simply because the branch pattern will be very predictable: Unless you get very long identical blocks or unique (i.e. copy) blocks, then you will simply alternate between copy and store blocks of data, and the branch predictor will work close to perfectly: next: movsx ecx,[esi] add esi,1 add ecx,129 ; Assume COPY /MOVSB block jnc StoreBlock rep movsb cmp esi,ebx ; End of input buffer jb next jmp done StoreBlock: lodsb sub ecx, 129-3 ; Repeat count rep stosb cmp esi,ebx ; End of input buffer jb next done: I.e. 4-5 cycles when correctly predicted, plus the time taken by either REP MOVSB or REP STOSB. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Owen Shepherd on 7 Aug 2010 15:35 Terje wrote: > It looks like this piece of code can run in a minimum of 5 cycles, plus > whatever extra latency is caused by the MOVSB and STOSB opcodes. I can't say for other processors, but AMD quote minimum figures of 5+n for rep movsb/stosb on the K8 and K10 (Intel just don't quote figures, which is somewhat less helpful...). Considering that jumps tend to be "direct path" (or equivalent) instructions and so can be decoded in parallel with more useful operations, I would expect that the branching version would be better. A thought of mine if you wanted to optimize it is that you would be best unrolling the repeat length check into the end of each stride, and making the branch predictions more consistent (This would mean that the extended runs of a single type should have a harder time upsetting prediction for the following normal ones).
From: Skybuck Flying on 8 Aug 2010 01:21 "Terje Mathisen" <"terje.mathisen at tmsw.no"> wrote in message news:ne30j7-q872.ln1(a)ntp.tmsw.no... > Andy Glew wrote: >> On 8/5/2010 7:20 AM, Skybuck Flying wrote: >>> Suppose RLE compression is used, tried it myself. >>> >>> That means a branch for each color. >> >> No it doesn't. >> >> Just think about it. On comp.arch back a few years agao, this would have >> been a newbie question. > > :-) >> >> Think branchless code. Heck, I could code it up branchlessly in C. >> >> Not sure of branchless is a win over a machine with branch prediction, >> where correctly predicted branches are free. But if you say you have >> branch mispredictions, try the branchless version. > > One RLE approach is to always have an initial count: > > If it is 0-127 it indicates 3 to 130 repeats of the next byte. > > If it is 128-255 (or -128 to -1) the next -128+129=1 to -1+129=128 bytes > are immediate values. Seems quite shitty to me... I sure as hell won't try it out... but you welcome to try it out yourself ;) :) Bye, Skybuck.
From: Terje Mathisen "terje.mathisen at on 8 Aug 2010 07:22 Skybuck Flying wrote: > "Terje Mathisen"<"terje.mathisen at tmsw.no"> wrote in message >> One RLE approach is to always have an initial count: >> >> If it is 0-127 it indicates 3 to 130 repeats of the next byte. >> >> If it is 128-255 (or -128 to -1) the next -128+129=1 to -1+129=128 bytes >> are immediate values. > > Seems quite shitty to me... I sure as hell won't try it out... but you > welcome to try it out yourself ;) :) Wow! I've finally been noticed by the Great Buck in the Sky. This must be my lucky day. :-) Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Rudy Velthuis on 8 Aug 2010 09:29 Terje Mathisen wrote: > I've finally been noticed by the Great Buck in the Sky. "biddi-biddi-biddi" <g> -- Rudy Velthuis http://rvelthuis.de "The bureaucracy is expanding to meet the needs of an expanding bureaucracy." -- Unknown
|
Pages: 1 Prev: Hardware TLB reloaders Next: Cost of moving data and the power budget |