Branchless RLE decoding (Was Re: Delphi needs opengl gui :)) [Computer Architecture]

Prev: Hardware TLB reloaders
Next: Cost of moving data and the power budget

From: Terje Mathisen "terje.mathisen at on 7 Aug 2010 09:25

Andy Glew wrote:
> On 8/5/2010 7:20 AM, Skybuck Flying wrote:
>> Suppose RLE compression is used, tried it myself.
>>
>> That means a branch for each color.
>
> No it doesn't.
>
> Just think about it. On comp.arch back a few years agao, this would have
> been a newbie question.

:-)
>
> Think branchless code. Heck, I could code it up branchlessly in C.
>
> Not sure of branchless is a win over a machine with branch prediction,
> where correctly predicted branches are free. But if you say you have
> branch mispredictions, try the branchless version.

One RLE approach is to always have an initial count:

If it is 0-127 it indicates 3 to 130 repeats of the next byte.

If it is 128-255 (or -128 to -1) the next -128+129=1 to -1+129=128 bytes
are immediate values.

Using this approach the absolute worst case (i.e. incompressible) byte
stream will cause an expansion by 129/128 in order to incorporate the
length bytes. As soon as you get values repeated 3 times or more the
compression ratio increases, up to a best possible factor of 2/130 for a
long block of identical bytes.

Decoding such a byte stream branchlessly can be done, but not very
easily, and it might not be very efficient either:

movsx ebx, byte ptr [esi] ; Count
mov al,[esi+1] ; Next byte, possible STOSB data
inc esi

lea ecx,[ebx+129] ; MOVSB count
lea edx,[ebx+3] ; STOSB count
sar ebx,31 ; -1 if MOVSB, 0 if STOSB

and ecx,ebx ; Either MOVSB count or 0
xor ebx,-1 ; NOT EBX, invert flag mask

rep movsb ; Copy 0 or (Count+129) bytes
and edx,ebx
mov ecx,edx

rep stosb ; Store 0 or (Count+3) copies
;; If the operation was a STOSB, then we have to skip the byte we loaded
;; into AL, in order to point at the next Count byte:
sub esi,ebx ; Increment if STOSB!

It looks like this piece of code can run in a minimum of 5 cycles, plus
whatever extra latency is caused by the MOVSB and STOSB opcodes.

It is quite possible that a naive version with a branch on every Count
byte will be faster, simply because the branch pattern will be very
predictable: Unless you get very long identical blocks or unique (i.e.
copy) blocks, then you will simply alternate between copy and store
blocks of data, and the branch predictor will work close to perfectly:

next:
movsx ecx,[esi]
add esi,1

add ecx,129 ; Assume COPY /MOVSB block
jnc StoreBlock

rep movsb

cmp esi,ebx ; End of input buffer
jb next

jmp done

StoreBlock:
lodsb
sub ecx, 129-3 ; Repeat count

rep stosb

cmp esi,ebx ; End of input buffer
jb next

done:

I.e. 4-5 cycles when correctly predicted, plus the time taken by either
REP MOVSB or REP STOSB.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Owen Shepherd on 7 Aug 2010 15:35

Terje wrote:

> It looks like this piece of code can run in a minimum of 5 cycles, plus
> whatever extra latency is caused by the MOVSB and STOSB opcodes.

I can't say for other processors, but AMD quote minimum figures of 5+n for
rep movsb/stosb on the K8 and K10 (Intel just don't quote figures, which is
somewhat less helpful...).

Considering that jumps tend to be "direct path" (or equivalent) instructions
and so can be decoded in parallel with more useful operations, I would
expect that the branching version would be better.

A thought of mine if you wanted to optimize it is that you would be best
unrolling the repeat length check into the end of each stride, and making
the branch predictions more consistent (This would mean that the extended
runs of a single type should have a harder time upsetting prediction for the
following normal ones).

From: Skybuck Flying on 8 Aug 2010 01:21

"Terje Mathisen" <"terje.mathisen at tmsw.no"> wrote in message
news:ne30j7-q872.ln1(a)ntp.tmsw.no...
> Andy Glew wrote:
>> On 8/5/2010 7:20 AM, Skybuck Flying wrote:
>>> Suppose RLE compression is used, tried it myself.
>>>
>>> That means a branch for each color.
>>
>> No it doesn't.
>>
>> Just think about it. On comp.arch back a few years agao, this would have
>> been a newbie question.
>
> :-)
>>
>> Think branchless code. Heck, I could code it up branchlessly in C.
>>
>> Not sure of branchless is a win over a machine with branch prediction,
>> where correctly predicted branches are free. But if you say you have
>> branch mispredictions, try the branchless version.
>
> One RLE approach is to always have an initial count:
>
> If it is 0-127 it indicates 3 to 130 repeats of the next byte.
>
> If it is 128-255 (or -128 to -1) the next -128+129=1 to -1+129=128 bytes
> are immediate values.

Seems quite shitty to me... I sure as hell won't try it out... but you
welcome
to try it out yourself ;) :)

Bye,
Skybuck.

From: Terje Mathisen "terje.mathisen at on 8 Aug 2010 07:22

Skybuck Flying wrote:
> "Terje Mathisen"<"terje.mathisen at tmsw.no"> wrote in message
>> One RLE approach is to always have an initial count:
>>
>> If it is 0-127 it indicates 3 to 130 repeats of the next byte.
>>
>> If it is 128-255 (or -128 to -1) the next -128+129=1 to -1+129=128 bytes
>> are immediate values.
>
> Seems quite shitty to me... I sure as hell won't try it out... but you
> welcome to try it out yourself ;) :)

Wow!

I've finally been noticed by the Great Buck in the Sky.

This must be my lucky day. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Rudy Velthuis on 8 Aug 2010 09:29

Terje Mathisen wrote:

> I've finally been noticed by the Great Buck in the Sky.

"biddi-biddi-biddi" <g>

--
Rudy Velthuis http://rvelthuis.de

"The bureaucracy is expanding to meet the needs of an expanding
bureaucracy." -- Unknown

|
Pages: 1
Prev: Hardware TLB reloaders
Next: Cost of moving data and the power budget