Faster image rotation [General Programming]

Prev: #include "cpuid.os"
Next: aspect ratio algorithm needed.

From: nmm1 on 14 Apr 2010 03:08

In article <ggtgp-FCC64C.01085514042010(a)news.isp.giganews.com>,
Brett Davis <ggtgp(a)yahoo.com> wrote:
>
>> > The most important number is cache line size, which you missed.
>> > If your image is 1,024 lines tall, that will completely thrash
>> > the cache, resulting in 3 bytes copied per cache line load/spill.
>> >
>> > If you do the following you can get a 2x speedup, it looks like
>> > more code, but will generate less, and the results will be
>> > pipelined correctly.
>> > Extra bonus points to those that understand why. Half the posters here?
>> >
>> > � � �{
>> > � � � �unsigned char *C = B+(H*j+H-i-1)*3;
>> > � � � �temp0 = A[0];
>> > � � � �temp1 = A[1];
>> > � � � �temp2 = A[2];
>> > � � � �C[0] = temp0;
>> > � � � �C[1] = temp1;
>> > � � � �C[2] = temp2;
>> > � � � �A += 3;
>> > � � �}
>> >
>> > Do not use *C++ = *A++;
>> >
>>
>> Programming hotshots have done so much damage.
>
>Damage?
>That is clean code that is easy to read and understand.
>
>> And they brag about it.
>
>Only one in a hundred programers know an optimizaton like that, for
>half of comp.arch to be that good says good things about comp.arch.

Well, yes and no. It is much cleaner to write the code in the form
the algorithm uses, and have the compiler optimise it, but you can't
do that if you insist on writing in C. Fortran rules in that respect,
though even Fortran is nowhere near as optimisable as it could be.

35 years ago, it was standard practice to unroll loops by hand,
whether in Algol, Fortran or assembler, but it never was more than
a necessity because the compilers' optimisation was generally poor
(though there were exceptions). The reason that it is needed in C
is because C is almost unoptimisable, by design.

Regards,
Nick Maclaren.

From: Terje Mathisen "terje.mathisen at on 14 Apr 2010 04:32

Brett Davis wrote:
> In article<na2e97-a342.ln1(a)ntp.tmsw.no>,
> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>
>> Noob wrote:
>>>
>>> I'm starting to think that the array is allocated in a non-cached region.
>>
>> Ouch!
>
> That is worse than the case where you only use 3 bytes for every cache line fill/spill.
>
>> That sounds like a frame buffer...
>>
>> Can you direct these temporary image buffers to be in cacheable ram instead?
>>
>> Terje
>
> If not you want to load four pixels at a time using three long reads, and shift
> out the bytes to write. Three slow reads instead of 12.

That was in my suggested algorithm, reading 4x4 blocks of pixels from
the source buffer, rewrapping them (24->32->24 bits) inside registers
and writing back to the target using full cache line stores if possible.

However, if he can move stuff out of non-cacheable framebuffer space,
that will help a _lot_ more. :-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Noob on 14 Apr 2010 04:56

Hello Brett,

Brett Davis wrote:

> I thought I was generous giving away top secrets that most everyone
> else hoards. I do have quite a bit more that will get another 50%,
> but he lacks the background to understand such [...]

I do appreciate the advice you've given.

For the record, I've worked several years on LNO-related topics,
thus I find your characterization rather misdirected ;-)

Regards.

From: EricP on 14 Apr 2010 12:29

Brett Davis wrote:
>>> The most important number is cache line size, which you missed.
>>> If your image is 1,024 lines tall, that will completely thrash
>>> the cache, resulting in 3 bytes copied per cache line load/spill.
>>>
>>> If you do the following you can get a 2x speedup, it looks like
>>> more code, but will generate less, and the results will be
>>> pipelined correctly.
>>> Extra bonus points to those that understand why. Half the posters here?
>>>
>>> {
>>> unsigned char *C = B+(H*j+H-i-1)*3;
>>> temp0 = A[0];
>>> temp1 = A[1];
>>> temp2 = A[2];
>>> C[0] = temp0;
>>> C[1] = temp1;
>>> C[2] = temp2;
>>> A += 3;
>>> }
>>>
>>> Do not use *C++ = *A++;
>>>
>> Programming hotshots have done so much damage.
>
> Damage?
> That is clean code that is easy to read and understand.
>
>> And they brag about it.
>
> Only one in a hundred programers know an optimizaton like that, for
> half of comp.arch to be that good says good things about comp.arch.

Ok, well, I'm going to take the risk of seeming a total doofuss
on a global scale, but I don't see what you are getting at.

A[x] and C[y] are referenced only once so there is no reason
for the compiler to enregister their values, and all other
variables are locals or pass-by-value, and therefore
no reason for aliasing to occur.

The SH-4 has byte memory access instructions, so this is just the
difference between LD ST LD ST LD ST and LD LD LD ST ST ST.
The latter requires 2 extra temp regs, which on x86 causes
a var spill into the stack, but probably is ok on sh-4.

So I don't see the 2x speedup here.

Eric

From: Casper H.S. Dik on 14 Apr 2010 13:57

EricP <ThatWouldBeTelling(a)thevillage.com> writes:

>A[x] and C[y] are referenced only once so there is no reason
>for the compiler to enregister their values, and all other
>variables are locals or pass-by-value, and therefore
>no reason for aliasing to occur.

A proper compiler will generate optimal code for:

for (i = 0; i < len; i++)
C[i] = A[i];

or similar code and will generate about the same code for:

pC = C; pA = A;
while (*pC < C[len])
*pC++ = *pA++;

If you can generate faster code by unrolling loops in C then you should get
a better compiler.

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Prev: #include "cpuid.os"
Next: aspect ratio algorithm needed.