From: Brett Davis on
In article
<17e7e045-9599-4000-a41c-8885ccfc95d0(a)g30g2000yqc.googlegroups.com>,
MitchAlsup <MitchAlsup(a)aol.com> wrote:

> On Apr 11, 10:24�pm, Brett Davis <gg...(a)yahoo.com> wrote:
> > If you do the following you can get a 2x speedup, it looks like
> > more code, but will generate less, and the results will be
> > pipelined correctly.
> > Extra bonus points to those that understand why.
>
> You have explicitly gotten rid of the aliasing issues so the compiler
> can avoid having to assume aliasing conflicts between C[0] and
> A[1],...
>
> Mitch

Excellent start, there are at least two or three other reasons, all
of which effect performance, unlike the aliasing issue that will
only rear its ugly head when an actual alias violation occurs.

Brett
From: Terje Mathisen "terje.mathisen at on
EricP wrote:
> These two could be combined so it would rotate a 4*4 block of 3 byte
> elements in a vertical stripe and writing 4 sequential rows at once.
> I was thinking it would require too many registers to rotate
> the block but I think it could be done with just 6 if you
> spill each pending row value as soon as it is ready:
> 4 to hold the pending write value for each row,
> one for the just loaded value, one for shifting, masking, merging.
> Other regs for holding dstPtr, dstTmpPtr, srcPtr, srcTmpPtr, height, etc,
> and it looks like it would pretty much fit in 16 registers,
> and only load each cache line once.
>
> That's probably as good as your going to get.

This is effectively what I'm suggesting, i.e. do the rotate while
copying, trying to minimize the number of load/store operations.

Working in 4x4 blocks is a good compromise.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: EricP on
Brett Davis wrote:
>>>>> If you do the following you can get a 2x speedup, it looks like
>>>>> more code, but will generate less, and the results will be
>>>>> pipelined correctly.
>>>>> Extra bonus points to those that understand why. Half the posters here?
>>>>>
>>>>> {
>>>>> unsigned char *C = B+(H*j+H-i-1)*3;
>>>>> temp0 = A[0];
>>>>> temp1 = A[1];
>>>>> temp2 = A[2];
>>>>> C[0] = temp0;
>>>>> C[1] = temp1;
>>>>> C[2] = temp2;
>>>>> A += 3;
>>>>> }
>>>>>
>>>>> Do not use *C++ = *A++;
>>> Only one in a hundred programers know an optimizaton like that, for
>>> half of comp.arch to be that good says good things about comp.arch.
>
> Benchmark it, try and stay in cache as a 66 MHz embedded CPU does
> not have memory 600 cycles away.
> Only an Intel class chip has enough OoO stuff to have a chance
> of coming close to the same speed as my code.
>
> Brett

Oh, you are avoiding read after write data
dependency pipeline stalls on in-order cpus.

The buffer overlap aliasing considerations in C prevents the
compiler from automatically rearranging the original LD ST
order to be more pipeline friendly for a particular cpu,
but Fortran would allow it.

Yeah ok that could be about 2x speed for that kind of cpu.

Eric

From: robin on
"Noob" <root(a)127.0.0.1> wrote in message news:hphm36$puk$1(a)speranza.aioe.org...

| I need to rotate a picture clockwise 90 degrees.
|
| Conceptually, my picture is represented by a matrix
| (W columns, H rows) of pixels, with each pixel
| represented by 3 octets.
|
| pixel original_picture[W][H];
| pixel rotated_picture[H][W];
|
| At the implementation level, I'm dealing with 1D integer arrays.
|
| int src[W*H*3];
| int dest[W*H*3];
|
| Conceptually, the rotation operation comes down to
|
| rotated_picture[j][H-1-i] = original_picture[i][j]
|
| My actual code is
|
| for (i = 0; i < H; ++i)
| for (j = 0; j < W; ++j)
| memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);

This is not going to be fast because:-
1. you are calling a subroutine 1,913,616 times to move
9 or 48 bits each time.
2. You are multiplying by 3 twice for each move.
3. You are using integer arrays.

1. Best to use ordinary assignment.
2. Eliminate *3 by changing to adds.
3. Use byte arrays (but I'm not clear about what you say.
You say that you have octets. Now 3 octets will be 9 bits,
not 3 integers equal to 48 bits.)

This is how it would look in PL/I:-

declare (src(m,n), dest(n,m)) char (3);
do i = 1 to m;
dest(*, m-i+1) = src(i,*);
end;

If the color information really is octets,
then the declaration is changed to bit (9) aligned.


From: glen herrmannsfeldt on
In comp.lang.pl1 robin <robin_v(a)bigpond.com> wrote:
> "Noob" <root(a)127.0.0.1> wrote in message news:hphm36$puk$1(a)speranza.aioe.org...
> | I need to rotate a picture clockwise 90 degrees.

> | Conceptually, my picture is represented by a matrix
> | (W columns, H rows) of pixels, with each pixel
> | represented by 3 octets.

> | pixel original_picture[W][H];
> | pixel rotated_picture[H][W];

> | At the implementation level, I'm dealing with 1D integer arrays.

> | int src[W*H*3];
> | int dest[W*H*3];

> | Conceptually, the rotation operation comes down to

> | rotated_picture[j][H-1-i] = original_picture[i][j]

> | My actual code is

> | for (i = 0; i < H; ++i)
> | for (j = 0; j < W; ++j)
> | memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);

This is basically a matrix transpose (with the difference that
one goes down instead of up). There are cache friendly matrix
transpose algorithms that should also speed this one up.

> This is not going to be fast because:-
> 1. you are calling a subroutine 1,913,616 times to move
> 9 or 48 bits each time.

Many C compilers implement memcpy() inline. I suppose you
believe that PL/I does a subroutine call for the MOD function?

> 2. You are multiplying by 3 twice for each move.

Most can also figure that one out.

> 3. You are using integer arrays.

> 1. Best to use ordinary assignment.
> 2. Eliminate *3 by changing to adds.
> 3. Use byte arrays (but I'm not clear about what you say.
> You say that you have octets. Now 3 octets will be 9 bits,
> not 3 integers equal to 48 bits.)

-- glen
First  |  Prev  |  Next  |  Last
Pages: 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Prev: #include "cpuid.os"
Next: aspect ratio algorithm needed.