From: Noob on
[ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ]
[ Followup-To: set to comp.programming ]

Hello,

I imagine the following problem has been efficiently solved
over a million times in the past.

I need to rotate a picture clockwise 90 degrees.

Conceptually, my picture is represented by a matrix
(W columns, H rows) of pixels, with each pixel
represented by 3 octets.

pixel original_picture[W][H];
pixel rotated_picture[H][W];

At the implementation level, I'm dealing with 1D integer arrays.

int src[W*H*3];
int dest[W*H*3];

Conceptually, the rotation operation comes down to

rotated_picture[j][H-1-i] = original_picture[i][j]

My actual code is

for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);

Consider as an example,
W = 1644
H = 1164
size = 5.74 MB

On my target platform, the rotation takes 1.65 seconds.

I'm trying to get this down under half a second.

I'm using the GNU tool chain. For some weird reason, we
compile everything -O0. The first thing I'll try is crank
gcc's optimization level.

I'm hoping gcc can perform some strength reduction, as the
index calculation seems to be taking a non-negligible fraction
of the total run-time.

Changing the loop to

for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
{
memcpy(dest+(H*j+H-i-1)*3, src, 3);
src += 3;
}

brings a 5% improvement.

I thought changing the memcpy call to 3 explicit assignments
might help, but it actually slowed things down.

Perhaps I need to perform some tiling magic... I don't think
gcc performs automatic tiling?

Comments and insight would be greatly appreciated.

Regards.
From: Noob on
[ Adding comp.arch.embedded back to the mix, since most of the replies come from that group ]

Paul Carpenter wrote:

> Noob wrote:
>
>> [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ]
>> [ Followup-To: set to comp.programming ]
>>
>> I imagine the following problem has been efficiently solved
>> over a million times in the past.
>>
>> I need to rotate a picture clockwise 90 degrees.
>>
>> Conceptually, my picture is represented by a matrix
>> (W columns, H rows) of pixels, with each pixel
>> represented by 3 octets.
>>
>> pixel original_picture[W][H];
>> pixel rotated_picture[H][W];
>>
>> At the implementation level, I'm dealing with 1D integer arrays.
>>
>> int src[W*H*3];
>> int dest[W*H*3];

I made a mistake. I meant to write
uint8_t src[W*H*3];
uint8_t dest[W*H*3];

> So you have an array of ints (undetermined width) and a lot of them
> as a block of memory.
>
> Why int for octets?

You're right. I meant uint8_t not int.

>> Conceptually, the rotation operation comes down to
>>
>> rotated_picture[j][H-1-i] = original_picture[i][j]
>
> At this point, I went where does H come from?

I'm not sure I understand what you mean.

I was trying to describe the algorithm at a slightly higher level.

pixel original_picture[W][H];
pixel rotated_picture[H][W];
for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
rotated_picture[j][H-1-i] = original_picture[i][j]

There is perhaps a better (clearer) way to express the algorithm?

> Do mean dimensions where maximum i is always greater than maximum j
> therefore you need to make a square o/p array, so the zeroth column
> is still the same height as original and consists of the bottom row
> of the input image.
>
>> My actual code is
>>
>> for (i = 0; i< H; ++i)
>> for (j = 0; j< W; ++j)
>> memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);
>
> very messy

In what sense?

>> Consider as an example,
>> W = 1644
>> H = 1164
>> size = 5.74 MB
>>
>> On my target platform, the rotation takes 1.65 seconds.
>>
>> I'm trying to get this down under half a second.
>>
>> I'm using the GNU tool chain. For some weird reason, we
>> compile everything -O0. The first thing I'll try is crank
>> gcc's optimization level.
>
> You nee to optimise your methods.

Methods? Do you mean the functions?

>> I'm hoping gcc can perform some strength reduction, as the
>> index calculation seems to be taking a non-negligible fraction
>> of the total run-time.
>
> Don't rely on compiler to sort out excessive algorithm overhead.
>
>> Changing the loop to
>>
>> for (i = 0; i< H; ++i)
>> for (j = 0; j< W; ++j)
>> {
>> memcpy(dest+(H*j+H-i-1)*3, src, 3);
>> src += 3;
>> }
>>
>> brings a 5% improvement.
>
> So you have seen that removing some complicated expression
> from every iteration [e.g. W*i+j)*3 ] and replacing with
> a simple addition speeds up operation.

Strength reduction is typically the kind of task where a compiler
should excel.

http://en.wikipedia.org/wiki/Strength_reduction

> This should give you a clue.
>
>> I thought changing the memcpy call to 3 explicit assignments
>> might help, but it actually slowed things down.
>
> Because the array indecii are calculated three times (for input
> and output), adding back lots of complicated expressions.

for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
{
unsigned char *C = B+(H*j+H-i-1)*3;
C[0] = A[0];
C[1] = A[1];
C[2] = A[2];
A += 3;
}

was slower than

for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
{
memcpy(B+(H*j+H-i-1)*3, A, 3);
A += 3;
}

on my platform.

>> Perhaps I need to perform some tiling magic... I don't think
>> gcc performs automatic tiling?
>
> You need to organise your data and use pointers, the rotation
> is using fixed strides through arrays, which is simple pointer
> addition (on most platforms)
>
> You could still use memcpy but for three int/byte at a time is
> wasteful. the following is an OUTLINE of doing three copies.
>
> *op = *ip++;
> op += H;
> *op = *ip++;
> op += H;
> *op = *ip++;
>
> You will need a few variables for pointers and tracking through inner
> loop, to make sure things get reset correctly.
>
> Only calculate your strides once and reuse them.
>
> Simple do calculations minimum times.

Thanks for the tips.

Regards.
From: Andrew Smallshaw on
On 2010-04-07, Noob <root(a)127.0.0.1> wrote:
>
> My actual code is
>
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);

I haven't looked at it in detail but that immediately caught my
eye. Surely:

memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3 * sizeof(int));

--
Andrew Smallshaw
andrews(a)sdf.lonestar.org
From: Paul Carpenter on
In article <hphv27$bnh$1(a)speranza.aioe.org>, root(a)127.0.0.1 says...
> [ Adding comp.arch.embedded back to the mix, since most of the replies come from that group ]
>
> Paul Carpenter wrote:
>
> > Noob wrote:
> >
> >> [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ]
> >> [ Followup-To: set to comp.programming ]
> >>
> >> I imagine the following problem has been efficiently solved
> >> over a million times in the past.
> >>
> >> I need to rotate a picture clockwise 90 degrees.
> >>
> >> Conceptually, my picture is represented by a matrix
> >> (W columns, H rows) of pixels, with each pixel
> >> represented by 3 octets.
> >>
> >> pixel original_picture[W][H];
> >> pixel rotated_picture[H][W];
> >>
> >> At the implementation level, I'm dealing with 1D integer arrays.
> >>
> >> int src[W*H*3];
> >> int dest[W*H*3];
>
> I made a mistake. I meant to write
> uint8_t src[W*H*3];
> uint8_t dest[W*H*3];

Makes more snse

.....
> >> Conceptually, the rotation operation comes down to
> >>
> >> rotated_picture[j][H-1-i] = original_picture[i][j]
> >
> > At this point, I went where does H come from?
>
> I'm not sure I understand what you mean.
>
> I was trying to describe the algorithm at a slightly higher level.
>
> pixel original_picture[W][H];
> pixel rotated_picture[H][W];
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> rotated_picture[j][H-1-i] = original_picture[i][j]
>
> There is perhaps a better (clearer) way to express the algorithm?
>
> > Do mean dimensions where maximum i is always greater than maximum j
> > therefore you need to make a square o/p array, so the zeroth column
> > is still the same height as original and consists of the bottom row
> > of the input image.
> >
> >> My actual code is
> >>
> >> for (i = 0; i< H; ++i)
> >> for (j = 0; j< W; ++j)
> >> memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);
> >
> > very messy
>
> In what sense?

Too many calculations done every iteration to create an offset which
should be the same difference between each stage on every iteration of
the loop.

> >> Consider as an example,
> >> W = 1644
> >> H = 1164
> >> size = 5.74 MB
> >>
> >> On my target platform, the rotation takes 1.65 seconds.
> >>
> >> I'm trying to get this down under half a second.
> >>
> >> I'm using the GNU tool chain. For some weird reason, we
> >> compile everything -O0. The first thing I'll try is crank
> >> gcc's optimization level.
> >
> > You nee to optimise your methods.
>
> Methods? Do you mean the functions?

No the method/algorithm used, adding functions on every iteration of
the loop adds overhead for something like this.

I think it would have trouble changing the loop style see later.

> >> I'm hoping gcc can perform some strength reduction, as the
> >> index calculation seems to be taking a non-negligible fraction
> >> of the total run-time.
> >
> > Don't rely on compiler to sort out excessive algorithm overhead.
> >
> >> Changing the loop to
> >>
> >> for (i = 0; i< H; ++i)
> >> for (j = 0; j< W; ++j)
> >> {
> >> memcpy(dest+(H*j+H-i-1)*3, src, 3);
> >> src += 3;
> >> }
> >>
> >> brings a 5% improvement.
> >
> > So you have seen that removing some complicated expression
> > from every iteration [e.g. W*i+j)*3 ] and replacing with
> > a simple addition speeds up operation.
>
> Strength reduction is typically the kind of task where a compiler
> should excel.

Won't stand much chance of changing something written in
an usuual way. There is not much it could reduce there usefully.

> > This should give you a clue.
> >
> >> I thought changing the memcpy call to 3 explicit assignments
> >> might help, but it actually slowed things down.
> >
> > Because the array indecii are calculated three times (for input
> > and output), adding back lots of complicated expressions.
>
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> {
> unsigned char *C = B+(H*j+H-i-1)*3;

Still a complex calculation EVERY loop iteration, for any particular
sizes being used 'H * j' is equivalent to adding H at the end of each
loop iteration, whilst every time this inner loop is started
'H - i - 1' is the same for the duration of the inner loop.

Only do actual parts of calculations that really change inside a loop
use variables to store partial results that don't change.

Also I assume B is the fixed start point of the array, whilst C is the
running pointer.

You can rewrite '(H*j+H-i-1)*3' as

3*(H*j) + 3*(H-i-1)

The first half is equivalent to adding 3 * H every loop iteration
and the second half is fixed for duration of inner loop.

Consider this which needs testing is off the top of my head

// B = base of o/p array
unsigned long Hstride, Hoffset;
unsigned char *C;

C = B;
Hstride = (3 * H) - 3; // -3 to offset the three ++ in inner loop
Hoffset = 0; // Hoffset coding could be improved
for (i = 0; i < H; ++i, C += Hoffset )
{
Hoffset = (H-i-1)*3;
for (j = 0; j < W; ++j, C += Hstride )
{
*C++ = *A++;
*C++ = *A++;
*C++ = *A++;
}
}

> C[0] = A[0];
> C[1] = A[1];
> C[2] = A[2];
> A += 3;
> }
>
> was slower than
>
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> {
> memcpy(B+(H*j+H-i-1)*3, A, 3);
> A += 3;
> }
>
> on my platform.

Everytime you use array[ index ] the complier creates an arithmetic
pointer manipulation to calculate the actual pointer to the desired
location.

--
Paul Carpenter | paul(a)pcserviceselectronics.co.uk
<http://www.pcserviceselectronics.co.uk/> PC Services
<http://www.pcserviceselectronics.co.uk/fonts/> Timing Diagram Font
<http://www.gnuh8.org.uk/> GNU H8 - compiler & Renesas H8/H8S/H8 Tiny
<http://www.badweb.org.uk/> For those web sites you hate
From: Joseph Power on
On Wed, 07 Apr 2010 12:17:12 +0200 in comp.arch.embedded, Noob <root(a)127.0.0.1>
wrote:

>My actual code is
>
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);
>

I suspect some compilers may do this 'under the hood', but you might want to
consider adding a temporary variable to cut down on repeated multiplies:

for (i = 0; i < H; ++i)
{
W_times_i = W * i;
for (j = 0; j < W; ++j)
{
memcpy(dest+(H*j+H-1-i)*3, src+(W_times_i+j)*3, 3);
}
}

This should eliminate W - 1 multiplies.

Given that there is a certain amount of overhead involved in calling memcpy
(especially for such a small number of bytes), you might want to simply replace
the call with three explicit assignments:

for (i = 0; i < H; ++i)
{
W_times_i = W * i;
for (j = 0; j < W; ++j)
{
dest_indx = (H*j+H-1-i)*3;
src_indx = (W_times_i+j)*3;
dest[dest_indx++] = src[src_indx++];
dest[dest_indx++] = src[src_indx++];
dest[dest_indx] = src[src_indx];
}
}


You might also try i++ and j++ in your for loops - if your processor's hardware
supports post-increment better than pre-increment, your compiler may be aware of
that fact.

hope that helps

Joe Power