Faster image rotation [General Programming]

Prev: #include "cpuid.os"
Next: aspect ratio algorithm needed.

From: Andrew Smallshaw on 7 Apr 2010 10:40

On 2010-04-07, Noob <root(a)127.0.0.1> wrote:
>
> My actual code is
>
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);

I haven't looked at it in detail but that immediately caught my
eye. Surely:

memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3 * sizeof(int));

--
Andrew Smallshaw
andrews(a)sdf.lonestar.org

From: Willem on 7 Apr 2010 10:58

Noob wrote:
) Hello,
)
) I imagine the following problem has been efficiently solved
) over a million times in the past.
)
) I need to rotate a picture clockwise 90 degrees.
)
) Conceptually, my picture is represented by a matrix
) (W columns, H rows) of pixels, with each pixel
) represented by 3 octets.
)
) pixel original_picture[W][H];
) pixel rotated_picture[H][W];
)
) At the implementation level, I'm dealing with 1D integer arrays.
)
) int src[W*H*3];
) int dest[W*H*3];

Why integer arrays ? You can usually put 3 octets into one integer.
Most graphics programs I know do this: they map one pixel to one integer.

) Conceptually, the rotation operation comes down to
)
) rotated_picture[j][H-1-i] = original_picture[i][j]
)
) My actual code is
)
) for (i = 0; i < H; ++i)
) for (j = 0; j < W; ++j)
) memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);

<snip>

) for (i = 0; i < H; ++i)
) for (j = 0; j < W; ++j)
) {
) memcpy(dest+(H*j+H-i-1)*3, src, 3);
) src += 3;
) }

It could very well be that writing in order is much more optimal than
reading in order. Have you tried switching dest and src in there ?

Also, you can reduce the strength even further like so:

for (i = 0; i < H; ++i) {
dp = dest + (H-i-1)*3;
for (j = 0; j < W; ++j) {
memcpy(dp, src, 3);
src += 3;
dp += H*3;
}
}

) I thought changing the memcpy call to 3 explicit assignments
) might help, but it actually slowed things down.

That's perhaps because that memcpy is copying 3 bytes,
and the assignment is copying 3 integers ?

) Perhaps I need to perform some tiling magic... I don't think
) gcc performs automatic tiling?

I'm not sure what tiling is in this context.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

From: Noob on 7 Apr 2010 11:41

Willem wrote:

> Noob wrote:
>
> ) I imagine the following problem has been efficiently solved
> ) over a million times in the past.
> )
> ) I need to rotate a picture clockwise 90 degrees.
> )
> ) Conceptually, my picture is represented by a matrix
> ) (W columns, H rows) of pixels, with each pixel
> ) represented by 3 octets.
> )
> ) pixel original_picture[W][H];
> ) pixel rotated_picture[H][W];
> )
> ) At the implementation level, I'm dealing with 1D integer arrays.
> )
> ) int src[W*H*3];
> ) int dest[W*H*3];
>
> Why integer arrays ? You can usually put 3 octets into one integer.
> Most graphics programs I know do this: they map one pixel to one integer.

My mistake. I meant to write

uint8_t src[W * H * 3];
uint8_t dest[W * H * 3];

(i.e. 3 consecutive octets represent one pixel.)

NB: I have no control over the format.

> ) Conceptually, the rotation operation comes down to
> )
> ) rotated_picture[j][H-1-i] = original_picture[i][j]
> )
> ) My actual code is
> )
> ) for (i = 0; i< H; ++i)
> ) for (j = 0; j< W; ++j)
> ) memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);
>
> <snip>
>
> ) for (i = 0; i< H; ++i)
> ) for (j = 0; j< W; ++j)
> ) {
> ) memcpy(dest+(H*j+H-i-1)*3, src, 3);
> ) src += 3;
> ) }
>
> It could very well be that writing in order is much more optimal than
> reading in order. Have you tried switching dest and src in there ?

Thanks for the suggestion, I will try.

> Also, you can reduce the strength even further like so:
>
> for (i = 0; i< H; ++i) {
> dp = dest + (H-i-1)*3;
> for (j = 0; j< W; ++j) {
> memcpy(dp, src, 3);
> src += 3;
> dp += H*3;
> }
> }

I've tested this. No cigar ;-)

> ) I thought changing the memcpy call to 3 explicit assignments
> ) might help, but it actually slowed things down.
>
> That's perhaps because that memcpy is copying 3 bytes,
> and the assignment is copying 3 integers ?

See my correction above.

> ) Perhaps I need to perform some tiling magic... I don't think
> ) gcc performs automatic tiling?
>
> I'm not sure what tiling is in this context.

http://en.wikipedia.org/wiki/Loop_tiling

From: Paul Carpenter on 7 Apr 2010 11:35

In article <hphv27$bnh$1(a)speranza.aioe.org>, root(a)127.0.0.1 says...
> [ Adding comp.arch.embedded back to the mix, since most of the replies come from that group ]
>
> Paul Carpenter wrote:
>
> > Noob wrote:
> >
> >> [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ]
> >> [ Followup-To: set to comp.programming ]
> >>
> >> I imagine the following problem has been efficiently solved
> >> over a million times in the past.
> >>
> >> I need to rotate a picture clockwise 90 degrees.
> >>
> >> Conceptually, my picture is represented by a matrix
> >> (W columns, H rows) of pixels, with each pixel
> >> represented by 3 octets.
> >>
> >> pixel original_picture[W][H];
> >> pixel rotated_picture[H][W];
> >>
> >> At the implementation level, I'm dealing with 1D integer arrays.
> >>
> >> int src[W*H*3];
> >> int dest[W*H*3];
>
> I made a mistake. I meant to write
> uint8_t src[W*H*3];
> uint8_t dest[W*H*3];

Makes more snse

.....
> >> Conceptually, the rotation operation comes down to
> >>
> >> rotated_picture[j][H-1-i] = original_picture[i][j]
> >
> > At this point, I went where does H come from?
>
> I'm not sure I understand what you mean.
>
> I was trying to describe the algorithm at a slightly higher level.
>
> pixel original_picture[W][H];
> pixel rotated_picture[H][W];
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> rotated_picture[j][H-1-i] = original_picture[i][j]
>
> There is perhaps a better (clearer) way to express the algorithm?
>
> > Do mean dimensions where maximum i is always greater than maximum j
> > therefore you need to make a square o/p array, so the zeroth column
> > is still the same height as original and consists of the bottom row
> > of the input image.
> >
> >> My actual code is
> >>
> >> for (i = 0; i< H; ++i)
> >> for (j = 0; j< W; ++j)
> >> memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);
> >
> > very messy
>
> In what sense?

Too many calculations done every iteration to create an offset which
should be the same difference between each stage on every iteration of
the loop.

> >> Consider as an example,
> >> W = 1644
> >> H = 1164
> >> size = 5.74 MB
> >>
> >> On my target platform, the rotation takes 1.65 seconds.
> >>
> >> I'm trying to get this down under half a second.
> >>
> >> I'm using the GNU tool chain. For some weird reason, we
> >> compile everything -O0. The first thing I'll try is crank
> >> gcc's optimization level.
> >
> > You nee to optimise your methods.
>
> Methods? Do you mean the functions?

No the method/algorithm used, adding functions on every iteration of
the loop adds overhead for something like this.

I think it would have trouble changing the loop style see later.

> >> I'm hoping gcc can perform some strength reduction, as the
> >> index calculation seems to be taking a non-negligible fraction
> >> of the total run-time.
> >
> > Don't rely on compiler to sort out excessive algorithm overhead.
> >
> >> Changing the loop to
> >>
> >> for (i = 0; i< H; ++i)
> >> for (j = 0; j< W; ++j)
> >> {
> >> memcpy(dest+(H*j+H-i-1)*3, src, 3);
> >> src += 3;
> >> }
> >>
> >> brings a 5% improvement.
> >
> > So you have seen that removing some complicated expression
> > from every iteration [e.g. W*i+j)*3 ] and replacing with
> > a simple addition speeds up operation.
>
> Strength reduction is typically the kind of task where a compiler
> should excel.

Won't stand much chance of changing something written in
an usuual way. There is not much it could reduce there usefully.

> > This should give you a clue.
> >
> >> I thought changing the memcpy call to 3 explicit assignments
> >> might help, but it actually slowed things down.
> >
> > Because the array indecii are calculated three times (for input
> > and output), adding back lots of complicated expressions.
>
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> {
> unsigned char *C = B+(H*j+H-i-1)*3;

Still a complex calculation EVERY loop iteration, for any particular
sizes being used 'H * j' is equivalent to adding H at the end of each
loop iteration, whilst every time this inner loop is started
'H - i - 1' is the same for the duration of the inner loop.

Only do actual parts of calculations that really change inside a loop
use variables to store partial results that don't change.

Also I assume B is the fixed start point of the array, whilst C is the
running pointer.

You can rewrite '(H*j+H-i-1)*3' as

3*(H*j) + 3*(H-i-1)

The first half is equivalent to adding 3 * H every loop iteration
and the second half is fixed for duration of inner loop.

Consider this which needs testing is off the top of my head

// B = base of o/p array
unsigned long Hstride, Hoffset;
unsigned char *C;

C = B;
Hstride = (3 * H) - 3; // -3 to offset the three ++ in inner loop
Hoffset = 0; // Hoffset coding could be improved
for (i = 0; i < H; ++i, C += Hoffset )
{
Hoffset = (H-i-1)*3;
for (j = 0; j < W; ++j, C += Hstride )
{
*C++ = *A++;
*C++ = *A++;
*C++ = *A++;
}
}

> C[0] = A[0];
> C[1] = A[1];
> C[2] = A[2];
> A += 3;
> }
>
> was slower than
>
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> {
> memcpy(B+(H*j+H-i-1)*3, A, 3);
> A += 3;
> }
>
> on my platform.

Everytime you use array[ index ] the complier creates an arithmetic
pointer manipulation to calculate the actual pointer to the desired
location.

--
Paul Carpenter | paul(a)pcserviceselectronics.co.uk
<http://www.pcserviceselectronics.co.uk/> PC Services
<http://www.pcserviceselectronics.co.uk/fonts/> Timing Diagram Font
<http://www.gnuh8.org.uk/> GNU H8 - compiler & Renesas H8/H8S/H8 Tiny
<http://www.badweb.org.uk/> For those web sites you hate

From: Terje Mathisen "terje.mathisen at on 7 Apr 2010 11:59

Follow-up restored to include a group I read)
Noob wrote:
> [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ]
> [ Followup-To: set to comp.programming ]
>
> Hello,
>
> I imagine the following problem has been efficiently solved
> over a million times in the past.
>
> I need to rotate a picture clockwise 90 degrees.
>
> Conceptually, my picture is represented by a matrix
> (W columns, H rows) of pixels, with each pixel
> represented by 3 octets.
>
> pixel original_picture[W][H];
> pixel rotated_picture[H][W];
>

This is indeed a well-known problem, the key to make it fast is to
realize that it is very similar to a transpose operation!

I.e. first copy everything to the target array (in a single block move),
then transpose it, finally you reverse each of the rows.

The difference between clock and counter-clock-wise rotation is in doing
the row reverse either before or after the transpose. Try both since I
don't remeber which is which!

The key here is that the transpose operation is quite cache-friendly,
much better than your current naive code.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: #include "cpuid.os"
Next: aspect ratio algorithm needed.