From: Noob on 7 Apr 2010 06:17 [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ] [ Followup-To: set to comp.programming ] Hello, I imagine the following problem has been efficiently solved over a million times in the past. I need to rotate a picture clockwise 90 degrees. Conceptually, my picture is represented by a matrix (W columns, H rows) of pixels, with each pixel represented by 3 octets. pixel original_picture[W][H]; pixel rotated_picture[H][W]; At the implementation level, I'm dealing with 1D integer arrays. int src[W*H*3]; int dest[W*H*3]; Conceptually, the rotation operation comes down to rotated_picture[j][H-1-i] = original_picture[i][j] My actual code is for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); Consider as an example, W = 1644 H = 1164 size = 5.74 MB On my target platform, the rotation takes 1.65 seconds. I'm trying to get this down under half a second. I'm using the GNU tool chain. For some weird reason, we compile everything -O0. The first thing I'll try is crank gcc's optimization level. I'm hoping gcc can perform some strength reduction, as the index calculation seems to be taking a non-negligible fraction of the total run-time. Changing the loop to for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) { memcpy(dest+(H*j+H-i-1)*3, src, 3); src += 3; } brings a 5% improvement. I thought changing the memcpy call to 3 explicit assignments might help, but it actually slowed things down. Perhaps I need to perform some tiling magic... I don't think gcc performs automatic tiling? Comments and insight would be greatly appreciated. Regards.
From: Meindert Sprang on 7 Apr 2010 06:39 "Noob" <root(a)127.0.0.1> wrote in message news:hphm36$puk$1(a)speranza.aioe.org... > My actual code is > > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); Rewrite this in assembler and you'll be amazed. Calling memcpy to copy only 3 bytes is killing you. Meindert
From: James Dow Allen on 7 Apr 2010 07:16 On Apr 7, 5:17 pm, Noob <r...(a)127.0.0.1> wrote: > I need to rotate a picture clockwise 90 degrees. As Meindert mentions, bypassing 'memcpy' might help, though some compilers might be doing that for you already. As you imply, calculating addresses (and with '+' not '*') in a more efficient way may help, though again some compilers might be doing it for you. Some other suggestions that might help: 1. Break large square into smaller squares for better cache utilization. Your exact problem is discussed at http://en.wikipedia.org/wiki/Cache-oblivious_algorithm (Perhaps that wiki article links to source code.) 2. Often 4-byte pixels (allowing uint32 as datatype) are used instead of 3-byte pixels. May seem very unlikely, but using 33% *more* storage just *might* lead to speedup because of faster pixel ops. When I saw the subject line, I though you might have been rotating by arbitrary angle, not 90-degrees. Once you get fast 90-degree rotate. the elegant 3-shear algorithm can build from that to get fast accurate rotation by any angle. James Dow Allen
From: Paul Carpenter on 7 Apr 2010 07:53 In article <hphm36$puk$1(a)speranza.aioe.org>, root(a)127.0.0.1 says... > [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ] > [ Followup-To: set to comp.programming ] > > Hello, > > I imagine the following problem has been efficiently solved > over a million times in the past. > > I need to rotate a picture clockwise 90 degrees. > > Conceptually, my picture is represented by a matrix > (W columns, H rows) of pixels, with each pixel > represented by 3 octets. > > pixel original_picture[W][H]; > pixel rotated_picture[H][W]; > > At the implementation level, I'm dealing with 1D integer arrays. > > int src[W*H*3]; > int dest[W*H*3]; So you have an array of ints (undetermined width) and a lot of them as a block of memory. Why int for octets? > Conceptually, the rotation operation comes down to > > rotated_picture[j][H-1-i] = original_picture[i][j] At this point I went where does H come from? Do mean dimensions where maximum i is always greater than maximum j therefore you need to make a square o/p array, so the zeroth column is still the same height as original and consists of the bottom row of the input image. > My actual code is > > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); very messy > Consider as an example, > W = 1644 > H = 1164 > size = 5.74 MB > > On my target platform, the rotation takes 1.65 seconds. > > I'm trying to get this down under half a second. > > I'm using the GNU tool chain. For some weird reason, we > compile everything -O0. The first thing I'll try is crank > gcc's optimization level. You nee to optimise your methods. > I'm hoping gcc can perform some strength reduction, as the > index calculation seems to be taking a non-negligible fraction > of the total run-time. Don't rely on compiler to sort out excessive algorith overhead. > Changing the loop to > > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > { > memcpy(dest+(H*j+H-i-1)*3, src, 3); > src += 3; > } > > brings a 5% improvement. So you have seen that removing some complicated expression from every iteration [e.g. W*i+j)*3 ] and replacing with a simple addition speeds up operation. This should give you a clue. > I thought changing the memcpy call to 3 explicit assignments > might help, but it actually slowed things down. Because the array indecii are calculated three times (for input and output), adding back lots of complicated expressions. > Perhaps I need to perform some tiling magic... I don't think > gcc performs automatic tiling? You need to organise your data and use pointers, the rotation is using fixed strides through arrays, which is simple pointer addition (on most platforms) You could still use memcpy but for three int/byte at a time is wasteful. the following is an OUTLINE of doing three copies. *op = *ip++; op += H; *op = *ip++; op += H; *op = *ip++; You will need a few variables for pointers and tracking through inner loop, to make sure things get reset correctly. Only calculate your strides once and reuse them. Simple do calculations minimum times. > Comments and insight would be greatly appreciated. > > Regards. > -- Paul Carpenter | paul(a)pcserviceselectronics.co.uk <http://www.pcserviceselectronics.co.uk/> PC Services <http://www.pcserviceselectronics.co.uk/fonts/> Timing Diagram Font <http://www.gnuh8.org.uk/> GNU H8 - compiler & Renesas H8/H8S/H8 Tiny <http://www.badweb.org.uk/> For those web sites you hate
From: Noob on 7 Apr 2010 08:50
[ Adding comp.arch.embedded back to the mix, since most of the replies come from that group ] Paul Carpenter wrote: > Noob wrote: > >> [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ] >> [ Followup-To: set to comp.programming ] >> >> I imagine the following problem has been efficiently solved >> over a million times in the past. >> >> I need to rotate a picture clockwise 90 degrees. >> >> Conceptually, my picture is represented by a matrix >> (W columns, H rows) of pixels, with each pixel >> represented by 3 octets. >> >> pixel original_picture[W][H]; >> pixel rotated_picture[H][W]; >> >> At the implementation level, I'm dealing with 1D integer arrays. >> >> int src[W*H*3]; >> int dest[W*H*3]; I made a mistake. I meant to write uint8_t src[W*H*3]; uint8_t dest[W*H*3]; > So you have an array of ints (undetermined width) and a lot of them > as a block of memory. > > Why int for octets? You're right. I meant uint8_t not int. >> Conceptually, the rotation operation comes down to >> >> rotated_picture[j][H-1-i] = original_picture[i][j] > > At this point, I went where does H come from? I'm not sure I understand what you mean. I was trying to describe the algorithm at a slightly higher level. pixel original_picture[W][H]; pixel rotated_picture[H][W]; for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) rotated_picture[j][H-1-i] = original_picture[i][j] There is perhaps a better (clearer) way to express the algorithm? > Do mean dimensions where maximum i is always greater than maximum j > therefore you need to make a square o/p array, so the zeroth column > is still the same height as original and consists of the bottom row > of the input image. > >> My actual code is >> >> for (i = 0; i< H; ++i) >> for (j = 0; j< W; ++j) >> memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); > > very messy In what sense? >> Consider as an example, >> W = 1644 >> H = 1164 >> size = 5.74 MB >> >> On my target platform, the rotation takes 1.65 seconds. >> >> I'm trying to get this down under half a second. >> >> I'm using the GNU tool chain. For some weird reason, we >> compile everything -O0. The first thing I'll try is crank >> gcc's optimization level. > > You nee to optimise your methods. Methods? Do you mean the functions? >> I'm hoping gcc can perform some strength reduction, as the >> index calculation seems to be taking a non-negligible fraction >> of the total run-time. > > Don't rely on compiler to sort out excessive algorithm overhead. > >> Changing the loop to >> >> for (i = 0; i< H; ++i) >> for (j = 0; j< W; ++j) >> { >> memcpy(dest+(H*j+H-i-1)*3, src, 3); >> src += 3; >> } >> >> brings a 5% improvement. > > So you have seen that removing some complicated expression > from every iteration [e.g. W*i+j)*3 ] and replacing with > a simple addition speeds up operation. Strength reduction is typically the kind of task where a compiler should excel. http://en.wikipedia.org/wiki/Strength_reduction > This should give you a clue. > >> I thought changing the memcpy call to 3 explicit assignments >> might help, but it actually slowed things down. > > Because the array indecii are calculated three times (for input > and output), adding back lots of complicated expressions. for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) { unsigned char *C = B+(H*j+H-i-1)*3; C[0] = A[0]; C[1] = A[1]; C[2] = A[2]; A += 3; } was slower than for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) { memcpy(B+(H*j+H-i-1)*3, A, 3); A += 3; } on my platform. >> Perhaps I need to perform some tiling magic... I don't think >> gcc performs automatic tiling? > > You need to organise your data and use pointers, the rotation > is using fixed strides through arrays, which is simple pointer > addition (on most platforms) > > You could still use memcpy but for three int/byte at a time is > wasteful. the following is an OUTLINE of doing three copies. > > *op = *ip++; > op += H; > *op = *ip++; > op += H; > *op = *ip++; > > You will need a few variables for pointers and tracking through inner > loop, to make sure things get reset correctly. > > Only calculate your strides once and reuse them. > > Simple do calculations minimum times. Thanks for the tips. Regards. |