From: Noob on 7 Apr 2010 06:17 [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ] [ Followup-To: set to comp.programming ] Hello, I imagine the following problem has been efficiently solved over a million times in the past. I need to rotate a picture clockwise 90 degrees. Conceptually, my picture is represented by a matrix (W columns, H rows) of pixels, with each pixel represented by 3 octets. pixel original_picture[W][H]; pixel rotated_picture[H][W]; At the implementation level, I'm dealing with 1D integer arrays. int src[W*H*3]; int dest[W*H*3]; Conceptually, the rotation operation comes down to rotated_picture[j][H-1-i] = original_picture[i][j] My actual code is for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); Consider as an example, W = 1644 H = 1164 size = 5.74 MB On my target platform, the rotation takes 1.65 seconds. I'm trying to get this down under half a second. I'm using the GNU tool chain. For some weird reason, we compile everything -O0. The first thing I'll try is crank gcc's optimization level. I'm hoping gcc can perform some strength reduction, as the index calculation seems to be taking a non-negligible fraction of the total run-time. Changing the loop to for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) { memcpy(dest+(H*j+H-i-1)*3, src, 3); src += 3; } brings a 5% improvement. I thought changing the memcpy call to 3 explicit assignments might help, but it actually slowed things down. Perhaps I need to perform some tiling magic... I don't think gcc performs automatic tiling? Comments and insight would be greatly appreciated. Regards.
From: Noob on 7 Apr 2010 08:50 [ Adding comp.arch.embedded back to the mix, since most of the replies come from that group ] Paul Carpenter wrote: > Noob wrote: > >> [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ] >> [ Followup-To: set to comp.programming ] >> >> I imagine the following problem has been efficiently solved >> over a million times in the past. >> >> I need to rotate a picture clockwise 90 degrees. >> >> Conceptually, my picture is represented by a matrix >> (W columns, H rows) of pixels, with each pixel >> represented by 3 octets. >> >> pixel original_picture[W][H]; >> pixel rotated_picture[H][W]; >> >> At the implementation level, I'm dealing with 1D integer arrays. >> >> int src[W*H*3]; >> int dest[W*H*3]; I made a mistake. I meant to write uint8_t src[W*H*3]; uint8_t dest[W*H*3]; > So you have an array of ints (undetermined width) and a lot of them > as a block of memory. > > Why int for octets? You're right. I meant uint8_t not int. >> Conceptually, the rotation operation comes down to >> >> rotated_picture[j][H-1-i] = original_picture[i][j] > > At this point, I went where does H come from? I'm not sure I understand what you mean. I was trying to describe the algorithm at a slightly higher level. pixel original_picture[W][H]; pixel rotated_picture[H][W]; for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) rotated_picture[j][H-1-i] = original_picture[i][j] There is perhaps a better (clearer) way to express the algorithm? > Do mean dimensions where maximum i is always greater than maximum j > therefore you need to make a square o/p array, so the zeroth column > is still the same height as original and consists of the bottom row > of the input image. > >> My actual code is >> >> for (i = 0; i< H; ++i) >> for (j = 0; j< W; ++j) >> memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); > > very messy In what sense? >> Consider as an example, >> W = 1644 >> H = 1164 >> size = 5.74 MB >> >> On my target platform, the rotation takes 1.65 seconds. >> >> I'm trying to get this down under half a second. >> >> I'm using the GNU tool chain. For some weird reason, we >> compile everything -O0. The first thing I'll try is crank >> gcc's optimization level. > > You nee to optimise your methods. Methods? Do you mean the functions? >> I'm hoping gcc can perform some strength reduction, as the >> index calculation seems to be taking a non-negligible fraction >> of the total run-time. > > Don't rely on compiler to sort out excessive algorithm overhead. > >> Changing the loop to >> >> for (i = 0; i< H; ++i) >> for (j = 0; j< W; ++j) >> { >> memcpy(dest+(H*j+H-i-1)*3, src, 3); >> src += 3; >> } >> >> brings a 5% improvement. > > So you have seen that removing some complicated expression > from every iteration [e.g. W*i+j)*3 ] and replacing with > a simple addition speeds up operation. Strength reduction is typically the kind of task where a compiler should excel. http://en.wikipedia.org/wiki/Strength_reduction > This should give you a clue. > >> I thought changing the memcpy call to 3 explicit assignments >> might help, but it actually slowed things down. > > Because the array indecii are calculated three times (for input > and output), adding back lots of complicated expressions. for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) { unsigned char *C = B+(H*j+H-i-1)*3; C[0] = A[0]; C[1] = A[1]; C[2] = A[2]; A += 3; } was slower than for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) { memcpy(B+(H*j+H-i-1)*3, A, 3); A += 3; } on my platform. >> Perhaps I need to perform some tiling magic... I don't think >> gcc performs automatic tiling? > > You need to organise your data and use pointers, the rotation > is using fixed strides through arrays, which is simple pointer > addition (on most platforms) > > You could still use memcpy but for three int/byte at a time is > wasteful. the following is an OUTLINE of doing three copies. > > *op = *ip++; > op += H; > *op = *ip++; > op += H; > *op = *ip++; > > You will need a few variables for pointers and tracking through inner > loop, to make sure things get reset correctly. > > Only calculate your strides once and reuse them. > > Simple do calculations minimum times. Thanks for the tips. Regards.
From: Andrew Smallshaw on 7 Apr 2010 10:40 On 2010-04-07, Noob <root(a)127.0.0.1> wrote: > > My actual code is > > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); I haven't looked at it in detail but that immediately caught my eye. Surely: memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3 * sizeof(int)); -- Andrew Smallshaw andrews(a)sdf.lonestar.org
From: Paul Carpenter on 7 Apr 2010 11:35 In article <hphv27$bnh$1(a)speranza.aioe.org>, root(a)127.0.0.1 says... > [ Adding comp.arch.embedded back to the mix, since most of the replies come from that group ] > > Paul Carpenter wrote: > > > Noob wrote: > > > >> [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ] > >> [ Followup-To: set to comp.programming ] > >> > >> I imagine the following problem has been efficiently solved > >> over a million times in the past. > >> > >> I need to rotate a picture clockwise 90 degrees. > >> > >> Conceptually, my picture is represented by a matrix > >> (W columns, H rows) of pixels, with each pixel > >> represented by 3 octets. > >> > >> pixel original_picture[W][H]; > >> pixel rotated_picture[H][W]; > >> > >> At the implementation level, I'm dealing with 1D integer arrays. > >> > >> int src[W*H*3]; > >> int dest[W*H*3]; > > I made a mistake. I meant to write > uint8_t src[W*H*3]; > uint8_t dest[W*H*3]; Makes more snse ..... > >> Conceptually, the rotation operation comes down to > >> > >> rotated_picture[j][H-1-i] = original_picture[i][j] > > > > At this point, I went where does H come from? > > I'm not sure I understand what you mean. > > I was trying to describe the algorithm at a slightly higher level. > > pixel original_picture[W][H]; > pixel rotated_picture[H][W]; > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > rotated_picture[j][H-1-i] = original_picture[i][j] > > There is perhaps a better (clearer) way to express the algorithm? > > > Do mean dimensions where maximum i is always greater than maximum j > > therefore you need to make a square o/p array, so the zeroth column > > is still the same height as original and consists of the bottom row > > of the input image. > > > >> My actual code is > >> > >> for (i = 0; i< H; ++i) > >> for (j = 0; j< W; ++j) > >> memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); > > > > very messy > > In what sense? Too many calculations done every iteration to create an offset which should be the same difference between each stage on every iteration of the loop. > >> Consider as an example, > >> W = 1644 > >> H = 1164 > >> size = 5.74 MB > >> > >> On my target platform, the rotation takes 1.65 seconds. > >> > >> I'm trying to get this down under half a second. > >> > >> I'm using the GNU tool chain. For some weird reason, we > >> compile everything -O0. The first thing I'll try is crank > >> gcc's optimization level. > > > > You nee to optimise your methods. > > Methods? Do you mean the functions? No the method/algorithm used, adding functions on every iteration of the loop adds overhead for something like this. I think it would have trouble changing the loop style see later. > >> I'm hoping gcc can perform some strength reduction, as the > >> index calculation seems to be taking a non-negligible fraction > >> of the total run-time. > > > > Don't rely on compiler to sort out excessive algorithm overhead. > > > >> Changing the loop to > >> > >> for (i = 0; i< H; ++i) > >> for (j = 0; j< W; ++j) > >> { > >> memcpy(dest+(H*j+H-i-1)*3, src, 3); > >> src += 3; > >> } > >> > >> brings a 5% improvement. > > > > So you have seen that removing some complicated expression > > from every iteration [e.g. W*i+j)*3 ] and replacing with > > a simple addition speeds up operation. > > Strength reduction is typically the kind of task where a compiler > should excel. Won't stand much chance of changing something written in an usuual way. There is not much it could reduce there usefully. > > This should give you a clue. > > > >> I thought changing the memcpy call to 3 explicit assignments > >> might help, but it actually slowed things down. > > > > Because the array indecii are calculated three times (for input > > and output), adding back lots of complicated expressions. > > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > { > unsigned char *C = B+(H*j+H-i-1)*3; Still a complex calculation EVERY loop iteration, for any particular sizes being used 'H * j' is equivalent to adding H at the end of each loop iteration, whilst every time this inner loop is started 'H - i - 1' is the same for the duration of the inner loop. Only do actual parts of calculations that really change inside a loop use variables to store partial results that don't change. Also I assume B is the fixed start point of the array, whilst C is the running pointer. You can rewrite '(H*j+H-i-1)*3' as 3*(H*j) + 3*(H-i-1) The first half is equivalent to adding 3 * H every loop iteration and the second half is fixed for duration of inner loop. Consider this which needs testing is off the top of my head // B = base of o/p array unsigned long Hstride, Hoffset; unsigned char *C; C = B; Hstride = (3 * H) - 3; // -3 to offset the three ++ in inner loop Hoffset = 0; // Hoffset coding could be improved for (i = 0; i < H; ++i, C += Hoffset ) { Hoffset = (H-i-1)*3; for (j = 0; j < W; ++j, C += Hstride ) { *C++ = *A++; *C++ = *A++; *C++ = *A++; } } > C[0] = A[0]; > C[1] = A[1]; > C[2] = A[2]; > A += 3; > } > > was slower than > > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > { > memcpy(B+(H*j+H-i-1)*3, A, 3); > A += 3; > } > > on my platform. Everytime you use array[ index ] the complier creates an arithmetic pointer manipulation to calculate the actual pointer to the desired location. -- Paul Carpenter | paul(a)pcserviceselectronics.co.uk <http://www.pcserviceselectronics.co.uk/> PC Services <http://www.pcserviceselectronics.co.uk/fonts/> Timing Diagram Font <http://www.gnuh8.org.uk/> GNU H8 - compiler & Renesas H8/H8S/H8 Tiny <http://www.badweb.org.uk/> For those web sites you hate
From: Joseph Power on 7 Apr 2010 12:07 On Wed, 07 Apr 2010 12:17:12 +0200 in comp.arch.embedded, Noob <root(a)127.0.0.1> wrote: >My actual code is > > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); > I suspect some compilers may do this 'under the hood', but you might want to consider adding a temporary variable to cut down on repeated multiplies: for (i = 0; i < H; ++i) { W_times_i = W * i; for (j = 0; j < W; ++j) { memcpy(dest+(H*j+H-1-i)*3, src+(W_times_i+j)*3, 3); } } This should eliminate W - 1 multiplies. Given that there is a certain amount of overhead involved in calling memcpy (especially for such a small number of bytes), you might want to simply replace the call with three explicit assignments: for (i = 0; i < H; ++i) { W_times_i = W * i; for (j = 0; j < W; ++j) { dest_indx = (H*j+H-1-i)*3; src_indx = (W_times_i+j)*3; dest[dest_indx++] = src[src_indx++]; dest[dest_indx++] = src[src_indx++]; dest[dest_indx] = src[src_indx]; } } You might also try i++ and j++ in your for loops - if your processor's hardware supports post-increment better than pre-increment, your compiler may be aware of that fact. hope that helps Joe Power
|
Next
|
Last
Pages: 1 2 3 4 5 6 Prev: Including compile timestamp in c? Next: fpga and the particular case of xilinx |