From: Andrew Smallshaw on 7 Apr 2010 10:40 On 2010-04-07, Noob <root(a)127.0.0.1> wrote: > > My actual code is > > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); I haven't looked at it in detail but that immediately caught my eye. Surely: memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3 * sizeof(int)); -- Andrew Smallshaw andrews(a)sdf.lonestar.org
From: Willem on 7 Apr 2010 10:58 Noob wrote: ) Hello, ) ) I imagine the following problem has been efficiently solved ) over a million times in the past. ) ) I need to rotate a picture clockwise 90 degrees. ) ) Conceptually, my picture is represented by a matrix ) (W columns, H rows) of pixels, with each pixel ) represented by 3 octets. ) ) pixel original_picture[W][H]; ) pixel rotated_picture[H][W]; ) ) At the implementation level, I'm dealing with 1D integer arrays. ) ) int src[W*H*3]; ) int dest[W*H*3]; Why integer arrays ? You can usually put 3 octets into one integer. Most graphics programs I know do this: they map one pixel to one integer. ) Conceptually, the rotation operation comes down to ) ) rotated_picture[j][H-1-i] = original_picture[i][j] ) ) My actual code is ) ) for (i = 0; i < H; ++i) ) for (j = 0; j < W; ++j) ) memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); <snip> ) for (i = 0; i < H; ++i) ) for (j = 0; j < W; ++j) ) { ) memcpy(dest+(H*j+H-i-1)*3, src, 3); ) src += 3; ) } It could very well be that writing in order is much more optimal than reading in order. Have you tried switching dest and src in there ? Also, you can reduce the strength even further like so: for (i = 0; i < H; ++i) { dp = dest + (H-i-1)*3; for (j = 0; j < W; ++j) { memcpy(dp, src, 3); src += 3; dp += H*3; } } ) I thought changing the memcpy call to 3 explicit assignments ) might help, but it actually slowed things down. That's perhaps because that memcpy is copying 3 bytes, and the assignment is copying 3 integers ? ) Perhaps I need to perform some tiling magic... I don't think ) gcc performs automatic tiling? I'm not sure what tiling is in this context. SaSW, Willem -- Disclaimer: I am in no way responsible for any of the statements made in the above text. For all I know I might be drugged or something.. No I'm not paranoid. You all think I'm paranoid, don't you ! #EOT
From: Noob on 7 Apr 2010 11:41 Willem wrote: > Noob wrote: > > ) I imagine the following problem has been efficiently solved > ) over a million times in the past. > ) > ) I need to rotate a picture clockwise 90 degrees. > ) > ) Conceptually, my picture is represented by a matrix > ) (W columns, H rows) of pixels, with each pixel > ) represented by 3 octets. > ) > ) pixel original_picture[W][H]; > ) pixel rotated_picture[H][W]; > ) > ) At the implementation level, I'm dealing with 1D integer arrays. > ) > ) int src[W*H*3]; > ) int dest[W*H*3]; > > Why integer arrays ? You can usually put 3 octets into one integer. > Most graphics programs I know do this: they map one pixel to one integer. My mistake. I meant to write uint8_t src[W * H * 3]; uint8_t dest[W * H * 3]; (i.e. 3 consecutive octets represent one pixel.) NB: I have no control over the format. > ) Conceptually, the rotation operation comes down to > ) > ) rotated_picture[j][H-1-i] = original_picture[i][j] > ) > ) My actual code is > ) > ) for (i = 0; i< H; ++i) > ) for (j = 0; j< W; ++j) > ) memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); > > <snip> > > ) for (i = 0; i< H; ++i) > ) for (j = 0; j< W; ++j) > ) { > ) memcpy(dest+(H*j+H-i-1)*3, src, 3); > ) src += 3; > ) } > > It could very well be that writing in order is much more optimal than > reading in order. Have you tried switching dest and src in there ? Thanks for the suggestion, I will try. > Also, you can reduce the strength even further like so: > > for (i = 0; i< H; ++i) { > dp = dest + (H-i-1)*3; > for (j = 0; j< W; ++j) { > memcpy(dp, src, 3); > src += 3; > dp += H*3; > } > } I've tested this. No cigar ;-) > ) I thought changing the memcpy call to 3 explicit assignments > ) might help, but it actually slowed things down. > > That's perhaps because that memcpy is copying 3 bytes, > and the assignment is copying 3 integers ? See my correction above. > ) Perhaps I need to perform some tiling magic... I don't think > ) gcc performs automatic tiling? > > I'm not sure what tiling is in this context. http://en.wikipedia.org/wiki/Loop_tiling
From: Paul Carpenter on 7 Apr 2010 11:35 In article <hphv27$bnh$1(a)speranza.aioe.org>, root(a)127.0.0.1 says... > [ Adding comp.arch.embedded back to the mix, since most of the replies come from that group ] > > Paul Carpenter wrote: > > > Noob wrote: > > > >> [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ] > >> [ Followup-To: set to comp.programming ] > >> > >> I imagine the following problem has been efficiently solved > >> over a million times in the past. > >> > >> I need to rotate a picture clockwise 90 degrees. > >> > >> Conceptually, my picture is represented by a matrix > >> (W columns, H rows) of pixels, with each pixel > >> represented by 3 octets. > >> > >> pixel original_picture[W][H]; > >> pixel rotated_picture[H][W]; > >> > >> At the implementation level, I'm dealing with 1D integer arrays. > >> > >> int src[W*H*3]; > >> int dest[W*H*3]; > > I made a mistake. I meant to write > uint8_t src[W*H*3]; > uint8_t dest[W*H*3]; Makes more snse ..... > >> Conceptually, the rotation operation comes down to > >> > >> rotated_picture[j][H-1-i] = original_picture[i][j] > > > > At this point, I went where does H come from? > > I'm not sure I understand what you mean. > > I was trying to describe the algorithm at a slightly higher level. > > pixel original_picture[W][H]; > pixel rotated_picture[H][W]; > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > rotated_picture[j][H-1-i] = original_picture[i][j] > > There is perhaps a better (clearer) way to express the algorithm? > > > Do mean dimensions where maximum i is always greater than maximum j > > therefore you need to make a square o/p array, so the zeroth column > > is still the same height as original and consists of the bottom row > > of the input image. > > > >> My actual code is > >> > >> for (i = 0; i< H; ++i) > >> for (j = 0; j< W; ++j) > >> memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); > > > > very messy > > In what sense? Too many calculations done every iteration to create an offset which should be the same difference between each stage on every iteration of the loop. > >> Consider as an example, > >> W = 1644 > >> H = 1164 > >> size = 5.74 MB > >> > >> On my target platform, the rotation takes 1.65 seconds. > >> > >> I'm trying to get this down under half a second. > >> > >> I'm using the GNU tool chain. For some weird reason, we > >> compile everything -O0. The first thing I'll try is crank > >> gcc's optimization level. > > > > You nee to optimise your methods. > > Methods? Do you mean the functions? No the method/algorithm used, adding functions on every iteration of the loop adds overhead for something like this. I think it would have trouble changing the loop style see later. > >> I'm hoping gcc can perform some strength reduction, as the > >> index calculation seems to be taking a non-negligible fraction > >> of the total run-time. > > > > Don't rely on compiler to sort out excessive algorithm overhead. > > > >> Changing the loop to > >> > >> for (i = 0; i< H; ++i) > >> for (j = 0; j< W; ++j) > >> { > >> memcpy(dest+(H*j+H-i-1)*3, src, 3); > >> src += 3; > >> } > >> > >> brings a 5% improvement. > > > > So you have seen that removing some complicated expression > > from every iteration [e.g. W*i+j)*3 ] and replacing with > > a simple addition speeds up operation. > > Strength reduction is typically the kind of task where a compiler > should excel. Won't stand much chance of changing something written in an usuual way. There is not much it could reduce there usefully. > > This should give you a clue. > > > >> I thought changing the memcpy call to 3 explicit assignments > >> might help, but it actually slowed things down. > > > > Because the array indecii are calculated three times (for input > > and output), adding back lots of complicated expressions. > > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > { > unsigned char *C = B+(H*j+H-i-1)*3; Still a complex calculation EVERY loop iteration, for any particular sizes being used 'H * j' is equivalent to adding H at the end of each loop iteration, whilst every time this inner loop is started 'H - i - 1' is the same for the duration of the inner loop. Only do actual parts of calculations that really change inside a loop use variables to store partial results that don't change. Also I assume B is the fixed start point of the array, whilst C is the running pointer. You can rewrite '(H*j+H-i-1)*3' as 3*(H*j) + 3*(H-i-1) The first half is equivalent to adding 3 * H every loop iteration and the second half is fixed for duration of inner loop. Consider this which needs testing is off the top of my head // B = base of o/p array unsigned long Hstride, Hoffset; unsigned char *C; C = B; Hstride = (3 * H) - 3; // -3 to offset the three ++ in inner loop Hoffset = 0; // Hoffset coding could be improved for (i = 0; i < H; ++i, C += Hoffset ) { Hoffset = (H-i-1)*3; for (j = 0; j < W; ++j, C += Hstride ) { *C++ = *A++; *C++ = *A++; *C++ = *A++; } } > C[0] = A[0]; > C[1] = A[1]; > C[2] = A[2]; > A += 3; > } > > was slower than > > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > { > memcpy(B+(H*j+H-i-1)*3, A, 3); > A += 3; > } > > on my platform. Everytime you use array[ index ] the complier creates an arithmetic pointer manipulation to calculate the actual pointer to the desired location. -- Paul Carpenter | paul(a)pcserviceselectronics.co.uk <http://www.pcserviceselectronics.co.uk/> PC Services <http://www.pcserviceselectronics.co.uk/fonts/> Timing Diagram Font <http://www.gnuh8.org.uk/> GNU H8 - compiler & Renesas H8/H8S/H8 Tiny <http://www.badweb.org.uk/> For those web sites you hate
From: Terje Mathisen "terje.mathisen at on 7 Apr 2010 11:59
Follow-up restored to include a group I read) Noob wrote: > [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ] > [ Followup-To: set to comp.programming ] > > Hello, > > I imagine the following problem has been efficiently solved > over a million times in the past. > > I need to rotate a picture clockwise 90 degrees. > > Conceptually, my picture is represented by a matrix > (W columns, H rows) of pixels, with each pixel > represented by 3 octets. > > pixel original_picture[W][H]; > pixel rotated_picture[H][W]; > This is indeed a well-known problem, the key to make it fast is to realize that it is very similar to a transpose operation! I.e. first copy everything to the target array (in a single block move), then transpose it, finally you reverse each of the rows. The difference between clock and counter-clock-wise rotation is in doing the row reverse either before or after the transpose. Try both since I don't remeber which is which! The key here is that the transpose operation is quite cache-friendly, much better than your current naive code. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching" |