From: Ulf Samuelsson on 7 Apr 2010 20:56 Noob skrev: > [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ] > [ Followup-To: set to comp.programming ] > > Hello, > > I imagine the following problem has been efficiently solved > over a million times in the past. > > I need to rotate a picture clockwise 90 degrees. Modern Hardware like the AT91SAM9M10 will do this in a H/W accelerator. Best Regards Ulf Samuelsson > > Conceptually, my picture is represented by a matrix > (W columns, H rows) of pixels, with each pixel > represented by 3 octets. > > pixel original_picture[W][H]; > pixel rotated_picture[H][W]; > > At the implementation level, I'm dealing with 1D integer arrays. > > int src[W*H*3]; > int dest[W*H*3]; > > Conceptually, the rotation operation comes down to > > rotated_picture[j][H-1-i] = original_picture[i][j] > > My actual code is > > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); > > Consider as an example, > W = 1644 > H = 1164 > size = 5.74 MB > > On my target platform, the rotation takes 1.65 seconds. > > I'm trying to get this down under half a second. > > I'm using the GNU tool chain. For some weird reason, we > compile everything -O0. The first thing I'll try is crank > gcc's optimization level. > > I'm hoping gcc can perform some strength reduction, as the > index calculation seems to be taking a non-negligible fraction > of the total run-time. > > Changing the loop to > > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > { > memcpy(dest+(H*j+H-i-1)*3, src, 3); > src += 3; > } > > brings a 5% improvement. > > I thought changing the memcpy call to 3 explicit assignments > might help, but it actually slowed things down. > > Perhaps I need to perform some tiling magic... I don't think > gcc performs automatic tiling? > > Comments and insight would be greatly appreciated. > > Regards.
From: Skybuck Flying on 8 Apr 2010 01:42 original .------------------->X | | | | | | \|/ Y rotated 90 degrees clockwise y<--------------------. | | | | | | \|/ x new_x <---------------. | | \|/ new_y new_x = (height-1)-y; new_y = x; Source := 0; Dest := (Height-1); Count := 0; for Index := 0 to PixelCount-1 do begin Source := Source + 1; Dest := Dest-1; Count := Count + 1; if Count >= Height then begin Count := 0; Dest := Dest + Height; end; Dest^ := Source^; end; Untested, but that's what I would conceptually try. This should get rid of all the (slower?) multiplications, in return for one single branch, and some fast adds and subs. Plus a memory copy per pixel. (What target you on ? ;)) Bye, Skybuck.
From: Noob on 8 Apr 2010 04:56 Ulf Samuelsson wrote: > Noob skrev: > >> I imagine the following problem has been efficiently solved >> over a million times in the past. >> >> I need to rotate a picture clockwise 90 degrees. > > Modern Hardware like the [snip] will do this in a H/W accelerator. I was specifically asking how to do it in software. I find this spamvertisement very inappropriate. Does your employer condone this activity?
From: Peter Dickerson on 8 Apr 2010 05:30 "Noob" <root(a)127.0.0.1> wrote in message news:hpk5o7$vtq$1(a)speranza.aioe.org... > Ulf Samuelsson wrote: > >> Noob skrev: >> >>> I imagine the following problem has been efficiently solved >>> over a million times in the past. >>> >>> I need to rotate a picture clockwise 90 degrees. >> >> Modern Hardware like the [snip] will do this in a H/W accelerator. > > I was specifically asking how to do it in software. > > I find this spamvertisement very inappropriate. > > Does your employer condone this activity? Well, the fact that some devices feel the need to do this in hardware is relevent. It means that either the processor does have the grunt to do the job well itself or that there is significant performance to be had from methods that are not easy possible in software (e.g. dedicated parallel hardware). If the second then it hints that some cleverness in software will be needed to do well. Anyway, there is nothing about USENET that requires replies to you questions to be for you only. If the reply is related, and the can reasonably expect there to be readers for which the reply is relevent, then I don't have a big problem. Some of this is about motivation. If the motivation of the response was to shoehorn there product or services into the discussion rather than to be genuinely helpful then that I surely don't like. Now, Ulf has a history in this group and I feel that history has been strongly positive though somewhat focused on Atmel products (since that's what he knows best). Peter
From: Noob on 8 Apr 2010 05:43
D Yuniskis wrote: > Noob wrote: > >> I need to rotate a picture clockwise 90 degrees. > > Are you sure you will always want to rotate by some > multiple of 90 deg? > > Are you sure you will *always* want to rotate CW by > exactly 90 deg? I'm displaying JPEG photographs. Some users want to rotate the pictures (landscape to portrait). AFAICS, all I need is 90� either way. >> Conceptually, my picture is represented by a matrix >> (W columns, H rows) of pixels, with each pixel >> represented by 3 octets. >> >> pixel original_picture[W][H]; >> pixel rotated_picture[H][W]; >> >> At the implementation level, I'm dealing with 1D integer arrays. >> >> int src[W*H*3]; >> int dest[W*H*3]; > > Your original data definition was cleaner. Probably. And I also added a bug, because I meant uint8_t src[W*H*3]; uint8_t dest[W*H*3]; > Why are you taking the memcpy function invocation hit here? > You're just moving "3 bytes (octets)". Right. > Create two pointers: > > pixel *input, *output; > > Initialize the input pointer to "some corner" (top left is > convenient). Initialize the output pointer to the corner > that is 90 degrees CW from this point. > > Move the input pointer across a row (or down a column -- depending > on which corner you chose as a starting point) and move the output > pointer down a column (or across a row, etc.). > > Then, just copy from *input to *output. I've done this, and it is still too slow for my taste :-( It brings the run-time from 1.6 seconds down to 1.1 seconds (for my 1644 x 1164 example). > Of course, using a three byte type makes this cumbersome. > I assume you don't want to pack those three bytes into a > long int (time/space efficiency concerns). So, you'll have > to break it down to a set of three "byte operations". I didn't see how to ask libjpeg to output 32-bit pixel values instead of 24-bit RGB values. > It might help if you told us which processor you are targeting. The data sheet states SH-4 32-bit super-scalar RISC CPU o 266 MHz, 2-way set associative 16-Kbyte ICache, 32-Kbyte DCache, MMU o 5-stage pipeline, delayed branch support o floating point unit, matrix operation support o debug port, interrupt controller Regards. |