From: Noob on 8 Apr 2010 06:12 Peter Dickerson wrote: > Noob wrote: > >> Ulf Samuelsson wrote: >> >>> Noob wrote: >>> >>>> I imagine the following problem has been efficiently solved >>>> over a million times in the past. >>>> >>>> I need to rotate a picture clockwise 90 degrees. >>> >>> Modern Hardware like the [snip] will do this in a H/W accelerator. >> >> I was specifically asking how to do it in software. >> >> I find this spamvertisement very inappropriate. >> >> Does your employer condone this activity? > > Well, the fact that some devices feel the need to do this in hardware is > relevant. It means that either the processor does have the grunt to do the > job well itself or that there is significant performance to be had from > methods that are not easy possible in software (e.g. dedicated parallel > hardware). If the second then it hints that some cleverness in software will > be needed to do well. As a matter of fact, I was told that the blitter on this platform (STx7109) *does* support image rotation in hardware, but that the capability had not been exposed by the API. Strange, wouldn't you say? > Anyway, there is nothing about USENET that requires replies to you questions > to be for you only. I do understand that. > If the reply is related, and the can reasonably expect > there to be readers for which the reply is relevant, then I don't have a big > problem. Some of this is about motivation. If the motivation of the response > was to shoehorn there product or services into the discussion rather than to > be genuinely helpful then that I surely don't like. Now, Ulf has a history > in this group and I feel that history has been strongly positive though > somewhat focused on Atmel products (since that's what he knows best). I agree that it might be appropriate to mention that some platforms provide hardware support for some operations. What I found rather spammy was the explicit mention of one of his employer's platform. I will concede that I may have over-reacted ;-) Regards.
From: Noob on 8 Apr 2010 08:38 Terje Mathisen wrote: > Noob wrote: > >> I imagine the following problem has been efficiently solved >> over a million times in the past. >> >> I need to rotate a picture clockwise 90 degrees. >> >> Conceptually, my picture is represented by a matrix >> (W columns, H rows) of pixels, with each pixel >> represented by 3 octets. >> >> pixel original_picture[W][H]; >> pixel rotated_picture[H][W]; > > This is indeed a well-known problem, the key to make it fast is to > realize that it is very similar to a transpose operation! > > I.e. first copy everything to the target array (in a single block move), > then transpose it, finally you reverse each of the rows. > > The difference between clock and counter-clock-wise rotation is in doing > the row reverse either before or after the transpose. Try both since I > don't remember which is which! > > The key here is that the transpose operation is quite cache-friendly, > much better than your current naive code. Hello Terje, Your sig is the reason I thought it would be a good idea to include comp.arch in the list :-) Let me take a 5x3 image as an example. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Transpose 0 5 10 1 6 11 2 7 12 3 8 13 4 9 14 Reverse the rows 10 5 0 11 6 1 12 7 2 13 8 3 14 9 4 This is, indeed, equivalent to a 90� clockwise rotation. I don't see why the transpose operation would be more cache-friendly than directly doing the rotation. TRANSPOSE = copy ROW(i) to COL(i) 90�CW ROT = copy ROW(i) to COL(H-1-i) Is it because the column access is incremented rather than decremented? I must be missing something... Could you provide some insight, or some links? Regards.
From: Terje Mathisen "terje.mathisen at on 8 Apr 2010 13:45 Noob wrote: > Let me take a 5x3 image as an example. > > 0 1 2 3 4 > 5 6 7 8 9 > 10 11 12 13 14 > > Transpose > > 0 5 10 > 1 6 11 > 2 7 12 > 3 8 13 > 4 9 14 > > Reverse the rows > > 10 5 0 > 11 6 1 > 12 7 2 > 13 8 3 > 14 9 4 > > This is, indeed, equivalent to a 90� clockwise rotation. > > I don't see why the transpose operation would be more > cache-friendly than directly doing the rotation. > > TRANSPOSE = copy ROW(i) to COL(i) > 90�CW ROT = copy ROW(i) to COL(H-1-i) > > Is it because the column access is incremented rather than > decremented? I must be missing something... > > Could you provide some insight, or some links? The basic (fast) transpose operation works on squares that have power of two sides, this is what allows it to work recursively, and thereby to fit into lower cache levels as soon as possible. The innermost loop of the transpose works with N registers each capable of holding N elements, so you can transpose a 4x4 block of 32-bit values using 4 SSE registers, or an 8x8 block of 16-bit values (as long as you're in 64-bit mode so you have at least 16 regs available). If you actually work with very small blocks, like your 5x3 example, then it really doesn't matter, you can just generate unrolled code to do it directly while copying, in which case the fastest code will be like Andy's suggestion where you gather in data to be able to write full cache lines to the target buffer. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: bartc on 8 Apr 2010 21:02 "Noob" <root(a)127.0.0.1> wrote in message news:hphm36$puk$1(a)speranza.aioe.org... >[ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ] > [ Followup-To: set to comp.programming ] > > Hello, > > I imagine the following problem has been efficiently solved > over a million times in the past. > > I need to rotate a picture clockwise 90 degrees. > > Conceptually, my picture is represented by a matrix > (W columns, H rows) of pixels, with each pixel > represented by 3 octets. > > pixel original_picture[W][H]; > pixel rotated_picture[H][W]; > > At the implementation level, I'm dealing with 1D integer arrays. > > int src[W*H*3]; > int dest[W*H*3]; > memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); Copying an odd size like 3 bytes might be slowing things down, CPUs prefer sizes such as 16 or 32 bits. Have you tried doing the operation using 32-bit pixels, just for comparison purposes? (And in that case, you can use int (or int32_t), for pixels.) That might show you how fast it could be (remembering 33% more memory accesses are needed). Otherwise, I would go with some inline assembler, perhaps with some manual loop unrolling. -- Bartc
From: Nils on 9 Apr 2010 04:12
Noob wrote: > My actual code is > > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); > > Consider as an example, > W = 1644 > H = 1164 > size = 5.74 MB If your target system has any kind of cache you may get a big performance boost if you do the rotation in 8x8 or 16x16 blocks. The reason is, that every iteration of the for-j loop will hit a different cache line. There aren't that many cache lines, so it is very likely that you see a cache miss for every write. If you do the first 8 or 16 columns first it's much more likely that the cache lines that you have accessed are still in the cache and your memory-accesses go faster. Also: What all others said. Get rid of memcpy and do it on your own. Cheers, Nils |