From: Skybuck Flying on 8 Apr 2010 01:42 original .------------------->X | | | | | | \|/ Y rotated 90 degrees clockwise y<--------------------. | | | | | | \|/ x new_x <---------------. | | \|/ new_y new_x = (height-1)-y; new_y = x; Source := 0; Dest := (Height-1); Count := 0; for Index := 0 to PixelCount-1 do begin Source := Source + 1; Dest := Dest-1; Count := Count + 1; if Count >= Height then begin Count := 0; Dest := Dest + Height; end; Dest^ := Source^; end; Untested, but that's what I would conceptually try. This should get rid of all the (slower?) multiplications, in return for one single branch, and some fast adds and subs. Plus a memory copy per pixel. (What target you on ? ;)) Bye, Skybuck.
From: nmm1 on 8 Apr 2010 02:57 In article <4ec6f608-1d16-4c97-b47a-7c1d71879142(a)o30g2000yqb.googlegroups.com>, Robert Myers <rbmyersusa(a)gmail.com> wrote: >> >> >> I need to rotate a picture clockwise 90 degrees. >> >> For extra marks, produce a much more cache-friendly version, using >> blocks rather than rows .... >> >> You can even do that in place, and make it efficient. =A0But I suspect >> that you (Terje) know all of this! >> >What would happen if you changed the default storage order for arrays >of dimension two and greater, so that they were always stored in >blocks with some cache or node-friendly order? You could make it >nearly transparent in a language like c++ by defining appropriate >objects. The access time would go up, and some other operations would run a lot more slowly. That technique works, but isn't any use as a general solution. It can help for some applications. TANSTAAFL. Regards, Nick Maclaren.
From: "Andy "Krazy" Glew" on 8 Apr 2010 03:12 On 4/7/2010 8:59 AM, Terje Mathisen wrote: > Follow-up restored to include a group I read) > Noob wrote: >> [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ] >> [ Followup-To: set to comp.programming ] >> >> Hello, >> >> I imagine the following problem has been efficiently solved >> over a million times in the past. >> >> I need to rotate a picture clockwise 90 degrees. >> >> Conceptually, my picture is represented by a matrix >> (W columns, H rows) of pixels, with each pixel >> represented by 3 octets. >> >> pixel original_picture[W][H]; >> pixel rotated_picture[H][W]; >> > > This is indeed a well-known problem, the key to make it fast is to > realize that it is very similar to a transpose operation! > > I.e. first copy everything to the target array (in a single block move), > then transpose it, finally you reverse each of the rows. > > The difference between clock and counter-clock-wise rotation is in doing > the row reverse either before or after the transpose. Try both since I > don't remeber which is which! > > The key here is that the transpose operation is quite cache-friendly, > much better than your current naive code. I usually defer to you wrt optimization, Terje, but unless you have an especially fast block copy instruction that the user cannot roll himself (not unlike my P6-era fast strings), I doubt that it will be faster to copy and then transpose in place. It is probably better to subdivide the original picture into blocks, and perform the rotation reading a block of the original_picture, and writing the block of the rotated_picture that it maps to. The only real question is the size of the blocks: cache sized? Maybe not: if you have N write combining buffers, there may be a benefit in writing blocks whose vertical height is equal to the number of WC buffers. However, when I have tuned transposes in the past I have usually found it better to write out rows of the destination, reading columns of the origin. This way you don't really depend on the number of WC buffers: you completely fill one WC buffer, and then move to the next. With the secondary issue of what width you read the columns in - e.g. if transposing byte sized elements, you might read 32 or 64 at a time, and then do a register transpose to assemble the blocks to write out. Nowadays, you have to look at 128 bit SSE and 256 bit AVX (or 512 bit LRB registers) as options. (Hmm, on a machine with scatter/gather, you might be tempted to read in a full cache line of the source, and scatter write to the destination. But then you *really* want to make sure that you fit into the WC buffers. Scatter read and single write probably better. But I suspect that an in register transpose is better overall. How efficiently can LRB transpose a set of vector registers? E.g. with 32 bits per element, a full cache line 64B register is 8 elements wide - so how efficiently can you transpose a matrix stored in 8 vector registers? Interestingly, most GPUs allow indirect addressing on their register files. I think that I have seen index registers - i.e. access register[register[i]] - as well as strided access TO THE GPU RF. In such a system, transposing or rotating is just (1) read the data in, (2) write the data out. No shuffling about. Although there is cross lane movement - but then the ALU lanes are not necessarily the same as the memory lanes) In any case, you wat to avoid the unnecessary read-for-ownerships that you might naively get when writing into memory one element of the time. SSE streaming stores.
From: Noob on 8 Apr 2010 04:56 Ulf Samuelsson wrote: > Noob skrev: > >> I imagine the following problem has been efficiently solved >> over a million times in the past. >> >> I need to rotate a picture clockwise 90 degrees. > > Modern Hardware like the [snip] will do this in a H/W accelerator. I was specifically asking how to do it in software. I find this spamvertisement very inappropriate. Does your employer condone this activity?
From: Peter Dickerson on 8 Apr 2010 05:30
"Noob" <root(a)127.0.0.1> wrote in message news:hpk5o7$vtq$1(a)speranza.aioe.org... > Ulf Samuelsson wrote: > >> Noob skrev: >> >>> I imagine the following problem has been efficiently solved >>> over a million times in the past. >>> >>> I need to rotate a picture clockwise 90 degrees. >> >> Modern Hardware like the [snip] will do this in a H/W accelerator. > > I was specifically asking how to do it in software. > > I find this spamvertisement very inappropriate. > > Does your employer condone this activity? Well, the fact that some devices feel the need to do this in hardware is relevent. It means that either the processor does have the grunt to do the job well itself or that there is significant performance to be had from methods that are not easy possible in software (e.g. dedicated parallel hardware). If the second then it hints that some cleverness in software will be needed to do well. Anyway, there is nothing about USENET that requires replies to you questions to be for you only. If the reply is related, and the can reasonably expect there to be readers for which the reply is relevent, then I don't have a big problem. Some of this is about motivation. If the motivation of the response was to shoehorn there product or services into the discussion rather than to be genuinely helpful then that I surely don't like. Now, Ulf has a history in this group and I feel that history has been strongly positive though somewhat focused on Atmel products (since that's what he knows best). Peter |