Prev: Which is the most beautiful and memorable hardware structure in a ?CPU?
Next: Energy usage per application ?
From: Skybuck Flying on 8 Apr 2010 01:42 original .------------------->X | | | | | | \|/ Y rotated 90 degrees clockwise y<--------------------. | | | | | | \|/ x new_x <---------------. | | \|/ new_y new_x = (height-1)-y; new_y = x; Source := 0; Dest := (Height-1); Count := 0; for Index := 0 to PixelCount-1 do begin Source := Source + 1; Dest := Dest-1; Count := Count + 1; if Count >= Height then begin Count := 0; Dest := Dest + Height; end; Dest^ := Source^; end; Untested, but that's what I would conceptually try. This should get rid of all the (slower?) multiplications, in return for one single branch, and some fast adds and subs. Plus a memory copy per pixel. (What target you on ? ;)) Bye, Skybuck.
From: nmm1 on 8 Apr 2010 02:57 In article <4ec6f608-1d16-4c97-b47a-7c1d71879142(a)o30g2000yqb.googlegroups.com>, Robert Myers <rbmyersusa(a)gmail.com> wrote: >> >> >> I need to rotate a picture clockwise 90 degrees. >> >> For extra marks, produce a much more cache-friendly version, using >> blocks rather than rows .... >> >> You can even do that in place, and make it efficient. =A0But I suspect >> that you (Terje) know all of this! >> >What would happen if you changed the default storage order for arrays >of dimension two and greater, so that they were always stored in >blocks with some cache or node-friendly order? You could make it >nearly transparent in a language like c++ by defining appropriate >objects. The access time would go up, and some other operations would run a lot more slowly. That technique works, but isn't any use as a general solution. It can help for some applications. TANSTAAFL. Regards, Nick Maclaren.
From: "Andy "Krazy" Glew" on 8 Apr 2010 03:12 On 4/7/2010 8:59 AM, Terje Mathisen wrote: > Follow-up restored to include a group I read) > Noob wrote: >> [ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ] >> [ Followup-To: set to comp.programming ] >> >> Hello, >> >> I imagine the following problem has been efficiently solved >> over a million times in the past. >> >> I need to rotate a picture clockwise 90 degrees. >> >> Conceptually, my picture is represented by a matrix >> (W columns, H rows) of pixels, with each pixel >> represented by 3 octets. >> >> pixel original_picture[W][H]; >> pixel rotated_picture[H][W]; >> > > This is indeed a well-known problem, the key to make it fast is to > realize that it is very similar to a transpose operation! > > I.e. first copy everything to the target array (in a single block move), > then transpose it, finally you reverse each of the rows. > > The difference between clock and counter-clock-wise rotation is in doing > the row reverse either before or after the transpose. Try both since I > don't remeber which is which! > > The key here is that the transpose operation is quite cache-friendly, > much better than your current naive code. I usually defer to you wrt optimization, Terje, but unless you have an especially fast block copy instruction that the user cannot roll himself (not unlike my P6-era fast strings), I doubt that it will be faster to copy and then transpose in place. It is probably better to subdivide the original picture into blocks, and perform the rotation reading a block of the original_picture, and writing the block of the rotated_picture that it maps to. The only real question is the size of the blocks: cache sized? Maybe not: if you have N write combining buffers, there may be a benefit in writing blocks whose vertical height is equal to the number of WC buffers. However, when I have tuned transposes in the past I have usually found it better to write out rows of the destination, reading columns of the origin. This way you don't really depend on the number of WC buffers: you completely fill one WC buffer, and then move to the next. With the secondary issue of what width you read the columns in - e.g. if transposing byte sized elements, you might read 32 or 64 at a time, and then do a register transpose to assemble the blocks to write out. Nowadays, you have to look at 128 bit SSE and 256 bit AVX (or 512 bit LRB registers) as options. (Hmm, on a machine with scatter/gather, you might be tempted to read in a full cache line of the source, and scatter write to the destination. But then you *really* want to make sure that you fit into the WC buffers. Scatter read and single write probably better. But I suspect that an in register transpose is better overall. How efficiently can LRB transpose a set of vector registers? E.g. with 32 bits per element, a full cache line 64B register is 8 elements wide - so how efficiently can you transpose a matrix stored in 8 vector registers? Interestingly, most GPUs allow indirect addressing on their register files. I think that I have seen index registers - i.e. access register[register[i]] - as well as strided access TO THE GPU RF. In such a system, transposing or rotating is just (1) read the data in, (2) write the data out. No shuffling about. Although there is cross lane movement - but then the ALU lanes are not necessarily the same as the memory lanes) In any case, you wat to avoid the unnecessary read-for-ownerships that you might naively get when writing into memory one element of the time. SSE streaming stores.
From: Noob on 8 Apr 2010 08:38 Terje Mathisen wrote: > Noob wrote: > >> I imagine the following problem has been efficiently solved >> over a million times in the past. >> >> I need to rotate a picture clockwise 90 degrees. >> >> Conceptually, my picture is represented by a matrix >> (W columns, H rows) of pixels, with each pixel >> represented by 3 octets. >> >> pixel original_picture[W][H]; >> pixel rotated_picture[H][W]; > > This is indeed a well-known problem, the key to make it fast is to > realize that it is very similar to a transpose operation! > > I.e. first copy everything to the target array (in a single block move), > then transpose it, finally you reverse each of the rows. > > The difference between clock and counter-clock-wise rotation is in doing > the row reverse either before or after the transpose. Try both since I > don't remember which is which! > > The key here is that the transpose operation is quite cache-friendly, > much better than your current naive code. Hello Terje, Your sig is the reason I thought it would be a good idea to include comp.arch in the list :-) Let me take a 5x3 image as an example. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Transpose 0 5 10 1 6 11 2 7 12 3 8 13 4 9 14 Reverse the rows 10 5 0 11 6 1 12 7 2 13 8 3 14 9 4 This is, indeed, equivalent to a 90� clockwise rotation. I don't see why the transpose operation would be more cache-friendly than directly doing the rotation. TRANSPOSE = copy ROW(i) to COL(i) 90�CW ROT = copy ROW(i) to COL(H-1-i) Is it because the column access is incremented rather than decremented? I must be missing something... Could you provide some insight, or some links? Regards.
From: Terje Mathisen "terje.mathisen at on 8 Apr 2010 13:45
Noob wrote: > Let me take a 5x3 image as an example. > > 0 1 2 3 4 > 5 6 7 8 9 > 10 11 12 13 14 > > Transpose > > 0 5 10 > 1 6 11 > 2 7 12 > 3 8 13 > 4 9 14 > > Reverse the rows > > 10 5 0 > 11 6 1 > 12 7 2 > 13 8 3 > 14 9 4 > > This is, indeed, equivalent to a 90� clockwise rotation. > > I don't see why the transpose operation would be more > cache-friendly than directly doing the rotation. > > TRANSPOSE = copy ROW(i) to COL(i) > 90�CW ROT = copy ROW(i) to COL(H-1-i) > > Is it because the column access is incremented rather than > decremented? I must be missing something... > > Could you provide some insight, or some links? The basic (fast) transpose operation works on squares that have power of two sides, this is what allows it to work recursively, and thereby to fit into lower cache levels as soon as possible. The innermost loop of the transpose works with N registers each capable of holding N elements, so you can transpose a 4x4 block of 32-bit values using 4 SSE registers, or an 8x8 block of 16-bit values (as long as you're in 64-bit mode so you have at least 16 regs available). If you actually work with very small blocks, like your 5x3 example, then it really doesn't matter, you can just generate unrolled code to do it directly while copying, in which case the fastest code will be like Andy's suggestion where you gather in data to be able to write full cache lines to the target buffer. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching" |