From: bartc on 15 Apr 2010 08:16 "robin" <robin51(a)dodo.com.au> wrote in message news:4bc66d38$0$78579$c30e37c6(a)exi-reader.telstra.net... > "Noob" <root(a)127.0.0.1> wrote in message > news:hphm36$puk$1(a)speranza.aioe.org... > > | I need to rotate a picture clockwise 90 degrees. > | > | Conceptually, my picture is represented by a matrix > | (W columns, H rows) of pixels, with each pixel > | represented by 3 octets. > | > | pixel original_picture[W][H]; > | pixel rotated_picture[H][W]; > | > | At the implementation level, I'm dealing with 1D integer arrays. > | > | > | Consider as an example, > | W = 1644 > | H = 1164 > | size = 5.74 MB > > Seems to be an arithmetic error here. > I get 15 Mb for 8-bit pixels, and 45Mb for 3 x 8-bit RGB pixels. 1644 x 1164 x 3 bytes is 5740848 bytes -- bartc
From: EricP on 15 Apr 2010 15:36 Terje Mathisen wrote: > EricP wrote: >> Brett Davis wrote: >>> Only one in a hundred programers know an optimizaton like that, for >>> half of comp.arch to be that good says good things about comp.arch. >> >> Ok, well, I'm going to take the risk of seeming a total doofuss >> on a global scale, but I don't see what you are getting at. >> >> A[x] and C[y] are referenced only once so there is no reason >> for the compiler to enregister their values, and all other >> variables are locals or pass-by-value, and therefore >> no reason for aliasing to occur. >> >> The SH-4 has byte memory access instructions, so this is just the >> difference between LD ST LD ST LD ST and LD LD LD ST ST ST. >> The latter requires 2 extra temp regs, which on x86 causes >> a var spill into the stack, but probably is ok on sh-4. >> >> So I don't see the 2x speedup here. > > The key is that without alias knowledge, it is perfectly possible for > the two arrays to intersect, in which case it is illegal for the > compiler to unroll the loads and stores. Ok, so how would it do better without considering buffer overlap and unrolling the loop? It still seems to me that there would be the same number of loads and stores. To really benefit it would have change byte loads and stores to dword, as you say below. Note that in the example MS code I posted, the compiler generated the following code for the 3 byte memcpy: ; Line 9 mov bx, WORD PTR [eax] mov WORD PTR [ecx], bx mov bl, BYTE PTR [eax+2] mov BYTE PTR [ecx+2], bl which would give different results if the source and dest overlapped but is legal because memcpy is defined as not considering overlap. > However, the huge advantage here is to load 12 bytes = 4 pixels into 3 > registers, then do the same for the 3 following lines so that you have > loaded a full 4x4 pixel block with 12 load instructions instead of 36. > > Next you do shifts & shuffles to rotate the 4x4 block, still ending up > with 12 registers, before storing them back with 12 writes. > > This saves 2/3 of all the load/store operations. > > Terje I wonder about this. Yes, it save lots of loads and stores, but unless the sh-4 has byte extract and insert ops (and it doesn't look so) you are going to spend ops doing moves, shifts, ands, ors. Coupled with the fact that the transpose approach touches each source cache line once, and each dest line two times (once for the source to dest copy-transpose, then the row reverse) I think this would loose to just a straight copy-while-rotating. I was thinking just doing the rotate during the copy from source to dest, but being cache aware so that each line is loaded just once. It should start in the lower left corner of srcBuf, copy a vertical stripe, column by column, from bottom to top so the output to dstBuf is serial and contiguous. Having stores be contiguous might help if there was store queue merging. The width of the stripe is 32 elements (96 bytes). It should move 32 (or less) of the 3 byte elements, then go back and do the next column in the stripe. dst[0,0] = src[h-1,0] dst[0,1] = src[h-2,0] .... dst[0,31] = src[h-32,0] dst[1,0] = src[h-1,1] dst[1,1] = src[h-2,1] .... dst[31,31] = src[h-32,31] This gives a data cache working set of 33 lines, the writes of dst are contiguous, and each line is loaded from main memory just once. Eric
From: EricP on 15 Apr 2010 19:45 EricP wrote: > Terje Mathisen wrote: >> However, the huge advantage here is to load 12 bytes = 4 pixels into 3 >> registers, then do the same for the 3 following lines so that you have >> loaded a full 4x4 pixel block with 12 load instructions instead of 36. >> >> Next you do shifts & shuffles to rotate the 4x4 block, still ending up >> with 12 registers, before storing them back with 12 writes. >> >> This saves 2/3 of all the load/store operations. >> >> Terje > > I wonder about this. Yes, it save lots of loads and stores, > but unless the sh-4 has byte extract and insert ops (and it doesn't look > so) > you are going to spend ops doing moves, shifts, ands, ors. > Coupled with the fact that the transpose approach touches > each source cache line once, and each dest line two times > (once for the source to dest copy-transpose, then the row reverse) > I think this would loose to just a straight copy-while-rotating. > > I was thinking just doing the rotate during the copy from source to dest, > but being cache aware so that each line is loaded just once. > > It should start in the lower left corner of srcBuf, copy a > vertical stripe, column by column, from bottom to top so the > output to dstBuf is serial and contiguous. > Having stores be contiguous might help if there was store queue merging. > > The width of the stripe is 32 elements (96 bytes). > It should move 32 (or less) of the 3 byte elements, > then go back and do the next column in the stripe. > > dst[0,0] = src[h-1,0] > dst[0,1] = src[h-2,0] > .... > dst[0,31] = src[h-32,0] > dst[1,0] = src[h-1,1] > dst[1,1] = src[h-2,1] > .... > dst[31,31] = src[h-32,31] > > This gives a data cache working set of 33 lines, > the writes of dst are contiguous, > and each line is loaded from main memory just once. > > Eric These two could be combined so it would rotate a 4*4 block of 3 byte elements in a vertical stripe and writing 4 sequential rows at once. I was thinking it would require too many registers to rotate the block but I think it could be done with just 6 if you spill each pending row value as soon as it is ready: 4 to hold the pending write value for each row, one for the just loaded value, one for shifting, masking, merging. Other regs for holding dstPtr, dstTmpPtr, srcPtr, srcTmpPtr, height, etc, and it looks like it would pretty much fit in 16 registers, and only load each cache line once. That's probably as good as your going to get. Eric
From: Brett Davis on 16 Apr 2010 01:23 In article <0cgh97-unc2.ln1(a)ntp.tmsw.no>, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote: > EricP wrote: > > So I don't see the 2x speedup here. > > The key is that without alias knowledge, it is perfectly possible for > the two arrays to intersect, in which case it is illegal for the > compiler to unroll the loads and stores. > > However, the huge advantage here is to load 12 bytes = 4 pixels into 3 > registers, then do the same for the 3 following lines so that you have > loaded a full 4x4 pixel block with 12 load instructions instead of 36. > > Next you do shifts & shuffles to rotate the 4x4 block, still ending up > with 12 registers, before storing them back with 12 writes. > > This saves 2/3 of all the load/store operations. > > Terje That is an excellent idea, but you will note that I only suggested long accesses for the reads, and writing bytes. Because I have made that optimization before and ended up with longer slower code. Much to my dismay I might add. (~5% slower.) The reason is that almost all CPUs have a write combiner to reduce memory traffic. Four bytes written in sequence end up as one long write to the cache or memory. So all those extra shifts and adds just waste cycles. Plus many of the SH chips have slow shift instructions, making things even worse. Brett
From: Brett Davis on 16 Apr 2010 01:49
In article <a5mxn.262830$Dv7.69033(a)newsfe17.iad>, EricP <ThatWouldBeTelling(a)thevillage.com> wrote: > Brett Davis wrote: > >>> If you do the following you can get a 2x speedup, it looks like > >>> more code, but will generate less, and the results will be > >>> pipelined correctly. > >>> Extra bonus points to those that understand why. Half the posters here? > >>> > >>> { > >>> unsigned char *C = B+(H*j+H-i-1)*3; > >>> temp0 = A[0]; > >>> temp1 = A[1]; > >>> temp2 = A[2]; > >>> C[0] = temp0; > >>> C[1] = temp1; > >>> C[2] = temp2; > >>> A += 3; > >>> } > >>> > >>> Do not use *C++ = *A++; > > > > Only one in a hundred programers know an optimizaton like that, for > > half of comp.arch to be that good says good things about comp.arch. > > Ok, well, I'm going to take the risk of seeming a total doofuss > on a global scale, but I don't see what you are getting at. > > A[x] and C[y] are referenced only once so there is no reason > for the compiler to enregister their values, and all other > variables are locals or pass-by-value, and therefore > no reason for aliasing to occur. > > So I don't see the 2x speedup here. > > Eric Benchmark it, try and stay in cache as a 66 MHz embedded CPU does not have memory 600 cycles away. Only an Intel class chip has enough OoO stuff to have a chance of coming close to the same speed as my code. Brett |