From: Brett Davis on 16 Apr 2010 01:56 In article <17e7e045-9599-4000-a41c-8885ccfc95d0(a)g30g2000yqc.googlegroups.com>, MitchAlsup <MitchAlsup(a)aol.com> wrote: > On Apr 11, 10:24�pm, Brett Davis <gg...(a)yahoo.com> wrote: > > If you do the following you can get a 2x speedup, it looks like > > more code, but will generate less, and the results will be > > pipelined correctly. > > Extra bonus points to those that understand why. > > You have explicitly gotten rid of the aliasing issues so the compiler > can avoid having to assume aliasing conflicts between C[0] and > A[1],... > > Mitch Excellent start, there are at least two or three other reasons, all of which effect performance, unlike the aliasing issue that will only rear its ugly head when an actual alias violation occurs. Brett
From: Terje Mathisen "terje.mathisen at on 16 Apr 2010 04:23 EricP wrote: > These two could be combined so it would rotate a 4*4 block of 3 byte > elements in a vertical stripe and writing 4 sequential rows at once. > I was thinking it would require too many registers to rotate > the block but I think it could be done with just 6 if you > spill each pending row value as soon as it is ready: > 4 to hold the pending write value for each row, > one for the just loaded value, one for shifting, masking, merging. > Other regs for holding dstPtr, dstTmpPtr, srcPtr, srcTmpPtr, height, etc, > and it looks like it would pretty much fit in 16 registers, > and only load each cache line once. > > That's probably as good as your going to get. This is effectively what I'm suggesting, i.e. do the rotate while copying, trying to minimize the number of load/store operations. Working in 4x4 blocks is a good compromise. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: EricP on 16 Apr 2010 19:36 Brett Davis wrote: >>>>> If you do the following you can get a 2x speedup, it looks like >>>>> more code, but will generate less, and the results will be >>>>> pipelined correctly. >>>>> Extra bonus points to those that understand why. Half the posters here? >>>>> >>>>> { >>>>> unsigned char *C = B+(H*j+H-i-1)*3; >>>>> temp0 = A[0]; >>>>> temp1 = A[1]; >>>>> temp2 = A[2]; >>>>> C[0] = temp0; >>>>> C[1] = temp1; >>>>> C[2] = temp2; >>>>> A += 3; >>>>> } >>>>> >>>>> Do not use *C++ = *A++; >>> Only one in a hundred programers know an optimizaton like that, for >>> half of comp.arch to be that good says good things about comp.arch. > > Benchmark it, try and stay in cache as a 66 MHz embedded CPU does > not have memory 600 cycles away. > Only an Intel class chip has enough OoO stuff to have a chance > of coming close to the same speed as my code. > > Brett Oh, you are avoiding read after write data dependency pipeline stalls on in-order cpus. The buffer overlap aliasing considerations in C prevents the compiler from automatically rearranging the original LD ST order to be more pipeline friendly for a particular cpu, but Fortran would allow it. Yeah ok that could be about 2x speed for that kind of cpu. Eric
From: robin on 17 Apr 2010 04:44 "Noob" <root(a)127.0.0.1> wrote in message news:hphm36$puk$1(a)speranza.aioe.org... | I need to rotate a picture clockwise 90 degrees. | | Conceptually, my picture is represented by a matrix | (W columns, H rows) of pixels, with each pixel | represented by 3 octets. | | pixel original_picture[W][H]; | pixel rotated_picture[H][W]; | | At the implementation level, I'm dealing with 1D integer arrays. | | int src[W*H*3]; | int dest[W*H*3]; | | Conceptually, the rotation operation comes down to | | rotated_picture[j][H-1-i] = original_picture[i][j] | | My actual code is | | for (i = 0; i < H; ++i) | for (j = 0; j < W; ++j) | memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); This is not going to be fast because:- 1. you are calling a subroutine 1,913,616 times to move 9 or 48 bits each time. 2. You are multiplying by 3 twice for each move. 3. You are using integer arrays. 1. Best to use ordinary assignment. 2. Eliminate *3 by changing to adds. 3. Use byte arrays (but I'm not clear about what you say. You say that you have octets. Now 3 octets will be 9 bits, not 3 integers equal to 48 bits.) This is how it would look in PL/I:- declare (src(m,n), dest(n,m)) char (3); do i = 1 to m; dest(*, m-i+1) = src(i,*); end; If the color information really is octets, then the declaration is changed to bit (9) aligned.
From: glen herrmannsfeldt on 17 Apr 2010 05:50
In comp.lang.pl1 robin <robin_v(a)bigpond.com> wrote: > "Noob" <root(a)127.0.0.1> wrote in message news:hphm36$puk$1(a)speranza.aioe.org... > | I need to rotate a picture clockwise 90 degrees. > | Conceptually, my picture is represented by a matrix > | (W columns, H rows) of pixels, with each pixel > | represented by 3 octets. > | pixel original_picture[W][H]; > | pixel rotated_picture[H][W]; > | At the implementation level, I'm dealing with 1D integer arrays. > | int src[W*H*3]; > | int dest[W*H*3]; > | Conceptually, the rotation operation comes down to > | rotated_picture[j][H-1-i] = original_picture[i][j] > | My actual code is > | for (i = 0; i < H; ++i) > | for (j = 0; j < W; ++j) > | memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); This is basically a matrix transpose (with the difference that one goes down instead of up). There are cache friendly matrix transpose algorithms that should also speed this one up. > This is not going to be fast because:- > 1. you are calling a subroutine 1,913,616 times to move > 9 or 48 bits each time. Many C compilers implement memcpy() inline. I suppose you believe that PL/I does a subroutine call for the MOD function? > 2. You are multiplying by 3 twice for each move. Most can also figure that one out. > 3. You are using integer arrays. > 1. Best to use ordinary assignment. > 2. Eliminate *3 by changing to adds. > 3. Use byte arrays (but I'm not clear about what you say. > You say that you have octets. Now 3 octets will be 9 bits, > not 3 integers equal to 48 bits.) -- glen |