From: nmm1 on 11 Apr 2010 05:55 In article <4BBF9F62.6020409(a)patten-glew.net>, Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote: > >> IBM did memory mapping across CPUS between the POWER3 and POWER4, >> and got it BADLY wrong. They backed off for the POWER5. And IBM >> aren't exactly tyros at that game. It's very easy to get wrong. >> The history of large computers is littered with worthy attempts at >> using memory banking in clever ways, most of which have failed to >> a greater or lesser degree. > >So, tell us what they did? How? Until and unless IBM tell all, we peasants will merely have to guess. All we know is that they did what you describe below, but that wasn't the whole story and wasn't the reason for the major problems. >The standard usually turns out to be a bad idea thing here is to >interleave cache lines across CPUs. (Did the Power3 have a CPU- >attached memory controller.) Not usually a good idea, unless local >and remote memory are very close in latency and bandwidth. That is, I believe, what the POWER4 did. And the details of the memory controller have never been published, nor has why the change was so catastrophic. Incidentally, it worked very well on the vector systems, and was very common on them. My guess is that it would work very well with modern Fortran codes (i.e. ones using arrays as first-class objects). Heaven help C/C++, but what else is new? >Not quite so bad, but still often surpringly painful, is to interleave >4K pages across CPUs/MCs. The OS can work around this by playing with >virtual address mappings, but it catches them by surprise. I would regard that as a bug in the OS design. The contiguous approach can be almost as painful for many shared-memory languages/programs, as almost everything gets allocated into a single bank. And I regard that as a bug in the OS design, too .... Regards, Nick Maclaren.
From: Brett Davis on 11 Apr 2010 23:24 Reposted from comp.arch.embeded >> I need to rotate a picture clockwise 90 degrees. > > The data sheet states > > SH-4 32-bit super-scalar RISC CPU > o 266 MHz, 2-way set associative 16-Kbyte ICache, 32-Kbyte DCache, MMU > o 5-stage pipeline, delayed branch support > o floating point unit, matrix operation support > o debug port, interrupt controller The most important number is cache line size, which you missed. If your image is 1,024 lines tall, that will completely thrash the cache, resulting in 3 bytes copied per cache line load/spill. If you copy 16x16 tiles you can get a 10x speedup. CopyTile16(source, dest, x, y, width, height)... You can also try 8x8 and 4x4, smaller loops can be faster due to all the args fitting in memory, and the loop getting unrolled. but the difference will be ~25% which is hardly worth the time to code. > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > { > unsigned char *C = B+(H*j+H-i-1)*3; > C[0] = A[0]; > C[1] = A[1]; > C[2] = A[2]; > A += 3; > } If you do the following you can get a 2x speedup, it looks like more code, but will generate less, and the results will be pipelined correctly. Extra bonus points to those that understand why. Half the posters here? { unsigned char *C = B+(H*j+H-i-1)*3; temp0 = A[0]; temp1 = A[1]; temp2 = A[2]; C[0] = temp0; C[1] = temp1; C[2] = temp2; A += 3; } Do not use *C++ = *A++; Brett
From: Casey Hawthorne on 12 Apr 2010 01:50 If the picture is conceptually represented by a matrix, then partitioning of the matrix is the way to go. You want to be aware of row major or column major order, so as to maximize locality of reference and also aware of cache line size, pipelining, and multiple cores. This may have been said before. -- Regards, Casey
From: MitchAlsup on 12 Apr 2010 12:26 On Apr 11, 10:24 pm, Brett Davis <gg...(a)yahoo.com> wrote: > If you do the following you can get a 2x speedup, it looks like > more code, but will generate less, and the results will be > pipelined correctly. > Extra bonus points to those that understand why. You have explicitly gotten rid of the aliasing issues so the compiler can avoid having to assume aliasing conflicts between C[0] and A[1],... Mitch
From: Robert Myers on 12 Apr 2010 20:08
On Apr 11, 11:24 pm, Brett Davis <gg...(a)yahoo.com> wrote: > Reposted from comp.arch.embeded > > >> I need to rotate a picture clockwise 90 degrees. > > > The data sheet states > > > SH-4 32-bit super-scalar RISC CPU > > o 266 MHz, 2-way set associative 16-Kbyte ICache, 32-Kbyte DCache, MMU > > o 5-stage pipeline, delayed branch support > > o floating point unit, matrix operation support > > o debug port, interrupt controller > > The most important number is cache line size, which you missed. > If your image is 1,024 lines tall, that will completely thrash > the cache, resulting in 3 bytes copied per cache line load/spill. > > If you copy 16x16 tiles you can get a 10x speedup. > > CopyTile16(source, dest, x, y, width, height)... > > You can also try 8x8 and 4x4, smaller loops can be faster due > to all the args fitting in memory, and the loop getting unrolled. > but the difference will be ~25% which is hardly worth the time to code. > > > for (i = 0; i < H; ++i) > > for (j = 0; j < W; ++j) > > { > > unsigned char *C = B+(H*j+H-i-1)*3; > > C[0] = A[0]; > > C[1] = A[1]; > > C[2] = A[2]; > > A += 3; > > } > > If you do the following you can get a 2x speedup, it looks like > more code, but will generate less, and the results will be > pipelined correctly. > Extra bonus points to those that understand why. Half the posters here? > > { > unsigned char *C = B+(H*j+H-i-1)*3; > temp0 = A[0]; > temp1 = A[1]; > temp2 = A[2]; > C[0] = temp0; > C[1] = temp1; > C[2] = temp2; > A += 3; > } > > Do not use *C++ = *A++; > Programming hotshots have done so much damage. And they brag about it. I watched some doing one-upsmanship while the earth was still being created, and I decided I wanted nothing to do with it. I think I showed good judgment (rare for me). Robert. |