From: Terje Mathisen "terje.mathisen at on 9 Apr 2010 16:21 Noob wrote: > My target is a 266-MHz SH-4. > > Assuming the vendor's implementation of memcpy is properly optimized, > I think it is safe to assume that it provides a lower bound for the > run-time of the rotation operation. Correct? > > As an example, copying a 5.7-MB picture (1644 x 1164) takes 320 ms > (i.e. 18 MB/s) Both dimensions are divisible by 4, but not by 8, so 4x4 basic blocks is a natural starting point. > > Right now, my naive (strength-reduced, no tiling) C implementation > takes 1100 ms. I think there is still room for improvement. > > One thing you may have overlooked in the advice you've given me is > that the pixel values are packed in 24-bits (RGB, no alpha). Aha! This is crucial, it probably means that you should work with 4x4 pixel blocks, each using 3x4x4 = 48 bytes. 48 bytes would fit in 12 32-bit registers, I don't know the details of the SH-4, but I believe it has at least 16 integer registers, right? Starting from this point I would first try a simple setup where I read in 4x4 pixels in 3x4 registers, then shuffle/shift/combine these into a rotated/transposed version, which I then store to 4 small temporary buffers. Each time these 4 buffers reach 64 bytes each (or 128, whatever the cache line size is), I'd copy them to the target array: This has the big benefit of always writing complete cache lines, while keeping the working buffers small enough that they will stay put in L1 cache. During the load part you'll read 12 bytes from each of 4 different addresses, this probably isn't quite optimal, but you'll at least get some reasonable benefit from burst dram transfers. > > Thus I can't (blindly) deal with 32-bit words. There's some > bit-twiddling to do to merge different parts of pixels, and > then I have to worry about endian-ness. Good times. > >> If you actually work with very small blocks, like your 5x3 example, then >> it really doesn't matter, you can just generate unrolled code to do it >> directly while copying, in which case the fastest code will be like >> Andy's suggestion where you gather in data to be able to write full >> cache lines to the target buffer. > > I was surprised that I got slightly better performance when I iterated > over consecutive bytes of the SOURCE buffer, than when I iterated over > consecutive bytes of the DEST buffer. > > My current code handles slightly over 5 MB/s. It's fast enough for pictures > smaller than, say, 2 MB. But some pictures weigh 6-7 MB. Having the user > wait 1.5 seconds might be acceptable, but barely. Back in the 486-33MHz days I used to say that 10 MB/s was my initial speed target, your SH-4 is probably quite a bit faster than this? OTOH your library block copy speed of 18 MB/s seems quite low. :-( Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: "Andy "Krazy" Glew" on 9 Apr 2010 17:42 On 4/9/2010 8:43 AM, nmm1(a)cam.ac.uk wrote: > IBM did memory mapping across CPUS between the POWER3 and POWER4, > and got it BADLY wrong. They backed off for the POWER5. And IBM > aren't exactly tyros at that game. It's very easy to get wrong. > The history of large computers is littered with worthy attempts at > using memory banking in clever ways, most of which have failed to > a greater or lesser degree. So, tell us what they did? The standard usually turns out to be a bad idea thing here is to interleave cache lines across CPUs. (Did the Power3 have a CPU-attached memory controller.) Not usually a good idea, unless local and remote memory are very close in latency and bandwidth. Not quite so bad, but still often surpringly painful, is to interleave 4K pages across CPUs/MCs. The OS can work around this by playing with virtual address mappings, but it catches them by surprise. The best is usually to have several large contiguous chunks per CPU. Let the OS do NUMA placement. Where you can specify what physical address they appear at. Several, because there are occasionally assumptions about physical address ranges. (E.g. the OS assumes CPU0 has exclusive access to physical address [64K,128K), CPU1 to [128K,+64K), etc.) The remappability is just to cope with such assumptions. Which sometimes crop up wrt I/O, especially when you transition across a boundary like 32 to 64 bit. But I still like having the option of being to interleave at finer granularity - cache line, or page - across a subset of memory controllers, for a subset of memory. Because for certain apps it's a win. If the bandwidth of a single channel cannot saturate a CPU.
From: MitchAlsup on 10 Apr 2010 20:13 On Apr 9, 4:42 pm, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net> wrote: > The standard usually turns out to be a bad idea thing here is to interleave cache lines across CPUs. > Not quite so bad, but still often surpringly painful, is to interleave 4K pages across CPUs/MCs. > The best is usually to have several large contiguous chunks per CPU. It turns out that interleaving should be performed at the level of the DRAM page for maximal effect with memory controllers that are smart enough. This is far larger than the CPU page, but if the DRAMs do not all have the same page sisze that the largest page that fits in all DRAMs is generally an excellent granule. A DRAM controller is sufficiently smart when it can hold dozens to hundreds of cache lines waiting to be written while still servicing demand reads with some protocal as to when writes actually happen and some means to access the write buffer on behalf of demand reads. This strategy commits the fewest number of DIMMs to the typical large scale strip mine active access patterns. {Leaving the rest of the DRAM banks to do more random activities.} Mitch
From: MitchAlsup on 10 Apr 2010 20:17 On Apr 7, 5:17 am, Noob <r...(a)127.0.0.1> wrote: > I need to rotate a picture clockwise 90 degrees. > > Conceptually, my picture is represented by a matrix > (W columns, H rows) of pixels, with each pixel > represented by 3 octets. > > pixel original_picture[W][H]; > pixel rotated_picture[H][W]; > > At the implementation level, I'm dealing with 1D integer arrays. > > int src[W*H*3]; > int dest[W*H*3]; I am left wondering why a level of indirection with no movement night not be better still. for( i=0; i<imax; i+=istep) for( j=0; j<jmax; j+=jstep) memcopy(blah); With suitable imax, jmax and suitable istep, jstep the loop works for either rotated or nonrotated configurations. Mitch
From: Casey Hawthorne on 10 Apr 2010 22:52
Must the picture be rotated clockwise 90 degrees. That is, must a static image rotated clockwise 90 degrees be constructed or can a rotation transformation be used to do what you want? On Wed, 07 Apr 2010 12:17:12 +0200, Noob <root(a)127.0.0.1> wrote: >[ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ] >[ Followup-To: set to comp.programming ] > >Hello, > >I imagine the following problem has been efficiently solved >over a million times in the past. > >I need to rotate a picture clockwise 90 degrees. > >Conceptually, my picture is represented by a matrix >(W columns, H rows) of pixels, with each pixel >represented by 3 octets. > >pixel original_picture[W][H]; >pixel rotated_picture[H][W]; > >At the implementation level, I'm dealing with 1D integer arrays. > >int src[W*H*3]; >int dest[W*H*3]; > >Conceptually, the rotation operation comes down to > > rotated_picture[j][H-1-i] = original_picture[i][j] > >My actual code is > > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3); > >Consider as an example, >W = 1644 >H = 1164 >size = 5.74 MB > >On my target platform, the rotation takes 1.65 seconds. > >I'm trying to get this down under half a second. > >I'm using the GNU tool chain. For some weird reason, we >compile everything -O0. The first thing I'll try is crank >gcc's optimization level. > >I'm hoping gcc can perform some strength reduction, as the >index calculation seems to be taking a non-negligible fraction >of the total run-time. > >Changing the loop to > > for (i = 0; i < H; ++i) > for (j = 0; j < W; ++j) > { > memcpy(dest+(H*j+H-i-1)*3, src, 3); > src += 3; > } > >brings a 5% improvement. > >I thought changing the memcpy call to 3 explicit assignments >might help, but it actually slowed things down. > >Perhaps I need to perform some tiling magic... I don't think >gcc performs automatic tiling? > >Comments and insight would be greatly appreciated. > >Regards. -- Regards, Casey |