Faster image rotation [General Programming]

Prev: #include "cpuid.os"
Next: aspect ratio algorithm needed.

From: Terje Mathisen "terje.mathisen at on 9 Apr 2010 16:21

Noob wrote:
> My target is a 266-MHz SH-4.
>
> Assuming the vendor's implementation of memcpy is properly optimized,
> I think it is safe to assume that it provides a lower bound for the
> run-time of the rotation operation. Correct?
>
> As an example, copying a 5.7-MB picture (1644 x 1164) takes 320 ms
> (i.e. 18 MB/s)

Both dimensions are divisible by 4, but not by 8, so 4x4 basic blocks is
a natural starting point.
>
> Right now, my naive (strength-reduced, no tiling) C implementation
> takes 1100 ms. I think there is still room for improvement.
>
> One thing you may have overlooked in the advice you've given me is
> that the pixel values are packed in 24-bits (RGB, no alpha).

Aha!

This is crucial, it probably means that you should work with 4x4 pixel
blocks, each using 3x4x4 = 48 bytes.

48 bytes would fit in 12 32-bit registers, I don't know the details of
the SH-4, but I believe it has at least 16 integer registers, right?

Starting from this point I would first try a simple setup where I read
in 4x4 pixels in 3x4 registers, then shuffle/shift/combine these into a
rotated/transposed version, which I then store to 4 small temporary buffers.

Each time these 4 buffers reach 64 bytes each (or 128, whatever the
cache line size is), I'd copy them to the target array: This has the big
benefit of always writing complete cache lines, while keeping the
working buffers small enough that they will stay put in L1 cache.

During the load part you'll read 12 bytes from each of 4 different
addresses, this probably isn't quite optimal, but you'll at least get
some reasonable benefit from burst dram transfers.
>
> Thus I can't (blindly) deal with 32-bit words. There's some
> bit-twiddling to do to merge different parts of pixels, and
> then I have to worry about endian-ness. Good times.
>
>> If you actually work with very small blocks, like your 5x3 example, then
>> it really doesn't matter, you can just generate unrolled code to do it
>> directly while copying, in which case the fastest code will be like
>> Andy's suggestion where you gather in data to be able to write full
>> cache lines to the target buffer.
>
> I was surprised that I got slightly better performance when I iterated
> over consecutive bytes of the SOURCE buffer, than when I iterated over
> consecutive bytes of the DEST buffer.
>
> My current code handles slightly over 5 MB/s. It's fast enough for pictures
> smaller than, say, 2 MB. But some pictures weigh 6-7 MB. Having the user
> wait 1.5 seconds might be acceptable, but barely.

Back in the 486-33MHz days I used to say that 10 MB/s was my initial
speed target, your SH-4 is probably quite a bit faster than this?

OTOH your library block copy speed of 18 MB/s seems quite low. :-(

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: "Andy "Krazy" Glew" on 9 Apr 2010 17:42

On 4/9/2010 8:43 AM, nmm1(a)cam.ac.uk wrote:
> IBM did memory mapping across CPUS between the POWER3 and POWER4,
> and got it BADLY wrong. They backed off for the POWER5. And IBM
> aren't exactly tyros at that game. It's very easy to get wrong.
> The history of large computers is littered with worthy attempts at
> using memory banking in clever ways, most of which have failed to
> a greater or lesser degree.

So, tell us what they did?

The standard usually turns out to be a bad idea thing here is to interleave cache lines across CPUs. (Did the Power3
have a CPU-attached memory controller.) Not usually a good idea, unless local and remote memory are very close in
latency and bandwidth.

Not quite so bad, but still often surpringly painful, is to interleave 4K pages across CPUs/MCs. The OS can work around
this by playing with virtual address mappings, but it catches them by surprise.

The best is usually to have several large contiguous chunks per CPU. Let the OS do NUMA placement. Where you can
specify what physical address they appear at. Several, because there are occasionally assumptions about physical address
ranges. (E.g. the OS assumes CPU0 has exclusive access to physical address [64K,128K), CPU1 to [128K,+64K), etc.) The
remappability is just to cope with such assumptions. Which sometimes crop up wrt I/O, especially when you transition
across a boundary like 32 to 64 bit.

But I still like having the option of being to interleave at finer granularity - cache line, or page - across a subset
of memory controllers, for a subset of memory. Because for certain apps it's a win. If the bandwidth of a single
channel cannot saturate a CPU.

From: MitchAlsup on 10 Apr 2010 20:13

On Apr 9, 4:42 pm, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net>
wrote:
> The standard usually turns out to be a bad idea thing here is to interleave cache lines across CPUs.
> Not quite so bad, but still often surpringly painful, is to interleave 4K pages across CPUs/MCs.
> The best is usually to have several large contiguous chunks per CPU.

It turns out that interleaving should be performed at the level of the
DRAM page for maximal effect with memory controllers that are smart
enough. This is far larger than the CPU page, but if the DRAMs do not
all have the same page sisze that the largest page that fits in all
DRAMs is generally an excellent granule.

A DRAM controller is sufficiently smart when it can hold dozens to
hundreds of cache lines waiting to be written while still servicing
demand reads with some protocal as to when writes actually happen and
some means to access the write buffer on behalf of demand reads.

This strategy commits the fewest number of DIMMs to the typical large
scale strip mine active access patterns. {Leaving the rest of the DRAM
banks to do more random activities.}

Mitch

From: MitchAlsup on 10 Apr 2010 20:17

On Apr 7, 5:17 am, Noob <r...(a)127.0.0.1> wrote:
> I need to rotate a picture clockwise 90 degrees.
>
> Conceptually, my picture is represented by a matrix
> (W columns, H rows) of pixels, with each pixel
> represented by 3 octets.
>
> pixel original_picture[W][H];
> pixel rotated_picture[H][W];
>
> At the implementation level, I'm dealing with 1D integer arrays.
>
> int src[W*H*3];
> int dest[W*H*3];

I am left wondering why a level of indirection with no movement night
not be better still.

for( i=0; i<imax; i+=istep)
for( j=0; j<jmax; j+=jstep)
memcopy(blah);

With suitable imax, jmax and suitable istep, jstep the loop works for
either rotated or nonrotated configurations.

Mitch

From: Casey Hawthorne on 10 Apr 2010 22:52

Must the picture be rotated clockwise 90 degrees.

That is, must a static image rotated clockwise 90 degrees be
constructed or can a rotation transformation be used to do what you
want?

On Wed, 07 Apr 2010 12:17:12 +0200, Noob <root(a)127.0.0.1> wrote:

>[ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ]
>[ Followup-To: set to comp.programming ]
>
>Hello,
>
>I imagine the following problem has been efficiently solved
>over a million times in the past.
>
>I need to rotate a picture clockwise 90 degrees.
>
>Conceptually, my picture is represented by a matrix
>(W columns, H rows) of pixels, with each pixel
>represented by 3 octets.
>
>pixel original_picture[W][H];
>pixel rotated_picture[H][W];
>
>At the implementation level, I'm dealing with 1D integer arrays.
>
>int src[W*H*3];
>int dest[W*H*3];
>
>Conceptually, the rotation operation comes down to
>
> rotated_picture[j][H-1-i] = original_picture[i][j]
>
>My actual code is
>
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);
>
>Consider as an example,
>W = 1644
>H = 1164
>size = 5.74 MB
>
>On my target platform, the rotation takes 1.65 seconds.
>
>I'm trying to get this down under half a second.
>
>I'm using the GNU tool chain. For some weird reason, we
>compile everything -O0. The first thing I'll try is crank
>gcc's optimization level.
>
>I'm hoping gcc can perform some strength reduction, as the
>index calculation seems to be taking a non-negligible fraction
>of the total run-time.
>
>Changing the loop to
>
> for (i = 0; i < H; ++i)
> for (j = 0; j < W; ++j)
> {
> memcpy(dest+(H*j+H-i-1)*3, src, 3);
> src += 3;
> }
>
>brings a 5% improvement.
>
>I thought changing the memcpy call to 3 explicit assignments
>might help, but it actually slowed things down.
>
>Perhaps I need to perform some tiling magic... I don't think
>gcc performs automatic tiling?
>
>Comments and insight would be greatly appreciated.
>
>Regards.
--
Regards,
Casey

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Prev: #include "cpuid.os"
Next: aspect ratio algorithm needed.