Faster image rotation [Embedded]

Prev: Including compile timestamp in c?
Next: fpga and the particular case of xilinx

From: bartc on 18 Apr 2010 06:42

"John W Kennedy" <jwkenne(a)attglobal.net> wrote in message
news:4bca7f2a$0$22520$607ed4bc(a)cv.net...
> On 2010-04-17 04:44:49 -0400, robin said:
>> Now 3 octets will be 9 bits,
>
> Robin, will you /please/ stop blithering on about things you don't
> understand?! Buy a Data Processing dictionary, for God's sake!
>
> The rest of this is addressed to the original poster: I don't understand
> why you're using int variables for octets; they should be char. For the
> rest, I'd do the following:
>
> typedef struct __Pixel {
> unsigned char red, green, blue;
> } Pixel;
>
> Pixel src[W][H];
> Pixel dest[H][W];
>
> for (int i = 0; i < W; ++i)
> for (int j = 0; j < H; ++i) {
> dest[j][i].red = src[i][j].red;
> dest[j][i].green = src[i][j].green;
> dest[j][i].blue = src[i][j].blue;
> }
....
> memcpy(dest[j][i], src[i][j], sizeof (Pixel));

And perhaps dest[j][i] = src[i][j];

But in practice W,H might only be known at runtime, making the code rather
different. Depending on the exact format of the image data, there might also
be padding bytes (nothing to do with C's struct padding), for example at the
end of each row.

--
Bartc

From: Noob on 20 Apr 2010 10:32

John W Kennedy wrote:

> robin said:
>
>> Now 3 octets will be 9 bits, [...]

http://en.wikipedia.org/wiki/Octet_(computing)

> The rest of this is addressed to the original poster: I don't understand
> why you're using int variables for octets; they should be char.

You're right. It was a typo.

> For the rest, I'd do the following:
>
> typedef struct __Pixel {
> unsigned char red, green, blue;
> } Pixel;
>
> Pixel src[W][H];
> Pixel dest[H][W];
>
> for (int i = 0; i < W; ++i)
> for (int j = 0; j < H; ++i) {
> dest[j][i].red = src[i][j].red;
> dest[j][i].green = src[i][j].green;
> dest[j][i].blue = src[i][j].blue;
> }

Are you sure the above is a description of a rotation? :-)
Clockwise or counter-clockwise?

> I'd also try:
>
> for (int i = 0; i < W; ++i)
> for (int j = 0; j < H; ++i)
> memcpy(dest[j][i], src[i][j], sizeof (Pixel));
>
> and see which one is faster; it will depend on the individual compiler.
> (Make sure you test at the same optimization level you plan to use.)

The target system is
CPU: 266 MHz, dual issue, 5-stage integer pipeline, SH-4
RAM: Two 64-MB, 200-MHz, DDR1 SDRAM modules (on separate memory buses)

After much testing, it dawned on me that the system's memory
allocator returns non-cached memory. (I found no way to request
large contiguous buffers in cached memory.) All cache-specific
optimizations thus became irrelevant.

On this system, a load from non-cached memory has a latency of
~45 cycles, thus the only optimization that made sense was to
load 32-bit words instead of octets. I configured libjpeg to
output 32-bit pixels instead of 24-bit pixels.

Then I got away with trivial code:

void rotate_right(uint32 *A, uint32 *B, int W, int H)
{
int i, j;
for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
B[H*j + H-1-i] = A[W*i + j]; /* B[j][H-1-i] = A[i][j] */
}

void rotate_left(uint32 *A, uint32 *B, int W, int H)
{
int i, j;
for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
B[H*(W-1-j) + i] = A[W*i + j]; /* B[W-1-j][i] = A[i][j] */
}

gcc-4.2.4 -O2 was smart enough to strength-reduce the index
computation for both arrays.

00000000 <_rotate_right>:
0: 86 2f mov.l r8,@-r15
2: 15 47 cmp/pl r7
4: 96 2f mov.l r9,@-r15
6: 63 68 mov r6,r8
8: 15 8f bf.s 36 <_rotate_right+0x36>
a: 73 61 mov r7,r1
c: 08 47 shll2 r7
e: 83 69 mov r8,r9
10: fc 77 add #-4,r7
12: 13 66 mov r1,r6
14: 7c 35 add r7,r5
16: 08 49 shll2 r9
18: 04 77 add #4,r7
1a: 15 48 cmp/pl r8
1c: 07 8b bf 2e <_rotate_right+0x2e>
1e: 43 60 mov r4,r0
20: 53 63 mov r5,r3
22: 83 62 mov r8,r2
24: 06 61 mov.l @r0+,r1
26: 10 42 dt r2
28: 12 23 mov.l r1,@r3
2a: fb 8f bf.s 24 <_rotate_right+0x24>
2c: 7c 33 add r7,r3
2e: 10 46 dt r6
30: fc 75 add #-4,r5
32: f2 8f bf.s 1a <_rotate_right+0x1a>
34: 9c 34 add r9,r4
36: f6 69 mov.l @r15+,r9
38: 0b 00 rts
3a: f6 68 mov.l @r15+,r8
3c: 09 00 nop
3e: 09 00 nop

The loop kernel is

24: 06 61 mov.l @r0+,r1
26: 10 42 dt r2
28: 12 23 mov.l r1,@r3
2a: fb 8f bf.s 24 <_rotate_right+0x24>
2c: 7c 33 add r7,r3

Thanks to all for your suggestions (especially Terje).

Regards.

From: Chris Gray on 20 Apr 2010 19:28

Noob <root(a)127.0.0.1> writes:

> The loop kernel is

> 24: 06 61 mov.l @r0+,r1
> 26: 10 42 dt r2
> 28: 12 23 mov.l r1,@r3
> 2a: fb 8f bf.s 24 <_rotate_right+0x24>
> 2c: 7c 33 add r7,r3

[Not referring to this specific code, but just following up.]

Why can't modern CPU's optimize the heck out of the relatively simple
code that a compiler might produce for a block copy? They have all of
the information they need - the addresses, the length, the alignments,
the position relative to page boundaries, cache lines, write buffers, etc.

Compilers often look at large chunks of code to figure out what they
are doing (e.g. Sun's "heroic optimizations" of a few years ago). CPUs
have transistors to burn now, why can't they look for patterns that
can be executed faster? Detect block copies, and turn them into
streaming fetches and stores, limited only by memory speeds. Don't
cache the data, don't purge any existing nonconflicting write buffers,
etc. Is the latency of detecting the situation too large?

Lots of code does a lot of copying - there could be a real benefit.

--
Experience should guide us, not rule us.

Chris Gray

First | Prev |
Pages: 1 2 3 4 5 6
Prev: Including compile timestamp in c?
Next: fpga and the particular case of xilinx