Faster image rotation [General Programming]

Prev: #include "cpuid.os"
Next: aspect ratio algorithm needed.

From: Brett Davis on 18 Apr 2010 21:59

In article <4BCB4EA2.4020706(a)patten-glew.net>,
"Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> wrote:

> On 4/18/2010 4:57 AM, Niels J�rgen Kruse wrote:
> > Andy "Krazy" Glew<ag-news(a)patten-glew.net> wrote:
> > Cortex A9 is not shipping in any product yet (I believe). Lots of
> > preannouncements though. The Apple A4 CPU is currently believed to be a
> > tweaked Cortex A8, perhaps related to the tweaked A8 that Intrinsity did
> > for Samsung before being acquired by Apple.
>
> One conspiracy-theorist type seems to think that it might actually be the
> PA Semi OOO PowerPC, running an ARM emulator.

A PowerPC running an emulator was within the realm of possibility.
I bet Apple looked at it.

The problem is that PowerPC offers little over what ARM provides.
(Besides 64 bit address mode, and a nice vector processor,
both of which would cost battery power.)
Thumb mode offers smaller code, a benefit for a handheld.

Apple knows that Thumb is a kludge, I hope Apple is looking at designing
their own CPU instruction set. Engineering a competitive advantage.
Hopefully they design something nice like my CLIW.

If the ARM chip has a nice MMU then Apple can stay 32 bits for a decade,
otherwise the roadmap will be looking dire for ARM in ~2 years.
Fixing the MMU would be the easy solution, it buys time.

Brett

From: David Brown on 19 Apr 2010 05:50

On 07/04/2010 12:17, Noob wrote:
<snip>
> I'm using the GNU tool chain. For some weird reason, we
> compile everything -O0. The first thing I'll try is crank
> gcc's optimization level.
>
> I'm hoping gcc can perform some strength reduction, as the
> index calculation seems to be taking a non-negligible fraction
> of the total run-time.

(I haven't tried this - it's theory only.)

What version of gcc are you using? Later versions are better at doing
this sort of optimisation automatically - with version 4.4 onwards you
should be writing something like this:

for (i = 0; i < H; i++) {
for (j = 0; j < W; j++) {
memcpy(&rotated_picture[i][j], &original_picture[j][i],
sizeof(pixel));
}
}

The compiler will (should!) handle strength reduction, inlining of
memcpy, invariant movement, etc., automatically. And with gcc 4.4
onwards and the right compiler options, it will re-arrange the loop to
fit the caches better:

<http://gcc.gnu.org/gcc-4.4/changes.html>

mvh.,

David

From: Noob on 20 Apr 2010 10:32

John W Kennedy wrote:

> robin said:
>
>> Now 3 octets will be 9 bits, [...]

http://en.wikipedia.org/wiki/Octet_(computing)

> The rest of this is addressed to the original poster: I don't understand
> why you're using int variables for octets; they should be char.

You're right. It was a typo.

> For the rest, I'd do the following:
>
> typedef struct __Pixel {
> unsigned char red, green, blue;
> } Pixel;
>
> Pixel src[W][H];
> Pixel dest[H][W];
>
> for (int i = 0; i < W; ++i)
> for (int j = 0; j < H; ++i) {
> dest[j][i].red = src[i][j].red;
> dest[j][i].green = src[i][j].green;
> dest[j][i].blue = src[i][j].blue;
> }

Are you sure the above is a description of a rotation? :-)
Clockwise or counter-clockwise?

> I'd also try:
>
> for (int i = 0; i < W; ++i)
> for (int j = 0; j < H; ++i)
> memcpy(dest[j][i], src[i][j], sizeof (Pixel));
>
> and see which one is faster; it will depend on the individual compiler.
> (Make sure you test at the same optimization level you plan to use.)

The target system is
CPU: 266 MHz, dual issue, 5-stage integer pipeline, SH-4
RAM: Two 64-MB, 200-MHz, DDR1 SDRAM modules (on separate memory buses)

After much testing, it dawned on me that the system's memory
allocator returns non-cached memory. (I found no way to request
large contiguous buffers in cached memory.) All cache-specific
optimizations thus became irrelevant.

On this system, a load from non-cached memory has a latency of
~45 cycles, thus the only optimization that made sense was to
load 32-bit words instead of octets. I configured libjpeg to
output 32-bit pixels instead of 24-bit pixels.

Then I got away with trivial code:

void rotate_right(uint32 *A, uint32 *B, int W, int H)
{
int i, j;
for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
B[H*j + H-1-i] = A[W*i + j]; /* B[j][H-1-i] = A[i][j] */
}

void rotate_left(uint32 *A, uint32 *B, int W, int H)
{
int i, j;
for (i = 0; i < H; ++i)
for (j = 0; j < W; ++j)
B[H*(W-1-j) + i] = A[W*i + j]; /* B[W-1-j][i] = A[i][j] */
}

gcc-4.2.4 -O2 was smart enough to strength-reduce the index
computation for both arrays.

00000000 <_rotate_right>:
0: 86 2f mov.l r8,@-r15
2: 15 47 cmp/pl r7
4: 96 2f mov.l r9,@-r15
6: 63 68 mov r6,r8
8: 15 8f bf.s 36 <_rotate_right+0x36>
a: 73 61 mov r7,r1
c: 08 47 shll2 r7
e: 83 69 mov r8,r9
10: fc 77 add #-4,r7
12: 13 66 mov r1,r6
14: 7c 35 add r7,r5
16: 08 49 shll2 r9
18: 04 77 add #4,r7
1a: 15 48 cmp/pl r8
1c: 07 8b bf 2e <_rotate_right+0x2e>
1e: 43 60 mov r4,r0
20: 53 63 mov r5,r3
22: 83 62 mov r8,r2
24: 06 61 mov.l @r0+,r1
26: 10 42 dt r2
28: 12 23 mov.l r1,@r3
2a: fb 8f bf.s 24 <_rotate_right+0x24>
2c: 7c 33 add r7,r3
2e: 10 46 dt r6
30: fc 75 add #-4,r5
32: f2 8f bf.s 1a <_rotate_right+0x1a>
34: 9c 34 add r9,r4
36: f6 69 mov.l @r15+,r9
38: 0b 00 rts
3a: f6 68 mov.l @r15+,r8
3c: 09 00 nop
3e: 09 00 nop

The loop kernel is

24: 06 61 mov.l @r0+,r1
26: 10 42 dt r2
28: 12 23 mov.l r1,@r3
2a: fb 8f bf.s 24 <_rotate_right+0x24>
2c: 7c 33 add r7,r3

Thanks to all for your suggestions (especially Terje).

Regards.

From: MitchAlsup on 19 Apr 2010 12:17

On Apr 18, 4:32 pm, n...(a)cam.ac.uk wrote:
> Yup. In my view, interrupts are doubleplus ungood - message passing
> is good.

CDC was the only company to get this one right. The OS ran mostly* in
the perifferal processors, leaving the great big number cruncher to
(ahem) crunch numbers.

(*) The interupt processing and I/O was run in the PPs and most of the
OS scheduling was run in the PPs.

I remember a time back in 1979, I was logged into a CDC 7600 in
California doing text editing. There were a dozen other members of the
SW team doing similarly. There was a long silent pause where no
character echoing was taking place. A few moments later (about 30
seconds) the processing returned to normal. However, we found out that
we were now logged into a CDC 7600 in Chicago. The Ca machine had
crashed, and the OS had picked up all the nonfaulting tasks, shipped
them up to another machine half way across the country and restarted
the processes.

Why can't we do this today? We could 30 years ago!

Mitch

From: Thomas Womack on 19 Apr 2010 12:44

In article <fb35005d-6ee0-42f4-818a-1b7120f6ca3e(a)11g2000yqr.googlegroups.com>,
MitchAlsup <MitchAlsup(a)aol.com> wrote:

>I remember a time back in 1979, I was logged into a CDC 7600 in
>California doing text editing. There were a dozen other members of the
>SW team doing similarly. There was a long silent pause where no
>character echoing was taking place. A few moments later (about 30
>seconds) the processing returned to normal. However, we found out that
>we were now logged into a CDC 7600 in Chicago. The Ca machine had
>crashed, and the OS had picked up all the nonfaulting tasks, shipped
>them up to another machine half way across the country and restarted
>the processes.
>
>Why can't we do this today? We could 30 years ago!

I think, in a setup with the (nowadays clearly unaffordably high)
level of computing staff that would be attached to an organisation
with two CDC 7600s in 1979, that is entirely possible today: you log
into a front-end machine which connects to a back-end machine which is
kept as a migratable VM.

Nobody bothers doing it for text editing because it's crazily
uneconomical.

Tom

First | Prev | Next | Last
Pages: 9 10 11 12 13 14 15 16 17 18 19 20 21
Prev: #include "cpuid.os"
Next: aspect ratio algorithm needed.