From: bartc on 18 Apr 2010 06:42 "John W Kennedy" <jwkenne(a)attglobal.net> wrote in message news:4bca7f2a$0$22520$607ed4bc(a)cv.net... > On 2010-04-17 04:44:49 -0400, robin said: >> Now 3 octets will be 9 bits, > > Robin, will you /please/ stop blithering on about things you don't > understand?! Buy a Data Processing dictionary, for God's sake! > > The rest of this is addressed to the original poster: I don't understand > why you're using int variables for octets; they should be char. For the > rest, I'd do the following: > > typedef struct __Pixel { > unsigned char red, green, blue; > } Pixel; > > Pixel src[W][H]; > Pixel dest[H][W]; > > for (int i = 0; i < W; ++i) > for (int j = 0; j < H; ++i) { > dest[j][i].red = src[i][j].red; > dest[j][i].green = src[i][j].green; > dest[j][i].blue = src[i][j].blue; > } .... > memcpy(dest[j][i], src[i][j], sizeof (Pixel)); And perhaps dest[j][i] = src[i][j]; But in practice W,H might only be known at runtime, making the code rather different. Depending on the exact format of the image data, there might also be padding bytes (nothing to do with C's struct padding), for example at the end of each row. -- Bartc
From: Noob on 20 Apr 2010 10:32 John W Kennedy wrote: > robin said: > >> Now 3 octets will be 9 bits, [...] http://en.wikipedia.org/wiki/Octet_(computing) > The rest of this is addressed to the original poster: I don't understand > why you're using int variables for octets; they should be char. You're right. It was a typo. > For the rest, I'd do the following: > > typedef struct __Pixel { > unsigned char red, green, blue; > } Pixel; > > Pixel src[W][H]; > Pixel dest[H][W]; > > for (int i = 0; i < W; ++i) > for (int j = 0; j < H; ++i) { > dest[j][i].red = src[i][j].red; > dest[j][i].green = src[i][j].green; > dest[j][i].blue = src[i][j].blue; > } Are you sure the above is a description of a rotation? :-) Clockwise or counter-clockwise? > I'd also try: > > for (int i = 0; i < W; ++i) > for (int j = 0; j < H; ++i) > memcpy(dest[j][i], src[i][j], sizeof (Pixel)); > > and see which one is faster; it will depend on the individual compiler. > (Make sure you test at the same optimization level you plan to use.) The target system is CPU: 266 MHz, dual issue, 5-stage integer pipeline, SH-4 RAM: Two 64-MB, 200-MHz, DDR1 SDRAM modules (on separate memory buses) After much testing, it dawned on me that the system's memory allocator returns non-cached memory. (I found no way to request large contiguous buffers in cached memory.) All cache-specific optimizations thus became irrelevant. On this system, a load from non-cached memory has a latency of ~45 cycles, thus the only optimization that made sense was to load 32-bit words instead of octets. I configured libjpeg to output 32-bit pixels instead of 24-bit pixels. Then I got away with trivial code: void rotate_right(uint32 *A, uint32 *B, int W, int H) { int i, j; for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) B[H*j + H-1-i] = A[W*i + j]; /* B[j][H-1-i] = A[i][j] */ } void rotate_left(uint32 *A, uint32 *B, int W, int H) { int i, j; for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) B[H*(W-1-j) + i] = A[W*i + j]; /* B[W-1-j][i] = A[i][j] */ } gcc-4.2.4 -O2 was smart enough to strength-reduce the index computation for both arrays. 00000000 <_rotate_right>: 0: 86 2f mov.l r8,@-r15 2: 15 47 cmp/pl r7 4: 96 2f mov.l r9,@-r15 6: 63 68 mov r6,r8 8: 15 8f bf.s 36 <_rotate_right+0x36> a: 73 61 mov r7,r1 c: 08 47 shll2 r7 e: 83 69 mov r8,r9 10: fc 77 add #-4,r7 12: 13 66 mov r1,r6 14: 7c 35 add r7,r5 16: 08 49 shll2 r9 18: 04 77 add #4,r7 1a: 15 48 cmp/pl r8 1c: 07 8b bf 2e <_rotate_right+0x2e> 1e: 43 60 mov r4,r0 20: 53 63 mov r5,r3 22: 83 62 mov r8,r2 24: 06 61 mov.l @r0+,r1 26: 10 42 dt r2 28: 12 23 mov.l r1,@r3 2a: fb 8f bf.s 24 <_rotate_right+0x24> 2c: 7c 33 add r7,r3 2e: 10 46 dt r6 30: fc 75 add #-4,r5 32: f2 8f bf.s 1a <_rotate_right+0x1a> 34: 9c 34 add r9,r4 36: f6 69 mov.l @r15+,r9 38: 0b 00 rts 3a: f6 68 mov.l @r15+,r8 3c: 09 00 nop 3e: 09 00 nop The loop kernel is 24: 06 61 mov.l @r0+,r1 26: 10 42 dt r2 28: 12 23 mov.l r1,@r3 2a: fb 8f bf.s 24 <_rotate_right+0x24> 2c: 7c 33 add r7,r3 Thanks to all for your suggestions (especially Terje). Regards.
From: Chris Gray on 20 Apr 2010 19:28 Noob <root(a)127.0.0.1> writes: > The loop kernel is > 24: 06 61 mov.l @r0+,r1 > 26: 10 42 dt r2 > 28: 12 23 mov.l r1,@r3 > 2a: fb 8f bf.s 24 <_rotate_right+0x24> > 2c: 7c 33 add r7,r3 [Not referring to this specific code, but just following up.] Why can't modern CPU's optimize the heck out of the relatively simple code that a compiler might produce for a block copy? They have all of the information they need - the addresses, the length, the alignments, the position relative to page boundaries, cache lines, write buffers, etc. Compilers often look at large chunks of code to figure out what they are doing (e.g. Sun's "heroic optimizations" of a few years ago). CPUs have transistors to burn now, why can't they look for patterns that can be executed faster? Detect block copies, and turn them into streaming fetches and stores, limited only by memory speeds. Don't cache the data, don't purge any existing nonconflicting write buffers, etc. Is the latency of detecting the situation too large? Lots of code does a lot of copying - there could be a real benefit. -- Experience should guide us, not rule us. Chris Gray
First
|
Prev
|
Pages: 1 2 3 4 5 6 Prev: Including compile timestamp in c? Next: fpga and the particular case of xilinx |