From: Noob on
Hello Terje :-)

Terje Mathisen wrote:

> Noob wrote:
>
>> My target is a 266-MHz SH-4.
>>
>> Assuming the vendor's implementation of memcpy is properly optimized,
>> I think it is safe to assume that it provides a lower bound for the
>> run-time of the rotation operation. Correct?

There seems to be something fishy with the vendor's memcpy: it handles (only)
~18 MB/s while the straight-forward foo() below handles ~21 MB/s

void foo(unsigned int *dest, unsigned int *src, int wordcount)
{
for (i = 0; i < wordcount; ++i) *dest++ = *src++;
}

The loop kernel translates to

..L137:
mov.l @r4+,r1
dt r2
mov.l r1,@r5
bf/s .L137
add #4,r5

Unrolling the loop twice makes the code 50% (huh?!?) slower

..L137:
mov.l @r4,r1
dt r3
mov.l @(4,r4),r2
add #8,r4
mov.l r1,@r5
mov.l r2,@(4,r5)
bf/s .L137
add #8,r5

I'm starting to think that the array is allocated in a non-cached region.

>> As an example, copying a 5.7-MB picture (1644 x 1164) takes 320 ms
>> (i.e. 18 MB/s)
>
> Both dimensions are divisible by 4, but not by 8, so 4x4 basic blocks is
> a natural starting point.

The dimensions are, indeed, divisible by 4, because I do the following ;-)

cinfo.output_width &= ~3;
cinfo.output_height &= ~3;

>> Right now, my naive (strength-reduced, no tiling) C implementation
>> takes 1100 ms. I think there is still room for improvement.
>>
>> One thing you may have overlooked in the advice you've given me is
>> that the pixel values are packed in 24-bits (RGB, no alpha).
>
> Aha!
>
> This is crucial, it probably means that you should work with 4x4 pixel
> blocks, each using 3x4x4 = 48 bytes.

Someone showed me how to get libjpeg to output 4-byte pixels, and
this alone gave me a 3x speedup.

> 48 bytes would fit in 12 32-bit registers, I don't know the details of
> the SH-4, but I believe it has at least 16 integer registers, right?

Right.

General register file:
o Sixteen 32-bit general registers (and eight 32-bit shadow registers)
o Seven 32-bit control registers
o Four 32-bit system registers

> Starting from this point I would first try a simple setup where I read
> in 4x4 pixels in 3x4 registers, then shuffle/shift/combine these into a
> rotated/transposed version, which I then store to 4 small temporary
> buffers.
>
> Each time these 4 buffers reach 64 bytes each (or 128, whatever the
> cache line size is)

The cache line size is 32 bytes.

I can play every cache trick in the book, but it won't help if I'm
working with non-cached memory.

> Back in the 486-33MHz days I used to say that 10 MB/s was my initial
> speed target, your SH-4 is probably quite a bit faster than this?

AFAICT, my 266-MHz SH-4 should be 5-10 faster than a 33-MHz 486 in
pure integer code.

I found a better description of the CPU:

o 32-bit internal data bus
o General register file:
� sixteen 32-bit general registers (and eight 32-bit shadow registers)
� seven 32-bit control registers
� four 32-bit system registers
o RISC-type instruction set
� fixed 16-bit instruction length for improved code efficiency
� load-store architecture
� delayed branch instructions
� conditional execution
o Superscalar architecture: parallel execution of two instructions
o Instruction execution time: maximum 2 instructions/cycle
o On-chip multiplier
o Five-stage pipeline (100, 200, 400 and 500 series)
o Seven-stage pipeline (300 series)
o Branch target cache for bubble-free branching along predicted path (300 series)

and the cache sub-system:

o Instruction cache (IC) features:
� 16 Kbyte, 2-way set associative
� 512 entries, 32-bytes block length
� compatibility mode (8 Kbyte direct mapped)
� index mode
o Operand cache (OC) features:
� 32 Kbyte, 2-way set associative
� 1024 entries, 32 bytes block length
� compatibility mode (16 Kbyte direct mapped)
� index mode
� RAM mode (16 Kbyte cache + 16 Kbyte RAM)
o Single-stage copy-back buffer, single-stage write-through buffer
o Cache memory contents can be accessed directly by address mapping (usable as on-chip memory)
o Store queue (32 bytes x 2 entries).

cf. also
http://documentation.renesas.com/eng/products/mpumcu/rej09b0318_sh_4sm.pdf

Regards.
From: Terje Mathisen "terje.mathisen at on
Noob wrote:
>
> I'm starting to think that the array is allocated in a non-cached region.

Ouch!

That sounds like a frame buffer...

Can you direct these temporary image buffers to be in cacheable ram instead?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: "Andy "Krazy" Glew" on
On 4/10/2010 5:13 PM, MitchAlsup wrote:
> On Apr 9, 4:42 pm, "Andy \"Krazy\" Glew"<ag-n...(a)patten-glew.net>
> wrote:
>> The standard usually turns out to be a bad idea thing here is to interleave cache lines across CPUs.
>> Not quite so bad, but still often surpringly painful, is to interleave 4K pages across CPUs/MCs.
>> The best is usually to have several large contiguous chunks per CPU.
>
> It turns out that interleaving should be performed at the level of the
> DRAM page for maximal effect with memory controllers that are smart
> enough. This is far larger than the CPU page, but if the DRAMs do not
> all have the same page sisze that the largest page that fits in all
> DRAMs is generally an excellent granule.
>
> A DRAM controller is sufficiently smart when it can hold dozens to
> hundreds of cache lines waiting to be written while still servicing
> demand reads with some protocal as to when writes actually happen and
> some means to access the write buffer on behalf of demand reads.
>
> This strategy commits the fewest number of DIMMs to the typical large
> scale strip mine active access patterns. {Leaving the rest of the DRAM
> banks to do more random activities.}
>
> Mitch

On many memory systems the DRAM page is approximately the same size as the virtual memory page - 4KB or thereabouts. So
in this situation our advice coincides.

When the DRAM page is smaller than the virtual memory page, it is probably a bad idea to interleave between CPUs at the
DRAM page. System software cannot control.

By the way, interleaving across channels between DRAM pages is often a win - if you are limited by channel bandwidth.
But this, unfortunately, makes the effective DRAM page even larger, exacerbating the following effect. It can be an
effective way of increasing a 2KB DRAM page to a 4KB page size that the OS can deall with.

When the DRAM page is larger than the e.g. 4KB virtual memory page, if you are only using that page size, on a long
running system all the address bits beyond bit 12 are effectively randomized. You might hope for page clustering or
large pages in the OS to help. Which they do. However, in my experience, large pages have usually only been available
for benchmark specials, and not for truly general purpose computing. Still, if you have an HPC system, use them if you
got them.
From: Brett Davis on
In article <na2e97-a342.ln1(a)ntp.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:

> Noob wrote:
> >
> > I'm starting to think that the array is allocated in a non-cached region.
>
> Ouch!

That is worse than the case where you only use 3 bytes for every cache line fill/spill.

> That sounds like a frame buffer...
>
> Can you direct these temporary image buffers to be in cacheable ram instead?
>
> Terje

If not you want to load four pixels at a time using three long reads, and shift
out the bytes to write. Three slow reads instead of 12.

Brett
From: Brett Davis on
> > The most important number is cache line size, which you missed.
> > If your image is 1,024 lines tall, that will completely thrash
> > the cache, resulting in 3 bytes copied per cache line load/spill.
> >
> > If you do the following you can get a 2x speedup, it looks like
> > more code, but will generate less, and the results will be
> > pipelined correctly.
> > Extra bonus points to those that understand why. Half the posters here?
> >
> > � � �{
> > � � � �unsigned char *C = B+(H*j+H-i-1)*3;
> > � � � �temp0 = A[0];
> > � � � �temp1 = A[1];
> > � � � �temp2 = A[2];
> > � � � �C[0] = temp0;
> > � � � �C[1] = temp1;
> > � � � �C[2] = temp2;
> > � � � �A += 3;
> > � � �}
> >
> > Do not use *C++ = *A++;
> >
>
> Programming hotshots have done so much damage.

Damage?
That is clean code that is easy to read and understand.

> And they brag about it.

Only one in a hundred programers know an optimizaton like that, for
half of comp.arch to be that good says good things about comp.arch.

> I watched some doing one-upsmanship while the earth was still being
> created, and I decided I wanted nothing to do with it. I think I
> showed good judgment (rare for me).

I thought I was generous giving away top secrets that most everyone
else hoards. I do have quite a bit more that will get another 50%,
but he lacks the background to understand such, and I lack the skills
to teach the complex stuff over a tty.

> Robert.

If you remember the story about the programmer told to add a cheat
to the blackjack program so the customer would always win...

I am not him, my code is clear and readable.

Brett ;)
First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Prev: #include "cpuid.os"
Next: aspect ratio algorithm needed.