Benchmarking a toy example on SH-4 [Embedded]

Prev: Opinions wanted on career-limiting moves (<g>)
Next: Using AVR-GCC toolchain on Mac OS X

From: Noob on 15 Apr 2010 08:03

Noob wrote:

> SUMMARY
> loadloop takes
> 7.0 cycles/iteration when LOADING a cached word
> 31.6 cycles/iteration when LOADING a non-cached word
>
> TODO: look at store performance [...]

..global _storeloop
..align 5
_storeloop:
mov.l @r5,r0 /*** READ TIMESTAMP COUNTER ***/
nop
..L3:
dt r4
mov.l r0,@r6
mov.l r0,@r6
mov.l r0,@r6
mov.l r0,@r6
mov.l r0,@r6
mov.l r0,@r6
bf .L3 /*** ONE-CYCLE STALL WHEN BRANCH IS TAKEN ***/
mov.l @r5,r1 /*** READ TIMESTAMP COUNTER ***/
rts
sub r1,r0 /*** DELAY SLOT ***/

SUMMARY
storeloop takes
7.0 cycles/iteration when STORING a cached word
37.4 cycles/iteration when STORING a non-cached word

I didn't expect non-cached stores to be 20% slower than
non-cached loads, while cached stores and cached loads
run at the same speed. Why might explain that?

Regards.

From: =?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?= on 15 Apr 2010 08:44

Noob <root(a)127.0.0.1> wrote:

> I didn't expect non-cached stores to be 20% slower than
> non-cached loads, while cached stores and cached loads
> run at the same speed. Why might explain that?

The reads benefit from DRAM page hits.

--
Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark

From: Noob on 15 Apr 2010 09:07

Noob wrote:

> loadloop takes
> 7.0 cycles/iteration when LOADING a cached word
> 31.6 cycles/iteration when LOADING a non-cached word
>
> storeloop takes
> 7.0 cycles/iteration when STORING a cached word
> 37.4 cycles/iteration when STORING a non-cached word

Next, a pointer chasing loop.

_chaseptr:
mov.l @r5,r0 /*** READ TIMESTAMP COUNTER ***/
nop
..L4:
dt r4
mov.l @r6,r6
mov.l @r6,r6
mov.l @r6,r6
mov.l @r6,r6
mov.l @r6,r6
mov.l @r6,r6
bf .L4 /*** ONE-CYCLE STALL WHEN BRANCH IS TAKEN ***/
mov.l @r5,r1 /*** READ TIMESTAMP COUNTER ***/
rts
sub r1,r0 /*** DELAY SLOT ***/

chaseptr takes
12.0 cycles/iteration when working with cached memory.
31.6 cycles/iteration when working with non-cached memory.

6 loads per iteration; 2-cycle latency on a cache hit, thus
12 cycles per iteration. (dt and bf basically come "for free".)

chaseptr is (marginally) faster than loadloop (31.56 vs 31.58) when
working with non-cached memory, which is slightly counter-intuitive.
(The difference might be insignificant, but it is systematic.)

Regards.

From: Noob on 15 Apr 2010 10:57

Noob wrote:

> loadloop takes
> 7.0 cycles/iteration when LOADING a cached word
> 31.6 cycles/iteration when LOADING a non-cached word
>
> storeloop takes
> 7.0 cycles/iteration when STORING a cached word
> 37.4 cycles/iteration when STORING a non-cached word
>
> chaseptr takes
> 12.0 cycles/iteration when working with cached memory
> 31.6 cycles/iteration when working with non-cached memory

Epic fail. I got the numbers for non-cached loads wrong
by an order of magnitude.

loadloop takes
7 cycles/iteration when LOADING a cached word
316 cycles/iteration when LOADING a non-cached word
^^^

storeloop takes
7.0 cycles/iteration when STORING a cached word
37.4 cycles/iteration when STORING a non-cached word

chaseptr takes
12 cycles/iteration when working with cached memory
316 cycles/iteration when working with non-cached memory
^^^

I'm now trying to understand why reading from non-cached
memory is so much slower than writing.

Is the CPU optimizing some (most) of my writes away because
I keep writing to the same address?

In my test, cache read bandwidth is 906 MB/s,
while non-cached read bandwidth is 20 MB/s.

20 MB/s seems very low for DDR1 SDRAM, wouldn't you agree?
Perhaps DRAM is not optimized for my artificial access pattern?
(Always hitting the same word.)

Regards.

From: Noob on 15 Apr 2010 11:48

Noob wrote:

> loadloop takes
> 7 cycles/iteration when LOADING a cached word
> 316 cycles/iteration when LOADING a non-cached word
>
> storeloop takes
> 7.0 cycles/iteration when STORING a cached word
> 37.4 cycles/iteration when STORING a non-cached word
>
> chaseptr takes
> 12 cycles/iteration when working with cached memory
> 316 cycles/iteration when working with non-cached memory
>
> I'm now trying to understand why reading from non-cached
> memory is so much slower than writing.
>
> Is the CPU optimizing some (most) of my writes away because
> I keep writing to the same address?

There is no difference between writing to contiguous words,
and writing to the same word, over and over again.

arrstore
7.0 cycles/iteration when STORING to cached memory
37.4cycles/iteration when STORING to non-cached memory

_arrstore:
mov.l @r5,r0 /*** READ TIMESTAMP COUNTER ***/
nop
..L5:
dt r4
mov.l r0,@( 0,r6)
mov.l r0,@( 4,r6)
mov.l r0,@( 8,r6)
mov.l r0,@(12,r6)
mov.l r0,@(16,r6)
mov.l r0,@(20,r6)
bf .L5 /*** ONE-CYCLE STALL WHEN BRANCH IS TAKEN ***/
mov.l @r5,r1 /*** READ TIMESTAMP COUNTER ***/
rts
sub r1,r0 /*** DELAY SLOT ***/

I am perplexed.

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: Opinions wanted on career-limiting moves (<g>)
Next: Using AVR-GCC toolchain on Mac OS X