Benchmarking a toy example on SH-4 [Embedded]

Prev: Opinions wanted on career-limiting moves (<g>)
Next: Using AVR-GCC toolchain on Mac OS X

From: Noob on 14 Apr 2010 12:07

Hello everyone,

I have a 266 MHz, dual issue, 5-stage integer pipeline, SH-4 CPU.

I've written a small piece of assembly to make sure I understand
what is going on in the trivial case.

The code:

r4 = loop iteration count
r5 = address of the time-stamp counter (1037109 Hz)

..text
..little
..global _noploop
..align 5
_noploop:
mov.l @r5,r0 /*** READ TIMESTAMP COUNTER ***/
nop
..L1:
dt r4
nop
nop
nop
nop
nop
nop
bf .L1 /*** ONE-CYCLE STALL WHEN BRANCH IS TAKEN ***/
mov.l @r5,r1 /*** READ TIMESTAMP COUNTER ***/
rts
sub r1,r0 /*** DELAY SLOT ***/

1) the loop kernel consists of 8 instructions
2) nop can execute in parallel with nop, with dt, and with bf
3) the dependency dt -> bf does not induce a pipeline stall
4) bf taken induces a one-cycle pipeline stall

Therefore, an iteration of the loop runs in 5 cycles.
Is this correct, so far?

I called noploop with an iteration count of 10^9.
It runs in 19623860 ticks = 18.92 seconds

5e9 cycles in 18.92 seconds = 264.2 MHz
(close enough to 266.67 MHz)

Can I safely conclude that this CPU does, indeed, run at
the advertised frequency?

The system supports DDR1 SDRAM.

Given that the CPU is running very close to peak performance
in my toy example, can I conclude that the instruction cache
is active? Or is it possible to reach this performance level
running straight from RAM?

AFAIU, our system comes with DDR-200. I would have expected
DDR-266, wouldn't that make more sense?

Thanks for reading this far :-)

Regards.

From: Terje Mathisen "terje.mathisen at on 14 Apr 2010 15:11

Noob wrote:
> Hello everyone,
>
> I have a 266 MHz, dual issue, 5-stage integer pipeline, SH-4 CPU.
>
> I've written a small piece of assembly to make sure I understand
> what is going on in the trivial case.
>
> The code:
>
> r4 = loop iteration count
> r5 = address of the time-stamp counter (1037109 Hz)
>
> ..text
> ..little
> ..global _noploop
> ..align 5
> _noploop:
> mov.l @r5,r0 /*** READ TIMESTAMP COUNTER ***/
> nop
> ..L1:
> dt r4
> nop
> nop
> nop
> nop
> nop
> nop
> bf .L1 /*** ONE-CYCLE STALL WHEN BRANCH IS TAKEN ***/
> mov.l @r5,r1 /*** READ TIMESTAMP COUNTER ***/
> rts
> sub r1,r0 /*** DELAY SLOT ***/
>
> 1) the loop kernel consists of 8 instructions
> 2) nop can execute in parallel with nop, with dt, and with bf
> 3) the dependency dt -> bf does not induce a pipeline stall
> 4) bf taken induces a one-cycle pipeline stall
>
> Therefore, an iteration of the loop runs in 5 cycles.
> Is this correct, so far?

Seems very reasonable, quite similar to the original Pentium pipeline.
>
> I called noploop with an iteration count of 10^9.
> It runs in 19623860 ticks = 18.92 seconds
>
> 5e9 cycles in 18.92 seconds = 264.2 MHz
> (close enough to 266.67 MHz)
>
> Can I safely conclude that this CPU does, indeed, run at
> the advertised frequency?

Or at least very close to it, your crystal might be slightly off spec.
>
> The system supports DDR1 SDRAM.
>
> Given that the CPU is running very close to peak performance
> in my toy example, can I conclude that the instruction cache
> is active? Or is it possible to reach this performance level
> running straight from RAM?

It depends: Does that cpu have any kind of prefetch buffer where small
loops can run out of the buffer, like some mainframes used to have?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Noob on 15 Apr 2010 05:20

Terje Mathisen wrote:

> Noob wrote:
>
>> I have a 266 MHz, dual issue, 5-stage integer pipeline, SH-4 CPU.
>>
>> I've written a small piece of assembly to make sure I understand
>> what is going on in the trivial case.
>> [snip]
>
> Seems very reasonable, quite similar to the original Pentium pipeline.

Yes, the pairing rules did bring back Pentium memories.

NB: one cannot pair two arithmetic instructions.
I find this a severe limitation.

>> I called noploop with an iteration count of 10^9.
>> It runs in 19623860 ticks = 18.92 seconds
>>
>> 5e9 cycles in 18.92 seconds = 264.2 MHz
>> (close enough to 266.67 MHz)
>>
>> Can I safely conclude that this CPU does, indeed, run at
>> the advertised frequency?
>
> Or at least very close to it, your crystal might be slightly off spec.

One percent seems like a large offset, wouldn't you say? (However, the
exact frequency of the platform's TSC might not be very important.)

>> The system supports DDR1 SDRAM.
>>
>> Given that the CPU is running very close to peak performance
>> in my toy example, can I conclude that the instruction cache
>> is active? Or is it possible to reach this performance level
>> running straight from RAM?
>
> It depends: Does that cpu have any kind of prefetch buffer where small
> loops can run out of the buffer, like some mainframes used to have?

How is a prefetch buffer different from an Icache?
They sound conceptually similar.

The documentation explicitly mentions software prefetching for data,
and seems to allude to hardware prefetching for instructions, e.g.

"If code is located in the final bytes of a memory area, as defined above,
instruction prefetching may initiate a bus access for an address outside
the memory area."

Regards.

From: Noob on 15 Apr 2010 06:04

Noob wrote:

> I have a 266 MHz, dual issue, 5-stage integer pipeline, SH-4 CPU.
>
> I've written a small piece of assembly to make sure I understand
> what is going on in the trivial case.
>
> The code:
>
> r4 = loop iteration count
> r5 = address of the time-stamp counter (1037109 Hz)
>
> .text
> .little
> .global _noploop
> .align 5
> _noploop:
> mov.l @r5,r0 /*** READ TIMESTAMP COUNTER ***/
> nop
> .L1:
> dt r4
> nop
> nop
> nop
> nop
> nop
> nop
> bf .L1 /*** ONE-CYCLE STALL WHEN BRANCH IS TAKEN ***/
> mov.l @r5,r1 /*** READ TIMESTAMP COUNTER ***/
> rts
> sub r1,r0 /*** DELAY SLOT ***/
>
> 1) the loop kernel consists of 8 instructions
> 2) nop can execute in parallel with nop, with dt, and with bf
> 3) the dependency dt -> bf does not induce a pipeline stall
> 4) bf taken induces a one-cycle pipeline stall
>
> Therefore, an iteration of the loop runs in 5 cycles.
> Is this correct, so far?
>
> I called noploop with an iteration count of 10^9.
> It runs in 19623860 ticks = 18.92 seconds

I then set out to prove that the memory manager returns non-cached memory.

I wrote a trivial load loop.

r4 = loop iteration count
r5 = address of the time-stamp counter (1037109 Hz)
r6 = address of one word

_loadloop:
mov.l @r5,r0 /*** READ TIMESTAMP COUNTER ***/
nop
..L2:
dt r4
mov.l @r6,r1
mov.l @r6,r1
mov.l @r6,r1
mov.l @r6,r1
mov.l @r6,r1
mov.l @r6,r1
bf .L2 /*** ONE-CYCLE STALL WHEN BRANCH IS TAKEN ***/
mov.l @r5,r1 /*** READ TIMESTAMP COUNTER ***/
rts
sub r1,r0 /*** DELAY SLOT ***/

A load can be paired with dt and with bf, but not with another load.
Thus, when r6 points to cached memory, I expect 7 cycles per iteration.
If I allocate the word on the stack, on via malloc, all is well.
1e9 iteration in 7e9 cycles => OK

If I allocate the word via the "AVMEM memory manager", not so well.
1e9 iteration in 31.6e9 cycles.

I think it is safe to conclude that the latter memory is not cached,
right?

TODO: look at store performance, then write my own memcpy.

Regards.

From: Terje Mathisen "terje.mathisen at on 15 Apr 2010 07:46

Noob wrote:
> Terje Mathisen wrote:
>
>> Noob wrote:
>>
>>> I have a 266 MHz, dual issue, 5-stage integer pipeline, SH-4 CPU.
>>>
>>> I've written a small piece of assembly to make sure I understand
>>> what is going on in the trivial case.
>>> [snip]
>>
>> Seems very reasonable, quite similar to the original Pentium pipeline.
>
> Yes, the pairing rules did bring back Pentium memories.
>
> NB: one cannot pair two arithmetic instructions.
> I find this a severe limitation.
>
>>> I called noploop with an iteration count of 10^9.
>>> It runs in 19623860 ticks = 18.92 seconds
>>>
>>> 5e9 cycles in 18.92 seconds = 264.2 MHz
>>> (close enough to 266.67 MHz)
>>>
>>> Can I safely conclude that this CPU does, indeed, run at
>>> the advertised frequency?
>>
>> Or at least very close to it, your crystal might be slightly off spec.
>
> One percent seems like a large offset, wouldn't you say? (However, the
> exact frequency of the platform's TSC might not be very important.)

I've never seen a crystal which is spot on, i.e. my current laptop has a
2.2 GHz cpu "Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz" while the
speed I measure by comparing RDTSC with os clock time, with ntpd running
to tune the system time, is 2.195GHz
>
>> It depends: Does that cpu have any kind of prefetch buffer where small
>> loops can run out of the buffer, like some mainframes used to have?
>
> How is a prefetch buffer different from an Icache?
> They sound conceptually similar.

They are, but a prefetch buffer, like on the 8088->584 cpus did not
snoop any bus activity, so selfmodifying code would not be picked up for
instructions already prefetched.

This makes such a buffer simpler than a real cache.
>
> The documentation explicitly mentions software prefetching for data,
> and seems to allude to hardware prefetching for instructions, e.g.
>
> "If code is located in the final bytes of a memory area, as defined above,
> instruction prefetching may initiate a bus access for an address outside
> the memory area."

This is the normal pipeline, it always tries to read the next few
instruction bytes, even if the last instruction was a branch and memory
ends just past that branch.

Anyway, with your AVMEM uncached memory regions you would get a huge
speedup by moving as much of the processing as possible into normal ram
and only move the final results into frame buffer space.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

| Next | Last
Pages: 1 2 3 4
Prev: Opinions wanted on career-limiting moves (<g>)
Next: Using AVR-GCC toolchain on Mac OS X