Benchmarking a toy example on SH-4 [Embedded]

Prev: Opinions wanted on career-limiting moves (<g>)
Next: Using AVR-GCC toolchain on Mac OS X

From: "Andy "Krazy" Glew" on 15 Apr 2010 23:51

On 4/15/2010 7:57 AM, Noob wrote:

> loadloop takes
> 7 cycles/iteration when LOADING a cached word
> 316 cycles/iteration when LOADING a non-cached word
> ^^^
>
> storeloop takes
> 7.0 cycles/iteration when STORING a cached word
> 37.4 cycles/iteration when STORING a non-cached word
>
> chaseptr takes
> 12 cycles/iteration when working with cached memory
> 316 cycles/iteration when working with non-cached memory
> ^^^
>
> I'm now trying to understand why reading from non-cached
> memory is so much slower than writing.
>
> Is the CPU optimizing some (most) of my writes away because
> I keep writing to the same address?

That's possible. But not so likely, since most systems keep uncached accesses separate, do not combine them, because
the uncached memory accesses may be to memory mapped I/O devices that have side effects.

(I have designed systems that have two different types of uncached memory, a UC-MMIO type that permits no optimizations,
and a UC-Ordinary type that ermits optimizations. But I am not aware of anyone shipping such a system. HPC guys often
ask for it.)

More likely, every time you do an uncached read it looks something like this

Processor sends out address.
Wait many cycles while address percolates through processor, across bus, to DRAM
Wait a few cycles while DRAM responds.
Wait many cycles while data percolates back
Wait a few cycles while processor handles data
Start next load

Whereas with stores, it is
Processor sends out address and data.
Store is buffered or pipelined
Followup store follows close behind.

Oh, and the long latency of the uncached load may be long enough that the DRAM controller closes the active page,
whereas the back to back stores probably score page hits.

You may be able to design microbenchmarks to distinguish store pipelining from store buffering.
E.g. if you have a store buffer of 8-10 entries, you might see what you observe.

> In my test, cache read bandwidth is 906 MB/s,
> while non-cached read bandwidth is 20 MB/s.
>
> 20 MB/s seems very low for DDR1 SDRAM, wouldn't you agree?
> Perhaps DRAM is not optimized for my artificial access pattern?
> (Always hitting the same word.)

DRAM is *NOT* optimized for uncached accesses.

With modern DRAM the only way you can approach peak bandwidth is to use burst accesses - typically cache line fills, but
also possibly reads of 512b/64B vectors, etc., load-multiple-register instructions.

I.e. you must either het a burst transfer implicitly, via a cache line or prefetch, or explicitly, by instructions that
load more than 4 bytes at a time.

From: Brett Davis on 16 Apr 2010 02:22

In article <hq7cgo$2md$1(a)speranza.aioe.org>, Noob <root(a)127.0.0.1>
wrote:

> > I'm now trying to understand why reading from non-cached
> > memory is so much slower than writing.

The CPU halts and waits for the read, the write goes to the
memory controller to handle, and the CPU goes its merry way.
Until the memory controller fills its buffer and stalls,
forcing the CPU to halt and wait.

> > Is the CPU optimizing some (most) of my writes away because
> > I keep writing to the same address?
>
> There is no difference between writing to contiguous words,
> and writing to the same word, over and over again.
>
> I am perplexed.

Uncached memory generally means memory mapped serial port
registers, etc. You do not want your CPU optimizing away
those writes.
There are several types of uncached memory generally, with
different rules and different performance. From hard volatile
for hardware registers, to write only display lists you
want the CPU to optimize memory writes to.

You didnt tell us which type(s) you were using or given to use. ;)

Brett

From: Terje Mathisen "terje.mathisen at on 16 Apr 2010 04:43

Noob wrote:
> In my test, cache read bandwidth is 906 MB/s,
> while non-cached read bandwidth is 20 MB/s.
>
> 20 MB/s seems very low for DDR1 SDRAM, wouldn't you agree?
> Perhaps DRAM is not optimized for my artificial access pattern?
> (Always hitting the same word.)

What you're seeing is simply that
a) DRAM is really slow these days, when using it as a Random access
memory, it is really a paging device to get blocks of data in/out of cache.

b) As Andy wrote, write buffers can hide significant parts of the
overhead, while an uncached load on a non-OoO has to wait until
everyting arrives.

The conclusion is simply that frame buffers like yours cannot ever be
read from, only written to, and if you need to do _any_ kind of
processing at all, it will be faster to double buffer, i.e. keep the
working frame buffer in normal cacheable ram, and only copy the finished
screen image to the hardware frame buffer when everything is done.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Noob on 16 Apr 2010 11:41

Andy "Krazy" Glew wrote:

> That's possible. But not so likely, since most systems keep uncached
> accesses separate, do not combine them, because the uncached memory
> accesses may be to memory mapped I/O devices that have side effects.
>
> (I have designed systems that have two different types of uncached
> memory, a UC-MMIO type that permits no optimizations, and a UC-Ordinary
> type that permits optimizations. But I am not aware of anyone shipping
> such a system. HPC guys often ask for it.)

(Going off on a tangent)

I thought there were many "types" of memory accesses?

For example, the AMD64 architecture defines the following "memory types"
with different properties.

Uncacheable (UC)
Cache Disable (CD)
Write-Combining (WC)
Write-Protect (WP)
Writethrough (WT)
Writeback (WB)

> More likely, every time you do an uncached read it looks something like
> this
>
> Processor sends out address.
> Wait many cycles while address percolates through processor, across bus,
> to DRAM
> Wait a few cycles while DRAM responds.
> Wait many cycles while data percolates back
> Wait a few cycles while processor handles data
> Start next load
>
> Whereas with stores, it is
> Processor sends out address and data.
> Store is buffered or pipelined
> Followup store follows close behind.

OK.

> You may be able to design microbenchmarks to distinguish store
> pipelining from store buffering.
> E.g. if you have a store buffer of 8-10 entries, you might see what you
> observe.

Lemme see what the documentation says.

"The ST40 cores all have some store buffering at their STBus interface to allow the CPU to
continue executing whilst stores are written out in parallel. The degree of buffering varies
between core families. In addition, the ST40 bus interface and/or the STBus interconnect
may introduce re-ordering of stores or merging of multiple stores to the same quadword into
one (so-called write-combining)."

>> In my test, cache read bandwidth is 906 MB/s,
>> while non-cached read bandwidth is 20 MB/s.
>>
>> 20 MB/s seems very low for DDR1 SDRAM, wouldn't you agree?
>> Perhaps DRAM is not optimized for my artificial access pattern?
>> (Always hitting the same word.)
>
> DRAM is *NOT* optimized for uncached accesses.
>
> With modern DRAM the only way you can approach peak bandwidth is to use
> burst accesses - typically cache line fills, but also possibly reads of
> 512b/64B vectors, etc., load-multiple-register instructions.
>
> I.e. you must either get a burst transfer implicitly, via a cache line
> or prefetch, or explicitly, by instructions that load more than 4 bytes
> at a time.

Apparently, this platform provides store queues.

"The SQs are a pair of software-controlled write buffers. Software can load each buffer with
32 bytes of data, and then initiate a burst write of the buffer to memory. The CPU can
continue to store data into one buffer whilst the other is being written out to memory,
allowing efficient back-to-back operation for large data transfers."

I'd like to have a way to burst reads, rather than writes, since
that is the bottle-neck in my situation.

What is the "load" equivalent of a store queue? :-)

Regards.

From: Noob on 16 Apr 2010 12:05

Brett Davis wrote:

> Uncached memory generally means memory mapped serial port
> registers, etc. You do not want your CPU optimizing away
> those writes.
> There are several types of uncached memory generally, with
> different rules and different performance. From hard volatile
> for hardware registers, to write only display lists you
> want the CPU to optimize memory writes to.

OK.

> You didn't tell us which type(s) you were using or given to use. ;)

The documentation states:

"Explicit control of re-ordering and combining for writes
to the STBus: None.
(The ST40 bus interface and the STBus are required
to preserve all critical write-order properties.)"

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: Opinions wanted on career-limiting moves (<g>)
Next: Using AVR-GCC toolchain on Mac OS X