Benchmarking a toy example on SH-4 [Embedded]

Prev: Opinions wanted on career-limiting moves (<g>)
Next: Using AVR-GCC toolchain on Mac OS X

From: Tim McCaffrey on 16 Apr 2010 14:34

In article <hqa0e2$i90$1(a)speranza.aioe.org>, root(a)127.0.0.1 says...
>
>
>Apparently, this platform provides store queues.
>
>"The SQs are a pair of software-controlled write buffers. Software
> can load each buffer with 32 bytes of data, and then initiate a
> burst write of the buffer to memory. The CPU can continue to store
> data into one buffer whilst the other is being written out to memory,
>allowing efficient back-to-back operation for large data transfers."
>
>I'd like to have a way to burst reads, rather than writes, since
>that is the bottle-neck in my situation.
>
>What is the "load" equivalent of a store queue? :-)

Cache line reads.

Seriously, if you have a DMA or data mover device you can sometimes
offload the copy on to that, and it can move the data faster because it
has been optimized for such things (well, if it was done right).

Have you tried 8 back-to-back loads? With the Pentium III, the fastest
was to copy uncached memory was to do the copy in a burst from uncached
to cached, and then cached to uncached (the P4 fixed this with more
intelligent/aggressive prefetching).

- Tim

From: Brett Davis on 17 Apr 2010 19:49

In article <hqa1rd$kf9$1(a)speranza.aioe.org>, Noob <root(a)127.0.0.1>
wrote:
>
> > You didn't tell us which type(s) you were using or given to use. ;)
>
> The documentation states:
>
> "Explicit control of re-ordering and combining for writes
> to the STBus: None.
> (The ST40 bus interface and the STBus are required
> to preserve all critical write-order properties.)"

Out of context that could mean anything.
How about giving me a URL and page number, like page 68, section 4.3.5:
http://documentation.renesas.com/eng/products/mpumcu/rej09b0318_sh_4sm.pdf

Looks like it holds two sequential words, no mention of what it
will do if you write two sequential bytes. I bet it epic fails.
Test word writes verses byte writes, see if byte writes are slower
per write.

To Renaesas: I made the mistake of clicking on a specific CPU,
and you cannot find a software manual there, instead you get hundreds
of hardware tech sheets. Grrr...

Most of these CPUs have scratchpad memory and as many as 6 DMA channels.

To Renaesas: You should at least drop a hint of the scratchpad RAM
and DMA channels in the software manual...
Drop it in section 1.1 as optional features.

If you set up DMA transfers to scratchpad for your data you can get
a 50x speed improvement. Making most of these other optimization
suggestions look like a silly waste of time.

A CPU has its hands tied behind its back when dealing with uncached
RAM, the DMA controller is your friend and saviour.
Works well to speed up ordinary batch reads from DRAM also.

Brett

From: "Andy "Krazy" Glew" on 16 Apr 2010 23:06

On 4/16/2010 8:41 AM, Noob wrote:
> Andy "Krazy" Glew wrote:
>
>> That's possible. But not so likely, since most systems keep uncached
>> accesses separate, do not combine them, because the uncached memory
>> accesses may be to memory mapped I/O devices that have side effects.
>>
>> (I have designed systems that have two different types of uncached
>> memory, a UC-MMIO type that permits no optimizations, and a UC-Ordinary
>> type that permits optimizations. But I am not aware of anyone shipping
>> such a system. HPC guys often ask for it.)
>
> (Going off on a tangent)
>
> I thought there were many "types" of memory accesses?
>
> For example, the AMD64 architecture defines the following "memory types"
> with different properties.
>
> Uncacheable (UC)
> Cache Disable (CD)
> Write-Combining (WC)
> Write-Protect (WP)
> Writethrough (WT)
> Writeback (WB)

Amusingly, I defined those types for Intel P6.

UC WP WT WB already existed, but outside the CPU. I invented the MTRRs (not one of my favorite things) to hold the
memory types internal.

I invented the WC memory type. Along with a number of memory types that got cut. Included UC-MMIO and UC-MEMORY. Also,
RC, FB, ....

Hmm... I have not seen the CD memory type before. Looks like they added one when I wasn't looking. Maybe it is
UC-MEMORY, and the old UC is UC-MMIO? I can only hope so.

(I actually wanted the memory type to be a bitmask, with features like
speculative loads allowed
burst
cache in L1, L2, ....
writeback/writethrough
etc.
Validation hated that idea.)

> "The ST40 cores all have some store buffering at their STBus interface
> to allow the CPU to
> continue executing whilst stores are written out in parallel. The degree
> of buffering varies
> between core families. In addition, the ST40 bus interface and/or the
> STBus interconnect
> may introduce re-ordering of stores or merging of multiple stores to the
> same quadword into
> one (so-called write-combining)."

They don't say anything about write combining stores to the same address,
but I think they are. Test with sequential stores vs random stores.

If sequential stores are slower, then ...

>
> Apparently, this platform provides store queues.
>
> "The SQs are a pair of software-controlled write buffers. Software can
> load each buffer with
> 32 bytes of data, and then initiate a burst write of the buffer to
> memory. The CPU can
> continue to store data into one buffer whilst the other is being written
> out to memory,
> allowing efficient back-to-back operation for large data transfers."
>
> I'd like to have a way to burst reads, rather than writes, since
> that is the bottle-neck in my situation.
>
> What is the "load" equivalent of a store queue? :-)

a) a load into a single cache line sized register - like a LRB 512b / 64B register

b) a load into multiple registers - there is often a "load multiple register" command

c) sometimes a special PREFETCH instruction that loads into a buffer, that you thn read out of 32 or 64b at a time

d) Intel just added a godawful SSE streaming load, that does much of the above.

e) sometimes you have a DMA engine that can do a burst read from UC memory, and write to cacheable memory.

I prefer explicit software - a) or b)

Tell us if SH has any of the above.

(I think I should add that to the comp-arch.nrt FAQ)

From: "Andy "Krazy" Glew" on 17 Apr 2010 23:58

On 4/16/2010 8:06 PM, Andy "Krazy" Glew wrote:
> On 4/16/2010 8:41 AM, Noob wrote:
>> Andy "Krazy" Glew wrote:
>>
>>> That's possible. But not so likely, since most systems keep uncached
>>> accesses separate, do not combine them, because the uncached memory
>>> accesses may be to memory mapped I/O devices that have side effects.
>>>
>>> (I have designed systems that have two different types of uncached
>>> memory, a UC-MMIO type that permits no optimizations, and a UC-Ordinary
>>> type that permits optimizations. But I am not aware of anyone shipping
>>> such a system. HPC guys often ask for it.)
>>
>> (Going off on a tangent)
>>
>> I thought there were many "types" of memory accesses?
>>
>> For example, the AMD64 architecture defines the following "memory types"
>> with different properties.
>>
>> Uncacheable (UC)
>> Cache Disable (CD)
>> Write-Combining (WC)
>> Write-Protect (WP)
>> Writethrough (WT)
>> Writeback (WB)
>
> Amusingly, I defined those types for Intel P6.
>
> UC WP WT WB already existed, but outside the CPU. I invented the MTRRs
> (not one of my favorite things) to hold the memory types internal.
>
> I invented the WC memory type. Along with a number of memory types that
> got cut. Included UC-MMIO and UC-MEMORY. Also, RC, FB, ....
>
> Hmm... I have not seen the CD memory type before. Looks like they added
> one when I wasn't looking. Maybe it is UC-MEMORY, and the old UC is
> UC-MMIO? I can only hope so.

I found the CD "memory type" in the AMD manual, http://support.amd.com/us/Processor_TechDocs/24593.pdf, excerpted at
botton of this post.

I quote "memory type", because it is not really a memory type that can be stored in the MTRRs or PAT. It arises from
the CR0.CD=1 control register bit setting.

AMD says that CD memory may have been cached due to earlier cacheable access, or due to virtual address aliasing.

If you think about it, however, that may also arise with regular UC memory. *May*. Should not, if the OS has managed
the caches correctly; but may, because bugs hapen.

So basically CD is UC, that snoops the cache. From which I assume that UC does not snoop the cache on at last some AMD
systems.

Does UC snoop the cache on Intel systems? Validation people would definitely prefer that it did not. Snooping would
waste power and potentially hurt performance. But it might maintain correctness. The downsides might be mitigated by
directories.

My druthers would be to have a separate "snoop" bit. You might create an uncached-but-snoopable memory type, to be used
for cache management - sometimes a data structure would be accessed cacheably, sometimes not. I would rather have
something like a per-memory-access instruction prefix that said "don't cache the next load". Failing that, however, you
can set up aliases for virtual memory, and arrange things so that you simply need to OR (ADD) in a particular base
address to get the uncached version of an addresses.

My druthers arise from the fact that I am a computer architect who is also a performance programmer. If I am a
performance programmer, I want to be able to control cacheability.

However, most computer architects are more sympathetic to validation concerns than they are to performance programmers.
Validation wants to eliminate cases, even if it makes the system less regular. (I call this introduction of
irregularity in the usually forlorn hope of reducing validation complexity "Pulling a Nabeel" after a validator who
practiced it. IMHO it is better to learn about proper experimental design, e.g. Latin Squares, as a way of reducing
validation complexity by significantly higher degrees.)
Such not-very-sophistiated validation-driven computer architecture teds to want to say "The OS shall not allow
aliasing of memory types, whether temporal or spatial." I.e. a given physical memory address should not be in the cache
as a result of an earlier cacheable access when it is now globally uncacheable (temporal aliasing). Similarly for
virtual address aliasing.

I think this is shortsighted.

a) because performance programmers really do want to be able to practice aliasing

b) because bugs in OSes happen - aliasing happens.

It is especially shortsighted if validation, or, worse, the cache designer takes advantage of this decreed but not
enforced prohibition of aliasing, and does something damaging. Like, causing data to become incoherent in weird ways.
Not so bad if they do something like causng a machine check if aliasing is detected, e.g. if a UC access hits in a
cache. Mainly because you will quickly learn how common such issues are. But silently corrupting the system - not so good.

Glew's morals:

a) Aliasing of memory types happens, both temporal and spatial. Live with it. Better yet, take advantage of it.

b) Orthogonality is good. Consider a separate snoop bit.

But all this is water under the bridge.

--

Unfortunately, AMD's CD memory type does not seem to be a UC-MEMORY vs. UC-MMIO type.

Actually, WC is in many ways a UC-MEMORY type. Although it goes further than UC-MEMORY, allowng stoes to be out of order.

--

Intel's UC- memory type is, again, not really a different semantic memory type. It is just an encoding trick, allowing
WC to optionally override UC in the MTRRS and PAT.

---

7.4 Memory Types
The AMD64 architecture defines the following memory types:
� Uncacheable (UC)�Reads from, and writes to, UC memory are not cacheable. Reads from UC
memory cannot be speculative. Write-combining to UC memory is not allowed. Reads from or
writes to UC memory cause the write buffers to be written to memory and be invalidated prior to
the access to UC memory.
The UC memory type is useful for memory-mapped I/O devices where strict ordering of reads and
writes is important.
� Cache Disable (CD)�The CD memory type is a form of uncacheable memory type that occurs
when caches are disabled (CR0.CD=1). With CD memory, it is possible for the address to be
cached due to an earlier cacheable access, or due to two virtual-addresses aliasing to a single
physical address.
For the L1 data cache and the L2 cache, reads from, and writes to, CD memory that hit the cache
cause the cache line to be invalidated before accessing main memory. If the cache line is in the
modified state, the line is written to main memory and then invalidated.
For the L1 instruction cache, reads from CD memory that hit the cache read the cached instructions
rather than access main memory. Reads that miss the cache access main memory and do not cause
cache-line replacement.

From: Noob on 22 Apr 2010 05:16

Andy "Krazy" Glew wrote:

> I would rather have something like a per-memory-access instruction prefix
> that said "don't cache the next load". Failing that, however, you can
> set up aliases for virtual memory, and arrange things so that you simply
> need to OR (ADD) in a particular base address to get the uncached
> version of an addresses.

AFAIU, my system works along the lines of your latter description.

There's a 29-bit physical address space, and the "top" 3 bits in a
virtual address define an "address region" (P0-P4).

if b31 == 0
then region P0
else
region P(b30*2+b29+1)

<quote>
Mask to 29-bit

The physical address is given by taking the virtual address and
replacing bits [31:29] by 3 zero bits. This gives a physical address
in the range 0x0000 0000 to 0x1FFF FFFF. Only physical addresses in
the range 0x0000 0000 to 0x1BFF FFFF may be safely accessed through a
virtual address that is handled by masking to 29 bits. If masking
gives a physical address in the range 0x1C00 0000 to 0x1FFF FFFF,
the behavior of the ST40 is undefined; the sole exception is for
accesses to the operand cache RAM mode area 0x7C00 0000 to 0x7FFF
FFFF when CCR.ORA=1.

The physical address range 0x1C00 0000 to 0x1FFF FFFF must only be
accessed either:
o through a P4 virtual address or
o for the range 0x1D00 0000 to 0x1FFF FFFF, by setting MMUCR.AT=1
and using an address translation in the UTLB
</quote>

Regards.

First | Prev |
Pages: 1 2 3 4
Prev: Opinions wanted on career-limiting moves (<g>)
Next: Using AVR-GCC toolchain on Mac OS X