From: Noob on 15 Apr 2010 08:03 Noob wrote: > SUMMARY > loadloop takes > 7.0 cycles/iteration when LOADING a cached word > 31.6 cycles/iteration when LOADING a non-cached word > > TODO: look at store performance [...] ..global _storeloop ..align 5 _storeloop: mov.l @r5,r0 /*** READ TIMESTAMP COUNTER ***/ nop ..L3: dt r4 mov.l r0,@r6 mov.l r0,@r6 mov.l r0,@r6 mov.l r0,@r6 mov.l r0,@r6 mov.l r0,@r6 bf .L3 /*** ONE-CYCLE STALL WHEN BRANCH IS TAKEN ***/ mov.l @r5,r1 /*** READ TIMESTAMP COUNTER ***/ rts sub r1,r0 /*** DELAY SLOT ***/ SUMMARY storeloop takes 7.0 cycles/iteration when STORING a cached word 37.4 cycles/iteration when STORING a non-cached word I didn't expect non-cached stores to be 20% slower than non-cached loads, while cached stores and cached loads run at the same speed. Why might explain that? Regards.
From: =?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?= on 15 Apr 2010 08:44 Noob <root(a)127.0.0.1> wrote: > I didn't expect non-cached stores to be 20% slower than > non-cached loads, while cached stores and cached loads > run at the same speed. Why might explain that? The reads benefit from DRAM page hits. -- Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark
From: Noob on 15 Apr 2010 09:07 Noob wrote: > loadloop takes > 7.0 cycles/iteration when LOADING a cached word > 31.6 cycles/iteration when LOADING a non-cached word > > storeloop takes > 7.0 cycles/iteration when STORING a cached word > 37.4 cycles/iteration when STORING a non-cached word Next, a pointer chasing loop. _chaseptr: mov.l @r5,r0 /*** READ TIMESTAMP COUNTER ***/ nop ..L4: dt r4 mov.l @r6,r6 mov.l @r6,r6 mov.l @r6,r6 mov.l @r6,r6 mov.l @r6,r6 mov.l @r6,r6 bf .L4 /*** ONE-CYCLE STALL WHEN BRANCH IS TAKEN ***/ mov.l @r5,r1 /*** READ TIMESTAMP COUNTER ***/ rts sub r1,r0 /*** DELAY SLOT ***/ chaseptr takes 12.0 cycles/iteration when working with cached memory. 31.6 cycles/iteration when working with non-cached memory. 6 loads per iteration; 2-cycle latency on a cache hit, thus 12 cycles per iteration. (dt and bf basically come "for free".) chaseptr is (marginally) faster than loadloop (31.56 vs 31.58) when working with non-cached memory, which is slightly counter-intuitive. (The difference might be insignificant, but it is systematic.) Regards.
From: Noob on 15 Apr 2010 10:57 Noob wrote: > loadloop takes > 7.0 cycles/iteration when LOADING a cached word > 31.6 cycles/iteration when LOADING a non-cached word > > storeloop takes > 7.0 cycles/iteration when STORING a cached word > 37.4 cycles/iteration when STORING a non-cached word > > chaseptr takes > 12.0 cycles/iteration when working with cached memory > 31.6 cycles/iteration when working with non-cached memory Epic fail. I got the numbers for non-cached loads wrong by an order of magnitude. loadloop takes 7 cycles/iteration when LOADING a cached word 316 cycles/iteration when LOADING a non-cached word ^^^ storeloop takes 7.0 cycles/iteration when STORING a cached word 37.4 cycles/iteration when STORING a non-cached word chaseptr takes 12 cycles/iteration when working with cached memory 316 cycles/iteration when working with non-cached memory ^^^ I'm now trying to understand why reading from non-cached memory is so much slower than writing. Is the CPU optimizing some (most) of my writes away because I keep writing to the same address? In my test, cache read bandwidth is 906 MB/s, while non-cached read bandwidth is 20 MB/s. 20 MB/s seems very low for DDR1 SDRAM, wouldn't you agree? Perhaps DRAM is not optimized for my artificial access pattern? (Always hitting the same word.) Regards.
From: Noob on 15 Apr 2010 11:48 Noob wrote: > loadloop takes > 7 cycles/iteration when LOADING a cached word > 316 cycles/iteration when LOADING a non-cached word > > storeloop takes > 7.0 cycles/iteration when STORING a cached word > 37.4 cycles/iteration when STORING a non-cached word > > chaseptr takes > 12 cycles/iteration when working with cached memory > 316 cycles/iteration when working with non-cached memory > > I'm now trying to understand why reading from non-cached > memory is so much slower than writing. > > Is the CPU optimizing some (most) of my writes away because > I keep writing to the same address? There is no difference between writing to contiguous words, and writing to the same word, over and over again. arrstore 7.0 cycles/iteration when STORING to cached memory 37.4cycles/iteration when STORING to non-cached memory _arrstore: mov.l @r5,r0 /*** READ TIMESTAMP COUNTER ***/ nop ..L5: dt r4 mov.l r0,@( 0,r6) mov.l r0,@( 4,r6) mov.l r0,@( 8,r6) mov.l r0,@(12,r6) mov.l r0,@(16,r6) mov.l r0,@(20,r6) bf .L5 /*** ONE-CYCLE STALL WHEN BRANCH IS TAKEN ***/ mov.l @r5,r1 /*** READ TIMESTAMP COUNTER ***/ rts sub r1,r0 /*** DELAY SLOT ***/ I am perplexed.
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 Prev: Opinions wanted on career-limiting moves (<g>) Next: Using AVR-GCC toolchain on Mac OS X |