From: Arne Vajhøj on
On 17-01-2010 18:20, John B. Matthews wrote:
> In article<4b539077$0$275$14726298(a)news.sunsite.dk>,
> Arne Vajhøj<arne(a)vajhoej.dk> wrote:
>
>> If you want the slides then you can find them at:
>
> http://developers.sun.com/learning/javaoneonline/j1sessn.jsp?sessn=TS-5496&yr=2009&track=javase
>
> Thanks! My favorite talking points (pp 68, 69):
>
> * Dominant operations
> 1985: page faults
> Locality is critical
> 1995: instructions executed
> Multiplies are expensive, loads are cheap
> Locality not so important
> 2005: cache misses
> Multiplies are cheap, loads are expensive!
> Locality is critical again!
>
> * We need to update our mental performance models as the hardware evolves
>
> * Unless you profile (deeply) you just don't know

I think he start by saying that the slides has been slightly modified
compared to the Java One version.

But they look very similar to me.

Arne


From: Roedy Green on
On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews"
<nospam(a)nospam.invalid> wrote, quoted or indirectly quoted someone who
said :

>* We need to update our mental performance models as the hardware evolves

I did not realise how important locality had become. A cache miss
going to RAM costs 200 to 300 clock cycles! This penalty dominates
everything else.

This suggests that interpretive code with a tight core might run
faster than "highly optimised" machine code since you could arrange
that the core of it was entirely in cache.

It also suggests FORTH-style coding with tiny methods and extreme
reusability would give you speed boost because more of your code could
fit in cache. We are no longer trying to reduce the number of
instructions executed. We are trying to fit the entire program into
cache. Techniques like loop unraveling could be counter productive
since they increase the size of the code.

Hyperthreading is a defence. If you have many hardware threads
running in the same CPU, when one thread blocks to fetch from RAM, the
other threads can keep going and keep multiple adders, instruction
decoders etc chugging.


--
Roedy Green Canadian Mind Products
http://mindprod.com
I decry the current tendency to seek patents on algorithms. There are better ways to earn a living than to prevent other people from making use of one�s contributions to computer science.
~ Donald Ervin Knuth (born: 1938-01-10 age: 72)
From: Arne Vajhøj on
On 17-01-2010 22:10, Roedy Green wrote:
> On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews"
> <nospam(a)nospam.invalid> wrote, quoted or indirectly quoted someone who
> said :
>> * We need to update our mental performance models as the hardware evolves
>
> I did not realise how important locality had become. A cache miss
> going to RAM costs 200 to 300 clock cycles! This penalty dominates
> everything else.

That is what he says.

Note though that for some problems cache missed are given by data sizes.

> This suggests that interpretive code with a tight core might run
> faster than "highly optimised" machine code since you could arrange
> that the core of it was entirely in cache.

Why?

The data fetched would still be the same.

And the CPU intensive loop like inner loops seems more
likely to fit into I cache than the relevant part of the
interpreter.

> It also suggests FORTH-style coding with tiny methods and extreme
> reusability would give you speed boost because more of your code could
> fit in cache. We are no longer trying to reduce the number of
> instructions executed. We are trying to fit the entire program into
> cache. Techniques like loop unraveling could be counter productive
> since they increase the size of the code.

loop unraveling == loop unrolling?

With L1 cache in the 128-256KB size, then it requires a lot
of unrolled loops to fill up the I cache.

Arne


From: Peter Duniho on
John B. Matthews wrote:
> In article <4b539077$0$275$14726298(a)news.sunsite.dk>,
> Arne Vajhøj <arne(a)vajhoej.dk> wrote:
>
>> If you want the slides then you can find them at:
>
> http://developers.sun.com/learning/javaoneonline/j1sessn.jsp?sessn=TS-5496&yr=2009&track=javase
>
> Thanks! My favorite talking points (pp 68, 69):
>
> * Dominant operations
> 1985: page faults
> Locality is critical
> 1995: instructions executed
> Multiplies are expensive, loads are cheap
> Locality not so important

My recollection is that even in 1995, caching issues were starting to
become an issue. Page faulting was definitely important on bigger
computers, but the desktop PCs didn't start having to deal with that
until later than 1985.

Somewhere around that time (but maybe post-1995…my recollection is
hazy), another big issue for desktop CPUs was branch prediction. That
is, having the code take the predicted path as often as possible, to
avoid having the pipeline flushed.

> 2005: cache misses
> Multiplies are cheap, loads are expensive!
> Locality is critical again!
>
> * We need to update our mental performance models as the hardware evolves
>
> * Unless you profile (deeply) you just don't know

Even profiling often will not uncover major architecture-dependent
issues, because so many are dependent on the exact hardware
configuration. Even in the 90's, you could see the same code running on
a 386, 486, and Pentium all on the same day. And of course today we've
got even more varieties of CPU architectures available at retail, never
mind counting the fact that unlike in the 90's, a 7- or 8-year-old CPU
can still run modern software just fine.

Profiling is definitely important for performance-critical code. It can
uncover lots of important architecture-independent problems. But it has
limited value in generalizing solutions for architecture-specific
issues. Only if you can restrict your installation to the same hardware
you used for profiling can you address those kinds of problems.

IMHO, the real take-away is that even with the older high-level
languages, and especially with systems like Java, hardware architecture
is actually irrelevant to most programmers. Only in very rare cases
will it matter, and for managed code it's much more important for one's
"mental performance model" to be aligned with the _virtual_ machine than
the actual one.

Pete
From: Peter Duniho on
Roedy Green wrote:
> [...]
> Hyperthreading is a defence. If you have many hardware threads
> running in the same CPU, when one thread blocks to fetch from RAM, the
> other threads can keep going and keep multiple adders, instruction
> decoders etc chugging.

Actually, hyperthreading and even, in some architectures, multi-core
CPUs can actually make things worse.

I've read claims that Intel has improved things with the Nehalem
architecture. But the shared-cache design of early hyperthreaded
processors could easily cause na�ve multi-threading implementations to
perform _much_ worse than a single-threaded implementation. That's
because having multiple threads all with the same entry point caused
those threads to often operate with a stack layout identical to each
other, which in turned caused aliasing in the cache.

The two threads running simultaneously on the same CPU, sharing a cache,
would spend most of their time alternately trashing the other thread's
cached stack data and waiting for their own stack data to be brought
back in to the cache from system RAM after the other thread trashed it.

Hyperthreading is far from a panacea, and I would not call it even a
defense. Specifically _because_ of how caching is so critical to
performance today, hyperthreading can cause huge performance problems on
certain CPUs, and even when it's used properly doesn't produce nearly as
big a benefit as actual multiple CPU cores would.

Pete
First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8 9 10
Prev: Code and Creation 04972
Next: Create grid