Prev: Code and Creation 04972
Next: Create grid
From: Arne Vajhøj on 17 Jan 2010 18:39 On 17-01-2010 18:20, John B. Matthews wrote: > In article<4b539077$0$275$14726298(a)news.sunsite.dk>, > Arne Vajhøj<arne(a)vajhoej.dk> wrote: > >> If you want the slides then you can find them at: > > http://developers.sun.com/learning/javaoneonline/j1sessn.jsp?sessn=TS-5496&yr=2009&track=javase > > Thanks! My favorite talking points (pp 68, 69): > > * Dominant operations > 1985: page faults > Locality is critical > 1995: instructions executed > Multiplies are expensive, loads are cheap > Locality not so important > 2005: cache misses > Multiplies are cheap, loads are expensive! > Locality is critical again! > > * We need to update our mental performance models as the hardware evolves > > * Unless you profile (deeply) you just don't know I think he start by saying that the slides has been slightly modified compared to the Java One version. But they look very similar to me. Arne
From: Roedy Green on 17 Jan 2010 22:10 On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews" <nospam(a)nospam.invalid> wrote, quoted or indirectly quoted someone who said : >* We need to update our mental performance models as the hardware evolves I did not realise how important locality had become. A cache miss going to RAM costs 200 to 300 clock cycles! This penalty dominates everything else. This suggests that interpretive code with a tight core might run faster than "highly optimised" machine code since you could arrange that the core of it was entirely in cache. It also suggests FORTH-style coding with tiny methods and extreme reusability would give you speed boost because more of your code could fit in cache. We are no longer trying to reduce the number of instructions executed. We are trying to fit the entire program into cache. Techniques like loop unraveling could be counter productive since they increase the size of the code. Hyperthreading is a defence. If you have many hardware threads running in the same CPU, when one thread blocks to fetch from RAM, the other threads can keep going and keep multiple adders, instruction decoders etc chugging. -- Roedy Green Canadian Mind Products http://mindprod.com I decry the current tendency to seek patents on algorithms. There are better ways to earn a living than to prevent other people from making use of one�s contributions to computer science. ~ Donald Ervin Knuth (born: 1938-01-10 age: 72)
From: Arne Vajhøj on 17 Jan 2010 22:42 On 17-01-2010 22:10, Roedy Green wrote: > On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews" > <nospam(a)nospam.invalid> wrote, quoted or indirectly quoted someone who > said : >> * We need to update our mental performance models as the hardware evolves > > I did not realise how important locality had become. A cache miss > going to RAM costs 200 to 300 clock cycles! This penalty dominates > everything else. That is what he says. Note though that for some problems cache missed are given by data sizes. > This suggests that interpretive code with a tight core might run > faster than "highly optimised" machine code since you could arrange > that the core of it was entirely in cache. Why? The data fetched would still be the same. And the CPU intensive loop like inner loops seems more likely to fit into I cache than the relevant part of the interpreter. > It also suggests FORTH-style coding with tiny methods and extreme > reusability would give you speed boost because more of your code could > fit in cache. We are no longer trying to reduce the number of > instructions executed. We are trying to fit the entire program into > cache. Techniques like loop unraveling could be counter productive > since they increase the size of the code. loop unraveling == loop unrolling? With L1 cache in the 128-256KB size, then it requires a lot of unrolled loops to fill up the I cache. Arne
From: Peter Duniho on 17 Jan 2010 23:56 John B. Matthews wrote: > In article <4b539077$0$275$14726298(a)news.sunsite.dk>, > Arne Vajhøj <arne(a)vajhoej.dk> wrote: > >> If you want the slides then you can find them at: > > http://developers.sun.com/learning/javaoneonline/j1sessn.jsp?sessn=TS-5496&yr=2009&track=javase > > Thanks! My favorite talking points (pp 68, 69): > > * Dominant operations > 1985: page faults > Locality is critical > 1995: instructions executed > Multiplies are expensive, loads are cheap > Locality not so important My recollection is that even in 1995, caching issues were starting to become an issue. Page faulting was definitely important on bigger computers, but the desktop PCs didn't start having to deal with that until later than 1985. Somewhere around that time (but maybe post-1995…my recollection is hazy), another big issue for desktop CPUs was branch prediction. That is, having the code take the predicted path as often as possible, to avoid having the pipeline flushed. > 2005: cache misses > Multiplies are cheap, loads are expensive! > Locality is critical again! > > * We need to update our mental performance models as the hardware evolves > > * Unless you profile (deeply) you just don't know Even profiling often will not uncover major architecture-dependent issues, because so many are dependent on the exact hardware configuration. Even in the 90's, you could see the same code running on a 386, 486, and Pentium all on the same day. And of course today we've got even more varieties of CPU architectures available at retail, never mind counting the fact that unlike in the 90's, a 7- or 8-year-old CPU can still run modern software just fine. Profiling is definitely important for performance-critical code. It can uncover lots of important architecture-independent problems. But it has limited value in generalizing solutions for architecture-specific issues. Only if you can restrict your installation to the same hardware you used for profiling can you address those kinds of problems. IMHO, the real take-away is that even with the older high-level languages, and especially with systems like Java, hardware architecture is actually irrelevant to most programmers. Only in very rare cases will it matter, and for managed code it's much more important for one's "mental performance model" to be aligned with the _virtual_ machine than the actual one. Pete
From: Peter Duniho on 17 Jan 2010 23:56
Roedy Green wrote: > [...] > Hyperthreading is a defence. If you have many hardware threads > running in the same CPU, when one thread blocks to fetch from RAM, the > other threads can keep going and keep multiple adders, instruction > decoders etc chugging. Actually, hyperthreading and even, in some architectures, multi-core CPUs can actually make things worse. I've read claims that Intel has improved things with the Nehalem architecture. But the shared-cache design of early hyperthreaded processors could easily cause na�ve multi-threading implementations to perform _much_ worse than a single-threaded implementation. That's because having multiple threads all with the same entry point caused those threads to often operate with a stack layout identical to each other, which in turned caused aliasing in the cache. The two threads running simultaneously on the same CPU, sharing a cache, would spend most of their time alternately trashing the other thread's cached stack data and waiting for their own stack data to be brought back in to the cache from system RAM after the other thread trashed it. Hyperthreading is far from a panacea, and I would not call it even a defense. Specifically _because_ of how caching is so critical to performance today, hyperthreading can cause huge performance problems on certain CPUs, and even when it's used properly doesn't produce nearly as big a benefit as actual multiple CPU cores would. Pete |