Prev: Code and Creation 04972
Next: Create grid
From: Tom Anderson on 18 Jan 2010 08:39 On Sun, 17 Jan 2010, Arne Vajh?j wrote: > On 17-01-2010 22:10, Roedy Green wrote: >> On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews" >> <nospam(a)nospam.invalid> wrote, quoted or indirectly quoted someone who >> said : >>> * We need to update our mental performance models as the hardware evolves >> >> I did not realise how important locality had become. A cache miss >> going to RAM costs 200 to 300 clock cycles! This penalty dominates >> everything else. This suggests that interpretive code with a tight core >> might run faster than "highly optimised" machine code since you could >> arrange that the core of it was entirely in cache. > > Why? > > The data fetched would still be the same. Not if the bytecode was more compact than the native code. > And the CPU intensive loop like inner loops seems more likely to fit > into I cache than the relevant part of the interpreter. If you have a single inner loop, then yes, the machine code will fit in the cache, and there's no performance advantage to bytecode. But if you have a large code footprint - something like an app server, say - then it's quite possible that more of the code will fit in the cache with bytecode than with native code. tom -- This is the best kind of weird. It can make a corpse laugh back to death. -- feedmepaper
From: Donkey Hottie on 18 Jan 2010 08:57 On 18.1.2010 15:39, Tom Anderson wrote: > On Sun, 17 Jan 2010, Arne Vajh?j wrote: > >> On 17-01-2010 22:10, Roedy Green wrote: >>> On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews" >>> <nospam(a)nospam.invalid> wrote, quoted or indirectly quoted someone who >>> said : >>>> * We need to update our mental performance models as the hardware >>>> evolves >>> >>> I did not realise how important locality had become. A cache miss >>> going to RAM costs 200 to 300 clock cycles! This penalty dominates >>> everything else. This suggests that interpretive code with a tight >>> core might run faster than "highly optimised" machine code since you >>> could arrange that the core of it was entirely in cache. >> >> Why? >> >> The data fetched would still be the same. > > Not if the bytecode was more compact than the native code. > >> And the CPU intensive loop like inner loops seems more likely to fit >> into I cache than the relevant part of the interpreter. > > If you have a single inner loop, then yes, the machine code will fit in > the cache, and there's no performance advantage to bytecode. But if you > have a large code footprint - something like an app server, say - then > it's quite possible that more of the code will fit in the cache with > bytecode than with native code. > I thought the bytecode is nowadays always converted to native code by the JIT. Am I wrong? -- You will have a long and unpleasant discussion with your supervisor.
From: John B. Matthews on 18 Jan 2010 10:10 In article <ju5e27-jdk.ln1(a)wellington.fredriksson.dy.fi>, Donkey Hottie <donkey(a)fred.pp.fi> wrote: > On 18.1.2010 15:39, Tom Anderson wrote: > > On Sun, 17 Jan 2010, Arne Vajh?j wrote: > > > >> On 17-01-2010 22:10, Roedy Green wrote: > >>> On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews" > >>> <nospam(a)nospam.invalid> wrote, quoted or indirectly quoted > >>> someone who said : > >>>> * We need to update our mental performance models as the hardware > >>>> evolves > >>> > >>> I did not realise how important locality had become. A cache > >>> miss going to RAM costs 200 to 300 clock cycles! This penalty > >>> dominates everything else. This suggests that interpretive code > >>> with a tight core might run faster than "highly optimised" > >>> machine code since you could arrange that the core of it was > >>> entirely in cache. > >> > >> Why? > >> > >> The data fetched would still be the same. > > > > Not if the bytecode was more compact than the native code. > > > >> And the CPU intensive loop like inner loops seems more likely to > >> fit into I cache than the relevant part of the interpreter. > > > > If you have a single inner loop, then yes, the machine code will > > fit in the cache, and there's no performance advantage to bytecode. > > But if you have a large code footprint - something like an app > > server, say - then it's quite possible that more of the code will > > fit in the cache with bytecode than with native code. > > I thought the bytecode is nowadays always converted to native code by > the JIT. Am I wrong? Some, but not all: "The Java Hotspot[VM] does not include a plug-in JIT compiler but instead compiles and inline[s] methods that appear [to be] the most used in the application." <http://java.sun.com/developer/onlineTraining/Programming/JDCBook/perf2.html> [Sorry about the heavy-handed editing.] -- John B. Matthews trashgod at gmail dot com <http://sites.google.com/site/drjohnbmatthews>
From: Thomas Pornin on 18 Jan 2010 17:10 According to Arne Vajh�j <arne(a)vajhoej.dk>: > With L1 cache in the 128-256KB size, then it requires a lot > of unrolled loops to fill up the I cache. L1 cache is more in the 32-64KB range. Basically 32 KB for a Core2 Intel, 64 KB for the AMD equivalent. That's for code; you have the same amount in data. 32 KB is easily filled up, especially with the help of the JIT compiler. The JIT compiler has a tendency to produce fat code. This comes from the dynamic games that the compiler plays: -- Some method calls are statically resolved (i.e. compiled into a direct function call), because the compiler has determined that there is currently only one possible target method. But some future class loading may change that, and require the JIT compiler to patch the binary code on the fly. -- When an exception is thrown, the JVM must be able to produce a stack trace, including computing back the source code line at each level, which means that the bytecode to binary conversion is somewhat reversible. These games are expensive, not in clock cycles but in RAM: the JIT compiler must use more bytes than what a C compiler would do on the equivalent C source code. I once had an implementation of the RIPEMD-160 hash function. My C code happily processes about 170 MB/s; the internal loop is fully unrolled, and uses about 7 KB of L1 cache. The equivalent Java code, fully unrolled, was ranking at... 1.1 MB/s. 150 times slower ! It turned out that what the C compiler was expanding to 7 KB, the JIT compiler was spreading to more than the 32 KB of L1 cache, implying a heavy rate of cache misses. Partial rerolling, at the expense of some extra RAM accesses, brought speed up to about 60 MB/s. That's slower than C code, but not awfully slower; a speed factor of 3 between C and Java is typical in such code (this is low-level CPU bound work, where the checks on array accesses tend to be a major cost). --Thomas Pornin
From: Lew on 18 Jan 2010 19:54
Donkey Hottie wrote: >> I thought the bytecode is nowadays always converted to native code by >> the JIT. Am I wrong? Yes. John B. Matthews wrote: > Some, but not all: "The Java Hotspot[VM] does not include a plug-in JIT > compiler but instead compiles and inline[s] methods that appear [to be] > the most used in the application." > > <http://java.sun.com/developer/onlineTraining/Programming/JDCBook/perf2.html> Hotspot runs bytecode altogether, at first (JNI excluded from consideration here). Based on actual runtime heuristics, it might convert some parts to native code and run the compiled version. As execution progresses, Hotspot may revert compiled parts back to interpreted bytecode, depending on runtime situations. -- Lew |