Crash Course In Modern Hardware [Java Help]

Prev: Code and Creation 04972
Next: Create grid

From: Tom Anderson on 18 Jan 2010 08:39

On Sun, 17 Jan 2010, Arne Vajh?j wrote:

> On 17-01-2010 22:10, Roedy Green wrote:
>> On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews"
>> <nospam(a)nospam.invalid> wrote, quoted or indirectly quoted someone who
>> said :
>>> * We need to update our mental performance models as the hardware evolves
>>
>> I did not realise how important locality had become. A cache miss
>> going to RAM costs 200 to 300 clock cycles! This penalty dominates
>> everything else. This suggests that interpretive code with a tight core
>> might run faster than "highly optimised" machine code since you could
>> arrange that the core of it was entirely in cache.
>
> Why?
>
> The data fetched would still be the same.

Not if the bytecode was more compact than the native code.

> And the CPU intensive loop like inner loops seems more likely to fit
> into I cache than the relevant part of the interpreter.

If you have a single inner loop, then yes, the machine code will fit in
the cache, and there's no performance advantage to bytecode. But if you
have a large code footprint - something like an app server, say - then
it's quite possible that more of the code will fit in the cache with
bytecode than with native code.

tom

--
This is the best kind of weird. It can make a corpse laugh back to
death. -- feedmepaper

From: Donkey Hottie on 18 Jan 2010 08:57

On 18.1.2010 15:39, Tom Anderson wrote:
> On Sun, 17 Jan 2010, Arne Vajh?j wrote:
>
>> On 17-01-2010 22:10, Roedy Green wrote:
>>> On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews"
>>> <nospam(a)nospam.invalid> wrote, quoted or indirectly quoted someone who
>>> said :
>>>> * We need to update our mental performance models as the hardware
>>>> evolves
>>>
>>> I did not realise how important locality had become. A cache miss
>>> going to RAM costs 200 to 300 clock cycles! This penalty dominates
>>> everything else. This suggests that interpretive code with a tight
>>> core might run faster than "highly optimised" machine code since you
>>> could arrange that the core of it was entirely in cache.
>>
>> Why?
>>
>> The data fetched would still be the same.
>
> Not if the bytecode was more compact than the native code.
>
>> And the CPU intensive loop like inner loops seems more likely to fit
>> into I cache than the relevant part of the interpreter.
>
> If you have a single inner loop, then yes, the machine code will fit in
> the cache, and there's no performance advantage to bytecode. But if you
> have a large code footprint - something like an app server, say - then
> it's quite possible that more of the code will fit in the cache with
> bytecode than with native code.
>

I thought the bytecode is nowadays always converted to native code by
the JIT. Am I wrong?

--
You will have a long and unpleasant discussion with your supervisor.

From: John B. Matthews on 18 Jan 2010 10:10

In article <ju5e27-jdk.ln1(a)wellington.fredriksson.dy.fi>,
Donkey Hottie <donkey(a)fred.pp.fi> wrote:

> On 18.1.2010 15:39, Tom Anderson wrote:
> > On Sun, 17 Jan 2010, Arne Vajh?j wrote:
> >
> >> On 17-01-2010 22:10, Roedy Green wrote:
> >>> On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews"
> >>> <nospam(a)nospam.invalid> wrote, quoted or indirectly quoted
> >>> someone who said :
> >>>> * We need to update our mental performance models as the hardware
> >>>> evolves
> >>>
> >>> I did not realise how important locality had become. A cache
> >>> miss going to RAM costs 200 to 300 clock cycles! This penalty
> >>> dominates everything else. This suggests that interpretive code
> >>> with a tight core might run faster than "highly optimised"
> >>> machine code since you could arrange that the core of it was
> >>> entirely in cache.
> >>
> >> Why?
> >>
> >> The data fetched would still be the same.
> >
> > Not if the bytecode was more compact than the native code.
> >
> >> And the CPU intensive loop like inner loops seems more likely to
> >> fit into I cache than the relevant part of the interpreter.
> >
> > If you have a single inner loop, then yes, the machine code will
> > fit in the cache, and there's no performance advantage to bytecode.
> > But if you have a large code footprint - something like an app
> > server, say - then it's quite possible that more of the code will
> > fit in the cache with bytecode than with native code.
>
> I thought the bytecode is nowadays always converted to native code by
> the JIT. Am I wrong?

Some, but not all: "The Java Hotspot[VM] does not include a plug-in JIT
compiler but instead compiles and inline[s] methods that appear [to be]
the most used in the application."

<http://java.sun.com/developer/onlineTraining/Programming/JDCBook/perf2.html>

[Sorry about the heavy-handed editing.]

--
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>

From: Thomas Pornin on 18 Jan 2010 17:10

According to Arne Vajh�j <arne(a)vajhoej.dk>:
> With L1 cache in the 128-256KB size, then it requires a lot
> of unrolled loops to fill up the I cache.

L1 cache is more in the 32-64KB range. Basically 32 KB for a Core2
Intel, 64 KB for the AMD equivalent. That's for code; you have the
same amount in data.

32 KB is easily filled up, especially with the help of the JIT compiler.
The JIT compiler has a tendency to produce fat code. This comes from
the dynamic games that the compiler plays:

-- Some method calls are statically resolved (i.e. compiled into a
direct function call), because the compiler has determined that there is
currently only one possible target method. But some future class loading
may change that, and require the JIT compiler to patch the binary code
on the fly.

-- When an exception is thrown, the JVM must be able to produce a stack
trace, including computing back the source code line at each level,
which means that the bytecode to binary conversion is somewhat
reversible.

These games are expensive, not in clock cycles but in RAM: the JIT
compiler must use more bytes than what a C compiler would do on the
equivalent C source code.

I once had an implementation of the RIPEMD-160 hash function. My C code
happily processes about 170 MB/s; the internal loop is fully unrolled,
and uses about 7 KB of L1 cache. The equivalent Java code, fully
unrolled, was ranking at... 1.1 MB/s. 150 times slower ! It turned out
that what the C compiler was expanding to 7 KB, the JIT compiler was
spreading to more than the 32 KB of L1 cache, implying a heavy rate
of cache misses.

Partial rerolling, at the expense of some extra RAM accesses, brought
speed up to about 60 MB/s. That's slower than C code, but not awfully
slower; a speed factor of 3 between C and Java is typical in such code
(this is low-level CPU bound work, where the checks on array accesses
tend to be a major cost).

--Thomas Pornin

From: Lew on 18 Jan 2010 19:54

Donkey Hottie wrote:
>> I thought the bytecode is nowadays always converted to native code by
>> the JIT. Am I wrong?

Yes.

John B. Matthews wrote:
> Some, but not all: "The Java Hotspot[VM] does not include a plug-in JIT
> compiler but instead compiles and inline[s] methods that appear [to be]
> the most used in the application."
>
> <http://java.sun.com/developer/onlineTraining/Programming/JDCBook/perf2.html>

Hotspot runs bytecode altogether, at first (JNI excluded from consideration
here). Based on actual runtime heuristics, it might convert some parts to
native code and run the compiled version. As execution progresses, Hotspot
may revert compiled parts back to interpreted bytecode, depending on runtime
situations.

--
Lew

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10
Prev: Code and Creation 04972
Next: Create grid