Crash Course In Modern Hardware [Java Help]

Prev: Code and Creation 04972
Next: Create grid

From: Thomas Pornin on 19 Jan 2010 10:34

According to Patricia Shanahan <pats(a)acm.org>:
> On the other hand, the code has few opportunities for intra-thread
> parallelism, because just about every instruction is dependent on the
> stack and modifies the stack. A register-based machine with many
> registers may be able to get more done while it is waiting for a slow
> load.

The stack in Java bytecode is pure notation, since stack depth at any
point can be known statically. The first step of a JIT compiler is to
recompute those depths, and thus transform the stack into so many local
variables. This is not an expensive step, either.

Also, modern CPU are known to do such kind of work. An x86 CPU will
internally cache the top stack elements into hidden registers. So it
certainly is doable from the hardware point of view.

--Thomas Pornin

From: Patricia Shanahan on 19 Jan 2010 12:07

Thomas Pornin wrote:
> According to Patricia Shanahan <pats(a)acm.org>:
>> On the other hand, the code has few opportunities for intra-thread
>> parallelism, because just about every instruction is dependent on the
>> stack and modifies the stack. A register-based machine with many
>> registers may be able to get more done while it is waiting for a slow
>> load.
>
> The stack in Java bytecode is pure notation, since stack depth at any
> point can be known statically. The first step of a JIT compiler is to
> recompute those depths, and thus transform the stack into so many local
> variables. This is not an expensive step, either.

Remember I was writing in the context of directly interpreting the
bytecode to retain its compactness, rather than using a JIT compiler.

The effect of transforming the code from top of stack relative
references to equivalent local variable references is to sacrifice the
extreme terseness of stack references, for a major gain in opportunities
for parallelism and out-of-order execution. I think that is a very good
trade-off - the loss in bigger instruction cache footprint and
instruction fetch bandwidth is small compared to the benefits.

Patricia

From: Tom Anderson on 21 Jan 2010 17:23

On Mon, 18 Jan 2010, Arne Vajh?j wrote:

> On 18-01-2010 08:39, Tom Anderson wrote:
>> On Sun, 17 Jan 2010, Arne Vajh?j wrote:
>>
>>> On 17-01-2010 22:10, Roedy Green wrote:
>>>> On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews"
>>>> <nospam(a)nospam.invalid> wrote, quoted or indirectly quoted someone who
>>>> said :
>>>>> * We need to update our mental performance models as the hardware
>>>>> evolves
>>>>
>>>> I did not realise how important locality had become. A cache miss
>>>> going to RAM costs 200 to 300 clock cycles! This penalty dominates
>>>> everything else. This suggests that interpretive code with a tight
>>>> core might run faster than "highly optimised" machine code since you
>>>> could arrange that the core of it was entirely in cache.
>>>
>>> Why?
>>>
>>> The data fetched would still be the same.
>>
>> Not if the bytecode was more compact than the native code.
>
> When I wrote data I actually meant data.

Doh! Sorry, Arne, i completely failed to understand there. You're quite
right, of course. And i would imagine that in most applications, reads of
data far outweigh reads of code (once you account for the caches). I would
be very interested to see numbers for that across different kinds of
program, though.

>>> And the CPU intensive loop like inner loops seems more likely to fit
>>> into I cache than the relevant part of the interpreter.
>>
>> If you have a single inner loop, then yes, the machine code will fit in
>> the cache, and there's no performance advantage to bytecode. But if you
>> have a large code footprint - something like an app server, say - then
>> it's quite possible that more of the code will fit in the cache with
>> bytecode than with native code.
>
> It is possible.
>
> Well - it is almost certain that it will be the case for some apps.
>
> But in most cases I would expect most of the time being spend
> on executing relative small pieces of code. 80-20 or 90-10 rule.

Right. And if your cache can hold 20% of the bytecode but not 20% of the
machine code, it's a win.

tom

--
You have now found yourself trapped in an incomprehensible maze.

From: Martin Gregorie on 21 Jan 2010 19:15

On Thu, 21 Jan 2010 22:23:22 +0000, Tom Anderson wrote:

> On Mon, 18 Jan 2010, Arne Vajh?j wrote:
>> When I wrote data I actually meant data.
>
> Doh! Sorry, Arne, i completely failed to understand there. You're quite
> right, of course. And i would imagine that in most applications, reads
> of data far outweigh reads of code (once you account for the caches). I
> would be very interested to see numbers for that across different kinds
> of program, though.
>
It depends what you mean by 'read'.
If you look at the instruction flow into the CPU, i.e. out of any caches
and into the CPU proper, the instruction flow is considerably larger than
the data flow in almost any architecture.

For the optimised code:

int a, b, c;

if (a > 0) then
c = a + b;
else
c = -1;

Assembler examples are:

ICL 1900 (24 bit f/l instructions. 24 bit words) c = a + b

LDA 7 A # Read A
LDN 6 0 # load literal zero
BLE 7 L1 # Jump if A <= 0
ADD 7 B # Read and add B
BRN L2
L1 LDN 7 -1 # Set result to -1
L2 STO 7 C # Save the result

Instructions: 7 read
Data: 2 words read, 1 word written
Ratio I:D 7:2

68020 (32 bit MPU and addresses, v/l instructions, 16 bit words)

MOVE.W A,D1 # Read A 5 bytes [1]
BMI.S L1 # Jump if negative 2 bytes
BEQ.S L1 # Jump if zero 2 bytes
ADD.W B,D1 # Read and add B 5 bytes
BRA.S L2 # 2 bytes
L1 MOVE.W #0,D1 # Set result to -1 5 bytes
L2 MOVE.W D1,C # Save the result 5 bytes

Instructions: 26 bytes read
Data: 4 bytes read, 2 bytes written
Ratio I:D 26:6

[1] I won't swear to these MOVE and ADD instruction lengths (my handbook
doesn't give them and my 68020 isn't running at present, but even if I'm
wrong and they're only 3 bytes, the ratio is still 18:6.

You don't have to throw in much in the way of overflow checking, address
arithmetic, etc to increase the Instruction:Data ratio quite considerably.

Both my examples are of processors with a decently sized register set but
I don't think entirely stack-oriented machines would do much better.

The ICL 2900 had the most sophisticated architecture I've seen (entirely
stack-based, descriptors for all but primitive data types, software-
controlled register length) and averaged 3 instructions per COBOL
sentence v.s the 6+ per sentence of the 1900, but its instruction flow
through the OCP (Order Code Processor) was higher than its data flow and
the hardware was optimised to reflect that fact.

If anybody knows of hardware where the data flow is larger then the
instruction flow and can provide an equivalent example I'd be fascinated
to see it.

--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |

From: Arne Vajhøj on 21 Jan 2010 22:06

On 21-01-2010 17:23, Tom Anderson wrote:
> On Mon, 18 Jan 2010, Arne Vajh?j wrote:
>> On 18-01-2010 08:39, Tom Anderson wrote:
>>> On Sun, 17 Jan 2010, Arne Vajh?j wrote:
>>>
>>>> On 17-01-2010 22:10, Roedy Green wrote:
>>>>> On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews"
>>>>> <nospam(a)nospam.invalid> wrote, quoted or indirectly quoted someone who
>>>>> said :
>>>>>> * We need to update our mental performance models as the hardware
>>>>>> evolves
>>>>>
>>>>> I did not realise how important locality had become. A cache miss
>>>>> going to RAM costs 200 to 300 clock cycles! This penalty dominates
>>>>> everything else. This suggests that interpretive code with a tight
>>>>> core might run faster than "highly optimised" machine code since you
>>>>> could arrange that the core of it was entirely in cache.
>>>>
>>>> Why?
>>>>
>>>> The data fetched would still be the same.
>>>
>>> Not if the bytecode was more compact than the native code.
>>
>> When I wrote data I actually meant data.
>
> Doh! Sorry, Arne, i completely failed to understand there. You're quite
> right, of course. And i would imagine that in most applications, reads
> of data far outweigh reads of code (once you account for the caches). I
> would be very interested to see numbers for that across different kinds
> of program, though.
>
>>>> And the CPU intensive loop like inner loops seems more likely to fit
>>>> into I cache than the relevant part of the interpreter.
>>>
>>> If you have a single inner loop, then yes, the machine code will fit in
>>> the cache, and there's no performance advantage to bytecode. But if you
>>> have a large code footprint - something like an app server, say - then
>>> it's quite possible that more of the code will fit in the cache with
>>> bytecode than with native code.
>>
>> It is possible.
>>
>> Well - it is almost certain that it will be the case for some apps.
>>
>> But in most cases I would expect most of the time being spend
>> on executing relative small pieces of code. 80-20 or 90-10 rule.
>
> Right. And if your cache can hold 20% of the bytecode but not 20% of the
> machine code, it's a win.

Yep.

Arne

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10
Prev: Code and Creation 04972
Next: Create grid