Prev: Code and Creation 04972
Next: Create grid
From: Thomas Pornin on 19 Jan 2010 10:34 According to Patricia Shanahan <pats(a)acm.org>: > On the other hand, the code has few opportunities for intra-thread > parallelism, because just about every instruction is dependent on the > stack and modifies the stack. A register-based machine with many > registers may be able to get more done while it is waiting for a slow > load. The stack in Java bytecode is pure notation, since stack depth at any point can be known statically. The first step of a JIT compiler is to recompute those depths, and thus transform the stack into so many local variables. This is not an expensive step, either. Also, modern CPU are known to do such kind of work. An x86 CPU will internally cache the top stack elements into hidden registers. So it certainly is doable from the hardware point of view. --Thomas Pornin
From: Patricia Shanahan on 19 Jan 2010 12:07 Thomas Pornin wrote: > According to Patricia Shanahan <pats(a)acm.org>: >> On the other hand, the code has few opportunities for intra-thread >> parallelism, because just about every instruction is dependent on the >> stack and modifies the stack. A register-based machine with many >> registers may be able to get more done while it is waiting for a slow >> load. > > The stack in Java bytecode is pure notation, since stack depth at any > point can be known statically. The first step of a JIT compiler is to > recompute those depths, and thus transform the stack into so many local > variables. This is not an expensive step, either. Remember I was writing in the context of directly interpreting the bytecode to retain its compactness, rather than using a JIT compiler. The effect of transforming the code from top of stack relative references to equivalent local variable references is to sacrifice the extreme terseness of stack references, for a major gain in opportunities for parallelism and out-of-order execution. I think that is a very good trade-off - the loss in bigger instruction cache footprint and instruction fetch bandwidth is small compared to the benefits. Patricia
From: Tom Anderson on 21 Jan 2010 17:23 On Mon, 18 Jan 2010, Arne Vajh?j wrote: > On 18-01-2010 08:39, Tom Anderson wrote: >> On Sun, 17 Jan 2010, Arne Vajh?j wrote: >> >>> On 17-01-2010 22:10, Roedy Green wrote: >>>> On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews" >>>> <nospam(a)nospam.invalid> wrote, quoted or indirectly quoted someone who >>>> said : >>>>> * We need to update our mental performance models as the hardware >>>>> evolves >>>> >>>> I did not realise how important locality had become. A cache miss >>>> going to RAM costs 200 to 300 clock cycles! This penalty dominates >>>> everything else. This suggests that interpretive code with a tight >>>> core might run faster than "highly optimised" machine code since you >>>> could arrange that the core of it was entirely in cache. >>> >>> Why? >>> >>> The data fetched would still be the same. >> >> Not if the bytecode was more compact than the native code. > > When I wrote data I actually meant data. Doh! Sorry, Arne, i completely failed to understand there. You're quite right, of course. And i would imagine that in most applications, reads of data far outweigh reads of code (once you account for the caches). I would be very interested to see numbers for that across different kinds of program, though. >>> And the CPU intensive loop like inner loops seems more likely to fit >>> into I cache than the relevant part of the interpreter. >> >> If you have a single inner loop, then yes, the machine code will fit in >> the cache, and there's no performance advantage to bytecode. But if you >> have a large code footprint - something like an app server, say - then >> it's quite possible that more of the code will fit in the cache with >> bytecode than with native code. > > It is possible. > > Well - it is almost certain that it will be the case for some apps. > > But in most cases I would expect most of the time being spend > on executing relative small pieces of code. 80-20 or 90-10 rule. Right. And if your cache can hold 20% of the bytecode but not 20% of the machine code, it's a win. tom -- You have now found yourself trapped in an incomprehensible maze.
From: Martin Gregorie on 21 Jan 2010 19:15 On Thu, 21 Jan 2010 22:23:22 +0000, Tom Anderson wrote: > On Mon, 18 Jan 2010, Arne Vajh?j wrote: >> When I wrote data I actually meant data. > > Doh! Sorry, Arne, i completely failed to understand there. You're quite > right, of course. And i would imagine that in most applications, reads > of data far outweigh reads of code (once you account for the caches). I > would be very interested to see numbers for that across different kinds > of program, though. > It depends what you mean by 'read'. If you look at the instruction flow into the CPU, i.e. out of any caches and into the CPU proper, the instruction flow is considerably larger than the data flow in almost any architecture. For the optimised code: int a, b, c; if (a > 0) then c = a + b; else c = -1; Assembler examples are: ICL 1900 (24 bit f/l instructions. 24 bit words) c = a + b LDA 7 A # Read A LDN 6 0 # load literal zero BLE 7 L1 # Jump if A <= 0 ADD 7 B # Read and add B BRN L2 L1 LDN 7 -1 # Set result to -1 L2 STO 7 C # Save the result Instructions: 7 read Data: 2 words read, 1 word written Ratio I:D 7:2 68020 (32 bit MPU and addresses, v/l instructions, 16 bit words) MOVE.W A,D1 # Read A 5 bytes [1] BMI.S L1 # Jump if negative 2 bytes BEQ.S L1 # Jump if zero 2 bytes ADD.W B,D1 # Read and add B 5 bytes BRA.S L2 # 2 bytes L1 MOVE.W #0,D1 # Set result to -1 5 bytes L2 MOVE.W D1,C # Save the result 5 bytes Instructions: 26 bytes read Data: 4 bytes read, 2 bytes written Ratio I:D 26:6 [1] I won't swear to these MOVE and ADD instruction lengths (my handbook doesn't give them and my 68020 isn't running at present, but even if I'm wrong and they're only 3 bytes, the ratio is still 18:6. You don't have to throw in much in the way of overflow checking, address arithmetic, etc to increase the Instruction:Data ratio quite considerably. Both my examples are of processors with a decently sized register set but I don't think entirely stack-oriented machines would do much better. The ICL 2900 had the most sophisticated architecture I've seen (entirely stack-based, descriptors for all but primitive data types, software- controlled register length) and averaged 3 instructions per COBOL sentence v.s the 6+ per sentence of the 1900, but its instruction flow through the OCP (Order Code Processor) was higher than its data flow and the hardware was optimised to reflect that fact. If anybody knows of hardware where the data flow is larger then the instruction flow and can provide an equivalent example I'd be fascinated to see it. -- martin@ | Martin Gregorie gregorie. | Essex, UK org |
From: Arne Vajhøj on 21 Jan 2010 22:06
On 21-01-2010 17:23, Tom Anderson wrote: > On Mon, 18 Jan 2010, Arne Vajh?j wrote: >> On 18-01-2010 08:39, Tom Anderson wrote: >>> On Sun, 17 Jan 2010, Arne Vajh?j wrote: >>> >>>> On 17-01-2010 22:10, Roedy Green wrote: >>>>> On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews" >>>>> <nospam(a)nospam.invalid> wrote, quoted or indirectly quoted someone who >>>>> said : >>>>>> * We need to update our mental performance models as the hardware >>>>>> evolves >>>>> >>>>> I did not realise how important locality had become. A cache miss >>>>> going to RAM costs 200 to 300 clock cycles! This penalty dominates >>>>> everything else. This suggests that interpretive code with a tight >>>>> core might run faster than "highly optimised" machine code since you >>>>> could arrange that the core of it was entirely in cache. >>>> >>>> Why? >>>> >>>> The data fetched would still be the same. >>> >>> Not if the bytecode was more compact than the native code. >> >> When I wrote data I actually meant data. > > Doh! Sorry, Arne, i completely failed to understand there. You're quite > right, of course. And i would imagine that in most applications, reads > of data far outweigh reads of code (once you account for the caches). I > would be very interested to see numbers for that across different kinds > of program, though. > >>>> And the CPU intensive loop like inner loops seems more likely to fit >>>> into I cache than the relevant part of the interpreter. >>> >>> If you have a single inner loop, then yes, the machine code will fit in >>> the cache, and there's no performance advantage to bytecode. But if you >>> have a large code footprint - something like an app server, say - then >>> it's quite possible that more of the code will fit in the cache with >>> bytecode than with native code. >> >> It is possible. >> >> Well - it is almost certain that it will be the case for some apps. >> >> But in most cases I would expect most of the time being spend >> on executing relative small pieces of code. 80-20 or 90-10 rule. > > Right. And if your cache can hold 20% of the bytecode but not 20% of the > machine code, it's a win. Yep. Arne |