Prev: Is there a debugger or a tool that will find the stack corruption
Next: LPC2478 arm7tdmi 56mhz 8kb cache 32mb ram, jpeg decoding performance
From: BGB / cr88192 on 1 Nov 2009 01:46 this is an observation I have made in the past, and I think I will state here as if it were some kind of rule: when a profiler says that the largest amount of time used in an interpreter is the main opcode dispatch switch (rather than some function called by it, or some logic contained within said switch, ... especially if this switch does nothing more than call functions...), then they are rapidly approaching the performance limits they can hope to gain from said interpreter. which is kind of lame in this case, as this interpreter is, technically, not all that fast... but, then again, there may still be a little tweaking left, as said switch has not gained 1st place (it is 2nd), and holds approx 12% of the total running time. 1st place, at 15%, is the hash-table for fetching instructions (and, sadly, is not really optimizable unless I discover some way to 'optimize' basic arithmetic expressions, which FWIW is about as productive as trying to optimize a switch...). this means: 27% of time: hash EIP (currently: "((EIP*65521)>>16)&65535"); fetch opcode-struct from hash-table (the hash-fail rate is apparently very low); switch on opcode info (branches to functions containing further dispatch and logic code). (well, all this, as well as a few arithmetic ops, such as adding the opcode size to EIP, ...). previously I had an opcode decoder which was not super fast (approx 60% of runtime if sequential decoding is used), but the hash seems to have largely eliminated it (since most of the time, pre-decoded opcodes are used from the hash). so, alas, it is an x86 interpreter performing at approx 386 (or maybe 486)-like speeds on my computer (~6 MIPS...). could be faster, except that MSVC on x64 has no real idea what it is doing WRT optimizations. then again, it is around 12x faster than when I started trying to optimize it (~0.5 MIPS...). note: I have plenty of past experience with both ASM and JIT, but at the moment am leaning against this, and in favor of using pure C-based interpretation for now. dunno if anyone knows some major way around this. IOW: if there is anything that can be done when the main switch creeps into a high position, and code starts becomming very fussy about the fine details WRT performance issues... past experience is generally not... (that this is a general limit to C-based interpreters...). any comments?...
From: Pascal J. Bourguignon on 1 Nov 2009 07:00 "BGB / cr88192" <cr88192(a)hotmail.com> writes: > note: I have plenty of past experience with both ASM and JIT, but at the > moment am leaning against this, and in favor of using pure C-based > interpretation for now. > > dunno if anyone knows some major way around this. Modify the C compiler so that it generates optimized code for your specific case (ie. hide your specific assembly in the C compiler). -- __Pascal Bourguignon__
From: BGB / cr88192 on 1 Nov 2009 10:19 "Pascal J. Bourguignon" <pjb(a)informatimago.com> wrote in message news:877huaihe4.fsf(a)galatea.local... > "BGB / cr88192" <cr88192(a)hotmail.com> writes: >> note: I have plenty of past experience with both ASM and JIT, but at the >> moment am leaning against this, and in favor of using pure C-based >> interpretation for now. >> >> dunno if anyone knows some major way around this. > > Modify the C compiler so that it generates optimized code for your > specific case (ie. hide your specific assembly in the C compiler). > I am using MSVC on x64 for this... I can't even do stupid inline assembly, as MS removed this feature for x64... (either way, for now I have tried to make this interpreter "generic", rather than trying to rely on details of the specific host arch...). my main complaint though is mostly that the compiler is stupid and does not know about even trivial optimizations... (its "optimization" is "all for show", it apparently just turns off a few more absurd features, but apparently still does not go and actually "optimize" code, and more so, trying to use optimization and debug options at the same time will generally break the compiler, forcing one to choose one or another...). .... mov rcx, [rsp+280h] mor rax, [rcx] mov rcx, [rsp+280h] mor rdx, [rcx] add rax, rdx .... and, worse, like when it starts using memory and spewing out long chains of loading and storing things to memory. mov [rsp+28h], rax mov [rsp+20h], rcx mov rcx, [rsp+20h] mov rax, [rsp+28h] .... hence, there is around a > 2x-3x speed difference between 32-bit GCC and 64-bit MSVC, with GCC winning... it is also worth noting that, from what I have seen from the profiler (which tends to auto-disassemble things), this sort of stuff apparently plagues the OS as well... say, what about the case where one is like: and eax, ecx jz foo MSVC?... nope, does not know of this either... and eax, ecx cmp eax, 0 je foo oh yes, and the initial setting up for a switch is a long and ugly chunk of code (around 10-12 instructions), except in the cases where it decides to compile the switch as the equivalent of a long if/else chain (not even binary sorting?... apparently it is linear dispatch). yep... now, I could use my own compiler, but alas then I would have to debug its output... or even GCC's Win64 support, but last time I tried to use it, it produced code so bad (buggy and ill formed) as to scare me away. granted, it has been a fairly long time now (~ 1 year already?...), so maybe they have improved it some, I don't know... so, alas, maybe it is "good enough" that I am getting 386-like speeds, as this could easily mean P1-like speeds if building with a less terrible compiler... but, from a theoretical perspective, the main dispatch switch is the effective limit for optimizing an interpreter?... (at the plain C level). (note, given MSVC is in use in this case, certain GCC'isms do not apply...). > > -- > __Pascal Bourguignon__
From: Robert Redelmeier on 1 Nov 2009 10:26 In alt.lang.asm BGB / cr88192 <cr88192(a)hotmail.com> wrote in part: > fetch opcode-struct from hash-table > (the hash-fail rate is apparently very low); There's your improvement opportunity -- substitute a simpler hash with larger failure rate and/or larger jump table. -- Robert
From: BGB / cr88192 on 1 Nov 2009 11:39
"Robert Redelmeier" <redelm(a)ev1.net.invalid> wrote in message news:hck9ao$j6$1(a)aioe.org... > In alt.lang.asm BGB / cr88192 <cr88192(a)hotmail.com> wrote in part: >> fetch opcode-struct from hash-table >> (the hash-fail rate is apparently very low); > > There's your improvement opportunity -- substitute a simpler > hash with larger failure rate and/or larger jump table. > only real way to do this I think is to use a mersenne prime and eliminate the shift, but the issue then is that only likely the low 16 bits or so of EIP would be considered in the hash. IME, (((value*PRIME)>>SHIFT)&MASK) is generally both fast and effective, though slightly slower than ((value*MERSENNE)&MASK). according to the profiler, another major source of time use is: "rip=ctx->sreg_base[0]+ctx->eip;" but, alas, this much is needed to allow segmentation to actually be usable (even if much of the time CS.base==0 anyways...). but, alas, it seems there is an approx 170x speed difference between the interpreter and native. apparently there are too many complicated factors involved, and I am not sure how to effectively convert this to 386-relative MHz... direct relative clock-rate scaling relative to current processor would place it at around 20 MHz. calculating based on available "marketing" info (relative to my current processor) it would be ~25 MHz. going by some other info (instructions-per-clock on a 386, ...), it would be closer to around 33 MHz (although I don't know "which" instructions they were measuring...). .... sadly, I don't have a 386 or 486 around to compare against... > > -- Robert |