the 'switch' limit... [ASM]

Prev: Is there a debugger or a tool that will find the stack corruption
Next: LPC2478 arm7tdmi 56mhz 8kb cache 32mb ram, jpeg decoding performance

From: BGB / cr88192 on 1 Nov 2009 01:46

this is an observation I have made in the past, and I think I will state
here as if it were some kind of rule:
when a profiler says that the largest amount of time used in an interpreter
is the main opcode dispatch switch (rather than some function called by it,
or some logic contained within said switch, ... especially if this switch
does nothing more than call functions...), then they are rapidly approaching
the performance limits they can hope to gain from said interpreter.

which is kind of lame in this case, as this interpreter is, technically, not
all that fast...

but, then again, there may still be a little tweaking left, as said switch
has not gained 1st place (it is 2nd), and holds approx 12% of the total
running time.

1st place, at 15%, is the hash-table for fetching instructions (and, sadly,
is not really optimizable unless I discover some way to 'optimize' basic
arithmetic expressions, which FWIW is about as productive as trying to
optimize a switch...).

this means: 27% of time:
hash EIP (currently: "((EIP*65521)>>16)&65535");
fetch opcode-struct from hash-table (the hash-fail rate is apparently very
low);
switch on opcode info (branches to functions containing further dispatch and
logic code).

(well, all this, as well as a few arithmetic ops, such as adding the opcode
size to EIP, ...).

previously I had an opcode decoder which was not super fast (approx 60% of
runtime if sequential decoding is used), but the hash seems to have largely
eliminated it (since most of the time, pre-decoded opcodes are used from the
hash).

so, alas, it is an x86 interpreter performing at approx 386 (or maybe
486)-like speeds on my computer (~6 MIPS...).
could be faster, except that MSVC on x64 has no real idea what it is doing
WRT optimizations.

then again, it is around 12x faster than when I started trying to optimize
it (~0.5 MIPS...).

note: I have plenty of past experience with both ASM and JIT, but at the
moment am leaning against this, and in favor of using pure C-based
interpretation for now.

dunno if anyone knows some major way around this.

IOW: if there is anything that can be done when the main switch creeps into
a high position, and code starts becomming very fussy about the fine details
WRT performance issues...

past experience is generally not... (that this is a general limit to C-based
interpreters...).

any comments?...

From: Pascal J. Bourguignon on 1 Nov 2009 07:00

"BGB / cr88192" <cr88192(a)hotmail.com> writes:
> note: I have plenty of past experience with both ASM and JIT, but at the
> moment am leaning against this, and in favor of using pure C-based
> interpretation for now.
>
> dunno if anyone knows some major way around this.

Modify the C compiler so that it generates optimized code for your
specific case (ie. hide your specific assembly in the C compiler).

--
__Pascal Bourguignon__

From: BGB / cr88192 on 1 Nov 2009 10:19

"Pascal J. Bourguignon" <pjb(a)informatimago.com> wrote in message
news:877huaihe4.fsf(a)galatea.local...
> "BGB / cr88192" <cr88192(a)hotmail.com> writes:
>> note: I have plenty of past experience with both ASM and JIT, but at the
>> moment am leaning against this, and in favor of using pure C-based
>> interpretation for now.
>>
>> dunno if anyone knows some major way around this.
>
> Modify the C compiler so that it generates optimized code for your
> specific case (ie. hide your specific assembly in the C compiler).
>

I am using MSVC on x64 for this...
I can't even do stupid inline assembly, as MS removed this feature for
x64...

(either way, for now I have tried to make this interpreter "generic", rather
than trying to rely on details of the specific host arch...).

my main complaint though is mostly that the compiler is stupid and does not
know about even trivial optimizations... (its "optimization" is "all for
show", it apparently just turns off a few more absurd features, but
apparently still does not go and actually "optimize" code, and more so,
trying to use optimization and debug options at the same time will generally
break the compiler, forcing one to choose one or another...).

....
mov rcx, [rsp+280h]
mor rax, [rcx]
mov rcx, [rsp+280h]
mor rdx, [rcx]
add rax, rdx
....

and, worse, like when it starts using memory and spewing out long chains of
loading and storing things to memory.

mov [rsp+28h], rax
mov [rsp+20h], rcx
mov rcx, [rsp+20h]
mov rax, [rsp+28h]
....

hence, there is around a > 2x-3x speed difference between 32-bit GCC and
64-bit MSVC, with GCC winning...

it is also worth noting that, from what I have seen from the profiler (which
tends to auto-disassemble things), this sort of stuff apparently plagues the
OS as well...

say, what about the case where one is like:
and eax, ecx
jz foo

MSVC?... nope, does not know of this either...

and eax, ecx
cmp eax, 0
je foo

oh yes, and the initial setting up for a switch is a long and ugly chunk of
code (around 10-12 instructions), except in the cases where it decides to
compile the switch as the equivalent of a long if/else chain (not even
binary sorting?... apparently it is linear dispatch).

yep...

now, I could use my own compiler, but alas then I would have to debug its
output...

or even GCC's Win64 support, but last time I tried to use it, it produced
code so bad (buggy and ill formed) as to scare me away. granted, it has been
a fairly long time now (~ 1 year already?...), so maybe they have improved
it some, I don't know...

so, alas, maybe it is "good enough" that I am getting 386-like speeds, as
this could easily mean P1-like speeds if building with a less terrible
compiler...

but, from a theoretical perspective, the main dispatch switch is the
effective limit for optimizing an interpreter?... (at the plain C level).

(note, given MSVC is in use in this case, certain GCC'isms do not apply...).

>
> --
> __Pascal Bourguignon__

From: Robert Redelmeier on 1 Nov 2009 10:26

In alt.lang.asm BGB / cr88192 <cr88192(a)hotmail.com> wrote in part:
> fetch opcode-struct from hash-table
> (the hash-fail rate is apparently very low);

There's your improvement opportunity -- substitute a simpler
hash with larger failure rate and/or larger jump table.

-- Robert

From: BGB / cr88192 on 1 Nov 2009 11:39

"Robert Redelmeier" <redelm(a)ev1.net.invalid> wrote in message
news:hck9ao$j6$1(a)aioe.org...
> In alt.lang.asm BGB / cr88192 <cr88192(a)hotmail.com> wrote in part:
>> fetch opcode-struct from hash-table
>> (the hash-fail rate is apparently very low);
>
> There's your improvement opportunity -- substitute a simpler
> hash with larger failure rate and/or larger jump table.
>

only real way to do this I think is to use a mersenne prime and eliminate
the shift, but the issue then is that only likely the low 16 bits or so of
EIP would be considered in the hash.

IME, (((value*PRIME)>>SHIFT)&MASK) is generally both fast and effective,
though slightly slower than ((value*MERSENNE)&MASK).

according to the profiler, another major source of time use is:
"rip=ctx->sreg_base[0]+ctx->eip;"

but, alas, this much is needed to allow segmentation to actually be usable
(even if much of the time CS.base==0 anyways...).

but, alas, it seems there is an approx 170x speed difference between the
interpreter and native.
apparently there are too many complicated factors involved, and I am not
sure how to effectively convert this to 386-relative MHz...

direct relative clock-rate scaling relative to current processor would place
it at around 20 MHz.

calculating based on available "marketing" info (relative to my current
processor) it would be ~25 MHz.

going by some other info (instructions-per-clock on a 386, ...), it would be
closer to around 33 MHz (although I don't know "which" instructions they
were measuring...).
....

sadly, I don't have a 386 or 486 around to compare against...

>
> -- Robert

| Next | Last
Pages: 1 2 3 4 5
Prev: Is there a debugger or a tool that will find the stack corruption
Next: LPC2478 arm7tdmi 56mhz 8kb cache 32mb ram, jpeg decoding performance