the 'switch' limit... [ASM]

Prev: Is there a debugger or a tool that will find the stack corruption
Next: LPC2478 arm7tdmi 56mhz 8kb cache 32mb ram, jpeg decoding performance

From: Pascal J. Bourguignon on 1 Nov 2009 12:03

"BGB / cr88192" <cr88192(a)hotmail.com> writes:

> "Pascal J. Bourguignon" <pjb(a)informatimago.com> wrote in message
> news:877huaihe4.fsf(a)galatea.local...
>> "BGB / cr88192" <cr88192(a)hotmail.com> writes:
>>> note: I have plenty of past experience with both ASM and JIT, but at the
>>> moment am leaning against this, and in favor of using pure C-based
>>> interpretation for now.
>>>
>>> dunno if anyone knows some major way around this.
>>
>> Modify the C compiler so that it generates optimized code for your
>> specific case (ie. hide your specific assembly in the C compiler).
>>
>
> I am using MSVC on x64 for this...

So unless you're rich enough to buy the sources of MSVC, forget it.

> I can't even do stupid inline assembly, as MS removed this feature for
> x64...

May I suggest you to switch to an open source compiler (such as gcc)?
Or just write your own compiler?

> now, I could use my own compiler, but alas then I would have to debug its
> output...

Hence the suggestion for a debugged compiler such as gcc.

> or even GCC's Win64 support, but last time I tried to use it, it produced
> code so bad (buggy and ill formed) as to scare me away. granted, it has been
> a fairly long time now (~ 1 year already?...), so maybe they have improved
> it some, I don't know...

This doesn't matter. What matters is that you have the sources, so
you can improve its optimizer.

(But also, read the interview of Fran Allen in "Coders at Work" (Peter
Seibel, Apress), you might prefer another language altogether to be
able to really optimize its compiler).

--
__Pascal Bourguignon__

From: Richard Harter on 1 Nov 2009 13:26

On Sat, 31 Oct 2009 23:46:13 -0700, "BGB / cr88192"
<cr88192(a)hotmail.com> wrote:

>this is an observation I have made in the past, and I think I will state
>here as if it were some kind of rule:
>when a profiler says that the largest amount of time used in an interpreter
>is the main opcode dispatch switch (rather than some function called by it,
>or some logic contained within said switch, ... especially if this switch
>does nothing more than call functions...), then they are rapidly approaching
>the performance limits they can hope to gain from said interpreter.
>
>which is kind of lame in this case, as this interpreter is, technically, not
>all that fast...
>
>but, then again, there may still be a little tweaking left, as said switch
>has not gained 1st place (it is 2nd), and holds approx 12% of the total
>running time.
>
>1st place, at 15%, is the hash-table for fetching instructions (and, sadly,
>is not really optimizable unless I discover some way to 'optimize' basic
>arithmetic expressions, which FWIW is about as productive as trying to
>optimize a switch...).
>
>this means: 27% of time:
>hash EIP (currently: "((EIP*65521)>>16)&65535");
>fetch opcode-struct from hash-table (the hash-fail rate is apparently very
>low);
>switch on opcode info (branches to functions containing further dispatch and
>logic code).

Offhand the above sounds ugly. I don't know many op codes you
are using but I expect that it most it is a few hundred. The
hash function you use is relatively expensive. Using a hash
table at all is relatively expensive. A trie might be cheaper or
possibly a perfect hash. Then there is the question of why you
are using a switch at all. If the op-code lookup yields an
index, why aren't you using it as an index into a table that
contains the pointer to the action function and the op-code
struct?

Mind you, there might be very good reasons for doing what you are
doing. I don't have access to your code. Still, it sounds as
though your code is cumbersome.

>(well, all this, as well as a few arithmetic ops, such as adding the opcode
>size to EIP, ...).
>
>previously I had an opcode decoder which was not super fast (approx 60% of
>runtime if sequential decoding is used), but the hash seems to have largely
>eliminated it (since most of the time, pre-decoded opcodes are used from the
>hash).
>
>
>so, alas, it is an x86 interpreter performing at approx 386 (or maybe
>486)-like speeds on my computer (~6 MIPS...).
>could be faster, except that MSVC on x64 has no real idea what it is doing
>WRT optimizations.
>
>then again, it is around 12x faster than when I started trying to optimize
>it (~0.5 MIPS...).
>
>
>note: I have plenty of past experience with both ASM and JIT, but at the
>moment am leaning against this, and in favor of using pure C-based
>interpretation for now.
>
>
>dunno if anyone knows some major way around this.
>
>IOW: if there is anything that can be done when the main switch creeps into
>a high position, and code starts becomming very fussy about the fine details
>WRT performance issues...
>
>past experience is generally not... (that this is a general limit to C-based
>interpreters...).
>
>
>any comments?...
>
>

Richard Harter, cri(a)tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
Kafka wasn't an author;
Kafka was a prophet!

From: BGB / cr88192 on 1 Nov 2009 14:16

"Pascal J. Bourguignon" <pjb(a)informatimago.com> wrote in message
news:87y6mqgot3.fsf(a)galatea.local...
> "BGB / cr88192" <cr88192(a)hotmail.com> writes:
>
>> "Pascal J. Bourguignon" <pjb(a)informatimago.com> wrote in message
>> news:877huaihe4.fsf(a)galatea.local...
>>> "BGB / cr88192" <cr88192(a)hotmail.com> writes:
>>>> note: I have plenty of past experience with both ASM and JIT, but at
>>>> the
>>>> moment am leaning against this, and in favor of using pure C-based
>>>> interpretation for now.
>>>>
>>>> dunno if anyone knows some major way around this.
>>>
>>> Modify the C compiler so that it generates optimized code for your
>>> specific case (ie. hide your specific assembly in the C compiler).
>>>
>>
>> I am using MSVC on x64 for this...
>
> So unless you're rich enough to buy the sources of MSVC, forget it.
>

yep.

>> I can't even do stupid inline assembly, as MS removed this feature for
>> x64...
>
> May I suggest you to switch to an open source compiler (such as gcc)?
> Or just write your own compiler?
>

GCC on Win64 was very unreliable last I checked (occasionally generating bad
code, ...), but it may have improved since then.

my compiler works, just I similarly don't trust it that well to produce good
code, and using it for compiling much of anything "non-trivial" is likely to
leave bugs all over the place which would be rather difficult to track down
(I leave it mostly for compiling smaller code fragments at runtime, since a
failure here is at least usually obvious and nearly not so difficult to
track down and fix...).

granted, running head-first into the land of bug fixing would be a good way
to improve its reliability, but I guess I am not yet willing to go down this
path (AKA: having lots of code built with lots of little hidden bugs...).

>> now, I could use my own compiler, but alas then I would have to debug its
>> output...
>
> Hence the suggestion for a debugged compiler such as gcc.
>

it may have improved since last I looked, but at the time GCC's Win64 output
was buggy.

I had compiled a big mass of code, and it was crashing. I eventually tracked
it down to GCC having been messing up with handling the Win64 calling
convention, and so I ended up using MSVC instead, figuring at least its
Win64 support demonstratably works reliably...

>
>> or even GCC's Win64 support, but last time I tried to use it, it produced
>> code so bad (buggy and ill formed) as to scare me away. granted, it has
>> been
>> a fairly long time now (~ 1 year already?...), so maybe they have
>> improved
>> it some, I don't know...
>
> This doesn't matter. What matters is that you have the sources, so
> you can improve its optimizer.
>

with GCC, one probably will not have to, as presumably at least GCC has an
idea how to produce efficient code (it seems to do so well enough on Linux).

so, the biggie issue is mostly a matter of bugs, and I am not really willing
to go digging through GCC's (IMO, teh huge) codebase on a bug hunt...

granted, they have hopefully improved the bugginess at least, which is the
main issue.

> (But also, read the interview of Fran Allen in "Coders at Work" (Peter
> Seibel, Apress), you might prefer another language altogether to be
> able to really optimize its compiler).
>

I have a fairly large existing C codebase, and I will probably not switch to
another language noting that most others have poor C integration.

it is, likewise:
it is not that there was no reason that I chose x86 as the main ISA for this
interpreter...

I chose x86 for similar reasons to why I stick with C...

even if, granted, x86 is not the most ideal ISA for an interpreter, but
hell, at least it allows me to use GCC for building a lot of code that runs
within my VM, which is itself enough of a reason IMO...

>
> --
> __Pascal Bourguignon__

From: BGB / cr88192 on 1 Nov 2009 14:59

"Richard Harter" <cri(a)tiac.net> wrote in message
news:4aedcfb0.958856406(a)text.giganews.com...
> On Sat, 31 Oct 2009 23:46:13 -0700, "BGB / cr88192"
> <cr88192(a)hotmail.com> wrote:
>
>>this is an observation I have made in the past, and I think I will state
>>here as if it were some kind of rule:
>>when a profiler says that the largest amount of time used in an
>>interpreter
>>is the main opcode dispatch switch (rather than some function called by
>>it,
>>or some logic contained within said switch, ... especially if this switch
>>does nothing more than call functions...), then they are rapidly
>>approaching
>>the performance limits they can hope to gain from said interpreter.
>>
>>which is kind of lame in this case, as this interpreter is, technically,
>>not
>>all that fast...
>>
>>but, then again, there may still be a little tweaking left, as said switch
>>has not gained 1st place (it is 2nd), and holds approx 12% of the total
>>running time.
>>
>>1st place, at 15%, is the hash-table for fetching instructions (and,
>>sadly,
>>is not really optimizable unless I discover some way to 'optimize' basic
>>arithmetic expressions, which FWIW is about as productive as trying to
>>optimize a switch...).
>>
>>this means: 27% of time:
>>hash EIP (currently: "((EIP*65521)>>16)&65535");
>>fetch opcode-struct from hash-table (the hash-fail rate is apparently very
>>low);
>>switch on opcode info (branches to functions containing further dispatch
>>and
>>logic code).
>
> Offhand the above sounds ugly. I don't know many op codes you
> are using but I expect that it most it is a few hundred.

errm...

the x86 ISA has a bit more than this, closer to around 1,800 (core integer
ops, x87, SSE/SSE2/..., and partial AVX). full AVX and SSE5 might push it a
bit over 2000.

note that this is actual opcodes, not nmonics, where x86 likes using a
number of different opcodes with the same nmonic, although in my list there
are still around 800 nmonics...

as for there being only a few hundred opcodes, in nearly any other bytecode
this would probably be the case, but is not so much the case with x86...

the interpreter does not at present handle the full set though...

but, the hash is not used for opcode lookup/decoding, rather it is used for
grabbing already decoded instructions from a cache (which is based on memory
address).

(this is because instruction decoding is fairly expensive, and so is not
done inline, and when instructions are actually decoded, they are converted
from an in-memory opcode, to a struct-based representation, which is what is
fetched here by the hash lookup).

so, it serves a role similar to that of the L1 cache.

> The
> hash function you use is relatively expensive. Using a hash
> table at all is relatively expensive. A trie might be cheaper or
> possibly a perfect hash. Then there is the question of why you
> are using a switch at all. If the op-code lookup yields an
> index, why aren't you using it as an index into a table that
> contains the pointer to the action function and the op-code
> struct?
>

the hash table fetches the (already decoded) opcode struct, and the switch
is used to have the interpreter actually so something.

Step:
hash EIP address (actually: EIP+CS.Base);
try fetch decoded opcode:
ExecOpcode.
fail:
set up hash entry;
decode opcode at EIP into hash;
ExecOpcode.

ExecOpcode:
switch(op->type) //not exact, but close enough
case Basic: ExecOpcodeBasic(); break;
case Reg: ExecOpcodeReg(); break;
case RM: ExecOpcodeRM(); break;
case RegRM: ExecOpcodeRegRM(); break;
case RMReg: ExecOpcodeRMReg(); break;
case Imm: ExecOpcodeImm(); break;
case RegImm: ExecOpcodeRegImm(); break;
case RMImm: ExecOpcodeRMImm(); break;
...

....
ExecOpcodeRegRM:
resolve RM, ...
switch(op->opnum)
case ADC: ...
case ADD: ...
case AND: ...
case BSF: ...
case BSR: ...
...
....

in all, it is a fairly "heavyweight" interpreter, but alas, x86 is also a
fairly heavy ISA...

> Mind you, there might be very good reasons for doing what you are
> doing. I don't have access to your code. Still, it sounds as
> though your code is cumbersome.
>

kind of, but it could be a whole lot worse...

how would you approach something like x86?...

hopefully not by suggesting a huge switch to dispatch for every possible
opcode byte/prefix/... and then re-enter for following bytes. many opcodes
are long (2/3 bytes for just the 'opcode' part, or 4/5 for many SSE
opcodes), and then there is the Mod/RM field, which itself is an issue to
decode.

(if you don't believe me, go and take a good long hard look at the Intel
docs, and imagine how you would go about writing an interpreter for all this
stuff...).

I started initially trying this, but soon found it unworkable.

I then switched to a more "generic" strategy (essentially grabbing my
pre-existing disassembler and using this as a template for an opcode
decoder, which itself was mostly based on a pattern-matching strategy).

it took a bit of tweaking to get the speed up (following that of reworking
it into a piece of usable interpreter machinery), but it was still fairly
slow vs the interpreter, hence I added the hash (actually, originally I
considered a more involved strategy, but this was much simpler for now, and
can still pull off more-or-less the same effect...).

>
>>(well, all this, as well as a few arithmetic ops, such as adding the
>>opcode
>>size to EIP, ...).
>>
>>previously I had an opcode decoder which was not super fast (approx 60% of
>>runtime if sequential decoding is used), but the hash seems to have
>>largely
>>eliminated it (since most of the time, pre-decoded opcodes are used from
>>the
>>hash).
>>
>>
>>so, alas, it is an x86 interpreter performing at approx 386 (or maybe
>>486)-like speeds on my computer (~6 MIPS...).
>>could be faster, except that MSVC on x64 has no real idea what it is doing
>>WRT optimizations.
>>
>>then again, it is around 12x faster than when I started trying to optimize
>>it (~0.5 MIPS...).
>>
>>
>>note: I have plenty of past experience with both ASM and JIT, but at the
>>moment am leaning against this, and in favor of using pure C-based
>>interpretation for now.
>>
>>
>>dunno if anyone knows some major way around this.
>>
>>IOW: if there is anything that can be done when the main switch creeps
>>into
>>a high position, and code starts becomming very fussy about the fine
>>details
>>WRT performance issues...
>>
>>past experience is generally not... (that this is a general limit to
>>C-based
>>interpreters...).
>>
>>
>>any comments?...
>>
>>
>
> Richard Harter, cri(a)tiac.net
> http://home.tiac.net/~cri, http://www.varinoma.com
> Kafka wasn't an author;
> Kafka was a prophet!

From: Richard Harter on 1 Nov 2009 16:50

On Sun, 1 Nov 2009 12:59:06 -0700, "BGB / cr88192"
<cr88192(a)hotmail.com> wrote:

>
>"Richard Harter" <cri(a)tiac.net> wrote in message
>news:4aedcfb0.958856406(a)text.giganews.com...
>> On Sat, 31 Oct 2009 23:46:13 -0700, "BGB / cr88192"
>> <cr88192(a)hotmail.com> wrote:
[snip]

>>
>> Offhand the above sounds ugly. I don't know many op codes you
>> are using but I expect that it most it is a few hundred.
>
>errm...
>
>the x86 ISA has a bit more than this, closer to around 1,800 (core integer
>ops, x87, SSE/SSE2/..., and partial AVX). full AVX and SSE5 might push it a
>bit over 2000.
>
>note that this is actual opcodes, not nmonics, where x86 likes using a
>number of different opcodes with the same nmonic, although in my list there
>are still around 800 nmonics...
>
>as for there being only a few hundred opcodes, in nearly any other bytecode
>this would probably be the case, but is not so much the case with x86...

That's still not all that bad, but it is less than pleasant.
>
>
>the interpreter does not at present handle the full set though...
>
>
>but, the hash is not used for opcode lookup/decoding, rather it is used for
>grabbing already decoded instructions from a cache (which is based on memory
>address).
>
>(this is because instruction decoding is fairly expensive, and so is not
>done inline, and when instructions are actually decoded, they are converted
>from an in-memory opcode, to a struct-based representation, which is what is
>fetched here by the hash lookup).
>
>so, it serves a role similar to that of the L1 cache.

I'm a little confused here. Are we just talking about opcodes
(decoded or not) or are we talking about full instructions. If
we are only talking about opcodes then everything can be
precomputed except an opcode->index mapping.

>
>
>> The
>> hash function you use is relatively expensive. Using a hash
>> table at all is relatively expensive. A trie might be cheaper or
>> possibly a perfect hash. Then there is the question of why you
>> are using a switch at all. If the op-code lookup yields an
>> index, why aren't you using it as an index into a table that
>> contains the pointer to the action function and the op-code
>> struct?
>>
>
>the hash table fetches the (already decoded) opcode struct, and the switch
>is used to have the interpreter actually so something.
>
>Step:
> hash EIP address (actually: EIP+CS.Base);
> try fetch decoded opcode:
> ExecOpcode.
> fail:
> set up hash entry;
> decode opcode at EIP into hash;
> ExecOpcode.
>
>ExecOpcode:
> switch(op->type) //not exact, but close enough
> case Basic: ExecOpcodeBasic(); break;
> case Reg: ExecOpcodeReg(); break;
> case RM: ExecOpcodeRM(); break;
> case RegRM: ExecOpcodeRegRM(); break;
> case RMReg: ExecOpcodeRMReg(); break;
> case Imm: ExecOpcodeImm(); break;
> case RegImm: ExecOpcodeRegImm(); break;
> case RMImm: ExecOpcodeRMImm(); break;
> ...
>
>...
>ExecOpcodeRegRM:
> resolve RM, ...
> switch(op->opnum)
> case ADC: ...
> case ADD: ...
> case AND: ...
> case BSF: ...
> case BSR: ...
> ...

Something isn't quite right here. In the switch ExecOpcodeRegRM
is a function; in the code immediately after it is a label.

One problem with your switches is that they are big. Now a good
compiler can and should optimize a compact set of cases into a
jump table. However all bets are off if there are holes or if
the cases are sequential. If doesn't produce a jump table it
probably defaults to sequential if/then/else chains which are
deadly for large switches. Thus in your ExecOpCode switch you
really want a table of function pointers indexed by op->type so
that the switch becomes something like

optypes[op->type]();

The op types you use for the table have to be approximately
sequential integers. If Intel doesn't give you the right kind of
numbers to use as an index, create an index field in your struct.
You can even have holes in the indexing - just fill the holes
with pointers to an error function.

All of this is fairly standard. There probably is some good
reason why you aren't doing this, but it isn't obvious.

>
>in all, it is a fairly "heavyweight" interpreter, but alas, x86 is also a
>fairly heavy ISA...
>
>
>> Mind you, there might be very good reasons for doing what you are
>> doing. I don't have access to your code. Still, it sounds as
>> though your code is cumbersome.
>>
>
>kind of, but it could be a whole lot worse...
>
>
>how would you approach something like x86?...
>
>hopefully not by suggesting a huge switch to dispatch for every possible
>opcode byte/prefix/... and then re-enter for following bytes. many opcodes
>are long (2/3 bytes for just the 'opcode' part, or 4/5 for many SSE
>opcodes), and then there is the Mod/RM field, which itself is an issue to
>decode.

Shudder. If in fact there are only a couple of thousand op codes
then 2/3 is plausible for organizational reasons. However 4/5
sounds a little fishy. Are there really only a couple of
thousand when you take everything into account.

But no, if there only a few thousand opcodes in 2/3 bytes a
specialized trie should do quite nicely. I like to do
precomputed magic tables myself, but it may turn out that a hash
function will do fine. You aren't by any chance using a prime
number size table are you? The modulo operation could be the
bulk of your time in the hash table.

>
>(if you don't believe me, go and take a good long hard look at the Intel
>docs, and imagine how you would go about writing an interpreter for all this
>stuff...).

Thanks but no thanks. Looking at the Intel docs is very low on
my list of priorities.
>
>
>I started initially trying this, but soon found it unworkable.
>
>I then switched to a more "generic" strategy (essentially grabbing my
>pre-existing disassembler and using this as a template for an opcode
>decoder, which itself was mostly based on a pattern-matching strategy).
>
>it took a bit of tweaking to get the speed up (following that of reworking
>it into a piece of usable interpreter machinery), but it was still fairly
>slow vs the interpreter, hence I added the hash (actually, originally I
>considered a more involved strategy, but this was much simpler for now, and
>can still pull off more-or-less the same effect...).
>
>
>>
>>>(well, all this, as well as a few arithmetic ops, such as adding the
>>>opcode
>>>size to EIP, ...).
>>>
>>>previously I had an opcode decoder which was not super fast (approx 60% of
>>>runtime if sequential decoding is used), but the hash seems to have
>>>largely
>>>eliminated it (since most of the time, pre-decoded opcodes are used from
>>>the
>>>hash).
>>>
>>>
>>>so, alas, it is an x86 interpreter performing at approx 386 (or maybe
>>>486)-like speeds on my computer (~6 MIPS...).
>>>could be faster, except that MSVC on x64 has no real idea what it is doing
>>>WRT optimizations.
>>>
>>>then again, it is around 12x faster than when I started trying to optimize
>>>it (~0.5 MIPS...).
>>>
>>>
>>>note: I have plenty of past experience with both ASM and JIT, but at the
>>>moment am leaning against this, and in favor of using pure C-based
>>>interpretation for now.
>>>
>>>
>>>dunno if anyone knows some major way around this.
>>>
>>>IOW: if there is anything that can be done when the main switch creeps
>>>into
>>>a high position, and code starts becomming very fussy about the fine
>>>details
>>>WRT performance issues...
>>>
>>>past experience is generally not... (that this is a general limit to
>>>C-based
>>>interpreters...).
>>>
>>>
>>>any comments?...
>>>
>>>
>>
>> Richard Harter, cri(a)tiac.net
>> http://home.tiac.net/~cri, http://www.varinoma.com
>> Kafka wasn't an author;
>> Kafka was a prophet!
>
>

Richard Harter, cri(a)tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
Kafka wasn't an author;
Kafka was a prophet!

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: Is there a debugger or a tool that will find the stack corruption
Next: LPC2478 arm7tdmi 56mhz 8kb cache 32mb ram, jpeg decoding performance