the 'switch' limit... [ASM]

Prev: Is there a debugger or a tool that will find the stack corruption
Next: LPC2478 arm7tdmi 56mhz 8kb cache 32mb ram, jpeg decoding performance

From: Pascal J. Bourguignon on 1 Nov 2009 17:08

"BGB / cr88192" <cr88192(a)hotmail.com> writes:
> dunno if anyone knows some major way around this.

Another way to do it would be to generate portably assembly with LLVM.

--
__Pascal Bourguignon__

From: BGB / cr88192 on 1 Nov 2009 18:14

"Richard Harter" <cri(a)tiac.net> wrote in message
news:4aedf3bd.968084843(a)text.giganews.com...
> On Sun, 1 Nov 2009 12:59:06 -0700, "BGB / cr88192"
> <cr88192(a)hotmail.com> wrote:
>
>>
>>"Richard Harter" <cri(a)tiac.net> wrote in message
>>news:4aedcfb0.958856406(a)text.giganews.com...
>>> On Sat, 31 Oct 2009 23:46:13 -0700, "BGB / cr88192"
>>> <cr88192(a)hotmail.com> wrote:
> [snip]
>
>>>
>>> Offhand the above sounds ugly. I don't know many op codes you
>>> are using but I expect that it most it is a few hundred.
>>
>>errm...
>>
>>the x86 ISA has a bit more than this, closer to around 1,800 (core integer
>>ops, x87, SSE/SSE2/..., and partial AVX). full AVX and SSE5 might push it
>>a
>>bit over 2000.
>>
>>note that this is actual opcodes, not nmonics, where x86 likes using a
>>number of different opcodes with the same nmonic, although in my list
>>there
>>are still around 800 nmonics...
>>
>>as for there being only a few hundred opcodes, in nearly any other
>>bytecode
>>this would probably be the case, but is not so much the case with x86...
>
> That's still not all that bad, but it is less than pleasant.

yes, ok.

some of my past bytecodes ended up with around 500 opcodes.

I think JBC has somewhere around 170-200 (I forget the exact number),
however, this is too few to do "that" much more with it.

some newer JVM versions added a few opcodes, with a good deal more
functionality being essentially multiplexed in via existing mechanisms (such
as method calls, ...). (actually, it could be loosely compared to building a
network filesystem interface on top of HTTP via WebDAV, where pre-existing
mechanisms are "extended" to handle new operations, ...).

the alternative strategy is, of course, adding piles of new opcodes...

>>
>>
>>the interpreter does not at present handle the full set though...
>>
>>
>>but, the hash is not used for opcode lookup/decoding, rather it is used
>>for
>>grabbing already decoded instructions from a cache (which is based on
>>memory
>>address).
>>
>>(this is because instruction decoding is fairly expensive, and so is not
>>done inline, and when instructions are actually decoded, they are
>>converted
>>from an in-memory opcode, to a struct-based representation, which is what
>>is
>>fetched here by the hash lookup).
>>
>>so, it serves a role similar to that of the L1 cache.
>
> I'm a little confused here. Are we just talking about opcodes
> (decoded or not) or are we talking about full instructions. If
> we are only talking about opcodes then everything can be
> precomputed except an opcode->index mapping.
>

ok, I was using the term to 'opcode' to refer to the entire instruction
(including arguments, ...).

so, in this case the 'opcode' structure contains:
opnum, which is an internal opcode number, which idendifies the nmonic;
opidx, which is an internal opcode index, which identifies the binary
representation of said imstruction;
width, which identifies the bit-width of integer ops;
reg, a register;
op_sz, cached in-memory size of opcode/instruction (used mostly to step
to the next EIP);
rm_base, memory base register, or another data register;
rm_idx, index register;
rm_sc, register scale;
rm_disp, an offset added to memory ops;
rm_fl, which holds flags;
imm, for holding an immediate;
reg2/reg3, for AVX (because we really need 4-register opcodes...);
...

the opcode decoder recognizes the pattern, and fills in all the fields in
the struct as appropriate.
in this way, the main interpreter just grabs the fields from the structure
it needs, without having to fiddle with its serialized form.

these structures are keyed by memory address, which is where the hash table
comes in.

>>
>>
>>> The
>>> hash function you use is relatively expensive. Using a hash
>>> table at all is relatively expensive. A trie might be cheaper or
>>> possibly a perfect hash. Then there is the question of why you
>>> are using a switch at all. If the op-code lookup yields an
>>> index, why aren't you using it as an index into a table that
>>> contains the pointer to the action function and the op-code
>>> struct?
>>>
>>
>>the hash table fetches the (already decoded) opcode struct, and the switch
>>is used to have the interpreter actually so something.
>>
>>Step:
>> hash EIP address (actually: EIP+CS.Base);
>> try fetch decoded opcode:
>> ExecOpcode.
>> fail:
>> set up hash entry;
>> decode opcode at EIP into hash;
>> ExecOpcode.
>>
>>ExecOpcode:
>> switch(op->type) //not exact, but close enough
>> case Basic: ExecOpcodeBasic(); break;
>> case Reg: ExecOpcodeReg(); break;
>> case RM: ExecOpcodeRM(); break;
>> case RegRM: ExecOpcodeRegRM(); break;
>> case RMReg: ExecOpcodeRMReg(); break;
>> case Imm: ExecOpcodeImm(); break;
>> case RegImm: ExecOpcodeRegImm(); break;
>> case RMImm: ExecOpcodeRMImm(); break;
>> ...
>>
>>...
>>ExecOpcodeRegRM:
>> resolve RM, ...
>> switch(op->opnum)
>> case ADC: ...
>> case ADD: ...
>> case AND: ...
>> case BSF: ...
>> case BSR: ...
>> ...
>
> Something isn't quite right here. In the switch ExecOpcodeRegRM
> is a function; in the code immediately after it is a label.
>

ok, my notation isn't very good.
it is a function, but the label-like notation shows what it contains.

ok, imagine it were something like:

ExecOpcodeRegRM():
...

> One problem with your switches is that they are big. Now a good
> compiler can and should optimize a compact set of cases into a
> jump table. However all bets are off if there are holes or if
> the cases are sequential. If doesn't produce a jump table it
> probably defaults to sequential if/then/else chains which are
> deadly for large switches. Thus in your ExecOpCode switch you
> really want a table of function pointers indexed by op->type so
> that the switch becomes something like
>
> optypes[op->type]();
>
> The op types you use for the table have to be approximately
> sequential integers. If Intel doesn't give you the right kind of
> numbers to use as an index, create an index field in your struct.
> You can even have holes in the indexing - just fill the holes
> with pointers to an error function.
>
> All of this is fairly standard. There probably is some good
> reason why you aren't doing this, but it isn't obvious.
>

actually, I have noted already that the compiler produces jump tables
internally for all these, so this much is good (I know, I have seen the
produced ASM for these cases).

the main cost is that it ends up taking around 10-12 instructions to handle
the jump-table logic, which is mostly the fault of MSVC lameness...

it is apparently mostly just producing if/else style sequences for very
small switches, such as apparently the switch for segment-overrides. I could
use a table for this, but in this case it is not in a place which causes
notable impact.

>>
>>in all, it is a fairly "heavyweight" interpreter, but alas, x86 is also a
>>fairly heavy ISA...
>>
>>
>>> Mind you, there might be very good reasons for doing what you are
>>> doing. I don't have access to your code. Still, it sounds as
>>> though your code is cumbersome.
>>>
>>
>>kind of, but it could be a whole lot worse...
>>
>>
>>how would you approach something like x86?...
>>
>>hopefully not by suggesting a huge switch to dispatch for every possible
>>opcode byte/prefix/... and then re-enter for following bytes. many opcodes
>>are long (2/3 bytes for just the 'opcode' part, or 4/5 for many SSE
>>opcodes), and then there is the Mod/RM field, which itself is an issue to
>>decode.
>
> Shudder. If in fact there are only a couple of thousand op codes
> then 2/3 is plausible for organizational reasons. However 4/5
> sounds a little fishy. Are there really only a couple of
> thousand when you take everything into account.
>

this is mostly because of how Intel organizes their opcodes.

a lot of the SSE opcodes use 0x66, 0xF3, 0xF2, ... as part of the opcode.

here are a few, which may encode as 3 or 4 bytes for the opcode (X is the
optional REX prefix).
addpd 66X0F58/r xr,xrm
addps X0F58/r xr,xrm
addsd F2X0F58/r xr,xrm
addss F3X0F58/r xr,xrm
addsubpd 66X0FD0/r xr,xrm
addsubps F2X0FD0/r xr,xrm

here are a few with 4 or 5 byte opcodes:
dppd 66X0F3A41/r,ib xr,xrm,i8
dpps 66X0F3A40/r,ib xr,xrm,i8
extractps 66X0F3A17/r,ib xr,xrm,i8

insertps 66X0F3A21/r,ib xr,xrm,i8

mpsadbw 66X0F3A42/r,ib xr,xrm,i8

....

aesenc 66X0F38DC/r xr,xrm
aesenclast 66X0F38DD/r xr,xrm
aesdec 66X0F38DE/r xr,xrm
aesdeclast 66X0F38DF/r xr,xrm
aesimc 66X0F38DB/r xr,xrm
aeskeygenassist 66X0F3ADF/r,ib xr,xrm,i8

....

for AVX Intel switched to an essentially partly new opcode encoding scheme.

the AVX opcodes in the listing thus far seem to be using a mostly fixed
4-byte opcodes.

vblendpd Ijr0D/r,ib xr,xrv,xrm,i8
Ijv0D/r,ib yr,yrv,yrm,i8
vblendps Ijr0C/r,ib xr,xrv,xrm,i8
Ijv0C/r,ib yr,yrv,yrm,i8
vblendvpd Ijr4B/r,is xr,xrv,xrm,xri
Ijv4B/r,is yr,yrv,yrm,yri
vblendvps Ijr4A/r,is xr,xrv,xrm,xri
Ijv4A/r,is yr,yrv,yrm,yri

where Ixx basically encodes a 3-byte VEX prefix, so essentially a 4-byte
opcode.

luckily at least, most of the basic opcodes use only 1 or 2 bytes for the
opcode...

this is followed by however many more are needed to encode Mod/RM or
immediates...

so, for example:
mov [eax+4*ecx+0xF00BA5], 0xBAF00

can take 10 bytes (in addition to the opcode, this is the "/r" in the
listings above).
with a 5-byte opcode, this could take 15 bytes for an instruction.

luckily, this should not happen in practice (usually it takes special
construction to create an opcode exceeding Intel's 16-byte limit).

> But no, if there only a few thousand opcodes in 2/3 bytes a
> specialized trie should do quite nicely. I like to do
> precomputed magic tables myself, but it may turn out that a hash
> function will do fine. You aren't by any chance using a prime
> number size table are you? The modulo operation could be the
> bulk of your time in the hash table.
>
>

luckily the main interpreter does not need to see the horrors of the
opcode/instruction encoding scheme, as it is all much more "sterile" than
this (all of this is left to the "opcode decoder", which is, luckily, not
currently a significant part of the running time).

as for the hash table, it is a power-of-2 size, and hence I use a mask...
actually, I since replaced the mask with a cast to 'u16', since this is
slightly faster in this case.

I don't use modulo in hash calculations as this tends to kill performance.
I usually use a multiply, a shift, an a mask.

multiply+mask also works, but it requires a mersenne prime to work well, and
has its own set of "interesting" properties, which as I see it did not match
the current use (I chose instead to multiply by a non-mersenne prime and use
a shift).

>>
>>(if you don't believe me, go and take a good long hard look at the Intel
>>docs, and imagine how you would go about writing an interpreter for all
>>this
>>stuff...).
>
> Thanks but no thanks. Looking at the Intel docs is very low on
> my list of priorities.

I deal with them a lot...

much of what I have done in the past few years has involved these docs...

From: BGB / cr88192 on 2 Nov 2009 01:24

"Pascal J. Bourguignon" <pjb(a)informatimago.com> wrote in message
news:87my35hp9a.fsf(a)galatea.local...
> "BGB / cr88192" <cr88192(a)hotmail.com> writes:
>> dunno if anyone knows some major way around this.
>
> Another way to do it would be to generate portably assembly with LLVM.
>

or, if I were going to go this route, I could use my pre-existing codegen
backend as well...

(basically, this would be because the interpreter was written to be used
along with my pre-existing compiler/VM framework, and quite possibly
accepting code generated by my custom compiler, as well as possibly libs
compiled with GCC, or some other compiler for 32-bit x86...).

either way, it would make the general "cost" much higher than an
interpreter, although it would allow for much better performance.

elsewhere in the thread I think I stated that I was sticking with pure C for
now.
there are a few reasons for this...

either way, I am not certain if a general-purpose codegen/backend of either
style is ideal for JITing x86 machine code, vs say a JIT based more around
the "model" already in-use within x86.

granted, a generic JIT could be better for anything<->anything translation
though, vs x86 to some-other-machine-code (probably x86 or x86-64).

but, for now, the interpreter may have to do as-is...
I can maybe micro-optimize it a little further, but I guess at this point I
can no longer expect drastic speed-ups (the only other major way to speed it
up being by having it do less, as in, the possiblity of using system calls
for many operations).

another possibility is the use of "magic" sequences (can be compared with
the prologues and epilogues in Win64), which would consist of special
patterns which the interpreter would recognize and handle specially.

I have already done this with a few sequences (revolving around 'rep movsb'
and friends), although it seems that the current C library I am using is
using generic loop code instead, which of course would not stuble on these
special optimizations (not that it would be difficult to change if needed).

another place is with more complex operations, which is where system calls
come into play.

specialized pattern recognition and optimization is also a possibility
(similar to a compiler/optimizer), but would likely be too expensive for
real-time use, and is not ideally matched for the design I ended up using
(where opcodes are executed individually rather than organized into
'traces').

so, for now, it will have to work...

I guess a 170x slowdown is not THAT bad, FWIW...

> --
> __Pascal Bourguignon__

From: tm on 2 Nov 2009 05:26

On 1 Nov., 20:16, "BGB / cr88192" <cr88...(a)hotmail.com> wrote:
> "Pascal J. Bourguignon" <p...(a)informatimago.com> wrote in messagenews:87y6mqgot3.fsf(a)galatea.local...
>
> > "BGB / cr88192" <cr88...(a)hotmail.com> writes:
>
> >> "Pascal J. Bourguignon" <p...(a)informatimago.com> wrote in message
> >>news:877huaihe4.fsf(a)galatea.local...
> >>> "BGB / cr88192" <cr88...(a)hotmail.com> writes:
> >>>> note: I have plenty of past experience with both ASM and JIT, but at
> >>>> the
> >>>> moment am leaning against this, and in favor of using pure C-based
> >>>> interpretation for now.
>
> >>>> dunno if anyone knows some major way around this.
>
> >>> Modify the C compiler so that it generates optimized code for your
> >>> specific case (ie. hide your specific assembly in the C compiler).
>
> >> I am using MSVC on x64 for this...
>
> > So unless you're rich enough to buy the sources of MSVC, forget it.
>
> yep.
>
> >> I can't even do stupid inline assembly, as MS removed this feature for
> >> x64...
>
> > May I suggest you to switch to an open source compiler (such as gcc)?
> > Or just write your own compiler?
>
> GCC on Win64 was very unreliable last I checked (occasionally generating bad
> code, ...), but it may have improved since then.

Did you tell them about the errors you found?

> my compiler works, just I similarly don't trust it that well to produce good
> code, and using it for compiling much of anything "non-trivial" is likely to
> leave bugs all over the place which would be rather difficult to track down
> (I leave it mostly for compiling smaller code fragments at runtime, since a
> failure here is at least usually obvious and nearly not so difficult to
> track down and fix...).

When you think that GCC and MSVC are not okay you should
definitely use your own C compiler. When there are bugs in it
create a test suite of C programs. That way you can reduce the
number of errors.

You seem to write programs in many areas. But to get this
programs finished with good quality I can only suggest:
Eat your own dogfood.

Greetings Thomas Mertes

Seed7 Homepage: http://seed7.sourceforge.net
Seed7 - The extensible programming language: User defined statements
and operators, abstract data types, templates without special
syntax, OO with interfaces and multiple dispatch, statically typed,
interpreted or compiled, portable, runs under linux/unix/windows.

From: BGB / cr88192 on 2 Nov 2009 10:53

"tm" <thomas.mertes(a)gmx.at> wrote in message
news:6bc33382-90b5-49aa-a486-fb2992ae8496(a)a21g2000yqc.googlegroups.com...
> On 1 Nov., 20:16, "BGB / cr88192" <cr88...(a)hotmail.com> wrote:
>> "Pascal J. Bourguignon" <p...(a)informatimago.com> wrote in
>> messagenews:87y6mqgot3.fsf(a)galatea.local...
>>
>> > "BGB / cr88192" <cr88...(a)hotmail.com> writes:
>>
>> >> "Pascal J. Bourguignon" <p...(a)informatimago.com> wrote in message
>> >>news:877huaihe4.fsf(a)galatea.local...
>> >>> "BGB / cr88192" <cr88...(a)hotmail.com> writes:
>> >>>> note: I have plenty of past experience with both ASM and JIT, but at
>> >>>> the
>> >>>> moment am leaning against this, and in favor of using pure C-based
>> >>>> interpretation for now.
>>
>> >>>> dunno if anyone knows some major way around this.
>>
>> >>> Modify the C compiler so that it generates optimized code for your
>> >>> specific case (ie. hide your specific assembly in the C compiler).
>>
>> >> I am using MSVC on x64 for this...
>>
>> > So unless you're rich enough to buy the sources of MSVC, forget it.
>>
>> yep.
>>
>> >> I can't even do stupid inline assembly, as MS removed this feature for
>> >> x64...
>>
>> > May I suggest you to switch to an open source compiler (such as gcc)?
>> > Or just write your own compiler?
>>
>> GCC on Win64 was very unreliable last I checked (occasionally generating
>> bad
>> code, ...), but it may have improved since then.
>
> Did you tell them about the errors you found?
>

I was not sure who to tell about what, nor was I not sure of the nature or
the cause of the bug (not really knowing GCC's internals...). only, that it
was a bug with passing arguments (args got messed up in transit, from what I
remember due to incorrect register handling).

>> my compiler works, just I similarly don't trust it that well to produce
>> good
>> code, and using it for compiling much of anything "non-trivial" is likely
>> to
>> leave bugs all over the place which would be rather difficult to track
>> down
>> (I leave it mostly for compiling smaller code fragments at runtime, since
>> a
>> failure here is at least usually obvious and nearly not so difficult to
>> track down and fix...).
>
> When you think that GCC and MSVC are not okay you should
> definitely use your own C compiler. When there are bugs in it
> create a test suite of C programs. That way you can reduce the
> number of errors.
>

yeah, proper testing should be good...
I guess I have not done a whole lot of this.

as for MSVC, usually it is good enough (after all, it is good enough for
MS).
it is just not as ideal for areas of code which notably effect performance.

my compiler can produce often more efficient code (except in the case of
switch, errm... because I never got around to properly implementing
jump-tables, and so there is an "O(log2 n)" switch overhead...).

so, nevermind switch, I do address many optimizations which MSVC apparently
does not, which at the time I had found unimpressive (given MS has lots of
resources, ...).

> You seem to write programs in many areas. But to get this
> programs finished with good quality I can only suggest:
> Eat your own dogfood.
>

I originally wrote it mostly for JIT / at-runtime compilation, and this is
mostly what I use it for (generating scripts at runtime, and compiling and
running these).

it was only recently that I had (ever) used it for a statically compiled
piece of code, but I forget for what reason I made it usable as a static
compiler.

actually, now I remember, it was the idea of using the Win64 calling
convention on Linux, just before I ended up figuring that I could probably
get more for less effort by instead using virtualized x86 as a
cross-platform IL (in between with a failed attempt at designing a bytecode,
which ended up as a hybrid of x86 and EFI bytecode...).

I guess if I were to use it as a static compiler, for building project
sources, I would have to fix up a few of its known deficiencies:
its lack of a properly working 'switch';
its lack of statically-initialized structs;
....

and, I guess, maybe writing proper test-cases. I have a few small pieces of
code which I had been using for tests, but they are far from
comprehensive...

> Greetings Thomas Mertes
>
> Seed7 Homepage: http://seed7.sourceforge.net
> Seed7 - The extensible programming language: User defined statements
> and operators, abstract data types, templates without special
> syntax, OO with interfaces and multiple dispatch, statically typed,
> interpreted or compiled, portable, runs under linux/unix/windows.

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: Is there a debugger or a tool that will find the stack corruption
Next: LPC2478 arm7tdmi 56mhz 8kb cache 32mb ram, jpeg decoding performance