Prev: Is there a debugger or a tool that will find the stack corruption
Next: LPC2478 arm7tdmi 56mhz 8kb cache 32mb ram, jpeg decoding performance
From: Pascal J. Bourguignon on 1 Nov 2009 17:08 "BGB / cr88192" <cr88192(a)hotmail.com> writes: > dunno if anyone knows some major way around this. Another way to do it would be to generate portably assembly with LLVM. -- __Pascal Bourguignon__
From: BGB / cr88192 on 1 Nov 2009 18:14 "Richard Harter" <cri(a)tiac.net> wrote in message news:4aedf3bd.968084843(a)text.giganews.com... > On Sun, 1 Nov 2009 12:59:06 -0700, "BGB / cr88192" > <cr88192(a)hotmail.com> wrote: > >> >>"Richard Harter" <cri(a)tiac.net> wrote in message >>news:4aedcfb0.958856406(a)text.giganews.com... >>> On Sat, 31 Oct 2009 23:46:13 -0700, "BGB / cr88192" >>> <cr88192(a)hotmail.com> wrote: > [snip] > >>> >>> Offhand the above sounds ugly. I don't know many op codes you >>> are using but I expect that it most it is a few hundred. >> >>errm... >> >>the x86 ISA has a bit more than this, closer to around 1,800 (core integer >>ops, x87, SSE/SSE2/..., and partial AVX). full AVX and SSE5 might push it >>a >>bit over 2000. >> >>note that this is actual opcodes, not nmonics, where x86 likes using a >>number of different opcodes with the same nmonic, although in my list >>there >>are still around 800 nmonics... >> >>as for there being only a few hundred opcodes, in nearly any other >>bytecode >>this would probably be the case, but is not so much the case with x86... > > That's still not all that bad, but it is less than pleasant. yes, ok. some of my past bytecodes ended up with around 500 opcodes. I think JBC has somewhere around 170-200 (I forget the exact number), however, this is too few to do "that" much more with it. some newer JVM versions added a few opcodes, with a good deal more functionality being essentially multiplexed in via existing mechanisms (such as method calls, ...). (actually, it could be loosely compared to building a network filesystem interface on top of HTTP via WebDAV, where pre-existing mechanisms are "extended" to handle new operations, ...). the alternative strategy is, of course, adding piles of new opcodes... >> >> >>the interpreter does not at present handle the full set though... >> >> >>but, the hash is not used for opcode lookup/decoding, rather it is used >>for >>grabbing already decoded instructions from a cache (which is based on >>memory >>address). >> >>(this is because instruction decoding is fairly expensive, and so is not >>done inline, and when instructions are actually decoded, they are >>converted >>from an in-memory opcode, to a struct-based representation, which is what >>is >>fetched here by the hash lookup). >> >>so, it serves a role similar to that of the L1 cache. > > I'm a little confused here. Are we just talking about opcodes > (decoded or not) or are we talking about full instructions. If > we are only talking about opcodes then everything can be > precomputed except an opcode->index mapping. > ok, I was using the term to 'opcode' to refer to the entire instruction (including arguments, ...). so, in this case the 'opcode' structure contains: opnum, which is an internal opcode number, which idendifies the nmonic; opidx, which is an internal opcode index, which identifies the binary representation of said imstruction; width, which identifies the bit-width of integer ops; reg, a register; op_sz, cached in-memory size of opcode/instruction (used mostly to step to the next EIP); rm_base, memory base register, or another data register; rm_idx, index register; rm_sc, register scale; rm_disp, an offset added to memory ops; rm_fl, which holds flags; imm, for holding an immediate; reg2/reg3, for AVX (because we really need 4-register opcodes...); ... the opcode decoder recognizes the pattern, and fills in all the fields in the struct as appropriate. in this way, the main interpreter just grabs the fields from the structure it needs, without having to fiddle with its serialized form. these structures are keyed by memory address, which is where the hash table comes in. >> >> >>> The >>> hash function you use is relatively expensive. Using a hash >>> table at all is relatively expensive. A trie might be cheaper or >>> possibly a perfect hash. Then there is the question of why you >>> are using a switch at all. If the op-code lookup yields an >>> index, why aren't you using it as an index into a table that >>> contains the pointer to the action function and the op-code >>> struct? >>> >> >>the hash table fetches the (already decoded) opcode struct, and the switch >>is used to have the interpreter actually so something. >> >>Step: >> hash EIP address (actually: EIP+CS.Base); >> try fetch decoded opcode: >> ExecOpcode. >> fail: >> set up hash entry; >> decode opcode at EIP into hash; >> ExecOpcode. >> >>ExecOpcode: >> switch(op->type) //not exact, but close enough >> case Basic: ExecOpcodeBasic(); break; >> case Reg: ExecOpcodeReg(); break; >> case RM: ExecOpcodeRM(); break; >> case RegRM: ExecOpcodeRegRM(); break; >> case RMReg: ExecOpcodeRMReg(); break; >> case Imm: ExecOpcodeImm(); break; >> case RegImm: ExecOpcodeRegImm(); break; >> case RMImm: ExecOpcodeRMImm(); break; >> ... >> >>... >>ExecOpcodeRegRM: >> resolve RM, ... >> switch(op->opnum) >> case ADC: ... >> case ADD: ... >> case AND: ... >> case BSF: ... >> case BSR: ... >> ... > > Something isn't quite right here. In the switch ExecOpcodeRegRM > is a function; in the code immediately after it is a label. > ok, my notation isn't very good. it is a function, but the label-like notation shows what it contains. ok, imagine it were something like: ExecOpcodeRegRM(): ... > One problem with your switches is that they are big. Now a good > compiler can and should optimize a compact set of cases into a > jump table. However all bets are off if there are holes or if > the cases are sequential. If doesn't produce a jump table it > probably defaults to sequential if/then/else chains which are > deadly for large switches. Thus in your ExecOpCode switch you > really want a table of function pointers indexed by op->type so > that the switch becomes something like > > optypes[op->type](); > > The op types you use for the table have to be approximately > sequential integers. If Intel doesn't give you the right kind of > numbers to use as an index, create an index field in your struct. > You can even have holes in the indexing - just fill the holes > with pointers to an error function. > > All of this is fairly standard. There probably is some good > reason why you aren't doing this, but it isn't obvious. > actually, I have noted already that the compiler produces jump tables internally for all these, so this much is good (I know, I have seen the produced ASM for these cases). the main cost is that it ends up taking around 10-12 instructions to handle the jump-table logic, which is mostly the fault of MSVC lameness... it is apparently mostly just producing if/else style sequences for very small switches, such as apparently the switch for segment-overrides. I could use a table for this, but in this case it is not in a place which causes notable impact. >> >>in all, it is a fairly "heavyweight" interpreter, but alas, x86 is also a >>fairly heavy ISA... >> >> >>> Mind you, there might be very good reasons for doing what you are >>> doing. I don't have access to your code. Still, it sounds as >>> though your code is cumbersome. >>> >> >>kind of, but it could be a whole lot worse... >> >> >>how would you approach something like x86?... >> >>hopefully not by suggesting a huge switch to dispatch for every possible >>opcode byte/prefix/... and then re-enter for following bytes. many opcodes >>are long (2/3 bytes for just the 'opcode' part, or 4/5 for many SSE >>opcodes), and then there is the Mod/RM field, which itself is an issue to >>decode. > > Shudder. If in fact there are only a couple of thousand op codes > then 2/3 is plausible for organizational reasons. However 4/5 > sounds a little fishy. Are there really only a couple of > thousand when you take everything into account. > this is mostly because of how Intel organizes their opcodes. a lot of the SSE opcodes use 0x66, 0xF3, 0xF2, ... as part of the opcode. here are a few, which may encode as 3 or 4 bytes for the opcode (X is the optional REX prefix). addpd 66X0F58/r xr,xrm addps X0F58/r xr,xrm addsd F2X0F58/r xr,xrm addss F3X0F58/r xr,xrm addsubpd 66X0FD0/r xr,xrm addsubps F2X0FD0/r xr,xrm here are a few with 4 or 5 byte opcodes: dppd 66X0F3A41/r,ib xr,xrm,i8 dpps 66X0F3A40/r,ib xr,xrm,i8 extractps 66X0F3A17/r,ib xr,xrm,i8 insertps 66X0F3A21/r,ib xr,xrm,i8 mpsadbw 66X0F3A42/r,ib xr,xrm,i8 .... aesenc 66X0F38DC/r xr,xrm aesenclast 66X0F38DD/r xr,xrm aesdec 66X0F38DE/r xr,xrm aesdeclast 66X0F38DF/r xr,xrm aesimc 66X0F38DB/r xr,xrm aeskeygenassist 66X0F3ADF/r,ib xr,xrm,i8 .... for AVX Intel switched to an essentially partly new opcode encoding scheme. the AVX opcodes in the listing thus far seem to be using a mostly fixed 4-byte opcodes. vblendpd Ijr0D/r,ib xr,xrv,xrm,i8 Ijv0D/r,ib yr,yrv,yrm,i8 vblendps Ijr0C/r,ib xr,xrv,xrm,i8 Ijv0C/r,ib yr,yrv,yrm,i8 vblendvpd Ijr4B/r,is xr,xrv,xrm,xri Ijv4B/r,is yr,yrv,yrm,yri vblendvps Ijr4A/r,is xr,xrv,xrm,xri Ijv4A/r,is yr,yrv,yrm,yri where Ixx basically encodes a 3-byte VEX prefix, so essentially a 4-byte opcode. luckily at least, most of the basic opcodes use only 1 or 2 bytes for the opcode... this is followed by however many more are needed to encode Mod/RM or immediates... so, for example: mov [eax+4*ecx+0xF00BA5], 0xBAF00 can take 10 bytes (in addition to the opcode, this is the "/r" in the listings above). with a 5-byte opcode, this could take 15 bytes for an instruction. luckily, this should not happen in practice (usually it takes special construction to create an opcode exceeding Intel's 16-byte limit). > But no, if there only a few thousand opcodes in 2/3 bytes a > specialized trie should do quite nicely. I like to do > precomputed magic tables myself, but it may turn out that a hash > function will do fine. You aren't by any chance using a prime > number size table are you? The modulo operation could be the > bulk of your time in the hash table. > > luckily the main interpreter does not need to see the horrors of the opcode/instruction encoding scheme, as it is all much more "sterile" than this (all of this is left to the "opcode decoder", which is, luckily, not currently a significant part of the running time). as for the hash table, it is a power-of-2 size, and hence I use a mask... actually, I since replaced the mask with a cast to 'u16', since this is slightly faster in this case. I don't use modulo in hash calculations as this tends to kill performance. I usually use a multiply, a shift, an a mask. multiply+mask also works, but it requires a mersenne prime to work well, and has its own set of "interesting" properties, which as I see it did not match the current use (I chose instead to multiply by a non-mersenne prime and use a shift). >> >>(if you don't believe me, go and take a good long hard look at the Intel >>docs, and imagine how you would go about writing an interpreter for all >>this >>stuff...). > > Thanks but no thanks. Looking at the Intel docs is very low on > my list of priorities. I deal with them a lot... much of what I have done in the past few years has involved these docs...
From: BGB / cr88192 on 2 Nov 2009 01:24 "Pascal J. Bourguignon" <pjb(a)informatimago.com> wrote in message news:87my35hp9a.fsf(a)galatea.local... > "BGB / cr88192" <cr88192(a)hotmail.com> writes: >> dunno if anyone knows some major way around this. > > Another way to do it would be to generate portably assembly with LLVM. > or, if I were going to go this route, I could use my pre-existing codegen backend as well... (basically, this would be because the interpreter was written to be used along with my pre-existing compiler/VM framework, and quite possibly accepting code generated by my custom compiler, as well as possibly libs compiled with GCC, or some other compiler for 32-bit x86...). either way, it would make the general "cost" much higher than an interpreter, although it would allow for much better performance. elsewhere in the thread I think I stated that I was sticking with pure C for now. there are a few reasons for this... either way, I am not certain if a general-purpose codegen/backend of either style is ideal for JITing x86 machine code, vs say a JIT based more around the "model" already in-use within x86. granted, a generic JIT could be better for anything<->anything translation though, vs x86 to some-other-machine-code (probably x86 or x86-64). but, for now, the interpreter may have to do as-is... I can maybe micro-optimize it a little further, but I guess at this point I can no longer expect drastic speed-ups (the only other major way to speed it up being by having it do less, as in, the possiblity of using system calls for many operations). another possibility is the use of "magic" sequences (can be compared with the prologues and epilogues in Win64), which would consist of special patterns which the interpreter would recognize and handle specially. I have already done this with a few sequences (revolving around 'rep movsb' and friends), although it seems that the current C library I am using is using generic loop code instead, which of course would not stuble on these special optimizations (not that it would be difficult to change if needed). another place is with more complex operations, which is where system calls come into play. specialized pattern recognition and optimization is also a possibility (similar to a compiler/optimizer), but would likely be too expensive for real-time use, and is not ideally matched for the design I ended up using (where opcodes are executed individually rather than organized into 'traces'). so, for now, it will have to work... I guess a 170x slowdown is not THAT bad, FWIW... > -- > __Pascal Bourguignon__
From: tm on 2 Nov 2009 05:26 On 1 Nov., 20:16, "BGB / cr88192" <cr88...(a)hotmail.com> wrote: > "Pascal J. Bourguignon" <p...(a)informatimago.com> wrote in messagenews:87y6mqgot3.fsf(a)galatea.local... > > > "BGB / cr88192" <cr88...(a)hotmail.com> writes: > > >> "Pascal J. Bourguignon" <p...(a)informatimago.com> wrote in message > >>news:877huaihe4.fsf(a)galatea.local... > >>> "BGB / cr88192" <cr88...(a)hotmail.com> writes: > >>>> note: I have plenty of past experience with both ASM and JIT, but at > >>>> the > >>>> moment am leaning against this, and in favor of using pure C-based > >>>> interpretation for now. > > >>>> dunno if anyone knows some major way around this. > > >>> Modify the C compiler so that it generates optimized code for your > >>> specific case (ie. hide your specific assembly in the C compiler). > > >> I am using MSVC on x64 for this... > > > So unless you're rich enough to buy the sources of MSVC, forget it. > > yep. > > >> I can't even do stupid inline assembly, as MS removed this feature for > >> x64... > > > May I suggest you to switch to an open source compiler (such as gcc)? > > Or just write your own compiler? > > GCC on Win64 was very unreliable last I checked (occasionally generating bad > code, ...), but it may have improved since then. Did you tell them about the errors you found? > my compiler works, just I similarly don't trust it that well to produce good > code, and using it for compiling much of anything "non-trivial" is likely to > leave bugs all over the place which would be rather difficult to track down > (I leave it mostly for compiling smaller code fragments at runtime, since a > failure here is at least usually obvious and nearly not so difficult to > track down and fix...). When you think that GCC and MSVC are not okay you should definitely use your own C compiler. When there are bugs in it create a test suite of C programs. That way you can reduce the number of errors. You seem to write programs in many areas. But to get this programs finished with good quality I can only suggest: Eat your own dogfood. Greetings Thomas Mertes Seed7 Homepage: http://seed7.sourceforge.net Seed7 - The extensible programming language: User defined statements and operators, abstract data types, templates without special syntax, OO with interfaces and multiple dispatch, statically typed, interpreted or compiled, portable, runs under linux/unix/windows.
From: BGB / cr88192 on 2 Nov 2009 10:53
"tm" <thomas.mertes(a)gmx.at> wrote in message news:6bc33382-90b5-49aa-a486-fb2992ae8496(a)a21g2000yqc.googlegroups.com... > On 1 Nov., 20:16, "BGB / cr88192" <cr88...(a)hotmail.com> wrote: >> "Pascal J. Bourguignon" <p...(a)informatimago.com> wrote in >> messagenews:87y6mqgot3.fsf(a)galatea.local... >> >> > "BGB / cr88192" <cr88...(a)hotmail.com> writes: >> >> >> "Pascal J. Bourguignon" <p...(a)informatimago.com> wrote in message >> >>news:877huaihe4.fsf(a)galatea.local... >> >>> "BGB / cr88192" <cr88...(a)hotmail.com> writes: >> >>>> note: I have plenty of past experience with both ASM and JIT, but at >> >>>> the >> >>>> moment am leaning against this, and in favor of using pure C-based >> >>>> interpretation for now. >> >> >>>> dunno if anyone knows some major way around this. >> >> >>> Modify the C compiler so that it generates optimized code for your >> >>> specific case (ie. hide your specific assembly in the C compiler). >> >> >> I am using MSVC on x64 for this... >> >> > So unless you're rich enough to buy the sources of MSVC, forget it. >> >> yep. >> >> >> I can't even do stupid inline assembly, as MS removed this feature for >> >> x64... >> >> > May I suggest you to switch to an open source compiler (such as gcc)? >> > Or just write your own compiler? >> >> GCC on Win64 was very unreliable last I checked (occasionally generating >> bad >> code, ...), but it may have improved since then. > > Did you tell them about the errors you found? > I was not sure who to tell about what, nor was I not sure of the nature or the cause of the bug (not really knowing GCC's internals...). only, that it was a bug with passing arguments (args got messed up in transit, from what I remember due to incorrect register handling). >> my compiler works, just I similarly don't trust it that well to produce >> good >> code, and using it for compiling much of anything "non-trivial" is likely >> to >> leave bugs all over the place which would be rather difficult to track >> down >> (I leave it mostly for compiling smaller code fragments at runtime, since >> a >> failure here is at least usually obvious and nearly not so difficult to >> track down and fix...). > > When you think that GCC and MSVC are not okay you should > definitely use your own C compiler. When there are bugs in it > create a test suite of C programs. That way you can reduce the > number of errors. > yeah, proper testing should be good... I guess I have not done a whole lot of this. as for MSVC, usually it is good enough (after all, it is good enough for MS). it is just not as ideal for areas of code which notably effect performance. my compiler can produce often more efficient code (except in the case of switch, errm... because I never got around to properly implementing jump-tables, and so there is an "O(log2 n)" switch overhead...). so, nevermind switch, I do address many optimizations which MSVC apparently does not, which at the time I had found unimpressive (given MS has lots of resources, ...). > You seem to write programs in many areas. But to get this > programs finished with good quality I can only suggest: > Eat your own dogfood. > I originally wrote it mostly for JIT / at-runtime compilation, and this is mostly what I use it for (generating scripts at runtime, and compiling and running these). it was only recently that I had (ever) used it for a statically compiled piece of code, but I forget for what reason I made it usable as a static compiler. actually, now I remember, it was the idea of using the Win64 calling convention on Linux, just before I ended up figuring that I could probably get more for less effort by instead using virtualized x86 as a cross-platform IL (in between with a failed attempt at designing a bytecode, which ended up as a hybrid of x86 and EFI bytecode...). I guess if I were to use it as a static compiler, for building project sources, I would have to fix up a few of its known deficiencies: its lack of a properly working 'switch'; its lack of statically-initialized structs; .... and, I guess, maybe writing proper test-cases. I have a few small pieces of code which I had been using for tests, but they are far from comprehensive... > Greetings Thomas Mertes > > Seed7 Homepage: http://seed7.sourceforge.net > Seed7 - The extensible programming language: User defined statements > and operators, abstract data types, templates without special > syntax, OO with interfaces and multiple dispatch, statically typed, > interpreted or compiled, portable, runs under linux/unix/windows. |