Prev: Is there a debugger or a tool that will find the stack corruption
Next: LPC2478 arm7tdmi 56mhz 8kb cache 32mb ram, jpeg decoding performance
From: Pascal J. Bourguignon on 1 Nov 2009 12:03 "BGB / cr88192" <cr88192(a)hotmail.com> writes: > "Pascal J. Bourguignon" <pjb(a)informatimago.com> wrote in message > news:877huaihe4.fsf(a)galatea.local... >> "BGB / cr88192" <cr88192(a)hotmail.com> writes: >>> note: I have plenty of past experience with both ASM and JIT, but at the >>> moment am leaning against this, and in favor of using pure C-based >>> interpretation for now. >>> >>> dunno if anyone knows some major way around this. >> >> Modify the C compiler so that it generates optimized code for your >> specific case (ie. hide your specific assembly in the C compiler). >> > > I am using MSVC on x64 for this... So unless you're rich enough to buy the sources of MSVC, forget it. > I can't even do stupid inline assembly, as MS removed this feature for > x64... May I suggest you to switch to an open source compiler (such as gcc)? Or just write your own compiler? > now, I could use my own compiler, but alas then I would have to debug its > output... Hence the suggestion for a debugged compiler such as gcc. > or even GCC's Win64 support, but last time I tried to use it, it produced > code so bad (buggy and ill formed) as to scare me away. granted, it has been > a fairly long time now (~ 1 year already?...), so maybe they have improved > it some, I don't know... This doesn't matter. What matters is that you have the sources, so you can improve its optimizer. (But also, read the interview of Fran Allen in "Coders at Work" (Peter Seibel, Apress), you might prefer another language altogether to be able to really optimize its compiler). -- __Pascal Bourguignon__
From: Richard Harter on 1 Nov 2009 13:26 On Sat, 31 Oct 2009 23:46:13 -0700, "BGB / cr88192" <cr88192(a)hotmail.com> wrote: >this is an observation I have made in the past, and I think I will state >here as if it were some kind of rule: >when a profiler says that the largest amount of time used in an interpreter >is the main opcode dispatch switch (rather than some function called by it, >or some logic contained within said switch, ... especially if this switch >does nothing more than call functions...), then they are rapidly approaching >the performance limits they can hope to gain from said interpreter. > >which is kind of lame in this case, as this interpreter is, technically, not >all that fast... > >but, then again, there may still be a little tweaking left, as said switch >has not gained 1st place (it is 2nd), and holds approx 12% of the total >running time. > >1st place, at 15%, is the hash-table for fetching instructions (and, sadly, >is not really optimizable unless I discover some way to 'optimize' basic >arithmetic expressions, which FWIW is about as productive as trying to >optimize a switch...). > >this means: 27% of time: >hash EIP (currently: "((EIP*65521)>>16)&65535"); >fetch opcode-struct from hash-table (the hash-fail rate is apparently very >low); >switch on opcode info (branches to functions containing further dispatch and >logic code). Offhand the above sounds ugly. I don't know many op codes you are using but I expect that it most it is a few hundred. The hash function you use is relatively expensive. Using a hash table at all is relatively expensive. A trie might be cheaper or possibly a perfect hash. Then there is the question of why you are using a switch at all. If the op-code lookup yields an index, why aren't you using it as an index into a table that contains the pointer to the action function and the op-code struct? Mind you, there might be very good reasons for doing what you are doing. I don't have access to your code. Still, it sounds as though your code is cumbersome. >(well, all this, as well as a few arithmetic ops, such as adding the opcode >size to EIP, ...). > >previously I had an opcode decoder which was not super fast (approx 60% of >runtime if sequential decoding is used), but the hash seems to have largely >eliminated it (since most of the time, pre-decoded opcodes are used from the >hash). > > >so, alas, it is an x86 interpreter performing at approx 386 (or maybe >486)-like speeds on my computer (~6 MIPS...). >could be faster, except that MSVC on x64 has no real idea what it is doing >WRT optimizations. > >then again, it is around 12x faster than when I started trying to optimize >it (~0.5 MIPS...). > > >note: I have plenty of past experience with both ASM and JIT, but at the >moment am leaning against this, and in favor of using pure C-based >interpretation for now. > > >dunno if anyone knows some major way around this. > >IOW: if there is anything that can be done when the main switch creeps into >a high position, and code starts becomming very fussy about the fine details >WRT performance issues... > >past experience is generally not... (that this is a general limit to C-based >interpreters...). > > >any comments?... > > Richard Harter, cri(a)tiac.net http://home.tiac.net/~cri, http://www.varinoma.com Kafka wasn't an author; Kafka was a prophet!
From: BGB / cr88192 on 1 Nov 2009 14:16 "Pascal J. Bourguignon" <pjb(a)informatimago.com> wrote in message news:87y6mqgot3.fsf(a)galatea.local... > "BGB / cr88192" <cr88192(a)hotmail.com> writes: > >> "Pascal J. Bourguignon" <pjb(a)informatimago.com> wrote in message >> news:877huaihe4.fsf(a)galatea.local... >>> "BGB / cr88192" <cr88192(a)hotmail.com> writes: >>>> note: I have plenty of past experience with both ASM and JIT, but at >>>> the >>>> moment am leaning against this, and in favor of using pure C-based >>>> interpretation for now. >>>> >>>> dunno if anyone knows some major way around this. >>> >>> Modify the C compiler so that it generates optimized code for your >>> specific case (ie. hide your specific assembly in the C compiler). >>> >> >> I am using MSVC on x64 for this... > > So unless you're rich enough to buy the sources of MSVC, forget it. > yep. >> I can't even do stupid inline assembly, as MS removed this feature for >> x64... > > May I suggest you to switch to an open source compiler (such as gcc)? > Or just write your own compiler? > GCC on Win64 was very unreliable last I checked (occasionally generating bad code, ...), but it may have improved since then. my compiler works, just I similarly don't trust it that well to produce good code, and using it for compiling much of anything "non-trivial" is likely to leave bugs all over the place which would be rather difficult to track down (I leave it mostly for compiling smaller code fragments at runtime, since a failure here is at least usually obvious and nearly not so difficult to track down and fix...). granted, running head-first into the land of bug fixing would be a good way to improve its reliability, but I guess I am not yet willing to go down this path (AKA: having lots of code built with lots of little hidden bugs...). >> now, I could use my own compiler, but alas then I would have to debug its >> output... > > Hence the suggestion for a debugged compiler such as gcc. > it may have improved since last I looked, but at the time GCC's Win64 output was buggy. I had compiled a big mass of code, and it was crashing. I eventually tracked it down to GCC having been messing up with handling the Win64 calling convention, and so I ended up using MSVC instead, figuring at least its Win64 support demonstratably works reliably... > >> or even GCC's Win64 support, but last time I tried to use it, it produced >> code so bad (buggy and ill formed) as to scare me away. granted, it has >> been >> a fairly long time now (~ 1 year already?...), so maybe they have >> improved >> it some, I don't know... > > This doesn't matter. What matters is that you have the sources, so > you can improve its optimizer. > with GCC, one probably will not have to, as presumably at least GCC has an idea how to produce efficient code (it seems to do so well enough on Linux). so, the biggie issue is mostly a matter of bugs, and I am not really willing to go digging through GCC's (IMO, teh huge) codebase on a bug hunt... granted, they have hopefully improved the bugginess at least, which is the main issue. > (But also, read the interview of Fran Allen in "Coders at Work" (Peter > Seibel, Apress), you might prefer another language altogether to be > able to really optimize its compiler). > I have a fairly large existing C codebase, and I will probably not switch to another language noting that most others have poor C integration. it is, likewise: it is not that there was no reason that I chose x86 as the main ISA for this interpreter... I chose x86 for similar reasons to why I stick with C... even if, granted, x86 is not the most ideal ISA for an interpreter, but hell, at least it allows me to use GCC for building a lot of code that runs within my VM, which is itself enough of a reason IMO... > > -- > __Pascal Bourguignon__
From: BGB / cr88192 on 1 Nov 2009 14:59 "Richard Harter" <cri(a)tiac.net> wrote in message news:4aedcfb0.958856406(a)text.giganews.com... > On Sat, 31 Oct 2009 23:46:13 -0700, "BGB / cr88192" > <cr88192(a)hotmail.com> wrote: > >>this is an observation I have made in the past, and I think I will state >>here as if it were some kind of rule: >>when a profiler says that the largest amount of time used in an >>interpreter >>is the main opcode dispatch switch (rather than some function called by >>it, >>or some logic contained within said switch, ... especially if this switch >>does nothing more than call functions...), then they are rapidly >>approaching >>the performance limits they can hope to gain from said interpreter. >> >>which is kind of lame in this case, as this interpreter is, technically, >>not >>all that fast... >> >>but, then again, there may still be a little tweaking left, as said switch >>has not gained 1st place (it is 2nd), and holds approx 12% of the total >>running time. >> >>1st place, at 15%, is the hash-table for fetching instructions (and, >>sadly, >>is not really optimizable unless I discover some way to 'optimize' basic >>arithmetic expressions, which FWIW is about as productive as trying to >>optimize a switch...). >> >>this means: 27% of time: >>hash EIP (currently: "((EIP*65521)>>16)&65535"); >>fetch opcode-struct from hash-table (the hash-fail rate is apparently very >>low); >>switch on opcode info (branches to functions containing further dispatch >>and >>logic code). > > Offhand the above sounds ugly. I don't know many op codes you > are using but I expect that it most it is a few hundred. errm... the x86 ISA has a bit more than this, closer to around 1,800 (core integer ops, x87, SSE/SSE2/..., and partial AVX). full AVX and SSE5 might push it a bit over 2000. note that this is actual opcodes, not nmonics, where x86 likes using a number of different opcodes with the same nmonic, although in my list there are still around 800 nmonics... as for there being only a few hundred opcodes, in nearly any other bytecode this would probably be the case, but is not so much the case with x86... the interpreter does not at present handle the full set though... but, the hash is not used for opcode lookup/decoding, rather it is used for grabbing already decoded instructions from a cache (which is based on memory address). (this is because instruction decoding is fairly expensive, and so is not done inline, and when instructions are actually decoded, they are converted from an in-memory opcode, to a struct-based representation, which is what is fetched here by the hash lookup). so, it serves a role similar to that of the L1 cache. > The > hash function you use is relatively expensive. Using a hash > table at all is relatively expensive. A trie might be cheaper or > possibly a perfect hash. Then there is the question of why you > are using a switch at all. If the op-code lookup yields an > index, why aren't you using it as an index into a table that > contains the pointer to the action function and the op-code > struct? > the hash table fetches the (already decoded) opcode struct, and the switch is used to have the interpreter actually so something. Step: hash EIP address (actually: EIP+CS.Base); try fetch decoded opcode: ExecOpcode. fail: set up hash entry; decode opcode at EIP into hash; ExecOpcode. ExecOpcode: switch(op->type) //not exact, but close enough case Basic: ExecOpcodeBasic(); break; case Reg: ExecOpcodeReg(); break; case RM: ExecOpcodeRM(); break; case RegRM: ExecOpcodeRegRM(); break; case RMReg: ExecOpcodeRMReg(); break; case Imm: ExecOpcodeImm(); break; case RegImm: ExecOpcodeRegImm(); break; case RMImm: ExecOpcodeRMImm(); break; ... .... ExecOpcodeRegRM: resolve RM, ... switch(op->opnum) case ADC: ... case ADD: ... case AND: ... case BSF: ... case BSR: ... ... .... in all, it is a fairly "heavyweight" interpreter, but alas, x86 is also a fairly heavy ISA... > Mind you, there might be very good reasons for doing what you are > doing. I don't have access to your code. Still, it sounds as > though your code is cumbersome. > kind of, but it could be a whole lot worse... how would you approach something like x86?... hopefully not by suggesting a huge switch to dispatch for every possible opcode byte/prefix/... and then re-enter for following bytes. many opcodes are long (2/3 bytes for just the 'opcode' part, or 4/5 for many SSE opcodes), and then there is the Mod/RM field, which itself is an issue to decode. (if you don't believe me, go and take a good long hard look at the Intel docs, and imagine how you would go about writing an interpreter for all this stuff...). I started initially trying this, but soon found it unworkable. I then switched to a more "generic" strategy (essentially grabbing my pre-existing disassembler and using this as a template for an opcode decoder, which itself was mostly based on a pattern-matching strategy). it took a bit of tweaking to get the speed up (following that of reworking it into a piece of usable interpreter machinery), but it was still fairly slow vs the interpreter, hence I added the hash (actually, originally I considered a more involved strategy, but this was much simpler for now, and can still pull off more-or-less the same effect...). > >>(well, all this, as well as a few arithmetic ops, such as adding the >>opcode >>size to EIP, ...). >> >>previously I had an opcode decoder which was not super fast (approx 60% of >>runtime if sequential decoding is used), but the hash seems to have >>largely >>eliminated it (since most of the time, pre-decoded opcodes are used from >>the >>hash). >> >> >>so, alas, it is an x86 interpreter performing at approx 386 (or maybe >>486)-like speeds on my computer (~6 MIPS...). >>could be faster, except that MSVC on x64 has no real idea what it is doing >>WRT optimizations. >> >>then again, it is around 12x faster than when I started trying to optimize >>it (~0.5 MIPS...). >> >> >>note: I have plenty of past experience with both ASM and JIT, but at the >>moment am leaning against this, and in favor of using pure C-based >>interpretation for now. >> >> >>dunno if anyone knows some major way around this. >> >>IOW: if there is anything that can be done when the main switch creeps >>into >>a high position, and code starts becomming very fussy about the fine >>details >>WRT performance issues... >> >>past experience is generally not... (that this is a general limit to >>C-based >>interpreters...). >> >> >>any comments?... >> >> > > Richard Harter, cri(a)tiac.net > http://home.tiac.net/~cri, http://www.varinoma.com > Kafka wasn't an author; > Kafka was a prophet!
From: Richard Harter on 1 Nov 2009 16:50
On Sun, 1 Nov 2009 12:59:06 -0700, "BGB / cr88192" <cr88192(a)hotmail.com> wrote: > >"Richard Harter" <cri(a)tiac.net> wrote in message >news:4aedcfb0.958856406(a)text.giganews.com... >> On Sat, 31 Oct 2009 23:46:13 -0700, "BGB / cr88192" >> <cr88192(a)hotmail.com> wrote: [snip] >> >> Offhand the above sounds ugly. I don't know many op codes you >> are using but I expect that it most it is a few hundred. > >errm... > >the x86 ISA has a bit more than this, closer to around 1,800 (core integer >ops, x87, SSE/SSE2/..., and partial AVX). full AVX and SSE5 might push it a >bit over 2000. > >note that this is actual opcodes, not nmonics, where x86 likes using a >number of different opcodes with the same nmonic, although in my list there >are still around 800 nmonics... > >as for there being only a few hundred opcodes, in nearly any other bytecode >this would probably be the case, but is not so much the case with x86... That's still not all that bad, but it is less than pleasant. > > >the interpreter does not at present handle the full set though... > > >but, the hash is not used for opcode lookup/decoding, rather it is used for >grabbing already decoded instructions from a cache (which is based on memory >address). > >(this is because instruction decoding is fairly expensive, and so is not >done inline, and when instructions are actually decoded, they are converted >from an in-memory opcode, to a struct-based representation, which is what is >fetched here by the hash lookup). > >so, it serves a role similar to that of the L1 cache. I'm a little confused here. Are we just talking about opcodes (decoded or not) or are we talking about full instructions. If we are only talking about opcodes then everything can be precomputed except an opcode->index mapping. > > >> The >> hash function you use is relatively expensive. Using a hash >> table at all is relatively expensive. A trie might be cheaper or >> possibly a perfect hash. Then there is the question of why you >> are using a switch at all. If the op-code lookup yields an >> index, why aren't you using it as an index into a table that >> contains the pointer to the action function and the op-code >> struct? >> > >the hash table fetches the (already decoded) opcode struct, and the switch >is used to have the interpreter actually so something. > >Step: > hash EIP address (actually: EIP+CS.Base); > try fetch decoded opcode: > ExecOpcode. > fail: > set up hash entry; > decode opcode at EIP into hash; > ExecOpcode. > >ExecOpcode: > switch(op->type) //not exact, but close enough > case Basic: ExecOpcodeBasic(); break; > case Reg: ExecOpcodeReg(); break; > case RM: ExecOpcodeRM(); break; > case RegRM: ExecOpcodeRegRM(); break; > case RMReg: ExecOpcodeRMReg(); break; > case Imm: ExecOpcodeImm(); break; > case RegImm: ExecOpcodeRegImm(); break; > case RMImm: ExecOpcodeRMImm(); break; > ... > >... >ExecOpcodeRegRM: > resolve RM, ... > switch(op->opnum) > case ADC: ... > case ADD: ... > case AND: ... > case BSF: ... > case BSR: ... > ... Something isn't quite right here. In the switch ExecOpcodeRegRM is a function; in the code immediately after it is a label. One problem with your switches is that they are big. Now a good compiler can and should optimize a compact set of cases into a jump table. However all bets are off if there are holes or if the cases are sequential. If doesn't produce a jump table it probably defaults to sequential if/then/else chains which are deadly for large switches. Thus in your ExecOpCode switch you really want a table of function pointers indexed by op->type so that the switch becomes something like optypes[op->type](); The op types you use for the table have to be approximately sequential integers. If Intel doesn't give you the right kind of numbers to use as an index, create an index field in your struct. You can even have holes in the indexing - just fill the holes with pointers to an error function. All of this is fairly standard. There probably is some good reason why you aren't doing this, but it isn't obvious. > >in all, it is a fairly "heavyweight" interpreter, but alas, x86 is also a >fairly heavy ISA... > > >> Mind you, there might be very good reasons for doing what you are >> doing. I don't have access to your code. Still, it sounds as >> though your code is cumbersome. >> > >kind of, but it could be a whole lot worse... > > >how would you approach something like x86?... > >hopefully not by suggesting a huge switch to dispatch for every possible >opcode byte/prefix/... and then re-enter for following bytes. many opcodes >are long (2/3 bytes for just the 'opcode' part, or 4/5 for many SSE >opcodes), and then there is the Mod/RM field, which itself is an issue to >decode. Shudder. If in fact there are only a couple of thousand op codes then 2/3 is plausible for organizational reasons. However 4/5 sounds a little fishy. Are there really only a couple of thousand when you take everything into account. But no, if there only a few thousand opcodes in 2/3 bytes a specialized trie should do quite nicely. I like to do precomputed magic tables myself, but it may turn out that a hash function will do fine. You aren't by any chance using a prime number size table are you? The modulo operation could be the bulk of your time in the hash table. > >(if you don't believe me, go and take a good long hard look at the Intel >docs, and imagine how you would go about writing an interpreter for all this >stuff...). Thanks but no thanks. Looking at the Intel docs is very low on my list of priorities. > > >I started initially trying this, but soon found it unworkable. > >I then switched to a more "generic" strategy (essentially grabbing my >pre-existing disassembler and using this as a template for an opcode >decoder, which itself was mostly based on a pattern-matching strategy). > >it took a bit of tweaking to get the speed up (following that of reworking >it into a piece of usable interpreter machinery), but it was still fairly >slow vs the interpreter, hence I added the hash (actually, originally I >considered a more involved strategy, but this was much simpler for now, and >can still pull off more-or-less the same effect...). > > >> >>>(well, all this, as well as a few arithmetic ops, such as adding the >>>opcode >>>size to EIP, ...). >>> >>>previously I had an opcode decoder which was not super fast (approx 60% of >>>runtime if sequential decoding is used), but the hash seems to have >>>largely >>>eliminated it (since most of the time, pre-decoded opcodes are used from >>>the >>>hash). >>> >>> >>>so, alas, it is an x86 interpreter performing at approx 386 (or maybe >>>486)-like speeds on my computer (~6 MIPS...). >>>could be faster, except that MSVC on x64 has no real idea what it is doing >>>WRT optimizations. >>> >>>then again, it is around 12x faster than when I started trying to optimize >>>it (~0.5 MIPS...). >>> >>> >>>note: I have plenty of past experience with both ASM and JIT, but at the >>>moment am leaning against this, and in favor of using pure C-based >>>interpretation for now. >>> >>> >>>dunno if anyone knows some major way around this. >>> >>>IOW: if there is anything that can be done when the main switch creeps >>>into >>>a high position, and code starts becomming very fussy about the fine >>>details >>>WRT performance issues... >>> >>>past experience is generally not... (that this is a general limit to >>>C-based >>>interpreters...). >>> >>> >>>any comments?... >>> >>> >> >> Richard Harter, cri(a)tiac.net >> http://home.tiac.net/~cri, http://www.varinoma.com >> Kafka wasn't an author; >> Kafka was a prophet! > > Richard Harter, cri(a)tiac.net http://home.tiac.net/~cri, http://www.varinoma.com Kafka wasn't an author; Kafka was a prophet! |