Two Click disassembly/reassembly [ASM]

From: randyhyde@earthlink.net on 27 Jan 2006 14:38

Charles A. Crayne wrote:
> On 26 Jan 2006 16:32:51 -0800
> "randyhyde(a)earthlink.net" <randyhyde(a)earthlink.net> wrote:
>
> :Now suppose that instead of a nice array of table entries, I have a
> :two-dimensional table of these objects. Or how about a linked list?
> :Or maybe parallel arrays? Or any other data structure I can dream up.
>
> Just for the record, how many instances of such structures do you expect
> to find in Betov's source code?

Is this tool *simply* for Betov's personal use? If so, it will be *far*
more practical to take the time wasted on writing the tool and invest
that time into hand translating all of his applications to the new CPU.
Or better yet, spend 25% of the time learning C/C++ and another 25% of
the time rewriting the apps in C/C++ in a truly portable manner.

And as for the structs appearing in Rene's code, RosAsm doesn't support
structs, which is one more reason the conversion to a different CPU is
going to be more difficult. When you've got statements like "mov eax,
D$someFakeStruct+someOffset, it's a bit more work to divine the meaning
of this statement and adjust the offsets of the structure fields so
that they work properly on a CPU that doesn't support the same
alignments as the x86.

>
> :Do you honestly think you can come up with a generic algorithm that
> :will figure this stuff out and correctly map the offsets onto the new
> :architecture?
>
> Probably not a generic algorithm, although even that might be possible,
> but I could probably come up with some pretty good heuristics.

The problem with "pretty good heuristics" is that we're discussing
*one* issue here. A nasty one, but *one* issue. This isn't a "tip of
the iceberg" issue, it's more like "a grain of sand on the beach"
issue. That is, once you've come up with a heuristic to handle this one
problem, you're faced with a *tremendous* number of other issues you've
got to deal with.

Even if you limit yourself to the simplistic code that Rene claims to
write, this is *not* a trivial process.

> However,
> that is neither here nor there, as the intent of the proposed tool is
> merely to speed up the dull, routine, time consuming transliterations,
> thus allowing a human programmer the freedom to see through all of the
> obfuscations which you are trying to introduce into Betov's code.

You're making the assumption that Rene is writing a tool to translate
*his* particular assembly code to other CPUs (which mainly consists of
RosAsm, at this point). I certainly didn't get that impression. Why
waste an *incredible* amount of time writing a tool that translates the
RosAsm system source code to some other CPU when you can manually
convert that code in less time? No, the only thing that makes sense
is to write a *generic* tool that *all* RosAsm users can use. A tool he
can add to the RosAsm bullet list so he can brag about "two clicks CPU
code conversion". And, sadly, a tool that will be just as broken as his
disassembler. The fact that Rene sticks to a simplified subset of the
x86 instruction set that might be easier to translate to a different
CPU doesn't mean that *all* RosAsm users do so. Look at Wannabee's code
posted around here, for example. In many respects it is *far* more
sophisticated than Rene's code (hey, at least he is *attempting* OOP in
RosAsm, even if it's not quite there yet). I can promise you that any
translator Rene might be capable of writing will break on OOP.

>
> :I stick by my assertion that translating the code from
> :scratch will probably be easier than modifying the produced code for
> :all but trivial applications.
>
> Do you consider Betov's code to consist of trivial applications?
> [Hint: This is a trick question.]

I do not particularly consider *Rene's* code to be the benchmark here.
I'm thinking more in terms of the RosAsm user base. Though RosAsm
doesn't have a *lot* of different users, the user base *is* big enough
that you cannot consider the conversion of Rene's code to be sufficient
for a general RosAsm tool.

Let's face it. We could define a small subset of the x86 instruction
set that is easy to translate, and impose all kinds of restrictions
(like you must access data in a certain way and all memory references
have to be through labels, and the like). Then it would be possible to
translate this restricted x86 code to another processor. But by doing
this, you've given up the whole reason for using assembly language in
the first place -- the power of the native CPU's language. If you're
going to place such restrictions on the user, then just use C and
you'll get *much* better results.

>
> :Do you really
> :want to review 250,000 lines of code
>
> I'd rather review 250,000 lines of code, than write them from scratch.

And once you get into it, you'll decide that it's better to write the
code from scratch. The code is *not* going to be pretty (just as bad
as reading disassembled code). And the semantic problems are going to
be subtle. As Alex pointed out, what do you do with ADC when there is
no carry flag on the target processor? Sure, you can emit (a lot of)
code to simulate the carry flag, but do you *really* want to read
through all this code? And without an optimizer to clean up afterwards
(which I can assure you that Rene is not capable of writing), you're
going to have to *completely* emulate all the flags and other semantics
of each instruction. E.g., what are you going to do when you encounter
a JPO or JPE instruction? True, Rene might not use these instructions
in *his* code, but if he distributes this tool as a general purpose
tool, he has to allow for the fact that someone else might what to use
these instructions.

Fortunately, Win32 is protected mode, so you don't have to worry about
things like "IN" and "OUT" emulation, but how on earth are you going to
translate something like "mov eax, dword ptr fs:[0]" to a new
architecture (code that is going to appear in any application that
supports structured exception handling in Windows). Again, we're
talking grains of sand here. The list goes on and on and on and on and
....

>
> :with an average of 50 bugs (or whatever) per 1,000 instructions
>
> Is this the error rate you typically see from your own product? Or is this
> just part of the 79.25% of all statistics which are made up on the spot?

Made up on the spot. But having written a data flow analyzer for the
6502 (a *much* simpler CPU than the x86, mind you), and having looked
into translating HLA source code to PPC assembly or C a few years back,
I am obviously a bit more aware of the problems than either you or Rene
with respect to this conversion process. The reason you don't see such
a tool for HLA today is because I determined that the result was
impractical. To do it right would result in unacceptably bad
performance, after a *ton* of work. To do a semi-automated tool as you
suggest would result in too many semantic miscues, injecting bugs into
the final result.

BTW, it is interesting to note that a semi-automated tool is about as
useless as an automatic disassembler. A semi-automated tool forces you
to maintain *two* versions of the software after a successful
conversion (including the manual process). That's why HLLs are so
popular -- properly written, you only need to maintain *one* source
file, not "n" sources files (one for each target CPU). If you can't do
an automated conversion so you only have to maintain one source file,
the tool won't find much use. This is why, for example, I briefly
considered the HLA->C conversion after deciding that the HLA->PPC
conversion just wouldn't work.

>
> :If inform or your scripting
> :language was anywhere near as semantically complex as 80x86 assembly
> :language, you'd have a point. But it's not.
>
> A bold statement from one who has seen only snippets of my scripting
> language, and probably not much more of the Inform language.

I know nothing of your language, but I have looked at Inform (back when
I was working on the AGE project). But as someone who has *taught*
compiler courses, and as the author of an assembler that does a source
to source conversion as part of the assembly process, let's just say
that I happen to know a *little* bit about this process. Translating
from one language to another is a practical thing to do if the target
language can efficiently represent all the semantics of the source
language. For example, the original C++ compiler emitted C code. And
many VHLL languages emit C (or some other HLL). Consider Flex and
Bison, for example. These translations work because the target
language is semantically capable of (efficiently) representing the
machine abstraction of the source language. I suspect that this is
true for your scripting language vs. Inform (Inform is very capable,
IIRC). I could be wrong, but it's a good guess.

BTW, a semi-automated tool for your purposes is not a bad idea. After
all, I seriously doubt that there is anywhere *near* the amount of work
needed for the convertor as would be needed for the alternate CPU
translator. Further, I don't expect that you're continuing to develop
or maintain your existing scripts and need the ability to maintain only
one file (i.e., once the conversion is done, you can use Inform and
stop using your tool). It's (probably) not like there is an existing
user base of your tool that would insist on automated conversions for
new products. You do the conversion once (even an
80% solution) and then move on to Inform. Big difference in the usage
of these tools.

Of course, the other issue that makes Rene's proposed tool less than
useful is the fact that porting the machine code is only *part* of the
problem. There's also the problem of the OS interface. Of course, he is
discussing only a WinCE port (as best I can tell) and WinCE is
*similar* to Win32, but there is little chance that his code will port
to WinCE. Just as there are semantic differences between x86
instructions and ARM (or whatever) instructions, there are semantic
differences between the API functions in Win32 and WinCE. Some calls
don't exist, some calls are new, and some calls behave differently.
Yes, there is a subset of calls you can make to write portable code,
but do you really think that Rene's existing code (or the existing
RosAsm code base) has stuck to this subset?

Again, if you want code that runs portably on Win32 and on various
WinCE platforms, the only reasonable solution is to get Visual Studio
with the processor pack and the various SDKs and pay *careful*
attention to the OS calls you make.

Trying to make assembly portable just isn't going to fly. If it was,
I'd be well into that project by now (I've had a lot of Macintosh users
ask for a Mac version of HLA over the years; fortunately, given recent
events, I'm glad I ignored those requests and quickly discovered that
x86->PPC wasn't practical).

Cheers,
Randy Hyde

From: randyhyde@earthlink.net on 27 Jan 2006 14:44

Dragontamer wrote:
> Charles A. Crayne wrote:
> > On 26 Jan 2006 16:32:51 -0800
> > "randyhyde(a)earthlink.net" <randyhyde(a)earthlink.net> wrote:
> >
> > :Now suppose that instead of a nice array of table entries, I have a
> > :two-dimensional table of these objects. Or how about a linked list?
> > :Or maybe parallel arrays? Or any other data structure I can dream up.
> >
> > Just for the record, how many instances of such structures do you expect
> > to find in Betov's source code?
>
> Question:
>
> Why hasn't big-endian vs little-endian been brought up yet?

As I've mentioned in other posts, the list of problems why a conversion
won't work is *nearly* endless. Each little problem you bring up is
like a grain of sand on a beach of problems.

As for Big vs. Little Endian, this isn't an issue with the translator
that Rene is hinting about. He's proposing a translator for WinCE and
WinCE only runs on processors that are little endian. Of course,
endian issues would be a bigger problem on other OSes (this was one of
the main reasons I gave up on an x86->PPC translator for the Mac
several years ago).

>
> Especially with the onslaught of networked programs today, and even
> programming languages designed *for* the internet?

Especially given the fact that optimizers for portable HLLs are pretty
good today (not as good as an expert assembly programmer, but *much*
better than the code you'll get out of Rene's proposed translator).

>
> Even the *data* itself may have to be converted for the code to execute
> correctly.

Of course. The current example is pointers to code. And given that
RosAsm doesn't have structures and other data structure hints to aid in
the conversion process, this makes the whole thing even more difficult.

>
> Other trivial examples:
> Any self-modifying code, from encrypting/decrypting memory
> for security reasons to compressed code would not translate
> so easily from platform to platform.

We're assuming that we're working with source code here, so this won't
be a problem. Encryption and compression are generally applied to the
binary code after compilation.

Cheers,
Randy Hyde

From: randyhyde@earthlink.net on 27 Jan 2006 14:51

Charles A. Crayne wrote:
> On Fri, 27 Jan 2006 00:15:02 +0000 (UTC)
> Alex McDonald <alex_mcd(a)btopenworld.com> wrote:
>
> :someCodePtr dd $ ; "pointer to self" address
> :someCode ... ; code to execute
> :
> : mov eax, someCodePtr ; fetch the code address
> : add eax, # 4 ; point at someCode
> : jmp eax ; and call it
> :
> :What's the problem with it?
>
> It can be replaced by 'jmp someCodePtr+4'
>
> -- Chuck

No, "jmp someCodePtr+4" would transfer control to the address held in
the dword immediately following someCodePtr. The above code transfers
control to the code address at the location specified by the *sum* of
the dword at someCodePtr and four.

And while you might think that this is bad coding practice, or that
Rene doesn't write code like this, it's quite easy to write a macro
that takes advantage of this scheme to produce (maintainable) code that
doesn't use a jump table (thus sparing you an extra memory access). If
your "case sequences" are the same length, the trick above can be quite
useful (generally, the offset will be greater than four, but the
concept is the same).

This is the kind of code that really demonstrates why assembly language
is so cool. The fact that you can do stuff like this (it falls under
Rene's "strategy optimization" moniker -- we don't need no stinking
jump table! So don't put it in the code).

Cheers,
Randy Hyde

From: randyhyde@earthlink.net on 27 Jan 2006 15:00

Alex McDonald wrote:
>
> You really have lost me here. What on earth does this have to do with
> this part of the thread? What does MASM bashing, along with some
> remarks about my surname that I normally get from three year old
> children, have to do with the code posted?

Very simple. He realizes that he has lost the argument, so he's
changing the subject to deflect attention away from his own mistakes.
Nothing like throwing in a few insults to get you to move away from the
fact that his proposal is a non-starter and that he should have
researched this a little better before hand.
Cheers,
Randy Hyde

From: Betov on 27 Jan 2006 15:48

"randyhyde(a)earthlink.net" <randyhyde(a)earthlink.net> ?crivait
news:1138392037.641543.43420(a)g14g2000cwa.googlegroups.com:

>
> Alex McDonald wrote:
>>
>> You really have lost me here. What on earth does this have to do with
>> this part of the thread? What does MASM bashing, along with some
>> remarks about my surname that I normally get from three year old
>> children, have to do with the code posted?
>
>
> Very simple. He realizes that he has lost the argument, so he's
> changing the subject to deflect attention away from his own mistakes.
> Nothing like throwing in a few insults to get you to move away from the
> fact that his proposal is a non-starter and that he should have
> researched this a little better before hand.

Ah!... We don't have a MASM victim here:

That one would not even be able to understand, even
when explained.

:)

Betov.

< http://rosasm.org >

First | Prev | Next | Last
Pages: 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Prev: Check out POASM
Next: Bad habits