Two Click disassembly/reassembly [ASM]

From: Betov on 26 Jan 2006 05:08

"Charles A. Crayne" <ccrayne(a)crayne.org> ?crivait
news:20060125222022.7c8bcbe7(a)heimdall.crayne.org:

> Perhaps, or perhaps he is talking about converting x86 assembly
> language code to source code for other processors.

Yes.

> In either case, his approach is probably the one which requires the
> least human interaction to accomplish the above goal.

This is what i suppose.

> However, since
> the difficulty of the task depends upon the similarity, or lack
> thereof, between the source and target architectures, and since there
> has not yet been any agreement on what the target architecture might
> be, it is easy, albeit unproductive, to postulate theoretical
> difficulties which may not be a significant consideration in real
> world implementations.

.... and that are not any problem, with the RosAsm Encoder
Architecture, for this example, as the References are
fixed last.

> If, for example, the address size of the target architecture is not
> four bytes, then a jump table invocation such as, 'jmp
> [sometable+4*eax]' requires that both the code statement, and the
> elements of the table be altered.

Yes, of course. If the Addresses are not four Bytes,
the port of a Table of Labels would fail. But, on
one hand, this is found way less in Assembly Sources
than in C / C++ Disassemblies... then, on the other
hand, replacing a couple of "4"s, say, by a couple of
"8"s (i think i have read somewhere there is one of
the "Alien Processor" working that way... or whatever)
.... would be _way_ less work that writting all of the
port entirely by hand.

> Some of these special cases can be handled automatically by the tool,
> and others will have to be cleaned up by a human. However, I have yet
> to see any arguments which reasonably suggest that the proposed tool
> would not be a useful one.

Yes. At least that would be a great help at porting
automatically everything that can be ported automatically,
and, quite frankely, when we take a look at what is really
found inside most executables, making a bet on a direct
re-run, does not seem to me completly stupid.

:)

Betov.

< http://rosasm.org >

From: randyhyde@earthlink.net on 26 Jan 2006 15:09

Charles A. Crayne wrote:
> On 25 Jan 2006 14:37:39 -0800
> "randyhyde(a)earthlink.net" <randyhyde(a)earthlink.net> wrote:
>
> :However, Rene is talking about
> :converting x86 assembly language code to machine code for other
> :processors.
>
> Perhaps, or perhaps he is talking about converting x86 assembly language
> code to source code for other processors.

Same problem. Doesn't matter if it's a source to source or source to
binary conversion. The problem is still *very* difficult. This was
researched quite a bit in the 1980s when people like Intel wanted to
migrate a lot of x86 code to other processors (e.g., the IA64). You can
see the results of that research today. Forgive me for believing that
if this were easy enough for someone like Rene to handle, the top minds
in the industry would have solved the problem a decade ago.

>
> In either case, his approach is probably the one which requires the
> least human interaction to accomplish the above goal. However, since the
> difficulty of the task depends upon the similarity, or lack thereof,
> between the source and target architectures, and since there has not yet
> been any agreement on what the target architecture might be, it is easy,
> albeit unproductive, to postulate theoretical difficulties which may not be
> a significant consideration in real world implementations.

Well, he *has* mentioned ARM and PocketPC (which includes MIPS and SH8,
among others). Doesn't matter though. As the example I give
demonstrates problems you're going to have if there are *any*
differences in opcode size, among other things.

>
> :mov eax, someCodePtr
> :add eax, 4
> :jmp eax
>
> Leaving aside the fact that this pseudo-example is bad coding practice,

Perhaps. But if you don't think things like this through, you wind up
doing things like spending two or three years writing a disassembler
that breaks whenever you insert an innocuous NOP into the disassembled
result.

Bottom line is that if you start coding before you *carefully* think
the problem through, you wind up wasting a good part of your life
writing code that will *never* work for anything other than carefully
crafted demos. Rene has proven this with his disassembler, he's about
to make the same mistake with code conversion.

> and may never occur in the source which Betov proposes to migrate,

Yeah, yeah.
He proposed writing a disassembler that converted library code to
RosAsm, and punted on handling problematic issues. Of course, it broke
when fed a few simple routines from the HLA standard library. No
offense, but I'm not expecting anything more from his "code conversion"
encoder.

> it
> does illustrate a more general issue, which needs to be considered.
> For obvious reasons, labels in the x86 source are highly unlikely to
> resolve to the same addresses as the corresponding labels do in the target
> source. Therefore, if one is going to even approximate line by line
> translation, ALL target addresses must be symbols, so that they can be
> resolved by the target assembler.
>
> If, for example, the address size of the target architecture is not four
> bytes, then a jump table invocation such as, 'jmp [sometable+4*eax]'
> requires that both the code statement, and the elements of the table be
> altered.

What happens with they didn't use a scaled index addressing mode?
Perhaps they've used induction across a loop and they automatically add
4 to EAX on each iteration of that loop. How is the converter going to
figure this out? (Hint: this is an undecideable problem; we're back to
the halting problem again).

>
> Some of these special cases can be handled automatically by the tool, and
> others will have to be cleaned up by a human.

Which makes such a tool almost worthless, consider the size of modern
applications (even those written in assembly). Except for trivial demo
apps, it's less work to do the conversion by hand when it's all said
and done.

> However, I have yet to see
> any arguments which reasonably suggest that the proposed tool would not be
> a useful one.

Well, you and Rene can feel free to spend all your free time working on
a tool. But given the usefulness of such a tool (plus the fact that
folks like Intel and Motorola have sunk a *lot* of money into
researching this problem), I have to agree with Alex when he says "if
it was that easy, it would have been done already." Think about it a
moment. Do you think that Rene is the *first* person to come up with
this idea? Heck, I was contemplating an HLA->C converter back in 2001,
but ultimately gave up on the idea because the result would have
produced code that was *way* too slow.

To give you a bit of a clue, it's not that this type of conversion is
impossible. It's just that the resulting code is *soooo* big and
*soooo* slow it's not practical. The solutions I've see pull the JIT
trick of keeping the original object code around and doing emulation on
things that it couldn't compile properly. Things like labels are
handled by lookup tables at run time (i.e., when you jump to an
indirect address, you look up the address in a lookup table to get the
target address on the new architecture). All this adds up to an
incredibly slow result. Considering that the target CPUs Rene has
mentioned are all *much* slower than a contemporary x86, this just
isn't a practical thing to do.

There is no doubt that Rene can "macro-ize" x86 instructions on other
architectures. But this just *won't* produce working software except
for some trivial demo apps. However, if you think it can be done, feel
free to join the RosAsm team and help him tilt at a few windmills.
Cheers,
Randy Hyde

From: randyhyde@earthlink.net on 26 Jan 2006 15:19

Betov wrote:
> "Charles A. Crayne" <ccrayne(a)crayne.org> écrivait
> news:20060125222022.7c8bcbe7(a)heimdall.crayne.org:
>
>
> > However, since
> > the difficulty of the task depends upon the similarity, or lack
> > thereof, between the source and target architectures, and since there
> > has not yet been any agreement on what the target architecture might
> > be, it is easy, albeit unproductive, to postulate theoretical
> > difficulties which may not be a significant consideration in real
> > world implementations.
>
> ... and that are not any problem, with the RosAsm Encoder
> Architecture, for this example, as the References are
> fixed last.

You really don't understand the scope of the problem, do you?

>
>
> > If, for example, the address size of the target architecture is not
> > four bytes, then a jump table invocation such as, 'jmp
> > [sometable+4*eax]' requires that both the code statement, and the
> > elements of the table be altered.
>
> Yes, of course. If the Addresses are not four Bytes,
> the port of a Table of Labels would fail. But, on
> one hand, this is found way less in Assembly Sources
> than in C / C++ Disassemblies...

Maybe the way *you* write code, stuff like this doesn't appear very
often. I can assure you that *real* assembly language programmers use
stuff like this all the time. And we're not just talking about jump
tables here, but tables of *any* data.

And have you even considered the fact that most processors don't allow
access to unaligned memory locations? Or that many target processors
don't support byte-addressable memory?

And as Alex as pointed out, have you considered the fact that most RISC
processors don't have the same notion of "condition codes" as the x86?

> then, on the other
> hand, replacing a couple of "4"s, say, by a couple of
> "8"s (i think i have read somewhere there is one of
> the "Alien Processor" working that way... or whatever)
> ... would be _way_ less work that writting all of the
> port entirely by hand.

What happens when the "*4" component is computed by the program rather
than part of the addressing mode? How will your "encoder" figure this
out?

>
>
> > Some of these special cases can be handled automatically by the tool,
> > and others will have to be cleaned up by a human. However, I have yet
> > to see any arguments which reasonably suggest that the proposed tool
> > would not be a useful one.
>
> Yes. At least that would be a great help at porting
> automatically everything that can be ported automatically,
> and, quite frankely, when we take a look at what is really
> found inside most executables, making a bet on a direct
> re-run, does not seem to me completly stupid.

Only because you've never studied enough Computer Science to realize
the magnitude of the problem you're attempting. It's like your
disassembler. You get some crazy idea that you know how to do something
so much better than everyone who came before you (and you probably
don't realize that this problem has been attempted *many* times in the
past, by people *much* smarter than you), and you jump in without
realizing the futility of what you're trying. Oh well, waste lots of
time on it. I'm sure you'll come up with yet another great demo like
your automatic disassembler that works for some spoon-fed apps, but
breaks on anything real-world-ish. Too bad your assembler users don't
get to benefit from the real work you could have done on your
*assembler* while you are wasting time writing yet another demo
program.

Computer Science is a formal science for exactly this reason-- so
people could determine which things are impossible or impractical
before they waste a good chuck of their time on the problem. What you
are attempting to do is impractical. People have proven that already.
But go ahead and waste your time on it. It is your time, after all.
Cheers,
Randy Hyde

From: Charles A. Crayne on 26 Jan 2006 18:58

On 26 Jan 2006 12:09:26 -0800
"randyhyde(a)earthlink.net" <randyhyde(a)earthlink.net> wrote:

:What happens with they didn't use a scaled index addressing mode?
:Perhaps they've used induction across a loop and they automatically add
:4 to EAX on each iteration of that loop.

And just how is this straw man any more formidable than his predecessor,
which I swept away with ease? However, since you obviously didn't
understand, let me go through it again, in more detail.

The tool flags any statement which calls, or jumps to, an address which is
not an unmodified label. The programmer can either add a new label to
the x86 source and change the call/jump target address accordingly [as in
your previous example], or change the target address calculation where
it occurs in the loop [as in your example above] and flag the call/jmp
statement as having by approved by the programmer.

:Except for trivial demo
:apps, it's less work to do the conversion by hand when it's all said
:and done.

It is difficult to believe that you typed this with a straight face, as it
should be obvious to the casual observer that the larger the body of code
to be converted, the more valuable the tool.

:Well, you and Rene can feel free to spend all your free time working on
:a tool.

Unlike Betov, I do not have a large body of RosAsm source code which I wish
to convert to another architecture, and therefore will not be working on
that tool. However I do have a very similar situation, for which I am
currently writing a conversion tool.

As you may know, some years ago, I wrote a text adventure game engine, for
which my wife wrote a number of game scripts. More recently, I ported the
engine from DOS to Linux, but in an attempt to reach users on other
hardware platforms, I have decided to port the scripts to the Inform
compiler.

Most of the time I have spent on this project has be devoted to learning
the Inform way of doing things, which I still have not completely
mastered. However, the 16 hours, or so, which I have spent writing the
tool, has already saved me at least 10 times that investment, as compared
to hand conversion.

-- Chuck

From: Alex McDonald on 26 Jan 2006 19:15

Charles A. Crayne wrote:

>
> :mov eax, someCodePtr
> :add eax, 4
> :jmp eax
>
> Leaving aside the fact that this pseudo-example is bad coding practice,

The syntax may be slightly ambiguous, so permit me to clean it up;

someCodePtr dd $ ; "pointer to self" address
someCode ... ; code to execute

mov eax, someCodePtr ; fetch the code address
add eax, # 4 ; point at someCode
jmp eax ; and call it

What's the problem with it?

--
Regards
Alex McDonald

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Prev: Check out POASM
Next: Bad habits