From: wolfgang kern on

Yes Beth,

but I never said I'm able to create a 'perfect disassembled' source
(in means of a theoretical ideal), only Randy can't see what and how
I try, so he repeats his 'perfect' claims and annoys everybody with
the endless stating of the halting theory.

I think to have explained my solution in enough detail,
and of course it is impossible to say if a code-path ends endless
if it's fed with indeterminant data.
But a static analyser don't need to care it, so halting doesn't apply.

Even unresolved gaps (rare to happen anyway) may result in db-lines
(and optional commented ASM) then, I consider the disassembling to be
at least correct and complete (just not perfect) as it will produce the
same code if compiled again.

__
wolfgang


From: Herbert Kleebauer on
Beth wrote:
> Herbert Kleebauer wrote:

> No, you're not getting it...

> Can you understand that logic?

No, your "logic" isn't very understandable at the moment. I'm
not a mathematician and never took a closer look at the halting
theorem. So maybe I'm wrong, but I understand the theorem (in
simple words, not in a formal definition) as:

We have an algorithm A which is feed with an input I (note,
the input is given, no randomness, no interrupts) and
we want an other algorithm HALT(A,I) which tells us, whether
A(I) will halt or not. Now the theorem says, we can't find
a HALT() which does this for any (A,I). On the other side,
it is trivial to find a HALT() which works for any algorithm A
which can be implemented on a finite system. So the final
content of the theorem is:

There is at least one algorithm/input pair for which it is
impossible to decide whether it halts or not, but this is an
algorithm which only can be implemented on a system with
an infinite amount of resources.

So, if there is any connection between the halting problem
and disassembling code, then the halting theorem says us:
If you don't try to disassemble a program which ONLY runs
on a system with infinite resources, then there is absolutely
no problem as long as the halting theorem is concerned. So I
really don't understand about what Randy is talking here.


Before we can decide if "perfect disassembly" is possible,
we first have to define what "disassembly" and "perfect
disassembly" is. If disassembly is the process to convert
a binary file x.exe into a text file x.asm, so that you
get an identical x.exe when you assemble x.asm, then it
is trivial to write a disassembler. Just convert each
byte into hex and prefix it with a "dcb" and us a flat
assembler which directly generates the binary output (where
do you see any halting problem here?).

Now, what is perfect disassembly? This surely is a very private
definition (for me a simple hex dump is much more perfect than
source in Intel syntax). If you prefer instruction memnonics
instead of hex bytes, you can replace a byte sequence by it's
equivalent instruction, but only if this is unambiguous.
For example, you can't replace the sequence "88 c2" by
a "mov al,dl" because the assembler maybe generates the
equivalent "8a d0" instead and if the program calculates
the checksum, then it will not run anymore, even if you don't
modify a single line before reassembling the program.
So either you have to use an assembler which allows you
to specify which code to generate or you have to disassemble
it to "dcb 0x88, 0xc2 ; mov al,dl".


If you think a perfect disassembler should generate a source
file, which you can edit (add or remove instructions), reassemble
it and then it still works, then this isn't possible in general.
But this has nothing to do with the halting problem. For example,
if the program calculates the checksum, it will not run after
modification if you not recalculate the new checksum.

Or look at the following code:

00000000: 31 c0 init: eor.l r0,r0
:
:
00000066: e9 00000066 br.l abc
var1=@-4
0000006b: 01 c2 add.l r0,r1
:
:
000000d1: 45 abc: inc.l r4
:
:
main:
00000136: a1 00000067 move.l var1,r0


The init code (which is only executed once at the start of
the program) is later in the main program used for variable
storage. The variable var1 is (because of the branch offset)
initialized with 0x66. If you disassemble the program and
insert additional instruction before the abc: label, then the
branch offset will change and therefore the initial value for
var1, which maybe will crash the program. Again this has nothing
to do with the ability to decide whether a memory location is
used as instruction or as data. To know that it is used both
ways, as instruction and data (as in the above example) isn't
better than not to know whether it is used as instruction or
data. You can find many more examples (e.g. self modifying code)
which makes it impossible generate a modifiable disassembling,
but all this has nothing to do with the halting problem.

For me, the purpose of disassembling is, to convert the binary
in a human readable form. And the perfect disassembler is the
one, which generates a format which makes it most easy FOR ME
to understand the code. And when you did understand the code
(which isn't easy without any documentation and useful label
names), then most of the work is already done. Rewriting
(and documenting) the code is much less work, so it really
isn't so important that you get a disassembled code which
you can directly modify and reassemble.
From: randyhyde@earthlink.net on

Herbert Kleebauer wrote:
> Beth wrote:
> > Herbert Kleebauer wrote:
>
> > No, you're not getting it...
>
> > Can you understand that logic?
>
> No, your "logic" isn't very understandable at the moment. I'm
> not a mathematician and never took a closer look at the halting
> theorem. So maybe I'm wrong, but I understand the theorem (in
> simple words, not in a formal definition) as:
>
> We have an algorithm A which is feed with an input I (note,
> the input is given, no randomness, no interrupts) and
> we want an other algorithm HALT(A,I) which tells us, whether
> A(I) will halt or not. Now the theorem says, we can't find
> a HALT() which does this for any (A,I).

Actually, it says for *all* (A, I).
And, btw, input is *exactly* that. If the input is supplied to the
program, it can't "randomly" change. Interrupts are just another form
of input to the system, btw (the input would be the time and location
where the interrupt event occurs, and what happens when the input
occurs).


> On the other side,
> it is trivial to find a HALT() which works for any algorithm A
> which can be implemented on a finite system. So the final
> content of the theorem is:

Technically, it is trivial to determine whether any algorithm A halts.
By definition, all "algorithms" halt. An "algorithm" is a sequence of
steps that executes and guarantees to halt, producing an answer. A
"procedure" is a sequence of steps that *may* produce an answer, but is
not guaranteed to halt. Therefore, the halting problem basically is the
process of differentiating algorithms from procedures.

Note, btw, that the set of all algorithms are a subset of the
procedures. That is, there is no requirement that a procedure run
forever. It may certainly halt and produce a result, though there is no
guarantee that it will if it is not also an algorithm.


>
> There is at least one algorithm/input pair for which it is
> impossible to decide whether it halts or not,

Well, substitute "procedure" for "algorithm" and this is correct.

> but this is an
> algorithm which only can be implemented on a system with
> an infinite amount of resources.

Wrong.
The amount of resources has nothing to do with it.
All algorithms (that is, that are guaranteed to halt and produce an
answer) use a finite number of resources. The proof of this is quite
simple. Accessing a "resource" takes at least one computational step
(that is, a finite amount of time). If a program attempted to access an
infinite number of resources it would never halt.

The infiniteness of a Turing machine isn't so a given program can
access an infinite number of resources. All programs that run on a
Turing machine are finite. The reason a Turing machine is infinite is
so that *any* program can run on it. For if a Turing machine were
finite, then there would be *some* programs (indeed, an infinite number
of them) that could not execute on a Turing machine.

I think this is a fundamental misconception that many people around
here have. The fact that any given program is finite does *not* imply
that we can determine whether it halts. Both the implementation of the
halting program and its input must be finite, by definition. Therefore,
we *can* implement the halting program (and its input) on a finite
machine (e.g., like the x86). But the proof that the halting program
is undecidable tells us that there is no guarantee that the halting
program will halt and tell us whether its input program will halt.

BTW, it's worth emphasizing that it's not impossible to write a program
that analyzes input programs to determine whether they halt. The
problem is guaranteeing that the halting program always halts and
produces the answer. That is, it's possible to write the halting
program if you accept the fact that it may not halt for certain input
programs (i.e., it's a procedure, not an algorithm).



>
> So, if there is any connection between the halting problem
> and disassembling code, then the halting theorem says us:
> If you don't try to disassemble a program which ONLY runs
> on a system with infinite resources, then there is absolutely
> no problem as long as the halting theorem is concerned. So I
> really don't understand about what Randy is talking here.

Your understanding is incorrect.
The halting program (or a disassembler) will use a finite number of
resources. The input program uses a finite number of resources. Any
program that uses an infinite number of resources will never halt (even
if such resources were available).

The disassembler problem (specifically, differentiating code and data)
is not guaranteed to stop and proceduce an answer, even though the
code/data differentiation program and its input require only a finite
number of resources.

I think that Chuck Crayne has confused a lot of people by attempting to
claim that these theorems only apply to infinite machines. The
misconception he has created is unfortunate. I should have caught on to
how this knowledge was being misused earlier.


>
>
> Before we can decide if "perfect disassembly" is possible,
> we first have to define what "disassembly" and "perfect
> disassembly" is. If disassembly is the process to convert
> a binary file x.exe into a text file x.asm, so that you
> get an identical x.exe when you assemble x.asm, then it
> is trivial to write a disassembler. Just convert each
> byte into hex and prefix it with a "dcb" and us a flat
> assembler which directly generates the binary output (where
> do you see any halting problem here?).

I've described my concept of "perfect disassembly" elsewhere.
See that post.

As for your example above, you're not differentiating code and data, so
"perfect disassembly" using this definition is perfectly possible. Of
course, the result is hardly useful. And if you attempt to assemble the
code to run at a different address than it was originally assembled
for, it will not run properly.


>
> Now, what is perfect disassembly? This surely is a very private
> definition (for me a simple hex dump is much more perfect than
> source in Intel syntax). If you prefer instruction memnonics
> instead of hex bytes, you can replace a byte sequence by it's
> equivalent instruction, but only if this is unambiguous.
> For example, you can't replace the sequence "88 c2" by
> a "mov al,dl" because the assembler maybe generates the
> equivalent "8a d0" instead and if the program calculates
> the checksum, then it will not run anymore, even if you don't
> modify a single line before reassembling the program.
> So either you have to use an assembler which allows you
> to specify which code to generate or you have to disassemble
> it to "dcb 0x88, 0xc2 ; mov al,dl".


more importantly, if you have "mov eax, i" where i is some direct
address in memory, and you reassemble the code with i falling at a
different address, the instruction must be properly assembled at the
new address.

But more basically, perfect disassembly implies that every instruction
that was an instruction in the original code is an instruction in the
disassembled code; and every data item that was a data item in the
original code is a data item in the disassembled source. Such a
disassembler cannot be written that automatically does this, and does
this for all inputs (halting, of course).

>
>
> If you think a perfect disassembler should generate a source
> file, which you can edit (add or remove instructions), reassemble
> it and then it still works, then this isn't possible in general.
> But this has nothing to do with the halting problem.

Technically, it has to do with the data/code differentiation problem.
But that problem is reducible to the halting problem, so we can argue
it *does* have something to do with the halting problem.

>
> For me, the purpose of disassembling is, to convert the binary
> in a human readable form. And the perfect disassembler is the
> one, which generates a format which makes it most easy FOR ME
> to understand the code.

If you redefine disassembly to relax the requirement that it properly
differentiate code and data (and other undecidable problems like
differentiating types of data), then it may very well be possible to
write a perfect disassembler, by your definition.
But if you want the disassembler to correctly differentiate code and
data for all input programs, halt, and produce a usable source file,
this is not possible.


Cheers,
Randy Hyde

From: Guga on
"The disassembler problem (specifically, differentiating code and data)
is not guaranteed to stop and proceduce an answer, even though the
code/data differentiation program and its input require only a finite
number of resources. "
(...)But if you want the disassembler to correctly differentiate code
and data for all input programs, halt, and produce a usable source
file, this is not possible. "

Ok...but then, how the processor identifies what is code and what is
data during loading the file ? If as you say, only humans can interpret
what is data and what is code (using the examples of the halting
problem), how the processor never halts when finding those specific
situations ?

If the program works perfectly and yet it contains the "undeciable"
code/data diferenciatinos, how it never hangs ?

I mean, if the processor itself have no problems identifying this, why
an disassembler (Assuming it to reproduce the same steps as what the
processor should do) is not possible ?

The processor only distinguishs when the app is running ? If tis that
it, then we can assume that when a program don´t run, the internal
data can be anything...i mean, it don´t matter at all, because the
processor didn´t actually "saw" the internal contents (dat) yet.

And, if the distinguish can be retrieved only when the app is running,
shouldn´t a debugger mixed with an disassembler could solve this kind
of problem ?


Best Regards,

Guga

From: Guga on
One thing only

Maybe i´m interpreting it wrong, but....On the loop example (Not
necessarily the differenciacion of code/data), the main program _if_
access that endles loop will "stop", i mean....not stop properly said,
but it will keep endless onm that loop. Like this:

mov eax
push Data001
L0:
jmp L0<
xor eax eax
mov eax 01


On the above example, when the program access for the 1st time the L0
locatino, it will enter onan endless loop only, and won´t run any
longer outside that loop.

(Not sure if this is what you consider "halt")...I mean, if "halt"
means hang, break, crash....then the program is not halting. It is
merelly going on a loop.

Then, since the rest of the data is unacessed, the processor "see" it
as a bunch of DB values, nothing more then it.

So, the proper disassemblement for that can display what is happening
like:

mov eax
push Data001
L0:
jmp L0<
DB 033 0C0 ; xor eax eax
DB 0B8 01 0 0 0 ; mov eax 01


So, on this specific situation, the program "ends" on the Loop when the
loop is accessed, and the processor don´t even identifie the
"unacessed". No matter if on that circunstancies it is only "Data"

So, even an disassembler that identifies that endless loop, and
properly react as the processor will react (Not sure about the term...i
hope you understand what i mean) identifying it as DB the unacessed
data, then this is the "perfect" disassemblement..(Or better saying ..
the proper solution for that unacessed data)

However, you can force the disassembler to identify those endless loops
as false program endings, and bypass that loop to access the "unacessed
data" a few bytes after the loop. But this is only if you want to
actually "see" what was the unacessed data is all about, because the
program itself won´t use them. So disassembling the unacessed data (On
this circunstancies) is only "cosmetic", because it will never be
executed.

Of course, that all the above i mean the cases where the DB data is
really not accessed on any part of the program. If they are accessed
indirectly, then the data must be interpreteded for a proper
disassemblement.

So, we have 02 situations right ? One is an endless loop that may
represent an program end, if the "unaccessed" data is really not being
used, and the other is the Halting problem, that states for dificult
(imposibility) of interpreting a specific chunck of bytes as code or
data.

For the halting problem, i´m not comment much further, because i´m
still reaearching on it, trying to figure it out all of what is being
said so far. But it seems to me that, if the processor when accessing
some chinck of bytes _can_ distinguish between code and data properly,
i don´t see any reason why a disassembler or decompiler can´t, except
the limitations of the technology or techniques envolved on the
disassembler itself.

Otherwise, if a specifc chunck of bytes can be interpreted as code or
data at the same time, the procesor will also idsentify this dubial
situation, and the program will probably crash. What don´t seems to
happen very often.

On VB files for instace, we have a totally weird way of accessing teh
Data. It have a small chunck of real code bytes, and a LOT of Data
mixed in the Code sections (as well as in the data sectino). Since this
is only Data (In fact, structures inside teh code section), the
processor knows that they are Data and not code. The only problem i see
is figure it out, how it "see" the difference properly and in 100% pof
the cases (100% because this is how an VB app is made internally, and
all of them runs on that way without crashing due to those problems).

Are the above assumptions correct ?

If not, why then ?


Best Regards,

Guga