From: wolfgang kern on 3 Oct 2005 06:36 Yes Beth, but I never said I'm able to create a 'perfect disassembled' source (in means of a theoretical ideal), only Randy can't see what and how I try, so he repeats his 'perfect' claims and annoys everybody with the endless stating of the halting theory. I think to have explained my solution in enough detail, and of course it is impossible to say if a code-path ends endless if it's fed with indeterminant data. But a static analyser don't need to care it, so halting doesn't apply. Even unresolved gaps (rare to happen anyway) may result in db-lines (and optional commented ASM) then, I consider the disassembling to be at least correct and complete (just not perfect) as it will produce the same code if compiled again. __ wolfgang
From: Herbert Kleebauer on 3 Oct 2005 07:37 Beth wrote: > Herbert Kleebauer wrote: > No, you're not getting it... > Can you understand that logic? No, your "logic" isn't very understandable at the moment. I'm not a mathematician and never took a closer look at the halting theorem. So maybe I'm wrong, but I understand the theorem (in simple words, not in a formal definition) as: We have an algorithm A which is feed with an input I (note, the input is given, no randomness, no interrupts) and we want an other algorithm HALT(A,I) which tells us, whether A(I) will halt or not. Now the theorem says, we can't find a HALT() which does this for any (A,I). On the other side, it is trivial to find a HALT() which works for any algorithm A which can be implemented on a finite system. So the final content of the theorem is: There is at least one algorithm/input pair for which it is impossible to decide whether it halts or not, but this is an algorithm which only can be implemented on a system with an infinite amount of resources. So, if there is any connection between the halting problem and disassembling code, then the halting theorem says us: If you don't try to disassemble a program which ONLY runs on a system with infinite resources, then there is absolutely no problem as long as the halting theorem is concerned. So I really don't understand about what Randy is talking here. Before we can decide if "perfect disassembly" is possible, we first have to define what "disassembly" and "perfect disassembly" is. If disassembly is the process to convert a binary file x.exe into a text file x.asm, so that you get an identical x.exe when you assemble x.asm, then it is trivial to write a disassembler. Just convert each byte into hex and prefix it with a "dcb" and us a flat assembler which directly generates the binary output (where do you see any halting problem here?). Now, what is perfect disassembly? This surely is a very private definition (for me a simple hex dump is much more perfect than source in Intel syntax). If you prefer instruction memnonics instead of hex bytes, you can replace a byte sequence by it's equivalent instruction, but only if this is unambiguous. For example, you can't replace the sequence "88 c2" by a "mov al,dl" because the assembler maybe generates the equivalent "8a d0" instead and if the program calculates the checksum, then it will not run anymore, even if you don't modify a single line before reassembling the program. So either you have to use an assembler which allows you to specify which code to generate or you have to disassemble it to "dcb 0x88, 0xc2 ; mov al,dl". If you think a perfect disassembler should generate a source file, which you can edit (add or remove instructions), reassemble it and then it still works, then this isn't possible in general. But this has nothing to do with the halting problem. For example, if the program calculates the checksum, it will not run after modification if you not recalculate the new checksum. Or look at the following code: 00000000: 31 c0 init: eor.l r0,r0 : : 00000066: e9 00000066 br.l abc var1=@-4 0000006b: 01 c2 add.l r0,r1 : : 000000d1: 45 abc: inc.l r4 : : main: 00000136: a1 00000067 move.l var1,r0 The init code (which is only executed once at the start of the program) is later in the main program used for variable storage. The variable var1 is (because of the branch offset) initialized with 0x66. If you disassemble the program and insert additional instruction before the abc: label, then the branch offset will change and therefore the initial value for var1, which maybe will crash the program. Again this has nothing to do with the ability to decide whether a memory location is used as instruction or as data. To know that it is used both ways, as instruction and data (as in the above example) isn't better than not to know whether it is used as instruction or data. You can find many more examples (e.g. self modifying code) which makes it impossible generate a modifiable disassembling, but all this has nothing to do with the halting problem. For me, the purpose of disassembling is, to convert the binary in a human readable form. And the perfect disassembler is the one, which generates a format which makes it most easy FOR ME to understand the code. And when you did understand the code (which isn't easy without any documentation and useful label names), then most of the work is already done. Rewriting (and documenting) the code is much less work, so it really isn't so important that you get a disassembled code which you can directly modify and reassemble.
From: randyhyde@earthlink.net on 3 Oct 2005 11:48 Herbert Kleebauer wrote: > Beth wrote: > > Herbert Kleebauer wrote: > > > No, you're not getting it... > > > Can you understand that logic? > > No, your "logic" isn't very understandable at the moment. I'm > not a mathematician and never took a closer look at the halting > theorem. So maybe I'm wrong, but I understand the theorem (in > simple words, not in a formal definition) as: > > We have an algorithm A which is feed with an input I (note, > the input is given, no randomness, no interrupts) and > we want an other algorithm HALT(A,I) which tells us, whether > A(I) will halt or not. Now the theorem says, we can't find > a HALT() which does this for any (A,I). Actually, it says for *all* (A, I). And, btw, input is *exactly* that. If the input is supplied to the program, it can't "randomly" change. Interrupts are just another form of input to the system, btw (the input would be the time and location where the interrupt event occurs, and what happens when the input occurs). > On the other side, > it is trivial to find a HALT() which works for any algorithm A > which can be implemented on a finite system. So the final > content of the theorem is: Technically, it is trivial to determine whether any algorithm A halts. By definition, all "algorithms" halt. An "algorithm" is a sequence of steps that executes and guarantees to halt, producing an answer. A "procedure" is a sequence of steps that *may* produce an answer, but is not guaranteed to halt. Therefore, the halting problem basically is the process of differentiating algorithms from procedures. Note, btw, that the set of all algorithms are a subset of the procedures. That is, there is no requirement that a procedure run forever. It may certainly halt and produce a result, though there is no guarantee that it will if it is not also an algorithm. > > There is at least one algorithm/input pair for which it is > impossible to decide whether it halts or not, Well, substitute "procedure" for "algorithm" and this is correct. > but this is an > algorithm which only can be implemented on a system with > an infinite amount of resources. Wrong. The amount of resources has nothing to do with it. All algorithms (that is, that are guaranteed to halt and produce an answer) use a finite number of resources. The proof of this is quite simple. Accessing a "resource" takes at least one computational step (that is, a finite amount of time). If a program attempted to access an infinite number of resources it would never halt. The infiniteness of a Turing machine isn't so a given program can access an infinite number of resources. All programs that run on a Turing machine are finite. The reason a Turing machine is infinite is so that *any* program can run on it. For if a Turing machine were finite, then there would be *some* programs (indeed, an infinite number of them) that could not execute on a Turing machine. I think this is a fundamental misconception that many people around here have. The fact that any given program is finite does *not* imply that we can determine whether it halts. Both the implementation of the halting program and its input must be finite, by definition. Therefore, we *can* implement the halting program (and its input) on a finite machine (e.g., like the x86). But the proof that the halting program is undecidable tells us that there is no guarantee that the halting program will halt and tell us whether its input program will halt. BTW, it's worth emphasizing that it's not impossible to write a program that analyzes input programs to determine whether they halt. The problem is guaranteeing that the halting program always halts and produces the answer. That is, it's possible to write the halting program if you accept the fact that it may not halt for certain input programs (i.e., it's a procedure, not an algorithm). > > So, if there is any connection between the halting problem > and disassembling code, then the halting theorem says us: > If you don't try to disassemble a program which ONLY runs > on a system with infinite resources, then there is absolutely > no problem as long as the halting theorem is concerned. So I > really don't understand about what Randy is talking here. Your understanding is incorrect. The halting program (or a disassembler) will use a finite number of resources. The input program uses a finite number of resources. Any program that uses an infinite number of resources will never halt (even if such resources were available). The disassembler problem (specifically, differentiating code and data) is not guaranteed to stop and proceduce an answer, even though the code/data differentiation program and its input require only a finite number of resources. I think that Chuck Crayne has confused a lot of people by attempting to claim that these theorems only apply to infinite machines. The misconception he has created is unfortunate. I should have caught on to how this knowledge was being misused earlier. > > > Before we can decide if "perfect disassembly" is possible, > we first have to define what "disassembly" and "perfect > disassembly" is. If disassembly is the process to convert > a binary file x.exe into a text file x.asm, so that you > get an identical x.exe when you assemble x.asm, then it > is trivial to write a disassembler. Just convert each > byte into hex and prefix it with a "dcb" and us a flat > assembler which directly generates the binary output (where > do you see any halting problem here?). I've described my concept of "perfect disassembly" elsewhere. See that post. As for your example above, you're not differentiating code and data, so "perfect disassembly" using this definition is perfectly possible. Of course, the result is hardly useful. And if you attempt to assemble the code to run at a different address than it was originally assembled for, it will not run properly. > > Now, what is perfect disassembly? This surely is a very private > definition (for me a simple hex dump is much more perfect than > source in Intel syntax). If you prefer instruction memnonics > instead of hex bytes, you can replace a byte sequence by it's > equivalent instruction, but only if this is unambiguous. > For example, you can't replace the sequence "88 c2" by > a "mov al,dl" because the assembler maybe generates the > equivalent "8a d0" instead and if the program calculates > the checksum, then it will not run anymore, even if you don't > modify a single line before reassembling the program. > So either you have to use an assembler which allows you > to specify which code to generate or you have to disassemble > it to "dcb 0x88, 0xc2 ; mov al,dl". more importantly, if you have "mov eax, i" where i is some direct address in memory, and you reassemble the code with i falling at a different address, the instruction must be properly assembled at the new address. But more basically, perfect disassembly implies that every instruction that was an instruction in the original code is an instruction in the disassembled code; and every data item that was a data item in the original code is a data item in the disassembled source. Such a disassembler cannot be written that automatically does this, and does this for all inputs (halting, of course). > > > If you think a perfect disassembler should generate a source > file, which you can edit (add or remove instructions), reassemble > it and then it still works, then this isn't possible in general. > But this has nothing to do with the halting problem. Technically, it has to do with the data/code differentiation problem. But that problem is reducible to the halting problem, so we can argue it *does* have something to do with the halting problem. > > For me, the purpose of disassembling is, to convert the binary > in a human readable form. And the perfect disassembler is the > one, which generates a format which makes it most easy FOR ME > to understand the code. If you redefine disassembly to relax the requirement that it properly differentiate code and data (and other undecidable problems like differentiating types of data), then it may very well be possible to write a perfect disassembler, by your definition. But if you want the disassembler to correctly differentiate code and data for all input programs, halt, and produce a usable source file, this is not possible. Cheers, Randy Hyde
From: Guga on 3 Oct 2005 12:19 "The disassembler problem (specifically, differentiating code and data) is not guaranteed to stop and proceduce an answer, even though the code/data differentiation program and its input require only a finite number of resources. " (...)But if you want the disassembler to correctly differentiate code and data for all input programs, halt, and produce a usable source file, this is not possible. " Ok...but then, how the processor identifies what is code and what is data during loading the file ? If as you say, only humans can interpret what is data and what is code (using the examples of the halting problem), how the processor never halts when finding those specific situations ? If the program works perfectly and yet it contains the "undeciable" code/data diferenciatinos, how it never hangs ? I mean, if the processor itself have no problems identifying this, why an disassembler (Assuming it to reproduce the same steps as what the processor should do) is not possible ? The processor only distinguishs when the app is running ? If tis that it, then we can assume that when a program don´t run, the internal data can be anything...i mean, it don´t matter at all, because the processor didn´t actually "saw" the internal contents (dat) yet. And, if the distinguish can be retrieved only when the app is running, shouldn´t a debugger mixed with an disassembler could solve this kind of problem ? Best Regards, Guga
From: Guga on 3 Oct 2005 14:52
One thing only Maybe i´m interpreting it wrong, but....On the loop example (Not necessarily the differenciacion of code/data), the main program _if_ access that endles loop will "stop", i mean....not stop properly said, but it will keep endless onm that loop. Like this: mov eax push Data001 L0: jmp L0< xor eax eax mov eax 01 On the above example, when the program access for the 1st time the L0 locatino, it will enter onan endless loop only, and won´t run any longer outside that loop. (Not sure if this is what you consider "halt")...I mean, if "halt" means hang, break, crash....then the program is not halting. It is merelly going on a loop. Then, since the rest of the data is unacessed, the processor "see" it as a bunch of DB values, nothing more then it. So, the proper disassemblement for that can display what is happening like: mov eax push Data001 L0: jmp L0< DB 033 0C0 ; xor eax eax DB 0B8 01 0 0 0 ; mov eax 01 So, on this specific situation, the program "ends" on the Loop when the loop is accessed, and the processor don´t even identifie the "unacessed". No matter if on that circunstancies it is only "Data" So, even an disassembler that identifies that endless loop, and properly react as the processor will react (Not sure about the term...i hope you understand what i mean) identifying it as DB the unacessed data, then this is the "perfect" disassemblement..(Or better saying .. the proper solution for that unacessed data) However, you can force the disassembler to identify those endless loops as false program endings, and bypass that loop to access the "unacessed data" a few bytes after the loop. But this is only if you want to actually "see" what was the unacessed data is all about, because the program itself won´t use them. So disassembling the unacessed data (On this circunstancies) is only "cosmetic", because it will never be executed. Of course, that all the above i mean the cases where the DB data is really not accessed on any part of the program. If they are accessed indirectly, then the data must be interpreteded for a proper disassemblement. So, we have 02 situations right ? One is an endless loop that may represent an program end, if the "unaccessed" data is really not being used, and the other is the Halting problem, that states for dificult (imposibility) of interpreting a specific chunck of bytes as code or data. For the halting problem, i´m not comment much further, because i´m still reaearching on it, trying to figure it out all of what is being said so far. But it seems to me that, if the processor when accessing some chinck of bytes _can_ distinguish between code and data properly, i don´t see any reason why a disassembler or decompiler can´t, except the limitations of the technology or techniques envolved on the disassembler itself. Otherwise, if a specifc chunck of bytes can be interpreted as code or data at the same time, the procesor will also idsentify this dubial situation, and the program will probably crash. What don´t seems to happen very often. On VB files for instace, we have a totally weird way of accessing teh Data. It have a small chunck of real code bytes, and a LOT of Data mixed in the Code sections (as well as in the data sectino). Since this is only Data (In fact, structures inside teh code section), the processor knows that they are Data and not code. The only problem i see is figure it out, how it "see" the difference properly and in 100% pof the cases (100% because this is how an VB app is made internally, and all of them runs on that way without crashing due to those problems). Are the above assumptions correct ? If not, why then ? Best Regards, Guga |