From: BGB / cr88192 on 5 Jul 2010 15:43 "Joshua Cranmer" <Pidgeot18(a)verizon.invalid> wrote in message news:i0r71e$jlc$1(a)news-int.gatech.edu... > On 07/04/2010 07:12 PM, Tom Anderson wrote: >> On Sun, 4 Jul 2010, Joshua Cranmer wrote: >>> In any case, interest in decompiling has significantly waned over the >>> past decade or so. A project or two on sourceforge claim to support >>> Java 5 decompilation, but I haven't tested it in depth. >> >> I wonder if the driver of the fall of decompilation is the rise of open >> source, and perhaps also open standards. If your landscape consists of, >> say, the JDK, JBoss, Spring, and Hibernate, then there are easier and >> more reliable ways to get hold of source code than decompilation. > > I think a better explanation is that it was never really a widespread > avenue of research to begin with. Academically, it consists of disassembly > [1], control structure identification, and typing and variable analysis. > The middle part is pretty much a solved problem, and I'm reasonably sure > that the type/variable analysis is also pretty well solved. Disassembly > has, by and large, remained generally difficult for native code, but great > strides have been made in the last 20 years or so. > native disassembly is not *that* difficult, as it is mostly a matter of having: a relatively complete knowledge of the ISA (usually comes along with writing an assembler... or at least in my case since my assembler is table-driven and some of the source-code portions of the assembler/disassembler are automatically generated by tools from said tables). also via the use of jump-targets (rather than trying to naively disassemble the entire image from start to end). the main issue in the past may well have been memory, since a lot of intermediate state is needed for all this. now, granted, SMC could foul this up, but given SMC is both rare and problematic in modern systems, this is not too much of an issue. so, if given an EXE, one can start tracing from the entry point, and from a DLL, from any of its exports (anything unreachable is then assumed to be either data or garbage). admittedly, most of my uses have been a bit narrower, mostly debugging, and also writing an x86 interpreter (partial simplified emulator...). > Since Java bytecode doesn't mash data and code together in the same space, > and given how much of the structure information is left in the bytecode, > it induced a massive spurt in decompilers because it was easy to > decompile. I'm guessing this spurt was more of a proof-of-concept than a > full-blown branching out. Since fully automated disassembly is the most > unsolved portion of decompiling, Java is academically uninteresting to > decompile; furthermore, you don't need to go the full decompiler route to > showcase improvements in disassembler. On top of all of this, one of the > major problem classes for reverse engineering in general is dealing with > malware, which mostly exists in native code and not bytecode languages. > You can see that there are a handful of decompilers, defunct or otherwise, > for other bytecodes (I know of two or three for both Python and .NET); the > only two languages which have a large number of decompilers are Java > (because it was easier) and C (because it was harder). > > In short, academically, Java decompilers are effectively solved, but > maintaining an up-to-date decompiler for Java (or any other bytecode > language) is not something many people wish to do. This has probably been > true since before Java was created: the lack of modern decompilers is > probably more attributable to an abnormal interest generated by Java being > the first major bytecode language in existence. > yep, probably about right... > For an open source project to survive, it needs a critical threshold of > developers. The Java decompiler market is already crowded with several > "good enough" solutions, C decompilers are effectively beyond the start of > the art [2], and the interest for other markets is generally insufficient > to sustain even a small operation. Perhaps a tool which could become the > "gcc" of decompilers (able to go from many source architectures to many > destination languages) might achieve this threshold. But unless a tool > achieves substantially better results, it is probably not going to be > successful as a project. > > [1] I'm glossing over a lot of stuff here which is actually quite > difficult for native code, but many of the problems don't exist in Java. large complicated ISA and awkwardness of recursive jump-tracing?... actually, if one writes an emulator, it is probably a bit easier to verify results, as one can assume that anywhere EIP reaches is known-valid, however, this is not as useful for static disassembly, since it is essentially limited to control paths which are actually reached (and essentially requires partial OS emulation, since the emulator needs good visibility of the internal goings-on, and a full emulator running an actual OS would obscure too much...). > [2] In the sense of fully-automated decompilation. x86 disassembly is a > royal pain in the butt; while there exist tools that can do this well > (IDA!), I'm not aware of anything that could be used in open-source > software [3]. > > [3] On reflection, I suppose LLVM is utilizing its x86 assembly > architecture for disassembly (for debugging purposes). yeah. I do similar as well, but for my own uses don't use LLVM (partly as I disagree with its architecture, am developing primarily in C, and have a different set of goals). yeah, probably seems like I am wasting time, but: LLVM is mostly aiming for being a high-performance codegen and code analysis; my main goal is mostly for making high-level features available from C (such as reflection and eval, as well as ability to load scripts, and cleanly integrate between C and high-level scripting languages, ...), which in all deal with a somewhat different set of problem domains... personally, I suspect I am at this point mostly outside the domain of existing VM's (the most technically similar thus-far is .NET). most VM's are essentially top-down monolithic structures, which seek to impose their workings on the world (building a layer on top of the world, and in so doing, having a clearly separate "inside" and "outside" world). I instead attempt to use a bottom-up structure, and the line between what is inside and outside is less well-defined. oh yeah, and my current script-language (BGBScript, mostly aiming to be an ES5/JS/AS variant) also serves an essentially "secondary" role, but can semi-transparently interface with C land (transparently making use of C functions and data), and a partial (mostly untested) interface also exists for going the other way. C -> BS interfacing is a bit problematic, some of my early tests having to make use of C functions/data to serve as "templates" for the interfaced-with BS code/data (for example: "give me a function pointer to X, taking the function signature from the C function Y", reflection facilities then used to figure out Y's signature, and then appropriate glue-thunks are generated). static-interfacing presents additional issues (and I don't generally bother with variable-argument functions, ...). Java also presents its share of interfacing issues... or such...
From: BGB / cr88192 on 5 Jul 2010 15:53 "Peter Duniho" <NpOeStPeAdM(a)NnOwSlPiAnMk.com> wrote in message news:Ua-dnaBDH6Wsr6_RnZ2dnUVZ_oWdnZ2d(a)posted.palinacquisition... > Joshua Cranmer wrote: >> On 07/05/2010 02:14 PM, BGB / cr88192 wrote: >>> luckily, most code does not fall under DMCA, meaning it is only civil >>> (in >>> which case the owning company is limited to lawsuits...). >> >> Actually, I believe there is a prevision in the DMCA making attempting to >> circumvent copyright a criminal offense. > > Circumventing encryption, not circumventing copyright. > > Until there's case law to show what direction the courts are going, it > would not be possible to say for sure whether Java byte code is considered > encryption, but a common-sense understanding certainly would not consider > it so. yes, this is mainly why most code doesn't fall under DMCA: most code does not use encryption... but, if one does deal with code which does encryption, then a whole new mess of issues pop up... >> Under the 1976 copyright act, pretty much everything in the U.S. is >> copyrighted (even the words I'm writing right now!), which includes >> source code. So attempting to recover copyrighted source code that you do >> not have the rights to would be a criminal offense. Furthermore, >> distributing a program with the intent of circumventing copyright is also >> criminal, so writing a decompiler can certainly fall under this category. > > The .NET community has been very well-served by Red Gate's Reflector > utility, which does a wonderful job of disassembling .NET programs. No > one has tried to sue or prosecute them, nor do I think it likely anyone > would. > > I doubt that a Java disassembler would be at any greater risk for same. > disassembly is common and fairly non-problematic, mostly due to its fair number of non-infringing uses... now, the matter is decompilers, which themselves present a few more problems on the legal front... unlike disassemblers, decompilers are not widely used for sake of debugging (where typically symbolic debugging info is used instead...). >> Furthermore and even more unfortunately, the Supreme Court appears to >> have backed off of its Betamax decision (which held that a device with >> substantial noninfringing uses was still legal) when it heard the >> Grokster case. Which lands the issue of whether or not a decompiler is >> legal or not into a muddy gray cesspool in the U.S. > > I suppose that depends on the bias with which one views that legal > decision. However, my interpretation of it was that the court very much > held the Betamax decision in high regard, and found that Grokster did not > in fact have a substantial non-infringing use. > > Pete
From: Peter Duniho on 5 Jul 2010 16:18 BGB / cr88192 wrote: > [...] >> The .NET community has been very well-served by Red Gate's Reflector >> utility, which does a wonderful job of disassembling .NET programs. No >> one has tried to sue or prosecute them, nor do I think it likely anyone >> would. >> >> I doubt that a Java disassembler would be at any greater risk for same. > > disassembly is common and fairly non-problematic, mostly due to its fair > number of non-infringing uses... > > now, the matter is decompilers, which themselves present a few more problems > on the legal front... I don't think so. I was imprecise, as .NET Reflector is both a disassembler (inasmuch as Java or .NET byte code are "assembly" languages) and a decompiler (Reflector will reconstruct to the best of its impressive abilities any managed language version of the MSIL it's analyzing). It has had no real legal challenges to its existence or use. The fact that the tool displays the low-level byte code as some reconstructed higher-level language versus simply a textual representation of the byte code itself is essentially irrelevant. In neither case is copyright being violated. Simply rearranging, reformatting, redisplaying, etc. some copyrighted material that you already have legal access to does not in and of itself violate the copyright. Pete
From: Joshua Cranmer on 5 Jul 2010 16:35 On 07/05/2010 03:43 PM, BGB / cr88192 wrote: > native disassembly is not *that* difficult, as it is mostly a matter of > having: In the context of disassembly as a prerequisite for decompiling, it can be difficult. I will agree that disassembling a small fragment is no challenge, but the issue is mostly program-wide decompiling and disassembling. Tasks like determining function boundaries and call frames I am including in disassembly, and this is not exactly an easy task, especially if you compile with -OMG. > now, granted, SMC could foul this up, but given SMC is both rare and > problematic in modern systems, this is not too much of an issue. Self-modifying code probably makes up the vast majority of "interesting" cases for disassembly: malware. >> [1] I'm glossing over a lot of stuff here which is actually quite >> difficult for native code, but many of the problems don't exist in Java. > > large complicated ISA and awkwardness of recursive jump-tracing?... No need to worry about the pain of code and data sharing the same code space (separation of code and data is equivalent to the halting problem) is a major factor. Determining function arguments (in light of things like fastcall or -fomit-frame-pointer) and even function boundaries is another annoying issue. It also helps that Java bytecode is typically unoptimized, so you get very sane CFGs. I suppose Java bytecode is roughly comparable to having a binary compiled with -g with full debug symbols and no optimization whatsoever, with the header files probably also included. > yeah, probably seems like I am wasting time, but: > LLVM is mostly aiming for being a high-performance codegen and code > analysis; > my main goal is mostly for making high-level features available from C (such > as reflection and eval, as well as ability to load scripts, and cleanly > integrate between C and high-level scripting languages, ...), which in all > deal with a somewhat different set of problem domains... Reflection and C++ don't mix very well. I could go on for hours about this, but by then we'd have long since gone well off-topic. > Java also presents its share of interfacing issues... At least there exists a single Java ABI. C++ on the other hand... -- Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth
From: Mike Schilling on 5 Jul 2010 18:00
"Peter Duniho" <NpOeStPeAdM(a)NnOwSlPiAnMk.com> wrote in message news:Ua-dnaBDH6Wsr6_RnZ2dnUVZ_oWdnZ2d(a)posted.palinacquisition... > The .NET community has been very well-served by Red Gate's Reflector > utility, which does a wonderful job of disassembling .NET programs. No > one has tried to sue or prosecute them, nor do I think it likely anyone > would. I've always suspect that Microsoft pays for Reflector and provides helpful hints to its developers. That's a lot cheaper than documenting all the hidden behavior you currently learn about only from using Reflector on the system assemblies. |