From: BGB / cr88192 on

"Joshua Cranmer" <Pidgeot18(a)verizon.invalid> wrote in message
news:i0r71e$jlc$1(a)news-int.gatech.edu...
> On 07/04/2010 07:12 PM, Tom Anderson wrote:
>> On Sun, 4 Jul 2010, Joshua Cranmer wrote:
>>> In any case, interest in decompiling has significantly waned over the
>>> past decade or so. A project or two on sourceforge claim to support
>>> Java 5 decompilation, but I haven't tested it in depth.
>>
>> I wonder if the driver of the fall of decompilation is the rise of open
>> source, and perhaps also open standards. If your landscape consists of,
>> say, the JDK, JBoss, Spring, and Hibernate, then there are easier and
>> more reliable ways to get hold of source code than decompilation.
>
> I think a better explanation is that it was never really a widespread
> avenue of research to begin with. Academically, it consists of disassembly
> [1], control structure identification, and typing and variable analysis.
> The middle part is pretty much a solved problem, and I'm reasonably sure
> that the type/variable analysis is also pretty well solved. Disassembly
> has, by and large, remained generally difficult for native code, but great
> strides have been made in the last 20 years or so.
>

native disassembly is not *that* difficult, as it is mostly a matter of
having:
a relatively complete knowledge of the ISA (usually comes along with writing
an assembler... or at least in my case since my assembler is table-driven
and some of the source-code portions of the assembler/disassembler are
automatically generated by tools from said tables).

also via the use of jump-targets (rather than trying to naively disassemble
the entire image from start to end).
the main issue in the past may well have been memory, since a lot of
intermediate state is needed for all this.

now, granted, SMC could foul this up, but given SMC is both rare and
problematic in modern systems, this is not too much of an issue.

so, if given an EXE, one can start tracing from the entry point, and from a
DLL, from any of its exports (anything unreachable is then assumed to be
either data or garbage).

admittedly, most of my uses have been a bit narrower, mostly debugging, and
also writing an x86 interpreter (partial simplified emulator...).


> Since Java bytecode doesn't mash data and code together in the same space,
> and given how much of the structure information is left in the bytecode,
> it induced a massive spurt in decompilers because it was easy to
> decompile. I'm guessing this spurt was more of a proof-of-concept than a
> full-blown branching out. Since fully automated disassembly is the most
> unsolved portion of decompiling, Java is academically uninteresting to
> decompile; furthermore, you don't need to go the full decompiler route to
> showcase improvements in disassembler. On top of all of this, one of the
> major problem classes for reverse engineering in general is dealing with
> malware, which mostly exists in native code and not bytecode languages.
> You can see that there are a handful of decompilers, defunct or otherwise,
> for other bytecodes (I know of two or three for both Python and .NET); the
> only two languages which have a large number of decompilers are Java
> (because it was easier) and C (because it was harder).
>
> In short, academically, Java decompilers are effectively solved, but
> maintaining an up-to-date decompiler for Java (or any other bytecode
> language) is not something many people wish to do. This has probably been
> true since before Java was created: the lack of modern decompilers is
> probably more attributable to an abnormal interest generated by Java being
> the first major bytecode language in existence.
>

yep, probably about right...


> For an open source project to survive, it needs a critical threshold of
> developers. The Java decompiler market is already crowded with several
> "good enough" solutions, C decompilers are effectively beyond the start of
> the art [2], and the interest for other markets is generally insufficient
> to sustain even a small operation. Perhaps a tool which could become the
> "gcc" of decompilers (able to go from many source architectures to many
> destination languages) might achieve this threshold. But unless a tool
> achieves substantially better results, it is probably not going to be
> successful as a project.
>
> [1] I'm glossing over a lot of stuff here which is actually quite
> difficult for native code, but many of the problems don't exist in Java.

large complicated ISA and awkwardness of recursive jump-tracing?...

actually, if one writes an emulator, it is probably a bit easier to verify
results, as one can assume that anywhere EIP reaches is known-valid,
however, this is not as useful for static disassembly, since it is
essentially limited to control paths which are actually reached (and
essentially requires partial OS emulation, since the emulator needs good
visibility of the internal goings-on, and a full emulator running an actual
OS would obscure too much...).


> [2] In the sense of fully-automated decompilation. x86 disassembly is a
> royal pain in the butt; while there exist tools that can do this well
> (IDA!), I'm not aware of anything that could be used in open-source
> software [3].
>
> [3] On reflection, I suppose LLVM is utilizing its x86 assembly
> architecture for disassembly (for debugging purposes).

yeah.

I do similar as well, but for my own uses don't use LLVM (partly as I
disagree with its architecture, am developing primarily in C, and have a
different set of goals).

yeah, probably seems like I am wasting time, but:
LLVM is mostly aiming for being a high-performance codegen and code
analysis;
my main goal is mostly for making high-level features available from C (such
as reflection and eval, as well as ability to load scripts, and cleanly
integrate between C and high-level scripting languages, ...), which in all
deal with a somewhat different set of problem domains...

personally, I suspect I am at this point mostly outside the domain of
existing VM's (the most technically similar thus-far is .NET).

most VM's are essentially top-down monolithic structures, which seek to
impose their workings on the world (building a layer on top of the world,
and in so doing, having a clearly separate "inside" and "outside" world). I
instead attempt to use a bottom-up structure, and the line between what is
inside and outside is less well-defined.

oh yeah, and my current script-language (BGBScript, mostly aiming to be an
ES5/JS/AS variant) also serves an essentially "secondary" role, but can
semi-transparently interface with C land (transparently making use of C
functions and data), and a partial (mostly untested) interface also exists
for going the other way. C -> BS interfacing is a bit problematic, some of
my early tests having to make use of C functions/data to serve as
"templates" for the interfaced-with BS code/data (for example: "give me a
function pointer to X, taking the function signature from the C function Y",
reflection facilities then used to figure out Y's signature, and then
appropriate glue-thunks are generated). static-interfacing presents
additional issues (and I don't generally bother with variable-argument
functions, ...).

Java also presents its share of interfacing issues...


or such...


From: BGB / cr88192 on

"Peter Duniho" <NpOeStPeAdM(a)NnOwSlPiAnMk.com> wrote in message
news:Ua-dnaBDH6Wsr6_RnZ2dnUVZ_oWdnZ2d(a)posted.palinacquisition...
> Joshua Cranmer wrote:
>> On 07/05/2010 02:14 PM, BGB / cr88192 wrote:
>>> luckily, most code does not fall under DMCA, meaning it is only civil
>>> (in
>>> which case the owning company is limited to lawsuits...).
>>
>> Actually, I believe there is a prevision in the DMCA making attempting to
>> circumvent copyright a criminal offense.
>
> Circumventing encryption, not circumventing copyright.
>
> Until there's case law to show what direction the courts are going, it
> would not be possible to say for sure whether Java byte code is considered
> encryption, but a common-sense understanding certainly would not consider
> it so.

yes, this is mainly why most code doesn't fall under DMCA:
most code does not use encryption...

but, if one does deal with code which does encryption, then a whole new mess
of issues pop up...


>> Under the 1976 copyright act, pretty much everything in the U.S. is
>> copyrighted (even the words I'm writing right now!), which includes
>> source code. So attempting to recover copyrighted source code that you do
>> not have the rights to would be a criminal offense. Furthermore,
>> distributing a program with the intent of circumventing copyright is also
>> criminal, so writing a decompiler can certainly fall under this category.
>
> The .NET community has been very well-served by Red Gate's Reflector
> utility, which does a wonderful job of disassembling .NET programs. No
> one has tried to sue or prosecute them, nor do I think it likely anyone
> would.
>
> I doubt that a Java disassembler would be at any greater risk for same.
>

disassembly is common and fairly non-problematic, mostly due to its fair
number of non-infringing uses...

now, the matter is decompilers, which themselves present a few more problems
on the legal front...
unlike disassemblers, decompilers are not widely used for sake of debugging
(where typically symbolic debugging info is used instead...).


>> Furthermore and even more unfortunately, the Supreme Court appears to
>> have backed off of its Betamax decision (which held that a device with
>> substantial noninfringing uses was still legal) when it heard the
>> Grokster case. Which lands the issue of whether or not a decompiler is
>> legal or not into a muddy gray cesspool in the U.S.
>
> I suppose that depends on the bias with which one views that legal
> decision. However, my interpretation of it was that the court very much
> held the Betamax decision in high regard, and found that Grokster did not
> in fact have a substantial non-infringing use.
>
> Pete


From: Peter Duniho on
BGB / cr88192 wrote:
> [...]
>> The .NET community has been very well-served by Red Gate's Reflector
>> utility, which does a wonderful job of disassembling .NET programs. No
>> one has tried to sue or prosecute them, nor do I think it likely anyone
>> would.
>>
>> I doubt that a Java disassembler would be at any greater risk for same.
>
> disassembly is common and fairly non-problematic, mostly due to its fair
> number of non-infringing uses...
>
> now, the matter is decompilers, which themselves present a few more problems
> on the legal front...

I don't think so.

I was imprecise, as .NET Reflector is both a disassembler (inasmuch as
Java or .NET byte code are "assembly" languages) and a decompiler
(Reflector will reconstruct to the best of its impressive abilities any
managed language version of the MSIL it's analyzing). It has had no
real legal challenges to its existence or use.

The fact that the tool displays the low-level byte code as some
reconstructed higher-level language versus simply a textual
representation of the byte code itself is essentially irrelevant. In
neither case is copyright being violated. Simply rearranging,
reformatting, redisplaying, etc. some copyrighted material that you
already have legal access to does not in and of itself violate the
copyright.

Pete
From: Joshua Cranmer on
On 07/05/2010 03:43 PM, BGB / cr88192 wrote:
> native disassembly is not *that* difficult, as it is mostly a matter of
> having:

In the context of disassembly as a prerequisite for decompiling, it can
be difficult. I will agree that disassembling a small fragment is no
challenge, but the issue is mostly program-wide decompiling and
disassembling. Tasks like determining function boundaries and call
frames I am including in disassembly, and this is not exactly an easy
task, especially if you compile with -OMG.

> now, granted, SMC could foul this up, but given SMC is both rare and
> problematic in modern systems, this is not too much of an issue.

Self-modifying code probably makes up the vast majority of "interesting"
cases for disassembly: malware.

>> [1] I'm glossing over a lot of stuff here which is actually quite
>> difficult for native code, but many of the problems don't exist in Java.
>
> large complicated ISA and awkwardness of recursive jump-tracing?...

No need to worry about the pain of code and data sharing the same code
space (separation of code and data is equivalent to the halting problem)
is a major factor. Determining function arguments (in light of things
like fastcall or -fomit-frame-pointer) and even function boundaries is
another annoying issue. It also helps that Java bytecode is typically
unoptimized, so you get very sane CFGs.

I suppose Java bytecode is roughly comparable to having a binary
compiled with -g with full debug symbols and no optimization whatsoever,
with the header files probably also included.

> yeah, probably seems like I am wasting time, but:
> LLVM is mostly aiming for being a high-performance codegen and code
> analysis;
> my main goal is mostly for making high-level features available from C (such
> as reflection and eval, as well as ability to load scripts, and cleanly
> integrate between C and high-level scripting languages, ...), which in all
> deal with a somewhat different set of problem domains...

Reflection and C++ don't mix very well. I could go on for hours about
this, but by then we'd have long since gone well off-topic.

> Java also presents its share of interfacing issues...

At least there exists a single Java ABI. C++ on the other hand...

--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth
From: Mike Schilling on
"Peter Duniho" <NpOeStPeAdM(a)NnOwSlPiAnMk.com> wrote in message
news:Ua-dnaBDH6Wsr6_RnZ2dnUVZ_oWdnZ2d(a)posted.palinacquisition...
> The .NET community has been very well-served by Red Gate's Reflector
> utility, which does a wonderful job of disassembling .NET programs. No
> one has tried to sue or prosecute them, nor do I think it likely anyone
> would.

I've always suspect that Microsoft pays for Reflector and provides helpful
hints to its developers. That's a lot cheaper than documenting all the
hidden behavior you currently learn about only from using Reflector on the
system assemblies.