status: ASTs as IR... [General Programming]

Prev: best message queue implementation
Next: memory sticks

From: BGB / cr88192 on 9 Aug 2010 18:45

well, here is basically a status update:
I ended up developing a "Generalized AST" based system, where the idea is
that a lot of language-specific stuff is stripped out of the ASTs, and the
new ASTs are intended to represent a more general and language-neutral form.

the ASTs are in turn represented as XML, and are sent between compiler
components in the form of binary XML (a binary XML encoding vaguely similar
to WBXML is in use).

unlike WBXML, it uses MRU lists for tag and attribute names, as well as
literal strings. strings may also be sent inline, where they may be encoded
with an LZ-Markov scheme (vaguely similar to LZ77, but matches are handled
differently). (this variant uses a 16kB dictionary). like WBXML (and unlike
EXI), it is byte-based (no entropy coding or similar is used).

the code for this binary XML format had been sitting around since 2005, but
until recently was mostly unused. textual XML is also possible, but a binary
form takes less space in serialized form (~ 20-25% the size of textual XML
for ASTs), and also decodes much more quickly.

I ended up calling this the GAST-IR.

here is an incomplete/older AST spec:
http://cr88192.dyndns.org/SilvVMSpec/2010-05-23_AST.html
(GAST is mostly similar, but differs mostly WRT a few of the tags, as well,
this spec has some holes, ...).

as well as a (mostly accurate) spec for the binary XML:
http://cr88192.dyndns.org/SilvVMSpec/2005-08-31_SBXE.txt

also, more recently the codegen upper-end was modified to use GAST as input,
in place of my prior RPNIL format. this could allow (eventually) no longer
needing to use a stack-machine within the codegen.

the codegen currently generates native x86 or x86-64 code, and currently
only handles C's functionality (the goal is to implement the functionality
needed to make Java and C# also work).

parsing both Java and C# code works at this point, but a little more work is
needed in the compiler before these languages are likely to work (mostly
related to things like OO facilities and exception handling).

an interface between GAST and my BGBScript interpreter was also implemented.
thr BGBScript interpreter is dynamically typed, where BS is a language
influenced somewhat by JavaScript and ActionScript, and a goal is to try to
get it conformant with the ES5 standard (eventually, most variations are
minor things like float precision being less than the required full double
precision, or incomplete regex or date-object handling, ...).

this could potentially partially fake Java or C# (by kludging their syntax
on top of JS-like semantics), but this would be tacky, low performance, and
not be able to faithfully represent semantics (also, C would not likely work
at all in this interpreter).

neither GAST-related interface has really been tested as of yet though.

it is possible that I could consider plans for a statically typed
interpreter.
it would retain the goal of having a relatively clean and transparent
interface with native C, as I have since noted that one doesn't have to
generate native code to interface with native code...

the interpreter would likely aim to support C, Java, and C#, much like the
main codegen, but could possibly also handle BGBScript (note: the
interpreter would likely support dynamic types, as well as statically-typed
operation).

this interpreter would likely also be targetted via GAST IR, and is likely
to use a similar design to the BGBScript interpreter (where the code for
compiling to the bytecode, and for interpreting it, are tightly integrated).
IME, this seems to work out much better than trying to have any sort of
standardized bytecode.

a longer-range goal could be considering the possibility of using LLVM as a
target, but it is likely to require a fair amount of effort to produce code
in LLVM's IR.

I could put source up on my server if anyone wants to look, but at the
moment this is doubtful.

I don't know, any thoughts or comments?...

From: Ira Baxter on 13 Aug 2010 12:01

"BGB / cr88192" <cr88192(a)hotmail.com> wrote in message
news:i3q0ee$ork$1(a)news.albasani.net...
> well, here is basically a status update:
> I ended up developing a "Generalized AST" based system, where the idea is
> that a lot of language-specific stuff is stripped out of the ASTs, and the
> new ASTs are intended to represent a more general and language-neutral
> form.
>
>
> the ASTs are in turn represented as XML, and are sent between compiler
> components in the form of binary XML (a binary XML encoding vaguely
> similar to WBXML is in use).
>

"BGB / cr88192" <cr88192(a)hotmail.com> wrote in message
news:<i3q0ee$ork$1(a)news.albasani.net>...
> well, here is basically a status update:
> I ended up developing a "Generalized AST" based system, where the idea is
> that a lot of language-specific stuff is stripped out of the ASTs, and the
> new ASTs are intended to represent a more general and language-neutral
> form.
>
>
> the ASTs are in turn represented as XML, and are sent between compiler
> components in the form of binary XML (a binary XML encoding vaguely
> similar to WBXML is in use).

Are you familiar with OMG's Abstract Syntax Tree Model?
http://www.omg.org/spec/ASTM/Current/

The exchange format is XMI. I haven't looked, but surely the OMG knows
about
"binary XML" and it wouldn't surprise me if ASTM models can now be exchanged
as binary XMI.

-- IDB

From: BGB / cr88192 on 13 Aug 2010 17:17

"Ira Baxter" <idbaxter(a)semdesigns.com> wrote in message
news:6MydnTyeUp3A8fjRnZ2dnUVZ_t6dnZ2d(a)giganews.com...
>
> "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message
> news:<i3q0ee$ork$1(a)news.albasani.net>...
>> well, here is basically a status update:
>> I ended up developing a "Generalized AST" based system, where the idea is
>> that a lot of language-specific stuff is stripped out of the ASTs, and
>> the new ASTs are intended to represent a more general and
>> language-neutral form.
>>
>>
>> the ASTs are in turn represented as XML, and are sent between compiler
>> components in the form of binary XML (a binary XML encoding vaguely
>> similar to WBXML is in use).
>
>
> Are you familiar with OMG's Abstract Syntax Tree Model?
> http://www.omg.org/spec/ASTM/Current/
>
> The exchange format is XMI. I haven't looked, but surely the OMG knows
> about
> "binary XML" and it wouldn't surprise me if ASTM models can now be
> exchanged
> as binary XMI.
>

quick skim:
spec has too many words, and so would take a bit more than the amount of
skimming I have done to make all that much sense of it.

however, they appear to be addressing a VERY different set of problems.

my stuff is much more low-level, and is concerned mostly with structural
aspects of programming languages, and getting them compiled, and how they
are represented (directly) in XML.

abstract semantics and metamodeling is not what I am dealing with here...

the closest I do is a bunch of ad-hoc kludging as needed to interface BS and
C (mostly figuring out how to coerce one piece of data into another piece of
data).

C and Java are interfaced mostly at the ABI level, rather than at the
semantic level (there is little concern for how ADTs cross the border, as
the concern is more for moving concrete-data types).

the spec makes no mention that I could see of binary XML or binary XML
serializations (or, for that matter, even much real mention of XML).

rather, it seems to describe things in terms of a (very much not XML)
abstract model (BNF-like, with it being rather unclear what the actual
representation is of these BNF-like forms, or for that matter, if whatever
representation is used internally uses the same naming).

my concern is more like:
these syntax constructs map onto these XML expressions;
XML is mapped into binary data blobs in this or that way;
....

so, an example of the latter would be like:
XML is converted into textual XML, and ran through deflate or gzip;
XML is converted into a binary glob via WBXML or EXI;
XML is converted into a binary glob via some custom format;
....

WBXML+gzip is also possible...

so, the goal is to make it obvious what is the representation for the data,
....

for example, within my compilers, the representation and interface used is
along similar lines to DOM.
most internal transforms are done on the XML.

currently, I don't support XSLT or XPath, but I had recently imagined the
possibility of designing somthing "similar" to XSLT for representing
compiler-related internal transforms, although this leaves a whole lot of
things (like doing register allocation, ...) which would not likely map so
nicely to XML, and so a fair amount of C code would likely still be needed
for this, partly undermining the whole system.

similarly, directly using an XSLT-like system within a compiler would
probably be slow-as-hell...
(and, just as easily, I could try to figure out designs for C-level API's to
try to reduce the amount of jerking-off currently needed for matching
patterns and manipulating XML nodes...).

so, alas, I don't know...

|
Pages: 1
Prev: best message queue implementation
Next: memory sticks