Prev: best message queue implementation
Next: memory sticks
From: BGB / cr88192 on 9 Aug 2010 18:45 well, here is basically a status update: I ended up developing a "Generalized AST" based system, where the idea is that a lot of language-specific stuff is stripped out of the ASTs, and the new ASTs are intended to represent a more general and language-neutral form. the ASTs are in turn represented as XML, and are sent between compiler components in the form of binary XML (a binary XML encoding vaguely similar to WBXML is in use). unlike WBXML, it uses MRU lists for tag and attribute names, as well as literal strings. strings may also be sent inline, where they may be encoded with an LZ-Markov scheme (vaguely similar to LZ77, but matches are handled differently). (this variant uses a 16kB dictionary). like WBXML (and unlike EXI), it is byte-based (no entropy coding or similar is used). the code for this binary XML format had been sitting around since 2005, but until recently was mostly unused. textual XML is also possible, but a binary form takes less space in serialized form (~ 20-25% the size of textual XML for ASTs), and also decodes much more quickly. I ended up calling this the GAST-IR. here is an incomplete/older AST spec: http://cr88192.dyndns.org/SilvVMSpec/2010-05-23_AST.html (GAST is mostly similar, but differs mostly WRT a few of the tags, as well, this spec has some holes, ...). as well as a (mostly accurate) spec for the binary XML: http://cr88192.dyndns.org/SilvVMSpec/2005-08-31_SBXE.txt also, more recently the codegen upper-end was modified to use GAST as input, in place of my prior RPNIL format. this could allow (eventually) no longer needing to use a stack-machine within the codegen. the codegen currently generates native x86 or x86-64 code, and currently only handles C's functionality (the goal is to implement the functionality needed to make Java and C# also work). parsing both Java and C# code works at this point, but a little more work is needed in the compiler before these languages are likely to work (mostly related to things like OO facilities and exception handling). an interface between GAST and my BGBScript interpreter was also implemented. thr BGBScript interpreter is dynamically typed, where BS is a language influenced somewhat by JavaScript and ActionScript, and a goal is to try to get it conformant with the ES5 standard (eventually, most variations are minor things like float precision being less than the required full double precision, or incomplete regex or date-object handling, ...). this could potentially partially fake Java or C# (by kludging their syntax on top of JS-like semantics), but this would be tacky, low performance, and not be able to faithfully represent semantics (also, C would not likely work at all in this interpreter). neither GAST-related interface has really been tested as of yet though. it is possible that I could consider plans for a statically typed interpreter. it would retain the goal of having a relatively clean and transparent interface with native C, as I have since noted that one doesn't have to generate native code to interface with native code... the interpreter would likely aim to support C, Java, and C#, much like the main codegen, but could possibly also handle BGBScript (note: the interpreter would likely support dynamic types, as well as statically-typed operation). this interpreter would likely also be targetted via GAST IR, and is likely to use a similar design to the BGBScript interpreter (where the code for compiling to the bytecode, and for interpreting it, are tightly integrated). IME, this seems to work out much better than trying to have any sort of standardized bytecode. a longer-range goal could be considering the possibility of using LLVM as a target, but it is likely to require a fair amount of effort to produce code in LLVM's IR. I could put source up on my server if anyone wants to look, but at the moment this is doubtful. I don't know, any thoughts or comments?...
From: Ira Baxter on 13 Aug 2010 12:01 "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message news:i3q0ee$ork$1(a)news.albasani.net... > well, here is basically a status update: > I ended up developing a "Generalized AST" based system, where the idea is > that a lot of language-specific stuff is stripped out of the ASTs, and the > new ASTs are intended to represent a more general and language-neutral > form. > > > the ASTs are in turn represented as XML, and are sent between compiler > components in the form of binary XML (a binary XML encoding vaguely > similar to WBXML is in use). > "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message news:<i3q0ee$ork$1(a)news.albasani.net>... > well, here is basically a status update: > I ended up developing a "Generalized AST" based system, where the idea is > that a lot of language-specific stuff is stripped out of the ASTs, and the > new ASTs are intended to represent a more general and language-neutral > form. > > > the ASTs are in turn represented as XML, and are sent between compiler > components in the form of binary XML (a binary XML encoding vaguely > similar to WBXML is in use). Are you familiar with OMG's Abstract Syntax Tree Model? http://www.omg.org/spec/ASTM/Current/ The exchange format is XMI. I haven't looked, but surely the OMG knows about "binary XML" and it wouldn't surprise me if ASTM models can now be exchanged as binary XMI. -- IDB
From: BGB / cr88192 on 13 Aug 2010 17:17 "Ira Baxter" <idbaxter(a)semdesigns.com> wrote in message news:6MydnTyeUp3A8fjRnZ2dnUVZ_t6dnZ2d(a)giganews.com... > > "BGB / cr88192" <cr88192(a)hotmail.com> wrote in message > news:<i3q0ee$ork$1(a)news.albasani.net>... >> well, here is basically a status update: >> I ended up developing a "Generalized AST" based system, where the idea is >> that a lot of language-specific stuff is stripped out of the ASTs, and >> the new ASTs are intended to represent a more general and >> language-neutral form. >> >> >> the ASTs are in turn represented as XML, and are sent between compiler >> components in the form of binary XML (a binary XML encoding vaguely >> similar to WBXML is in use). > > > Are you familiar with OMG's Abstract Syntax Tree Model? > http://www.omg.org/spec/ASTM/Current/ > > The exchange format is XMI. I haven't looked, but surely the OMG knows > about > "binary XML" and it wouldn't surprise me if ASTM models can now be > exchanged > as binary XMI. > quick skim: spec has too many words, and so would take a bit more than the amount of skimming I have done to make all that much sense of it. however, they appear to be addressing a VERY different set of problems. my stuff is much more low-level, and is concerned mostly with structural aspects of programming languages, and getting them compiled, and how they are represented (directly) in XML. abstract semantics and metamodeling is not what I am dealing with here... the closest I do is a bunch of ad-hoc kludging as needed to interface BS and C (mostly figuring out how to coerce one piece of data into another piece of data). C and Java are interfaced mostly at the ABI level, rather than at the semantic level (there is little concern for how ADTs cross the border, as the concern is more for moving concrete-data types). the spec makes no mention that I could see of binary XML or binary XML serializations (or, for that matter, even much real mention of XML). rather, it seems to describe things in terms of a (very much not XML) abstract model (BNF-like, with it being rather unclear what the actual representation is of these BNF-like forms, or for that matter, if whatever representation is used internally uses the same naming). my concern is more like: these syntax constructs map onto these XML expressions; XML is mapped into binary data blobs in this or that way; .... so, an example of the latter would be like: XML is converted into textual XML, and ran through deflate or gzip; XML is converted into a binary glob via WBXML or EXI; XML is converted into a binary glob via some custom format; .... WBXML+gzip is also possible... so, the goal is to make it obvious what is the representation for the data, .... for example, within my compilers, the representation and interface used is along similar lines to DOM. most internal transforms are done on the XML. currently, I don't support XSLT or XPath, but I had recently imagined the possibility of designing somthing "similar" to XSLT for representing compiler-related internal transforms, although this leaves a whole lot of things (like doing register allocation, ...) which would not likely map so nicely to XML, and so a fair amount of C code would likely still be needed for this, partly undermining the whole system. similarly, directly using an XSLT-like system within a compiler would probably be slow-as-hell... (and, just as easily, I could try to figure out designs for C-level API's to try to reduce the amount of jerking-off currently needed for matching patterns and manipulating XML nodes...). so, alas, I don't know...
|
Pages: 1 Prev: best message queue implementation Next: memory sticks |