From: Michael Tharp on 2 Apr 2010 15:53 Most Esteemed Hackers: Due to popular demand on #postgresql (by which I mean David Fetter), I have been spending a little time making the internal SQL parser available to clients via a C-language SQL function. The function itself is extremely simple: just a wrapper around a call to raw_parser followed by nodeToString. Most of the "hard stuff" has been in parsing the output of nodeToString on the client side. So, I have a few questions to help gauge interest in related patches: Is there interest in a patch to extend nodes/outfuncs.c with support for serializing more node types? Coverage has been pretty good so far but various utility statements and their related nodes are missing, e.g. AlterTableStmt and GrantStmt. I expect that this will be the least contentious suggestion. The nodeToString format as it stands is somewhat ambiguous with respect to the type of a node member's value if one does not have access to readfuncs.c. For example, a T_BitString called foo is serialized as ':foo b1010' while a char * containing 'b1010' is also serialized as ':foo b1010'. This may just mean that _outToken needs to escape the leading 'b'. A similar problem exists for booleans ('true' as a string vs. as a boolean). Additionally, values may span more than one token for certain types e.g. Datum (":constvalue 4 [ 16 0 0 0 ]"). Plan trees have a few types that don't have a corresponding read function and output an array of space-separated integers. PlanInvalItem even seems to use a format containing parentheses, which the tokenizer splits as if it were a list. While most of these only occur in plan nodes and thus don't affect my use case (Datum being the exception), it would be ideal if they could be parsed more straightforwardly. These last two problems perhaps can be worked around by escaping more things in _outToken, but maybe it would be smarter to make the fields self-descriptive in terms of type. For example, the field names could be prefixed with a short string describing its type, which in most cases would be a single character, e.g. 's:schemaname' for a char*, 'b:true' for a bool, 'n:...' for any node (including Value nodes), or longer strings for less commonly used types like the integer arrays in plan nodes (although these would probably be better as a real integer list). These could be used to unambiguously parse individual tokens and also to determine how many or what kind of token to expect for multi-token values such as Datum which would otherwise require guessing. Does this seem reasonable? Is there another format that might make more sense? As far as I can tell, the current parser in nodes/read.c ignores the field names entirely, so this can be done without changing postgres' own parsing code at all and without affecting backwards compatibility of any stored trees. Does anyone else out there use nodeToString() output in their own tools, and if so, does this make your life easier or harder? Lastly, I'll leave a link to my WIP implementation in case anyone is interested: http://bitbucket.org/gxti/parse_sql/src/ Currently I'm working on adding support for cooked parse trees and figuring out what, if anything, I need to do to support multibyte encodings. My personal use is for parsing DDL so the input is decidedly not hostile but I'd still like to make this a generally useful module. Thanks in advance for any comments, tips, or flames sent my way. -- m. tharp -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
|
Pages: 1 Prev: [HACKERS] inlining SQL functions Next: [RFC] nodeToString format and exporting the SQL parser |