From: Michael Tharp on
Most Esteemed Hackers:

Due to popular demand on #postgresql (by which I mean David Fetter), I
have been spending a little time making the internal SQL parser
available to clients via a C-language SQL function. The function itself
is extremely simple: just a wrapper around a call to raw_parser followed
by nodeToString. Most of the "hard stuff" has been in parsing the output
of nodeToString on the client side. So, I have a few questions to help
gauge interest in related patches:

Is there interest in a patch to extend nodes/outfuncs.c with support for
serializing more node types? Coverage has been pretty good so far but
various utility statements and their related nodes are missing, e.g.
AlterTableStmt and GrantStmt. I expect that this will be the least
contentious suggestion.

The nodeToString format as it stands is somewhat ambiguous with respect
to the type of a node member's value if one does not have access to
readfuncs.c. For example, a T_BitString called foo is serialized as
':foo b1010' while a char * containing 'b1010' is also serialized as
':foo b1010'. This may just mean that _outToken needs to escape the
leading 'b'. A similar problem exists for booleans ('true' as a string
vs. as a boolean).

Additionally, values may span more than one token for certain types e.g.
Datum (":constvalue 4 [ 16 0 0 0 ]"). Plan trees have a few types that
don't have a corresponding read function and output an array of
space-separated integers. PlanInvalItem even seems to use a format
containing parentheses, which the tokenizer splits as if it were a list.
While most of these only occur in plan nodes and thus don't affect my
use case (Datum being the exception), it would be ideal if they could be
parsed more straightforwardly.

These last two problems perhaps can be worked around by escaping more
things in _outToken, but maybe it would be smarter to make the fields
self-descriptive in terms of type. For example, the field names could be
prefixed with a short string describing its type, which in most cases
would be a single character, e.g. 's:schemaname' for a char*, 'b:true'
for a bool, 'n:...' for any node (including Value nodes), or longer
strings for less commonly used types like the integer arrays in plan
nodes (although these would probably be better as a real integer list).
These could be used to unambiguously parse individual tokens and also to
determine how many or what kind of token to expect for multi-token
values such as Datum which would otherwise require guessing. Does this
seem reasonable? Is there another format that might make more sense?

As far as I can tell, the current parser in nodes/read.c ignores the
field names entirely, so this can be done without changing postgres' own
parsing code at all and without affecting backwards compatibility of any
stored trees. Does anyone else out there use nodeToString() output in
their own tools, and if so, does this make your life easier or harder?

Lastly, I'll leave a link to my WIP implementation in case anyone is
interested:
http://bitbucket.org/gxti/parse_sql/src/
Currently I'm working on adding support for cooked parse trees and
figuring out what, if anything, I need to do to support multibyte
encodings. My personal use is for parsing DDL so the input is decidedly
not hostile but I'd still like to make this a generally useful module.

Thanks in advance for any comments, tips, or flames sent my way.

-- m. tharp

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers