[RFC] nodeToString format and exporting the SQL parser [PgSql]

Prev: [HACKERS] [RFC] nodeToString format and exporting the SQL parser
Next: pgindent bizarreness

From: Pavel Stehule on 21 Apr 2010 14:01

2010/4/21 Jehan-Guillaume (ioguix) de Rorthais <ioguix(a)free.fr>:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 04/04/2010 18:10, David Fetter wrote:
>> On Sat, Apr 03, 2010 at 03:17:30PM +0200, Markus Schiltknecht wrote:
>>> Hi,
>>>
>>> Michael Tharp wrote:
>>>> I have been spending a little time making the internal SQL parser
>>>> available to clients via a C-language SQL function.
>>>
>>> This sounds very much like one of the Cluster Features:
>>> http://wiki.postgresql.org/wiki/ClusterFeatures#API_into_the_Parser_.2F_Parser_as_an_independent_module
>>>
>>> Is this what you (or David) have in mind?
>>
>> I'm not a fan of statement-based replication of any description. Â The
>> use cases I have in mind involve things like known-correct syntax
>> highlighting in text editors.
>
> The point here is not to expose the internal data structure, but to
> deliver a tokenized version of the given SQL script.
>
> There's actually many different use cases for external projects :
> Â - syntax highlighting
> Â - rewrite query with proper indentation
> Â - replication
> Â - properly splitting queries from a script
> Â - define type of the query (SELECT ? UPDATE/DELETE ? DDL ?)
> Â - checking validity of a query before sending it
> Â - ...
>
> In addition of PgPool needs, I can see 3 or 4 direct use cases for
> pgAdmin and phpPgAdmin.
>
> So it seems to me having the parser code in a shared library would be
> very useful for external C projects which can link to it. However it
> would be useless for other non-C projects which can't use it directly
> but are connected to a PostgreSQL backend anyway (phpPgAdmin as instance).
>
> What about having a new SQL command like TOKENIZE ? it would kinda act
> like EXPLAIN but giving a tokenized version of the given SQL script. As
> EXPLAIN, it could speak XML, YAML, JSON, you name it...
>
> Each token could have :
> Â - a type ('identifier', 'string', 'sql command', 'sql keyword',
> 'variable'...)
> Â - the start position in the string
> Â - the value
> Â - the line number
> Â - ...
>
> A simple example of a tokenizer is the php one:
> Â http://fr.php.net/token_get_all
>
> And here is a basic example which return pseudo rows here :
>
> => TOKENIZE $script$
> Â Â SELECT 1;
> Â Â UPDATE test SET "a"=2;
> Â $script$;
>

you don't need special command for this task .. function is enough

new SQL command is useless

http://www.pgsql.cz/index.php/Oracle_functionality_%28en%29#PLVlex

it can be very simple with new changes in parser.

Regards
Pavel Stehule

> Â type Â Â Â | pos | Â value Â | line
> - -------------+-----+----------+------
> Â SQL_COMMAND | 1 Â | 'SELECT' | Â 1
> Â CONSTANT Â Â | 8 Â | '1' Â Â Â | Â 1
> Â DELIMITER Â | 9 Â | ';' Â Â Â | Â 1
> Â SQL_COMMAND | 11 Â | 'UPDATE' | Â 2
> Â IDENTIFIER Â | 18 Â | 'test' Â | Â 2
> Â SQL_KEYWORD | 23 Â | 'SET' Â Â | Â 2
> Â IDENTIFIER Â | 27 Â | '"a"' Â Â | Â 2
> Â OPERATOR Â Â | 30 Â | '=' Â Â Â | Â 2
> Â CONSTANT Â Â | 31 Â | '1' Â Â Â | Â 2
>
>>
>> Cheers,
>> David.
>
> As a phpPgAdmin dev, I am thinking about this subject since a long time.
> I am interested about trying to create such a patch after discussing it
> and if you think it is doable.
>
> - --
> JGuillaume (ioguix) de Rorthais
> http://www.dalibo.com
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAkvPOJMACgkQxWGfaAgowiLrUACfa7qMVr3oiOVS7JfhTa1S9EqY
> pYkAn3Sj6cezC/EdWPu2+kzrgjaDygGE
> =oY1c
> -----END PGP SIGNATURE-----
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 24 Apr 2010 20:41

On Sat, Apr 24, 2010 at 8:07 PM, Bruce Momjian <bruce(a)momjian.us> wrote:
> Jehan-Guillaume (ioguix) de Rorthais wrote:
>> A simple example of a tokenizer is the php one:
>> http://fr.php.net/token_get_all
>>
>> And here is a basic example which return pseudo rows here :
>>
>> => TOKENIZE $script$
>> SELECT 1;
>> UPDATE test SET "a"=2;
>> $script$;
>>
>> type | pos | value | line
>> - -------------+-----+----------+------
>> SQL_COMMAND | 1 | 'SELECT' | 1
>> CONSTANT | 8 | '1' | 1
>> DELIMITER | 9 | ';' | 1
>> SQL_COMMAND | 11 | 'UPDATE' | 2
>> IDENTIFIER | 18 | 'test' | 2
>> SQL_KEYWORD | 23 | 'SET' | 2
>> IDENTIFIER | 27 | '"a"' | 2
>> OPERATOR | 30 | '=' | 2
>> CONSTANT | 31 | '1' | 2
>
> Sounds useful to me, though as a function like suggested in a later
> email.

If tool-builders think this is useful, I have no problem with making
it available. It should be suitably disclaimed: "We reserve the right
to rip out the entire flex/yacc-based lexer and parser at any time and
replace them with a hand-coded system written in Prolog that emits
tokenization information only in ASN.1-encoded pig latin. If massive
changes in the way this function works - or its complete disappearance
- are going to make you grumpy, don't call it."

But having said that, assuming there is a real use case for this, I
think it's better to let people get at it rather than forcing them to
roll their own. Because frankly, if we do rip out the whole thing,
then people are going to have to adjust their stuff anyway, regardless
of whether they're using some API we provide or something they've
cooked up from scratch. And in practice, most changes on our end are
likely to be incremental, though, again, we're not guaranteeing that
in any way.

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 24 Apr 2010 20:49

On Fri, Apr 2, 2010 at 3:53 PM, Michael Tharp <gxti(a)partiallystapled.com> wrote:
> Most Esteemed Hackers:
>
> Due to popular demand on #postgresql (by which I mean David Fetter), I have
> been spending a little time making the internal SQL parser available to
> clients via a C-language SQL function. The function itself is extremely
> simple: just a wrapper around a call to raw_parser followed by nodeToString.

Seems reasonable.

> Most of the "hard stuff" has been in parsing the output of nodeToString on
> the client side. So, I have a few questions to help gauge interest in
> related patches:
>
> Is there interest in a patch to extend nodes/outfuncs.c with support for
> serializing more node types? Coverage has been pretty good so far but
> various utility statements and their related nodes are missing, e.g.
> AlterTableStmt and GrantStmt. I expect that this will be the least
> contentious suggestion.

This wouldn't bother me provided the code footprint is small. I would
be against adding a lot of complexity for this.

> The nodeToString format as it stands is somewhat ambiguous with respect to
> the type of a node member's value if one does not have access to
> readfuncs.c. For example, a T_BitString called foo is serialized as ':foo
> b1010' while a char * containing 'b1010' is also serialized as ':foo b1010'.
> This may just mean that _outToken needs to escape the leading 'b'. A similar
> problem exists for booleans ('true' as a string vs. as a boolean).

I am not inclined to change this. Turning the format into something
self-describing seems to me to be significant work and a significant
compatibility break for a very small amount of benefit.

> Additionally, values may span more than one token for certain types e.g.
> Datum (":constvalue 4 [ 16 0 0 0 ]"). Plan trees have a few types that don't
> have a corresponding read function and output an array of space-separated
> integers. PlanInvalItem even seems to use a format containing parentheses,
> which the tokenizer splits as if it were a list. While most of these only
> occur in plan nodes and thus don't affect my use case (Datum being the
> exception), it would be ideal if they could be parsed more
> straightforwardly.

I'm not inclined to change this, either.

> These last two problems perhaps can be worked around by escaping more things
> in _outToken, but maybe it would be smarter to make the fields
> self-descriptive in terms of type. For example, the field names could be
> prefixed with a short string describing its type, which in most cases would
> be a single character, e.g. 's:schemaname' for a char*, 'b:true' for a bool,
> 'n:...' for any node (including Value nodes), or longer strings for less
> commonly used types like the integer arrays in plan nodes (although these
> would probably be better as a real integer list). These could be used to
> unambiguously parse individual tokens and also to determine how many or what
> kind of token to expect for multi-token values such as Datum which would
> otherwise require guessing. Does this seem reasonable? Is there another
> format that might make more sense?

This seems ugly to me and I don't see the utility of it.

> As far as I can tell, the current parser in nodes/read.c ignores the field
> names entirely, so this can be done without changing postgres' own parsing
> code at all and without affecting backwards compatibility of any stored
> trees. Does anyone else out there use nodeToString() output in their own
> tools, and if so, does this make your life easier or harder?
>
> Lastly, I'll leave a link to my WIP implementation in case anyone is
> interested:
> http://bitbucket.org/gxti/parse_sql/src/
> Currently I'm working on adding support for cooked parse trees and figuring
> out what, if anything, I need to do to support multibyte encodings. My
> personal use is for parsing DDL so the input is decidedly not hostile but
> I'd still like to make this a generally useful module.
>
> Thanks in advance for any comments, tips, or flames sent my way.

Thanks for having a thick skin. :-)

I'm having a hard time imaging what you could use this for without
encoding a lot of information about the meaning of particular
constructs. In which case the self-describing stuff is not needed.
As you point out downthread, if all you want to do is compare, it's
not needed either.

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 24 Apr 2010 21:02

Robert Haas <robertmhaas(a)gmail.com> writes:
> On Sat, Apr 24, 2010 at 8:07 PM, Bruce Momjian <bruce(a)momjian.us> wrote:
>> Sounds useful to me, though as a function like suggested in a later
>> email.

> If tool-builders think this is useful, I have no problem with making
> it available. It should be suitably disclaimed: "We reserve the right
> to rip out the entire flex/yacc-based lexer and parser at any time and
> replace them with a hand-coded system written in Prolog that emits
> tokenization information only in ASN.1-encoded pig latin. If massive
> changes in the way this function works - or its complete disappearance
> - are going to make you grumpy, don't call it."

I'm a bit concerned with the vagueness of the goals here. We started
with a request to dump out node trees, ie, post-parsing representation;
but the example use case of syntax highlighting would find that
representation quite useless. (Example: foo::bar and CAST(foo AS bar)
yield the same parse tree.) A syntax highlighter might get some use
out of the lexer-output token stream, but I'm afraid from the proposed
output that people might be expecting more semantic information than
the lexer can provide. The lexer doesn't, for example, have any clue
that some keywords are commands and others aren't; nor any very clear
understanding about the semantic difference between the tokens '='
and ';'.

Also, if all you want is the lexer, it's not that hard to steal psql's
version and adapt it to your purposes. The lexer doesn't change very
fast, and it's not that big either.

Anyway, it certainly wouldn't be hard for an add-on module to provide a
SRF that calls the lexer (or parser) and returns some sort of tabular
representation of the results. I'm just not sure how useful it'll be
in the real world.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Michael Tharp on 24 Apr 2010 21:08

On 04/24/2010 08:49 PM, Robert Haas wrote:
>> The nodeToString format as it stands is somewhat ambiguous with respect to
>> the type of a node member's value if one does not have access to
>> readfuncs.c. For example, a T_BitString called foo is serialized as ':foo
>> b1010' while a char * containing 'b1010' is also serialized as ':foo b1010'.
>> This may just mean that _outToken needs to escape the leading 'b'. A similar
>> problem exists for booleans ('true' as a string vs. as a boolean).
>
> I am not inclined to change this. Turning the format into something
> self-describing seems to me to be significant work and a significant
> compatibility break for a very small amount of benefit.

The funny thing is, it doesn't seem to be a compatibility break because
the code in readfuncs.c that parses the node strings ignores the field
names entirely because it assumes they are in a particular order. It
also isn't much work to change the output because the code is, with the
exception of a few weirdos, all at the top of outfuncs.c, and the
weirdos are also dispersed within that file.

However, I'm no longer convinced that using a serialized node tree is
the way to go for my use case, nor am I particularly sure that it even
matches my use case at all anymore as I keep simplifying the goals as
time goes on. I won't be able to make any compelling arguments until I
figure out what I need :-)

Thanks for the feedback.

-- m. tharp

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3
Prev: [HACKERS] [RFC] nodeToString format and exporting the SQL parser
Next: pgindent bizarreness