From: Dimitri Fontaine on
Tom Lane <tgl(a)sss.pgh.pa.us> writes:
> FWIW, the core xml code seems to have been pretty stable since we gave
> up on trying to redirect libxml's memory allocations to palloc.
> So what you basically need to do to xpath.c is something like this:
> http://archives.postgresql.org/pgsql-committers/2009-05/msg00229.php

Which translates to this git URL, for lazy browsers such as me:

http://git.postgresql.org/gitweb?p=postgresql.git;a=commitdiff;h=f8059d7f8ae8bfac840fbda3c8efbc0f7c09b123
--
dim

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
I believe I have fixed all the reported crashes in contrib/xml2.
However there is still this issue pointed out by Robert:

> CREATE TABLE xpath_test (id integer NOT NULL, t xml);
> INSERT INTO xpath_test VALUES (1, '<rowlist><row a="1"/><row a="2" b="oops"/></rowlist>');
> SELECT * FROM xpath_table('id', 't', 'xpath_test',
> '/rowlist/row/@a|/rowlist/row/@b', 'true') as t(id int4, a text, b text);

> which yields an answer that is, at least, extremely surprising, if not
> flat-out wrong:

> id | a | b
> ----+---+------
> 1 | 1 | oops
> 1 | 2 |
> (2 rows)

the point being that it seems like "oops" should be associated with "2"
not "1". The reason for that behavior is that xpath_table runs through
the XPATH_NODESET results generated by the various XPaths and dumps the
k'th one of each into the k'th output row generated for the current
input row. If there is any way to synchronize which node in each array
goes with each node in each other array, it's not apparent to me, but
I don't know libxml's API at all. Perhaps there is some other call we
should be using to evaluate all the XPaths in parallel?

(The code is also unbelievably inefficient, recompiling each XPath
expression once per output row (!); but it doesn't seem worth fixing
that right away given that we might have to throw away the logic
entirely in order to fix this bug.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Andrew Dunstan on


Tom Lane wrote:
> I believe I have fixed all the reported crashes in contrib/xml2.
>

Yay! Well done! That at least removes any possibly urgency about
removing the module.

> However there is still this issue pointed out by Robert:
>
>
>> CREATE TABLE xpath_test (id integer NOT NULL, t xml);
>> INSERT INTO xpath_test VALUES (1, '<rowlist><row a="1"/><row a="2" b="oops"/></rowlist>');
>> SELECT * FROM xpath_table('id', 't', 'xpath_test',
>> '/rowlist/row/@a|/rowlist/row/@b', 'true') as t(id int4, a text, b text);
>>
>
>
>> which yields an answer that is, at least, extremely surprising, if not
>> flat-out wrong:
>>
>
>
>> id | a | b
>> ----+---+------
>> 1 | 1 | oops
>> 1 | 2 |
>> (2 rows)
>>
>
> the point being that it seems like "oops" should be associated with "2"
> not "1". The reason for that behavior is that xpath_table runs through
> the XPATH_NODESET results generated by the various XPaths and dumps the
> k'th one of each into the k'th output row generated for the current
> input row. If there is any way to synchronize which node in each array
> goes with each node in each other array, it's not apparent to me, but
> I don't know libxml's API at all. Perhaps there is some other call we
> should be using to evaluate all the XPaths in parallel?
>
> (The code is also unbelievably inefficient, recompiling each XPath
> expression once per output row (!); but it doesn't seem worth fixing
> that right away given that we might have to throw away the logic
> entirely in order to fix this bug.)
>
>
>

Damn that's ugly.


ISTM the missing piece is really in our API. We need to be able to
specify a nodeset to iterate over, and then for each node take the first
value produced by each xpath expression. So the example above would look
something like:

SELECT * FROM xpath_table('id', 't', 'xpath_test',
'/rowlist/row', '@a|@b', 'true') as t(id int4, a text, b text);


Maybe we could approximate that with the current API by factoring out
the common root of the xpath expressions, but that's likely to be
extremely fragile and error prone, and we've already got bad experience
of trying to be too cute with xpath expressions.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
Andrew Dunstan <andrew(a)dunslane.net> writes:
> Tom Lane wrote:
>> ... The reason for that behavior is that xpath_table runs through
>> the XPATH_NODESET results generated by the various XPaths and dumps the
>> k'th one of each into the k'th output row generated for the current
>> input row.

> Damn that's ugly.

Yup :-(

> ISTM the missing piece is really in our API. We need to be able to
> specify a nodeset to iterate over, and then for each node take the first
> value produced by each xpath expression. So the example above would look
> something like:

> SELECT * FROM xpath_table('id', 't', 'xpath_test',
> '/rowlist/row', '@a|@b', 'true') as t(id int4, a text, b text);

Hm. It seems like that still leaves you open to the possibility of
out-of-sync results. If you consider the current behavior as what
you'd get with an empty root nodeset spec, then restricting it to
produce only the first output row doesn't help at all -- it would still
associate "1" with "oops". In general if the nodeset spec doesn't
select a unique subnode then you're at risk of bogus answers.
Maybe that could be defined as user error but it sure seems like it
would be error-prone to use.

> Maybe we could approximate that with the current API by factoring out
> the common root of the xpath expressions, but that's likely to be
> extremely fragile and error prone, and we've already got bad experience
> of trying to be too cute with xpath expressions.

Agreed, we do not want to be doing textual manipulations of XPaths,
which is what burnt us before. But does libxml2 offer any more
abstract path representation we could work on?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Andrew Dunstan on


Tom Lane wrote:
> Andrew Dunstan <andrew(a)dunslane.net> writes:
>
>> Tom Lane wrote:
>>
>>> ... The reason for that behavior is that xpath_table runs through
>>> the XPATH_NODESET results generated by the various XPaths and dumps the
>>> k'th one of each into the k'th output row generated for the current
>>> input row.
>>>
>
>
>> ISTM the missing piece is really in our API. We need to be able to
>> specify a nodeset to iterate over, and then for each node take the first
>> value produced by each xpath expression. So the example above would look
>> something like:
>>
>
>
>> SELECT * FROM xpath_table('id', 't', 'xpath_test',
>> '/rowlist/row', '@a|@b', 'true') as t(id int4, a text, b text);
>>
>
> Hm. It seems like that still leaves you open to the possibility of
> out-of-sync results. If you consider the current behavior as what
> you'd get with an empty root nodeset spec, then restricting it to
> produce only the first output row doesn't help at all -- it would still
> associate "1" with "oops". In general if the nodeset spec doesn't
> select a unique subnode then you're at risk of bogus answers.
> Maybe that could be defined as user error but it sure seems like it
> would be error-prone to use.
>
>

Well, I think that's going to be hard or impossible to avoid in the
general case. My suggestion was intended to give the user a much better
chance of avoiding it, however.

Arbitrary XML (or JSON or YAML or any artbitrarilly tree structured data
markup) doesn't map well to a rectangular structure, and this is always
likely to cause problems like this IMO.

I guess in the end the user could preprocess the XML with XSLT to remove
the irregularities before passing to to xpath_table.

We certainly need to put some warnings in the docs about it.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers