From: David Fetter on
On Mon, Jun 28, 2010 at 08:08:53AM -0700, Mike Berrow wrote:
> We need to make extensive use of the 'xml_is_well_formed' function provided
> by the XML2 module.
>
> Yet the documentation says that the xml2 module will be deprecated since
> "XML syntax checking and XPath queries"
> is covered by the XML-related functionality based on the SQL/XML standard in
> the core server from PostgreSQL 8.3 onwards.
>
> However, the core function XMLPARSE does not provide equivalent
> functionality since when it detects an invalid XML document,
> it throws an error rather than returning a truth value (which is what we
> need and currently have with the 'xml_is_well_formed' function).
>
> For example:
>
> select xml_is_well_formed('<br></br2>');
> xml_is_well_formed
> --------------------
> f
> (1 row)
>
> select XMLPARSE( DOCUMENT '<br></br2>' );
> ERROR: invalid XML document
> DETAIL: Entity: line 1: parser error : expected '>'
> <br></br2>
> ^
> Entity: line 1: parser error : Extra content at the end of the document
> <br></br2>
> ^
>
> Is there some way to use the new, core XML functionality to simply
> return a truth value in the way that we need?.

Here's a PL/pgsql wrapper for it. You could create a similar wrapper
for other commands.

CREATE OR REPLACE FUNCTION xml_is_well_formed(in_putative_xml TEXT)
STRICT /* Leave this line here if you want RETURNS NULL ON NULL INPUT behavior. */
RETURNS BOOLEAN
LANGUAGE plpgsql
AS $$
BEGIN
PERFORM XMLPARSE(DOCUMENT(in_putative_xml));
RETURN true;
EXCEPTION
WHEN invalid_xml_document THEN
RETURN false;
END;
$$;

While tracking this down, I didn't see a way to get SQLSTATE or the
corresponding condition name via psql. Is this an oversight? A bug,
perhaps?

Cheers,
David.
--
David Fetter <david(a)fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter(a)gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Mike Fowler on
Quoting Mike Fowler <mike(a)mlfowler.com>:

> Should the IS DOCUMENT predicate support this? At the moment you get
> the following:
>
> template1=# SELECT
> '<towns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns>'
> IS
> DOCUMENT;
> ?column?
> ----------
> t
> (1 row)
>
> template1=# SELECT
> '<towns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns'
> IS
> DOCUMENT;
> ERROR: invalid XML content
> LINE 1: SELECT '<towns><town>Bidford-on-Avon</town><town>Cwmbran</to...
> ^
> DETAIL: Entity: line 1: parser error : expected '>'
> owns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns
>
> ^
> Entity: line 1: parser error : chunk is not well balanced
> owns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns
>
> ^
> I would've hoped the second would've returned 'f' rather than failing.
> I've had a glance at the XML/SQL standard and I don't see anything in
> the detail of the predicate (8.2) that would specifically prohibit us
> from changing this behavior, unless the common rule 'Parsing a string
> as an XML value' (10.16) must always be in force. I'm no standard
> expert, but IMHO this would be an acceptable change to improve
> usability. What do others think?

Right, I've answered my own question whilst sitting in the open source
coding session at CHAR(10). Yes, IS DOCUMENT should return false for a
non-well formed document, and indeed is coded to do such. However, the
conversion to the xml type which happens before the underlying
xml_is_document function is even called fails and exceptions out. I'll
work on a patch to resolve this behavior such that IS DOCUMENT will
give you the missing 'xml_is_well_formed' function.

Regards,

--
Mike Fowler
Registered Linux user: 379787


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Mike Fowler on
Quoting Robert Haas <robertmhaas(a)gmail.com>:

>
> I think the point if "IS DOCUMENT" is to distinguish a document:
>
> <foo>some stuff<bar/><baz/></foo>
>
> from a document fragment:
>
> <bar/><baz/>
>
> A document is allowed only one toplevel tag.
>
> It'd be nice, I think, to have a function that tells you whether
> something is legal XML without throwing an error if it isn't, but I
> suspect that should be a separate function, rather than trying to jam
> it into "IS DOCUMENT".
>
> http://developer.postgresql.org/pgdocs/postgres/functions-xml.html#AEN15187
>

I've submitted a patch to the bug report I filed yesterday that
implements this. The way I read the standard (and I'm only reading a
draft and I'm no expert) I don't see that it mandates that IS DOCUMENT
returns false when IS CONTENT would return true. So if IS CONTENT were
to be implemented, to determine that you have something that is
malformed you could say:

val IS NOT DOCUMENT AND val IS NOT CONTENT

I think having the direct predicate support would be useful for
columns of text where you know that some, though possibly not all,
text values are valid XML.

--
Mike Fowler
Registered Linux user: 379787


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Peter Eisentraut on
On fre, 2010-07-02 at 14:07 +0100, Mike Fowler wrote:
> So if IS CONTENT were
> to be implemented, to determine that you have something that is
> malformed

But that's not what IS CONTENT does. "Content" still needs to be
well-formed.


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Mike Fowler on
Quoting Peter Eisentraut <peter_e(a)gmx.net>:

> On fre, 2010-07-02 at 14:07 +0100, Mike Fowler wrote:
>> So if IS CONTENT were
>> to be implemented, to determine that you have something that is
>> malformed
>
> But that's not what IS CONTENT does. "Content" still needs to be
> well-formed.
>

What I was hoping to achieve was to determine that something wasn't a
document and wasn't content, however as you pointed out on the bugs
thread the value must be XML. My mistake was not checking that I had
followed the definitions all the way back to the root. What I will do
instead is implement the xml_is_well_formed function and get a patch
out in the next day or two.

Thank you Robert and Peter for tolerating my stumbles through the standard.

Regards,

--
Mike Fowler
Registered Linux user: 379787


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers