intolerant HTML parser [Python]

Prev: [RELEASED] Python 2.7 alpha 3
Next: How to print all expressions that match a regular expression

From: Lawrence D'Oliveiro on 8 Feb 2010 05:19

In message <4b6fd672$0$6734$9b4e6d93(a)newsspool2.arcor-online.net>, Stefan
Behnel wrote:

> Jim, 06.02.2010 20:09:
>
>> I generate some HTML and I want to include in my unit tests a check
>> for syntax. So I am looking for a program that will complain at any
>> syntax irregularities.
>
> First thing to note here is that you should consider switching to an HTML
> generation tool that does this automatically.

I think that's what he's writing.

From: Stefan Behnel on 8 Feb 2010 05:36

Lawrence D'Oliveiro, 08.02.2010 11:19:
> In message <4b6fd672$0$6734$9b4e6d93(a)newsspool2.arcor-online.net>, Stefan
> Behnel wrote:
>
>> Jim, 06.02.2010 20:09:
>>
>>> I generate some HTML and I want to include in my unit tests a check
>>> for syntax. So I am looking for a program that will complain at any
>>> syntax irregularities.
>> First thing to note here is that you should consider switching to an HTML
>> generation tool that does this automatically.
>
> I think that's what he's writing.

I don't read it that way. There's a huge difference between

- generating HTML manually and validating (some of) it in a unit test

and

- generating HTML using a tool that guarantees correct HTML output

the advantage of the second approach being that others have already done
all the debugging for you.

Stefan

From: Phlip on 8 Feb 2010 12:12

Stefan Behnel wrote:

> I don't read it that way. There's a huge difference between
>
> - generating HTML manually and validating (some of) it in a unit test
>
> and
>
> - generating HTML using a tool that guarantees correct HTML output
>
> the advantage of the second approach being that others have already done
> all the debugging for you.

Anyone TDDing around HTML or XML should use or fork my assert_xml()
(from django-test-extensions).

The current version trivially detects a leading <html> tag and uses
etree.HTML(xml); else it goes with the stricter etree.XML(xml). The
former will not complain about the provided sample HTML.

Sadly, the industry has such a legacy of HTML written in Notepad that
well-formed (X)HTML will never be well-formed XML. My own action item
here is to apply Stefan's parser_options suggestion to make the
etree.HTML() stricter.

However, a generator is free to produce arbitrarily restricted XML
that avoids the problems with XHTML. It could, for example, push any
Javascript that even dreams of using & instead of & out into .js
files.

So an assert_xml() hot-wired to process only XML - with the true HTML
doctype - is still useful to TDD generated code, because its XPath
reference will detect that you get the nodes you expect.

--
Phlip
http://c2.com/cgi/wiki?ZeekLand

From: Phlip on 8 Feb 2010 13:16

and the tweak is:

parser = etree.HTMLParser(recover=False)
return etree.HTML(xml, parser)

That reduces tolerance. The entire assert_xml() is (apologies for
wrapping lines!):

def _xml_to_tree(self, xml):
from lxml import etree
self._xml = xml

try:
if '<html' in xml[:200]: # NOTE the condition COULD suck
more!
parser = etree.HTMLParser(recover=False)
return etree.HTML(xml, parser)
return etree.HTML(xml)
else:
return etree.XML(xml)

except ValueError: # TODO don't rely on exceptions for
normal control flow
tree = xml
self._xml = str(tree) # CONSIDER does this reconstitute
the nested XML ?
return tree

def assert_xml(self, xml, xpath, **kw):
'Check that a given extent of XML or HTML contains a given
XPath, and return its first node'

tree = self._xml_to_tree(xml)
nodes = tree.xpath(xpath)
self.assertTrue(len(nodes) > 0, xpath + ' not found in ' +
self._xml)
node = nodes[0]
if kw.get('verbose', False): self.reveal_xml(node) # "here
have ye been? What have ye seen?"--Morgoth
return node

def reveal_xml(self, node):
'Spews an XML node as source, for diagnosis'

from lxml import etree
print etree.tostring(node, pretty_print=True) # CONSIDER
does pretty_print work? why not?

def deny_xml(self, xml, xpath):
'Check that a given extent of XML or HTML does not contain a
given XPath'

tree = self._xml_to_tree(xml)
nodes = tree.xpath(xpath)
self.assertEqual(0, len(nodes), xpath + ' should not appear in
' + self._xml)

From: Lawrence D'Oliveiro on 8 Feb 2010 16:39

In message <4b6fe93d$0$6724$9b4e6d93(a)newsspool2.arcor-online.net>, Stefan
Behnel wrote:

> - generating HTML using a tool that guarantees correct HTML output

Where do you think these tools come from? They don't write themselves, you
know.

First | Prev | Next | Last
Pages: 1 2 3
Prev: [RELEASED] Python 2.7 alpha 3
Next: How to print all expressions that match a regular expression