From: Jussi Piitulainen on 4 Mar 2010 05:46 Dear group, I am observing weird semi-erratic behaviour that involves Python 3 and lxml, is extremely sensitive to changes in the input data, and only occurs when I name a partial result. I would like some help with this, please. (Python 3.1.1; GNU/Linux; how do I find lxml version?) The test script regress/Tribug is at the end of this message, with a snippet to show the form of regress/tridata.py where the XML is. What I observe is this. Parsing an XML document (wrapped in BytesIO) with lxml.etree.parse and then extracting certain elements with xpath sometimes fails so that I get three times the correct number of elements. On the first run of the script, it fails in one way, and on each subsequent run in another way: subsequent runs are repeatable. Second, the bug only occurs when I give a name to the result from lxml.etree.parse! This is seen below in the lines labeled "name1" or "name2" that sometimes exhibit the bug, and lines labeled "nest1" or "nest2" that never do. That is, this fails in some complex way: result = etree.parse(BytesIO(body)) n = len(result.xpath(title)) This fails to fail: n = len(etree.parse(BytesIO(body)).xpath(title)) I have failed to observe the error interactively. I believe the erroneus result lists are of the form [x x x y y y z z z] when they should be [x y z] but I do not know if the x's are identical or copies. I will know more later, of course, when I have written more complex tests, unless somebody can lead me to a more profitable way of debugging this. Two versions of the test runs follow, before and after a trivial change to the test data. Since the numbers are repeated n's of the above snippets, they should all be the same: 5 observed 1000 times. A first run after removing regress/tridata.pyc: [1202] $ regress/Tribug name1: size 5 observed 969 times name1: size 15 observed 31 times name2: size 5 observed 1000 times nest1: size 5 observed 1000 times nest2: size 5 observed 1000 times All subsequent runs, with regress/tridata.pyc recreated: [1203] $ regress/Tribug name1: size 5 observed 1000 times name2: size 5 observed 978 times name2: size 15 observed 22 times nest1: size 5 observed 1000 times nest2: size 5 observed 1000 times Adding an empty comment <!-- --> to the XML document; a first run: [1207] $ regress/Tribug name1: size 5 observed 992 times name1: size 15 observed 8 times name2: size 5 observed 1000 times nest1: size 5 observed 1000 times nest2: size 5 observed 1000 times And subsequent runs: [1208] $ regress/Tribug name1: size 5 observed 991 times name1: size 15 observed 9 times name2: size 5 observed 998 times name2: size 15 observed 2 times nest1: size 5 observed 1000 times nest2: size 5 observed 1000 times ---start of regress/Tribug--- #! /bin/env python3 # -*- mode: Python; -*- from io import BytesIO from lxml import etree from tridata import body, title def naming(): sizes = dict() for k in range(0,1000): result = etree.parse(BytesIO(body)) n = len(result.xpath(title)) sizes[n] = 1 + sizes.get(n, 0) return sizes def nesting(): sizes = dict() for k in range(0,1000): n = len(etree.parse(BytesIO(body)).xpath(title)) sizes[n] = 1 + sizes.get(n, 0) return sizes def report(label, sizes): for size, count in sizes.items(): print('{}: size {} observed {} times' .format(label, size, count)) report('name1', naming()) report('name2', naming()) report('nest1', nesting()) report('nest2', nesting()) ---end of regress/Tribug--- The file regress/tridata.py contains only the two constants. I omit most of the XML. It would be several screenfuls. ---start of regress/tridata.py--- body = b'''<OAI-PMH xmlns="http://www.opena... .... </OAI-PMH> ''' title = '//*[name()="record"]//*[name()="dc:title"]' ---end of regress/tridata.py---
From: Stefan Behnel on 4 Mar 2010 14:10 Jussi Piitulainen, 04.03.2010 11:46: > I am observing weird semi-erratic behaviour that involves Python 3 and > lxml, is extremely sensitive to changes in the input data, and only > occurs when I name a partial result. I would like some help with this, > please. (Python 3.1.1; GNU/Linux; how do I find lxml version?) Here's how to find the version: http://codespeak.net/lxml/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-should-i-do I'll give your test code a try when I get to it. However, note that the best place to discuss this is the lxml mailing list: http://codespeak.net/mailman/listinfo/lxml-dev Stefan
From: Jussi Piitulainen on 4 Mar 2010 16:40 Stefan Behnel writes: > Jussi Piitulainen, 04.03.2010 11:46: > > I am observing weird semi-erratic behaviour that involves Python 3 > > and lxml, is extremely sensitive to changes in the input data, and > > only occurs when I name a partial result. I would like some help > > with this, please. (Python 3.1.1; GNU/Linux; how do I find lxml > > version?) > > Here's how to find the version: > > http://codespeak.net/lxml/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-should-i-do Ok, thank you. Here's the results: >>> print(et.LXML_VERSION, et.LIBXML_VERSION, .... et.LIBXML_COMPILED_VERSION, et.LIBXSLT_VERSION, .... et.LIBXSLT_COMPILED_VERSION) (2, 2, 4, 0) (2, 6, 26) (2, 6, 26) (1, 1, 17) (1, 1, 17) > I'll give your test code a try when I get to it. However, note that > the best place to discuss this is the lxml mailing list: > > http://codespeak.net/mailman/listinfo/lxml-dev Thank you. Two things, however. First, I snipped out most of the XML document in that post, so it won't be runnable as is. As I think I said, my attempts to edit it down to size made the bug hide itself. Second, it's very sensitive to any changes in that XML. Oh, and a third thing. I'm not at all sure yet that the bug is in lxml. It seems to me that Python itself does impossible things - I hope I'm just being blind to something obvious, really. But if you like to try it out, I'll post the full test data as a followup to this. It's just bogus test data.
From: Jussi Piitulainen on 4 Mar 2010 16:47 This is the full data file on which my regress/Tribug exhibits the behaviour that I find incomprehensible, described in the first post in this thread. The comment in the beginning of the file below was written before I commented out some records in the data, so the actual numbers now are not ten expected, thirty sometimes observed, but the wrong number is always the correct number tripled (5 and 15, I think). ---regress/tridata.py follows--- # Exercise lxml.etree.parse(body).xpath(title) # which I think should always return a list of # ten elements but sometimes returns thirty, # with each of the ten in triplicate. And this # seems impossible to me. Yet I see it happening. body = b'''<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2010-03-02T09:38:47Z</responseDate> <request verb="ListRecords" from="2004-01-01T00:00:00Z" until="2004-12-31T23:59:59Z" metadataPrefix="oai_dc">http://localhost/pmh/que</request> <ListRecords> <record> <header><!-- x --><!-- --> <identifier>jrc32003R0055-pl.xml/2/0</identifier> <datestamp>2004-08-15T19:45:00Z</datestamp> <setSpec>pl</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>Rozporządzenie</dc:title> </oai_dc:dc> </metadata> </record> <!-- <record> <header> <identifier>jrc32003R0055-pl.xml/2/1</identifier> <datestamp>2004-08-15T19:45:00Z</datestamp> <setSpec>pl</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>Komisji</dc:title> </oai_dc:dc> </metadata> </record> <record> <header> <identifier>jrc32003R0055-pl.xml/2/2</identifier> <datestamp>2004-08-15T19:45:00Z</datestamp> <setSpec>pl</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>(WE)</dc:title> </oai_dc:dc> </metadata> </record> <record> <header> <identifier>jrc32003R0055-pl.xml/2/3</identifier> <datestamp>2004-08-15T19:45:00Z</datestamp> <setSpec>pl</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>nr</dc:title> </oai_dc:dc> </metadata> </record> <record> <header> <identifier>jrc32003R0055-pl.xml/2/4</identifier> <datestamp>2004-08-15T19:45:00Z</datestamp> <setSpec>pl</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>55/2003</dc:title> </oai_dc:dc> </metadata> </record> <record> <header> <identifier>jrc32003R0055-pl.xml/3/0</identifier> <datestamp>2004-08-15T19:45:00Z</datestamp> <setSpec>pl</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>z</dc:title> </oai_dc:dc> </metadata> </record> --> <record> <header> <identifier>jrc32003R0055-pl.xml/3/1</identifier> <datestamp>2004-08-15T19:45:00Z</datestamp> <setSpec>pl</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>dnia</dc:title> </oai_dc:dc> </metadata> </record> <record> <header> <identifier>jrc32003R0055-pl.xml/3/2</identifier> <datestamp>2004-08-15T19:45:00Z</datestamp> <setSpec>pl</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>13</dc:title> </oai_dc:dc> </metadata> </record> <record> <header> <identifier>jrc32003R0055-pl.xml/3/3</identifier> <datestamp>2004-08-15T19:45:00Z</datestamp> <setSpec>pl</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>stycznia</dc:title> </oai_dc:dc> </metadata> </record> <record> <header> <identifier>jrc32003R0055-pl.xml/3/4</identifier> <datestamp>2004-08-15T19:45:00Z</datestamp> <setSpec>pl</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>2003</dc:title> </oai_dc:dc> </metadata> </record> </ListRecords> </OAI-PMH> ''' title = '//*[name()="record"]//*[name()="dc:title"]'
From: Stefan Behnel on 5 Mar 2010 03:03 Jussi Piitulainen, 04.03.2010 22:40: > Stefan Behnel writes: >> Jussi Piitulainen, 04.03.2010 11:46: >>> I am observing weird semi-erratic behaviour that involves Python 3 >>> and lxml, is extremely sensitive to changes in the input data, and >>> only occurs when I name a partial result. I would like some help >>> with this, please. (Python 3.1.1; GNU/Linux; how do I find lxml >>> version?) >> >> Here's how to find the version: >> >> http://codespeak.net/lxml/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-should-i-do > > Ok, thank you. Here's the results: > > >>> print(et.LXML_VERSION, et.LIBXML_VERSION, > ... et.LIBXML_COMPILED_VERSION, et.LIBXSLT_VERSION, > ... et.LIBXSLT_COMPILED_VERSION) > (2, 2, 4, 0) (2, 6, 26) (2, 6, 26) (1, 1, 17) (1, 1, 17) I can't reproduce this with the latest lxml trunk (and Py3.2 trunk) and libxml2 2.7.6, even after running your test script for quite a while. I'd try to upgrade the libxml2 version. Stefan
|
Next
|
Last
Pages: 1 2 Prev: Don't work __getattr__ with __add__ Next: WANTED: A good name for the pair (args, kwargs) |