From: Stefan Behnel on 28 Jul 2010 06:48 jia li, 28.07.2010 12:10: > I have an XML file with hundreds of<error> elements. > > What's strange is only one of there elements could not be parsed correctly: > <error> > <checker>REVERSE_INULL</checker> > <function>Dispose_ParameterList</function> > <unmangled_function>Dispose_ParameterList</unmangled_function> > <status>UNINSPECTED</status> > <num>146</num> > <home>1/146MMSLib_LinkedList.c</home> > </error> > > I printed the data in "characters(self, data)" and after parsing. The result > is one "\r\n" is inserted after "1/" and "146MMSLib_LinkedList.c" for the > latter. > > But if I make my XML file only this element left, it could parse correctly. First of all: don't use SAX. Use ElementTree's iterparse() function. That will shrink you code down to a simple loop in a few lines. Then, the problem is likely that you are getting separate events for text nodes. The "\r\n" most likely only occurs due to your print statement, I doubt that it's really in the data returned from SAX. Again: using ElementTree instead of SAX will avoid this kind of problem. Stefan
From: Aahz on 9 Aug 2010 12:52 In article <mailman.1250.1280314148.1673.python-list(a)python.org>, Stefan Behnel <stefan_ml(a)behnel.de> wrote: > >First of all: don't use SAX. Use ElementTree's iterparse() function. That >will shrink you code down to a simple loop in a few lines. Unless I'm missing something, that only helps if the final tree fits into memory. What do you suggest other than SAX if your XML file may be hundreds of megabytes? -- Aahz (aahz(a)pythoncraft.com) <*> http://www.pythoncraft.com/ "...if I were on life-support, I'd rather have it run by a Gameboy than a Windows box." --Cliff Wells
From: Stefan Behnel on 9 Aug 2010 13:31 Aahz, 09.08.2010 18:52: > In article<mailman.1250.1280314148.1673.python-list(a)python.org>, > Stefan Behnel wrote: >> >> First of all: don't use SAX. Use ElementTree's iterparse() function. That >> will shrink you code down to a simple loop in a few lines. > > Unless I'm missing something, that only helps if the final tree fits into > memory. What do you suggest other than SAX if your XML file may be > hundreds of megabytes? Well, what about using ElementTree's iterparse() function in that case? That's what it's good at, and its cElementTree version is extremely fast. Stefan
From: Aahz on 9 Aug 2010 19:20 In article <mailman.1860.1281375095.1673.python-list(a)python.org>, Stefan Behnel <stefan_ml(a)behnel.de> wrote: >Aahz, 09.08.2010 18:52: >> In article<mailman.1250.1280314148.1673.python-list(a)python.org>, >> Stefan Behnel wrote: >>> >>> First of all: don't use SAX. Use ElementTree's iterparse() function. That >>> will shrink you code down to a simple loop in a few lines. >> >> Unless I'm missing something, that only helps if the final tree fits into >> memory. What do you suggest other than SAX if your XML file may be >> hundreds of megabytes? > >Well, what about using ElementTree's iterparse() function in that case? >That's what it's good at, and its cElementTree version is extremely fast. The docs say, "Parses an XML section into an element tree incrementally". Sure sounds like it retains the entire parsed tree in RAM. Not good. Again, how do you parse an XML file larger than your available memory using something other than SAX? -- Aahz (aahz(a)pythoncraft.com) <*> http://www.pythoncraft.com/ "...if I were on life-support, I'd rather have it run by a Gameboy than a Windows box." --Cliff Wells
From: Christian Heimes on 9 Aug 2010 19:39 Am 10.08.2010 01:20, schrieb Aahz: > The docs say, "Parses an XML section into an element tree incrementally". > Sure sounds like it retains the entire parsed tree in RAM. Not good. > Again, how do you parse an XML file larger than your available memory > using something other than SAX? The document at http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ explains it one way. The iterparser approach is ingenious but it doesn't work for every XML format. Let's say you have a 10 GB XML file with one million <part/> tags. An iterparser doesn't load the entire document. Instead it iterates over the file and yields (for example) one million ElementTrees for each <part/> tag and its children. You can get the nice API of ElementTree with the memory efficiency of a SAX parser if you obey "Listing 4". Christian
|
Next
|
Last
Pages: 1 2 Prev: Which multiprocessing methods use shared memory? Next: Nice way to cast a homogeneous tuple |