From: Stefan Behnel on 10 Aug 2010 02:35 Christian Heimes, 10.08.2010 01:39: > Am 10.08.2010 01:20, schrieb Aahz: >> The docs say, "Parses an XML section into an element tree incrementally". >> Sure sounds like it retains the entire parsed tree in RAM. Not good. >> Again, how do you parse an XML file larger than your available memory >> using something other than SAX? > > The document at > http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ explains it > one way. > > The iterparser approach is ingenious but it doesn't work for every XML > format. Let's say you have a 10 GB XML file with one million<part/> > tags. An iterparser doesn't load the entire document. Instead it > iterates over the file and yields (for example) one million ElementTrees > for each<part/> tag and its children. You can get the nice API of > ElementTree with the memory efficiency of a SAX parser if you obey > "Listing 4". In the very common case that you are interested in all children of the root element, it's even enough to intercept on the specific tag name (lxml.etree has an option for that, but an 'if' block will do just fine in ET) and just ".clear()" the child element at the end of the loop body. That results in very fast and simple code, but will leave the tags in the tree while only removing their content and attributes. Usually works well enough for several ten thousand elements, especially when using cElementTree. As usual, a bit of benchmarking will uncover the right way to do it in your case. That's also a huge advantage over SAX: iterparse code is much easier to tune into a streamlined loop body when you need it. Stefan
First
|
Prev
|
Pages: 1 2 Prev: Which multiprocessing methods use shared memory? Next: Nice way to cast a homogeneous tuple |