Prev: problem adding a scrollbar to a text widget
Next: running a piece of code at specific intervals?
From: kj on 4 Aug 2010 13:13 I want to write code that parses a file that is far bigger than the amount of memory I can count on. Therefore, I want to stay as far away as possible from anything that produces a memory-resident DOM tree. The top-level structure of this xml is very simple: it's just a very long list of "records". All the complexity of the data is at the level of the individual records, but these records are tiny in size (relative to the size of the entire file). So the ideal would be a "parser-iterator", which parses just enough of the file to "yield" (in the generator sense) the next record, thereby returning control to the caller; the caller can process the record, delete it from memory, and return control to the parser-iterator; once parser-iterator regains control, it repeats this sequence starting where it left off. The problem, as I see it, is that SAX-type parsers like expat want to do everything with callbacks, which is not readily compatible with the generator paradigm I just described. Is there a way to get an xml.parsers.expat parser (or any other SAX-type parser) to stop at a particular point to yield a value? The only approach I can think of is to have the appropriate parser callbacks throw an exception wherever a yield would have been. The exception-handling code would have the actual yield statement, followed by code that restarts the parser where it left off. Additional logic would be necessary to implement the piecemeal reading of the input file into memory. But I'm not very conversant with SAX parsers, and even less with generators, so all this may be unnecessary, or way off. Any other tricks/suggestions for turning a SAX parsers into a generator, please let me know. ~K
From: Peter Otten on 4 Aug 2010 13:22 kj wrote: > I want to write code that parses a file that is far bigger than > the amount of memory I can count on. Therefore, I want to stay as > far away as possible from anything that produces a memory-resident > DOM tree. > > The top-level structure of this xml is very simple: it's just a > very long list of "records". All the complexity of the data is at > the level of the individual records, but these records are tiny in > size (relative to the size of the entire file). > > So the ideal would be a "parser-iterator", which parses just enough > of the file to "yield" (in the generator sense) the next record, > thereby returning control to the caller; the caller can process > the record, delete it from memory, and return control to the > parser-iterator; once parser-iterator regains control, it repeats > this sequence starting where it left off. How about http://effbot.org/zone/element-iterparse.htm#incremental-parsing Peter
From: kj on 4 Aug 2010 17:49 In <i3c7lc$e6v$03$1(a)news.t-online.com> Peter Otten <__peter__(a)web.de> writes: >How about >http://effbot.org/zone/element-iterparse.htm#incremental-parsing Exactly! Thanks! ~K
|
Pages: 1 Prev: problem adding a scrollbar to a text widget Next: running a piece of code at specific intervals? |