XML parsing: SAX/expat & yield [Python]

Prev: problem adding a scrollbar to a text widget
Next: running a piece of code at specific intervals?

From: kj on 4 Aug 2010 13:13

I want to write code that parses a file that is far bigger than
the amount of memory I can count on. Therefore, I want to stay as
far away as possible from anything that produces a memory-resident
DOM tree.

The top-level structure of this xml is very simple: it's just a
very long list of "records". All the complexity of the data is at
the level of the individual records, but these records are tiny in
size (relative to the size of the entire file).

So the ideal would be a "parser-iterator", which parses just enough
of the file to "yield" (in the generator sense) the next record,
thereby returning control to the caller; the caller can process
the record, delete it from memory, and return control to the
parser-iterator; once parser-iterator regains control, it repeats
this sequence starting where it left off.

The problem, as I see it, is that SAX-type parsers like expat want
to do everything with callbacks, which is not readily compatible
with the generator paradigm I just described.

Is there a way to get an xml.parsers.expat parser (or any other
SAX-type parser) to stop at a particular point to yield a value?

The only approach I can think of is to have the appropriate parser
callbacks throw an exception wherever a yield would have been.
The exception-handling code would have the actual yield statement,
followed by code that restarts the parser where it left off.
Additional logic would be necessary to implement the piecemeal
reading of the input file into memory.

But I'm not very conversant with SAX parsers, and even less with
generators, so all this may be unnecessary, or way off.

Any other tricks/suggestions for turning a SAX parsers into a
generator, please let me know.

~K

From: Peter Otten on 4 Aug 2010 13:22

kj wrote:

> I want to write code that parses a file that is far bigger than
> the amount of memory I can count on. Therefore, I want to stay as
> far away as possible from anything that produces a memory-resident
> DOM tree.
>
> The top-level structure of this xml is very simple: it's just a
> very long list of "records". All the complexity of the data is at
> the level of the individual records, but these records are tiny in
> size (relative to the size of the entire file).
>
> So the ideal would be a "parser-iterator", which parses just enough
> of the file to "yield" (in the generator sense) the next record,
> thereby returning control to the caller; the caller can process
> the record, delete it from memory, and return control to the
> parser-iterator; once parser-iterator regains control, it repeats
> this sequence starting where it left off.

How about

http://effbot.org/zone/element-iterparse.htm#incremental-parsing

Peter

From: kj on 4 Aug 2010 17:49

In <i3c7lc$e6v$03$1(a)news.t-online.com> Peter Otten <__peter__(a)web.de> writes:

>How about

>http://effbot.org/zone/element-iterparse.htm#incremental-parsing

Exactly!

Thanks!

~K

|
Pages: 1
Prev: problem adding a scrollbar to a text widget
Next: running a piece of code at specific intervals?