Prev: Need informations
Next: freetds implementation in Ruby
From: Brian Candler on 30 Apr 2010 12:26 I plan to parse a huge XML document (too big to fit into RAM) using a stream parser. I can divide the stream into logical chunks which can be processed individually. If a particular chunk fails, I want to append it to an output XML file, which will contain all the failed chunks, and can be patched up and retried. To do this, I want to be able to regenerate the XML of the failed chunk, preferably identical to how it was seen. The options I can think of are: 1. A stream parser which gives me the raw XML alongside each parsed item; I can concatenate the raw XML into a string. 2. A stream parser which gives me the byte pos of the current node, so I can seek back within the file to fetch the XML again 3. A stream parser which gives me events to identify the different parts of XML, together with an inverse process to which I can replay the events and get the XML back again. Playing with REXML StreamListener, I can get a series of method calls like start_tag(...) and end_tag(...), and I can collect these into an array; is there existing code which would let me squirt that array and recreate the XML? Any other options I should be looking at? Thanks, Brian. -- Posted via http://www.ruby-forum.com/.
From: Caleb Clausen on 30 Apr 2010 13:57 On 4/30/10, Brian Candler <b.candler(a)pobox.com> wrote: > I plan to parse a huge XML document (too big to fit into RAM) using a > stream parser. I can divide the stream into logical chunks which can be > processed individually. If a particular chunk fails, I want to append it > to an output XML file, which will contain all the failed chunks, and can > be patched up and retried. > > To do this, I want to be able to regenerate the XML of the failed chunk, > preferably identical to how it was seen. > > The options I can think of are: > > 1. A stream parser which gives me the raw XML alongside each parsed > item; I can concatenate the raw XML into a string. > > 2. A stream parser which gives me the byte pos of the current node, so I > can seek back within the file to fetch the XML again > > 3. A stream parser which gives me events to identify the different parts > of XML, together with an inverse process to which I can replay the > events and get the XML back again. > > Playing with REXML StreamListener, I can get a series of method calls > like start_tag(...) and end_tag(...), and I can collect these into an > array; is there existing code which would let me squirt that array and > recreate the XML? Any other options I should be looking at? From my experience, REXML is far too wimpy to deal with data on this scale. (Among other things, it was too slow.) I suggest using the 'stream parser' (a misnomer, this is really a lexer) in libxml instead. I don't know for sure if it can reconstruct the original text the way you want, but that should be possible. I think the class you'd want is LibXML::XML::SaxParser. See http://libxml.rubyforge.org/.
From: John W Higgins on 30 Apr 2010 14:30 [Note: parts of this message were removed to make it a legal post.] Morning, On Fri, Apr 30, 2010 at 9:26 AM, Brian Candler <b.candler(a)pobox.com> wrote: > I plan to parse a huge XML document (too big to fit into RAM) using a > stream parser. I can divide the stream into logical chunks which can be > processed individually. If a particular chunk fails, I want to append it > to an output XML file, which will contain all the failed chunks, and can > be patched up and retried. > If you aren't completely against Perl - XML-Twig [1] has a tool called xml_split [2] which works rather well at splitting xml files. You might wish to split up your files into smaller files prior to even beginning the processing and then if a file fails to process you just have the file in hand. When finished you could smash the failed files back together using xml_merge [3] from the same perl package. If there is some ruby variant of this I couldn't locate it but that never means much :) John [1] - http://search.cpan.org/~mirod/XML-Twig-3.34/ [2] - http://search.cpan.org/~mirod/XML-Twig-3.34/tools/xml_split/xml_split [3] - http://search.cpan.org/~mirod/XML-Twig-3.34/tools/xml_merge/xml_merge
From: Robert Dober on 30 Apr 2010 14:42 On Fri, Apr 30, 2010 at 6:26 PM, Brian Candler <b.candler(a)pobox.com> wrote: Would you care to use JRuby? That would give you access to top XML Stream parsers IIRC ;) Just as an example: org.apache.xerces.parsers.SAXParser seems very suited for your purpose, although it is a little bit of work to construct your xml fragments it should be rather easy. HTH R. -- The best way to predict the future is to invent it. -- Alan Kay
From: Brian Candler on 30 Apr 2010 16:32
> Would you care to use JRuby? I don't mind which stream parser, but Java is out :-) Since this is a bit of disposable code, I've decided to cheat. I pretty-print the XML, then I can read it line-at-a-time using gets into a buffer, identify a range of lines which forms a chunk, then parse the buffer. On error I write out the buffer again. Thanks for all your suggestions. -- Posted via http://www.ruby-forum.com/. |