From: Jerome David Sallinger on 10 Aug 2010 09:53 Hello. I am working with some XML logs coming from a network simulator. My aim is to strip out the transient information concerning any given variable. For example here is some example data: string = <<EOF <seqexml version="1.0"> <primitive name='PHY_DATA_IND' time="00:00:40.450" CFN="206" sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1' channel_name='HS_DPCCH' channel_number='0' > <parameter name="CQI">23</parameter> <parameter name="H-ARQ Status">DTX</parameter> </primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207" sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1' channel_name='HS_DPCCH' channel_number='0' > <parameter name="CQI">22</parameter> <parameter name="H-ARQ Status">DTX</parameter> </primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207" sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1' channel_name='HS_DPCCH' channel_number='0' > <parameter name="CQI">22</parameter> <parameter name="H-ARQ Status">DTX</parameter> </primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207" sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1' channel_name='HS_DPCCH' channel_number='0' > <parameter name="CQI">23</parameter> <parameter name="H-ARQ Status">DTX</parameter> </primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207" sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1' channel_name='HS_DPCCH' channel_number='0' > <parameter name="CQI">22</parameter> <parameter name="H-ARQ Status">DTX</parameter> </primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207" sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1' channel_name='HS_DPCCH' channel_number='0' > <parameter name="CQI">24</parameter> <parameter name="H-ARQ Status">DTX</parameter> </primitive><primitive name='PHY_DATA_IND' time="00:00:40.470" CFN="208" sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1' channel_name='HS_DPCCH' channel_number='0' > <parameter name="CQI">21</parameter> <parameter name="H-ARQ Status">DTX</parameter> </primitive><primitive name='PHY_DATA_IND' time="00:00:40.470" CFN="208" sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1' channel_name='HS_DPCCH' channel_number='0' > <parameter name="CQI">22</parameter> <parameter name="H-ARQ Status">DTX</parameter> </primitive> </seqexml> EOF For example I may want to strip out all the "CQI" and timing values to get: 23, 00:00:40.450 22, 00:00:40.460 22, 00:00:40.460 23, 00:00:40.460 22, 00:00:40.460 24, 00:00:40.460 21, 00:00:40.470 22, 00:00:40.470 Question: These files can be very large and keeping the computer resource overhead is important. I've looked at other threads on this forum to decide which method of extracting data against timestamps would be the quickest but the information has been conflicting. I understand that stream parsing is faster that DOM. I also understand that libxml is faster than REXML, but libxml streaming uses DOM. So is it safe to assume that REXMl streaming is faster than libxml streaming? I also need to consider which way of things would be easier to implement. -- Posted via http://www.ruby-forum.com/.
From: brabuhr on 10 Aug 2010 10:52 On Tue, Aug 10, 2010 at 9:53 AM, Jerome David Sallinger <imran.nazir(a)yahoo.co.uk> wrote: > Hello. > > I am working with some XML logs coming from a network simulator. > My aim is to strip out the transient information concerning any given > variable. > > For example here is some example data: >[...] > > For example I may want to strip out all the "CQI" and timing values to > get: > > 23, 00:00:40.450 > 22, 00:00:40.460 > 22, 00:00:40.460 > 23, 00:00:40.460 > 22, 00:00:40.460 > 24, 00:00:40.460 > 21, 00:00:40.470 > 22, 00:00:40.470 > > Question: These files can be very large and keeping the computer > resource overhead is important. I've looked at other threads on this > forum to decide which method of extracting data against timestamps would > be the quickest but the information has been conflicting. I make no claim about what might be best :) but, nokogiri seems to be the leading Ruby XML library at the moment. I quickly adapted an old REXML pull parser to work with your sample data: def parse(stream) raise "BlockRequired" unless block_given? parser = REXML::Parsers::PullParser.new(stream) row = {} while parser.has_next? event = parser.pull case event.event_type when :start_element case event[0] when 'primitive' row = event[1]; col = nil when 'parameter' col = event[1]["name"] end row[col] ||= "" if col when :end_element col = nil case event[0] when 'primitive' yield(row) else # ignore end when :text row[col] << event[0].chomp if col else #ignore end end end parse(string){|row| #p row puts "#{row["CQI"]}, #{row["time"]}" } > ruby x.rb 23, 00:00:40.450 22, 00:00:40.460 22, 00:00:40.460 23, 00:00:40.460 22, 00:00:40.460 24, 00:00:40.460 21, 00:00:40.470 22, 00:00:40.470 The original program I lifted that from was processing XML files up to several gigabytes; particularly on the largest files we saw much better performance running under JRuby over MRI (1.8.5 or so).
|
Pages: 1 Prev: London Ruby contractor needed Next: Tail recursion with 1.9 |