From: Jerome David Sallinger on
Hello.

I am working with some XML logs coming from a network simulator.
My aim is to strip out the transient information concerning any given
variable.

For example here is some example data:

string = <<EOF
<seqexml version="1.0">
<primitive name='PHY_DATA_IND' time="00:00:40.450" CFN="206"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">23</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">23</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">24</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.470" CFN="208"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">21</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.470" CFN="208"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive>
</seqexml>
EOF

For example I may want to strip out all the "CQI" and timing values to
get:

23, 00:00:40.450
22, 00:00:40.460
22, 00:00:40.460
23, 00:00:40.460
22, 00:00:40.460
24, 00:00:40.460
21, 00:00:40.470
22, 00:00:40.470

Question: These files can be very large and keeping the computer
resource overhead is important. I've looked at other threads on this
forum to decide which method of extracting data against timestamps would
be the quickest but the information has been conflicting.

I understand that stream parsing is faster that DOM. I also understand
that libxml is faster than REXML, but libxml streaming uses DOM. So is
it safe to assume that REXMl streaming is faster than libxml streaming?

I also need to consider which way of things would be easier to
implement.
--
Posted via http://www.ruby-forum.com/.

From: brabuhr on
On Tue, Aug 10, 2010 at 9:53 AM, Jerome David Sallinger
<imran.nazir(a)yahoo.co.uk> wrote:
> Hello.
>
> I am working with some XML logs coming from a network simulator.
> My aim is to strip out the transient information concerning any given
> variable.
>
> For example here is some example data:
>[...]
>
> For example I may want to strip out all the "CQI" and timing values to
> get:
>
> 23, 00:00:40.450
> 22, 00:00:40.460
> 22, 00:00:40.460
> 23, 00:00:40.460
> 22, 00:00:40.460
> 24, 00:00:40.460
> 21, 00:00:40.470
> 22, 00:00:40.470
>
> Question: These files can be very large and keeping the computer
> resource overhead is important. I've looked at other threads on this
> forum to decide which method of extracting data against timestamps would
> be the quickest but the information has been conflicting.

I make no claim about what might be best :) but, nokogiri seems to be
the leading Ruby XML library at the moment. I quickly adapted an old
REXML pull parser to work with your sample data:

def parse(stream)
raise "BlockRequired" unless block_given?

parser = REXML::Parsers::PullParser.new(stream)

row = {}

while parser.has_next?
event = parser.pull

case event.event_type
when :start_element
case event[0]
when 'primitive'
row = event[1]; col = nil
when 'parameter'
col = event[1]["name"]
end

row[col] ||= "" if col

when :end_element
col = nil

case event[0]
when 'primitive'
yield(row)
else
# ignore
end

when :text
row[col] << event[0].chomp if col

else
#ignore
end
end
end

parse(string){|row|
#p row
puts "#{row["CQI"]}, #{row["time"]}"
}

> ruby x.rb
23, 00:00:40.450
22, 00:00:40.460
22, 00:00:40.460
23, 00:00:40.460
22, 00:00:40.460
24, 00:00:40.460
21, 00:00:40.470
22, 00:00:40.470

The original program I lifted that from was processing XML files up to
several gigabytes; particularly on the largest files we saw much
better performance running under JRuby over MRI (1.8.5 or so).