Prev: active_support classify truncating
Next: pathname.rb:270: warning: `*' interpreted as argument prefix
From: Bob05 Dr on 29 Jun 2010 19:54 Hello, I have some text files that I would like to extract text from, then join them on one single line and save them to a text file. Here is an example of the text I want to take out: <Title>Protein complexes in Saccharomyces cerevisiae (GPM06600002310)</Title> <ShortLabel>GPM06600002310</ShortLabel> <ProtocolName>None</ProtocolName> Here is how I would like the text to save as: Protein complexes in Saccharomyces cerevisiae (GPM06600002310)GPM06600002310 None So far I have this: require 'rexml/document' include REXML file = File.new("1.xml") doc = Document.new(file) puts doc aFile = File.new("1.txt", "w") aFile.write(doc) aFile.close I was wondering, how can you split out text and join them on one line? -- Posted via http://www.ruby-forum.com/.
From: Jesús Gabriel y Galán on 30 Jun 2010 02:20
On Wed, Jun 30, 2010 at 1:54 AM, Bob05 Dr <knightplayer(a)gmail.com> wrote: > Hello, > > I have some text files that I would like to extract text from, then join > them on one single line and save them to a text file. > > Here is an example of the text I want to take out: > > <Title>Protein complexes in Saccharomyces cerevisiae > (GPM06600002310)</Title> > <ShortLabel>GPM06600002310</ShortLabel> > <ProtocolName>None</ProtocolName> > > Here is how I would like the text to save as: > > Protein complexes in Saccharomyces cerevisiae > (GPM06600002310)GPM06600002310 None > > > So far I have this: > > require 'rexml/document' > include REXML > file = File.new("1.xml") > doc = Document.new(file) > puts doc > aFile = File.new("1.txt", "w") > aFile.write(doc) > aFile.close > > I was wondering, how can you split out text and join them on one line? First of all, your document doesn't parse well, because it has two root nodes. After solving that, what you need is to get to each element and extract its text children nodes. Take a look at: http://www.germane-software.com/software/rexml/docs/tutorial.html And the methods: elements [] text of the API. Experiment a little in IRB: irb(main):001:0> s = <<EOF irb(main):002:0" <Title>Protein complexes in Saccharomyces cerevisiae irb(main):003:0" (GPM06600002310)</Title> irb(main):004:0" <ShortLabel>GPM06600002310</ShortLabel> irb(main):005:0" <ProtocolName>None</ProtocolName> irb(main):006:0" EOF => "<Title>Protein complexes in Saccharomyces cerevisiae\n(GPM06600002310)</Title>\n<ShortLabel>GPM06600002310</ShortLabel>\n<ProtocolName>None</ProtocolName>\n" irb(main):007:0> irb(main):008:0* irb(main):009:0* require 'rexml/document' => true irb(main):010:0> include REXML => Object irb(main):011:0> doc = Document.new s REXML::ParseException: #<RuntimeError: attempted adding second root element to document> ooooops, two root elements. I'll add a fake one surrounding everything: irb(main):012:0> s = <<EOF irb(main):013:0" <ROOT> irb(main):014:0" <Title>Protein complexes in Saccharomyces cerevisiae irb(main):015:0" (GPM06600002310)</Title> irb(main):016:0" <ShortLabel>GPM06600002310</ShortLabel> irb(main):017:0" <ProtocolName>None</ProtocolName> irb(main):018:0" </ROOT> irb(main):019:0" EOF => "<ROOT>\n<Title>Protein complexes in Saccharomyces cerevisiae\n(GPM06600002310)</Title>\n<ShortLabel>GPM06600002310</ShortLabel>\n<ProtocolName>None</ProtocolName>\n</ROOT>\n" irb(main):020:0> doc = Document.new s => <UNDEFINED> ... </> irb(main):025:0> doc.elements => #<REXML::Elements:0xb72907e0 @element=<UNDEFINED> ... </>> irb(main):026:0> doc.elements.each {|el| p el} <ROOT> ... </> => [<ROOT> ... </>] irb(main):027:0> doc.to_a => [<ROOT> ... </>, "\n"] irb(main):028:0> doc.elements.to_a => [<ROOT> ... </>] irb(main):032:0> doc.elements["/Title"] => nil irb(main):033:0> doc.elements["Title"] => nil irb(main):034:0> root = doc.root => <ROOT> ... </> irb(main):035:0> root.elements["Title"] => <Title> ... </> irb(main):036:0> root.elements["Title"].to_s => "<Title>Protein complexes in Saccharomyces cerevisiae\n(GPM06600002310)</Title>" Look, it seems that with that I can get the text of the Title element. Let's see if there's a better way: irb(main):039:0> root.elements["Title"].methods.sort => ["<<", "==", "===", "=~", "[]", "[]=", "__id__", "__send__", "add", "add_attribute", "add_attributes", "add_element", "add_namespace", "add_text", "all?", "any?", "attribute", "attributes", "bytes", "cdatas", "children", "class", "clone", "collect", "comments", "context", "context=", "count", "cycle", "dclone", "deep_clone", "delete", "delete_at", "delete_attribute", "delete_element", "delete_if", "delete_namespace", "detect", "display", "document", "drop", "drop_while", "dup", "each", "each_child", "each_cons", "each_element", "each_element_with_attribute", "each_element_with_text", "each_index", "each_recursive", "each_slice", "each_with_index", "elements", "entries", "enum_cons", "enum_for", "enum_slice", "enum_with_index", "eql?", "equal?", "expanded_name", "extend", "find", "find_all", "find_first_recursive", "find_index", "first", "freeze", "frozen?", "fully_expanded_name", "get_elements", "get_text", "grep", "group_by", "has_attributes?", "has_elements?", "has_name?", "has_text?", "hash", "id", "ignore_whitespace_nodes", "include?", "indent", "index", "index_in_parent", "inject", "insert_after", "insert_before", "inspect", "instance_eval", "instance_exec", "instance_of?", "instance_variable_defined?", "instance_variable_get", "instance_variable_set", "instance_variables", "instructions", "is_a?", "kind_of?", "length", "local_name", "map", "max", "max_by", "member?", "method", "methods", "min", "min_by", "minmax", "minmax_by", "name", "name=", "namespace", "namespaces", "next_element", "next_sibling", "next_sibling=", "next_sibling_node", "nil?", "node_type", "none?", "object_id", "one?", "parent", "parent=", "parent?", "partition", "prefix", "prefix=", "prefixes", "previous_element", "previous_sibling", "previous_sibling=", "previous_sibling_node", "private_methods", "protected_methods", "public_methods", "push", "raw", "reduce", "reject", "remove", "replace_child", "replace_with", "respond_to?", "reverse_each", "root", "root_node", "select", "send", "singleton_methods", "size", "sort", "sort_by", "taint", "tainted?", "take", "take_while", "tap", "text", "text=", "texts", "to_a", "to_enum", "to_s", "to_set", "type", "unshift", "untaint", "whitespace", "write", "xpath", "zip"] There's a text method in there, would that do what I expect? irb(main):040:0> root.elements["Title"].text => "Protein complexes in Saccharomyces cerevisiae\n(GPM06600002310)" Bingo ! Is there a way to access it directly from the doc, instead of having a root variable? irb(main):042:0> doc.elements["ROOT/Title"].text => "Protein complexes in Saccharomyces cerevisiae\n(GPM06600002310)" Now you can do the same for the other elements. I also recommend you learn XPath and CSS selectors if you are going to be parsing markup, and also look at other parsers like Nokogiri. This example was pretty simple, but these things can get nasty. Jesus. |