Prev: Strange Loop 2010
Next: How to update Ruby >>> CentOS
From: Jonathan Bale on 13 Jul 2010 19:38 I helping my boss with some scripting for a web analysis research project. He handles vocabulary and analysis, while I am using ruby parse WARC files and the actual HTML. Anyway, I'm still fairly new to Ruby. I did the WARC parsing, but I was wondering what I should use for the HTML parser. (Didn't want to re-invent that wheel.) Some considerations: * Mainly we just need to pull the content text out of the HTML * A few tags might have special weight or significance (h1, etc.) * Unfortunately, nearly all the HTML is broken, because all our test data was provided by this software that truncates the data after a certain length. -- Posted via http://www.ruby-forum.com/.
From: Marc Weber on 13 Jul 2010 20:22 Excerpts from Jonathan Bale's message of Wed Jul 14 01:38:31 +0200 2010: > I helping my boss with some scripting for a web analysis research > project. He handles vocabulary and analysis, while I am using ruby parse > WARC files and the actual HTML. > > Anyway, I'm still fairly new to Ruby. I did the WARC parsing, but I was > wondering what I should use for the HTML parser. (Didn't want to > re-invent that wheel.) Some considerations: Google for nokogiri. That's one solution. Marc Weber
|
Pages: 1 Prev: Strange Loop 2010 Next: How to update Ruby >>> CentOS |