From: Stuart Clarke on 1 Jul 2010 05:47 Hey all, Could anyone advise me on a fast way to search a single, but very large file (1Gb) quickly for a string of text? Also, is there a library to identify the file offset this string was found within the file? Thanks -- Posted via http://www.ruby-forum.com/.
From: Michael Fellinger on 1 Jul 2010 06:40 On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke <stuart.clarke1986(a)gmail.com> wrote: > Hey all, > > Could anyone advise me on a fast way to search a single, but very large > file (1Gb) quickly for a string of text? Also, is there a library to > identify the file offset this string was found within the file? You can use IO#grep like this: File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io| io.grep(/apiKey/){|m| p io.pos => m } } The pos is the position the match ended, so just substract the string length. The above example was a file with 700mb, took around 40s the first time, 2s subsequently, so disk I/O is the limiting factor in terms of speed (as usual). Oh, and also don't use binary encoding if you are dealing with another one ;) -- Michael Fellinger CTO, The Rubyists, LLC
From: Robert Klemme on 1 Jul 2010 07:03 2010/7/1 Michael Fellinger <m.fellinger(a)gmail.com>: > On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke > <stuart.clarke1986(a)gmail.com> wrote: >> Hey all, >> >> Could anyone advise me on a fast way to search a single, but very large >> file (1Gb) quickly for a string of text? Also, is there a library to >> identify the file offset this string was found within the file? > > You can use IO#grep like this: > File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io| > io.grep(/apiKey/){|m| p io.pos => m } } > > The pos is the position the match ended, so just substract the string length. > The above example was a file with 700mb, took around 40s the first > time, 2s subsequently, so disk I/O is the limiting factor in terms of > speed (as usual). If you only need to know whether the string occurs in the file you can do found = File.foreach("foo").any? {|line| /apiKey/ =~ line} This will stop searching as soon as the sequence is found. "fgrep -l foo" is likely faster. Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/
From: Stuart Clarke on 1 Jul 2010 07:58 Thanks. This seems to be pretty much the best logic for me, however it takes a good 20 minutes to scan a 2Gb file. Any ideas? Thanks Michael Fellinger wrote: > On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke > <stuart.clarke1986(a)gmail.com> wrote: >> Hey all, >> >> Could anyone advise me on a fast way to search a single, but very large >> file (1Gb) quickly for a string of text? Also, is there a library to >> identify the file offset this string was found within the file? > > You can use IO#grep like this: > File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io| > io.grep(/apiKey/){|m| p io.pos => m } } > > The pos is the position the match ended, so just substract the string > length. > The above example was a file with 700mb, took around 40s the first > time, 2s subsequently, so disk I/O is the limiting factor in terms of > speed (as usual). > Oh, and also don't use binary encoding if you are dealing with another > one ;) -- Posted via http://www.ruby-forum.com/.
From: Joel VanderWerf on 1 Jul 2010 13:03
Michael Fellinger wrote: > On Thu, Jul 1, 2010 at 6:47 PM, Stuart Clarke > <stuart.clarke1986(a)gmail.com> wrote: >> Hey all, >> >> Could anyone advise me on a fast way to search a single, but very large >> file (1Gb) quickly for a string of text? Also, is there a library to >> identify the file offset this string was found within the file? > > You can use IO#grep like this: > File.open('qimo-2.0-desktop.iso', 'r:BINARY'){|io| > io.grep(/apiKey/){|m| p io.pos => m } } > > The pos is the position the match ended Actually, pos will be the position of the end of the line on which the match was found, because #grep works line by line. |