From: Hakan on 28 Apr 2010 08:42 We need to scan a very big input file to see how many times each date occurs in it. This means that we want to check the number of times successive strings of the form "20020701", "20020702" and so on are in it from a given start to end date. The syntax is European format. What is the most efficent way to do it? I have tried with 1) a system call to grep and 2) a RandomAccessfile reading each character and moving the file pointer ahead, but none of them runs quickly enough. Another option might be to use a pattern matching, but then we would still probably have the problems of searching through most of the file. The records come in order such that no records of July 1 come after July 2. It would be great if anyone has ideas for this or has done it before. Thanks in advance!! Regards, H�kan Lane -- Newsoffice.de - Die Onlinesoftware zum Lesen und Schreiben im Usenet Die Signatur l��t sich nach Belieben anpassen ;-)
From: Nigel Wade on 28 Apr 2010 09:34 On Wed, 28 Apr 2010 14:42:35 +0200, Hakan wrote: > We need to scan a very big input file to see how many times each date > occurs in it. This means that we want to check the number of times > successive strings of the form "20020701", "20020702" and so on are in > it from a given start to end date. The syntax is European format. > > What is the most efficent way to do it? I have tried with 1) a system > call to grep and 2) a RandomAccessfile reading each character and moving > the file pointer ahead, but none of them runs quickly enough. Another > option might be to use a pattern matching, but then we would still > probably have the problems of searching through most of the file. The > records come in order such that no records of July 1 come after July 2. > It would be great if anyone has ideas for this or has done it before. > > Thanks in advance!! > > Regards, > > Håkan Lane I doubt you'll beat grep. It's optimized for one single task, pattern matching. If grep won't run fast enough for your requirements then you may need to rethink the solution. One point may be worth checking if you are running in a Linux environment (I don't know if this affects other environments). If you are scanning for plain ASCII, and your environment is using UTF/unicode, you can greatly increase the speed of grep by switching to plain ASCII matching. Increases in speed of up to 100x are not uncommon simply setting "LANG=C" in the environment before running grep. E.g. a simple scan of a log file: $ time grep -i password /var/log/nxserver.log >/dev/null real 0m9.650s user 0m9.529s sys 0m0.087s $ LANG=C $ time grep -i password /var/log/nxserver.log >/dev/null real 0m0.151s user 0m0.074s sys 0m0.078s -- Nigel Wade
From: Tom Anderson on 28 Apr 2010 12:36 On Wed, 28 Apr 2010, Hakan wrote: > We need to scan a very big input file Exactly how big? > to see how many times each date occurs in it. This means that we want to > check the number of times successive strings of the form "20020701", > "20020702" and so on are in it from a given start to end date. The > syntax is European format. What do you mean by 'successive'? Could you give us a sample of the input file? > What is the most efficent way to do it? I have tried with 1) a system call to > grep Could you tell us the exact grep command you run? > and 2) a RandomAccessfile reading each character and moving the file > pointer ahead, I'm not sure how much buffering that does. You might be better off with a FileInputStream wrapped in a BufferedInputStream of generous size (or in fact, wrapped in an InputStreamReader and some buffering somewhere), or with a memory-mapped file obtained from a NIO FileChannel. Or you might not. > but none of them runs quickly enough. Another option might be to use a > pattern matching, but then we would still probably have the problems of > searching through most of the file. As i understand your requirement, you'll have to scan the *entire* file. What do you mean by "the problems of searching through most of the file"? tom -- Basically, at any given time, most people in the world are wasting time.
From: markspace on 28 Apr 2010 13:21 Hakan wrote: > It would be great if anyone has ideas for this or has done it before. A database with an index is something that has done this before and solved the problem. Import the file into a DB, then just search. If the dates are in order, it might be possible to do a binary search, if the records can be found starting from an arbitrary point in the file's data stream. However, a DB with an index is still likely to be faster.
From: Robert Kochem on 28 Apr 2010 13:54
Hakan wrote: > What is the most efficent way to do it? I have tried with 1) a system > call to grep and 2) a RandomAccessfile reading each character and moving > the file pointer ahead, A RandomAccesFile and reading single characters is a very bad idea because this results in thousands of very small read requests which slows down the HDD. IMHO it is much better to use an FileInputStream in combination with a BufferedInputStream with a buffer larger than 10MB. That will result in large read operation which will perform better even if most of the data is never used/skipped. Additionally this sequencial access allows the OS to use read-ahead optimization. Robert |