Handling extremely large input files [Java Programming]

Prev: Referencing enclosing instance
Next: Java Logging Question

From: Hakan on 28 Apr 2010 08:42

We need to scan a very big input file to see how many times each date
occurs in it. This means that we want to check the number of times
successive strings of the form "20020701", "20020702" and so on are in
it from a given start to end date. The syntax is European format.

What is the most efficent way to do it? I have tried with 1) a system
call to grep and 2) a RandomAccessfile reading each character and moving
the file pointer ahead, but none of them runs quickly enough. Another
option might be to use a pattern matching, but then we would still
probably have the problems of searching through most of the file. The
records come in order such that no records of July 1 come after July 2.
It would be great if anyone has ideas for this or has done it before.

Thanks in advance!!

Regards,

H�kan Lane

--
Newsoffice.de - Die Onlinesoftware zum Lesen und Schreiben im Usenet
Die Signatur l��t sich nach Belieben anpassen ;-)

From: Nigel Wade on 28 Apr 2010 09:34

On Wed, 28 Apr 2010 14:42:35 +0200, Hakan wrote:

> We need to scan a very big input file to see how many times each date
> occurs in it. This means that we want to check the number of times
> successive strings of the form "20020701", "20020702" and so on are in
> it from a given start to end date. The syntax is European format.
>
> What is the most efficent way to do it? I have tried with 1) a system
> call to grep and 2) a RandomAccessfile reading each character and moving
> the file pointer ahead, but none of them runs quickly enough. Another
> option might be to use a pattern matching, but then we would still
> probably have the problems of searching through most of the file. The
> records come in order such that no records of July 1 come after July 2.
> It would be great if anyone has ideas for this or has done it before.
>
> Thanks in advance!!
>
> Regards,
>
> Håkan Lane

I doubt you'll beat grep. It's optimized for one single task, pattern
matching. If grep won't run fast enough for your requirements then you
may need to rethink the solution.

One point may be worth checking if you are running in a Linux environment
(I don't know if this affects other environments). If you are scanning
for plain ASCII, and your environment is using UTF/unicode, you can
greatly increase the speed of grep by switching to plain ASCII matching.
Increases in speed of up to 100x are not uncommon simply setting "LANG=C"
in the environment before running grep. E.g. a simple scan of a log file:

$ time grep -i password /var/log/nxserver.log >/dev/null
real 0m9.650s
user 0m9.529s
sys 0m0.087s

$ LANG=C
$ time grep -i password /var/log/nxserver.log >/dev/null
real 0m0.151s
user 0m0.074s
sys 0m0.078s

--
Nigel Wade

From: Tom Anderson on 28 Apr 2010 12:36

On Wed, 28 Apr 2010, Hakan wrote:

> We need to scan a very big input file

Exactly how big?

> to see how many times each date occurs in it. This means that we want to
> check the number of times successive strings of the form "20020701",
> "20020702" and so on are in it from a given start to end date. The
> syntax is European format.

What do you mean by 'successive'? Could you give us a sample of the input
file?

> What is the most efficent way to do it? I have tried with 1) a system call to
> grep

Could you tell us the exact grep command you run?

> and 2) a RandomAccessfile reading each character and moving the file
> pointer ahead,

I'm not sure how much buffering that does. You might be better off with a
FileInputStream wrapped in a BufferedInputStream of generous size (or in
fact, wrapped in an InputStreamReader and some buffering somewhere), or
with a memory-mapped file obtained from a NIO FileChannel. Or you might
not.

> but none of them runs quickly enough. Another option might be to use a
> pattern matching, but then we would still probably have the problems of
> searching through most of the file.

As i understand your requirement, you'll have to scan the *entire* file.
What do you mean by "the problems of searching through most of the file"?

tom

--
Basically, at any given time, most people in the world are wasting time.

From: markspace on 28 Apr 2010 13:21

Hakan wrote:

> It would be great if anyone has ideas for this or has done it before.

A database with an index is something that has done this before and
solved the problem. Import the file into a DB, then just search.

If the dates are in order, it might be possible to do a binary search,
if the records can be found starting from an arbitrary point in the
file's data stream. However, a DB with an index is still likely to be
faster.

From: Robert Kochem on 28 Apr 2010 13:54

Hakan wrote:

> What is the most efficent way to do it? I have tried with 1) a system
> call to grep and 2) a RandomAccessfile reading each character and moving
> the file pointer ahead,

A RandomAccesFile and reading single characters is a very bad idea because
this results in thousands of very small read requests which slows down the
HDD.

IMHO it is much better to use an FileInputStream in combination with a
BufferedInputStream with a buffer larger than 10MB. That will result in
large read operation which will perform better even if most of the data is
never used/skipped. Additionally this sequencial access allows the OS to
use read-ahead optimization.

Robert

| Next | Last
Pages: 1 2
Prev: Referencing enclosing instance
Next: Java Logging Question