Handling extremely large input files [Java Programming]

Prev: Referencing enclosing instance
Next: Java Logging Question

From: Martin Gregorie on 28 Apr 2010 15:57

On Wed, 28 Apr 2010 14:42:35 +0200, Hakan wrote:

> We need to scan a very big input file to see how many times each date
> occurs in it. This means that we want to check the number of times
> successive strings of the form "20020701", "20020702" and so on are in
> it from a given start to end date. The syntax is European format.
>
Sounds like a job for awk (gawk) or Perl to me, especially if its more or
less a one-off or infrequent task.

> What is the most efficent way to do it? I have tried with 1) a system
> call to grep
>
Does this imply that grep is run multiple times? If so, awk or Perl would
be *much* faster. Both have major advantages over repeatedly calling grep:

- you can use multiple patterns, e.g. one to limit processing to the
required date range / section of the file delimited by a date range

- you can use another pattern to recognise dates being counted if that
makes the code better structured

- both awk and Perl support associative arrays, which is exactly what
you need for listing unique dates on the file and counting the
ocurrences of each date.

Unless I missed something, a program that does the required scanning and
lists sorted results could be written in under 20 lines as an awk script
and in not many more with Perl.

--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |

From: ~Glynne on 28 Apr 2010 16:30

On Apr 28, 6:42 am, Hakan <H...(a)softhome.net> wrote:
> We need to scan a very big input file to see how many times each date
> occurs in it. This means that we want to check the number of times
> successive strings of the form "20020701", "20020702" and so on are in
> it from a given start to end date. The syntax is European format.
>
> What is the most efficent way to do it? I have tried with 1) a system
> call to grep and 2) a RandomAccessfile reading each character and moving
> the file pointer ahead, but none of them runs quickly enough. Another
> option might be to use a pattern matching, but then we would still
> probably have the problems of searching through most of the file. The
> records come in order such that no records of July 1 come after July 2.
> It would be great if anyone has ideas for this or has done it before.
>
> Thanks in advance!!
>
> Regards,
>
> Håkan Lane
>
> --
> Newsoffice.de - Die Onlinesoftware zum Lesen und Schreiben im Usenet
> Die Signatur läßt sich nach Belieben anpassen ;-)

Implement the following pseudo-code in the language of your choice.

last_date = fencepost
counter = 0

while readline

extract date field

if ( date != last_date )
print last_date, counter
last_date = date
counter = 1
else
increment counter

end while

print last_date, counter

As for fast IO in Java I would recommend

BufferedReader stdin =
new BufferedReader(new InputStreamReader(System.in,
"ISO-8859-1"));

while( (line=stdin.readLine()) != null ) { .... }

~Glynne

From: Arne Vajhøj on 28 Apr 2010 22:00

On 28-04-2010 13:54, Robert Kochem wrote:
> IMHO it is much better to use an FileInputStream in combination with a
> BufferedInputStream with a buffer larger than 10MB.

BufferedInputStream/FileInputStream with a large buffer
or
FileInputStream with a buffer in the application code
or
FileChannel.map

should all give good performance.

Arne

From: Roedy Green on 29 Apr 2010 12:48

On Wed, 28 Apr 2010 14:42:35 +0200, Hakan <H.L(a)softhome.net> wrote,
quoted or indirectly quoted someone who said :

>
> We need to scan a very big input file to see how many times each date
>occurs in it. This means that we want to check the number of times
>successive strings of the form "20020701", "20020702" and so on are in
>it from a given start to end date. The syntax is European format.

You need to provide more detail, e.g. hex patterns you are looking
for.

Places you can gain speed:

1. read bytes rather than chars to save the conversion of bytes to
chars. i.e. FileInputStream.

2. don't use a generic regex. Use some hard coded char handling.
Process a buffer full at a time with some overlap between buffers or
special handling to deal with the buffer boundaries.

3. whacking huge buffers. Optimal size to be found by experiment.

4. possible use of nio, though your file presumably is too big for a
memory mapped file.
--
Roedy Green Canadian Mind Products
http://mindprod.com

It�s amazing how much structure natural languages have when you consider who speaks them and how they evolved.

First | Prev |
Pages: 1 2
Prev: Referencing enclosing instance
Next: Java Logging Question