Reading from very large file [Java Programming]

Prev: Business Calendar
Next: Computing sales taxes

From: Stanimir Stamenkov on 9 May 2010 07:09

Sun, 09 May 2010 08:57:59 +0200, /Hakan/:

> Sorry about the mistake, but the file is actually 13 GB. I can read to a
> character array buffering about 30 million characters before the heap
> space is overflowed. This is still only a part of the file.

How does the Grep.java example performs for you:

http://java.sun.com/javase/6/docs/technotes/guides/io/example/index.html

I guess you could modify the grep() method to extract what you need
and save it to another file.

--
Stanimir

From: Tom Anderson on 9 May 2010 07:16

On Sun, 9 May 2010, Hakan wrote:

> Sorry about the mistake, but the file is actually 13 GB. I can read to a
> character array buffering about 30 million characters before the heap space
> is overflowed. This is still only a part of the file.
>
> The sscce site is down and not accessible when I tried. What I have been
> doing so far is something like this in rough code:

Rough code is really not that useful - you're having a problem because
something in your code is wrong, which means that something in your
*understanding* of the code is wrong. Telling us about your understanding
of the code is therefore not very useful. Why can't you copy and paste
your actual code?

> static int nchars=27000000;
> int startpos=0;
> File readfile="../x.txt";
> FileReader frd=new File;
> String searchs="20020701";
> char[] arr=new char[nchars];
>
> while (more dates to search for)
> {
> frd=new FileReader(readfile); /*reopen file
> frd.skip(startpos); /*move to file pointer where final place of last date was found

I suspect the above line is the problem.

A FileReader works in characters, not bytes. Characters may be a variable
number of bytes (in some encodings, and so in general), and thus skipping
a given number of a characters doesn't corresponding to skipping any fixed
number of bytes. Thus, FileReader.skip can't be implemented efficiently on
top of the low-level seek() system call. Instead, it has to read through
the contents of the file, counting characters until it's skipped the right
number. So, every time you make this call, you're re-reading all of the
file you've read so far.

> frd.read(arr,0,nchars); /*10
> find number of date occurrences in arr with pattern matching
> update searchs (first time to "20020702" and so on
> startpos=startpos+(last place of pattern match)
> output result for this date
> }
>
> This in all tends to use one to two minutes per run of the loop. What I
> would like to do is to a) either preprocess the file such that I get an
> input file where only numbers are present or b) change the read call at
> label 10 so that it only reads numbers instead of all next characters.

No, you don't want to do either of those things. You want to avoid the
real problem, which is re-reading the file every trip round the loop.

You're massively overcomplicating this problem. All you need to do is set
up the FileReader - once, and with suitable buffering - then read
characters from it, looking for strings which look like dates. You can do
this in exactly one pass of the file, and less than 30 lines of code.

I know that because in ten minutes, i just wrote a program that does it.
Download the class file from here:

http://urchin.earth.li/~twic/tmp/DateScanner.class

And run it like:

java DateScanner name-of-file.txt

It doesn't do the full sequential processing of dates that you want to do,
but it does report every date it finds, and its position. Now run it like
this:

java -Dquiet=true DateScanner name-of-file.txt

To suppress output. How long does it take to process your file?

tom

--
All roads lead unto death row; who knows what's after?

From: Lew on 9 May 2010 11:35

On 05/09/2010 02:57 AM,
Lew wrote:
>> <http://sscce.org/>
>> <http://sscce.org/>
>> <http://sscce.org/>

Hakan wrote:
> Sorry about the mistake, but the file is actually 13 GB. I can read to a

Well, now, that's a horse of a different color.

> character array buffering about 30 million characters before the heap

Allocate a larger -Xmx.

> space is overflowed. This is still only a part of the file.

What about buffering?

> The sscce site is down and not accessible when I tried. What I have been

It's up now as I check it. Look again.

> doing so far is something like this in rough code:
>
> static int nchars=27000000;
> int startpos=0;
> File readfile="../x.txt";

Relative paths from inside Java code can be tricky.
Class.getResourceAsStream() (around which you'd wrap a Reader, of course) can
help with that.

> FileReader frd=new File;

This line will not compile.

Show ACTUAL code. How many people have told you this in this thread?

You keep showing disrespect for the people trying to help you, and that could
hurt your chances of getting good help.

> String searchs="20020701";
> char[] arr=new char[nchars];
>
> while (more dates to search for)
> {
> frd=new FileReader(readfile); /*reopen file
> frd.skip(startpos); /*move to file pointer where final place of last
> date was found
> frd.read(arr,0,nchars); /*10
> find number of date occurrences in arr with pattern matching
> update searchs (first time to "20020702" and so on
> startpos=startpos+(last place of pattern match)
> output result for this date
> }

None of this will compile. What is this garbage?

Have you ever heard of indentation?

Use up to four spaces indent per logical level for Usenet posts.

> This in all tends to use one to two minutes per run of the loop. What I

This all tends not to run at all, since it won't compile in the first place.

> would like to do is to a) either preprocess the file such that I get an
> input file where only numbers are present or b) change the read call at
> label 10 so that it only reads numbers instead of all next characters.
> Thank you so much for your help!!

Once again, since you seem to have ignored or missed it the first couple of
times people have told you, read in the file sequentially with a relatively
large buffer (say, 8 MiB? 16?)

Why do you disregard substantial portions of the advice several people have
given you, then come back with the exact same question again and again?

--
Lew

From: Eric Sosman on 9 May 2010 12:02

On 5/9/2010 11:35 AM, Lew wrote:
> Hakan wrote:
>> Sorry about the mistake, but the file is actually 13 GB. I can read to a
>
> Well, now, that's a horse of a different color.
>
>> character array buffering about 30 million characters before the heap
>
> Allocate a larger -Xmx.

... and buy at least 32GB of RAM. (In other words: Not an
economical approach; cheaper -- and probably faster -- approaches
exist and have been used for decades.)

--
Eric Sosman
esosman(a)ieee-dot-org.invalid

From: Arne Vajhøj on 9 May 2010 12:46

On 09-05-2010 02:57, Hakan wrote:
> Sorry about the mistake, but the file is actually 13 GB.

That will take some time to process.

> I can read to a
> character array buffering about 30 million characters before the heap
> space is overflowed.

You can increase heap space using -Xmx1g or other size, but buffers
larger than 30 Mchar will not improve performance significantly
(assuming sequential processing).

> The sscce site is down and not accessible when I tried. What I have been
> doing so far is something like this in rough code:
>
> static int nchars=27000000;
> int startpos=0;
> File readfile="../x.txt";
> FileReader frd=new File;
> String searchs="20020701";
> char[] arr=new char[nchars];
>
> while (more dates to search for)
> {
> frd=new FileReader(readfile); /*reopen file
> frd.skip(startpos); /*move to file pointer where final place of last
> date was found
> frd.read(arr,0,nchars); /*10
> find number of date occurrences in arr with pattern matching
> update searchs (first time to "20020702" and so on
> startpos=startpos+(last place of pattern match)
> output result for this date
> }
>
> This in all tends to use one to two minutes per run of the loop. What I
> would like to do is to a) either preprocess the file such that I get an
> input file where only numbers are present or b) change the read call at
> label 10 so that it only reads numbers instead of all next characters.

The above code is not precise enough to that we can see what
could be the bottleneck.

As someone else stated then the skip trick looks very suspiciously.

Arne

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: Business Calendar
Next: Computing sales taxes