Reading from very large file [Java Programming]

Prev: Business Calendar
Next: Computing sales taxes

From: Lew on 9 May 2010 13:46

Hakan wrote:
>>> character array buffering about 30 million characters before the heap

Lew wrote:
>> Allocate a larger -Xmx.

Eric Sosman wrote:
> ... and buy at least 32GB of RAM. (In other words: Not an
> economical approach; cheaper -- and probably faster -- approaches
> exist and have been used for decades.)

You're missing an antecedent. *What* is not an economical approach?

And why does he need to buy 32GB of RAM?

Most machines come with at least 128 MB, and 1 GB is neither unduly expensive
nor very uncommon. There's little reason in most desktop or server
environments to max out at 30M characters.

--
Lew

From: Eric Sosman on 9 May 2010 15:27

On 5/9/2010 1:46 PM, Lew wrote:
> Hakan wrote:
>>>> character array buffering about 30 million characters before the heap
>
> Lew wrote:
>>> Allocate a larger -Xmx.
>
> Eric Sosman wrote:
>> ... and buy at least 32GB of RAM. (In other words: Not an
>> economical approach; cheaper -- and probably faster -- approaches
>> exist and have been used for decades.)
>
> You're missing an antecedent. *What* is not an economical approach?
>
> And why does he need to buy 32GB of RAM?

I seem to have mixed a couple unrelated sub-threads together,
coming up with the impression that somebody wanted the O.P. to read
the entire 13GB file into *one* buffer, all at once, which would
have taken ~26GB of 16-bit chars all by itself. But hunting back
through the thread I can't find the suggestion spelled out this way,
so I guess my imagination has been running away with me. *If* that
(wasteful) approach were taken, he'd need a lot of RAM. And a 64-bit
JVM, of course.

> Most machines come with at least 128 MB, and 1 GB is neither unduly
> expensive nor very uncommon. There's little reason in most desktop or
> server environments to max out at 30M characters.

Kinda makes you wish the O.P. would let us know what he was
doing, doesn't it? (Sigh.)

--
Eric Sosman
esosman(a)ieee-dot-org.invalid

From: bugbear on 10 May 2010 04:58

Hakan wrote:
> Lew wrote:
>
>
>
>> First, 13.7 MB isn't so terribly large. Second, markspace specifically
>> asked for hard numbers and pointed out that adjectives like "extremely
>> big" are not terribly meaningful, yet you ignored that advice and the
>> request and simply provided another vague adjective, "immense",
>> without any indication of what your target performance is. Third, he
>> asked for an SSCCE, which you also ignored completely.
>
>> Given all that, you make it impossible to help you, but let me try
>> anyway. I'm just a great guy that way.
>
>> But you're still going to have to provide an SSCCE. Read
>> <http://sscce.org/>
>> to learn about that.
>
>> You mentioned that "reading each character with a RandomAccessFile is
>> too slow". OK, then don't do that! Stream the data in, using a large
>> block size for the read, for example, using
>> <http://java.sun.com/javase/6/docs/api/java/io/BufferedReader.html#BufferedReader(java.io.Reader,
>>
>> int)>
>> to establish the stream.
>
>> At that point your search for digits is nearly all memory-bound. On
>> most modern systems you should be able to fit the entire 13.7 MB in
>> memory at once,
>> eliminating I/O as a limiting factor.
>
>> Now you just need an efficient algorithm. Perhaps a state machine
>> that scans your 13.7 MB in-memory buffer and spits out sequences of
>> digits to a handler, somewhat the way XML SAX parsers handle searches
>> for tags, would be useful.
>
>> Now for the best piece of advice when asking for help from Usenet:
>
>> <http://sscce.org/>
>> <http://sscce.org/>
>> <http://sscce.org/>
>
> Sorry about the mistake, but the file is actually 13 GB. I can read to a
> character array buffering about 30 million characters before the heap
> space is overflowed. This is still only a part of the file.
>
> The sscce site is down and not accessible when I tried. What I have been
> doing so far is something like this in rough code:
>
> static int nchars=27000000;
> int startpos=0;
> File readfile="../x.txt";
> FileReader frd=new File;
> String searchs="20020701";
> char[] arr=new char[nchars];
>
> while (more dates to search for)
> {
> frd=new FileReader(readfile); /*reopen file
> frd.skip(startpos); /*move to file pointer where final place of last
> date was found
> frd.read(arr,0,nchars); /*10
> find number of date occurrences in arr with pattern matching
> update searchs (first time to "20020702" and so on
> startpos=startpos+(last place of pattern match)
> output result for this date
> }
>
> This in all tends to use one to two minutes per run of the loop. What I
> would like to do is to a) either preprocess the file such that I get an
> input file where only numbers are present or b) change the read call at
> label 10 so that it only reads numbers instead of all next characters.
> Thank you so much for your help!!

13 Gb in two minutes is a throughput of 110 Mb/Sec,
which doesn't seem ludicrously slow.

What's your target?

BugBear

From: Alan Malloy on 10 May 2010 05:57

bugbear wrote:
> Hakan wrote:
>> Lew wrote:
>>
>>
>>
>>> First, 13.7 MB isn't so terribly large. Second, markspace
>>> specifically asked for hard numbers and pointed out that adjectives
>>> like "extremely big" are not terribly meaningful, yet you ignored
>>> that advice and the request and simply provided another vague
>>> adjective, "immense", without any indication of what your target
>>> performance is. Third, he asked for an SSCCE, which you also ignored
>>> completely.
>>
>>> Given all that, you make it impossible to help you, but let me try
>>> anyway. I'm just a great guy that way.
>>
>>> But you're still going to have to provide an SSCCE. Read
>>> <http://sscce.org/>
>>> to learn about that.
>>
>>> You mentioned that "reading each character with a RandomAccessFile is
>>> too slow". OK, then don't do that! Stream the data in, using a
>>> large block size for the read, for example, using
>>> <http://java.sun.com/javase/6/docs/api/java/io/BufferedReader.html#BufferedReader(java.io.Reader,
>>>
>>> int)>
>>> to establish the stream.
>>
>>> At that point your search for digits is nearly all memory-bound. On
>>> most modern systems you should be able to fit the entire 13.7 MB in
>>> memory at once, eliminating I/O as a limiting factor.
>>
>>> Now you just need an efficient algorithm. Perhaps a state machine
>>> that scans your 13.7 MB in-memory buffer and spits out sequences of
>>> digits to a handler, somewhat the way XML SAX parsers handle searches
>>> for tags, would be useful.
>>
>>> Now for the best piece of advice when asking for help from Usenet:
>>
>>> <http://sscce.org/>
>>> <http://sscce.org/>
>>> <http://sscce.org/>
>>
>> Sorry about the mistake, but the file is actually 13 GB. I can read to
>> a character array buffering about 30 million characters before the
>> heap space is overflowed. This is still only a part of the file.
>>
>> The sscce site is down and not accessible when I tried. What I have
>> been doing so far is something like this in rough code:
>>
>> static int nchars=27000000;
>> int startpos=0;
>> File readfile="../x.txt";
>> FileReader frd=new File;
>> String searchs="20020701";
>> char[] arr=new char[nchars];
>>
>> while (more dates to search for)
>> {
>> frd=new FileReader(readfile); /*reopen file
>> frd.skip(startpos); /*move to file pointer where final place of last
>> date was found
>> frd.read(arr,0,nchars); /*10
>> find number of date occurrences in arr with pattern matching
>> update searchs (first time to "20020702" and so on
>> startpos=startpos+(last place of pattern match)
>> output result for this date
>> }
>>
>> This in all tends to use one to two minutes per run of the loop. What
>> I would like to do is to a) either preprocess the file such that I get
>> an input file where only numbers are present or b) change the read
>> call at label 10 so that it only reads numbers instead of all next
>> characters. Thank you so much for your help!!
>
>
> 13 Gb in two minutes is a throughput of 110 Mb/Sec,
> which doesn't seem ludicrously slow.
>
> What's your target?
>
> BugBear

He's complaining that he's spending two minutes in each iteration of his
loop; the loop reads only 27MB per iteration.

--
Cheers,
Alan (San Jose, California, USA)

First | Prev |
Pages: 1 2 3 4
Prev: Business Calendar
Next: Computing sales taxes