Reading from very large file [Java Programming]

Prev: Business Calendar
Next: Computing sales taxes

From: Eric Sosman on 8 May 2010 15:40

On 5/8/2010 2:55 PM, Hakan wrote:
> markspace wrote:
>
>> Hakan wrote:
>>>
>>> I'd like to read only numbers from an extremely big file containing
>>> both characters and digits. It turns out that a) reading each
>>> character with a RandomAccessFile is too slow
>
>> I think a tightly scoped SSCCE is needed here. "Extremely big" and
>> "too slow" are such vague and relative terms that there's not really
>> much we can do if we don't know what sort of performance target we're
>> trying to hit.
>
>> SSCCE with the access times you are seeing, plus your desired
>> performance improvement, would be the best.
>
> The text file has a size in the range of 13.7 MB. No matter what access
> times I have on an individual read, it will take immense amounts of time
> unless I find the smartest way to preprocess it and filter out all
> non-digits. Thanks.

"Immense amounts of time?" Let's assume a really awful situation,
where the file is maximally fragmented into 8KB chunks scattered non-
contiguously all around the disk. That is, assume the system can read
only 8KB at a time and then must make a long, slow seek to somewhere
else for the next 8KB: no read-ahead, no caching, just raw I/O. You'll
need about 1750 operations to read the 13.7MB file. If each one takes
a nice, leisurely 20 ms, it'll take you 35 seconds to pull the whole
business in. "Immense amounts of time?" You've already spent that
much and more in asking how to speed it up!

Follow markspace's advice: Show some code, and show some actual
measurements. Without those, nobody is likely to take your fears
seriously.

--
Eric Sosman
esosman(a)ieee-dot-org.invalid

From: Arne Vajhøj on 8 May 2010 17:43

On 08-05-2010 14:11, Hakan wrote:
> I'd like to read only numbers from an extremely big file containing both
> characters and digits. It turns out that a) reading each character with
> a RandomAccessFile is too slow and b) a StreamTokenizer did not work, as
> it has irregular delimiters for some reason. What is the best way? I've
> been looking at overriding read in a subclass of FilterReader, but I am
> not sure if it is the best way, how to do it and if it will be fast
> enough. Thanks in advance.

To get good performance you need to read huge chunks
of data from disk to Java memory.

To get the program working you need to code correctly
according to the format and the requirements. Which we
have no information about here.

Arne

From: markspace on 8 May 2010 18:16

Hakan wrote:

>
> The text file has a size in the range of 13.7 MB.

Can you write a short program to generate a file that "looks like" your
data so we can see what you are dealing with? Make it an SSCCE.

> No matter what access
> times I have on an individual read, it will take immense amounts of time
> unless I find the smartest way to preprocess it and filter out all
> non-digits. Thanks.

There is no one "smartest way" to read data. Please tell us what times
you are seeing now, how you are reading the file (as an SSCCE), and what
sort of speed up you are expecting to see.

From: John B. Matthews on 8 May 2010 19:56

In article <1273342288.02(a)user.newsoffice.de>, Hakan <H.L(a)softhome.net>
wrote:

> b) a StreamTokenizer did not work, as it has irregular delimiters for
> some reason.

As a longtime fan of StreamTokenizer, I'm puzzled by this. The default
delimiter is white space, and whitespaceChars() allows considerable
flexibility.

As for performance, others have suggested a suitable buffer, and I chose
1 MiB. I get correct results for an 87 KiB test file containing 2^14
numbers and a 2.4 MiB dictionary containing no numbers. A 30 MiB file
takes less than two seconds to process.

<console>
$ make run < test.txt
java -cp build/classes cli.TokenizerTest
Count: 16384
83.659 ms
$ make run < /usr/share/dict/words
java -cp build/classes cli.TokenizerTest
Count: 0
230.727 ms
$ make run < classes.jar
java -cp build/classes cli.TokenizerTest
Count: 131906
1689.561 ms
</console>

<code>
package cli;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.StreamTokenizer;

/** @author John B. Matthews */
public class TokenizerTest {

public static void main(String[] args) {
long start = System.nanoTime();
tokenize();
System.out.println(
(System.nanoTime() - start) / 1000000d + " ms");
}

private static void tokenize() {
StreamTokenizer tokens = new StreamTokenizer(
new BufferedReader(
new InputStreamReader(System.in), 1024 * 1024));
try {
int count = 0;
int token = tokens.nextToken();
while (token != StreamTokenizer.TT_EOF) {
if (token == StreamTokenizer.TT_NUMBER) {
count++;
}
token = tokens.nextToken();
}
System.out.println("Count: " + count);
} catch (IOException e) {
e.printStackTrace();
}
}
}
<code>

--
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>

From: Hakan on 9 May 2010 02:57

Lew wrote:

> First, 13.7 MB isn't so terribly large. Second, markspace specifically asked
> for hard numbers and pointed out that adjectives like "extremely big" are not
> terribly meaningful, yet you ignored that advice and the request and simply
> provided another vague adjective, "immense", without any indication of what
> your target performance is. Third, he asked for an SSCCE, which you also
> ignored completely.

> Given all that, you make it impossible to help you, but let me try anyway.
> I'm just a great guy that way.

> But you're still going to have to provide an SSCCE. Read
> <http://sscce.org/>
> to learn about that.

> You mentioned that "reading each character with a RandomAccessFile is too
> slow". OK, then don't do that! Stream the data in, using a large block size
> for the read, for example, using
> <http://java.sun.com/javase/6/docs/api/java/io/BufferedReader.html#BufferedReader(java.io.Reader,
>
> int)>
> to establish the stream.

> At that point your search for digits is nearly all memory-bound. On most
> modern systems you should be able to fit the entire 13.7 MB in memory at once,
>
> eliminating I/O as a limiting factor.

> Now you just need an efficient algorithm. Perhaps a state machine that scans
> your 13.7 MB in-memory buffer and spits out sequences of digits to a handler,
> somewhat the way XML SAX parsers handle searches for tags, would be useful.

> Now for the best piece of advice when asking for help from Usenet:

> <http://sscce.org/>
> <http://sscce.org/>
> <http://sscce.org/>

Sorry about the mistake, but the file is actually 13 GB. I can read to
a character array buffering about 30 million characters before the heap
space is overflowed. This is still only a part of the file.

The sscce site is down and not accessible when I tried. What I have
been doing so far is something like this in rough code:

static int nchars=27000000;
int startpos=0;
File readfile="../x.txt";
FileReader frd=new File;
String searchs="20020701";
char[] arr=new char[nchars];

while (more dates to search for)
{
frd=new FileReader(readfile); /*reopen file
frd.skip(startpos); /*move to file pointer where final place of last
date was found
frd.read(arr,0,nchars); /*10
find number of date occurrences in arr with pattern matching
update searchs (first time to "20020702" and so on
startpos=startpos+(last place of pattern match)
output result for this date
}

This in all tends to use one to two minutes per run of the loop. What I
would like to do is to a) either preprocess the file such that I get an
input file where only numbers are present or b) change the read call at
label 10 so that it only reads numbers instead of all next characters.
Thank you so much for your help!!

--
Newsoffice.de - Die Onlinesoftware zum Lesen und Schreiben im Usenet
Die Signatur l��t sich nach Belieben anpassen ;-)

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: Business Calendar
Next: Computing sales taxes