From: Hakan on

I'd like to read only numbers from an extremely big file containing
both characters and digits. It turns out that a) reading each character
with a RandomAccessFile is too slow and b) a StreamTokenizer did not
work, as it has irregular delimiters for some reason. What is the best
way? I've been looking at overriding read in a subclass of FilterReader,
but I am not sure if it is the best way, how to do it and if it will be
fast enough. Thanks in advance.



--
Newsoffice.de - Die Onlinesoftware zum Lesen und Schreiben im Usenet
Die Signatur l��t sich nach Belieben anpassen ;-)
From: markspace on
Hakan wrote:
>
> I'd like to read only numbers from an extremely big file containing both
> characters and digits. It turns out that a) reading each character with
> a RandomAccessFile is too slow

I think a tightly scoped SSCCE is needed here. "Extremely big" and "too
slow" are such vague and relative terms that there's not really much we
can do if we don't know what sort of performance target we're trying to hit.

SSCCE with the access times you are seeing, plus your desired
performance improvement, would be the best.
From: Hakan on
markspace wrote:

> Hakan wrote:
>>
>> I'd like to read only numbers from an extremely big file containing both
>> characters and digits. It turns out that a) reading each character with
>> a RandomAccessFile is too slow

> I think a tightly scoped SSCCE is needed here. "Extremely big" and "too
> slow" are such vague and relative terms that there's not really much we
> can do if we don't know what sort of performance target we're trying to hit.

> SSCCE with the access times you are seeing, plus your desired
> performance improvement, would be the best.

The text file has a size in the range of 13.7 MB. No matter what access
times I have on an individual read, it will take immense amounts of time
unless I find the smartest way to preprocess it and filter out all
non-digits. Thanks.

--
Newsoffice.de - Die Onlinesoftware zum Lesen und Schreiben im Usenet
Die Signatur l��t sich nach Belieben anpassen ;-)
From: Lew on
Hakan wrote:
>>> I'd like to read only numbers from an extremely big file containing
>>> both characters and digits. It turns out that a) reading each
>>> character with a RandomAccessFile is too slow

markspace wrote:
>> I think a tightly scoped SSCCE is needed here. "Extremely big" and
>> "too slow" are such vague and relative terms that there's not really
>> much we can do if we don't know what sort of performance target we're
>> trying to hit.
>
>> SSCCE with the access times you are seeing, plus your desired
>> performance improvement, would be the best.

Hakan wrote:
> The text file has a size in the range of 13.7 MB. No matter what access
> times I have on an individual read, it will take immense amounts of time
> unless I find the smartest way to preprocess it and filter out all
> non-digits. Thanks.

First, 13.7 MB isn't so terribly large. Second, markspace specifically asked
for hard numbers and pointed out that adjectives like "extremely big" are not
terribly meaningful, yet you ignored that advice and the request and simply
provided another vague adjective, "immense", without any indication of what
your target performance is. Third, he asked for an SSCCE, which you also
ignored completely.

Given all that, you make it impossible to help you, but let me try anyway.
I'm just a great guy that way.

But you're still going to have to provide an SSCCE. Read
<http://sscce.org/>
to learn about that.

You mentioned that "reading each character with a RandomAccessFile is too
slow". OK, then don't do that! Stream the data in, using a large block size
for the read, for example, using
<http://java.sun.com/javase/6/docs/api/java/io/BufferedReader.html#BufferedReader(java.io.Reader,
int)>
to establish the stream.

At that point your search for digits is nearly all memory-bound. On most
modern systems you should be able to fit the entire 13.7 MB in memory at once,
eliminating I/O as a limiting factor.

Now you just need an efficient algorithm. Perhaps a state machine that scans
your 13.7 MB in-memory buffer and spits out sequences of digits to a handler,
somewhat the way XML SAX parsers handle searches for tags, would be useful.

Now for the best piece of advice when asking for help from Usenet:

<http://sscce.org/>
<http://sscce.org/>
<http://sscce.org/>

--
Lew
From: Robert Klemme on
On 08.05.2010 20:55, Hakan wrote:
> markspace wrote:
>
>> Hakan wrote:
>>>
>>> I'd like to read only numbers from an extremely big file containing
>>> both characters and digits. It turns out that a) reading each
>>> character with a RandomAccessFile is too slow
>
>> I think a tightly scoped SSCCE is needed here. "Extremely big" and
>> "too slow" are such vague and relative terms that there's not really
>> much we can do if we don't know what sort of performance target we're
>> trying to hit.
>
>> SSCCE with the access times you are seeing, plus your desired
>> performance improvement, would be the best.
>
> The text file has a size in the range of 13.7 MB. No matter what access
> times I have on an individual read, it will take immense amounts of time
> unless I find the smartest way to preprocess it and filter out all
> non-digits. Thanks.

I have no idea what you want to do with those characters but what's
wrong with reading the file beginning to end with a fixed buffer size
and inspect the buffer? You won't get much more efficient than that
unless you have information about the file's format that can be exploited.

Btw, I don't even think that reading the whole file into memory and
process it there is completely ruled out yet. 28MB (which you need for
character data) is not much on modern operating systems. Granted, you
then should run your VM with more than the default memory sizes but
that's not really a big deal. But you should do that only if you really
have the need to jump back and forth in the file.

Cheers

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/