From: Hakan on 8 May 2010 14:11 I'd like to read only numbers from an extremely big file containing both characters and digits. It turns out that a) reading each character with a RandomAccessFile is too slow and b) a StreamTokenizer did not work, as it has irregular delimiters for some reason. What is the best way? I've been looking at overriding read in a subclass of FilterReader, but I am not sure if it is the best way, how to do it and if it will be fast enough. Thanks in advance. -- Newsoffice.de - Die Onlinesoftware zum Lesen und Schreiben im Usenet Die Signatur l��t sich nach Belieben anpassen ;-)
From: markspace on 8 May 2010 14:28 Hakan wrote: > > I'd like to read only numbers from an extremely big file containing both > characters and digits. It turns out that a) reading each character with > a RandomAccessFile is too slow I think a tightly scoped SSCCE is needed here. "Extremely big" and "too slow" are such vague and relative terms that there's not really much we can do if we don't know what sort of performance target we're trying to hit. SSCCE with the access times you are seeing, plus your desired performance improvement, would be the best.
From: Hakan on 8 May 2010 14:55 markspace wrote: > Hakan wrote: >> >> I'd like to read only numbers from an extremely big file containing both >> characters and digits. It turns out that a) reading each character with >> a RandomAccessFile is too slow > I think a tightly scoped SSCCE is needed here. "Extremely big" and "too > slow" are such vague and relative terms that there's not really much we > can do if we don't know what sort of performance target we're trying to hit. > SSCCE with the access times you are seeing, plus your desired > performance improvement, would be the best. The text file has a size in the range of 13.7 MB. No matter what access times I have on an individual read, it will take immense amounts of time unless I find the smartest way to preprocess it and filter out all non-digits. Thanks. -- Newsoffice.de - Die Onlinesoftware zum Lesen und Schreiben im Usenet Die Signatur l��t sich nach Belieben anpassen ;-)
From: Lew on 8 May 2010 15:15 Hakan wrote: >>> I'd like to read only numbers from an extremely big file containing >>> both characters and digits. It turns out that a) reading each >>> character with a RandomAccessFile is too slow markspace wrote: >> I think a tightly scoped SSCCE is needed here. "Extremely big" and >> "too slow" are such vague and relative terms that there's not really >> much we can do if we don't know what sort of performance target we're >> trying to hit. > >> SSCCE with the access times you are seeing, plus your desired >> performance improvement, would be the best. Hakan wrote: > The text file has a size in the range of 13.7 MB. No matter what access > times I have on an individual read, it will take immense amounts of time > unless I find the smartest way to preprocess it and filter out all > non-digits. Thanks. First, 13.7 MB isn't so terribly large. Second, markspace specifically asked for hard numbers and pointed out that adjectives like "extremely big" are not terribly meaningful, yet you ignored that advice and the request and simply provided another vague adjective, "immense", without any indication of what your target performance is. Third, he asked for an SSCCE, which you also ignored completely. Given all that, you make it impossible to help you, but let me try anyway. I'm just a great guy that way. But you're still going to have to provide an SSCCE. Read <http://sscce.org/> to learn about that. You mentioned that "reading each character with a RandomAccessFile is too slow". OK, then don't do that! Stream the data in, using a large block size for the read, for example, using <http://java.sun.com/javase/6/docs/api/java/io/BufferedReader.html#BufferedReader(java.io.Reader, int)> to establish the stream. At that point your search for digits is nearly all memory-bound. On most modern systems you should be able to fit the entire 13.7 MB in memory at once, eliminating I/O as a limiting factor. Now you just need an efficient algorithm. Perhaps a state machine that scans your 13.7 MB in-memory buffer and spits out sequences of digits to a handler, somewhat the way XML SAX parsers handle searches for tags, would be useful. Now for the best piece of advice when asking for help from Usenet: <http://sscce.org/> <http://sscce.org/> <http://sscce.org/> -- Lew
From: Robert Klemme on 8 May 2010 15:15
On 08.05.2010 20:55, Hakan wrote: > markspace wrote: > >> Hakan wrote: >>> >>> I'd like to read only numbers from an extremely big file containing >>> both characters and digits. It turns out that a) reading each >>> character with a RandomAccessFile is too slow > >> I think a tightly scoped SSCCE is needed here. "Extremely big" and >> "too slow" are such vague and relative terms that there's not really >> much we can do if we don't know what sort of performance target we're >> trying to hit. > >> SSCCE with the access times you are seeing, plus your desired >> performance improvement, would be the best. > > The text file has a size in the range of 13.7 MB. No matter what access > times I have on an individual read, it will take immense amounts of time > unless I find the smartest way to preprocess it and filter out all > non-digits. Thanks. I have no idea what you want to do with those characters but what's wrong with reading the file beginning to end with a fixed buffer size and inspect the buffer? You won't get much more efficient than that unless you have information about the file's format that can be exploited. Btw, I don't even think that reading the whole file into memory and process it there is completely ruled out yet. 28MB (which you need for character data) is not much on modern operating systems. Granted, you then should run your VM with more than the default memory sizes but that's not really a big deal. But you should do that only if you really have the need to jump back and forth in the file. Cheers robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/ |