From: Lew on 9 May 2010 13:46 Hakan wrote: >>> character array buffering about 30 million characters before the heap Lew wrote: >> Allocate a larger -Xmx. Eric Sosman wrote: > ... and buy at least 32GB of RAM. (In other words: Not an > economical approach; cheaper -- and probably faster -- approaches > exist and have been used for decades.) You're missing an antecedent. *What* is not an economical approach? And why does he need to buy 32GB of RAM? Most machines come with at least 128 MB, and 1 GB is neither unduly expensive nor very uncommon. There's little reason in most desktop or server environments to max out at 30M characters. -- Lew
From: Eric Sosman on 9 May 2010 15:27 On 5/9/2010 1:46 PM, Lew wrote: > Hakan wrote: >>>> character array buffering about 30 million characters before the heap > > Lew wrote: >>> Allocate a larger -Xmx. > > Eric Sosman wrote: >> ... and buy at least 32GB of RAM. (In other words: Not an >> economical approach; cheaper -- and probably faster -- approaches >> exist and have been used for decades.) > > You're missing an antecedent. *What* is not an economical approach? > > And why does he need to buy 32GB of RAM? I seem to have mixed a couple unrelated sub-threads together, coming up with the impression that somebody wanted the O.P. to read the entire 13GB file into *one* buffer, all at once, which would have taken ~26GB of 16-bit chars all by itself. But hunting back through the thread I can't find the suggestion spelled out this way, so I guess my imagination has been running away with me. *If* that (wasteful) approach were taken, he'd need a lot of RAM. And a 64-bit JVM, of course. > Most machines come with at least 128 MB, and 1 GB is neither unduly > expensive nor very uncommon. There's little reason in most desktop or > server environments to max out at 30M characters. Kinda makes you wish the O.P. would let us know what he was doing, doesn't it? (Sigh.) -- Eric Sosman esosman(a)ieee-dot-org.invalid
From: bugbear on 10 May 2010 04:58 Hakan wrote: > Lew wrote: > > > >> First, 13.7 MB isn't so terribly large. Second, markspace specifically >> asked for hard numbers and pointed out that adjectives like "extremely >> big" are not terribly meaningful, yet you ignored that advice and the >> request and simply provided another vague adjective, "immense", >> without any indication of what your target performance is. Third, he >> asked for an SSCCE, which you also ignored completely. > >> Given all that, you make it impossible to help you, but let me try >> anyway. I'm just a great guy that way. > >> But you're still going to have to provide an SSCCE. Read >> <http://sscce.org/> >> to learn about that. > >> You mentioned that "reading each character with a RandomAccessFile is >> too slow". OK, then don't do that! Stream the data in, using a large >> block size for the read, for example, using >> <http://java.sun.com/javase/6/docs/api/java/io/BufferedReader.html#BufferedReader(java.io.Reader, >> >> int)> >> to establish the stream. > >> At that point your search for digits is nearly all memory-bound. On >> most modern systems you should be able to fit the entire 13.7 MB in >> memory at once, >> eliminating I/O as a limiting factor. > >> Now you just need an efficient algorithm. Perhaps a state machine >> that scans your 13.7 MB in-memory buffer and spits out sequences of >> digits to a handler, somewhat the way XML SAX parsers handle searches >> for tags, would be useful. > >> Now for the best piece of advice when asking for help from Usenet: > >> <http://sscce.org/> >> <http://sscce.org/> >> <http://sscce.org/> > > Sorry about the mistake, but the file is actually 13 GB. I can read to a > character array buffering about 30 million characters before the heap > space is overflowed. This is still only a part of the file. > > The sscce site is down and not accessible when I tried. What I have been > doing so far is something like this in rough code: > > static int nchars=27000000; > int startpos=0; > File readfile="../x.txt"; > FileReader frd=new File; > String searchs="20020701"; > char[] arr=new char[nchars]; > > while (more dates to search for) > { > frd=new FileReader(readfile); /*reopen file > frd.skip(startpos); /*move to file pointer where final place of last > date was found > frd.read(arr,0,nchars); /*10 > find number of date occurrences in arr with pattern matching > update searchs (first time to "20020702" and so on > startpos=startpos+(last place of pattern match) > output result for this date > } > > This in all tends to use one to two minutes per run of the loop. What I > would like to do is to a) either preprocess the file such that I get an > input file where only numbers are present or b) change the read call at > label 10 so that it only reads numbers instead of all next characters. > Thank you so much for your help!! 13 Gb in two minutes is a throughput of 110 Mb/Sec, which doesn't seem ludicrously slow. What's your target? BugBear
From: Alan Malloy on 10 May 2010 05:57
bugbear wrote: > Hakan wrote: >> Lew wrote: >> >> >> >>> First, 13.7 MB isn't so terribly large. Second, markspace >>> specifically asked for hard numbers and pointed out that adjectives >>> like "extremely big" are not terribly meaningful, yet you ignored >>> that advice and the request and simply provided another vague >>> adjective, "immense", without any indication of what your target >>> performance is. Third, he asked for an SSCCE, which you also ignored >>> completely. >> >>> Given all that, you make it impossible to help you, but let me try >>> anyway. I'm just a great guy that way. >> >>> But you're still going to have to provide an SSCCE. Read >>> <http://sscce.org/> >>> to learn about that. >> >>> You mentioned that "reading each character with a RandomAccessFile is >>> too slow". OK, then don't do that! Stream the data in, using a >>> large block size for the read, for example, using >>> <http://java.sun.com/javase/6/docs/api/java/io/BufferedReader.html#BufferedReader(java.io.Reader, >>> >>> int)> >>> to establish the stream. >> >>> At that point your search for digits is nearly all memory-bound. On >>> most modern systems you should be able to fit the entire 13.7 MB in >>> memory at once, eliminating I/O as a limiting factor. >> >>> Now you just need an efficient algorithm. Perhaps a state machine >>> that scans your 13.7 MB in-memory buffer and spits out sequences of >>> digits to a handler, somewhat the way XML SAX parsers handle searches >>> for tags, would be useful. >> >>> Now for the best piece of advice when asking for help from Usenet: >> >>> <http://sscce.org/> >>> <http://sscce.org/> >>> <http://sscce.org/> >> >> Sorry about the mistake, but the file is actually 13 GB. I can read to >> a character array buffering about 30 million characters before the >> heap space is overflowed. This is still only a part of the file. >> >> The sscce site is down and not accessible when I tried. What I have >> been doing so far is something like this in rough code: >> >> static int nchars=27000000; >> int startpos=0; >> File readfile="../x.txt"; >> FileReader frd=new File; >> String searchs="20020701"; >> char[] arr=new char[nchars]; >> >> while (more dates to search for) >> { >> frd=new FileReader(readfile); /*reopen file >> frd.skip(startpos); /*move to file pointer where final place of last >> date was found >> frd.read(arr,0,nchars); /*10 >> find number of date occurrences in arr with pattern matching >> update searchs (first time to "20020702" and so on >> startpos=startpos+(last place of pattern match) >> output result for this date >> } >> >> This in all tends to use one to two minutes per run of the loop. What >> I would like to do is to a) either preprocess the file such that I get >> an input file where only numbers are present or b) change the read >> call at label 10 so that it only reads numbers instead of all next >> characters. Thank you so much for your help!! > > > 13 Gb in two minutes is a throughput of 110 Mb/Sec, > which doesn't seem ludicrously slow. > > What's your target? > > BugBear He's complaining that he's spending two minutes in each iteration of his loop; the loop reads only 27MB per iteration. -- Cheers, Alan (San Jose, California, USA) |