Best Way to Process Large Text Files [Java Programming]

Prev: Call for Papers: The 2010 International Conference of Information Security and Internet Engineering (ICISIE 2010)
Next: Mismatch in Statement and PreparedStatement execution in Oracle DB.

From: Arne Vajhøj on 10 Feb 2010 21:07

On 10-02-2010 18:59, EJP wrote:
> On 10/02/2010 10:28 PM, Michael Powe wrote:
>> I have a little time to complete this project and I'd like to build
>> something more efficient, that won't peg the CPU for an hour.
>
> Fix your code. It only takes a few seconds to read a file of practically
> any size. In my experience the only way you can take an hour to process
> any file on modern equipment is if you read the whole file into memory
> via concatenation of Strings and then process it, which is the wrong
> approach from every possible point of view. Process a line at a time.

I agree completely with your point.

Huge files may still take time to read from the disk though.

Arne

From: markspace on 10 Feb 2010 22:45

Arne Vajh�j wrote:
> On 10-02-2010 18:59, EJP wrote:
>> On 10/02/2010 10:28 PM, Michael Powe wrote:
>>> I have a little time to complete this project and I'd like to build
>>> something more efficient, that won't peg the CPU for an hour.
>>
>> Fix your code. It only takes a few seconds to read a file of practically
>> any size. In my experience the only way you can take an hour to process
>> any file on modern equipment is if you read the whole file into memory
>> via concatenation of Strings and then process it, which is the wrong
>> approach from every possible point of view. Process a line at a time.
>
> I agree completely with your point.
>
> Huge files may still take time to read from the disk though.

The OP said > 1 GB, so we don't know if he meant up to 2 GB or if he's
talking about 10 GB or 100 GB or 1000 GB. So a little clarification
here would help, I think.

From: Roedy Green on 11 Feb 2010 17:26

On Wed, 10 Feb 2010 06:28:14 -0500, Michael Powe
<michael+gnus(a)trollope.org> wrote, quoted or indirectly quoted someone
who said :

>
>I am tasked with writing an application to process some large text
>files, i.e. > 1 GB. The input will be csv and the output will be in the
>format of an IIS web server log.

Do a little benchmark where you do nothing but read the giant file.

If all the time is spend processing the file, there not much point in
fancy stuff to read the file files.

Usually when things slow to a crawl it is because you have filled RAM
with objects you don't need, and that forces very frequent GC.

Before you start optimising, you first have to prove where the
bottlenecks are.
--
Roedy Green Canadian Mind Products
http://mindprod.com

Every compilable program in a sense works. The problem is with your unrealistic expections on what it will do.

From: Alex on 11 Feb 2010 22:05

On Feb 10, 9:07 pm, Arne Vajhøj <a...(a)vajhoej.dk> wrote:
> Huge files may still take time to read from the disk though.
A lot of time. I tried my skills in Netflix $1,000,000 contest... on
my computer it took 15 minutes to read their entire data (for example
to do some calculation). I had compressed it to the zip archive and
then reading it and uncompress with the same calculation took only 3
minutes.

From: Daniel Pitts on 12 Feb 2010 17:39

On 2/10/2010 3:28 AM, Michael Powe wrote:
> Hello,
>
> I am tasked with writing an application to process some large text
> files, i.e.> 1 GB. The input will be csv and the output will be in the
> format of an IIS web server log.
>
> I've done this sort of thing before. In the past, I've just
> brute-forced it, with a BufferedReader and BufferedWriter handling the
> input/output line by line.
>
> I have a little time to complete this project and I'd like to build
> something more efficient, that won't peg the CPU for an hour.
>
> My thought was to have a read thread and a write thread and create a
> buffer into which some amount of input would be written; and then, when
> a threshold was reached, the data would be written out.
>
> Is this a good idea? Are there better ways to manage this?
>
> And finally, I need pointers as to how I would create such a buffer.
> The threaded read/write part I can do.
>
> Thanks for any help.
>
> mp
>
Depending on how processor intensive the transformation is, you might
not gain anything from threading.

If you are using regex to parse, you may be better off optimizing your
regexs, or using hand-coded parsing instead. A naive regex which "works"
may have some performance problems. Use greedy matching where
appropriate is one way to improve performance.

--
Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>

First | Prev |
Pages: 1 2
Prev: Call for Papers: The 2010 International Conference of Information Security and Internet Engineering (ICISIE 2010)
Next: Mismatch in Statement and PreparedStatement execution in Oracle DB.