From: Martin Gregorie on 28 Apr 2010 15:57 On Wed, 28 Apr 2010 14:42:35 +0200, Hakan wrote: > We need to scan a very big input file to see how many times each date > occurs in it. This means that we want to check the number of times > successive strings of the form "20020701", "20020702" and so on are in > it from a given start to end date. The syntax is European format. > Sounds like a job for awk (gawk) or Perl to me, especially if its more or less a one-off or infrequent task. > What is the most efficent way to do it? I have tried with 1) a system > call to grep > Does this imply that grep is run multiple times? If so, awk or Perl would be *much* faster. Both have major advantages over repeatedly calling grep: - you can use multiple patterns, e.g. one to limit processing to the required date range / section of the file delimited by a date range - you can use another pattern to recognise dates being counted if that makes the code better structured - both awk and Perl support associative arrays, which is exactly what you need for listing unique dates on the file and counting the ocurrences of each date. Unless I missed something, a program that does the required scanning and lists sorted results could be written in under 20 lines as an awk script and in not many more with Perl. -- martin@ | Martin Gregorie gregorie. | Essex, UK org |
From: ~Glynne on 28 Apr 2010 16:30 On Apr 28, 6:42 am, Hakan <H...(a)softhome.net> wrote: > We need to scan a very big input file to see how many times each date > occurs in it. This means that we want to check the number of times > successive strings of the form "20020701", "20020702" and so on are in > it from a given start to end date. The syntax is European format. > > What is the most efficent way to do it? I have tried with 1) a system > call to grep and 2) a RandomAccessfile reading each character and moving > the file pointer ahead, but none of them runs quickly enough. Another > option might be to use a pattern matching, but then we would still > probably have the problems of searching through most of the file. The > records come in order such that no records of July 1 come after July 2. > It would be great if anyone has ideas for this or has done it before. > > Thanks in advance!! > > Regards, > > Håkan Lane > > -- > Newsoffice.de - Die Onlinesoftware zum Lesen und Schreiben im Usenet > Die Signatur läßt sich nach Belieben anpassen ;-) Implement the following pseudo-code in the language of your choice. last_date = fencepost counter = 0 while readline extract date field if ( date != last_date ) print last_date, counter last_date = date counter = 1 else increment counter end while print last_date, counter As for fast IO in Java I would recommend BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in, "ISO-8859-1")); while( (line=stdin.readLine()) != null ) { .... } ~Glynne
From: Arne Vajhøj on 28 Apr 2010 22:00 On 28-04-2010 13:54, Robert Kochem wrote: > IMHO it is much better to use an FileInputStream in combination with a > BufferedInputStream with a buffer larger than 10MB. BufferedInputStream/FileInputStream with a large buffer or FileInputStream with a buffer in the application code or FileChannel.map should all give good performance. Arne
From: Roedy Green on 29 Apr 2010 12:48
On Wed, 28 Apr 2010 14:42:35 +0200, Hakan <H.L(a)softhome.net> wrote, quoted or indirectly quoted someone who said : > > We need to scan a very big input file to see how many times each date >occurs in it. This means that we want to check the number of times >successive strings of the form "20020701", "20020702" and so on are in >it from a given start to end date. The syntax is European format. You need to provide more detail, e.g. hex patterns you are looking for. Places you can gain speed: 1. read bytes rather than chars to save the conversion of bytes to chars. i.e. FileInputStream. 2. don't use a generic regex. Use some hard coded char handling. Process a buffer full at a time with some overlap between buffers or special handling to deal with the buffer boundaries. 3. whacking huge buffers. Optimal size to be found by experiment. 4. possible use of nio, though your file presumably is too big for a memory mapped file. -- Roedy Green Canadian Mind Products http://mindprod.com It�s amazing how much structure natural languages have when you consider who speaks them and how they evolved. |