From: Ole-Hjalmar Kristensen on 1 Sep 2009 04:06 I see the C++ version uses fgets, which is a reasonably fast way of getting a line. Is there any reason not to use fgets? fgets (OpenSolaris version) uses memccpy to actually copy the buffer until the newline: 43 extern int _filbuf(); 44 extern char *memccpy(); 45 46 char * 47 fgets(ptr, size, iop) 48 char *ptr; 49 register int size; 50 register FILE *iop; 51 { 52 char *p, *ptr0 = ptr; 53 register int n; 54 55 if ( !(iop->_flag & (_IOREAD|_IORW)) ) { 56 iop->_flag |= _IOERR; 57 return (NULL); 58 } 59 60 for (size--; size > 0; size -= n) { 61 if (iop->_cnt <= 0) { /* empty buffer */ 62 if (_filbuf(iop) == EOF) { 63 if (ptr0 == ptr) 64 return (NULL); 65 break; /* no more data */ 66 } 67 iop->_ptr--; 68 iop->_cnt++; 69 } 70 n = MIN(size, iop->_cnt); 71 if ((p = memccpy(ptr, (char *) iop->_ptr, '\n', n)) != NULL) 72 n = p - ptr; 73 ptr += n; 74 iop->_cnt -= n; 75 iop->_ptr += n; 76 _BUFSYNC(iop); 77 if (p != NULL) 78 break; /* found '\n' in buffer */ 79 } 80 *ptr = '\0'; 81 return (ptr0); 82 } >>>>> "GB" == Georg Bauhaus <rm.dash-bauhaus(a)futureapps.de> writes: GB> Ole-Hjalmar Kristensen schrieb: >> Character'read could very well be slower if there is no buffering >> involved. The way to read efficiently is to read a large buffer >> (multiple of disk block size) and chop it into lines yourself. >> GB> Indeed, stream attributes won't help. GB> However, replacing Text_IO.Get_Line---which we use now--- GB> might still be considered controversial: GB> - Is "chopping" an acceptable interpretation of "read line by line"? GB> - writing a correct double buffering scheme (or a buffer GB> with end-of-buffer triggers for Read_A_Line_From_Buffer) GB> is still a bit of work, unless there is something we could GB> copy. GB> A package containing a quick getline will definitely be GB> useful in general Unix programming, I should think. GB> What is being used in the Socket packages, BTW? GB> Equally useful will be a package for fast unbounded strings. GB> Replace_Slice is what prevents showcasing GNAT's Spitbol GB> pattern matching as one of the fastest things on this planet! -- C++: The power, elegance and simplicity of a hand grenade.
From: Georg Bauhaus on 1 Sep 2009 05:29 Ole-Hjalmar Kristensen schrieb: > I see the C++ version uses fgets, which is a reasonably fast way of > getting a line. Is there any reason not to use fgets? > fgets (OpenSolaris version) uses memccpy to actually copy the buffer > until the newline: There were two reasons not to import fgets: 1 - fgets turns out not to be faster than Stream_IO. 2 - It's C. ;-) The fasta entry at the Shootout which uses Unchecked_Conversion, not 'Address could even be improved a small bit if we change it to use 'Address. But then, this adds to the line count. Is it worth it, for a shorter program? There was a discussion leading to this (<4a7b4daa$0$31333$9b4e6d93(a)newsspool4.arcor-online.net>) early in August.
From: Ole-Hjalmar Kristensen on 1 Sep 2009 06:16 Or you could use the Unix read system call and roll your own get_line function along these lines. It will read 1000000 lines of 60 characters line by line in about a second. with Interfaces.C_Streams; use Interfaces.C_Streams; with Ada.Text_Io; procedure Get_Line_Test is In_Buf : String(1..8192); for In_Buf'Alignment use 8192; First: Integer := In_Buf'last; Last: Integer := 0; N : Size_T := 1; function Read(Fd : Int; Buffer : Voids; Nbyte : Size_T) return Size_T; pragma Import(C,read); function Empty_Buffer return Boolean is begin return First > Last; end Empty_Buffer; pragma Inline(Empty_Buffer); function End_File return Boolean is begin return N <= 0; end End_File; pragma Inline(End_File); -- Get line from stdin, return "" if end of file function Get_Line return String is begin -- Get buffer from file if empty if Empty_buffer then N := Read(0,In_Buf(1)'Address, Size_T(In_Buf'Last)); First := 1; Last := Integer(N); end if; if Empty_Buffer then return ""; end if; -- Look for end of line and return it for I in First..Last loop if In_Buf(I) = Ascii.LF then declare Start : Integer := First; begin First := I + 1; return In_Buf(Start..I-1); end; end if; end loop; -- Did not find end of line, seek further declare Copy: string := In_Buf(First..Last); begin Last := 0; return Copy & Get_Line; end; end Get_Line; pragma Inline(Get_Line); begin while not End_File loop declare Line : String := Get_Line; begin -- Do something with line here null; end; end loop; end Get_Line_Test; astra06:~/ada/div-ada> uname -a SunOS astra06 5.10 Generic_137137-09 sun4u sparc SUNW,Sun-Fire-V440 astra06:~/ada/div-ada> ls -l x.txt -rw------- 1 ok145024 clustra 60000000 Sep 1 12:03 x.txt astra06:~/ada/div-ada> time ./get_line_test < x.txt 0.49u 0.09s 0:00.59 98.3% -- C++: The power, elegance and simplicity of a hand grenade.
From: Ole-Hjalmar Kristensen on 1 Sep 2009 06:48 >>>>> "GB" == Georg Bauhaus <rm.dash-bauhaus(a)futureapps.de> writes: OK, I see. In GNAT, the stream_io is a relatively thin layer on top of the C streams, isn't it? So you will get essentially the same speed as the C standard IO library. But does Stream_Io have a version of get_line? I know the C streams interface has a binding to fgets, but that is just using fgets directly. How did you read line by line with Stream_Io? The only way I can think of to improve on fgets is to get rid of the copying, but would you then strictly be reading line by line? I am thinking of something like procedure get_line(buffer : in out string; first : out natural; last :out natural); This get_line would only need to copy when it encounters a string which extends past the end of the buffer, in the normal case it would only return the indices of the slice. GB> Ole-Hjalmar Kristensen schrieb: >> I see the C++ version uses fgets, which is a reasonably fast way of >> getting a line. Is there any reason not to use fgets? >> fgets (OpenSolaris version) uses memccpy to actually copy the buffer >> until the newline: GB> There were two reasons not to import fgets: GB> 1 - fgets turns out not to be faster than Stream_IO. GB> 2 - It's C. ;-) GB> The fasta entry at the Shootout which uses Unchecked_Conversion, GB> not 'Address could even be improved a small bit if we change GB> it to use 'Address. But then, this adds to the line count. GB> Is it worth it, for a shorter program? GB> There was a discussion leading to this GB> (<4a7b4daa$0$31333$9b4e6d93(a)newsspool4.arcor-online.net>) GB> early in August. -- C++: The power, elegance and simplicity of a hand grenade.
From: Georg Bauhaus on 1 Sep 2009 07:34
Ole-Hjalmar Kristensen schrieb: > astra06:~/ada/div-ada> uname -a > SunOS astra06 5.10 Generic_137137-09 sun4u sparc SUNW,Sun-Fire-V440 > astra06:~/ada/div-ada> ls -l x.txt > -rw------- 1 ok145024 clustra 60000000 Sep 1 12:03 x.txt > astra06:~/ada/div-ada> time ./get_line_test < x.txt > 0.49u 0.09s 0:00.59 98.3% > Comparing the two programs, they are on a par, almost exactly. However, yours is IMHO less messy. It also has return Copy & Get_Line; which, I guess, solves the problem of long lines, even at the expense of copying. Can we reduce the copying overhead? The copying would be have O(f(Item'Length)) with f(n) ~= s*(s+1)/2 where s = n/BUFSIZ, so it is in O(n**2). We could, I think, use a storage pool: - whenever copying is needed, allocate another buffer right next to the current buffer. Use the newly allocated buffer for the recursive call. - on reaching of line/file, reinterpret the sequence of buffers in the storage pool as a String. Finally, Free. Can this be done? (This won't make a Shootout program any smaller, but seems good for the general case...) |