dynamically creating a string [Lisp]

Prev: need easy benchmarks
Next: Lisp sucks!

From: Vinay on 12 Mar 2010 05:58

On 2010-03-11 19:58:53 -0800, refun said:

> Here's another way using a fill-pointer:
>
> (defun read-stream-to-string (stream)
> (let ((string (make-array (file-length stream)
> :element-type 'character
> :initial-element #\Space
> :fill-pointer 0)))
> (loop for char = (read-char stream nil 'done)
> until (eql char 'done)
> do (vector-push char string)
> finally (return string))))
>
> (with-open-file (s #p"path goes here")
> (read-stream-to-string s))
>
> It makes an array of file-length size, but keeps a fill-pointer of 0, then
> pushes each character into the array.
>
> Is this the way to keep down the consing and avoid copying the array data
> pointlessly?

the only problem i found with this is that it expects the size of the
resource to be known ...

(file-length stream)

my requirement needs to cater to both fixed size files on the disk and
streams coming in over a socket ...

--- news://freenews.netfront.net/ - complaints: news(a)netfront.net ---

From: refun on 12 Mar 2010 06:47

In article <hnd6it$83o$1(a)news.eternal-september.org>, tfb(a)tfeb.org says...
>
> On 2010-03-12 03:58:53 +0000, refun said:
>
> > It makes an array of file-length size, but keeps a fill-pointer of 0, then
> > pushes each character into the array.
>
> This is not safe, because the number of characters in the file may not
> be the same as the file's length, depending on the encoding of the
> file. This is a classic "the world is Unix and all characters are
> exactly 8 bits long" mistake.

I think one can usually assume that the physical representation(encoding) of a
character is at least 8bits - it could be more, but very rarely it would be
less, and I haven't seen a CL implementation which supports encodings where a
character is stored in less than 8 bits (please let me know if there is an
actual CL implementation that supports less-than 8bit characters). So if one
assumes characters are at least 8 bits in size (or more), then the string
length will always be lesser or equal than the file size, thus using FILE-
LENGTH as a maximum upper margin would be safe in practice.

Note that since FILL-POINTER is 0, (length string) would return 0 as well, and
would be increased by one by each vector-push(-extend), and would most likely
never reach the FILE-LENGTH limit. In the case that an encoding which uses less
than 8 bits per character is used(if one exists and is supported by current CL
implementations) or the file grew since FILE-LENGTH was FUNCALLed then a
VECTOR-PUSH-EXTEND would be safer (as per Vassil Nikolov's recommendation).

From: Tim Bradshaw on 12 Mar 2010 10:22

On 2010-03-12 11:47:41 +0000, refun said:

> I think one can usually assume that the physical representation(encoding) of a
> character is at least 8bits - it could be more, but very rarely it would be
> less, and I haven't seen a CL implementation which supports encodings where a
> character is stored in less than 8 bits (please let me know if there is an
> actual CL implementation that supports less-than 8bit characters). So if one
> assumes characters are at least 8 bits in size (or more), then the string
> length will always be lesser or equal than the file size, thus using FILE-
> LENGTH as a maximum upper margin would be safe in practice.

What about a system which stores files as records, with one record per
line (but no space for newlines)? Or a system where the file is
growing, or compressed in storage (OK, the latter should really stash
the original length, but perhaps this compression is being done in
user-space and not by the filesstem).

I think making assumptions about the in-core size based on what the FS
reports is just a bad idea, really.

From: Tim X on 12 Mar 2010 18:38

Tim Bradshaw <tfb(a)tfeb.org> writes:

> On 2010-03-12 11:47:41 +0000, refun said:
>
>> I think one can usually assume that the physical representation(encoding) of a
>> character is at least 8bits - it could be more, but very rarely it would be
>> less, and I haven't seen a CL implementation which supports encodings where a
>> character is stored in less than 8 bits (please let me know if there is an
>> actual CL implementation that supports less-than 8bit characters). So if one
>> assumes characters are at least 8 bits in size (or more), then the string
>> length will always be lesser or equal than the file size, thus using FILE-
>> LENGTH as a maximum upper margin would be safe in practice.
>
> What about a system which stores files as records, with one record per line
> (but no space for newlines)? Or a system where the file is growing, or
> compressed in storage (OK, the latter should really stash the original length,
> but perhaps this compression is being done in user-space and not by the
> filesstem).
>
> I think making assumptions about the in-core size based on what the FS reports
> is just a bad idea, really.
>

Just to add, on some systems, such as Unix, there is also the concept of
sparse files. Essentially, these are files that will 'lie' about their
length. Many admins have been bitten when copying such files because the
copy will end up much larger than the original's reported size.

This may not refute the original claim of using file size in that the
copied size will be equal or greater, but I think it does highlight the
danger of using filesize to predict anything other than the size of that
file at that moment in time.

Tim

--
tcross (at) rapttech dot com dot au

From: Pascal J. Bourguignon on 12 Mar 2010 19:45

Tim X <timx(a)nospam.dev.null> writes:

> Tim Bradshaw <tfb(a)tfeb.org> writes:
>
>> On 2010-03-12 11:47:41 +0000, refun said:
>>
>>> I think one can usually assume that the physical representation(encoding) of a
>>> character is at least 8bits - it could be more, but very rarely it would be
>>> less, and I haven't seen a CL implementation which supports encodings where a
>>> character is stored in less than 8 bits (please let me know if there is an
>>> actual CL implementation that supports less-than 8bit characters). So if one
>>> assumes characters are at least 8 bits in size (or more), then the string
>>> length will always be lesser or equal than the file size, thus using FILE-
>>> LENGTH as a maximum upper margin would be safe in practice.
>>
>> What about a system which stores files as records, with one record per line
>> (but no space for newlines)? Or a system where the file is growing, or
>> compressed in storage (OK, the latter should really stash the original length,
>> but perhaps this compression is being done in user-space and not by the
>> filesstem).
>>
>> I think making assumptions about the in-core size based on what the FS reports
>> is just a bad idea, really.
>>
>
> Just to add, on some systems, such as Unix, there is also the concept of
> sparse files. Essentially, these are files that will 'lie' about their
> length.

Well, they don't really lie. stat(2) reports two sizes:

off_t st_size; /* file size, in bytes */
quad_t st_blocks; /* blocks allocated for file */
u_long st_blksize;/* optimal file sys I/O ops blocksize */

so it is easy enough to know the number of bytes in the file (included
eluded sparse bytes), st_size, and the real space occupied by the file,
st_blocks*st_blksize.

> Many admins have been bitten when copying such files because the
> copy will end up much larger than the original's reported size.

gnu cp has --sparse=always.

> This may not refute the original claim of using file size in that the
> copied size will be equal or greater, but I think it does highlight the
> danger of using filesize to predict anything other than the size of that
> file at that moment in time.

clisp doesn't create a sparse file, but a CL implementation could easily
do so:

C/USER[5]> (with-open-file (f "/tmp/sparse" :direction :io
:if-does-not-exist :create
:element-type '(unsigned-byte 8))
(file-position f 102400)
(write-byte 1 f))
1
C/USER[6]> (ls '-l "/tmp/sparse")
- 102401 Mar 13 01:43 /tmp/sparse

C/USER[7]> (ext:shell "du /tmp/sparse")
208 /tmp/sparse
NIL
C/USER[8]> (ext:shell "du -k /tmp/sparse")
104 /tmp/sparse
NIL

--
__Pascal Bourguignon__

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: need easy benchmarks
Next: Lisp sucks!