From: Alexandre Ferrieux on 11 Jan 2010 12:22 On Jan 11, 7:25 am, Arjen Markus <arjen.markus...(a)gmail.com> wrote: > On 10 jan, 16:31, "slebet...(a)yahoo.com" <slebet...(a)gmail.com> wrote: > > > > > > > On Jan 10, 11:09 pm, "Paul(a)Tcl3D" <p...(a)tcl3d.org> wrote: > > > > Hello, > > > > I have a question regarding Tcl binary strings. > > > > If running the following script creating two binary strings: > > > > set num 50 > > > > for { set j 0 } { $j < $num } { incr j } { > > > append row1 [binary format c 0]} > > > > puts "row1: length=[string length $row1] bytelength=[string bytelength > > > $row1]" > > > > for { set j 0 } { $j < $num } { incr j } { > > > append row2 [binary format c 1]} > > > > puts "row2: length=[string length $row2] bytelength=[string bytelength > > > $row2]" > > > > I get the following output (tested with 8.4, 8.5, 8.6): > > > > row1: length=50 bytelength=100 > > > row2: length=50 bytelength=50 > > > > Why do zero values occupy 2 bytes in a binary string? > > > Don't use [string bytelength]. What you want is [string length]. > > > Because Tcl is implemented in C and because in C, strings are > > terminated by nul (0x00), the tcl interpreter internally encodes nuls > > as a special two-byte character. The [string bytelength] is really > > there mainly for debugging purposes or to workaround any possible edge > > cases not automatically handled by tcl. For everything else use > > [string length]. > > The reason is not so much that C uses NUL bytes to terminate > strings, but that Tcl uses UTF-8 internally. With "counted strings" > there is no need for this extra memory, but it is the UTF-8 encoding > of NUL bytes. No. In strict UTF-8 the encoding of NUL is \0. Only "modified UTF-8" (and Tcl's internal use) use C080. And the reason for this, is to preserve the (C originated) contract that , for any Tcl-Obj with a string rep, obj->bytes[obj->length] is the first \0 Of course this contract seems unnecessary given the presence of "length" (counted strings), but it's been there for so long that nobody dares to remove it... The reason I insist on this, is that [string bytelength] does not even give an accurate measurement of what's written to a channel with [fconfigure -encoding utf-8], since the channel will use strict UTF-8 with dark and cold zeroes. -Alex
From: Arjen Markus on 12 Jan 2010 02:47 On 11 jan, 18:22, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > On Jan 11, 7:25 am, Arjen Markus <arjen.markus...(a)gmail.com> wrote: > > > > > > > On 10 jan, 16:31, "slebet...(a)yahoo.com" <slebet...(a)gmail.com> wrote: > > > > On Jan 10, 11:09 pm, "Paul(a)Tcl3D" <p...(a)tcl3d.org> wrote: > > > > > Hello, > > > > > I have a question regarding Tcl binary strings. > > > > > If running the following script creating two binary strings: > > > > > set num 50 > > > > > for { set j 0 } { $j < $num } { incr j } { > > > > append row1 [binary format c 0]} > > > > > puts "row1: length=[string length $row1] bytelength=[string bytelength > > > > $row1]" > > > > > for { set j 0 } { $j < $num } { incr j } { > > > > append row2 [binary format c 1]} > > > > > puts "row2: length=[string length $row2] bytelength=[string bytelength > > > > $row2]" > > > > > I get the following output (tested with 8.4, 8.5, 8.6): > > > > > row1: length=50 bytelength=100 > > > > row2: length=50 bytelength=50 > > > > > Why do zero values occupy 2 bytes in a binary string? > > > > Don't use [string bytelength]. What you want is [string length]. > > > > Because Tcl is implemented in C and because in C, strings are > > > terminated by nul (0x00), the tcl interpreter internally encodes nuls > > > as a special two-byte character. The [string bytelength] is really > > > there mainly for debugging purposes or to workaround any possible edge > > > cases not automatically handled by tcl. For everything else use > > > [string length]. > > > The reason is not so much that C uses NUL bytes to terminate > > strings, but that Tcl uses UTF-8 internally. With "counted strings" > > there is no need for this extra memory, but it is the UTF-8 encoding > > of NUL bytes. > > No. In strict UTF-8 the encoding of NUL is \0. Only "modified > UTF-8" (and Tcl's internal use) use C080. > And the reason for this, is to preserve the (C originated) contract > that , for any Tcl-Obj with a string rep, > > obj->bytes[obj->length] is the first \0 > > Of course this contract seems unnecessary given the presence of > "length" (counted strings), but it's been there for so long that > nobody dares to remove it... > > The reason I insist on this, is that [string bytelength] does not even > give an accurate measurement of what's written to a channel with > [fconfigure -encoding utf-8], since the channel will use strict UTF-8 > with dark and cold zeroes. > > -Alex- Tekst uit oorspronkelijk bericht niet weergeven - > > - Tekst uit oorspronkelijk bericht weergeven - So it is even more complicated than "just" UTF-8 ... Indeed, all the more reason to ignore [string bytelength]. Regards, Arjen
First
|
Prev
|
Pages: 1 2 Prev: can androgel cause estrogen levels to increase Next: Tkhtml2.0 source code for 64bit |