From: Alexandre Ferrieux on
On Jan 11, 7:25 am, Arjen Markus <arjen.markus...(a)gmail.com> wrote:
> On 10 jan, 16:31, "slebet...(a)yahoo.com" <slebet...(a)gmail.com> wrote:
>
>
>
>
>
> > On Jan 10, 11:09 pm, "Paul(a)Tcl3D" <p...(a)tcl3d.org> wrote:
>
> > > Hello,
>
> > > I have a question regarding Tcl binary strings.
>
> > > If running the following script creating two binary strings:
>
> > > set num 50
>
> > > for { set j 0 } { $j < $num } { incr j } {
> > >      append row1 [binary format c 0]}
>
> > > puts "row1: length=[string length $row1] bytelength=[string bytelength
> > > $row1]"
>
> > > for { set j 0 } { $j < $num } { incr j } {
> > >      append row2 [binary format c 1]}
>
> > > puts "row2: length=[string length $row2] bytelength=[string bytelength
> > > $row2]"
>
> > > I get the following output (tested with 8.4, 8.5, 8.6):
>
> > > row1: length=50 bytelength=100
> > > row2: length=50 bytelength=50
>
> > > Why do zero values occupy 2 bytes in a binary string?
>
> > Don't use [string bytelength]. What you want is [string length].
>
> > Because Tcl is implemented in C and because in C, strings are
> > terminated by nul (0x00), the tcl interpreter internally encodes nuls
> > as a special two-byte character. The [string bytelength] is really
> > there mainly for debugging purposes or to workaround any possible edge
> > cases not automatically handled by tcl. For everything else use
> > [string length].
>
> The reason is not so much that C uses NUL bytes to terminate
> strings, but that Tcl uses UTF-8 internally. With "counted strings"
> there is no need for this extra memory, but it is the UTF-8 encoding
> of NUL bytes.

No. In strict UTF-8 the encoding of NUL is \0. Only "modified
UTF-8" (and Tcl's internal use) use C080.
And the reason for this, is to preserve the (C originated) contract
that , for any Tcl-Obj with a string rep,

obj->bytes[obj->length] is the first \0

Of course this contract seems unnecessary given the presence of
"length" (counted strings), but it's been there for so long that
nobody dares to remove it...

The reason I insist on this, is that [string bytelength] does not even
give an accurate measurement of what's written to a channel with
[fconfigure -encoding utf-8], since the channel will use strict UTF-8
with dark and cold zeroes.

-Alex
From: Arjen Markus on
On 11 jan, 18:22, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
wrote:
> On Jan 11, 7:25 am, Arjen Markus <arjen.markus...(a)gmail.com> wrote:
>
>
>
>
>
> > On 10 jan, 16:31, "slebet...(a)yahoo.com" <slebet...(a)gmail.com> wrote:
>
> > > On Jan 10, 11:09 pm, "Paul(a)Tcl3D" <p...(a)tcl3d.org> wrote:
>
> > > > Hello,
>
> > > > I have a question regarding Tcl binary strings.
>
> > > > If running the following script creating two binary strings:
>
> > > > set num 50
>
> > > > for { set j 0 } { $j < $num } { incr j } {
> > > >      append row1 [binary format c 0]}
>
> > > > puts "row1: length=[string length $row1] bytelength=[string bytelength
> > > > $row1]"
>
> > > > for { set j 0 } { $j < $num } { incr j } {
> > > >      append row2 [binary format c 1]}
>
> > > > puts "row2: length=[string length $row2] bytelength=[string bytelength
> > > > $row2]"
>
> > > > I get the following output (tested with 8.4, 8.5, 8.6):
>
> > > > row1: length=50 bytelength=100
> > > > row2: length=50 bytelength=50
>
> > > > Why do zero values occupy 2 bytes in a binary string?
>
> > > Don't use [string bytelength]. What you want is [string length].
>
> > > Because Tcl is implemented in C and because in C, strings are
> > > terminated by nul (0x00), the tcl interpreter internally encodes nuls
> > > as a special two-byte character. The [string bytelength] is really
> > > there mainly for debugging purposes or to workaround any possible edge
> > > cases not automatically handled by tcl. For everything else use
> > > [string length].
>
> > The reason is not so much that C uses NUL bytes to terminate
> > strings, but that Tcl uses UTF-8 internally. With "counted strings"
> > there is no need for this extra memory, but it is the UTF-8 encoding
> > of NUL bytes.
>
> No. In strict UTF-8 the encoding of NUL is \0. Only "modified
> UTF-8" (and Tcl's internal use) use C080.
> And the reason for this, is to preserve the (C originated) contract
> that , for any Tcl-Obj with a string rep,
>
>      obj->bytes[obj->length] is the first \0
>
> Of course this contract seems unnecessary given the presence of
> "length" (counted strings), but it's been there for so long that
> nobody dares to remove it...
>
> The reason I insist on this, is that [string bytelength] does not even
> give an accurate measurement of what's written to a channel with
> [fconfigure -encoding utf-8], since the channel will use strict UTF-8
> with dark and cold zeroes.
>
> -Alex- Tekst uit oorspronkelijk bericht niet weergeven -
>
> - Tekst uit oorspronkelijk bericht weergeven -

So it is even more complicated than "just" UTF-8 ... Indeed, all
the more reason to ignore [string bytelength].

Regards,

Arjen