From: Kevin Kenny on
Alexandre Ferrieux wrote:
> Yes, you're in line with Kevin who regrets the creation of the String
> type :}

Oh, String is useful, I grant that. But

(1) It's more heavyweight than it needs to be.
(2) We're too eager to use it when we don't need it.

If I had it to do over, I'd have a data structure that would index
into the string, giving the byte position of every Nth character for
some small N (16? 64? Would have to measure performance...), with
perhaps an optimisation to handle long strings of ASCII.

A general String overhaul would also be A Good Idea. We really need
to consider UTF-8 normalisation (and perhaps even fix character
counting for combining forms). Laying the infrastructure for things
like bidi rendering would also be helpful.

So delving once again into our Unicode handling might be a useful
project.

--
73 de ke9tv/2, Kevin
From: tom.rmadilo on
On Jan 22, 2:51 pm, Kevin Kenny <kenn...(a)acm.org> wrote:
> Alexandre Ferrieux wrote:
> > Yes, you're in line with Kevin who regrets the creation of the String
> > type :}
>
> Oh, String is useful, I grant that.  But
>
> (1) It's more heavyweight than it needs to be.
> (2) We're too eager to use it when we don't need it.

I might be confused here, so smack me down if necessary. The problem
with any string in Tcl is that there is no index into the string.
Forget characters. The first thing needed is the ability to iterate
over the string one byte at a time. Then, you could easily create a
proc which transforms a generic string into a character string. For
binary data, this is a noop. Maybe we need a parallel [octets] command
to support [string]. You could also generalize with [bitsring] with an
option -charbits to handle different char sets, in the case of fixed
bit-length character sets. Anyway this has the scent of a
continuation, if you want to avoid the extra storage required.

> If I had it to do over, I'd have a data structure that would index
> into the string, giving the byte position of every Nth character for
> some small N (16? 64?  Would have to measure performance...), with
> perhaps an optimisation to handle long strings of ASCII.

Personally I think all the cost should go to the non-default users.
Optimization in this case already exists, so anything added is a long
route back to the current situation. Just create a separate string
command for the non-default cases. For ASCII, this might be faster,
also UTF-16.

The main point is that you need a separate non-blob-like structure
that can efficiently index into the blob (a string really is a blob,
sometimes dropping the l). The default structure is just a binary
string with fixed bit-length chars. The option is to transform this
into variable length chars, which would require a memory consuming
index. Still, the index could be sparse, only including segments
already found.

So first you need a metric, probably bits, to measure a string. Fixed
length character sets are mapped by addition, subtraction and
multiplication. Some character sets can be mapped from the beginning
or the end (UTF-8), some only from the beginning. None of these
mappings will be easy without the fixed metric and an efficient index
into the metric.