Prev: Is there a IPTC and/or XMP tcl package (scripted or binary) out in the wild?
Next: How to profile a Tcl/Tk extension?
From: Kevin Kenny on 22 Jan 2010 17:51 Alexandre Ferrieux wrote: > Yes, you're in line with Kevin who regrets the creation of the String > type :} Oh, String is useful, I grant that. But (1) It's more heavyweight than it needs to be. (2) We're too eager to use it when we don't need it. If I had it to do over, I'd have a data structure that would index into the string, giving the byte position of every Nth character for some small N (16? 64? Would have to measure performance...), with perhaps an optimisation to handle long strings of ASCII. A general String overhaul would also be A Good Idea. We really need to consider UTF-8 normalisation (and perhaps even fix character counting for combining forms). Laying the infrastructure for things like bidi rendering would also be helpful. So delving once again into our Unicode handling might be a useful project. -- 73 de ke9tv/2, Kevin
From: tom.rmadilo on 22 Jan 2010 19:22
On Jan 22, 2:51 pm, Kevin Kenny <kenn...(a)acm.org> wrote: > Alexandre Ferrieux wrote: > > Yes, you're in line with Kevin who regrets the creation of the String > > type :} > > Oh, String is useful, I grant that. But > > (1) It's more heavyweight than it needs to be. > (2) We're too eager to use it when we don't need it. I might be confused here, so smack me down if necessary. The problem with any string in Tcl is that there is no index into the string. Forget characters. The first thing needed is the ability to iterate over the string one byte at a time. Then, you could easily create a proc which transforms a generic string into a character string. For binary data, this is a noop. Maybe we need a parallel [octets] command to support [string]. You could also generalize with [bitsring] with an option -charbits to handle different char sets, in the case of fixed bit-length character sets. Anyway this has the scent of a continuation, if you want to avoid the extra storage required. > If I had it to do over, I'd have a data structure that would index > into the string, giving the byte position of every Nth character for > some small N (16? 64? Would have to measure performance...), with > perhaps an optimisation to handle long strings of ASCII. Personally I think all the cost should go to the non-default users. Optimization in this case already exists, so anything added is a long route back to the current situation. Just create a separate string command for the non-default cases. For ASCII, this might be faster, also UTF-16. The main point is that you need a separate non-blob-like structure that can efficiently index into the blob (a string really is a blob, sometimes dropping the l). The default structure is just a binary string with fixed bit-length chars. The option is to transform this into variable length chars, which would require a memory consuming index. Still, the index could be sparse, only including segments already found. So first you need a metric, probably bits, to measure a string. Fixed length character sets are mapped by addition, subtraction and multiplication. Some character sets can be mapped from the beginning or the end (UTF-8), some only from the beginning. None of these mappings will be easy without the fixed metric and an efficient index into the metric. |