From: Zhang Weiwu on 21 Jul 2010 00:32 Hello. Am I the first one got confused of the encoding named "unicode" in the output of this statement? % encoding names I first guess the word "unicode" as encoding name are because Windows system uses "unicode" to mean "UTF-16LE WITH BOM", that is what happens when you save a text file in Notepad of Windows and choose "Unicode", or when you save spreadsheet as "Unicode Text" in Microsoft Excel. This understanding seems to be correct when I run tclkit on Windows. I have many data files in UTF-16LE and in my tcl script I do "fconfigure -encoding unicode" before reading them, which works fine. Today I downloaded tclkit 8.5.1 for Mac OS on X11 and run my application on Mac OS, and realized encoding name "unicode" have to be interpreted differently. In fact I had to convert my data file from UTF-16LE to UTF-8 to make the same tcl script run correctly. In order to make the script run on both Mac OS, it seems the only choice I have is to prepare data in UTF-8 only and change script to always read files in UTF-8, which is not ambiguous. This is a bit difficult because the data producing workflow is done on MS Windows, always resulting UTF16-LE. Adding a step to the already repetitive data preparing workflow isn't nice. Do we have a better idea?
From: Donal K. Fellows on 21 Jul 2010 05:27 On Jul 21, 5:32 am, Zhang Weiwu <zhangweiwu+J...(a)realss.com> wrote: > Hello. Am I the first one got confused of the encoding named "unicode" > in the output of this statement? > > % encoding names > > I first guess the word "unicode" as encoding name are because Windows > system uses "unicode" to mean "UTF-16LE WITH BOM", that is what happens > when you save a text file in Notepad of Windows and choose "Unicode", or > when you save spreadsheet as "Unicode Text" in Microsoft Excel. Tcl doesn't add a BOM (that's a feature of a file, not of a data stream; a subtle difference I know) and produces characters in *host* endianness; it also doesn't parse a BOM on input for you. It also only handles characters in the BMP, but that's a general Tcl issue. (It's also really ugly to fix properly since it requires deep changes to the RE engine - the problems are character sets and what constitutes a single character - and the addition of a normalization engine, and there are licensing issues with some of the solutions people have suggested in the past.) I wish I had something better to report. Donal.
From: Alexandre Ferrieux on 21 Jul 2010 06:20 On Jul 21, 11:27 am, "Donal K. Fellows" <donal.k.fell...(a)manchester.ac.uk> wrote: > On Jul 21, 5:32 am, Zhang Weiwu <zhangweiwu+J...(a)realss.com> wrote: > > > Hello. Am I the first one got confused of the encoding named "unicode" > > in the output of this statement? > > > % encoding names > > > I first guess the word "unicode" as encoding name are because Windows > > system uses "unicode" to mean "UTF-16LE WITH BOM", that is what happens > > when you save a text file in Notepad of Windows and choose "Unicode", or > > when you save spreadsheet as "Unicode Text" in Microsoft Excel. > > Tcl doesn't add a BOM (that's a feature of a file, not of a data > stream; a subtle difference I know) and produces characters in *host* > endianness; it also doesn't parse a BOM on input for you. It also only > handles characters in the BMP, but that's a general Tcl issue. (It's > also really ugly to fix properly since it requires deep changes to the > RE engine - the problems are character sets and what constitutes a > single character - and the addition of a normalization engine, and > there are licensing issues with some of the solutions people have > suggested in the past.) > > I wish I had something better to report. Donal, you do have an OSX Tcl at hand, don't you ? On that platform, does [fconfigure -encoding unicode] allow to read an UTF-16LE (assuming an x86 mac, not an mc68k dinosaur ;-) properly or not, when the characters are "not risky" (say ASCII) ? (The OP's wording makes it unclear whether there is really a platform- specific issue, or just a few warts in a specific file with strange characters or reversed byte order...) -Alex
From: Joe English on 21 Jul 2010 20:15 Zhang Weiwu wrote: > > Hello. Am I the first one got confused of the encoding named "unicode" > in the output of this statement? > > % encoding names No, you are not. The *first* one to be confused by the encoding mistakenly called "unicode" in Tcl is the person who wrote the code in the first place :-) > I first guess the word "unicode" as encoding name are because Windows > system uses "unicode" to mean "UTF-16LE WITH BOM", that is what happens > when you save a text file in Notepad of Windows and choose "Unicode", or > when you save spreadsheet as "Unicode Text" in Microsoft Excel. This > understanding seems to be correct when I run tclkit on Windows. I have > many data files in UTF-16LE and in my tcl script I do "fconfigure > -encoding unicode" before reading them, which works fine. Tcl's "unicode" encoding is actually UCS-2, which uses 16-bit codepoints. When serialized to octets, it'll either be UCS-2LE or UCS-2BE, depending on the native byte order of the host computer. (UCS-2[BE/LE] is a strict subset of UTF-16[BE/LE]. Since Tcl doesn't recognize characters outside the BMP, the distinction makes no real difference as far as Tcl is concerned.) > Today I downloaded tclkit 8.5.1 for Mac OS on X11 and run my application > on Mac OS, and realized encoding name "unicode" have to be interpreted > differently. In fact I had to convert my data file from UTF-16LE to > UTF-8 to make the same tcl script run correctly. That's consistent with what I'd expect. In Tcl, "unicode" is compatible with UTF16-LE on Intel boxes, or with UTF16-BE everywhere else. (Or maybe it's the other way around. I never remember.) > In order to make the script run on both Mac OS, it seems the only choice > I have is to prepare data in UTF-8 only and change script to always read > files in UTF-8, which is not ambiguous. That's the most sensible thing to do, if it's practical. > This is a bit difficult because > the data producing workflow is done on MS Windows, always resulting > UTF16-LE. Adding a step to the already repetitive data preparing > workflow isn't nice. Do we have a better idea? A better idea would be to add explicit "utf16le"/"utf16be" and/or "ucs2le/be" encodings to Tcl. (I'm somewhat surprised that that hasn't happened yet -- it's eminently sensible -- probably just that nobody's gotten around to it yet.) --Joe English
From: Zhang Weiwu on 21 Jul 2010 21:50 > > Tcl doesn't add a BOM (that's a feature of a file, not of a data > stream; a subtle difference I know) and produces characters in *host* > endianness > Do you mean, that "unicode" as an encoding name, means UTF-16LE in Microsoft Windows, and means something differently (as I tested, means UTF-8) in Mac OS, and if I use it on big endian system, say Linux on MIPS arch, it might mean UTF-16BE? In this case, there should be a comment on wiki or somewhere to warn against use of "unicode" as encoding name in scripts. Because most developer would do like me: test it on Windows only (or on his / her working system only) and decide, emm, Unicode must mean this (IN my case, I think "unicode" means UTF-16LE) on other OSs and systems too, and make applications that breaks on other OS (I just did it!).
|
Next
|
Last
Pages: 1 2 3 4 Prev: COM Word and blanks in Filenames Next: Why "glob -directory" is such a pain? |