what encoding is the encoding named "unicode"? [TCL]

Prev: COM Word and blanks in Filenames
Next: Why "glob -directory" is such a pain?

From: Donal K. Fellows on 22 Jul 2010 05:02

On Jul 21, 11:20 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
wrote:
> Donal, you do have an OSX Tcl at hand, don't you ?
>
> On that platform, does [fconfigure -encoding unicode] allow to read an
> UTF-16LE (assuming an x86 mac, not an mc68k dinosaur ;-) properly or
> not, when the characters are "not risky" (say ASCII) ?

Should do. The "unicode" encoding is always host-endian. :-( I don't
recommend it for files, frankly, but it is rather useful on a number
of platforms (well, Windows) for interacting with the OS itself. (OSX
uses UTF-8 throughout these days.)

> (The OP's wording makes it unclear whether there is really a platform-
> specific issue, or just a few warts in a specific file with strange
> characters or reversed byte order...)

That's what I suspect.

Donal.

From: Alexandre Ferrieux on 22 Jul 2010 08:26

On Jul 22, 3:50 am, Zhang Weiwu <zhangweiwu+J...(a)realss.com> wrote:
> > Tcl doesn't add a BOM (that's a feature of a file, not of a data
> > stream; a subtle difference I know) and produces characters in *host*
> > endianness
>
> Do you mean, that "unicode" as an encoding name, means UTF-16LE in
> Microsoft Windows, and means something differently (as I tested, means
> UTF-8) in Mac OS, and if I use it on big endian system, say Linux on
> MIPS arch, it might mean UTF-16BE?
>
> In this case, there should be a comment on wiki or somewhere to warn
> against use of "unicode" as encoding name in scripts. Because most
> developer would do like me: test it on Windows only (or on his / her
> working system only) and decide, emm, Unicode must mean this (IN my
> case, I think "unicode" means UTF-16LE) on other OSs and systems too,
> and make applications that breaks on other OS (I just did it!).

Lets spell things out precisely: do you confirm that, on your OSX
machine, given a file "file.txt" in UTF-8, the following code:

pack [text .t]
set f [open file.txt r]
fconfigure $f -encoding utf-8
.t insert 1.0 [read $f]

displays garbled characters in the text widget ? If yes, how garbled ?
From the beginning ? Or only on certain characters ?

(Note that I purposefully used Tk for the display, to avoid an extra
layer of confusion from a possible mismatch between stdout's encoding
and what the xterm thinks)

-Alex

From: Zhang Weiwu on 2 Aug 2010 08:00

On 2010年07月22日 20:26, Alexandre Ferrieux wrote:
> Lets spell things out precisely: do you confirm that, on your OSX
> machine, given a file "file.txt" in UTF-8, the following code:
>
> pack [text .t]
> set f [open file.txt r]
> fconfigure $f -encoding utf-8
> .t insert 1.0 [read $f]
>
> displays garbled characters in the text widget ? If yes, how garbled ?
> From the beginning ? Or only on certain characters ?
>
>
Hi. Thank you for trying to sort out the problem. The fact Joe English
and Donal K. Fellow, with knowledge of how things internally work (in
fact, how it doesn't), posted addressing the issue directly while you
didn't get it, is an example how obscure this issue is to language users
like us. I think I should clarify it further for you and other
programmer came to this post through google by using a program to
demonstrate the problem.

I see you try to demonstrate the problem by sample program, but your
sample program did not mention the encoding named "unicode", thus cannot
demonstrate the problem. You used the encoding named "UTF-8", which is
unicode, but is not 'the encoding named "unicode"'. (Yes, this is
confusing.)

Here is the code to demonstrate this problem. Basically it demonstrate
"unicode" as encoding means differently in Mac OS than it means in
Windows/Linux. The demonstration involve 2 steps:

First step:

Run the following code on Linux (in my case Ubuntu 10.04 x86-32) or
on Windows (both have same output):

write.tcl:

set f [open test.txt w]
fconfigure $f -encoding unicode
puts $f "Hello World"
close $f

You can tell the output is UTF16 by looking at the length:
$ ls -lh test.txt
-rw-rw-r-- 1 almustafa users 24 2010-08-02 19:17 test.txt

The file should be 12 bytes if in UTF-8. An octal dump confirm
test.txt is in UTF-16LE.

Second Step:

Copy this test.txt file as-is to Mac OS on ppc (I used scp). And run
this code:

display.tcl

#!/usr/bin/wish
package require Tk
pack [text .t]
set f [open test.txt r]
fconfigure $f -encoding unicode
.t insert 1.0 [read $f]

The output is a window of a text field with garbled text in it.

If strictly what Joe English and Donal K. Fellow said are true, the
meaning of "unicode" as encoding name should have changed when Apple
switches from PPC to x86. Thus, any TCL code made on Mac OS X ppc
version using "unicode" is doomed to fail on Mac OS X x86 version. Or
more plainly, any TCL script using -encoding unicode written on Mac OS X
before 2005 are doomed to stop working after 2010, and have mixed result
between 2005 and 2010.

I am also surprised UTF-16LE as an encoding name has never been
implemented. But I don't know how can we users make this implementation
happen. It do make sense, because Windows make wide use of UTF16LE,
supporting this encoding would reduce potential human conversion needed
for Windows users. It happened on me.

From: tom.rmadilo on 2 Aug 2010 13:49

On Aug 2, 5:00Â am, Zhang Weiwu <zhangweiwu+J...(a)realss.com> wrote:
> On 2010å¹´07æ22æ¥ 20:26, Alexandre Ferrieux wrote:> Lets spell things out precisely: do you confirm that, on your OSX
> > machine, given a file "file.txt" in UTF-8, the following code:
>
> > Â Â pack [text .t]
> > Â Â set f [open file.txt r]
> > Â Â fconfigure $f -encoding utf-8
> > Â Â .t insert 1.0 [read $f]
>
> > displays garbled characters in the text widget ? If yes, how garbled ?
> > From the beginning ? Or only on certain characters ?
>
> Hi. Thank you for trying to sort out the problem. The fact Joe English
> and Donal K. Fellow, with knowledge of how things internally work (in
> fact, how it doesn't), posted addressing the issue directly while you
> didn't get it, is an example how obscure this issue is to language users
> like us. I think I should clarify it further for you and other
> programmer came to this post through google by using a program to
> demonstrate the problem.
>
> I see you try to demonstrate the problem by sample program, but your
> sample program did not mention the encoding named "unicode", thus cannot
> demonstrate the problem. You used the encoding named "UTF-8", which is
> unicode, but is not 'the encoding named "unicode"'. (Yes, this is
> confusing.)
>
> Here is the code to demonstrate this problem. Basically it demonstrate
> "unicode" as encoding means differently in Mac OS than it means in
> Windows/Linux. The demonstration involve 2 steps:
>
> First step:
>
> Â Â Run the following code on Linux (in my case Ubuntu 10.04 x86-32) or
> Â Â on Windows (both have same output):
>
> Â Â write.tcl:
>
> Â Â Â Â set f [open test.txt w]
> Â Â Â Â fconfigure $f -encoding unicode
> Â Â Â Â puts $f "Hello World"
> Â Â Â Â close $f
>
> Â Â You can tell the output is UTF16 by looking at the length:
> Â Â $ ls -lh test.txt
> Â Â -rw-rw-r-- 1 almustafa users 24 2010-08-02 19:17 test.txt
>
> Â Â The file should be 12 bytes if in UTF-8. An octal dump confirm
> Â Â test.txt is in UTF-16LE.
>
> Second Step:
>
> Â Â Copy this test.txt file as-is to Mac OS on ppc (I used scp).. And run
> Â Â this code:
>
> Â Â display.tcl
>
> Â Â Â Â #!/usr/bin/wish
> Â Â Â Â package require Tk
> Â Â Â Â pack [text .t]
> Â Â Â Â set f [open test.txt r]
> Â Â Â Â fconfigure $f -encoding unicode
> Â Â Â Â .t insert 1.0 [read $f]
>
> Â Â The output is a window of a text field with garbled text in it.
>
> If strictly what Joe English and Donal K. Fellow said are true, the
> meaning of "unicode" as encoding name should have changed when Apple
> switches from PPC to x86. Thus, any TCL code made on Mac OS X ppc
> version using "unicode" is doomed to fail on Mac OS X x86 version. Or
> more plainly, any TCL script using -encoding unicode written on Mac OS X
> before 2005 are doomed to stop working after 2010, and have mixed result
> between 2005 and 2010.
>
> I am also surprised UTF-16LE as an encoding name has never been
> implemented. But I don't know how can we users make this implementation
> happen. It do make sense, because Windows make wide use of UTF16LE,
> supporting this encoding would reduce potential human conversion needed
> for Windows users. It happened on me.

From: tom.rmadilo on 2 Aug 2010 14:20

On Aug 2, 5:00 am, Zhang Weiwu <zhangweiwu+J...(a)realss.com> wrote:
> If strictly what Joe English and Donal K. Fellow said are true, the
> meaning of "unicode" as encoding name should have changed when Apple
> switches from PPC to x86. Thus, any TCL code made on Mac OS X ppc
> version using "unicode" is doomed to fail on Mac OS X x86 version. Or
> more plainly, any TCL script using -encoding unicode written on Mac OS X
> before 2005 are doomed to stop working after 2010, and have mixed result
> between 2005 and 2010.
>
> I am also surprised UTF-16LE as an encoding name has never been
> implemented. But I don't know how can we users make this implementation
> happen. It do make sense, because Windows make wide use of UTF16LE,
> supporting this encoding would reduce potential human conversion needed
> for Windows users. It happened on me.

I'm not sure that you have identified the problem. You created a file
on one computer/configuration where "unicode" means one thing. You
then transfered the file in binary mode to another computer/
configuration where "unicode" means something else. Since you never
performed any translation from/to, you should expect garbage.

The exact same thing happens when you create a text file on *nix
systems, binary transfer to Windows and then try to read the file in
"Notepad".

I suggest three keys to a long term solution:

1. Always use UTF-8 to store textual data (data created or already
processed by the application).
2. Always transfer files as binary data.
3. Allow the application to handle the necessary translations/
interpretations...that is don't hard code the translation mode if you
expect to handle multiple encodings. Encoding detection might require
reading an entire file, which is the reason for the first two
suggestions.

The only question I have is if it is possible to read/write UTF-16LE
and UTF-16BE on any system, or if this is fixed by the detected system
type. That would suck, but could be fixed by suggestion 1.

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: COM Word and blanks in Filenames
Next: Why "glob -directory" is such a pain?