From: Donal K. Fellows on 22 Jul 2010 05:02 On Jul 21, 11:20 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com> wrote: > Donal, you do have an OSX Tcl at hand, don't you ? > > On that platform, does [fconfigure -encoding unicode] allow to read an > UTF-16LE (assuming an x86 mac, not an mc68k dinosaur ;-) properly or > not, when the characters are "not risky" (say ASCII) ? Should do. The "unicode" encoding is always host-endian. :-( I don't recommend it for files, frankly, but it is rather useful on a number of platforms (well, Windows) for interacting with the OS itself. (OSX uses UTF-8 throughout these days.) > (The OP's wording makes it unclear whether there is really a platform- > specific issue, or just a few warts in a specific file with strange > characters or reversed byte order...) That's what I suspect. Donal.
From: Alexandre Ferrieux on 22 Jul 2010 08:26 On Jul 22, 3:50 am, Zhang Weiwu <zhangweiwu+J...(a)realss.com> wrote: > > Tcl doesn't add a BOM (that's a feature of a file, not of a data > > stream; a subtle difference I know) and produces characters in *host* > > endianness > > Do you mean, that "unicode" as an encoding name, means UTF-16LE in > Microsoft Windows, and means something differently (as I tested, means > UTF-8) in Mac OS, and if I use it on big endian system, say Linux on > MIPS arch, it might mean UTF-16BE? > > In this case, there should be a comment on wiki or somewhere to warn > against use of "unicode" as encoding name in scripts. Because most > developer would do like me: test it on Windows only (or on his / her > working system only) and decide, emm, Unicode must mean this (IN my > case, I think "unicode" means UTF-16LE) on other OSs and systems too, > and make applications that breaks on other OS (I just did it!). Lets spell things out precisely: do you confirm that, on your OSX machine, given a file "file.txt" in UTF-8, the following code: pack [text .t] set f [open file.txt r] fconfigure $f -encoding utf-8 .t insert 1.0 [read $f] displays garbled characters in the text widget ? If yes, how garbled ? From the beginning ? Or only on certain characters ? (Note that I purposefully used Tk for the display, to avoid an extra layer of confusion from a possible mismatch between stdout's encoding and what the xterm thinks) -Alex
From: Zhang Weiwu on 2 Aug 2010 08:00 On 2010年07月22日 20:26, Alexandre Ferrieux wrote: > Lets spell things out precisely: do you confirm that, on your OSX > machine, given a file "file.txt" in UTF-8, the following code: > > pack [text .t] > set f [open file.txt r] > fconfigure $f -encoding utf-8 > .t insert 1.0 [read $f] > > displays garbled characters in the text widget ? If yes, how garbled ? > From the beginning ? Or only on certain characters ? > > Hi. Thank you for trying to sort out the problem. The fact Joe English and Donal K. Fellow, with knowledge of how things internally work (in fact, how it doesn't), posted addressing the issue directly while you didn't get it, is an example how obscure this issue is to language users like us. I think I should clarify it further for you and other programmer came to this post through google by using a program to demonstrate the problem. I see you try to demonstrate the problem by sample program, but your sample program did not mention the encoding named "unicode", thus cannot demonstrate the problem. You used the encoding named "UTF-8", which is unicode, but is not 'the encoding named "unicode"'. (Yes, this is confusing.) Here is the code to demonstrate this problem. Basically it demonstrate "unicode" as encoding means differently in Mac OS than it means in Windows/Linux. The demonstration involve 2 steps: First step: Run the following code on Linux (in my case Ubuntu 10.04 x86-32) or on Windows (both have same output): write.tcl: set f [open test.txt w] fconfigure $f -encoding unicode puts $f "Hello World" close $f You can tell the output is UTF16 by looking at the length: $ ls -lh test.txt -rw-rw-r-- 1 almustafa users 24 2010-08-02 19:17 test.txt The file should be 12 bytes if in UTF-8. An octal dump confirm test.txt is in UTF-16LE. Second Step: Copy this test.txt file as-is to Mac OS on ppc (I used scp). And run this code: display.tcl #!/usr/bin/wish package require Tk pack [text .t] set f [open test.txt r] fconfigure $f -encoding unicode .t insert 1.0 [read $f] The output is a window of a text field with garbled text in it. If strictly what Joe English and Donal K. Fellow said are true, the meaning of "unicode" as encoding name should have changed when Apple switches from PPC to x86. Thus, any TCL code made on Mac OS X ppc version using "unicode" is doomed to fail on Mac OS X x86 version. Or more plainly, any TCL script using -encoding unicode written on Mac OS X before 2005 are doomed to stop working after 2010, and have mixed result between 2005 and 2010. I am also surprised UTF-16LE as an encoding name has never been implemented. But I don't know how can we users make this implementation happen. It do make sense, because Windows make wide use of UTF16LE, supporting this encoding would reduce potential human conversion needed for Windows users. It happened on me.
From: tom.rmadilo on 2 Aug 2010 13:49 On Aug 2, 5:00 am, Zhang Weiwu <zhangweiwu+J...(a)realss.com> wrote: > On 2010å¹´07æ22æ¥ 20:26, Alexandre Ferrieux wrote:> Lets spell things out precisely: do you confirm that, on your OSX > > machine, given a file "file.txt" in UTF-8, the following code: > > >   pack [text .t] > >   set f [open file.txt r] > >   fconfigure $f -encoding utf-8 > >   .t insert 1.0 [read $f] > > > displays garbled characters in the text widget ? If yes, how garbled ? > > From the beginning ? Or only on certain characters ? > > Hi. Thank you for trying to sort out the problem. The fact Joe English > and Donal K. Fellow, with knowledge of how things internally work (in > fact, how it doesn't), posted addressing the issue directly while you > didn't get it, is an example how obscure this issue is to language users > like us. I think I should clarify it further for you and other > programmer came to this post through google by using a program to > demonstrate the problem. > > I see you try to demonstrate the problem by sample program, but your > sample program did not mention the encoding named "unicode", thus cannot > demonstrate the problem. You used the encoding named "UTF-8", which is > unicode, but is not 'the encoding named "unicode"'. (Yes, this is > confusing.) > > Here is the code to demonstrate this problem. Basically it demonstrate > "unicode" as encoding means differently in Mac OS than it means in > Windows/Linux. The demonstration involve 2 steps: > > First step: > >   Run the following code on Linux (in my case Ubuntu 10.04 x86-32) or >   on Windows (both have same output): > >   write.tcl: > >     set f [open test.txt w] >     fconfigure $f -encoding unicode >     puts $f "Hello World" >     close $f > >   You can tell the output is UTF16 by looking at the length: >   $ ls -lh test.txt >   -rw-rw-r-- 1 almustafa users 24 2010-08-02 19:17 test.txt > >   The file should be 12 bytes if in UTF-8. An octal dump confirm >   test.txt is in UTF-16LE. > > Second Step: > >   Copy this test.txt file as-is to Mac OS on ppc (I used scp).. And run >   this code: > >   display.tcl > >     #!/usr/bin/wish >     package require Tk >     pack [text .t] >     set f [open test.txt r] >     fconfigure $f -encoding unicode >     .t insert 1.0 [read $f] > >   The output is a window of a text field with garbled text in it. > > If strictly what Joe English and Donal K. Fellow said are true, the > meaning of "unicode" as encoding name should have changed when Apple > switches from PPC to x86. Thus, any TCL code made on Mac OS X ppc > version using "unicode" is doomed to fail on Mac OS X x86 version. Or > more plainly, any TCL script using -encoding unicode written on Mac OS X > before 2005 are doomed to stop working after 2010, and have mixed result > between 2005 and 2010. > > I am also surprised UTF-16LE as an encoding name has never been > implemented. But I don't know how can we users make this implementation > happen. It do make sense, because Windows make wide use of UTF16LE, > supporting this encoding would reduce potential human conversion needed > for Windows users. It happened on me.
From: tom.rmadilo on 2 Aug 2010 14:20 On Aug 2, 5:00 am, Zhang Weiwu <zhangweiwu+J...(a)realss.com> wrote: > If strictly what Joe English and Donal K. Fellow said are true, the > meaning of "unicode" as encoding name should have changed when Apple > switches from PPC to x86. Thus, any TCL code made on Mac OS X ppc > version using "unicode" is doomed to fail on Mac OS X x86 version. Or > more plainly, any TCL script using -encoding unicode written on Mac OS X > before 2005 are doomed to stop working after 2010, and have mixed result > between 2005 and 2010. > > I am also surprised UTF-16LE as an encoding name has never been > implemented. But I don't know how can we users make this implementation > happen. It do make sense, because Windows make wide use of UTF16LE, > supporting this encoding would reduce potential human conversion needed > for Windows users. It happened on me. I'm not sure that you have identified the problem. You created a file on one computer/configuration where "unicode" means one thing. You then transfered the file in binary mode to another computer/ configuration where "unicode" means something else. Since you never performed any translation from/to, you should expect garbage. The exact same thing happens when you create a text file on *nix systems, binary transfer to Windows and then try to read the file in "Notepad". I suggest three keys to a long term solution: 1. Always use UTF-8 to store textual data (data created or already processed by the application). 2. Always transfer files as binary data. 3. Allow the application to handle the necessary translations/ interpretations...that is don't hard code the translation mode if you expect to handle multiple encodings. Encoding detection might require reading an entire file, which is the reason for the first two suggestions. The only question I have is if it is possible to read/write UTF-16LE and UTF-16BE on any system, or if this is fixed by the detected system type. That would suck, but could be fixed by suggestion 1.
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 Prev: COM Word and blanks in Filenames Next: Why "glob -directory" is such a pain? |