From: Alexandre Ferrieux on 2 Aug 2010 18:47 On Aug 2, 2:00 pm, Zhang Weiwu <zhangweiwu+J...(a)realss.com> wrote: > On 2010å¹´07æ22æ¥ 20:26, Alexandre Ferrieux wrote:> Lets spell things out precisely: do you confirm that, on your OSX > > machine, given a file "file.txt" in UTF-8, the following code: > > >   pack [text .t] > >   set f [open file.txt r] > >   fconfigure $f -encoding utf-8 > >   .t insert 1.0 [read $f] > > > displays garbled characters in the text widget ? If yes, how garbled ? > > From the beginning ? Or only on certain characters ? > > Hi. Thank you for trying to sort out the problem. The fact Joe English > and Donal K. Fellow, with knowledge of how things internally work (in > fact, how it doesn't), posted addressing the issue directly while you > didn't get it, is an example how obscure this issue is to language users > like us. Ahem... beware of hasty conclusions. I was merely trying to extract facts from what was essentially interpretations. See below. >   Copy this test.txt file as-is to Mac OS on ppc Ah, ppc is big endian of course.... no mystery here, just the obvious fact that without a BOM, UCS-16 is only meaningful within the same endianness. So, all this case around a would-be OSX-Tcl bug, is merely a complaint about the (admitted) ambiguity of the "unicode" encoding, meaning UCS-16 with native order, right ? If yes, then why not simply encourage Joe's proposal of order-explicit variants ? -Alex
From: Fredrik Karlsson on 3 Aug 2010 07:30 Hi all, sorry for hijacking this thread a bit, but a related question came to me - what is the standard / good way of detecting UTF-16 encoding in a file rather the assumed utf-8 encoding (wich in my domain can be assumed except under somewhat rare circumstances). Code sniplets would be very helpful. /Fredrik
From: tom.rmadilo on 3 Aug 2010 20:37 On Aug 3, 4:30 am, Fredrik Karlsson <dargo...(a)gmail.com> wrote: > Hi all, > > sorry for hijacking this thread a bit, but a related question came to > me - what is the standard / good way of detecting UTF-16 encoding in a > file rather the assumed utf-8 encoding (wich in my domain can be > assumed except under somewhat rare circumstances). > > Code sniplets would be very helpful. If you only have random chunks of a file or byte stream, there is no way to tell which is which. In fact, this is true with every byte stream: the best you can do is rule out encodings, or more usefully make educated guesses. The problem with UTF-16 is that there is no way to detect byte order given a random chunk of a file of byte stream. UTF-16 has state. Usually you have to use something called a byte-order-mark BOM (FFFE or FEFF). But: 'For the IANA registered charsets UTF-16BE and UTF-16LE, a byte order mark must not be used because the names of these character sets already determine the byte order. If encountered, an initial U+FEFF must be interpreted as a (deprecated) "zero width no-break space".' Of course, if you don't know the char set, you are screwed. Another reason to use UTF-8, or binary data, for all external communication.
From: Fredrik Karlsson on 4 Aug 2010 04:35 Hi! Thank you very much for your answer. However, my question is more simple than that. I understand that UTF-16 is generally tricky because a there may not be a BOM. In my application however, I know that the data to be processed is most likelly UTF-8, but may also be UTF-16 with a BOM. So, what I need is just a safe and robust way of checking the first two elements in the file. This is basically what I have come up with: --- set infile [open utf16.TextGrid] fconfigure $infile -encoding utf-8 set cont [read $infile] if {[string equal -length 2 $cont "þÿ"] || [string equal -length 2 $cont "ÿþ"]} { puts "UTF-16" } else { puts "UTF-8" } close $infile --- Is this safe? That else can I do? /Fredrik
From: Uwe Klein on 4 Aug 2010 05:45 Fredrik Karlsson wrote: > Hi! > > Thank you very much for your answer. However, my question is more > simple than that. I understand that UTF-16 is generally tricky because > a there may not be a BOM. > In my application however, I know that the data to be processed is > most likelly UTF-8, but may also be UTF-16 with a BOM. So, what I need > is just a safe and robust way of checking the first two elements in > the file. This is basically what I have come up with: > > --- > set infile [open utf16.TextGrid] > > fconfigure $infile -encoding utf-8 > > set cont [read $infile] > > if {[string equal -length 2 $cont "��"] || [string equal -length 2 > $cont "��"]} { > puts "UTF-16" > } else { > puts "UTF-8" > } > > close $infile > --- > > Is this safe? That else can I do? > > /Fredrik proc determine_encoding file { set infile [open $file] fconfigure infile -encoding binary set head [ read infile 4 ] close infile binary scan $head H8 hhex # ref from: http://en.wikipedia.org/wiki/Byte_Order_Mark switch -glob -- $hhex \ FFFE* { return utf-16-LE } FEFF* { return utf-16-BE } 0000FEFF { return uft-32-BE } EFBBBF* { return utf-8 } .... { # insert other encodings/filetypes... } default { return utf-8 ;# ?? } } uwe
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 Prev: COM Word and blanks in Filenames Next: Why "glob -directory" is such a pain? |