what encoding is the encoding named "unicode"? [TCL]

Prev: COM Word and blanks in Filenames
Next: Why "glob -directory" is such a pain?

From: Alexandre Ferrieux on 2 Aug 2010 18:47

On Aug 2, 2:00Â pm, Zhang Weiwu <zhangweiwu+J...(a)realss.com> wrote:
> On 2010å¹´07æ22æ¥ 20:26, Alexandre Ferrieux wrote:> Lets spell things out precisely: do you confirm that, on your OSX
> > machine, given a file "file.txt" in UTF-8, the following code:
>
> > Â Â pack [text .t]
> > Â Â set f [open file.txt r]
> > Â Â fconfigure $f -encoding utf-8
> > Â Â .t insert 1.0 [read $f]
>
> > displays garbled characters in the text widget ? If yes, how garbled ?
> > From the beginning ? Or only on certain characters ?
>
> Hi. Thank you for trying to sort out the problem. The fact Joe English
> and Donal K. Fellow, with knowledge of how things internally work (in
> fact, how it doesn't), posted addressing the issue directly while you
> didn't get it, is an example how obscure this issue is to language users
> like us.

Ahem... beware of hasty conclusions. I was merely trying to extract
facts from what was essentially interpretations. See below.

> Â Â Copy this test.txt file as-is to Mac OS on ppc

Ah, ppc is big endian of course.... no mystery here, just the obvious
fact that without a BOM, UCS-16 is only meaningful within the same
endianness.
So, all this case around a would-be OSX-Tcl bug, is merely a complaint
about the (admitted) ambiguity of the "unicode" encoding, meaning
UCS-16 with native order, right ?
If yes, then why not simply encourage Joe's proposal of order-explicit
variants ?

-Alex

From: Fredrik Karlsson on 3 Aug 2010 07:30

Hi all,

sorry for hijacking this thread a bit, but a related question came to
me - what is the standard / good way of detecting UTF-16 encoding in a
file rather the assumed utf-8 encoding (wich in my domain can be
assumed except under somewhat rare circumstances).

Code sniplets would be very helpful.

/Fredrik

From: tom.rmadilo on 3 Aug 2010 20:37

On Aug 3, 4:30 am, Fredrik Karlsson <dargo...(a)gmail.com> wrote:
> Hi all,
>
> sorry for hijacking this thread a bit, but a related question came to
> me - what is the standard / good way of detecting UTF-16 encoding in a
> file rather the assumed utf-8 encoding (wich in my domain can be
> assumed except under somewhat rare circumstances).
>
> Code sniplets would be very helpful.

If you only have random chunks of a file or byte stream, there is no
way to tell which is which. In fact, this is true with every byte
stream: the best you can do is rule out encodings, or more usefully
make educated guesses.

The problem with UTF-16 is that there is no way to detect byte order
given a random chunk of a file of byte stream. UTF-16 has state.
Usually you have to use something called a byte-order-mark BOM (FFFE
or FEFF). But:

'For the IANA registered charsets UTF-16BE and UTF-16LE, a byte order
mark must not be used because the names of these character sets
already determine the byte order. If encountered, an initial U+FEFF
must be interpreted as a (deprecated) "zero width no-break space".'

Of course, if you don't know the char set, you are screwed. Another
reason to use UTF-8, or binary data, for all external communication.

From: Fredrik Karlsson on 4 Aug 2010 04:35

Hi!

Thank you very much for your answer. However, my question is more
simple than that. I understand that UTF-16 is generally tricky because
a there may not be a BOM.
In my application however, I know that the data to be processed is
most likelly UTF-8, but may also be UTF-16 with a BOM. So, what I need
is just a safe and robust way of checking the first two elements in
the file. This is basically what I have come up with:

---
set infile [open utf16.TextGrid]

fconfigure $infile -encoding utf-8

set cont [read $infile]

if {[string equal -length 2 $cont "þÿ"] || [string equal -length 2
$cont "ÿþ"]} {
puts "UTF-16"
} else {
puts "UTF-8"
}

close $infile
---

Is this safe? That else can I do?

/Fredrik

From: Uwe Klein on 4 Aug 2010 05:45

Fredrik Karlsson wrote:
> Hi!
>
> Thank you very much for your answer. However, my question is more
> simple than that. I understand that UTF-16 is generally tricky because
> a there may not be a BOM.
> In my application however, I know that the data to be processed is
> most likelly UTF-8, but may also be UTF-16 with a BOM. So, what I need
> is just a safe and robust way of checking the first two elements in
> the file. This is basically what I have come up with:
>
> ---
> set infile [open utf16.TextGrid]
>
> fconfigure $infile -encoding utf-8
>
> set cont [read $infile]
>
> if {[string equal -length 2 $cont "��"] || [string equal -length 2
> $cont "��"]} {
> puts "UTF-16"
> } else {
> puts "UTF-8"
> }
>
> close $infile
> ---
>
> Is this safe? That else can I do?
>
> /Fredrik
proc determine_encoding file {
set infile [open $file]
fconfigure infile -encoding binary

set head [ read infile 4 ]
close infile
binary scan $head H8 hhex

# ref from: http://en.wikipedia.org/wiki/Byte_Order_Mark

switch -glob -- $hhex \
FFFE* {
return utf-16-LE
} FEFF* {
return utf-16-BE
} 0000FEFF {
return uft-32-BE
} EFBBBF* {
return utf-8
} .... {
# insert other encodings/filetypes...
} default {
return utf-8 ;# ??
}
}

uwe

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: COM Word and blanks in Filenames
Next: Why "glob -directory" is such a pain?