From: tom.rmadilo on 4 Aug 2010 12:42 On Aug 4, 2:45 am, Uwe Klein <uwe_klein_habertw...(a)t-online.de> wrote: > Fredrik Karlsson wrote: > > Hi! > > > Thank you very much for your answer. However, my question is more > > simple than that. I understand that UTF-16 is generally tricky because > > a there may not be a BOM. > > In my application however, I know that the data to be processed is > > most likelly UTF-8, but may also be UTF-16 with a BOM. So, what I need > > is just a safe and robust way of checking the first two elements in > > the file. This is basically what I have come up with: > > > --- > > set infile [open utf16.TextGrid] > > > fconfigure $infile -encoding utf-8 > > > set cont [read $infile] > > > if {[string equal -length 2 $cont "þÿ"] || [string equal -length 2 > > $cont "ÿþ"]} { > > puts "UTF-16" > > } else { > > puts "UTF-8" > > } > > > close $infile > > --- > > > Is this safe? That else can I do? > > > /Fredrik > > proc determine_encoding file { > set infile [open $file] > fconfigure infile -encoding binary > > set head [ read infile 4 ] > close infile > binary scan $head H8 hhex > > # ref from:http://en.wikipedia.org/wiki/Byte_Order_Mark > > switch -glob -- $hhex \ > FFFE* { > return utf-16-LE > } FEFF* { > return utf-16-BE > } 0000FEFF { > return uft-32-BE > } EFBBBF* { > return utf-8 > } .... { > # insert other encodings/filetypes... > } default { > return utf-8 ;# ?? > } > > } Looks good to me. Note that Uwe has configured the channel in binary mode, which is critical. Depending on the application, you may need or want to remove the BOM, for instance, UTF-8 doesn't use a BOM, since bytes are always in the same order. Also, refer to the reference used in the above proc., it gives some hints, and links to other resources which explain the many problems associated with the BOM. It is basically an application level issue and probably shouldn't be hard- coded into your channel code.
First
|
Prev
|
Pages: 1 2 3 4 Prev: COM Word and blanks in Filenames Next: Why "glob -directory" is such a pain? |