From: Alf P. Steinbach on 14 Mar 2010 20:37 * Mark Tolonen: > > "Terry Reedy" <tjreedy(a)udel.edu> wrote in message > news:hnjkuo$n16$1(a)dough.gmane.org... > On 3/14/2010 4:40 PM, Guillermo wrote: >> Adding the byte that some call a 'utf-8 bom' makes the file an invalid >> utf-8 file. > > Not true. From http://unicode.org/faq/utf_bom.html: > > Q: When a BOM is used, is it only in 16-bit Unicode text? > A: No, a BOM can be used as a signature no matter how the Unicode text > is transformed: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising > the BOM will be whatever the Unicode character FEFF is converted into by > that transformation format. In that form, the BOM serves to indicate > both that it is a Unicode file, and which of the formats it is in. > Examples: > BytesEncoding Form > 00 00 FE FF UTF-32, big-endian > FF FE 00 00 UTF-32, little-endian > FE FF UTF-16, big-endian > FF FE UTF-16, little-endian > EF BB BF UTF-8 Well, technically true, and Terry was wrong about "There is no such thing as a utf-8 'byte order mark'. The concept is an oxymoron.". It's true that as a descriptive term "byte order mark" is an oxymoron for UTF-8. But in this particular context it's not a descriptive term, and it's not only technically allowed, as you point out, but sometimes required. However, some tools are unable to process UTF-8 files with BOM. The most annoying example is the GCC compiler suite, in particular g++, which in its Windows MinGW manifestation insists on UTF-8 source code without BOM, while Microsoft's compiler needs the BOM to recognize the file as UTF-8 -- the only way I found to satisfy both compilers, apart from a restriction to ASCII or perhaps Windows ANSI with wide character literals restricted to ASCII (exploiting a bug in g++ that lets it handle narrow character literals with non-ASCII chars) is to preprocess the source code. But that's not a general solution since the g++ preprocessor, via another bug, accepts some constructs (which then compile nicely) which the compiler doesn't accept when explicit preprocessing isn't used. So it's a mess. Cheers, - Alf
First
|
Prev
|
Pages: 1 2 3 Prev: sqlite3 is sqlite 2? Next: Understanding the CPython dict implementation |