From: python on 24 Mar 2010 10:52 I assume there's no standard library function that wraps codecs.open() to sniff a file's BOM header and open the file with the appropriate encoding? My reading of the docs leads me to believe that there are 5 types of possible BOM headers with multiple names (synoymns?) for the same BOM encoding type. BOM = '\xff\xfe' BOM_LE = '\xff\xfe' BOM_UTF16 = '\xff\xfe' BOM_UTF16_LE = '\xff\xfe' BOM_BE = '\xfe\xff' BOM32_BE = '\xfe\xff' BOM_UTF16_BE = '\xfe\xff' BOM64_BE = '\x00\x00\xfe\xff' BOM_UTF32_BE = '\x00\x00\xfe\xff' BOM64_LE = '\xff\xfe\x00\x00' BOM_UTF32 = '\xff\xfe\x00\x00' BOM_UTF32_LE = '\xff\xfe\x00\x00' BOM_UTF8 = '\xef\xbb\xbf' Is the process of writing a BOM sniffer readlly as simple as detecting one of these 5 header types and then calling codecs.open() with the appropriate encoding= parameter? Note: I'm only interested in Unicode encodings. I am not interested in any of the non-Unicode encodings supported by the codecs module. Thank you, Malcolm
From: Lawrence D'Oliveiro on 25 Mar 2010 19:16 In message <mailman.1139.1269442366.23598.python-list(a)python.org>, python(a)bdurham.com wrote: > BOM_UTF8 = '\xef\xbb\xbf' Since when does UTF-8 need a BOM?
From: Irmen de Jong on 25 Mar 2010 19:21 On 26-3-2010 0:16, Lawrence D'Oliveiro wrote: > In message<mailman.1139.1269442366.23598.python-list(a)python.org>, > python(a)bdurham.com wrote: > >> BOM_UTF8 = '\xef\xbb\xbf' > > Since when does UTF-8 need a BOM? It doesn't, but it is allowed. Not recommended though. Unfortunately several tools, such as notepad.exe, have a tendency of silently adding it when saving files. -irmen
|
Pages: 1 Prev: crypto: verify external pkcs7 signature Next: C-API: Extract information from function object |