From: Terry Reedy on 14 Mar 2010 17:37 On 3/14/2010 4:40 PM, Guillermo wrote: > Hi, > > I would appreciate if someone could point out what am I doing wrong > here. > > Basically, I need to save a string containing non-ascii characters to > a file encoded in utf-8. > > If I stay in python, everything seems to work fine, but the moment I > try to read the file with another Windows program, everything goes to > hell. > > So here's the script unicode2file.py: > =================================================================== > # encoding=utf-8 > import codecs > > f = codecs.open("m.txt",mode="w", encoding="utf8") > a = u"mañana" > print repr(a) > f.write(a) > f.close() > > f = codecs.open("m.txt", mode="r", encoding="utf8") > a = f.read() > print repr(a) > f.close() > =================================================================== > > That gives the expected output, both calls to repr() yield the same > result. > > But now, if I do type me.txt in cmd.exe, I get garbled characters > instead of "ñ". > > I then open the file with my editor (Sublime Text), and I see "mañana" > normally. I save (nothing to be saved, really), go back to the dos > prompt, do type m.txt and I get again the same garbled characters. > > I then open the file m.txt with notepad, and I see "mañana" normally. > I save (again, no actual modifications), go back to the dos prompt, do > type m.txt and this time it works! I get "mañana". When notepad opens > the file, the encoding is already UTF-8, so short of a UTF-8 bom being There is no such thing as a utf-8 'byte order mark'. The concept is an oxymoron. > added to the file, I don't know what happens when I save the > unmodified file. Also, I would think that the python script should > save a valid utf-8 file in the first place... Adding the byte that some call a 'utf-8 bom' makes the file an invalid utf-8 file. However, I suspect that notepad wrote the file in the system encoding, which can encode n with tilde and which cmd.exe does understand. If you started with a file with encoded cyrillic, arabic, hindi, and chinese characters (for instance), I suspect you would get a different result. tjr
From: Guillermo on 14 Mar 2010 17:53 > The console is commonly using Code Page 437 which is most compatible > with old DOS programs since it can display line drawing characters. You > can change the code page to UTF-8 with > chcp 65001 That's another issue in my actual script. A twofold problem, actually: 1) For me chcp gives 850 and I'm relying on that to decode the bytes I get back from the console. I suppose this is bound to fail because another Windows installation might have a different default codepage. 2) My script gets output from a Popen call (to execute a Powershell script [new Windows shell language] from Python; it does make sense!). I suppose changing the Windows codepage for a single Popen call isn't straightforward/possible? Right now, I only get the desired result if I decode the output from Popen as "cp850".
From: Neil Hodgson on 14 Mar 2010 18:15 Guillermo: > 2) My script gets output from a Popen call (to execute a Powershell > script [new Windows shell language] from Python; it does make sense!). > I suppose changing the Windows codepage for a single Popen call isn't > straightforward/possible? You could try SetConsoleOutputCP and SetConsoleCP. Neil
From: Guillermo on 14 Mar 2010 18:21 > 2) My script gets output from a Popen call (to execute a Powershell > script [new Windows shell language] from Python; it does make sense!). > I suppose changing the Windows codepage for a single Popen call isn't > straightforward/possible? Nevermind. I'm able to change Windows' codepage to 65001 from within the Powershell script and I get back a string encoded in UTF-8 with BOM, so problem solved! Thanks for the help, Guillermo
From: Mark Tolonen on 14 Mar 2010 20:02 "Terry Reedy" <tjreedy(a)udel.edu> wrote in message news:hnjkuo$n16$1(a)dough.gmane.org... On 3/14/2010 4:40 PM, Guillermo wrote: > Adding the byte that some call a 'utf-8 bom' makes the file an invalid > utf-8 file. Not true. From http://unicode.org/faq/utf_bom.html: Q: When a BOM is used, is it only in 16-bit Unicode text? A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising the BOM will be whatever the Unicode character FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. Examples: BytesEncoding Form 00 00 FE FF UTF-32, big-endian FF FE 00 00 UTF-32, little-endian FE FF UTF-16, big-endian FF FE UTF-16, little-endian EF BB BF UTF-8 -Mark
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 Prev: sqlite3 is sqlite 2? Next: Understanding the CPython dict implementation |