From: Guillermo on 14 Mar 2010 16:40 Hi, I would appreciate if someone could point out what am I doing wrong here. Basically, I need to save a string containing non-ascii characters to a file encoded in utf-8. If I stay in python, everything seems to work fine, but the moment I try to read the file with another Windows program, everything goes to hell. So here's the script unicode2file.py: =================================================================== # encoding=utf-8 import codecs f = codecs.open("m.txt",mode="w", encoding="utf8") a = u"mañana" print repr(a) f.write(a) f.close() f = codecs.open("m.txt", mode="r", encoding="utf8") a = f.read() print repr(a) f.close() =================================================================== That gives the expected output, both calls to repr() yield the same result. But now, if I do type me.txt in cmd.exe, I get garbled characters instead of "ñ". I then open the file with my editor (Sublime Text), and I see "mañana" normally. I save (nothing to be saved, really), go back to the dos prompt, do type m.txt and I get again the same garbled characters. I then open the file m.txt with notepad, and I see "mañana" normally. I save (again, no actual modifications), go back to the dos prompt, do type m.txt and this time it works! I get "mañana". When notepad opens the file, the encoding is already UTF-8, so short of a UTF-8 bom being added to the file, I don't know what happens when I save the unmodified file. Also, I would think that the python script should save a valid utf-8 file in the first place... What's going on here? Regards, Guillermo
From: Neil Hodgson on 14 Mar 2010 17:05 Guillermo: > I then open the file m.txt with notepad, and I see "ma�ana" normally. > I save (again, no actual modifications), go back to the dos prompt, do > type m.txt and this time it works! I get "ma�ana". When notepad opens > the file, the encoding is already UTF-8, so short of a UTF-8 bom being > added to the file, That is what happens: the file now starts with a BOM \xEB\xBB\xBF as you can see with a hex editor. > I don't know what happens when I save the > unmodified file. Also, I would think that the python script should > save a valid utf-8 file in the first place... Its just as valid UTF-8 without a BOM. People have different opinions on this but for compatibility, I think it is best to always start UTF-8 files with a BOM. Neil
From: Guillermo on 14 Mar 2010 17:22 > That is what happens: the file now starts with a BOM \xEB\xBB\xBF as > you can see with a hex editor. Is this an enforced convention under Windows, then? My head's aching after so much pulling at my hair, but I have the feeling that the problem only arises when text travels through the dos console... Cheers, Guillermo
From: Joaquin Abian on 14 Mar 2010 17:25 On 14 mar, 22:22, Guillermo <guillermo.lis...(a)googlemail.com> wrote: > > That is what happens: the file now starts with a BOM \xEB\xBB\xBF as > > you can see with a hex editor. > > Is this an enforced convention under Windows, then? My head's aching > after so much pulling at my hair, but I have the feeling that the > problem only arises when text travels through the dos console... > > Cheers, > Guillermo search for BOM in wikipedia. There it talks about notepad behavior. ja
From: Neil Hodgson on 14 Mar 2010 17:35 Guillermo: > Is this an enforced convention under Windows, then? My head's aching > after so much pulling at my hair, but I have the feeling that the > problem only arises when text travels through the dos console... The console is commonly using Code Page 437 which is most compatible with old DOS programs since it can display line drawing characters. You can change the code page to UTF-8 with chcp 65001 Now, "type m.txt" with the original BOM-less file and it should be OK. You may also need to change the console font to one that is Unicode compatible like Lucida Console. Neil
|
Next
|
Last
Pages: 1 2 3 Prev: sqlite3 is sqlite 2? Next: Understanding the CPython dict implementation |