From: Joe Goldthwaite on 28 Jul 2010 14:32 Hi, I've got an Ascii file with some latin characters. Specifically \xe1 and \xfc. I'm trying to import it into a Postgresql database that's running in Unicode mode. The Unicode converter chokes on those two characters. I could just manually replace those to characters with something valid but if any other invalid characters show up in later versions of the file, I'd like to handle them correctly. I've been playing with the Unicode stuff and I found out that I could convert both those characters correctly using the latin1 encoder like this; import unicodedata s = '\xe1\xfc' print unicode(s,'latin1') The above works. When I try to convert my file however, I still get an error; import unicodedata input = file('ascii.csv', 'r') output = file('unicode.csv','w') for line in input.xreadlines(): output.write(unicode(line,'latin1')) input.close() output.close() Traceback (most recent call last): File "C:\Users\jgold\CloudmartFiles\UnicodeTest.py", line 10, in __main__ output.write(unicode(line,'latin1')) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 295: ordinal not in range(128) I'm stuck using Python 2.4.4 which may be handling the strings differently depending on if they're in the program or coming from the file. I just haven't been able to figure out how to get the Unicode conversion working from the file data. Can anyone explain what is going on?
From: MRAB on 28 Jul 2010 15:20 Joe Goldthwaite wrote: > Hi, > > I've got an Ascii file with some latin characters. Specifically \xe1 and > \xfc. I'm trying to import it into a Postgresql database that's running in > Unicode mode. The Unicode converter chokes on those two characters. > > I could just manually replace those to characters with something valid but > if any other invalid characters show up in later versions of the file, I'd > like to handle them correctly. > > > I've been playing with the Unicode stuff and I found out that I could > convert both those characters correctly using the latin1 encoder like this; > > > import unicodedata > > s = '\xe1\xfc' > print unicode(s,'latin1') > > > The above works. When I try to convert my file however, I still get an > error; > > import unicodedata > > input = file('ascii.csv', 'r') > output = file('unicode.csv','w') > > for line in input.xreadlines(): > output.write(unicode(line,'latin1')) > > input.close() > output.close() > > Traceback (most recent call last): > File "C:\Users\jgold\CloudmartFiles\UnicodeTest.py", line 10, in __main__ > output.write(unicode(line,'latin1')) > UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position > 295: ordinal not in range(128) > > I'm stuck using Python 2.4.4 which may be handling the strings differently > depending on if they're in the program or coming from the file. I just > haven't been able to figure out how to get the Unicode conversion working > from the file data. > > Can anyone explain what is going on? > What you need to remember is that files contain bytes. When you say "ASCII file" what you mean is that the file contains bytes which represent text encoded as ASCII, and such a file by definition can't contain bytes outside the range 0-127. Therefore your file isn't an ASCII file. So then you've decided to treat it as a file containing bytes which represent text encoded as Latin-1. You're reading bytes from a file, decoding them to Unicode, and then trying to write them to a file, but the output file expects bytes (did I say that files contain bytes? :-)), so it's trying to encode back to bytes using the default encoding, which is ASCII. u'\xe1' can't be encoded as ASCII, therefore UnicodeEncodeError is raised.
From: Thomas Jollans on 28 Jul 2010 15:21 On 07/28/2010 08:32 PM, Joe Goldthwaite wrote: > Hi, > > I've got an Ascii file with some latin characters. Specifically \xe1 and > \xfc. I'm trying to import it into a Postgresql database that's running in > Unicode mode. The Unicode converter chokes on those two characters. > > I could just manually replace those to characters with something valid but > if any other invalid characters show up in later versions of the file, I'd > like to handle them correctly. > > > I've been playing with the Unicode stuff and I found out that I could > convert both those characters correctly using the latin1 encoder like this; > > > import unicodedata > > s = '\xe1\xfc' > print unicode(s,'latin1') > > > The above works. When I try to convert my file however, I still get an > error; > > import unicodedata > > input = file('ascii.csv', 'r') > output = file('unicode.csv','w') output is still a binary file - there are no unicode files. You need to encode the text somehow. > Traceback (most recent call last): > File "C:\Users\jgold\CloudmartFiles\UnicodeTest.py", line 10, in __main__ > output.write(unicode(line,'latin1')) > UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position > 295: ordinal not in range(128) by default, Python tries to encode strings using ASCII. This, obviously, won't work here. Do you know which encoding your database expects ? I'd assume it'd understand UTF-8. Everybody uses UTF-8. > > for line in input.xreadlines(): > output.write(unicode(line,'latin1')) unicode(line, 'latin1') is unicode, you need it to be a UTF-8 bytestring: unicode(line, 'latin1').encode('utf-8') or: line.decode('latin1').encode('utf-8')
From: John Nagle on 28 Jul 2010 15:29 On 7/28/2010 11:32 AM, Joe Goldthwaite wrote: > Hi, > > I've got an Ascii file with some latin characters. Specifically \xe1 and > \xfc. I'm trying to import it into a Postgresql database that's running in > Unicode mode. The Unicode converter chokes on those two characters. > > I could just manually replace those to characters with something valid but > if any other invalid characters show up in later versions of the file, I'd > like to handle them correctly. > > > I've been playing with the Unicode stuff and I found out that I could > convert both those characters correctly using the latin1 encoder like this; > > > import unicodedata > > s = '\xe1\xfc' > print unicode(s,'latin1') > > > The above works. When I try to convert my file however, I still get an > error; > > import unicodedata > > input = file('ascii.csv', 'r') > output = file('unicode.csv','w') > > for line in input.xreadlines(): > output.write(unicode(line,'latin1')) > > input.close() > output.close() > Try this, which will get you a UTF-8 file, the usual standard for Unicode in a file. for rawline in input : unicodeline = unicode(line,'latin1') # Latin-1 to Unicode output.write(unicodeline.encode('utf-8')) # Unicode to as UTF-8 John Nagle
From: Thomas Jollans on 28 Jul 2010 15:40 On 07/28/2010 09:29 PM, John Nagle wrote: > for rawline in input : > unicodeline = unicode(line,'latin1') # Latin-1 to Unicode > output.write(unicodeline.encode('utf-8')) # Unicode to as UTF-8 you got your blocks wrong.
|
Next
|
Last
Pages: 1 2 Prev: Linear nterpolation in 3D Next: Newbie question regarding SSL and certificate verification |