Ascii to Unicode. [Python]

Prev: Linear nterpolation in 3D
Next: Newbie question regarding SSL and certificate verification

From: Joe Goldthwaite on 28 Jul 2010 14:32

Hi,

I've got an Ascii file with some latin characters. Specifically \xe1 and
\xfc. I'm trying to import it into a Postgresql database that's running in
Unicode mode. The Unicode converter chokes on those two characters.

I could just manually replace those to characters with something valid but
if any other invalid characters show up in later versions of the file, I'd
like to handle them correctly.

I've been playing with the Unicode stuff and I found out that I could
convert both those characters correctly using the latin1 encoder like this;

import unicodedata

s = '\xe1\xfc'
print unicode(s,'latin1')

The above works. When I try to convert my file however, I still get an
error;

import unicodedata

input = file('ascii.csv', 'r')
output = file('unicode.csv','w')

for line in input.xreadlines():
output.write(unicode(line,'latin1'))

input.close()
output.close()

Traceback (most recent call last):
File "C:\Users\jgold\CloudmartFiles\UnicodeTest.py", line 10, in __main__
output.write(unicode(line,'latin1'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position
295: ordinal not in range(128)

I'm stuck using Python 2.4.4 which may be handling the strings differently
depending on if they're in the program or coming from the file. I just
haven't been able to figure out how to get the Unicode conversion working
from the file data.

Can anyone explain what is going on?

From: MRAB on 28 Jul 2010 15:20

Joe Goldthwaite wrote:
> Hi,
>
> I've got an Ascii file with some latin characters. Specifically \xe1 and
> \xfc. I'm trying to import it into a Postgresql database that's running in
> Unicode mode. The Unicode converter chokes on those two characters.
>

> I could just manually replace those to characters with something valid but
> if any other invalid characters show up in later versions of the file, I'd
> like to handle them correctly.
>
>
> I've been playing with the Unicode stuff and I found out that I could
> convert both those characters correctly using the latin1 encoder like this;
>
>
> import unicodedata
>
> s = '\xe1\xfc'
> print unicode(s,'latin1')
>
>
> The above works. When I try to convert my file however, I still get an
> error;
>
> import unicodedata
>
> input = file('ascii.csv', 'r')
> output = file('unicode.csv','w')
>
> for line in input.xreadlines():
> output.write(unicode(line,'latin1'))
>
> input.close()
> output.close()
>
> Traceback (most recent call last):
> File "C:\Users\jgold\CloudmartFiles\UnicodeTest.py", line 10, in __main__
> output.write(unicode(line,'latin1'))
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position
> 295: ordinal not in range(128)
>
> I'm stuck using Python 2.4.4 which may be handling the strings differently
> depending on if they're in the program or coming from the file. I just
> haven't been able to figure out how to get the Unicode conversion working
> from the file data.
>
> Can anyone explain what is going on?
>
What you need to remember is that files contain bytes.

When you say "ASCII file" what you mean is that the file contains bytes
which represent text encoded as ASCII, and such a file by definition
can't contain bytes outside the range 0-127. Therefore your file isn't
an ASCII file. So then you've decided to treat it as a file containing
bytes which represent text encoded as Latin-1.

You're reading bytes from a file, decoding them to Unicode, and then
trying to write them to a file, but the output file expects bytes (did I
say that files contain bytes? :-)), so it's trying to encode back to
bytes using the default encoding, which is ASCII. u'\xe1' can't be
encoded as ASCII, therefore UnicodeEncodeError is raised.

From: Thomas Jollans on 28 Jul 2010 15:21

On 07/28/2010 08:32 PM, Joe Goldthwaite wrote:
> Hi,
>
> I've got an Ascii file with some latin characters. Specifically \xe1 and
> \xfc. I'm trying to import it into a Postgresql database that's running in
> Unicode mode. The Unicode converter chokes on those two characters.
>
> I could just manually replace those to characters with something valid but
> if any other invalid characters show up in later versions of the file, I'd
> like to handle them correctly.
>
>
> I've been playing with the Unicode stuff and I found out that I could
> convert both those characters correctly using the latin1 encoder like this;
>
>
> import unicodedata
>
> s = '\xe1\xfc'
> print unicode(s,'latin1')
>
>
> The above works. When I try to convert my file however, I still get an
> error;
>
> import unicodedata
>
> input = file('ascii.csv', 'r')
> output = file('unicode.csv','w')

output is still a binary file - there are no unicode files. You need to
encode the text somehow.

> Traceback (most recent call last):
> File "C:\Users\jgold\CloudmartFiles\UnicodeTest.py", line 10, in __main__
> output.write(unicode(line,'latin1'))
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position
> 295: ordinal not in range(128)

by default, Python tries to encode strings using ASCII. This, obviously,
won't work here.

Do you know which encoding your database expects ? I'd assume it'd
understand UTF-8. Everybody uses UTF-8.

>
> for line in input.xreadlines():
> output.write(unicode(line,'latin1'))

unicode(line, 'latin1') is unicode, you need it to be a UTF-8 bytestring:

unicode(line, 'latin1').encode('utf-8')

or:

line.decode('latin1').encode('utf-8')

From: John Nagle on 28 Jul 2010 15:29

On 7/28/2010 11:32 AM, Joe Goldthwaite wrote:
> Hi,
>
> I've got an Ascii file with some latin characters. Specifically \xe1 and
> \xfc. I'm trying to import it into a Postgresql database that's running in
> Unicode mode. The Unicode converter chokes on those two characters.
>
> I could just manually replace those to characters with something valid but
> if any other invalid characters show up in later versions of the file, I'd
> like to handle them correctly.
>
>
> I've been playing with the Unicode stuff and I found out that I could
> convert both those characters correctly using the latin1 encoder like this;
>
>
> import unicodedata
>
> s = '\xe1\xfc'
> print unicode(s,'latin1')
>
>
> The above works. When I try to convert my file however, I still get an
> error;
>
> import unicodedata
>
> input = file('ascii.csv', 'r')
> output = file('unicode.csv','w')
>
> for line in input.xreadlines():
> output.write(unicode(line,'latin1'))
>
> input.close()
> output.close()
>
Try this, which will get you a UTF-8 file, the usual standard for
Unicode in a file.

for rawline in input :
unicodeline = unicode(line,'latin1') # Latin-1 to Unicode
output.write(unicodeline.encode('utf-8')) # Unicode to as UTF-8

John Nagle

From: Thomas Jollans on 28 Jul 2010 15:40

On 07/28/2010 09:29 PM, John Nagle wrote:
> for rawline in input :
> unicodeline = unicode(line,'latin1') # Latin-1 to Unicode
> output.write(unicodeline.encode('utf-8')) # Unicode to as UTF-8

you got your blocks wrong.

| Next | Last
Pages: 1 2
Prev: Linear nterpolation in 3D
Next: Newbie question regarding SSL and certificate verification