umlauts [Python]

Prev: Python 2.6.3 and finding init.tcl
Next: subprocess executing shell

From: Arian Kuschki on 17 Oct 2009 09:54

Hi all

this has been bugging me for a long time and I do not seem to be able to
understand what to do. I always have problems when dealing input text that
contains umlauts. Consider the following:

In [1]: import urllib

In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")

In [3]: xml = f.read()

In [4]: f.close()

In [5]: print xml
------> print(xml)
<?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
><forecast_information><cit
y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
data=""/><longitude_e6 data=""/><forecast_date
data="2009-10-17"/><current_date_time data="2009-10
-17 14:20:00 +0000"/><unit_system
data="SI"/></forecast_information><current_conditions><condition data="Meistens
bew�kt"/><temp_f data="43"/><temp_c data="6"/><h
umidity data="Feuchtigkeit: 87�%"/><icon
data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
Windgeschwindigkeiten von 13 km/h"/></curr
ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
data="1"/><high data="7"/><icon
data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
data="So."/><low data="-1"/><high data="8"/><icon
data="/ig/images/weather/chance_of_sno
w.gif"/><condition data="Vereinzelt
Schnee"/></forecast_conditions><forecast_conditions><day_of_week
data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
mages/weather/mostly_sunny.gif"/><condition data="Teils
sonnig"/></forecast_conditions><forecast_conditions><day_of_week
data="Di."/><low data="0"/><high data="8"
/><icon data="/ig/images/weather/sunny.gif"/><condition
data="Klar"/></forecast_conditions></weather></xml_api_reply>

As you can see the umlauts in the XML are not displayed properly. When I want
to process this text (for example with xml.sax), I get error messages because
the parses can't read this.

I've tried to read up on this and there is a lot of information on the web, but
nothing seems to work for me. For example setting the coding to UTF like this:
# -*- coding: utf-8 -*- or using the decode() string method.

I always have this kind of problem when input contains umlauts, not just in
this case. My locale (on Ubuntu) is en_GB.UTF-8.

Cheers
Arian

From: Diez B. Roggisch on 17 Oct 2009 11:51

Arian Kuschki schrieb:
> Hi all
>
> this has been bugging me for a long time and I do not seem to be able to
> understand what to do. I always have problems when dealing input text that
> contains umlauts. Consider the following:
>
> In [1]: import urllib
>
> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>
> In [3]: xml = f.read()
>
> In [4]: f.close()
>
> In [5]: print xml
> ------> print(xml)
> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
>> <forecast_information><cit
> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
> data=""/><longitude_e6 data=""/><forecast_date
> data="2009-10-17"/><current_date_time data="2009-10
> -17 14:20:00 +0000"/><unit_system
> data="SI"/></forecast_information><current_conditions><condition data="Meistens
> bew�kt"/><temp_f data="43"/><temp_c data="6"/><h
> umidity data="Feuchtigkeit: 87�%"/><icon
> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
> Windgeschwindigkeiten von 13 km/h"/></curr
> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
> data="1"/><high data="7"/><icon
> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
> data="So."/><low data="-1"/><high data="8"/><icon
> data="/ig/images/weather/chance_of_sno
> w.gif"/><condition data="Vereinzelt
> Schnee"/></forecast_conditions><forecast_conditions><day_of_week
> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
> mages/weather/mostly_sunny.gif"/><condition data="Teils
> sonnig"/></forecast_conditions><forecast_conditions><day_of_week
> data="Di."/><low data="0"/><high data="8"
> /><icon data="/ig/images/weather/sunny.gif"/><condition
> data="Klar"/></forecast_conditions></weather></xml_api_reply>
>
> As you can see the umlauts in the XML are not displayed properly. When I want
> to process this text (for example with xml.sax), I get error messages because
> the parses can't read this.
>
> I've tried to read up on this and there is a lot of information on the web, but
> nothing seems to work for me. For example setting the coding to UTF like this:
> # -*- coding: utf-8 -*- or using the decode() string method.

The encoding of the python-source-file has nothing to do with this. It's
only relevant for unicode-literals (in python 2.x, that's u"...")

>
> I always have this kind of problem when input contains umlauts, not just in
> this case. My locale (on Ubuntu) is en_GB.UTF-8.

If we assume the data on the website is correct (it appears to be when I
open it in FF), then your problem is most probably your display/terminal.

What does this show you in your interactive interpreter?

>>> print "\xc3\xb6"
ö

For me, it's o-umlaut, ö. This is because the above bytes are the
sequence for ö in utf-8.

If this shows something else, you need to adjust your terminal settings.

Diez

From: StarWing on 17 Oct 2009 12:11

On 10æ17æ¥, ä¸å9æ¶54å, Arian Kuschki <arian.kusc...(a)googlemail.com>
wrote:
> Hi all
>
> this has been bugging me for a long time and I do not seem to be able to
> understand what to do. I always have problems when dealing input text that
> contains umlauts. Consider the following:
>
> In [1]: import urllib
>
> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>
> In [3]: xml = f.read()
>
> In [4]: f.close()
>
> In [5]: print xml
> ------> print(xml)
> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
>
> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
> data=""/><longitude_e6 data=""/><forecast_date
> data="2009-10-17"/><current_date_time data="2009-10
> -17 14:20:00 +0000"/><unit_system
> data="SI"/></forecast_information><current_conditions><condition data="Meistens
> bew kt"/><temp_f data="43"/><temp_c data="6"/><h
> umidity data="Feuchtigkeit: 87 %"/><icon
> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
> Windgeschwindigkeiten von 13 km/h"/></curr
> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
> data="1"/><high data="7"/><icon
> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
> data="So."/><low data="-1"/><high data="8"/><icon
> data="/ig/images/weather/chance_of_sno
> w.gif"/><condition data="Vereinzelt
> Schnee"/></forecast_conditions><forecast_conditions><day_of_week
> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
> mages/weather/mostly_sunny.gif"/><condition data="Teils
> sonnig"/></forecast_conditions><forecast_conditions><day_of_week
> data="Di."/><low data="0"/><high data="8"
> /><icon data="/ig/images/weather/sunny.gif"/><condition
> data="Klar"/></forecast_conditions></weather></xml_api_reply>
>
> As you can see the umlauts in the XML are not displayed properly. When I want
> to process this text (for example with xml.sax), I get error messages because
> the parses can't read this.
>
> I've tried to read up on this and there is a lot of information on the web, but
> nothing seems to work for me. For example setting the coding to UTF like this:
> # -*- coding: utf-8 -*- or using the decode() string method.
>
> I always have this kind of problem when input contains umlauts, not just in
> this case. My locale (on Ubuntu) is en_GB.UTF-8.
>
> Cheers
> Arian

try this?

# vim: set fencoding=utf-8:
import urllib
import xml.sax as sax, xml.sax.handler as handler

f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
xml = f.read()
xml = xml.decode("cp1252")
f.close()

class my_handler(handler.ContentHandler):
def startElement(self, name, attrs):
print "begin:", name, attrs

def endElement(self, name):
print "end:", name

sax.parseString(xml, my_handler())

From: MRAB on 17 Oct 2009 12:14

Arian Kuschki wrote:
> Hi all
>
> this has been bugging me for a long time and I do not seem to be able to
> understand what to do. I always have problems when dealing input text that
> contains umlauts. Consider the following:
>
> In [1]: import urllib
>
> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>
> In [3]: xml = f.read()
>
> In [4]: f.close()
>
> In [5]: print xml
> ------> print(xml)
> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
>> <forecast_information><cit
> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
> data=""/><longitude_e6 data=""/><forecast_date
> data="2009-10-17"/><current_date_time data="2009-10
> -17 14:20:00 +0000"/><unit_system
> data="SI"/></forecast_information><current_conditions><condition data="Meistens
> bew�kt"/><temp_f data="43"/><temp_c data="6"/><h
> umidity data="Feuchtigkeit: 87�%"/><icon
> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
> Windgeschwindigkeiten von 13 km/h"/></curr
> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
> data="1"/><high data="7"/><icon
> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
> data="So."/><low data="-1"/><high data="8"/><icon
> data="/ig/images/weather/chance_of_sno
> w.gif"/><condition data="Vereinzelt
> Schnee"/></forecast_conditions><forecast_conditions><day_of_week
> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
> mages/weather/mostly_sunny.gif"/><condition data="Teils
> sonnig"/></forecast_conditions><forecast_conditions><day_of_week
> data="Di."/><low data="0"/><high data="8"
> /><icon data="/ig/images/weather/sunny.gif"/><condition
> data="Klar"/></forecast_conditions></weather></xml_api_reply>
>
> As you can see the umlauts in the XML are not displayed properly. When I want
> to process this text (for example with xml.sax), I get error messages because
> the parses can't read this.
>
> I've tried to read up on this and there is a lot of information on the web, but
> nothing seems to work for me. For example setting the coding to UTF like this:
> # -*- coding: utf-8 -*- or using the decode() string method.
>
> I always have this kind of problem when input contains umlauts, not just in
> this case. My locale (on Ubuntu) is en_GB.UTF-8.
>
The string you received from the website is a bytestring and you're just
printing it to your console, which is configured for UTF-8. However, the
bytestring isn't valid UTF-8, so the console is replacing the invalid
parts with the funny characters.

You should decode the bytestring to Unicode and then re-encode it to
UTF-8. I don't know what encoding the website is actually using; here
I'm assuming ISO-8859-1:

print xml.decode("iso-8859-1").encode("utf-8")

From: Diez B. Roggisch on 17 Oct 2009 12:50

StarWing schrieb:
> On 10月17日, 下午9时54分, Arian Kuschki <arian.kusc...(a)googlemail.com>
> wrote:
>> Hi all
>>
>> this has been bugging me for a long time and I do not seem to be able to
>> understand what to do. I always have problems when dealing input text that
>> contains umlauts. Consider the following:
>>
>> In [1]: import urllib
>>
>> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
>>
>> In [3]: xml = f.read()
>>
>> In [4]: f.close()
>>
>> In [5]: print xml
>> ------> print(xml)
>> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
>> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
>>
>> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
>> data=""/><longitude_e6 data=""/><forecast_date
>> data="2009-10-17"/><current_date_time data="2009-10
>> -17 14:20:00 +0000"/><unit_system
>> data="SI"/></forecast_information><current_conditions><condition data="Meistens
>> bew kt"/><temp_f data="43"/><temp_c data="6"/><h
>> umidity data="Feuchtigkeit: 87 %"/><icon
>> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
>> Windgeschwindigkeiten von 13 km/h"/></curr
>> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
>> data="1"/><high data="7"/><icon
>> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
>> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
>> data="So."/><low data="-1"/><high data="8"/><icon
>> data="/ig/images/weather/chance_of_sno
>> w.gif"/><condition data="Vereinzelt
>> Schnee"/></forecast_conditions><forecast_conditions><day_of_week
>> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
>> mages/weather/mostly_sunny.gif"/><condition data="Teils
>> sonnig"/></forecast_conditions><forecast_conditions><day_of_week
>> data="Di."/><low data="0"/><high data="8"
>> /><icon data="/ig/images/weather/sunny.gif"/><condition
>> data="Klar"/></forecast_conditions></weather></xml_api_reply>
>>
>> As you can see the umlauts in the XML are not displayed properly. When I want
>> to process this text (for example with xml.sax), I get error messages because
>> the parses can't read this.
>>
>> I've tried to read up on this and there is a lot of information on the web, but
>> nothing seems to work for me. For example setting the coding to UTF like this:
>> # -*- coding: utf-8 -*- or using the decode() string method.
>>
>> I always have this kind of problem when input contains umlauts, not just in
>> this case. My locale (on Ubuntu) is en_GB.UTF-8.
>>
>> Cheers
>> Arian
>
> try this?
>
> # vim: set fencoding=utf-8:
> import urllib
> import xml.sax as sax, xml.sax.handler as handler
>
> f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
> xml = f.read()
> xml = xml.decode("cp1252")
> f.close()
>
> class my_handler(handler.ContentHandler):
> def startElement(self, name, attrs):
> print "begin:", name, attrs
>
> def endElement(self, name):
> print "end:", name
>
> sax.parseString(xml, my_handler())

This is wrong. XML is a *byte*-based format, which explicitly states
encodings. So decoding a byte-string to a unicode-object and then
passing it to a parser is not working in the very moment you have data that

- is outside your default-system-encoding (ususally ascii)
- the system-encoding and the declared decoding differ

Besides, I don't see where the whole SAX-stuff is supposed to do
anything the direct print and the decode() don't do - smells like
cargo-cult to me.

Diez

| Next | Last
Pages: 1 2 3 4
Prev: Python 2.6.3 and finding init.tcl
Next: subprocess executing shell