From: Arian Kuschki on 17 Oct 2009 09:54 Hi all this has been bugging me for a long time and I do not seem to be able to understand what to do. I always have problems when dealing input text that contains umlauts. Consider the following: In [1]: import urllib In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") In [3]: xml = f.read() In [4]: f.close() In [5]: print xml ------> print(xml) <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0" tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0" ><forecast_information><cit y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6 data=""/><longitude_e6 data=""/><forecast_date data="2009-10-17"/><current_date_time data="2009-10 -17 14:20:00 +0000"/><unit_system data="SI"/></forecast_information><current_conditions><condition data="Meistens bew�kt"/><temp_f data="43"/><temp_c data="6"/><h umidity data="Feuchtigkeit: 87�%"/><icon data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit Windgeschwindigkeiten von 13 km/h"/></curr ent_conditions><forecast_conditions><day_of_week data="Sa."/><low data="1"/><high data="7"/><icon data="/ig/images/weather/chance_of_rain.gif"/><condition data="V ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week data="So."/><low data="-1"/><high data="8"/><icon data="/ig/images/weather/chance_of_sno w.gif"/><condition data="Vereinzelt Schnee"/></forecast_conditions><forecast_conditions><day_of_week data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i mages/weather/mostly_sunny.gif"/><condition data="Teils sonnig"/></forecast_conditions><forecast_conditions><day_of_week data="Di."/><low data="0"/><high data="8" /><icon data="/ig/images/weather/sunny.gif"/><condition data="Klar"/></forecast_conditions></weather></xml_api_reply> As you can see the umlauts in the XML are not displayed properly. When I want to process this text (for example with xml.sax), I get error messages because the parses can't read this. I've tried to read up on this and there is a lot of information on the web, but nothing seems to work for me. For example setting the coding to UTF like this: # -*- coding: utf-8 -*- or using the decode() string method. I always have this kind of problem when input contains umlauts, not just in this case. My locale (on Ubuntu) is en_GB.UTF-8. Cheers Arian
From: Diez B. Roggisch on 17 Oct 2009 11:51 Arian Kuschki schrieb: > Hi all > > this has been bugging me for a long time and I do not seem to be able to > understand what to do. I always have problems when dealing input text that > contains umlauts. Consider the following: > > In [1]: import urllib > > In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") > > In [3]: xml = f.read() > > In [4]: f.close() > > In [5]: print xml > ------> print(xml) > <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0" > tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0" >> <forecast_information><cit > y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6 > data=""/><longitude_e6 data=""/><forecast_date > data="2009-10-17"/><current_date_time data="2009-10 > -17 14:20:00 +0000"/><unit_system > data="SI"/></forecast_information><current_conditions><condition data="Meistens > bew�kt"/><temp_f data="43"/><temp_c data="6"/><h > umidity data="Feuchtigkeit: 87�%"/><icon > data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit > Windgeschwindigkeiten von 13 km/h"/></curr > ent_conditions><forecast_conditions><day_of_week data="Sa."/><low > data="1"/><high data="7"/><icon > data="/ig/images/weather/chance_of_rain.gif"/><condition data="V > ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week > data="So."/><low data="-1"/><high data="8"/><icon > data="/ig/images/weather/chance_of_sno > w.gif"/><condition data="Vereinzelt > Schnee"/></forecast_conditions><forecast_conditions><day_of_week > data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i > mages/weather/mostly_sunny.gif"/><condition data="Teils > sonnig"/></forecast_conditions><forecast_conditions><day_of_week > data="Di."/><low data="0"/><high data="8" > /><icon data="/ig/images/weather/sunny.gif"/><condition > data="Klar"/></forecast_conditions></weather></xml_api_reply> > > As you can see the umlauts in the XML are not displayed properly. When I want > to process this text (for example with xml.sax), I get error messages because > the parses can't read this. > > I've tried to read up on this and there is a lot of information on the web, but > nothing seems to work for me. For example setting the coding to UTF like this: > # -*- coding: utf-8 -*- or using the decode() string method. The encoding of the python-source-file has nothing to do with this. It's only relevant for unicode-literals (in python 2.x, that's u"...") > > I always have this kind of problem when input contains umlauts, not just in > this case. My locale (on Ubuntu) is en_GB.UTF-8. If we assume the data on the website is correct (it appears to be when I open it in FF), then your problem is most probably your display/terminal. What does this show you in your interactive interpreter? >>> print "\xc3\xb6" ö For me, it's o-umlaut, ö. This is because the above bytes are the sequence for ö in utf-8. If this shows something else, you need to adjust your terminal settings. Diez
From: StarWing on 17 Oct 2009 12:11 On 10æ17æ¥, ä¸å9æ¶54å, Arian Kuschki <arian.kusc...(a)googlemail.com> wrote: > Hi all > > this has been bugging me for a long time and I do not seem to be able to > understand what to do. I always have problems when dealing input text that > contains umlauts. Consider the following: > > In [1]: import urllib > > In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") > > In [3]: xml = f.read() > > In [4]: f.close() > > In [5]: print xml > ------> print(xml) > <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0" > tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit > > y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6 > data=""/><longitude_e6 data=""/><forecast_date > data="2009-10-17"/><current_date_time data="2009-10 > -17 14:20:00 +0000"/><unit_system > data="SI"/></forecast_information><current_conditions><condition data="Meistens > bew kt"/><temp_f data="43"/><temp_c data="6"/><h > umidity data="Feuchtigkeit: 87 %"/><icon > data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit > Windgeschwindigkeiten von 13 km/h"/></curr > ent_conditions><forecast_conditions><day_of_week data="Sa."/><low > data="1"/><high data="7"/><icon > data="/ig/images/weather/chance_of_rain.gif"/><condition data="V > ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week > data="So."/><low data="-1"/><high data="8"/><icon > data="/ig/images/weather/chance_of_sno > w.gif"/><condition data="Vereinzelt > Schnee"/></forecast_conditions><forecast_conditions><day_of_week > data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i > mages/weather/mostly_sunny.gif"/><condition data="Teils > sonnig"/></forecast_conditions><forecast_conditions><day_of_week > data="Di."/><low data="0"/><high data="8" > /><icon data="/ig/images/weather/sunny.gif"/><condition > data="Klar"/></forecast_conditions></weather></xml_api_reply> > > As you can see the umlauts in the XML are not displayed properly. When I want > to process this text (for example with xml.sax), I get error messages because > the parses can't read this. > > I've tried to read up on this and there is a lot of information on the web, but > nothing seems to work for me. For example setting the coding to UTF like this: > # -*- coding: utf-8 -*- or using the decode() string method. > > I always have this kind of problem when input contains umlauts, not just in > this case. My locale (on Ubuntu) is en_GB.UTF-8. > > Cheers > Arian try this? # vim: set fencoding=utf-8: import urllib import xml.sax as sax, xml.sax.handler as handler f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") xml = f.read() xml = xml.decode("cp1252") f.close() class my_handler(handler.ContentHandler): def startElement(self, name, attrs): print "begin:", name, attrs def endElement(self, name): print "end:", name sax.parseString(xml, my_handler())
From: MRAB on 17 Oct 2009 12:14 Arian Kuschki wrote: > Hi all > > this has been bugging me for a long time and I do not seem to be able to > understand what to do. I always have problems when dealing input text that > contains umlauts. Consider the following: > > In [1]: import urllib > > In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") > > In [3]: xml = f.read() > > In [4]: f.close() > > In [5]: print xml > ------> print(xml) > <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0" > tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0" >> <forecast_information><cit > y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6 > data=""/><longitude_e6 data=""/><forecast_date > data="2009-10-17"/><current_date_time data="2009-10 > -17 14:20:00 +0000"/><unit_system > data="SI"/></forecast_information><current_conditions><condition data="Meistens > bew�kt"/><temp_f data="43"/><temp_c data="6"/><h > umidity data="Feuchtigkeit: 87�%"/><icon > data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit > Windgeschwindigkeiten von 13 km/h"/></curr > ent_conditions><forecast_conditions><day_of_week data="Sa."/><low > data="1"/><high data="7"/><icon > data="/ig/images/weather/chance_of_rain.gif"/><condition data="V > ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week > data="So."/><low data="-1"/><high data="8"/><icon > data="/ig/images/weather/chance_of_sno > w.gif"/><condition data="Vereinzelt > Schnee"/></forecast_conditions><forecast_conditions><day_of_week > data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i > mages/weather/mostly_sunny.gif"/><condition data="Teils > sonnig"/></forecast_conditions><forecast_conditions><day_of_week > data="Di."/><low data="0"/><high data="8" > /><icon data="/ig/images/weather/sunny.gif"/><condition > data="Klar"/></forecast_conditions></weather></xml_api_reply> > > As you can see the umlauts in the XML are not displayed properly. When I want > to process this text (for example with xml.sax), I get error messages because > the parses can't read this. > > I've tried to read up on this and there is a lot of information on the web, but > nothing seems to work for me. For example setting the coding to UTF like this: > # -*- coding: utf-8 -*- or using the decode() string method. > > I always have this kind of problem when input contains umlauts, not just in > this case. My locale (on Ubuntu) is en_GB.UTF-8. > The string you received from the website is a bytestring and you're just printing it to your console, which is configured for UTF-8. However, the bytestring isn't valid UTF-8, so the console is replacing the invalid parts with the funny characters. You should decode the bytestring to Unicode and then re-encode it to UTF-8. I don't know what encoding the website is actually using; here I'm assuming ISO-8859-1: print xml.decode("iso-8859-1").encode("utf-8")
From: Diez B. Roggisch on 17 Oct 2009 12:50 StarWing schrieb: > On 10月17日, 下午9时54分, Arian Kuschki <arian.kusc...(a)googlemail.com> > wrote: >> Hi all >> >> this has been bugging me for a long time and I do not seem to be able to >> understand what to do. I always have problems when dealing input text that >> contains umlauts. Consider the following: >> >> In [1]: import urllib >> >> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") >> >> In [3]: xml = f.read() >> >> In [4]: f.close() >> >> In [5]: print xml >> ------> print(xml) >> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0" >> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit >> >> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6 >> data=""/><longitude_e6 data=""/><forecast_date >> data="2009-10-17"/><current_date_time data="2009-10 >> -17 14:20:00 +0000"/><unit_system >> data="SI"/></forecast_information><current_conditions><condition data="Meistens >> bew kt"/><temp_f data="43"/><temp_c data="6"/><h >> umidity data="Feuchtigkeit: 87 %"/><icon >> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit >> Windgeschwindigkeiten von 13 km/h"/></curr >> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low >> data="1"/><high data="7"/><icon >> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V >> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week >> data="So."/><low data="-1"/><high data="8"/><icon >> data="/ig/images/weather/chance_of_sno >> w.gif"/><condition data="Vereinzelt >> Schnee"/></forecast_conditions><forecast_conditions><day_of_week >> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i >> mages/weather/mostly_sunny.gif"/><condition data="Teils >> sonnig"/></forecast_conditions><forecast_conditions><day_of_week >> data="Di."/><low data="0"/><high data="8" >> /><icon data="/ig/images/weather/sunny.gif"/><condition >> data="Klar"/></forecast_conditions></weather></xml_api_reply> >> >> As you can see the umlauts in the XML are not displayed properly. When I want >> to process this text (for example with xml.sax), I get error messages because >> the parses can't read this. >> >> I've tried to read up on this and there is a lot of information on the web, but >> nothing seems to work for me. For example setting the coding to UTF like this: >> # -*- coding: utf-8 -*- or using the decode() string method. >> >> I always have this kind of problem when input contains umlauts, not just in >> this case. My locale (on Ubuntu) is en_GB.UTF-8. >> >> Cheers >> Arian > > try this? > > # vim: set fencoding=utf-8: > import urllib > import xml.sax as sax, xml.sax.handler as handler > > f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") > xml = f.read() > xml = xml.decode("cp1252") > f.close() > > class my_handler(handler.ContentHandler): > def startElement(self, name, attrs): > print "begin:", name, attrs > > def endElement(self, name): > print "end:", name > > sax.parseString(xml, my_handler()) This is wrong. XML is a *byte*-based format, which explicitly states encodings. So decoding a byte-string to a unicode-object and then passing it to a parser is not working in the very moment you have data that - is outside your default-system-encoding (ususally ascii) - the system-encoding and the declared decoding differ Besides, I don't see where the whole SAX-stuff is supposed to do anything the direct print and the decode() don't do - smells like cargo-cult to me. Diez
|
Next
|
Last
Pages: 1 2 3 4 Prev: Python 2.6.3 and finding init.tcl Next: subprocess executing shell |