From: Diez B. Roggisch on 17 Oct 2009 12:50 StarWing schrieb: > On 10月17日, 下午9时54分, Arian Kuschki <arian.kusc...(a)googlemail.com> > wrote: >> Hi all >> >> this has been bugging me for a long time and I do not seem to be able to >> understand what to do. I always have problems when dealing input text that >> contains umlauts. Consider the following: >> >> In [1]: import urllib >> >> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") >> >> In [3]: xml = f.read() >> >> In [4]: f.close() >> >> In [5]: print xml >> ------> print(xml) >> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0" >> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit >> >> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6 >> data=""/><longitude_e6 data=""/><forecast_date >> data="2009-10-17"/><current_date_time data="2009-10 >> -17 14:20:00 +0000"/><unit_system >> data="SI"/></forecast_information><current_conditions><condition data="Meistens >> bew kt"/><temp_f data="43"/><temp_c data="6"/><h >> umidity data="Feuchtigkeit: 87 %"/><icon >> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit >> Windgeschwindigkeiten von 13 km/h"/></curr >> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low >> data="1"/><high data="7"/><icon >> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V >> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week >> data="So."/><low data="-1"/><high data="8"/><icon >> data="/ig/images/weather/chance_of_sno >> w.gif"/><condition data="Vereinzelt >> Schnee"/></forecast_conditions><forecast_conditions><day_of_week >> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i >> mages/weather/mostly_sunny.gif"/><condition data="Teils >> sonnig"/></forecast_conditions><forecast_conditions><day_of_week >> data="Di."/><low data="0"/><high data="8" >> /><icon data="/ig/images/weather/sunny.gif"/><condition >> data="Klar"/></forecast_conditions></weather></xml_api_reply> >> >> As you can see the umlauts in the XML are not displayed properly. When I want >> to process this text (for example with xml.sax), I get error messages because >> the parses can't read this. >> >> I've tried to read up on this and there is a lot of information on the web, but >> nothing seems to work for me. For example setting the coding to UTF like this: >> # -*- coding: utf-8 -*- or using the decode() string method. >> >> I always have this kind of problem when input contains umlauts, not just in >> this case. My locale (on Ubuntu) is en_GB.UTF-8. >> >> Cheers >> Arian > > try this? > > # vim: set fencoding=utf-8: > import urllib > import xml.sax as sax, xml.sax.handler as handler > > f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") > xml = f.read() > xml = xml.decode("cp1252") > f.close() > > class my_handler(handler.ContentHandler): > def startElement(self, name, attrs): > print "begin:", name, attrs > > def endElement(self, name): > print "end:", name > > sax.parseString(xml, my_handler()) This is wrong. XML is a *byte*-based format, which explicitly states encodings. So decoding a byte-string to a unicode-object and then passing it to a parser is not working in the very moment you have data that - is outside your default-system-encoding (ususally ascii) - the system-encoding and the declared decoding differ Besides, I don't see where the whole SAX-stuff is supposed to do anything the direct print and the decode() don't do - smells like cargo-cult to me. Diez
From: Diez B. Roggisch on 17 Oct 2009 12:54 MRAB schrieb: > Arian Kuschki wrote: >> Hi all >> >> this has been bugging me for a long time and I do not seem to be able >> to understand what to do. I always have problems when dealing input >> text that contains umlauts. Consider the following: >> >> In [1]: import urllib >> >> In [2]: f = >> urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") >> >> In [3]: xml = f.read() >> >> In [4]: f.close() >> >> In [5]: print xml >> ------> print(xml) >> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0" >> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0" >>> <forecast_information><cit >> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6 >> data=""/><longitude_e6 data=""/><forecast_date >> data="2009-10-17"/><current_date_time data="2009-10 >> -17 14:20:00 +0000"/><unit_system >> data="SI"/></forecast_information><current_conditions><condition >> data="Meistens bew�kt"/><temp_f data="43"/><temp_c data="6"/><h >> umidity data="Feuchtigkeit: 87�%"/><icon >> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition >> data="Wind: W mit Windgeschwindigkeiten von 13 km/h"/></curr >> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low >> data="1"/><high data="7"/><icon >> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V >> ereinzelt >> Regen"/></forecast_conditions><forecast_conditions><day_of_week >> data="So."/><low data="-1"/><high data="8"/><icon >> data="/ig/images/weather/chance_of_sno >> w.gif"/><condition data="Vereinzelt >> Schnee"/></forecast_conditions><forecast_conditions><day_of_week >> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i >> mages/weather/mostly_sunny.gif"/><condition data="Teils >> sonnig"/></forecast_conditions><forecast_conditions><day_of_week >> data="Di."/><low data="0"/><high data="8" >> /><icon data="/ig/images/weather/sunny.gif"/><condition >> data="Klar"/></forecast_conditions></weather></xml_api_reply> >> >> As you can see the umlauts in the XML are not displayed properly. When >> I want to process this text (for example with xml.sax), I get error >> messages because the parses can't read this. >> >> I've tried to read up on this and there is a lot of information on the >> web, but nothing seems to work for me. For example setting the coding >> to UTF like this: # -*- coding: utf-8 -*- or using the decode() string >> method. >> >> I always have this kind of problem when input contains umlauts, not >> just in this case. My locale (on Ubuntu) is en_GB.UTF-8. >> > The string you received from the website is a bytestring and you're just > printing it to your console, which is configured for UTF-8. However, the > bytestring isn't valid UTF-8, so the console is replacing the invalid > parts with the funny characters. This is wierd. I looked at the site in FireFox - and it was displayed correctly, including umlauts. Bringing up the info-dialog claims the page is UTF-8, the XML itself says so as well (implicit, through the missing declaration of an encoding) - but it clearly is *not* utf-8. One would expect google to be better at this... Diez
From: Diez B. Roggisch on 17 Oct 2009 12:54 MRAB schrieb: > Arian Kuschki wrote: >> Hi all >> >> this has been bugging me for a long time and I do not seem to be able >> to understand what to do. I always have problems when dealing input >> text that contains umlauts. Consider the following: >> >> In [1]: import urllib >> >> In [2]: f = >> urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") >> >> In [3]: xml = f.read() >> >> In [4]: f.close() >> >> In [5]: print xml >> ------> print(xml) >> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0" >> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0" >>> <forecast_information><cit >> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6 >> data=""/><longitude_e6 data=""/><forecast_date >> data="2009-10-17"/><current_date_time data="2009-10 >> -17 14:20:00 +0000"/><unit_system >> data="SI"/></forecast_information><current_conditions><condition >> data="Meistens bew�kt"/><temp_f data="43"/><temp_c data="6"/><h >> umidity data="Feuchtigkeit: 87�%"/><icon >> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition >> data="Wind: W mit Windgeschwindigkeiten von 13 km/h"/></curr >> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low >> data="1"/><high data="7"/><icon >> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V >> ereinzelt >> Regen"/></forecast_conditions><forecast_conditions><day_of_week >> data="So."/><low data="-1"/><high data="8"/><icon >> data="/ig/images/weather/chance_of_sno >> w.gif"/><condition data="Vereinzelt >> Schnee"/></forecast_conditions><forecast_conditions><day_of_week >> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i >> mages/weather/mostly_sunny.gif"/><condition data="Teils >> sonnig"/></forecast_conditions><forecast_conditions><day_of_week >> data="Di."/><low data="0"/><high data="8" >> /><icon data="/ig/images/weather/sunny.gif"/><condition >> data="Klar"/></forecast_conditions></weather></xml_api_reply> >> >> As you can see the umlauts in the XML are not displayed properly. When >> I want to process this text (for example with xml.sax), I get error >> messages because the parses can't read this. >> >> I've tried to read up on this and there is a lot of information on the >> web, but nothing seems to work for me. For example setting the coding >> to UTF like this: # -*- coding: utf-8 -*- or using the decode() string >> method. >> >> I always have this kind of problem when input contains umlauts, not >> just in this case. My locale (on Ubuntu) is en_GB.UTF-8. >> > The string you received from the website is a bytestring and you're just > printing it to your console, which is configured for UTF-8. However, the > bytestring isn't valid UTF-8, so the console is replacing the invalid > parts with the funny characters. This is wierd. I looked at the site in FireFox - and it was displayed correctly, including umlauts. Bringing up the info-dialog claims the page is UTF-8, the XML itself says so as well (implicit, through the missing declaration of an encoding) - but it clearly is *not* utf-8. One would expect google to be better at this... Diez
From: StarWing on 17 Oct 2009 12:55 On 10æ18æ¥, ä¸å12æ¶14å, MRAB <pyt...(a)mrabarnett.plus.com> wrote: > Arian Kuschki wrote: > > Hi all > > > this has been bugging me for a long time and I do not seem to be able to > > understand what to do. I always have problems when dealing input text that > > contains umlauts. Consider the following: > > > In [1]: import urllib > > > In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") > > > In [3]: xml = f.read() > > > In [4]: f.close() > > > In [5]: print xml > > ------> print(xml) > > <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0" > > tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0" > >> <forecast_information><cit > > y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6 > > data=""/><longitude_e6 data=""/><forecast_date > > data="2009-10-17"/><current_date_time data="2009-10 > > -17 14:20:00 +0000"/><unit_system > > data="SI"/></forecast_information><current_conditions><condition data="Meistens > > bew kt"/><temp_f data="43"/><temp_c data="6"/><h > > umidity data="Feuchtigkeit: 87 %"/><icon > > data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit > > Windgeschwindigkeiten von 13 km/h"/></curr > > ent_conditions><forecast_conditions><day_of_week data="Sa."/><low > > data="1"/><high data="7"/><icon > > data="/ig/images/weather/chance_of_rain.gif"/><condition data="V > > ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week > > data="So."/><low data="-1"/><high data="8"/><icon > > data="/ig/images/weather/chance_of_sno > > w.gif"/><condition data="Vereinzelt > > Schnee"/></forecast_conditions><forecast_conditions><day_of_week > > data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i > > mages/weather/mostly_sunny.gif"/><condition data="Teils > > sonnig"/></forecast_conditions><forecast_conditions><day_of_week > > data="Di."/><low data="0"/><high data="8" > > /><icon data="/ig/images/weather/sunny.gif"/><condition > > data="Klar"/></forecast_conditions></weather></xml_api_reply> > > > As you can see the umlauts in the XML are not displayed properly. When I want > > to process this text (for example with xml.sax), I get error messages because > > the parses can't read this. > > > I've tried to read up on this and there is a lot of information on the web, but > > nothing seems to work for me. For example setting the coding to UTF like this: > > # -*- coding: utf-8 -*- or using the decode() string method. > > > I always have this kind of problem when input contains umlauts, not just in > > this case. My locale (on Ubuntu) is en_GB.UTF-8. > > The string you received from the website is a bytestring and you're just > printing it to your console, which is configured for UTF-8. However, the > bytestring isn't valid UTF-8, so the console is replacing the invalid > parts with the funny characters. > > You should decode the bytestring to Unicode and then re-encode it to > UTF-8. I don't know what encoding the website is actually using; here > I'm assuming ISO-8859-1: > > print xml.decode("iso-8859-1").encode("utf-8") in 2.6, str.decode return unicode, so you can directly print it. in 3.1, str.encode return bytes, so you can also directly print it. so, just decode("cp1252"), it's enough.
From: StarWing on 17 Oct 2009 13:02 On 10æ18æ¥, ä¸å12æ¶50å, "Diez B. Roggisch" <de...(a)nospam.web.de> wrote: > StarWing schrieb: > > > > > On 10æ17æ¥, ä¸å9æ¶54å, Arian Kuschki <arian.kusc...(a)googlemail.com> > > wrote: > >> Hi all > > >> this has been bugging me for a long time and I do not seem to be able to > >> understand what to do. I always have problems when dealing input text that > >> contains umlauts. Consider the following: > > >> In [1]: import urllib > > >> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") > > >> In [3]: xml = f.read() > > >> In [4]: f.close() > > >> In [5]: print xml > >> ------> print(xml) > >> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0" > >> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit > > >> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6 > >> data=""/><longitude_e6 data=""/><forecast_date > >> data="2009-10-17"/><current_date_time data="2009-10 > >> -17 14:20:00 +0000"/><unit_system > >> data="SI"/></forecast_information><current_conditions><condition data="Meistens > >> bew kt"/><temp_f data="43"/><temp_c data="6"/><h > >> umidity data="Feuchtigkeit: 87 %"/><icon > >> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit > >> Windgeschwindigkeiten von 13 km/h"/></curr > >> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low > >> data="1"/><high data="7"/><icon > >> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V > >> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week > >> data="So."/><low data="-1"/><high data="8"/><icon > >> data="/ig/images/weather/chance_of_sno > >> w.gif"/><condition data="Vereinzelt > >> Schnee"/></forecast_conditions><forecast_conditions><day_of_week > >> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i > >> mages/weather/mostly_sunny.gif"/><condition data="Teils > >> sonnig"/></forecast_conditions><forecast_conditions><day_of_week > >> data="Di."/><low data="0"/><high data="8" > >> /><icon data="/ig/images/weather/sunny.gif"/><condition > >> data="Klar"/></forecast_conditions></weather></xml_api_reply> > > >> As you can see the umlauts in the XML are not displayed properly. When I want > >> to process this text (for example with xml.sax), I get error messages because > >> the parses can't read this. > > >> I've tried to read up on this and there is a lot of information on the web, but > >> nothing seems to work for me. For example setting the coding to UTF like this: > >> # -*- coding: utf-8 -*- or using the decode() string method. > > >> I always have this kind of problem when input contains umlauts, not just in > >> this case. My locale (on Ubuntu) is en_GB.UTF-8. > > >> Cheers > >> Arian > > > try this? > > > # vim: set fencoding=utf-8: > > import urllib > > import xml.sax as sax, xml.sax.handler as handler > > > f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") > > xml = f.read() > > xml = xml.decode("cp1252") > > f.close() > > > class my_handler(handler.ContentHandler): > >   def startElement(self, name, attrs): > >     print "begin:", name, attrs > > >   def endElement(self, name): > >     print "end:", name > > > sax.parseString(xml, my_handler()) > > This is wrong. XML is a *byte*-based format, which explicitly states > encodings. So decoding a byte-string to a unicode-object and then > passing it to a parser is not working in the very moment you have data that > >  - is outside your default-system-encoding (ususally ascii) >  - the system-encoding and the declared decoding differ > > Besides, I don't see where the whole SAX-stuff is supposed to do > anything the direct print  and the decode() don't do - smells like > cargo-cult to me. > > Diez yes, XML is a *byte*-based format, and so as utf-8 and code-page (cp936, cp1252, etc.). so usually XML will sign its coding at head. but this didn't work now. in Python2.6, sys.getdefaultcoding() return 'ascii', and I can't use sys.setdefaultcoding(), and f.read() return a str. so it must be a undecoded, byte-base format (i.e. raw XML data). so use the right code- page to decode it is safe.(notice the webpage is google.de). in Python3.1, read() returns a bytes object. so we *must* decode it, nor we can't pass it into a parser.
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 Prev: Python 2.6.3 and finding init.tcl Next: subprocess executing shell |