From: Diez B. Roggisch on 17 Oct 2009 14:17 StarWing schrieb: > On 10月18日, 上午12时50分, "Diez B. Roggisch" <de...(a)nospam.web.de> wrote: >> StarWing schrieb: >> >> >> >>> On 10月17日, 下午9时54分, Arian Kuschki <arian.kusc...(a)googlemail.com> >>> wrote: >>>> Hi all >>>> this has been bugging me for a long time and I do not seem to be able to >>>> understand what to do. I always have problems when dealing input text that >>>> contains umlauts. Consider the following: >>>> In [1]: import urllib >>>> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") >>>> In [3]: xml = f.read() >>>> In [4]: f.close() >>>> In [5]: print xml >>>> ------> print(xml) >>>> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0" >>>> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit >>>> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6 >>>> data=""/><longitude_e6 data=""/><forecast_date >>>> data="2009-10-17"/><current_date_time data="2009-10 >>>> -17 14:20:00 +0000"/><unit_system >>>> data="SI"/></forecast_information><current_conditions><condition data="Meistens >>>> bew kt"/><temp_f data="43"/><temp_c data="6"/><h >>>> umidity data="Feuchtigkeit: 87 %"/><icon >>>> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit >>>> Windgeschwindigkeiten von 13 km/h"/></curr >>>> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low >>>> data="1"/><high data="7"/><icon >>>> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V >>>> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week >>>> data="So."/><low data="-1"/><high data="8"/><icon >>>> data="/ig/images/weather/chance_of_sno >>>> w.gif"/><condition data="Vereinzelt >>>> Schnee"/></forecast_conditions><forecast_conditions><day_of_week >>>> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i >>>> mages/weather/mostly_sunny.gif"/><condition data="Teils >>>> sonnig"/></forecast_conditions><forecast_conditions><day_of_week >>>> data="Di."/><low data="0"/><high data="8" >>>> /><icon data="/ig/images/weather/sunny.gif"/><condition >>>> data="Klar"/></forecast_conditions></weather></xml_api_reply> >>>> As you can see the umlauts in the XML are not displayed properly. When I want >>>> to process this text (for example with xml.sax), I get error messages because >>>> the parses can't read this. >>>> I've tried to read up on this and there is a lot of information on the web, but >>>> nothing seems to work for me. For example setting the coding to UTF like this: >>>> # -*- coding: utf-8 -*- or using the decode() string method. >>>> I always have this kind of problem when input contains umlauts, not just in >>>> this case. My locale (on Ubuntu) is en_GB.UTF-8. >>>> Cheers >>>> Arian >>> try this? >>> # vim: set fencoding=utf-8: >>> import urllib >>> import xml.sax as sax, xml.sax.handler as handler >>> f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") >>> xml = f.read() >>> xml = xml.decode("cp1252") >>> f.close() >>> class my_handler(handler.ContentHandler): >>> def startElement(self, name, attrs): >>> print "begin:", name, attrs >>> def endElement(self, name): >>> print "end:", name >>> sax.parseString(xml, my_handler()) >> This is wrong. XML is a *byte*-based format, which explicitly states >> encodings. So decoding a byte-string to a unicode-object and then >> passing it to a parser is not working in the very moment you have data that >> >> - is outside your default-system-encoding (ususally ascii) >> - the system-encoding and the declared decoding differ >> >> Besides, I don't see where the whole SAX-stuff is supposed to do >> anything the direct print and the decode() don't do - smells like >> cargo-cult to me. >> >> Diez > > yes, XML is a *byte*-based format, and so as utf-8 and code-page > (cp936, cp1252, etc.). so usually XML will sign its coding at head. > but this didn't work now. > > in Python2.6, sys.getdefaultcoding() return 'ascii', and I can't use > sys.setdefaultcoding(), and f.read() return a str. so it must be a > undecoded, byte-base format (i.e. raw XML data). so use the right code- > page to decode it is safe.(notice the webpage is google.de). > > in Python3.1, read() returns a bytes object. so we *must* decode it, > nor we can't pass it into a parser. You didn't get my point. A XML-parser only *takes* a byte-string. Decoding is it's business. So your above last sentence is wrong. Because regardless of the python-version, if you feed the parser a unicode-object, python will first encode that to a byte-string, possibly giving a UnicodeError (maybe this automated conversion has gone in Py3K, but then you get a type-error instead). So to make the above work (if one wants to parse the xml), the proper thing to do would be xml = xml.decode("cp1252").encode("utf-8") and then feed that. Of course the really good thing would be to fix the webpage, but that's beyond our capabilities I fear... Diez
From: I V on 17 Oct 2009 14:57 On Sat, 17 Oct 2009 18:54:10 +0200, Diez B. Roggisch wrote: > This is wierd. I looked at the site in FireFox - and it was displayed > correctly, including umlauts. Bringing up the info-dialog claims the > page is UTF-8, the XML itself says so as well (implicit, through the > missing declaration of an encoding) - but it clearly is *not* utf-8. The headers correctly identify it as ISO-8859-1, which overrides the implicit specification of UTF-8. I'm not sure why Firefox is reporting it as UTF-8 (it does that for me, too); I can see the umlauts, so it's clearly processing it as ISO-8859-1.
From: Arian Kuschki on 17 Oct 2009 13:50 Whoa, that was quick! Thanks for all the answers, I'll try to recapitulate >What does this show you in your interactive interpreter? > >>>> print "\xc3\xb6" >ö > >For me, it's o-umlaut, ö. This is because the above bytes are the >sequence for ö in utf-8. > >If this shows something else, you need to adjust your terminal settings. for me it also prints the correct o-umlaut (ö), so that was not the problem. All of the below result in xml that shows all umlauts correctly when printed: xml.decode("cp1252") xml.decode("cp1252").encode("utf-8") xml.decode("iso-8859-1") xml.decode("iso-8859-1").encode("utf-8") But when I want to parse the xml then, it only works if I do both decode and encode. If I only decode, I get the following error: SAXParseException: <unknown>:1:1: not well-formed (invalid token) Do I understand right that since the encoding was not specified in the xml response, it should have been utf-8 by default? And that if it had indeed been utf-8 I would not have had the encoding problem in the first place? Anyway, thanks everybody, this has helped me a lot. Arian On Sat 17, 20:17 +0200, Diez B. Roggisch wrote: > StarWing schrieb: > >On 10月18日, 上午12时50分, "Diez B. Roggisch" <de...(a)nospam.web.de> wrote: > >>StarWing schrieb: > >> > >> > >> > >>>On 10月17日, 下午9时54分, Arian Kuschki <arian.kusc...(a)googlemail.com> > >>>wrote: > >>>>Hi all > >>>>this has been bugging me for a long time and I do not seem to be able to > >>>>understand what to do. I always have problems when dealing input text that > >>>>contains umlauts. Consider the following: > >>>>In [1]: import urllib > >>>>In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") > >>>>In [3]: xml = f.read() > >>>>In [4]: f.close() > >>>>In [5]: print xml > >>>>------> print(xml) > >>>><?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0" > >>>>tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit > >>>>y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6 > >>>>data=""/><longitude_e6 data=""/><forecast_date > >>>>data="2009-10-17"/><current_date_time data="2009-10 > >>>>-17 14:20:00 +0000"/><unit_system > >>>>data="SI"/></forecast_information><current_conditions><condition data="Meistens > >>>>bew kt"/><temp_f data="43"/><temp_c data="6"/><h > >>>>umidity data="Feuchtigkeit: 87 %"/><icon > >>>>data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit > >>>>Windgeschwindigkeiten von 13 km/h"/></curr > >>>>ent_conditions><forecast_conditions><day_of_week data="Sa."/><low > >>>>data="1"/><high data="7"/><icon > >>>>data="/ig/images/weather/chance_of_rain.gif"/><condition data="V > >>>>ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week > >>>>data="So."/><low data="-1"/><high data="8"/><icon > >>>>data="/ig/images/weather/chance_of_sno > >>>>w.gif"/><condition data="Vereinzelt > >>>>Schnee"/></forecast_conditions><forecast_conditions><day_of_week > >>>>data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i > >>>>mages/weather/mostly_sunny.gif"/><condition data="Teils > >>>>sonnig"/></forecast_conditions><forecast_conditions><day_of_week > >>>>data="Di."/><low data="0"/><high data="8" > >>>>/><icon data="/ig/images/weather/sunny.gif"/><condition > >>>>data="Klar"/></forecast_conditions></weather></xml_api_reply> > >>>>As you can see the umlauts in the XML are not displayed properly. When I want > >>>>to process this text (for example with xml.sax), I get error messages because > >>>>the parses can't read this. > >>>>I've tried to read up on this and there is a lot of information on the web, but > >>>>nothing seems to work for me. For example setting the coding to UTF like this: > >>>># -*- coding: utf-8 -*- or using the decode() string method. > >>>>I always have this kind of problem when input contains umlauts, not just in > >>>>this case. My locale (on Ubuntu) is en_GB.UTF-8. > >>>>Cheers > >>>>Arian > >>>try this? > >>># vim: set fencoding=utf-8: > >>>import urllib > >>>import xml.sax as sax, xml.sax.handler as handler > >>>f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen") > >>>xml = f.read() > >>>xml = xml.decode("cp1252") > >>>f.close() > >>>class my_handler(handler.ContentHandler): > >>> def startElement(self, name, attrs): > >>> print "begin:", name, attrs > >>> def endElement(self, name): > >>> print "end:", name > >>>sax.parseString(xml, my_handler()) > >>This is wrong. XML is a *byte*-based format, which explicitly states > >>encodings. So decoding a byte-string to a unicode-object and then > >>passing it to a parser is not working in the very moment you have data that > >> > >> - is outside your default-system-encoding (ususally ascii) > >> - the system-encoding and the declared decoding differ > >> > >>Besides, I don't see where the whole SAX-stuff is supposed to do > >>anything the direct print and the decode() don't do - smells like > >>cargo-cult to me. > >> > >>Diez > > > >yes, XML is a *byte*-based format, and so as utf-8 and code-page > >(cp936, cp1252, etc.). so usually XML will sign its coding at head. > >but this didn't work now. > > > >in Python2.6, sys.getdefaultcoding() return 'ascii', and I can't use > >sys.setdefaultcoding(), and f.read() return a str. so it must be a > >undecoded, byte-base format (i.e. raw XML data). so use the right code- > >page to decode it is safe.(notice the webpage is google.de). > > > >in Python3.1, read() returns a bytes object. so we *must* decode it, > >nor we can't pass it into a parser. > > You didn't get my point. A XML-parser only *takes* a byte-string. > Decoding is it's business. So your above last sentence is wrong. > > Because regardless of the python-version, if you feed the parser a > unicode-object, python will first encode that to a byte-string, > possibly giving a UnicodeError (maybe this automated conversion has > gone in Py3K, but then you get a type-error instead). > > So to make the above work (if one wants to parse the xml), the > proper thing to do would be > > xml = xml.decode("cp1252").encode("utf-8") > > and then feed that. Of course the really good thing would be to fix > the webpage, but that's beyond our capabilities I fear... > > Diez > -- > http://mail.python.org/mailman/listinfo/python-list --
From: Arian Kuschki on 17 Oct 2009 13:54 I just checked and I see the following in the headers: Content-Type text/xml; charset=UTF-8 Where does it say ISO-8859-1? On Sat 17, 20:57 +0200, I V wrote: > On Sat, 17 Oct 2009 18:54:10 +0200, Diez B. Roggisch wrote: > > > This is wierd. I looked at the site in FireFox - and it was displayed > > correctly, including umlauts. Bringing up the info-dialog claims the > > page is UTF-8, the XML itself says so as well (implicit, through the > > missing declaration of an encoding) - but it clearly is *not* utf-8. > > The headers correctly identify it as ISO-8859-1, which overrides the > implicit specification of UTF-8. I'm not sure why Firefox is reporting it > as UTF-8 (it does that for me, too); I can see the umlauts, so it's > clearly processing it as ISO-8859-1. > -- > http://mail.python.org/mailman/listinfo/python-list --
From: I V on 17 Oct 2009 15:56 On Sat, 17 Oct 2009 21:24:59 +0330, Arian Kuschki wrote: > I just checked and I see the following in the headers: Content-Type > text/xml; charset=UTF-8 > > Where does it say ISO-8859-1? In the headers returned via urllib (and via wget). But checking in Firefox, it does indeed specify UTF-8 in the content type. Using wget, but specifying the same User-Agent header that Firefox uses, I get the same UTF-8 Content-Type that I see in Firefox. How bizarre.
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 Prev: Python 2.6.3 and finding init.tcl Next: subprocess executing shell |