From: Mister Yu on 1 Apr 2010 06:56 hi experts, i m new to python, i m writing crawlers to extract data from some chinese websites, and i run into a encoding problem. i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4' which is encoded in "gb2312", but i have no idea of how to convert it back to utf-8 to re-create this one is easy: this will work ============================ >>> su = u"¤¤¤å".encode('gb2312') >>> su u >>> print su.decode('gb2312') ¤¤¤å -> (same as the original string) ============================ but this doesn't,why =========================== >>> su = u'\xd6\xd0\xce\xc4' >>> su u'\xd6\xd0\xce\xc4' >>> print su.decode('gb2312') Traceback (most recent call last): File "<console>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128) =========================== thank you
From: Chris Rebert on 1 Apr 2010 07:22 2010/4/1 Mister Yu <eryan.yu(a)gmail.com>: > hi experts, > > i m new to python, i m writing crawlers to extract data from some > chinese websites, and i run into a encoding problem. > > i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4' > which is encoded in "gb2312", No! Instances of type 'unicode' (i.e. strings with a leading 'u') ***aren't encoded at all***. > but i have no idea of how to convert it > back to utf-8 To convert u'\xd6\xd0\xce\xc4' to UTF-8, do u'\xd6\xd0\xce\xc4'.encode('utf-8') > to re-create this one is easy: > > this will work > ============================ >>>> su = u"ä¸æ".encode('gb2312') >>>> su > u >>>> print su.decode('gb2312') > ä¸æ   -> (same as the original string) > > ============================ > but this doesn't,why > =========================== >>>> su = u'\xd6\xd0\xce\xc4' >>>> su > u'\xd6\xd0\xce\xc4' >>>> print su.decode('gb2312') You can't decode a unicode string, it's already been decoded! One decodes a bytestring to get a unicode string. One **encodes** a unicode string to get a bytestring. So the last line of your example should be: print su.encode('gb2312') Only call .encode() on things of type 'unicode'. Only call .decode() on things of type 'str'. [When using Python 2.x that is. Python 3.x renames the types in question.] Cheers, Chris -- http://blog.rebertia.com
From: Mister Yu on 1 Apr 2010 07:38 On Apr 1, 7:22 pm, Chris Rebert <c...(a)rebertia.com> wrote: > 2010/4/1 Mister Yu <eryan...(a)gmail.com>: > > > hi experts, > > > i m new to python, i m writing crawlers to extract data from some > > chinese websites, and i run into a encoding problem. > > > i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4' > > which is encoded in "gb2312", > > No! Instances of type 'unicode' (i.e. strings with a leading 'u') > ***aren't encoded at all***. > > > but i have no idea of how to convert it > > back to utf-8 > > To convert u'\xd6\xd0\xce\xc4' to UTF-8, do u'\xd6\xd0\xce\xc4'.encode('utf-8') > > > > > to re-create this one is easy: > > > this will work > > ============================ > >>>> su = u"ä¸æ".encode('gb2312') > >>>> su > > u > >>>> print su.decode('gb2312') > > ä¸æ   -> (same as the original string) > > > ============================ > > but this doesn't,why > > =========================== > >>>> su = u'\xd6\xd0\xce\xc4' > >>>> su > > u'\xd6\xd0\xce\xc4' > >>>> print su.decode('gb2312') > > You can't decode a unicode string, it's already been decoded! > > One decodes a bytestring to get a unicode string. > One **encodes** a unicode string to get a bytestring. > > So the last line of your example should be: > print su.encode('gb2312') > > Only call .encode() on things of type 'unicode'. > Only call .decode() on things of type 'str'. > [When using Python 2.x that is. Python 3.x renames the types in question.] > > Cheers, > Chris > --http://blog.rebertia.com hi, thanks for the tips. but i m still not very sure how to convert a unicode object ** u'\xd6\xd0\xce\xc4 ** back to "ä¸æ" the string it supposed to be? thanks. sorry i m really new to python.
From: Mister Yu on 1 Apr 2010 07:47 =========================================== print u'\xd6\xd0\xce\xc4'.encode('utf-8') ÃÃÃà (the result is supposed to be "ä¸æ" but not something like this) =========================================== >>> su = u"ä¸æ".encode('gb2312') >>> su '\xd6\xd0\xce\xc4' ===========================================
From: Chris Rebert on 1 Apr 2010 08:13
On Thu, Apr 1, 2010 at 4:38 AM, Mister Yu <eryan.yu(a)gmail.com> wrote: > On Apr 1, 7:22 pm, Chris Rebert <c...(a)rebertia.com> wrote: >> 2010/4/1 Mister Yu <eryan...(a)gmail.com>: >> > hi experts, >> >> > i m new to python, i m writing crawlers to extract data from some >> > chinese websites, and i run into a encoding problem. >> >> > i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4' >> > which is encoded in "gb2312", <snip> > hi, thanks for the tips. > > but i m still not very sure how to convert a unicode object  ** > u'\xd6\xd0\xce\xc4 ** back to "ä¸æ" the string it supposed to be? Ah, my apologies! I overlooked something (sorry, it's early in the morning where I am). What you have there is ***really*** screwy. It's the 2 Chinese characters, encoded in gb2312, and then somehow cast *directly* into a 'unicode' string (which ought never to be done). In answer to your original question (after some experimentation): gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4']) unicode_string = gb2312_bytes.decode('gb2312') utf8_bytes = unicode_string.encode('utf-8') #as you wanted If possible, I'd look at the code that's giving you that funky "string" in the first place and see if it can be fixed to give you either a proper bytestring or proper unicode string rather than the bastardized mess you're currently having to deal with. Apologies again and Cheers, Chris -- http://blog.rebertia.com |