From: Ben Finney on 17 May 2006 02:20 "manstey" <manstey(a)csu.edu.au> writes: > 1. Here is my input data file, line 2: > gn1:1,1.2 R")$I73YT R")$IYT(a)ncfsa Your program is reading this using the 'utf-8' encoding. When it does so, all the characters you show above will be read in happily as you see them (so long as you view them with the 'utf-8' encoding), and converted to Unicode characters representing the same thing. Do you have any other information that might indicate this is *not* utf-8 encoded data? > 2. Here is my output data file, line 2: > u'gn', u'1', u'1', u'1', u'2', u'-', u'R")$I73YT', u'R")$IYT', > u'R")$IYT', u'@', u'ncfsa', u'nc', '', '', '', u'f', u's', u'a', '', > '', '', '', '', '', '', '', u'B.:R")$I^YT', u'b.:cv)cv^yc', '\xc9\x94' As you can see, reading the file with 'utf-8' encoding and writing it out again as 'utf-8' encoding, the characters (as you posted them in the message) have been faithfully preserved by Unicode processing and encoding. Bear in mind that when you present the "input data file, line 2" to us, your message is itself encoded using a particular character encoding. (In the case of the message where you wrote the above, it's 'utf-8'.) This means we may or may not be seeing the exact same bytes you see in the input file; we're seeing characters in the encoding you used to post the message. You need to know what encoding was used when the data in that file was written. You can then read the file using that encoding, and convert the characters to unicode for processing inside your program. When you write them out again, you can choose the 'utf-8' encoding as you have done. Have you read this excellent article on understanding the programming implications of character sets and Unicode? "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" <URL:http://www.joelonsoftware.com/articles/Unicode.html> -- \ "I'd like to see a nude opera, because when they hit those high | `\ notes, I bet you can really see it in those genitals." -- Jack | _o__) Handey | Ben Finney
From: manstey on 17 May 2006 06:19 Hi Martin, Thanks very much. Your def comma_separated_utf8(items): approach raises an exception in codecs.py, so I tried = u", ".join(word_info + parse + gloss), which works perfectly. So I want to understand exactly why this works. word_info and parse and gloss are all tuples. does str convert the three into an ascii string? but the join method retains their unicode status. In the text file, the unicode characters appear perfectly, so I'm very happy. cheers matthew
From: Martin v. Löwis on 17 May 2006 18:16
manstey wrote: > Thanks very much. Your def comma_separated_utf8(items): approach raises > an exception in codecs.py, so I tried = u", ".join(word_info + parse + > gloss), which works perfectly. So I want to understand exactly why this > works. word_info and parse and gloss are all tuples. does str convert > the three into an ascii string? Correct: a tuple is converted into a string with (contents), where contents is achieved through comma-separating repr() of each tuple element. repr(a_unicode_string) creates a \x or \u representation. > but the join method retains their unicode status. Correct. The result is a Unicode string if the joiner is a Unicode string, and all tuple elements are Unicode strings. If one is not, a conversion to Unicode is attempted. > In the text file, the unicode characters appear perfectly, so I'm very > happy. Glad it works. Regards, Martin |