From: Ben Finney on 17 May 2006 00:47 "manstey" <manstey(a)csu.edu.au> writes: > I'm a newbie at python, so I don't really understand how your answer > solves my unicode problem. Since your replies fail to give any context of the existing discussion, I could only go by the content of what you'd written in that message. I didn't see a problem with anything Unicode -- I saw three objects being added together, which you told us were function objects. That's the problem I pointed out. -- \ "When a well-packaged web of lies has been sold to the masses | `\ over generations, the truth will seem utterly preposterous and | _o__) its speaker a raving lunatic." -- Dresden James | Ben Finney
From: Martin v. Löwis on 17 May 2006 01:08 manstey wrote: > input_file = open(input_file_loc, 'r') > output_file = open(output_file_loc, 'w') > for line in input_file: > output_file.write(str(word_info + parse + gloss)) # = three > functions that return tuples > > (u'F', u'\u0254') are two of the many unicode tuple elements returned > by the three functions. > > What am I doing wrong? Well, the primary problem is that you don't tell us what you are really doing. For example, it is very hard to believe that this is the actual code that you are running: If word_info, parse, and gloss are functions, the code should read input_file = open(input_file_loc, 'r') output_file = open(output_file_loc, 'w') for line in input_file: output_file.write(str(word_info() + parse() + gloss())) I.e. you need to call the functions for this code to make any sense. You have probably chosen to edit the code in order to not show us your real code. Unfortunately, since you are a newbie in Python, you make errors in doing so, and omit important details. That makes it very difficult to help you. Regards, Martin
From: manstey on 17 May 2006 01:29 OK, I apologise for not being clearer. 1. Here is my input data file, line 2: gn1:1,1.2 R")$I73YT R")$IYT(a)ncfsa 2. Here is my output data file, line 2: u'gn', u'1', u'1', u'1', u'2', u'-', u'R")$I73YT', u'R")$IYT', u'R")$IYT', u'@', u'ncfsa', u'nc', '', '', '', u'f', u's', u'a', '', '', '', '', '', '', '', '', u'B.:R")$I^YT', u'b.:cv)cv^yc', '\xc9\x94' 3. Here is my main program: # -*- coding: UTF-8 -*- import codecs import splitFunctions import surfaceIPA # Constants for file location # Working directory constants dir_root = 'E:\\' dir_relative = '2 Core\\2b Data\\Data Working\\' # Input file constants input_file_name = 'in.grab.txt' input_file_loc = dir_root + dir_relative + input_file_name # Initialise input file input_file = codecs.open(input_file_loc, 'r', 'utf-8') # Output file constants output_file_name = 'out.grab.txt' output_file_loc = dir_root + dir_relative + output_file_name # Initialise output file output_file = codecs.open(output_file_loc, 'w', 'utf-8') # unicode i = 0 for line in input_file: if line[0] != '>': # Ignore headers i += 1 if i != 1: word_info = splitFunctions.splitGrab(line, i) parse=splitFunctions.splitParse(word_info[10]) gloss=surfaceIPA.surfaceIPA(word_info[6],word_info[8],word_info[9],parse) a=str(word_info + parse + gloss).encode('utf-8') a=a[1:len(a)-1] output_file.write(a) output_file.write('\n') input_file.close() output_file.close() print 'done' 4. Here is my problem: At the end of my output file, where my unicode character \u0254 (OPEN O) appears, the file has '\xc9\x94' What I want is an output file like: 'gn', '1', '1', '1', '2', '-', ..... 'É”' where É” is an open O, and would display correctly in the appropriate font. Once I can get it to display properly, I will rewrite gloss so that it returns a proper translation of 'R")$I73YT', which will be a string of unicode characters. Is this clearer? The other two functions are basic. splitGrab turns 'gn1:1,1.2 R")$I73YT R")$IYT(a)ncfsa' into 'gn 1 1 1 2 R")$I73YT R")$IYT @ ncfsa' and splitParse turns the final piece of this 'ncfsa' into 'n c f s a'. They have to be done separately as splitParse involves some translation and program logic. SurfaceIPA reads in 'R")$I73YT' and other data to produce the unicode string. At the moment it just returns two dummy strings and u'\u0254'.encode('utf-8'). All help is appreciated! Thanks
From: Martin v. Löwis on 17 May 2006 02:08 manstey wrote: > a=str(word_info + parse + gloss).encode('utf-8') > a=a[1:len(a)-1] > > Is this clearer? Indeed. The problem is your usage of str() to "render" the output. As word_info+parse+gloss is a list (or is it a tuple?), str() will already produce "Python source code", i.e. an ASCII byte string that can be read back into the interpreter; all Unicode is gone from that string. If you want comma-separated output, you should do this: def comma_separated_utf8(items): result = [] for item in items: result.append(item.encode('utf-8')) return ", ".join(result) and then a = comma_separated_utf8(word_info + parse + gloss) Then you don't have to drop the parentheses from a anymore, as it won't have parentheses in the first place. As the encoding will be done already in the output file, the following should also work: a = u", ".join(word_info + parse + gloss) This would make "a" a comma-separated unicode string, so that the subsequent output_file.write(a) encodes it as UTF-8. If that doesn't work, I would like to know what the exact value of gloss is, do print "GLOSS IS", repr(gloss) to print it out. Regards, Martin
From: Tim Roberts on 17 May 2006 02:12 "manstey" <manstey(a)csu.edu.au> wrote: > >I have done more reading on unicode and then tried my code in IDLE >rather than WING IDE, and discovered that it works fine in IDLE, so I >think WING has a problem with unicode. Rather, its output defaults to ASCII. >So, assuming I now work in IDLE, all I want help with is how to read in >an ascii string and convert its letters to various unicode values and >save the resulting 'string' to a utf-8 text file. Is this clear? > >so in pseudo code >1. F is converted to \u0254, $ is converted to \u0283, C is converted >to \u02A6\02C1, etc. >(i want to do this using a dictionary TRANSLATE={'F':u'\u0254', etc) >2. I read in a file with lines like: >F$ >FCF$ >$$C$ etc >3. I convert this to >\u0254\u0283 >\u0254\u02A6\02C1\u0254 etc >4. i save the results in a new file > >when i read the new file in a unicode editor (EmEditor), i don't see >\u0254\u02A6\02C1\u0254, but I see the actual characters (open o, esh, >ts digraph, modified letter reversed glottal stop, etc. Of course. Isn't that exactly what you wanted? The Python string u"\u0254" contains one character (Latin small open o). It does NOT contain 6 characters. If you write that to a file, that file will contain 1 character -- 2 bytes. If you actually want the 6-character string \u0254 written to a file, then you need to escape the \u special code: "\\u0254". However, I don't see what good that would do you. The \u escape is a Python source code thing. >I'm sure this is straightforward but I can't get it to work. I think it is working exactly as you want. -- - Tim Roberts, timr(a)probo.com Providenza & Boekelheide, Inc.
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 Prev: multiline comments Next: Modules... paths... newbie confusion |