From: Stefan Behnel on 29 Jul 2010 08:54 William Johnston, 29.07.2010 14:12: > I have a Python app that parses XML files and then writes to text files. XML or HTML? > However, the output text file is "sometimes" encoded in some Asian language. > > Here is my code: > > > encoding = "iso-8859-1" > > clean_sent = nltk.clean_html(sent.text) > > clean_sent = clean_sent.encode(encoding, "ignore"); > > > I also tried "UTF-8" encoding, but received the same results. What result? Maybe the NLTK cannot determine the encoding of the HTML file (because the file is broken and/or doesn't correctly specify its own encoding) and thus fails to decode it? Stefan
|
Pages: 1 Prev: measuring a function time Next: solving Tix problem in ubuntu jaunty |