From: Stefan Behnel on
William Johnston, 29.07.2010 14:12:
> I have a Python app that parses XML files and then writes to text files.

XML or HTML?


> However, the output text file is "sometimes" encoded in some Asian language.
>
> Here is my code:
>
>
> encoding = "iso-8859-1"
>
> clean_sent = nltk.clean_html(sent.text)
>
> clean_sent = clean_sent.encode(encoding, "ignore");
>
>
> I also tried "UTF-8" encoding, but received the same results.

What result?

Maybe the NLTK cannot determine the encoding of the HTML file (because the
file is broken and/or doesn't correctly specify its own encoding) and thus
fails to decode it?

Stefan