From: Paulo da Silva on 5 Jun 2010 19:03 I need to read text files and process each line using string comparisions and regexp. I have a python2 program that uses <file object>.readline to read each line as a string. Then, processing it was a trivial job. With python3 I got error messagew like: File "./pp1.py", line 93, in RL line=inf.readline() File "/usr/lib64/python3.1/codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4963-4965: invalid data How do I handle this? If I use <file object>.read from an open as binary file I got a <bytes> object. Then how do I handle it? Reg exps, comparisions with strings, ?... Thanks for any help.
From: Chris Rebert on 5 Jun 2010 19:41 On Sat, Jun 5, 2010 at 4:03 PM, Paulo da Silva <psdasilva.nospam(a)netcabonospam.pt> wrote: > I need to read text files and process each line using string > comparisions and regexp. > > I have a python2 program that uses <file object>.readline to read each > line as a string. Then, processing it was a trivial job. > > With python3 I got error messagew like: > File "./pp1.py", line 93, in RL > Â Â line=inf.readline() > Â File "/usr/lib64/python3.1/codecs.py", line 300, in decode > Â Â (result, consumed) = self._buffer_decode(data, self.errors, final) > UnicodeDecodeError: 'utf8' codec can't decode bytes in position > 4963-4965: invalid data > > How do I handle this? Specify the encoding of the text when opening the file using the `encoding` parameter. For Windows-1252 for example: your_file = open("path/to/file.ext", 'r', encoding='cp1252') Cheers, Chris -- http://blog.rebertia.com
From: python on 5 Jun 2010 19:49 Chris, > Specify the encoding of the text when opening the file using the `encoding` parameter. For Windows-1252 for example: > > your_file = open("path/to/file.ext", 'r', encoding='cp1252') This looks similar to the codecs module's functionality. Do you know if the codecs module is still required in Python 3.x? Thank you, Malcolm
From: Paulo da Silva on 5 Jun 2010 21:24 Em 06-06-2010 00:41, Chris Rebert escreveu: > On Sat, Jun 5, 2010 at 4:03 PM, Paulo da Silva > <psdasilva.nospam(a)netcabonospam.pt> wrote: .... > > Specify the encoding of the text when opening the file using the > `encoding` parameter. For Windows-1252 for example: > > your_file = open("path/to/file.ext", 'r', encoding='cp1252') > OK! This fixes my current problem. I used encoding="iso-8859-15". This is how my text files are encoded. But what about a more general case where the encoding of the text file is unknown? Is there anything like "autodetect"?
From: MRAB on 5 Jun 2010 22:14 Paulo da Silva wrote: > Em 06-06-2010 00:41, Chris Rebert escreveu: >> On Sat, Jun 5, 2010 at 4:03 PM, Paulo da Silva >> <psdasilva.nospam(a)netcabonospam.pt> wrote: > ... > >> Specify the encoding of the text when opening the file using the >> `encoding` parameter. For Windows-1252 for example: >> >> your_file = open("path/to/file.ext", 'r', encoding='cp1252') >> > > OK! This fixes my current problem. I used encoding="iso-8859-15". This > is how my text files are encoded. > But what about a more general case where the encoding of the text file > is unknown? Is there anything like "autodetect"? > An encoding like 'cp1252' uses 1 byte/character, but so does 'cp1250'. How could you tell which was the correct encoding? Well, if the file contained words in a certain language and some of the characters were wrong, then you'd know that the encoding was wrong. This does imply, though, that you'd need to know what the language should look like! You could try different encodings, and for each one try to identify what could be words, then look them up in dictionaries for various languages to see whether they are real words...
|
Next
|
Last
Pages: 1 2 Prev: modify XMP data (Python/Windows) Next: save xls to csv/dbf without Excel/win32com.client |