writing \feff at the begining of a file [Python]

Prev: How do I get number of files in a particular directory.
Next: Deditor -- pythonic text-editor

From: Jean-Michel Pichavant on 13 Aug 2010 05:45

Hello python world,

I'm trying to update the content of a $Microsoft$ VC2005 project files
using a python application.
Since those files are XML data, I assumed I could easily do that.

My problem is that VC somehow thinks that the file is corrupted and
update the file like the following:

-<?xml version='1.0' encoding='UTF-8'?>
+?<feff><?xml version="1.0" encoding="UTF-8"?>

Actually, <feff> is displayed in a different color by vim, telling me
that this is some kind of special caracter code (I'm no familiar with
such thing).
After googling that, I have a clue : could be some unicode caracter use
to indicate something ... well I don't know in fact ("UTF-8 files
sometimes start with a byte-order marker (BOM) to indicate that they are
encoded in UTF-8.").

My problem is however simplier : how do I add such character at the
begining of the file ?
I tried

f = open('paf', w)
f.write(u'\ufeff')

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in
position 0: ordinal not in range(128)

The error may be explicit but I have no idea how to proceed further. Any
clue ?

JM

From: Tim Golden on 13 Aug 2010 05:58

On 13/08/2010 10:45, Jean-Michel Pichavant wrote:
> My problem is however simplier : how do I add such character at the
> begining of the file ?
> I tried
>
> f = open('paf', w)

f = open ("pag", "wb")
f.write ("\xfe\xff")

TJG

From: Ulrich Eckhardt on 13 Aug 2010 06:43

Jean-Michel Pichavant wrote:
> My problem is however simplier : how do I add such character [a BOM]
> at the begining of the file ?
> I tried
>
> f = open('paf', w)
> f.write(u'\ufeff')
>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in
> position 0: ordinal not in range(128)

Try the codecs module to open the file, which will then do all the
transcoding between internal texts and external UTF-8 for you.

Uli

--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

From: MRAB on 13 Aug 2010 13:45

Jean-Michel Pichavant wrote:
> Hello python world,
>
> I'm trying to update the content of a $Microsoft$ VC2005 project files
> using a python application.
> Since those files are XML data, I assumed I could easily do that.
>
> My problem is that VC somehow thinks that the file is corrupted and
> update the file like the following:
>
> -<?xml version='1.0' encoding='UTF-8'?>
> +?<feff><?xml version="1.0" encoding="UTF-8"?>
>
>
> Actually, <feff> is displayed in a different color by vim, telling me
> that this is some kind of special caracter code (I'm no familiar with
> such thing).
> After googling that, I have a clue : could be some unicode caracter use
> to indicate something ... well I don't know in fact ("UTF-8 files
> sometimes start with a byte-order marker (BOM) to indicate that they are
> encoded in UTF-8.").
>
> My problem is however simplier : how do I add such character at the
> begining of the file ?
> I tried
>
> f = open('paf', w)
> f.write(u'\ufeff')
>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in
> position 0: ordinal not in range(128)
>
> The error may be explicit but I have no idea how to proceed further. Any
> clue ?
>
In Python 2 the default encoding is 'ascii'. What you want is 'utf-8'.

Use codecs.open() instead, with the 'utf-8-sig' encoding, which will
include the BOM.

From: Nobody on 13 Aug 2010 14:04

On Fri, 13 Aug 2010 11:45:28 +0200, Jean-Michel Pichavant wrote:

> I'm trying to update the content of a $Microsoft$ VC2005 project files
> using a python application.
> Since those files are XML data, I assumed I could easily do that.
>
> My problem is that VC somehow thinks that the file is corrupted and
> update the file like the following:
>
> -<?xml version='1.0' encoding='UTF-8'?>
> +?<feff><?xml version="1.0" encoding="UTF-8"?>
>
>
> Actually, <feff> is displayed in a different color by vim, telling me
> that this is some kind of special caracter code (I'm no familiar with
> such thing).

U+FEFF is a "byte order mark" or BOM. Each Unicode-based encoding (UTF-8,
UTF-16, UTF-16-LE, etc) will encode it differently, so it enables a
program reading the file to determine the encoding before reading any
actual data.

> My problem is however simplier : how do I add such character at the
> begining of the file ?
> I tried

Either:

1. Open the file as binary and write '\xef\xbb\xbf' to the file:

f = open('foo.txt', 'wb')
f.write('\xef\xbb\xbf')

[You can also use the constant BOM_UTF8 from the codecs module.]

2. Open the file as utf-8 and write u'\ufeff' to the file:

import codecs
f = codecs.open('foo.txt', 'w', 'utf-8')
f.write(u'\ufeff')

3. Open the file as utf-8-sig and don't write anything (or write an empty
string):

import codecs
f = codecs.open('foo.txt', 'w', 'utf-8-sig')
f.write('')

The utf-8-sig codec automatically writes a BOM at the beginning of the
file. It is present in Python 2.5 and later.

| Next | Last
Pages: 1 2
Prev: How do I get number of files in a particular directory.
Next: Deditor -- pythonic text-editor