From: Andrew Dunstan on


Peter Eisentraut wrote:
> But now we're back to the original problem. Certain editors insert BOMs
> at the beginning of the file. And that is by any definition before the
> embedded client encoding declaration. I think the only ways to solve
> this are:
>
> 1) Ignore the BOM if a client encoding declaration of UTF8 appears in a
> narrowly defined location near the beginning of the file (XML and
> PEP-0263 style). For *example*, we could ignore the BOM if the file
> starts with exactly "<BOM>\encoding UTF8\n". Would probably not work
> well in practice.
>
> 2) Parse two alternative versions of the file, one with the BOM ignored
> and one with the BOM not ignored, until you need to make a decision.
> Hilariously complicated, but would perhaps solve the problem.
>
> 3) Give up, do nothing.
>
>

4) set the client encoding before the file is read in any of the ways
that have already been discussed and then allow psql to eat the BOM.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Peter Eisentraut on
On ons, 2009-11-18 at 08:52 -0500, Andrew Dunstan wrote:
> 4) set the client encoding before the file is read in any of the ways
> that have already been discussed and then allow psql to eat the BOM.

This is certainly a workaround, just like piping the file through a
suitable sed expression would be, but conceptually, the client encoding
is a property of the file and should therefore be marked in the file.


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
Peter Eisentraut <peter_e(a)gmx.net> writes:
> This is certainly a workaround, just like piping the file through a
> suitable sed expression would be, but conceptually, the client encoding
> is a property of the file and should therefore be marked in the file.

In a perfect world things would be like that, but the world is
imperfect. When only one of the available encodings even pretends
to have a marking convention, and even that one convention is broken,
imagining that you can fix it is just a recipe for making things worse.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Peter Eisentraut on
On mån, 2009-11-16 at 22:37 +0200, Peter Eisentraut wrote:
> On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
> > Sure. Client encoding is declared in body of a file, but BOM is
> > in head of the file. So, we should always ignore BOM sequence
> > at the file head no matter what client encoding is used.
> >
> > The attached patch replace BOM with while spaces, but it does not
> > change client encoding automatically. I think we can always ignore
> > client encoding at the replacement because SQL command cannot start
> > with BOM sequence. If we don't ignore the sequence, execution of
> > the script must fail with syntax error.
>
> OK, I think the consensus here is:
>
> - Eat BOM at beginning of file (as you implemented)
>
> - Only when client encoding is UTF-8 --> please fix that
>
> I'm not sure if replacing a BOM by three spaces is a good way to
> implement "eating", because it might throw off a column indicator
> somewhere, say, but I couldn't reproduce a problem. Note that the U
> +FEFF character is defined as *zero-width* non-breaking space.

I have committed a change that implements the above.



--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers