From: Andrew Dunstan on 18 Nov 2009 08:52 Peter Eisentraut wrote: > But now we're back to the original problem. Certain editors insert BOMs > at the beginning of the file. And that is by any definition before the > embedded client encoding declaration. I think the only ways to solve > this are: > > 1) Ignore the BOM if a client encoding declaration of UTF8 appears in a > narrowly defined location near the beginning of the file (XML and > PEP-0263 style). For *example*, we could ignore the BOM if the file > starts with exactly "<BOM>\encoding UTF8\n". Would probably not work > well in practice. > > 2) Parse two alternative versions of the file, one with the BOM ignored > and one with the BOM not ignored, until you need to make a decision. > Hilariously complicated, but would perhaps solve the problem. > > 3) Give up, do nothing. > > 4) set the client encoding before the file is read in any of the ways that have already been discussed and then allow psql to eat the BOM. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Peter Eisentraut on 18 Nov 2009 09:06 On ons, 2009-11-18 at 08:52 -0500, Andrew Dunstan wrote: > 4) set the client encoding before the file is read in any of the ways > that have already been discussed and then allow psql to eat the BOM. This is certainly a workaround, just like piping the file through a suitable sed expression would be, but conceptually, the client encoding is a property of the file and should therefore be marked in the file. -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Tom Lane on 18 Nov 2009 10:18 Peter Eisentraut <peter_e(a)gmx.net> writes: > This is certainly a workaround, just like piping the file through a > suitable sed expression would be, but conceptually, the client encoding > is a property of the file and should therefore be marked in the file. In a perfect world things would be like that, but the world is imperfect. When only one of the available encodings even pretends to have a marking convention, and even that one convention is broken, imagining that you can fix it is just a recipe for making things worse. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Peter Eisentraut on 21 Nov 2009 18:59
On mån, 2009-11-16 at 22:37 +0200, Peter Eisentraut wrote: > On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote: > > Sure. Client encoding is declared in body of a file, but BOM is > > in head of the file. So, we should always ignore BOM sequence > > at the file head no matter what client encoding is used. > > > > The attached patch replace BOM with while spaces, but it does not > > change client encoding automatically. I think we can always ignore > > client encoding at the replacement because SQL command cannot start > > with BOM sequence. If we don't ignore the sequence, execution of > > the script must fail with syntax error. > > OK, I think the consensus here is: > > - Eat BOM at beginning of file (as you implemented) > > - Only when client encoding is UTF-8 --> please fix that > > I'm not sure if replacing a BOM by three spaces is a good way to > implement "eating", because it might throw off a column indicator > somewhere, say, but I couldn't reproduce a problem. Note that the U > +FEFF character is defined as *zero-width* non-breaking space. I have committed a change that implements the above. -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |