From: Peter Eisentraut on 16 Nov 2009 15:37 On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote: > Sure. Client encoding is declared in body of a file, but BOM is > in head of the file. So, we should always ignore BOM sequence > at the file head no matter what client encoding is used. > > The attached patch replace BOM with while spaces, but it does not > change client encoding automatically. I think we can always ignore > client encoding at the replacement because SQL command cannot start > with BOM sequence. If we don't ignore the sequence, execution of > the script must fail with syntax error. OK, I think the consensus here is: - Eat BOM at beginning of file (as you implemented) - Only when client encoding is UTF-8 --> please fix that I'm not sure if replacing a BOM by three spaces is a good way to implement "eating", because it might throw off a column indicator somewhere, say, but I couldn't reproduce a problem. Note that the U +FEFF character is defined as *zero-width* non-breaking space. -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Tom Lane on 16 Nov 2009 16:01 Peter Eisentraut <peter_e(a)gmx.net> writes: > I'm not sure if replacing a BOM by three spaces is a good way to > implement "eating", because it might throw off a column indicator > somewhere, say, but I couldn't reproduce a problem. Note that the U > +FEFF character is defined as *zero-width* non-breaking space. So wouldn't it be better to remove the three bytes, rather than replace with spaces? The latter will certainly confuse clients that think that "column 1" means what they think is the first character. A syntax error in the first line of the file should be sufficient to demonstrate the issue. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Itagaki Takahiro on 16 Nov 2009 19:31 Peter Eisentraut <peter_e(a)gmx.net> wrote: > OK, I think the consensus here is: > - Eat BOM at beginning of file (as you implemented) > - Only when client encoding is UTF-8 --> please fix that Are they AND condition? If so, this patch will be useless. Please remember \encoding or SET client_encoding appear *after* BOM at beginning of file. I'll agree if the condition is "Eat BOM at beginning of file and <<set client encoding to UTF-8>>", like: Defining Python Source Code Encodings: http://www.python.org/dev/peps/pep-0263/ > I'm not sure if replacing a BOM by three spaces is a good way to > implement "eating", because it might throw off a column indicator > somewhere, say, but I couldn't reproduce a problem. Note that the U > +FEFF character is defined as *zero-width* non-breaking space. I assumed psql discards whitespaces automatically, but I see it is more robust to remove BOM bytes explitly. I'll fix it. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Tom Lane on 16 Nov 2009 19:37 Itagaki Takahiro <itagaki.takahiro(a)oss.ntt.co.jp> writes: > Please remember \encoding or SET client_encoding appear > *after* BOM at beginning of file. I'll agree if the condition is > "Eat BOM at beginning of file and <<set client encoding to UTF-8>>", As has been stated multiple times, that will not get accepted, because it will *break* files in other encodings that chance to match the BOM pattern. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Andrew Dunstan on 16 Nov 2009 19:51
Itagaki Takahiro wrote: > Peter Eisentraut <peter_e(a)gmx.net> wrote: > > >> OK, I think the consensus here is: >> - Eat BOM at beginning of file (as you implemented) >> - Only when client encoding is UTF-8 --> please fix that >> > > Are they AND condition? If so, this patch will be useless. > Please remember \encoding or SET client_encoding appear > *after* BOM at beginning of file. I'll agree if the condition is > "Eat BOM at beginning of file and <<set client encoding to UTF-8>>", > like: > Defining Python Source Code Encodings: > http://www.python.org/dev/peps/pep-0263/ > As previously discussed we should not be automagically setting the client encoding, nor inferring it from the presence of a BOM. As for when it can be set, unless I'm mistaken you should be able to set it before any file is opened, if you need to, using PGOPTIONS or psql "dbname=mydb options='-c client_encoding=utf8'". Of course, if the server encoding is utf8 then, in the absence of it being set using those methods, the client encoding will start as utf8 also. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers |