UTF8 with BOM support in psql [PgSql]

Prev: Controlling changes in plpgsql variableresolution
Next: ProcessUtility_hook

From: Peter Eisentraut on 16 Nov 2009 15:37

On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
> Sure. Client encoding is declared in body of a file, but BOM is
> in head of the file. So, we should always ignore BOM sequence
> at the file head no matter what client encoding is used.
>
> The attached patch replace BOM with while spaces, but it does not
> change client encoding automatically. I think we can always ignore
> client encoding at the replacement because SQL command cannot start
> with BOM sequence. If we don't ignore the sequence, execution of
> the script must fail with syntax error.

OK, I think the consensus here is:

- Eat BOM at beginning of file (as you implemented)

- Only when client encoding is UTF-8 --> please fix that

I'm not sure if replacing a BOM by three spaces is a good way to
implement "eating", because it might throw off a column indicator
somewhere, say, but I couldn't reproduce a problem. Note that the U
+FEFF character is defined as *zero-width* non-breaking space.

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 16 Nov 2009 16:01

Peter Eisentraut <peter_e(a)gmx.net> writes:
> I'm not sure if replacing a BOM by three spaces is a good way to
> implement "eating", because it might throw off a column indicator
> somewhere, say, but I couldn't reproduce a problem. Note that the U
> +FEFF character is defined as *zero-width* non-breaking space.

So wouldn't it be better to remove the three bytes, rather than
replace with spaces? The latter will certainly confuse clients that
think that "column 1" means what they think is the first character.
A syntax error in the first line of the file should be sufficient
to demonstrate the issue.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Itagaki Takahiro on 16 Nov 2009 19:31

Peter Eisentraut <peter_e(a)gmx.net> wrote:

> OK, I think the consensus here is:
> - Eat BOM at beginning of file (as you implemented)
> - Only when client encoding is UTF-8 --> please fix that

Are they AND condition? If so, this patch will be useless.
Please remember \encoding or SET client_encoding appear
*after* BOM at beginning of file. I'll agree if the condition is
"Eat BOM at beginning of file and <<set client encoding to UTF-8>>",
like:
Defining Python Source Code Encodings:
http://www.python.org/dev/peps/pep-0263/

> I'm not sure if replacing a BOM by three spaces is a good way to
> implement "eating", because it might throw off a column indicator
> somewhere, say, but I couldn't reproduce a problem. Note that the U
> +FEFF character is defined as *zero-width* non-breaking space.

I assumed psql discards whitespaces automatically, but I see it is
more robust to remove BOM bytes explitly. I'll fix it.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 16 Nov 2009 19:37

Itagaki Takahiro <itagaki.takahiro(a)oss.ntt.co.jp> writes:
> Please remember \encoding or SET client_encoding appear
> *after* BOM at beginning of file. I'll agree if the condition is
> "Eat BOM at beginning of file and <<set client encoding to UTF-8>>",

As has been stated multiple times, that will not get accepted,
because it will *break* files in other encodings that chance to
match the BOM pattern.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Andrew Dunstan on 16 Nov 2009 19:51

Itagaki Takahiro wrote:
> Peter Eisentraut <peter_e(a)gmx.net> wrote:
>
>
>> OK, I think the consensus here is:
>> - Eat BOM at beginning of file (as you implemented)
>> - Only when client encoding is UTF-8 --> please fix that
>>
>
> Are they AND condition? If so, this patch will be useless.
> Please remember \encoding or SET client_encoding appear
> *after* BOM at beginning of file. I'll agree if the condition is
> "Eat BOM at beginning of file and <<set client encoding to UTF-8>>",
> like:
> Defining Python Source Code Encodings:
> http://www.python.org/dev/peps/pep-0263/
>

As previously discussed we should not be automagically setting the
client encoding, nor inferring it from the presence of a BOM.

As for when it can be set, unless I'm mistaken you should be able to set
it before any file is opened, if you need to, using PGOPTIONS or psql
"dbname=mydb options='-c client_encoding=utf8'". Of course, if the
server encoding is utf8 then, in the absence of it being set using those
methods, the client encoding will start as utf8 also.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9
Prev: Controlling changes in plpgsql variableresolution
Next: ProcessUtility_hook