UTF8 with BOM support in psql [PgSql]

Prev: Controlling changes in plpgsql variableresolution
Next: ProcessUtility_hook

From: Peter Eisentraut on 21 Oct 2009 06:00

On Wed, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
> The attached patch replace BOM with while spaces, but it does not
> change client encoding automatically. I think we can always ignore
> client encoding at the replacement because SQL command cannot start
> with BOM sequence. If we don't ignore the sequence, execution of
> the script must fail with syntax error.

I feel that psql is the wrong place to fix this. BOMs in UTF-8 should
be ignored everywhere, all the time.

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Andrew Dunstan on 21 Oct 2009 09:08

Peter Eisentraut wrote:
> On Wed, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
>
>> The attached patch replace BOM with while spaces, but it does not
>> change client encoding automatically. I think we can always ignore
>> client encoding at the replacement because SQL command cannot start
>> with BOM sequence. If we don't ignore the sequence, execution of
>> the script must fail with syntax error.
>>
>
> I feel that psql is the wrong place to fix this. BOMs in UTF-8 should
> be ignored everywhere, all the time.
>
>

I suggest you re-read the Unicode FAQ on the subject. That is not the
conclusion I came to after I read it. Quite the reverse in fact.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Peter Eisentraut on 24 Oct 2009 17:33

On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
> So, we should always ignore BOM sequence
> at the file head no matter what client encoding is used.

I think we can't do that. That byte sequence might be valid user data
in other encodings.

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Peter Eisentraut on 14 Nov 2009 05:46

On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
> Client encoding is declared in body of a file, but BOM is
> in head of the file. So, we should always ignore BOM sequence
> at the file head no matter what client encoding is used.
>
> The attached patch replace BOM with while spaces, but it does not
> change client encoding automatically. I think we can always ignore
> client encoding at the replacement because SQL command cannot start
> with BOM sequence. If we don't ignore the sequence, execution of
> the script must fail with syntax error.

I don't know what the best solution is here. The BOM encoded as UTF-8
is valid data in other encodings. Of course, there is your point that
such data cannot be at the start of an SQL command.

There is also the notion of how files are handled on Unix. Normally,
you'd assume that all of

psql -f file.sql
psql < file.sql
cat file.sql | psql
cat file1.sql file2.sql | psql

behave consistently. That would require that the BOM is ignored in the
middle of the data stream (which is legal and required per Unicode
standard) and that this only happens if the character set is actually
Unicode.

Any ideas?

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Andrew Dunstan on 14 Nov 2009 08:06

Peter Eisentraut wrote:
> On ons, 2009-10-21 at 13:11 +0900, Itagaki Takahiro wrote:
>
>> Client encoding is declared in body of a file, but BOM is
>> in head of the file. So, we should always ignore BOM sequence
>> at the file head no matter what client encoding is used.
>>
>> The attached patch replace BOM with while spaces, but it does not
>> change client encoding automatically. I think we can always ignore
>> client encoding at the replacement because SQL command cannot start
>> with BOM sequence. If we don't ignore the sequence, execution of
>> the script must fail with syntax error.
>>
>
> I don't know what the best solution is here. The BOM encoded as UTF-8
> is valid data in other encodings. Of course, there is your point that
> such data cannot be at the start of an SQL command.
>
> There is also the notion of how files are handled on Unix. Normally,
> you'd assume that all of
>
> psql -f file.sql
> psql < file.sql
> cat file.sql | psql
> cat file1.sql file2.sql | psql
>
> behave consistently. That would require that the BOM is ignored in the
> middle of the data stream (which is legal and required per Unicode
> standard) and that this only happens if the character set is actually
> Unicode.
>
>
>

Cases 2 and 3 should be indistinguishable from psql's POV, although case
3 wins a "Useless Use of cat" award.

If we are only eating a BOM at the start of a file, which was the
consensus IIRC, and we treat STDIN as a file for this purpose, then we
would eat the leading BOM on file.sql and file1.sql in all the cases
above but not on file2.sql since we would not have any idea where the
file boundary was. That last case strikes me as a not very likely usage
(I'm pretty sure I've never used it, at least). A file containing:

\i file1.sql
\i file2.sql

would be the workaround if needed.

As for handling the fact that client encoding can't be set in a script
until after the leading BOM, there is always

PGOPTIONS="-c client_encoding=UTF8"

or similar.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9
Prev: Controlling changes in plpgsql variableresolution
Next: ProcessUtility_hook