UTF8 with BOM support in psql [PgSql]

Prev: Controlling changes in plpgsql variableresolution
Next: ProcessUtility_hook

From: Bruce Momjian on 20 Oct 2009 01:58

Itagaki Takahiro wrote:
> UTF8 encoding text files with BOM (Byte Order Mark) are commonly
> used in Windows, though BOM was designed for UTF16 text originally.
> However, psql cannot read such format even if we set client encoding
> to UTF8. Is it worth supporting those format in psql?
>
> When psql opens a file with -f or \i, it checks first 3 bytes of the
> file. If they are BOM, discard the 3 bytes and change client encoding
> to UTF8 automatically.
>
> Is this change reasonable? Comments welcome.

Seems there is community support for accepting BOM:

http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php

Should I add this as a TODO item?

--
Bruce Momjian <bruce(a)momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Itagaki Takahiro on 20 Oct 2009 02:18

Bruce Momjian <bruce(a)momjian.us> wrote:

> Itagaki Takahiro wrote:
> > When psql opens a file with -f or \i, it checks first 3 bytes of the
> > file. If they are BOM, discard the 3 bytes and change client encoding
> > to UTF8 automatically.
>
> Seems there is community support for accepting BOM:
> http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php

Thank yor for information.
I read the thread that we discussed about BOM handling in *data types*.
I agree the decision in the thead that we should not skip BOM characters,
but we can handle BOM in a different way in the head of *files* for psql
and COPY input.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Peter Eisentraut on 20 Oct 2009 05:54

On Tue, 2009-10-20 at 14:41 +0900, Itagaki Takahiro wrote:
> UTF8 encoding text files with BOM (Byte Order Mark) are commonly
> used in Windows, though BOM was designed for UTF16 text originally.
> However, psql cannot read such format even if we set client encoding
> to UTF8. Is it worth supporting those format in psql?

psql doesn't have a problem, but the backend's lexer doesn't parse the
BOM as whitespace. Since the lexer is byte-based, it will presumably
have problems with anything outside of ASCII that Unicode considers
whitespace.

> When psql opens a file with -f or \i, it checks first 3 bytes of the
> file. If they are BOM, discard the 3 bytes and change client encoding
> to UTF8 automatically.

While I see that the Unicode standard supports using a UTF-8 encoded BOM
as UTF-8 signature, I wonder if those bytes can usefully appear in a
leading position in other encodings.

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 20 Oct 2009 10:36

Bruce Momjian <bruce(a)momjian.us> writes:
> Seems there is community support for accepting BOM:
> http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php

That discussion has approximately nothing to do with the
much-more-invasive change that Itagaki-san is suggesting.

In particular I think an automatic change of client_encoding isn't
particularly a good idea --- wouldn't you have to change it back later,
and is there any possibility of creating a security issue from such
behavior? Remember that client_encoding *IS* tied to security issues
such as backslash escape handling.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Andrew Dunstan on 20 Oct 2009 11:13

Tom Lane wrote:
> Bruce Momjian <bruce(a)momjian.us> writes:
>
>> Seems there is community support for accepting BOM:
>> http://archives.postgresql.org/pgsql-hackers/2009-09/msg01625.php
>>
>
> That discussion has approximately nothing to do with the
> much-more-invasive change that Itagaki-san is suggesting.
>
> In particular I think an automatic change of client_encoding isn't
> particularly a good idea --- wouldn't you have to change it back later,
> and is there any possibility of creating a security issue from such
> behavior? Remember that client_encoding *IS* tied to security issues
> such as backslash escape handling.
>
>
>

Yeah, I don't think we should be second-guessing the user on the encoding.

What I think we might sensibly do is to eat the leading BOM of an SQL
file iff the client encoding is UTF8, and otherwise treat it as just
bytes in whatever the encoding is.

Should we also do the same for files passed via \copy? What about
streams on stdin? What about files read from the backend via COPY?

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

| Next | Last
Pages: 1 2 3 4 5 6 7 8 9
Prev: Controlling changes in plpgsql variableresolution
Next: ProcessUtility_hook