UTF8 with BOM support in psql [PgSql]

Prev: Controlling changes in plpgsql variableresolution
Next: ProcessUtility_hook

From: Tom Lane on 20 Oct 2009 11:51

Andrew Dunstan <andrew(a)dunslane.net> writes:
> What I think we might sensibly do is to eat the leading BOM of an SQL
> file iff the client encoding is UTF8, and otherwise treat it as just
> bytes in whatever the encoding is.

That seems relatively non-risky.

> Should we also do the same for files passed via \copy? What about
> streams on stdin? What about files read from the backend via COPY?

Not thrilled about doing this on stdin --- you have no good
justification for assuming that start of stdin corresponds to a file
boundary somewhere. COPY files, maybe.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "Kevin Grittner" on 20 Oct 2009 11:54

Andrew Dunstan <andrew(a)dunslane.net> wrote:

> What I think we might sensibly do is to eat the leading BOM of an
> SQL file iff the client encoding is UTF8, and otherwise treat it as
> just bytes in whatever the encoding is.

Only at the beginning of the file or stream? What happens when people
concatenate files? Would it make sense to treat BOM as whitespace in
UTF-8, or maybe ignore it entirely?

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Magnus Hagander on 20 Oct 2009 12:02

2009/10/20 Tom Lane <tgl(a)sss.pgh.pa.us>:
> Andrew Dunstan <andrew(a)dunslane.net> writes:
>> What I think we might sensibly do is to eat the leading BOM of an SQL
>> file iff the client encoding is UTF8, and otherwise treat it as just
>> bytes in whatever the encoding is.
>
> That seems relatively non-risky.

+1.

>> Should we also do the same for files passed via \copy? What about
>> streams on stdin? What about files read from the backend via COPY?
>
> Not thrilled about doing this on stdin --- you have no good
> justification for assuming that start of stdin corresponds to a file
> boundary somewhere. COPY files, maybe.

Yeah, that seems a lot more error-prone.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: David Christensen on 20 Oct 2009 12:02

On Oct 20, 2009, at 10:51 AM, Tom Lane wrote:

> Andrew Dunstan <andrew(a)dunslane.net> writes:
>> What I think we might sensibly do is to eat the leading BOM of an SQL
>> file iff the client encoding is UTF8, and otherwise treat it as just
>> bytes in whatever the encoding is.
>
> That seems relatively non-risky.

Is that only when the default client encoding is set to UTF8
(PGCLIENTENCODING, whatever), or will it be coded to work with the
following:

$ PGCLIENTENCODING=...nonutf8...
$ psql -f <file>

Where <file> is:
<BOM>
....

SET CLIENT ENCODING 'utf8';

....
EOF

Regards,

David
--
David Christensen
End Point Corporation
david(a)endpoint.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Itagaki Takahiro on 21 Oct 2009 00:11

David Christensen <david(a)endpoint.com> wrote:

> Is that only when the default client encoding is set to UTF8
> (PGCLIENTENCODING, whatever), or will it be coded to work with the
> following:
>
> $ psql -f <file>
> Where <file> is:
> <BOM>
> SET CLIENT ENCODING 'utf8';

Sure. Client encoding is declared in body of a file, but BOM is
in head of the file. So, we should always ignore BOM sequence
at the file head no matter what client encoding is used.

The attached patch replace BOM with while spaces, but it does not
change client encoding automatically. I think we can always ignore
client encoding at the replacement because SQL command cannot start
with BOM sequence. If we don't ignore the sequence, execution of
the script must fail with syntax error.

This patch does nothing about COPY and \copy commands. It might be
possible to add BOM handling code around AllocateFile() in CopyFrom()
to support "COPY FROM 'utf8file-with-bom.tsv'", but we need another
approach for "COPY FROM STDIN". For example,
$ echo utf8bom-1.tsv utf8bom-2.tsv | psql -c "COPY FROM STDIN"
might contain BOM sequence in the middle of input stream.
Anyway, those changes would come from another patch in the future.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9
Prev: Controlling changes in plpgsql variableresolution
Next: ProcessUtility_hook