UTF8 with BOM support in psql [PgSql]

Prev: Controlling changes in plpgsql variableresolution
Next: ProcessUtility_hook

From: Tom Lane on 16 Nov 2009 20:03

Andrew Dunstan <andrew(a)dunslane.net> writes:
> As for when it can be set, unless I'm mistaken you should be able to set
> it before any file is opened, if you need to, using PGOPTIONS or psql
> "dbname=mydb options='-c client_encoding=utf8'". Of course, if the
> server encoding is utf8 then, in the absence of it being set using those
> methods, the client encoding will start as utf8 also.

It could also be set in ~/.psqlrc, which would probably be the most
convenient method for regular users of UTF8 files who need to talk
to non-UTF8 databases.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Itagaki Takahiro on 16 Nov 2009 21:30

Tom Lane <tgl(a)sss.pgh.pa.us> wrote:

> Andrew Dunstan <andrew(a)dunslane.net> writes:
> > if you need to, using PGOPTIONS or psql
> > "dbname=mydb options='-c client_encoding=utf8'".
>
> It could also be set in ~/.psqlrc, which would probably be the most
> convenient method for regular users of UTF8 files who need to talk
> to non-UTF8 databases.

It's nonsense. Users often use scripts written in difference encodings
at once. Encoding information should be packed in script file itself.
We should not force users to open script files and check its encoding
before they execute the files.

BTW, I have an idea to improve handling of per-file encoding.
We continue to use the encoding settings specified in included file
at \i command. But should the setting be reverted at the end of file?
ie.

=# \encoding SJIS
=# \i script-in-utf8.sql
=# -- encoding should be SJIS here.

If encoding setting is reverted,
> "Eat BOM at beginning of file and <<set client encoding to UTF-8>>"
will be much safer.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 16 Nov 2009 22:44

Itagaki Takahiro <itagaki.takahiro(a)oss.ntt.co.jp> writes:
> If encoding setting is reverted,
>> "Eat BOM at beginning of file and <<set client encoding to UTF-8>>"
> will be much safer.

This isn't going to happen, so please stop wasting our time arguing
about it.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Itagaki Takahiro on 17 Nov 2009 00:19

Tom Lane <tgl(a)sss.pgh.pa.us> wrote:

> Itagaki Takahiro <itagaki.takahiro(a)oss.ntt.co.jp> writes:
> > If encoding setting is reverted,
> >> "Eat BOM at beginning of file and <<set client encoding to UTF-8>>"
> > will be much safer.
>
> This isn't going to happen, so please stop wasting our time arguing
> about it.

Ok, sorry. But I still cannot accept this restriction.
>> - Only when client encoding is UTF-8 --> please fix that

The attachd patch is a new proposal of the feature.
When we found BOM at beginning of file, set "expected_encoding" to UTF8.
Before every execusion of query, if pset.encoding is not UTF8, we check the
query string not to contain any non-ASCII characters and throw an error if
found. Encoding declarations are typically written only in ascii characters,
so we can postpone encoding checking until non-ascii characters appear.

Since the default value of expected_encoding is SQL_ASCII, that pass
through all characters, so the patch does nothing to scripts without BOM.
(There are no codes to set expected_encoding except BOM.)
If client encoding is UTF8, it skips BOM and no effect to the script body.
BOMs are skipped even if client encoding is not set to UTF8, but can throw
an error if there are no explicit encoding declaration.

AFAIC, the patch can solve the almost problems in the discussions
developmentally. Comments welcome.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

From: Peter Eisentraut on 17 Nov 2009 02:02

On tis, 2009-11-17 at 14:19 +0900, Itagaki Takahiro wrote:
> The attachd patch is a new proposal of the feature.
> When we found BOM at beginning of file, set "expected_encoding" to UTF8.
> Before every execusion of query, if pset.encoding is not UTF8, we check the
> query string not to contain any non-ASCII characters and throw an error if
> found. Encoding declarations are typically written only in ascii characters,
> so we can postpone encoding checking until non-ascii characters appear.
>
> Since the default value of expected_encoding is SQL_ASCII, that pass
> through all characters, so the patch does nothing to scripts without BOM.
> (There are no codes to set expected_encoding except BOM.)
> If client encoding is UTF8, it skips BOM and no effect to the script body.
> BOMs are skipped even if client encoding is not set to UTF8, but can throw
> an error if there are no explicit encoding declaration.

I think I could support using the presence of the BOM as a fall-back
indicator of encoding in absence of any other declaration. It seems to
me, however, that the description above ignores the existence of
encodings other than SQL_ASCII and UTF8.

Also, when the proposed patch to set the encoding from the locale
appears, we need to make this logic more precise. Something like:

1. set client_encoding or \encoding, otherwise
2. if BOM found, then UTF8, otherwise
3. by locale environment, otherwise
4. SQL_ASCII (= server encoding, effectively)

Also, I'm not sure if we need this logic only when we send a query. It
might be better to do this in the lexer when we find a non-ASCII
character and we don't have a client encoding != SQL_ASCII set yet.

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9
Prev: Controlling changes in plpgsql variableresolution
Next: ProcessUtility_hook