From: Itagaki Takahiro on

Peter Eisentraut <peter_e(a)gmx.net> wrote:

> I think I could support using the presence of the BOM as a fall-back
> indicator of encoding in absence of any other declaration.

What is the difference the fall-back and <<set client encoding to UTF-8
if BOM found>> ? I read this discussion that we cannot accept any automatic
encoding detections (properly speaking, detection is ok, but automatic
assignment is not). We should not have any fall-back mechanism, no?

> Also, when the proposed patch to set the encoding from the locale
> appears, we need to make this logic more precise.

Encoding-from-locale feature will be useful, but the patch does *not*
set any encodings. The reason is same as above.

> Also, I'm not sure if we need this logic only when we send a query. It
> might be better to do this in the lexer when we find a non-ASCII
> character and we don't have a client encoding != SQL_ASCII set yet.

Absolutely, but is it an indepedent issue from BOM? Multi-byte scripts
without encoding are always dangerous whether BOM is present or not.
I'd say we can always throw an error when we find queries that contain
multi-byte characters if no prior encoding declaration.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Andrew Dunstan on


Itagaki Takahiro wrote:
> Multi-byte scripts
> without encoding are always dangerous whether BOM is present or not.
> I'd say we can always throw an error when we find queries that contain
> multi-byte characters if no prior encoding declaration.
>
>
>

You will break a gazillion scripts that today work quite happily if you do.

I think you have really not thought out these proposals well.

Maybe there is a case for a extra command line switch to set the initial
client encoding for psql, which would make that a little easier and less
obscure to do. Would that make things simpler for you?

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
Peter Eisentraut <peter_e(a)gmx.net> writes:
> I think I could support using the presence of the BOM as a fall-back
> indicator of encoding in absence of any other declaration. It seems to
> me, however, that the description above ignores the existence of
> encodings other than SQL_ASCII and UTF8.

Yeah. This entire proposal rests on the assumption that UTF8 is the
only encoding that really matters, and introducing a possibility of
breaking things for users of other encodings is acceptable damage.
I do not think that supporting a deprecated-by-standards behavior
is worth that.

Even assuming that we had consensus on a behavior that involved
silently changing client_encoding, I do not believe that it's practical
to implement it in an acceptable fashion. Just issuing a SET behind the
user's back will not work in a number of scenarios:

* We are inside a transaction when \i is called, and the file contains
a ROLLBACK.

* We are inside a failed transaction when \i is called --- the SET won't
even work at all.

* Same two cases inside a savepoint.

* The file contains a \c command.

If you expect that the previous client_encoding should be restored at
the end of the \i inclusion (as I certainly would) then you have the
first three hazards at file end as well, except that now the odds of
being inside a failed transaction are significantly higher. Also,
what if the file contained a SET CLIENT_ENCODING command itself?
How should that interact with this?

Lastly, a silent change of client_encoding would also affect the
encoding of notice and error messages that come out while the \i
file is running. I fail to find that non-astonishing, either.

I think that the only way this sort of behavior could be implemented
without a bunch of broken corner cases would be if we put the
responsibility of encoding conversion inside psql, so that switching its
idea of the encoding was just a local change rather than something it
had to ask the backend to do, and it could be careful to apply the
encoding only to the data coming from the \i file. Which is possible,
perhaps, but it hardly seems that slightly-more-convenient BOM handling
is worth it.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Peter Eisentraut on
On tis, 2009-11-17 at 09:31 +0900, Itagaki Takahiro wrote:
> Peter Eisentraut <peter_e(a)gmx.net> wrote:
>
> > OK, I think the consensus here is:
> > - Eat BOM at beginning of file (as you implemented)
> > - Only when client encoding is UTF-8 --> please fix that
>
> Are they AND condition? If so, this patch will be useless.
> Please remember \encoding or SET client_encoding appear
> *after* BOM at beginning of file.

Presumably, if you have editors throwing in BOM marks without asking,
you have an environment where either

a) You can set the client encoding to UTF8 in the environment, so it
applies by default, or

b) The server encoding is UTF8, so the client encoding will default to
that.

Together, that should cover a lot of cases. Not perfect, but far from
useless.


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Peter Eisentraut on
On tis, 2009-11-17 at 00:59 -0800, Chuck McDevitt wrote:
> Or is there a plan to read and convert the UTF-16 or UTF-32 to UTF-8,
> so psql and PostgreSQL understand it?
> (BTW, that would actually be nice on Windows, where UTF-16 is common).

Well, someone could implement UTF-16 or UTF-whatever as client encoding.
But I have not heard of any concrete proposals about that.


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers