UTF8 with BOM support in psql [PgSql]

Prev: Controlling changes in plpgsql variableresolution
Next: ProcessUtility_hook

From: Itagaki Takahiro on 17 Nov 2009 23:03

Andrew Dunstan <andrew(a)dunslane.net> wrote:

> Itagaki Takahiro wrote:
> > Multi-byte scripts
> > without encoding are always dangerous whether BOM is present or not.
> > I'd say we can always throw an error when we find queries that contain
> > multi-byte characters if no prior encoding declaration.
>
> You will break a gazillion scripts that today work quite happily if you do.

Sure. That's why I didn't send a patch for it :)
If by any chance we do so, we'll have a boolean option to disable the check.

> Maybe there is a case for a extra command line switch to set the initial
> client encoding for psql, which would make that a little easier and less
> obscure to do. Would that make things simpler for you?

No. There are complex reasons on Windows in Japan. The client encoding is
always SJIS because of Windows restriction, but the database is initialized
with UTF8. Simple interactive works with psql are done under SJIS encoding,
but some scripts are written in UTF8 because it matches the server encoding.
(Of course the script is executed as "psql -f utf8.sql > out.txt")

I don't want user to check the encoding of scripts before executing --
it is far from fail-safe.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Andrew Dunstan on 17 Nov 2009 23:22

Itagaki Takahiro wrote:
> I don't want user to check the encoding of scripts before executing --
> it is far from fail-safe.
>
>
>

That's what we require in all other cases. Why should UTF8 be special?
If I have a script in Latin1 and Postgres thinks it's UTF8 it will
probably explode. Same for the reverse situation. Second-guessing the
user strikes me as being quite as dangerous as what you're trying to
cure, for all the reasons Tom outline earlier today. What is more, you
will teach Windows users to rely on the client encoding being set in
UTF8 scripts without their doing anything, and then when they get on
another platform they will not understand why it doesn't work because
the BOMs will be missing.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Itagaki Takahiro on 17 Nov 2009 23:35

Andrew Dunstan <andrew(a)dunslane.net> wrote:

> Itagaki Takahiro wrote:
> > I don't want user to check the encoding of scripts before executing --
> > it is far from fail-safe.
>
> That's what we require in all other cases. Why should UTF8 be special?

No. I didn't think about UTF-8 nor BOM in that point.
I assumed we are discussing the following line:

> > I'd say we can always throw an error when we find queries that contain
> > multi-byte characters if no prior encoding declaration.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Peter Eisentraut on 18 Nov 2009 04:11

On ons, 2009-11-18 at 12:52 +0900, Itagaki Takahiro wrote:
> Peter Eisentraut <peter_e(a)gmx.net> wrote:
>
> > Together, that should cover a lot of cases. Not perfect, but far from
> > useless.
>
> For Japanese users on Windows, the client encoding are always set to SJIS
> because of the restriction of cmd.exe. But the script file can be written
> in UTF8 with BOM. I don't think we should depend on client encoding.

Set by whom, how, and because of what restriction?

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Peter Eisentraut on 18 Nov 2009 04:18

On tis, 2009-11-17 at 23:22 -0500, Andrew Dunstan wrote:
> Itagaki Takahiro wrote:
> > I don't want user to check the encoding of scripts before executing
> --
> > it is far from fail-safe.
> >
> >
> >
>
> That's what we require in all other cases. Why should UTF8 be special?

But now we're back to the original problem. Certain editors insert BOMs
at the beginning of the file. And that is by any definition before the
embedded client encoding declaration. I think the only ways to solve
this are:

1) Ignore the BOM if a client encoding declaration of UTF8 appears in a
narrowly defined location near the beginning of the file (XML and
PEP-0263 style). For *example*, we could ignore the BOM if the file
starts with exactly "<BOM>\encoding UTF8\n". Would probably not work
well in practice.

2) Parse two alternative versions of the file, one with the BOM ignored
and one with the BOM not ignored, until you need to make a decision.
Hilariously complicated, but would perhaps solve the problem.

3) Give up, do nothing.

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9
Prev: Controlling changes in plpgsql variableresolution
Next: ProcessUtility_hook