[GENERAL] psql weird behaviour with charset encodings [PgSql]

Prev: Two C-interface issues on -Testers
Next: plpython3

From: hgonzalez on 7 May 2010 21:48

> However, it appears that glibc's printf
code interprets the parameter as the number of *characters* to print,
and to determine what's a character it assumes the string is in the
environment LC_CTYPE's encoding.

Well, I myself have problems to believe that :-)
This would be nasty... Are you sure?

I couldn reproduce that.
I made a quick test, passing a utf-8 encoded string
(5 bytes correspoding to 4 unicode chars: "niño")
And my glib (same Fedora 12) seems to count bytes,
as it should.

#include<stdio.h>
main () {
char s[] = "ni\xc3\xb1o";
printf("|%.*s|\n",5,s);
}

This, compiled with gcc 4.4.3, run with my root locale (utf8)
did not padded a blank. ie it worked as expected.

Hernán

From: hgonzalez on 8 May 2010 18:50

Well, I finally found some related -rather old- issues in Bugzilla (glib)

http://sources.redhat.com/bugzilla/show_bug.cgi?id=6530
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=208308
http://sources.redhat.com/bugzilla/show_bug.cgi?id=649

The last explains why they do not consider it a bug:

ISO C99 requires for %.*s to only write complete characters that fit below
the
precision number of bytes. If you are using say UTF-8 locale, but ISO-8859-1
characters as shown in the input file you provided, some of the strings are
not valid UTF-8 strings, therefore sprintf fails with -1 because of the
encoding error. That's not a bug in glibc.

It's clear, though it's also rather ugly, from a specification point of
view (we must
count raw bytes for the width field, but also must decode the utf8 chars
for finding
character boundaries). I guess we must live with that.

Hernán J. González

From: Tom Lane on 8 May 2010 21:24

hgonzalez(a)gmail.com writes:
> http://sources.redhat.com/bugzilla/show_bug.cgi?id=649

> The last explains why they do not consider it a bug:

> ISO C99 requires for %.*s to only write complete characters that fit below
> the
> precision number of bytes. If you are using say UTF-8 locale, but ISO-8859-1
> characters as shown in the input file you provided, some of the strings are
> not valid UTF-8 strings, therefore sprintf fails with -1 because of the
> encoding error. That's not a bug in glibc.

Yeah, that was about the position I thought they'd take.

So the bottom line here is that we're best off to avoid %.*s because
it may fail if the string contains data that isn't validly encoded
according to libc's idea of the prevailing encoding. I think that
means the patch I committed earlier is still a good idea, but the
comments need a bit of adjustment. Will fix.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

|
Pages: 1
Prev: Two C-interface issues on -Testers
Next: plpython3