Prev: Two C-interface issues on -Testers
Next: plpython3
From: hgonzalez on 7 May 2010 21:48 > However, it appears that glibc's printf code interprets the parameter as the number of *characters* to print, and to determine what's a character it assumes the string is in the environment LC_CTYPE's encoding. Well, I myself have problems to believe that :-) This would be nasty... Are you sure? I couldn reproduce that. I made a quick test, passing a utf-8 encoded string (5 bytes correspoding to 4 unicode chars: "niño") And my glib (same Fedora 12) seems to count bytes, as it should. #include<stdio.h> main () { char s[] = "ni\xc3\xb1o"; printf("|%.*s|\n",5,s); } This, compiled with gcc 4.4.3, run with my root locale (utf8) did not padded a blank. ie it worked as expected. Hernán
From: hgonzalez on 8 May 2010 18:50 Well, I finally found some related -rather old- issues in Bugzilla (glib) http://sources.redhat.com/bugzilla/show_bug.cgi?id=6530 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=208308 http://sources.redhat.com/bugzilla/show_bug.cgi?id=649 The last explains why they do not consider it a bug: ISO C99 requires for %.*s to only write complete characters that fit below the precision number of bytes. If you are using say UTF-8 locale, but ISO-8859-1 characters as shown in the input file you provided, some of the strings are not valid UTF-8 strings, therefore sprintf fails with -1 because of the encoding error. That's not a bug in glibc. It's clear, though it's also rather ugly, from a specification point of view (we must count raw bytes for the width field, but also must decode the utf8 chars for finding character boundaries). I guess we must live with that. Hernán J. González
From: Tom Lane on 8 May 2010 21:24 hgonzalez(a)gmail.com writes: > http://sources.redhat.com/bugzilla/show_bug.cgi?id=649 > The last explains why they do not consider it a bug: > ISO C99 requires for %.*s to only write complete characters that fit below > the > precision number of bytes. If you are using say UTF-8 locale, but ISO-8859-1 > characters as shown in the input file you provided, some of the strings are > not valid UTF-8 strings, therefore sprintf fails with -1 because of the > encoding error. That's not a bug in glibc. Yeah, that was about the position I thought they'd take. So the bottom line here is that we're best off to avoid %.*s because it may fail if the string contains data that isn't validly encoded according to libc's idea of the prevailing encoding. I think that means the patch I committed earlier is still a good idea, but the comments need a bit of adjustment. Will fix. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
|
Pages: 1 Prev: Two C-interface issues on -Testers Next: plpython3 |