From: rebelde on 31 May 2010 09:11 Hello, We are porting some UTF-8 ready application from Linux to SunOS 5.9 and running in the following unclear problem. After a lot of digging I'm able to simplify the problem in the following snipp of C-code: #include <stdio.h> #include <string.h> #include <stdlib.h> #include <locale.h> main() { char *asc = "a"; char *utf = "\303\204"; /* this is an UTF-8 German A with dots */ char buf[80]; setlocale(LC_ALL, ""); sprintf(buf, "%-*.*s", 16, 16, asc); printf("strlen of buf with ascii char %d\n", strlen(buf)); printf("[%s]\n", buf); sprintf(buf, "%-*.*s", 16, 16, utf); printf("strlen of buf utf char %d\n", strlen(buf)); printf("[%s]\n", buf); exit(0); } If you compile and run it you will see that in some environment the resulting string is not (as expected) 16 bytes long, but 17: $ ./a.out strlen of buf with ascii char 16 [a ] strlen of buf utf char 16 [� ] $ LC_ALL="" ./a.out strlen of buf with ascii char 16 [a ] strlen of buf utf char 16 [� ] $ LC_ALL=de_DE.UTF-8 ./a.out strlen of buf with ascii char 16 [a ] strlen of buf utf char 17 <******************************************* [� ] $ LC_ALL=de_DE.UTF-8 ./a.out | od -t x1 0000000 73 74 72 6c 65 6e 20 6f 66 20 62 75 66 20 77 69 0000020 74 68 20 61 73 63 69 69 20 63 68 61 72 20 31 36 0000040 0a 5b 61 20 20 20 20 20 20 20 20 20 20 20 20 20 0000060 20 20 5d 0a 73 74 72 6c 65 6e 20 6f 66 20 62 75 0000100 66 20 75 74 66 20 63 68 61 72 20 31 37 0a 5b c3 0000120 84 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 0000140 5d 0a 0000142 i.e. the problem shows up when the source buffer contains a 2-byte UTF-8 char and you 1) have LC_ALL=de_DE.UTF-8 in the env *and* 2) set this back inside to LC_ALL="" you can also see that the output is plain 2 byte UTF-8 code for the German letter A with dots, followed by 15 chars of blank, which gives 17 chars in the case of "�" (and 16 in the case of "a"); the behaviour is the same for SunOS 5.9 and SunOS 5.10, but not on FreeBSD 8.x and not on Linux SLES10; the man page of setlocale(3C) does not mention any influence of the settings on sprintf(3C), but on things (logically) like strftime, ctype, ... what does this mean? is this a bug? IMHO sprintf(3C) should just add bytes to a buffer as described in its format string and should count a string of 2-bytes (the UTF-8 �) as two bytes, regardless what the two bytes mean, and should fill the rest of the buffer with (in our case) 14 blanks; Any idea or any pointer to an explanation? Thanks in advance Matthias -- http://www.unixarea.de/
From: rebelde on 1 Jun 2010 09:53 Drazen Kacar wrote: > With your example program on my system (Solaris 10): > > {morrigan}~/trash> cc loc.c > "loc.c", line 7: warning: old-style declaration or incorrect type for: > main > {morrigan}~/trash> LC_ALL=en_US.UTF-8 ./a.out > strlen of buf with ascii char 16 > [a � � � � � � � ] > strlen of buf utf char 17 > [� � � � � � � � �] > {morrigan}~/trash> c99 loc.c > "loc.c", line 7: warning: old-style declaration or incorrect type for: > main > {morrigan}~/trash> LC_ALL=en_US.UTF-8 ./a.out > strlen of buf with ascii char 16 > [a � � � � � � � ] > strlen of buf utf char 16 > [� � � � � � � � ] > >> what does this mean? is this a bug? > > Take a look at standards(5) and define your compilation environment to > better suite your needs. (Invoking c99 is just the simplest way to get > standard conforming environment. It's not necessarily the best for your > needs.) > Hello Drazen, Thanks for your reply and hints. I've checked before standards(5) and it was not really clear for me what was meant with 'columns of screen display'; now I understand what the idea is... in our case, the result of the sprintf(3C) is to be stored in database columns and need to be the exact number of bytes, rather something longer. We're using a gcc $ gcc --version gcc (GCC) 3.4.6 .... which does not know the -xc99 flag: $ gcc -xc99 str.c gcc: language c99 not recognized Will check what would be the best way to solve this... Thanks again Matthias -- Matthias Apitz t +49-89-61308 351 - f +49-89-61308 399 - m +49-170-4527211 e <guru(a)unixarea.de> - w http://www.unixarea.de/ Solidarity with the zionistic pirates of Israel? Not in my name! �Solidaridad con los piratas sionistas de Israel? �No en mi nombre!
From: Paul Floyd on 1 Jun 2010 14:07 On Tue, 01 Jun 2010 15:53:58 +0200, rebelde <guru(a)unixarea.de> wrote: > number of bytes, rather something longer. > > We're using a gcc If you want standards, the Sun Studio is better. > $ gcc --version > gcc (GCC) 3.4.6 > ... gcc -std=c99 -pedantic is the equivalent. A bientot Paul -- Paul Floyd http://paulf.free.fr
From: rebelde on 2 Jun 2010 04:17 Paul Floyd wrote: >> $ gcc --version >> gcc (GCC) 3.4.6 >> ... > > gcc -std=c99 -pedantic is the equivalent. > But gives also 17 byte for %-16.16s in case of a UTF-8 char: $ gcc -std=c99 -pedantic str.c str.c:8: warning: return type defaults to `int' $ LC_ALL="" ./a.out strlen of buf with ascii char 16 [a ] strlen of buf utf char 16 [� ] $ LC_ALL=de_DE.UTF-8 ./a.out strlen of buf with ascii char 16 [a ] strlen of buf utf char 17 [� ] $ matthias -- Matthias Apitz t +49-89-61308 351 - f +49-89-61308 399 - m +49-170-4527211 e <guru(a)unixarea.de> - w http://www.unixarea.de/ Solidarity with the zionistic pirates of Israel? Not in my name! �Solidaridad con los piratas sionistas de Israel? �No en mi nombre!
From: Paul Floyd on 3 Jun 2010 16:00 On Wed, 02 Jun 2010 10:17:30 +0200, rebelde <guru(a)unixarea.de> wrote: > Paul Floyd wrote: > >>> $ gcc --version >>> gcc (GCC) 3.4.6 >>> ... >> >> gcc -std=c99 -pedantic is the equivalent. >> > > But gives also 17 byte for %-16.16s in case of a UTF-8 char: OK, so GCC isn't conforming to the standards. A bientot Paul -- Paul Floyd http://paulf.free.fr
|
Next
|
Last
Pages: 1 2 Prev: SRC/P hardware RAID controller drivers Next: Solaris 10 Basic Install question |