UTF8 and std::string [C++]

Prev: localtime deprecated?
Next: bind guard ?

From: kanze on 14 Jun 2006 06:31

Pete Becker wrote:
> Wu Yongwei wrote:

> > A gotcha under Windows: wchar_t is 2 bytes wide.

> wchar_t is a type defined by the compiler. For some Windows
> compilers it's 2 bytes wide, for others it isn't.

Is that true? I'm not that familiar with the Windows world, but
I know that a compiler for a given platform doesn't have
unlimited freedom. At the very least, it must be compatible
with the system API. (Not according to the standard, of course,
but practically, to be usable.) And I was under the impression
that the Windows API (unlike Unix) used wchar_t in some places.

--
James Kanze GABI Software
Conseils en informatique orient?e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Alf P. Steinbach on 14 Jun 2006 18:29

* Alf P. Steinbach:
> * Pete Becker:
>> Wu Yongwei wrote:
>>
>>> A gotcha under Windows: wchar_t is 2 bytes wide.
>> wchar_t is a type defined by the compiler. For some Windows compilers
>> it's 2 bytes wide, for others it isn't.
>
> Is there a C++ compiler for 32-bit Windows where wchar_t isn't 32 bits
> by default?

Sorry, 2*8 = 16 bits, of course?

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Maxim Yegorushkin on 14 Jun 2006 18:25

Pete Becker wrote:

> ... then UTF8 is also fixed-width, so long as you are sure you will
> never have characters represented by values greater thn 0xff.

No greater than 0x7f, because any byte greater or equal to 0x80 is a
part of a multibyte character in UTF-8.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Maxim Yegorushkin on 14 Jun 2006 18:27

Eugene Gershnik wrote:

[]

> UTF-8 has special properties that make it very attractive for many
> applications. In particular it guarantees that no byte of multi-byte
> entry corresponds to a standalone single byte. Thus with UTF-8 you can
> still search for english only strings (like /, \\ or .) using
> single-byte algorithms like strchr().
> It is also can be used (with caution) with std::string unlike UTF-16
> and UTF-32 for which you will have to invent a character type and write
> traits.
> IMO UTF-8 (and UTF-8 locales) is probably the best way to use Unicode
> on Unix. Apparently I am also backed by known experts
> http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux

Another good link about Linux Unicode programming is
http://www-128.ibm.com/developerworks/linux/library/l-linuni.html

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: kanze on 14 Jun 2006 18:31

Eugene Gershnik wrote:
> Bronek Kozicki wrote:

[...]
> UTF-8 has special properties that make it very attractive for
> many applications. In particular it guarantees that no byte of
> multi-byte entry corresponds to a standalone single byte. Thus
> with UTF-8 you can still search for english only strings (like
> /, \\ or .) using single-byte algorithms like strchr().

It also means that you cannot use std::find to search for an ?.

> It is also can be used (with caution) with std::string unlike
> UTF-16 and UTF-32 for which you will have to invent a
> character type and write traits.

Agreed, but in practice, if you are using UTF-8 in std::string,
you're strings aren't compatible with the third party libraries
using std::string in their interface. Arguably, you want a
different type, so that the compiler will catch errors.

> IMO UTF-8 (and UTF-8 locales) is probably the best way to use
> Unicode on Unix. Apparently I am also backed by known experts
> http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux

That article only really speaks of external representations.
For which there's not really much choice: for better or for
worse, we live in an 8-bit world -- all modern architectures
have 8 bit bytes, all of the Internet protocols are octet
oriented, etc. And the only 8 bit code which can handle all
languages is UTF-8.

Internally, it depends on the application, and what you are
doing with the strings. For many applications, I think that
UTF-8 is a good choice, even for internal use. For others, I'd
go with UTF-32.

> UTF-16 is a good option on platforms that directly support it
> like Windows, AIX or Java. UTF-32 is probably not a good
> option anywhere ;-)

I can't think of any context where UTF-16 would be my choice.
It seems to have all of the weaknesses of UTF-8 (e.g.
multi-byte), plus a few of its own (byte order in external
files), and no added benefits -- UTF-8 will usually use less
space. Any time you need true random access to characters,
however, UTF-32 is the way to go. The one exception might be if
you could be sure of not having to handle surrogates; if
internationalisation were limited to Europe, for example.

--
James Kanze GABI Software
Conseils en informatique orient?e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: localtime deprecated?
Next: bind guard ?