UTF8 and std::string [C++]

Prev: localtime deprecated?
Next: bind guard ?

From: Ambarish Sridharanarayanan on 15 Jun 2006 10:55

In article <1150115446.201432.287990(a)u72g2000cwu.googlegroups.com>,
kanze wrote:

> Having said that, I think a lot depends on the application. I
> certainly wouldn't like to have to write an editor using UTF-8,
> for example.

I suspect you'd prefer UTF-32 since presumably characters are
fixed-width. Unfortunately the story doesn't end there. Applications
like editors would have to think in terms of glyphs, and glyphs can
comprise multiple Unicode characters. Indeed, most Indic glyphs are
composed of 2 or more characters. Even in Latin-derived languages, you
could have the character 'a' followed by the diaeresis character, and
you might want to treat the 2 characters as a combined element when
using the editor.

At the end of the day, it might boil down to a perf trade-off based on
your target languages/scripts.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Pete Becker on 15 Jun 2006 10:56

Pete Becker wrote:

> kanze wrote:
>
>
>>Is that true? I'm not that familiar with the Windows world, but
>>I know that a compiler for a given platform doesn't have
>>unlimited freedom. At the very least, it must be compatible
>>with the system API. (Not according to the standard, of course,
>>but practically, to be usable.) And I was under the impression
>>that the Windows API (unlike Unix) used wchar_t in some places.
>>
>
>
> The Windows API uses WCHAR, which is a macro or a typedef (haven't
> looked it up recently) for a suitably sized integer type.
>

I just checked, and it's a typedef for wchar_t. That's a "recent" change
(i.e. in the past six or seven years). I'm pretty sure it used to be
just some integer type.

--

Pete Becker
Roundhouse Consulting, Ltd.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: kanze on 15 Jun 2006 11:03

Pete Becker wrote:
> kanze wrote:

> > Is that true? I'm not that familiar with the Windows world,
> > but I know that a compiler for a given platform doesn't have
> > unlimited freedom. At the very least, it must be compatible
> > with the system API. (Not according to the standard, of
> > course, but practically, to be usable.) And I was under the
> > impression that the Windows API (unlike Unix) used wchar_t
> > in some places.

> The Windows API uses WCHAR, which is a macro or a typedef
> (haven't looked it up recently) for a suitably sized integer
> type.

But it surely cannot be an arbitrary type (e.g. long long).

All it means, I think, is that Windows provides two identical
API's, one using char, and one using wchar_t. The macro or
typedef is just a technique to avoid having to maintain two
distinct almost identical copies, and to allow the user to
choose one API or the other easily, from the command line of the
compiler (supposing he has written his own code to use this
macro as well, of course).

What I was trying to say was that the Windows API accepts
wchar_t* strings in some places (or under some conditions). The
system API defines these as UTF16 encoded strings of 16 bit
elements. If a compiler decided to implement wchar_t as a 32
bit type (for example), the user couldn't pass wchar_t* strings
(including wide character string literals) to the system API.
The consequences of this restriction are so bad that no compiler
would actually do this.

--
James Kanze GABI Software
Conseils en informatique orient?e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Eugene Gershnik on 15 Jun 2006 11:00

kanze wrote:
> Eugene Gershnik wrote:
>> Bronek Kozicki wrote:
>
> [...]
>> UTF-8 has special properties that make it very attractive for
>> many applications. In particular it guarantees that no byte of
>> multi-byte entry corresponds to a standalone single byte. Thus
>> with UTF-8 you can still search for english only strings (like
>> /, \\ or .) using single-byte algorithms like strchr().
>
> It also means that you cannot use std::find to search for an ?.

Neither can you with UTF-32 or anything else since AFAIK ? may be encoded as
e followed by the thingy on top or as a single unit ?. ;-)
In any event my point was that in many contexts (system programming,
networking) you almost never look for anything above 0x7F, even though you
have to store it.
Also note that you *can* use std::find with a filtering iterator (which is
easy to write) sacrificing performance. Then again nobody uses std::find on
strings. You either use basic_string::find or strstr() and similar. Which
both work fine on ? in UTF-8 as long as you pass it as a string and not a
single char.
The only thing that doesn't work well with UTF-8 is access at arbitrary
index but I doubt any software except maybe document editors really needs to
do it.

>> It is also can be used (with caution) with std::string unlike
>> UTF-16 and UTF-32 for which you will have to invent a
>> character type and write traits.
>
> Agreed, but in practice, if you are using UTF-8 in std::string,
> you're strings aren't compatible with the third party libraries
> using std::string in their interface.

This depends on a library. If it only looks for characters below 0x7F and
passes the rest unmodified I stay compatible. Most libraries fall in this
category. That's why so much Unix code works perfectly in UTF-8 locale even
though it wasn't written with it in mind.

> Arguably, you want a
> different type, so that the compiler will catch errors.

Yes. When I want maximum safety I create struct utf8_char {...}; with the
same size and alignment as char. Then I specialize char_traits, delegating
to char_traits<char> and have typedef basic_string<utf8_char> utf8_string.
This creates a string binary compatible to std::string but with a different
type. It gives me type safety but I am still able to reinterpret_cast
pointers and references between std::string and utf8_string if I want to. I
know it is undefined behavior but it works extremely well on all compilers I
have to deal with (and I suspect on all compilers in existence).

>> UTF-16 is a good option on platforms that directly support it
>> like Windows, AIX or Java. UTF-32 is probably not a good
>> option anywhere ;-)
>
> I can't think of any context where UTF-16 would be my choice.

Any code written for NT-based Windows for example. The system pretty much
forces you into it. All the system APIs (not some but *all*) that deal with
strings accept UTF-16. None of them accept UTF-8 and UTF-32. There is also
no notion of UTF-8 locale. If you select anything but UTF-16 for your
application you will have to convert everywhere.

> It seems to have all of the weaknesses of UTF-8 (e.g.
> multi-byte), plus a few of its own (byte order in external
> files), and no added benefits -- UTF-8 will usually use less
> space. Any time you need true random access to characters,
> however, UTF-32 is the way to go.

Well as long as you don't need to look up characters *yourself* but only get
results from libraries that already understand UTF-16 the problems above
disappear. Win32, ICU and Rosette all use UTF-16 as their base character
type (well Rosette supports UTF-32 too) so it is easier to just use it
everywhere.

On the most fundamental level to do I18N correctly strings have to be dealt
with as indivisible units. When you want to perform some operation you pass
the string to a library and get the results back. No hand written iteration
can be expected to deal with pre-composed vs. composite, different canonical
forms and all the other garbage Unicode brings us.
If a string is an indivisible unit then it doesn't really matter what this
unit is as long as it is what your libraries expect to see.

--
Eugene

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Eugene Gershnik on 15 Jun 2006 16:37

Pete Becker wrote:
> Pete Becker wrote:
>>
>>
>> The Windows API uses WCHAR, which is a macro or a typedef (haven't
>> looked it up recently) for a suitably sized integer type.
>>
>
> I just checked, and it's a typedef for wchar_t. That's a "recent"
> change (i.e. in the past six or seven years). I'm pretty sure it used
> to be just some integer type.

No what is a recent change is the fact that wchar_t had become a true
seprate type on VC compiler. It used to be a typedef to unsigned short. Even
today you have a choice of using compatibility mode in which wchar_t is
still unsigned short. The compiler status is

VC 6 and before: wchar_t is unsigned short
VC 7 and 7.1: by default wchar_t is unsigned short but it is possible to get
standard behavior by passing a flag to compiler (-Zc:wchar_t)
VC 8: by default wchar_t is a separate type but it is possible to revert to
the old behavior (-Zc:wchar_t-)

In any event all of this has nothing to do with WCHAR/wchar_t
representations which have to be 2 byte and store UTF-16 in little-endian
format.

--
Eugene

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: localtime deprecated?
Next: bind guard ?