UTF8 and std::string [C++]

Prev: localtime deprecated?
Next: bind guard ?

From: Éric Malenfant on 15 Jun 2006 16:32

Pete Becker a ?crit :
> kanze wrote:
> >
> > Is that true? I'm not that familiar with the Windows world, but
> > I know that a compiler for a given platform doesn't have
> > unlimited freedom. At the very least, it must be compatible
> > with the system API. (Not according to the standard, of course,
> > but practically, to be usable.) And I was under the impression
> > that the Windows API (unlike Unix) used wchar_t in some places.
> >
>
> The Windows API uses WCHAR, which is a macro or a typedef (haven't
> looked it up recently) for a suitably sized integer type.
>

FWIW: In the Windows platform SDK installed with VC6 and VC7.1, WCHAR
is a typedef in the header winnt.h:

typedef wchar_t WCHAR;

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Pete Becker on 15 Jun 2006 16:34

kanze wrote:

>
> What I was trying to say was that the Windows API accepts
> wchar_t* strings in some places (or under some conditions).
>

Back in the olden days those took pointers to unsigned short, if I
remember correctly. It's only lately that they became wchar_t, imposing
all sorts of heinous contraints on compilers. <g>

--

Pete Becker
Roundhouse Consulting, Ltd.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: kanze on 16 Jun 2006 05:25

Eugene Gershnik wrote:
> kanze wrote:
> > Eugene Gershnik wrote:
> >> Bronek Kozicki wrote:

> > [...]
> >> UTF-8 has special properties that make it very attractive for
> >> many applications. In particular it guarantees that no byte of
> >> multi-byte entry corresponds to a standalone single byte. Thus
> >> with UTF-8 you can still search for english only strings (like
> >> /, \\ or .) using single-byte algorithms like strchr().

> > It also means that you cannot use std::find to search for an
> > ?.

> Neither can you with UTF-32 or anything else since AFAIK ? may
> be encoded as e followed by the thingy on top or as a single
> unit ?. ;-)

Not according to Unicode, at least not in correctly formed
Unicode sequences.

But that's not the point. The point is more the opposite:
simplistic solutions like looking for a single character are
just that: simplistic. The fact that you can find certain
characters with such a single character search in UTF-8 is a
marginal advantage, at best.

> In any event my point was that in many contexts (system
> programming, networking) you almost never look for anything
> above 0x7F, even though you have to store it.

For the moment. Although in most of those contexts, you have to
deal with binary data as well, which means that any simple text
handling will fail.

> Also note that you *can* use std::find with a filtering
> iterator (which is easy to write) sacrificing performance.
> Then again nobody uses std::find on strings. You either use
> basic_string::find or strstr() and similar. Which both work
> fine on ? in UTF-8 as long as you pass it as a string and not
> a single char.

Agreed. But then, a lot of other multibyte character sets will
work in that case as well.

I'm not saying that UTF-8 doesn't have any advantages. But the
fundamental reason for using it is much simpler: it's the only
game in town. I know of no other internationally standardized 8
bit code set which covers all languages.

> The only thing that doesn't work well with UTF-8 is access at
> arbitrary index but I doubt any software except maybe document
> editors really needs to do it.

I don't know. I know that what I do doesn't need it, but I
don't know too much about what others might be doing.

> >> It is also can be used (with caution) with std::string
> >> unlike UTF-16 and UTF-32 for which you will have to invent
> >> a character type and write traits.

> > Agreed, but in practice, if you are using UTF-8 in
> > std::string, you're strings aren't compatible with the third
> > party libraries using std::string in their interface.

> This depends on a library. If it only looks for characters
> below 0x7F and passes the rest unmodified I stay compatible.
> Most libraries fall in this category. That's why so much Unix
> code works perfectly in UTF-8 locale even though it wasn't
> written with it in mind.

Are you kidding. I've not found this to be the case at all.
Most Unix tools are extremely primitive, and line things up in
columns based on byte count (which also imposes fixed width
fonts---rather limiting as well).

Note that the problem is NOT trivial. If everything were UTF-8,
it would be easy to adopt. But everything isn't UTF-8, and we
cannot change the past. The file systems I work on do have
characters like '?' in them, already encoded in ISO 8859-1. If
you create a filename using UTF-8 in the same directory, ls is
going to have one hell of a time displaying the directory
contents correctly. Except that ls doesn't worry about
displaying them correctly. It just spits them out, and counts
on xterm doing the job correctly. And xterm delegates the job
to a font, which has one specific encoding (which both ls and
xterm ignore).

This is one case where Windows has the edge on Unix: Windows
imposes a specific encoding for filenames. IMHO, it would have
been better if they had followed the Plan 9 example, and chosen
UTF-8, but anything is better than the Unix solution, where
nothing is defined, every application does whatever it feels
like, and filenames with anything other than basic US ASCII end
up causing a lot of problems.

> > Arguably, you want a different type, so that the compiler
> > will catch errors.

> Yes. When I want maximum safety I create struct utf8_char
> {...}; with the same size and alignment as char. Then I
> specialize char_traits, delegating to char_traits<char> and
> have typedef basic_string<utf8_char> utf8_string. This creates
> a string binary compatible to std::string but with a different
> type. It gives me type safety but I am still able to
> reinterpret_cast pointers and references between std::string
> and utf8_string if I want to. I know it is undefined behavior
> but it works extremely well on all compilers I have to deal
> with (and I suspect on all compilers in existence).

And you doubtlessly have to convert a lot:-). Or do you also
create all of the needed facet's in locale?

Still, it doesn't work if the code you're interfacing to is
trying to line data up using character counts, and doesn't
expect multi-byte characters. If, like a lot of software here
in Europe, it assumes ISO 8859-1.

> >> UTF-16 is a good option on platforms that directly support
> >> it like Windows, AIX or Java. UTF-32 is probably not a good
> >> option anywhere ;-)

> > I can't think of any context where UTF-16 would be my
> > choice.

> Any code written for NT-based Windows for example. The system
> pretty much forces you into it.

In the same way that Unix forces you into US ASCII, yes.

> All the system APIs (not some but *all*) that deal with
> strings accept UTF-16. None of them accept UTF-8 and UTF-32.
> There is also no notion of UTF-8 locale. If you select
> anything but UTF-16 for your application you will have to
> convert everywhere.

They've got to support UTF-8 somewhere. It's the standard
encoding for all of the Internet protocols.

I'd probably treat Windows the same way I treat Unix: I use the
straight 8 bit interface, make sure all of the strings the
system is concerned with are pure US ASCII, and do the rest
myself.

> > It seems to have all of the weaknesses of UTF-8 (e.g.
> > multi-byte), plus a few of its own (byte order in external
> > files), and no added benefits -- UTF-8 will usually use less
> > space. Any time you need true random access to characters,
> > however, UTF-32 is the way to go.

> Well as long as you don't need to look up characters
> *yourself* but only get results from libraries that already
> understand UTF-16 the problems above disappear. Win32, ICU and
> Rosette all use UTF-16 as their base character type (well
> Rosette supports UTF-32 too) so it is easier to just use it
> everywhere.

> On the most fundamental level to do I18N correctly strings
> have to be dealt with as indivisible units. When you want to
> perform some operation you pass the string to a library and
> get the results back. No hand written iteration can be
> expected to deal with pre-composed vs. composite, different
> canonical forms and all the other garbage Unicode brings us.
> If a string is an indivisible unit then it doesn't really
> matter what this unit is as long as it is what your libraries
> expect to see.

So we basically agree:-). All that's missing is the libraries.
(I know, some exist. But all too often, you don't have a
choice.)

--
James Kanze GABI Software
Conseils en informatique orient?e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Pete Becker on 16 Jun 2006 05:22

Eugene Gershnik wrote:

> Pete Becker wrote:
>
>>Pete Becker wrote:
>>
>>>
>>>The Windows API uses WCHAR, which is a macro or a typedef (haven't
>>>looked it up recently) for a suitably sized integer type.
>>>
>>
>>I just checked, and it's a typedef for wchar_t. That's a "recent"
>>change (i.e. in the past six or seven years). I'm pretty sure it used
>>to be just some integer type.
>
>
> No what is a recent change is the fact that wchar_t had become a true
> seprate type on VC compiler. It used to be a typedef to unsigned short. Even
> today you have a choice of using compatibility mode in which wchar_t is
> still unsigned short.

I'm well aware of the history of wchar_t in MS compilers. I was talking
about the definition of WCHAR in the Windows headers, which, believe it
or not, at one time didn't ship with the compiler.

>
> In any event all of this has nothing to do with WCHAR/wchar_t
> representations which have to be 2 byte and store UTF-16 in little-endian
> format.
>

WCHAR has to be 2 bytes and store UTF-16 in little-endian format,
because that's the way that the Windows API was designed. More recently,
wchar_t has to do the same, because WCHAR is now defined as wchar_t.
There's no essential connection, just the artificial one that the
Windows headers create.

--

Pete Becker
Roundhouse Consulting, Ltd.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Eugene Gershnik on 17 Jun 2006 06:19

Pete Becker wrote:
>
> I'm well aware of the history of wchar_t in MS compilers. I was
> talking about the definition of WCHAR in the Windows headers, which,
> believe it
> or not, at one time didn't ship with the compiler.

I do believe it since I well remember it ;-) In any case WCHAR was obviously
intended to stand for wchar_t. When the compilers didn't uniformly provide
it unsigned short was used as a substitute. This is still the case with many
older Windows libraries.

> WCHAR has to be 2 bytes and store UTF-16 in little-endian format,
> because that's the way that the Windows API was designed. More
> recently, wchar_t has to do the same, because WCHAR is now defined as
> wchar_t.
> There's no essential connection, just the artificial one that the
> Windows headers create.

Well you can also argue that compiler's char doesn't have to be compatible
with Win32's CHAR, and void * with LPVOID. Can a compiler do something like
this? Yes. Would it be usable? Probably not.

--
Eugene

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: localtime deprecated?
Next: bind guard ?