Prev: localtime deprecated?
Next: bind guard ?
From: Ambarish Sridharanarayanan on 15 Jun 2006 10:55 In article <1150115446.201432.287990(a)u72g2000cwu.googlegroups.com>, kanze wrote: > Having said that, I think a lot depends on the application. I > certainly wouldn't like to have to write an editor using UTF-8, > for example. I suspect you'd prefer UTF-32 since presumably characters are fixed-width. Unfortunately the story doesn't end there. Applications like editors would have to think in terms of glyphs, and glyphs can comprise multiple Unicode characters. Indeed, most Indic glyphs are composed of 2 or more characters. Even in Latin-derived languages, you could have the character 'a' followed by the diaeresis character, and you might want to treat the 2 characters as a combined element when using the editor. At the end of the day, it might boil down to a perf trade-off based on your target languages/scripts. [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Pete Becker on 15 Jun 2006 10:56 Pete Becker wrote: > kanze wrote: > > >>Is that true? I'm not that familiar with the Windows world, but >>I know that a compiler for a given platform doesn't have >>unlimited freedom. At the very least, it must be compatible >>with the system API. (Not according to the standard, of course, >>but practically, to be usable.) And I was under the impression >>that the Windows API (unlike Unix) used wchar_t in some places. >> > > > The Windows API uses WCHAR, which is a macro or a typedef (haven't > looked it up recently) for a suitably sized integer type. > I just checked, and it's a typedef for wchar_t. That's a "recent" change (i.e. in the past six or seven years). I'm pretty sure it used to be just some integer type. -- Pete Becker Roundhouse Consulting, Ltd. [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: kanze on 15 Jun 2006 11:03 Pete Becker wrote: > kanze wrote: > > Is that true? I'm not that familiar with the Windows world, > > but I know that a compiler for a given platform doesn't have > > unlimited freedom. At the very least, it must be compatible > > with the system API. (Not according to the standard, of > > course, but practically, to be usable.) And I was under the > > impression that the Windows API (unlike Unix) used wchar_t > > in some places. > The Windows API uses WCHAR, which is a macro or a typedef > (haven't looked it up recently) for a suitably sized integer > type. But it surely cannot be an arbitrary type (e.g. long long). All it means, I think, is that Windows provides two identical API's, one using char, and one using wchar_t. The macro or typedef is just a technique to avoid having to maintain two distinct almost identical copies, and to allow the user to choose one API or the other easily, from the command line of the compiler (supposing he has written his own code to use this macro as well, of course). What I was trying to say was that the Windows API accepts wchar_t* strings in some places (or under some conditions). The system API defines these as UTF16 encoded strings of 16 bit elements. If a compiler decided to implement wchar_t as a 32 bit type (for example), the user couldn't pass wchar_t* strings (including wide character string literals) to the system API. The consequences of this restriction are so bad that no compiler would actually do this. -- James Kanze GABI Software Conseils en informatique orient?e objet/ Beratung in objektorientierter Datenverarbeitung 9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34 [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Eugene Gershnik on 15 Jun 2006 11:00 kanze wrote: > Eugene Gershnik wrote: >> Bronek Kozicki wrote: > > [...] >> UTF-8 has special properties that make it very attractive for >> many applications. In particular it guarantees that no byte of >> multi-byte entry corresponds to a standalone single byte. Thus >> with UTF-8 you can still search for english only strings (like >> /, \\ or .) using single-byte algorithms like strchr(). > > It also means that you cannot use std::find to search for an ?. Neither can you with UTF-32 or anything else since AFAIK ? may be encoded as e followed by the thingy on top or as a single unit ?. ;-) In any event my point was that in many contexts (system programming, networking) you almost never look for anything above 0x7F, even though you have to store it. Also note that you *can* use std::find with a filtering iterator (which is easy to write) sacrificing performance. Then again nobody uses std::find on strings. You either use basic_string::find or strstr() and similar. Which both work fine on ? in UTF-8 as long as you pass it as a string and not a single char. The only thing that doesn't work well with UTF-8 is access at arbitrary index but I doubt any software except maybe document editors really needs to do it. >> It is also can be used (with caution) with std::string unlike >> UTF-16 and UTF-32 for which you will have to invent a >> character type and write traits. > > Agreed, but in practice, if you are using UTF-8 in std::string, > you're strings aren't compatible with the third party libraries > using std::string in their interface. This depends on a library. If it only looks for characters below 0x7F and passes the rest unmodified I stay compatible. Most libraries fall in this category. That's why so much Unix code works perfectly in UTF-8 locale even though it wasn't written with it in mind. > Arguably, you want a > different type, so that the compiler will catch errors. Yes. When I want maximum safety I create struct utf8_char {...}; with the same size and alignment as char. Then I specialize char_traits, delegating to char_traits<char> and have typedef basic_string<utf8_char> utf8_string. This creates a string binary compatible to std::string but with a different type. It gives me type safety but I am still able to reinterpret_cast pointers and references between std::string and utf8_string if I want to. I know it is undefined behavior but it works extremely well on all compilers I have to deal with (and I suspect on all compilers in existence). >> UTF-16 is a good option on platforms that directly support it >> like Windows, AIX or Java. UTF-32 is probably not a good >> option anywhere ;-) > > I can't think of any context where UTF-16 would be my choice. Any code written for NT-based Windows for example. The system pretty much forces you into it. All the system APIs (not some but *all*) that deal with strings accept UTF-16. None of them accept UTF-8 and UTF-32. There is also no notion of UTF-8 locale. If you select anything but UTF-16 for your application you will have to convert everywhere. > It seems to have all of the weaknesses of UTF-8 (e.g. > multi-byte), plus a few of its own (byte order in external > files), and no added benefits -- UTF-8 will usually use less > space. Any time you need true random access to characters, > however, UTF-32 is the way to go. Well as long as you don't need to look up characters *yourself* but only get results from libraries that already understand UTF-16 the problems above disappear. Win32, ICU and Rosette all use UTF-16 as their base character type (well Rosette supports UTF-32 too) so it is easier to just use it everywhere. On the most fundamental level to do I18N correctly strings have to be dealt with as indivisible units. When you want to perform some operation you pass the string to a library and get the results back. No hand written iteration can be expected to deal with pre-composed vs. composite, different canonical forms and all the other garbage Unicode brings us. If a string is an indivisible unit then it doesn't really matter what this unit is as long as it is what your libraries expect to see. -- Eugene [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Eugene Gershnik on 15 Jun 2006 16:37
Pete Becker wrote: > Pete Becker wrote: >> >> >> The Windows API uses WCHAR, which is a macro or a typedef (haven't >> looked it up recently) for a suitably sized integer type. >> > > I just checked, and it's a typedef for wchar_t. That's a "recent" > change (i.e. in the past six or seven years). I'm pretty sure it used > to be just some integer type. No what is a recent change is the fact that wchar_t had become a true seprate type on VC compiler. It used to be a typedef to unsigned short. Even today you have a choice of using compatibility mode in which wchar_t is still unsigned short. The compiler status is VC 6 and before: wchar_t is unsigned short VC 7 and 7.1: by default wchar_t is unsigned short but it is possible to get standard behavior by passing a flag to compiler (-Zc:wchar_t) VC 8: by default wchar_t is a separate type but it is possible to revert to the old behavior (-Zc:wchar_t-) In any event all of this has nothing to do with WCHAR/wchar_t representations which have to be 2 byte and store UTF-16 in little-endian format. -- Eugene [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |