Prev: localtime deprecated?
Next: bind guard ?
From: Eugene Gershnik on 13 Jun 2006 18:33 Bronek Kozicki wrote: > jrm wrote: > > std::wstring might not be a good idea according to the details section > > here from ustring class: > > why not? std::wstring is typicaly implemented on top of Unicode support of > target platform, and character type used is typically some fixed-width Unicode > encoding, like UTF16 (on Windows) of UTF32 (on Linux; I do not know about > other flavours of Unix). wchar_t is locale dependent on Solaris. It is UTF-32 for UTF-8 locales and something proprietary on others. This question has been beaten to death in this NG in the past. The simple conclusion is standard C++ wchar_t != Unicode. IIRC P.J. Plauger once explained here why it should be considered a good thing. > UTF8 is not character type (neither UTF16 or UTF32 > are, but at least they are fixed width, so they can map to wchar_t) but fancy > encoding. UTF-16 is *not* fixed width. It is a variable width encoding where a Unicode character can be represented by 1 or 2 16-bit units. At least this was so last time I checked. I wouldn't be suprised if some new Unicode standard broke it further. UTF-32 is the only fixed length encoding for Unicode available today. Again see caveat above. It is also very wasteful if the bulk of your text processing is ASCII compatible. (note that 4 bytes is the *worst* case for UTF-8). UTF-8 has special properties that make it very attractive for many applications. In particular it guarantees that no byte of multi-byte entry corresponds to a standalone single byte. Thus with UTF-8 you can still search for english only strings (like /, \\ or .) using single-byte algorithms like strchr(). It is also can be used (with caution) with std::string unlike UTF-16 and UTF-32 for which you will have to invent a character type and write traits. IMO UTF-8 (and UTF-8 locales) is probably the best way to use Unicode on Unix. Apparently I am also backed by known experts http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux UTF-16 is a good option on platforms that directly support it like Windows, AIX or Java. UTF-32 is probably not a good option anywhere ;-) -- Eugene [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Alf P. Steinbach on 14 Jun 2006 06:18 * Pete Becker: > Wu Yongwei wrote: > >> A gotcha under Windows: wchar_t is 2 bytes wide. > > wchar_t is a type defined by the compiler. For some Windows compilers > it's 2 bytes wide, for others it isn't. Is there a C++ compiler for 32-bit Windows where wchar_t isn't 32 bits by default? If such a compiler exists it would be unable to compile existing source code based on the identity assumption C++ wchar_t === Windows WCHAR. -- A: Because it messes up the order in which people normally read text. Q: Why is it such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail? [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Jeff Koftinoff on 14 Jun 2006 06:20 Bronek Kozicki wrote: > Jeff Koftinoff wrote: > > But UTF-16 and UTF-32 both are potentially multi-code-point per > > character encodings... See the "Grapheme Boundaries" section of: > > they are best one can get now. > > > B. > Right, but a 'best' solution would be to use a string class that can iterate multi-byte characters. --jeffk++ [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Pete Becker on 14 Jun 2006 06:28 Bronek Kozicki wrote: > > why not? std::wstring is typicaly implemented on top of Unicode support of > target platform, and character type used is typically some fixed-width Unicode > encoding, like UTF16 (on Windows) of UTF32 (on Linux; I do not know about > other flavours of Unix). UTF16 is not fixed-width, unless you are sure you will never have characters represented by surrogate pairs. But if you're willing to do that, then UTF8 is also fixed-width, so long as you are sure you will never have characters represented by values greater thn 0xff. It's just a question of how much stuff you're willing to ignore in order to claim that a representation is fixed width. -- Pete Becker Roundhouse Consulting, Ltd. [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: kanze on 14 Jun 2006 06:30
Pete Becker wrote: > jrm wrote: > > std::wstring might not be a good idea according to the details section > > here from ustring class: > > <snip > > src=http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html#_details> > > In a perfect world the C++ Standard Library would contain a > > UTF-8 string class. Unfortunately, the C++ standard doesn't > > mention UTF-8 at all. Note that std::wstring is not a UTF-8 > > string class because it contains only fixed-width characters > > (where width could be 32, 16, or even 8 bits). > > </snip> > Back in the olden days, the Japanese tried to work with > multi-byte representations of Japanese characters. The result > of that experience was that they insisted that C add wide > character support so they wouldn't have to. Times change. UTF-8 was designed with some of the problems encountered in the Japanese encodings in mind. Having said that, I think a lot depends on the application. I certainly wouldn't like to have to write an editor using UTF-8, for example. But for a lot of applications (including things like compilers and interpreters), text handling is limited to reading input sequentially, cutting it up into tokens, then only comparing the tokens or pasting them together for output text. As long as you're only accessing any string object sequentially (which is the case in such applications), UTF-8 can be made to work quite well. -- James Kanze GABI Software Conseils en informatique orient?e objet/ Beratung in objektorientierter Datenverarbeitung 9 place S?mard, 78210 St.-Cyr-l'?cole, France, +33 (0)1 30 23 00 34 [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |