UTF8 and std::string [C++]

Prev: localtime deprecated?
Next: bind guard ?

From: Alf P. Steinbach on 11 Jun 2006 17:28

* jrm:
> std::wstring might not be a good idea according to the details section
> here from ustring class:
>
> <snip
> src=http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html#_details>

I see nothing there that says std::wstring with UTF-16 or UTF-32 would
be a bad choice.

However, if more than 16-bit Unicode (the original Unicode, now the
Basic Multilingual Plane of full Unicode) is required, then on a C++
implementation with 16-bit wchar_t -- such as a Windows C++ compiler
-- a std::wstring has the same potential problem as a std::string has
with UTF-8, that it doesn't support the variable length encoding.

On the third hand, if the platform is exclusively Windows (NT family),
then std::wstring corresponds directly to what's required for system
calls, so that in most cases no conversion is required, either way.

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Bronek Kozicki on 11 Jun 2006 17:34

Jeff Koftinoff wrote:
> But UTF-16 and UTF-32 both are potentially multi-code-point per
> character encodings... See the "Grapheme Boundaries" section of:

they are best one can get now.

B.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Bronek Kozicki on 11 Jun 2006 17:33

jrm wrote:
> std::wstring might not be a good idea according to the details section
> here from ustring class:

why not? std::wstring is typicaly implemented on top of Unicode support of
target platform, and character type used is typically some fixed-width Unicode
encoding, like UTF16 (on Windows) of UTF32 (on Linux; I do not know about
other flavours of Unix). UTF8 is not character type (neither UTF16 or UTF32
are, but at least they are fixed width, so they can map to wchar_t) but fancy
encoding. And typical location of data encoding is not in data processing, but
input/output. Anything that can be represented in UTF8 can be also represented
in UTF32 and in UTF16 (or almost anything - there are surrogates to compensate
shorter characters in UTF16, but I'm not sure how much value they provide)

B.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Pete Becker on 11 Jun 2006 17:40

Wu Yongwei wrote:

>
> A gotcha under Windows: wchar_t is 2 bytes wide.
>

wchar_t is a type defined by the compiler. For some Windows compilers
it's 2 bytes wide, for others it isn't.

--

Pete Becker
Roundhouse Consulting, Ltd.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Pete Becker on 11 Jun 2006 17:41

jrm wrote:

> std::wstring might not be a good idea according to the details section
> here from ustring class:
>
> <snip
> src=http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html#_details>
>
> In a perfect world the C++ Standard Library would contain a UTF-8
> string class. Unfortunately, the C++ standard doesn't mention UTF-8 at
> all. Note that std::wstring is not a UTF-8 string class because it
> contains only fixed-width characters (where width could be 32, 16, or
> even 8 bits).
>
> </snip>
>

Back in the olden days, the Japanese tried to work with multi-byte
representations of Japanese characters. The result of that experience
was that they insisted that C add wide character support so they
wouldn't have to.

--

Pete Becker
Roundhouse Consulting, Ltd.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: localtime deprecated?
Next: bind guard ?