UTF8 and std::string [C++]

Prev: localtime deprecated?
Next: bind guard ?

From: jrm on 9 Jun 2006 19:37

Hi,

Recently I stumbled onto this class:

http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html

The interface looks very similar to std::string but I haven't tried it.

Ravi

Dave wrote:
> A few weeks ago I looked for an implementation of std::string that can
> handle UTF8 strings. I was thinking that the STL iterator abstraction
> would be nice for iterating over a variable length encoded string. So
> far I haven't found anything. Does anybody know of a UTF8 std::string
> implementation?
>
> I'm really curious how the char_traits template was implemented to
> handle variable length character encodings.
>
> Thanks,
> Dave

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Dave on 9 Jun 2006 19:43

Thanks for all of the helpful replies. I came to the same conclusion
after doing further research since originally posting. It looks like
std::wstring and locale conversions when doing I/O are the way to go.
That approach gives a robust solution that can read standard ASCII,
UTF8, and wide character text files equally.

I like this group. There's always good answers in here. Thanks again.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: jrm on 10 Jun 2006 15:37

std::wstring might not be a good idea according to the details section
here from ustring class:

<snip
src=http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html#_details>

In a perfect world the C++ Standard Library would contain a UTF-8
string class. Unfortunately, the C++ standard doesn't mention UTF-8 at
all. Note that std::wstring is not a UTF-8 string class because it
contains only fixed-width characters (where width could be 32, 16, or
even 8 bits).

</snip>

Dave wrote:
> Thanks for all of the helpful replies. I came to the same conclusion
> after doing further research since originally posting. It looks like
> std::wstring and locale conversions when doing I/O are the way to go.
> That approach gives a robust solution that can read standard ASCII,
> UTF8, and wide character text files equally.
>
> I like this group. There's always good answers in here. Thanks again.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Wu Yongwei on 10 Jun 2006 15:48

Dave wrote:
> Thanks for all of the helpful replies. I came to the same conclusion
> after doing further research since originally posting. It looks like
> std::wstring and locale conversions when doing I/O are the way to go.
> That approach gives a robust solution that can read standard ASCII,
> UTF8, and wide character text files equally.
>
> I like this group. There's always good answers in here. Thanks again.

A gotcha under Windows: wchar_t is 2 bytes wide. Depending on your
application, it might or might not have impacts.

ICU is a more robust way to treat UNICODE characters, I believe.

Best regards,

Yongwei

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Jeff Koftinoff on 10 Jun 2006 15:52

Bronek Kozicki wrote:
> Dave wrote:
> > A few weeks ago I looked for an implementation of std::string that can
> > handle UTF8 strings. I was thinking that the STL iterator abstraction
>
> I suggest that for your normal data processing needs you stick with
> fixed-width Unicode encodings, like UTF16 or UTF32 - most std::wstring
> implementations directly support one or another. Use UTF8 only for
> input/output using IO specific for your platform and/or its support functions.
> The reason is simple - efficiency.
>

But UTF-16 and UTF-32 both are potentially multi-code-point per
character encodings... See the "Grapheme Boundaries" section of:
http://www.unicode.org/unicode/uni2book/ch05.pdf

And from:

http://www.unicode.org/reports/tr19/tr19-9.html

| In any event, however, Unicode code points do not necessarily match
user-expectations for
| "characters". For example, the following are not represented by a
single code point: a
| combining character sequences such as <g, acute>; a conjoining jamo
sequence; or the
| Devanagari conjunct "ksha". These are better matched by grapheme
boundaries, as
| explained in Chapter 5, Implementation Guidelines and in UTR #18:
Unicode Regular >
| Expression Guidelines.

--jeffk++

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: localtime deprecated?
Next: bind guard ?