Prev: localtime deprecated?
Next: bind guard ?
From: jrm on 9 Jun 2006 19:37 Hi, Recently I stumbled onto this class: http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html The interface looks very similar to std::string but I haven't tried it. Ravi Dave wrote: > A few weeks ago I looked for an implementation of std::string that can > handle UTF8 strings. I was thinking that the STL iterator abstraction > would be nice for iterating over a variable length encoded string. So > far I haven't found anything. Does anybody know of a UTF8 std::string > implementation? > > I'm really curious how the char_traits template was implemented to > handle variable length character encodings. > > Thanks, > Dave [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Dave on 9 Jun 2006 19:43 Thanks for all of the helpful replies. I came to the same conclusion after doing further research since originally posting. It looks like std::wstring and locale conversions when doing I/O are the way to go. That approach gives a robust solution that can read standard ASCII, UTF8, and wide character text files equally. I like this group. There's always good answers in here. Thanks again. [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: jrm on 10 Jun 2006 15:37 std::wstring might not be a good idea according to the details section here from ustring class: <snip src=http://www.gtkmm.org/docs/glibmm-2.4/docs/reference/html/classGlib_1_1ustring.html#_details> In a perfect world the C++ Standard Library would contain a UTF-8 string class. Unfortunately, the C++ standard doesn't mention UTF-8 at all. Note that std::wstring is not a UTF-8 string class because it contains only fixed-width characters (where width could be 32, 16, or even 8 bits). </snip> Dave wrote: > Thanks for all of the helpful replies. I came to the same conclusion > after doing further research since originally posting. It looks like > std::wstring and locale conversions when doing I/O are the way to go. > That approach gives a robust solution that can read standard ASCII, > UTF8, and wide character text files equally. > > I like this group. There's always good answers in here. Thanks again. [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Wu Yongwei on 10 Jun 2006 15:48 Dave wrote: > Thanks for all of the helpful replies. I came to the same conclusion > after doing further research since originally posting. It looks like > std::wstring and locale conversions when doing I/O are the way to go. > That approach gives a robust solution that can read standard ASCII, > UTF8, and wide character text files equally. > > I like this group. There's always good answers in here. Thanks again. A gotcha under Windows: wchar_t is 2 bytes wide. Depending on your application, it might or might not have impacts. ICU is a more robust way to treat UNICODE characters, I believe. Best regards, Yongwei [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Jeff Koftinoff on 10 Jun 2006 15:52
Bronek Kozicki wrote: > Dave wrote: > > A few weeks ago I looked for an implementation of std::string that can > > handle UTF8 strings. I was thinking that the STL iterator abstraction > > I suggest that for your normal data processing needs you stick with > fixed-width Unicode encodings, like UTF16 or UTF32 - most std::wstring > implementations directly support one or another. Use UTF8 only for > input/output using IO specific for your platform and/or its support functions. > The reason is simple - efficiency. > But UTF-16 and UTF-32 both are potentially multi-code-point per character encodings... See the "Grapheme Boundaries" section of: http://www.unicode.org/unicode/uni2book/ch05.pdf And from: http://www.unicode.org/reports/tr19/tr19-9.html | In any event, however, Unicode code points do not necessarily match user-expectations for | "characters". For example, the following are not represented by a single code point: a | combining character sequences such as <g, acute>; a conjoining jamo sequence; or the | Devanagari conjunct "ksha". These are better matched by grapheme boundaries, as | explained in Chapter 5, Implementation Guidelines and in UTR #18: Unicode Regular > | Expression Guidelines. --jeffk++ [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ] |