Prev: GPP Generic Preprocessor
Next: UML Survey
From: german diago on 16 Feb 2010 22:40 Hello. I've been trying support for utf strings in c++0x (from gcc svn). I looked at the current draft N3000 for the language, and I have a question. The length() member function says it returns the number of char16_t, char32_t or chars in a string, depending on the basic character type. But the number of chars that a symbol is encoded in, at least for utf-8 encoding (and I believe it's also true for utf-16) is variable. So these functions don't return the real number of symbols in each string, but the number of chars, depending on the size of the char. So to calculate the real number of symbols, you cannot rely on a standard function. I think a standard function to calculate the number of "symbols", not the number of chars of a string, should be included, maybe with another name, since length should be kept for compatibility. Or is there one I'm not aware of? Thanks for your time. -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Joshua Maurice on 17 Feb 2010 08:13 On Feb 17, 7:40 am, german diago <germandi...(a)gmail.com> wrote: > Hello. I've been trying support for utf strings in c++0x (from gcc > svn). I looked at the current draft N3000 for the language, and I have > a question. > > The length() member function says it returns the number of char16_t, > char32_t or chars in a string, depending on the basic character type. > > But the number of chars that a symbol is encoded in, at least for > utf-8 encoding (and I believe it's also true for utf-16) is variable. > So these functions don't return the real number of symbols in each > string, but the number of chars, depending on the size of the char. > So to calculate the real number of symbols, you cannot rely on a > standard function. I think a standard function to calculate the number > of "symbols", not the number of chars of a string, should be included, > maybe with another name, since length should be kept for > compatibility. > > Or is there one I'm not aware of? Thanks for your time. Then there should also be a function to return the total number of grapheme clusters. Analogously, there probably ought to be iterators for 1- encoding units, 2- symbols aka unicode code point, and 3- grapheme clusters aka what the end user thinks of as a char. Honestly, I haven't reviewed it yet, but I hold out little hope that we'll actually have this basic functionality, and thus we'll still be stuck with ICU for the forseeable future. -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: Mathias Gaunard on 17 Feb 2010 08:14 On 17 f�v, 15:40, german diago <germandi...(a)gmail.com> wrote: > The length() member function says it returns the number of char16_t, > char32_t or chars in a string, depending on the basic character type. Of course, since u16string is simply basic_string<char16_t>. It's just a mean to store Unicode, and no string operation is Unicode- aware. > I think a standard function to calculate the number > of "symbols", not the number of chars of a string, should be included, > maybe with another name, since length should be kept for > compatibility. And what purpose would that function serve, alone? Ideally you would need a whole set of Unicode support primitives. Also you might be misguided in thinking that a Unicode code point is a "symbol". A grapheme is closer to that idea, and can be made of an arbitrary number of code points (or rather up to 32 if you restrict yourself to stream-safe unicode strings). -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
From: CornedBee on 17 Feb 2010 08:13 On Feb 17, 4:40 pm, german diago <germandi...(a)gmail.com> wrote: > But the number of chars that a symbol is encoded in, at least for > utf-8 encoding (and I believe it's also true for utf-16) is variable. > So these functions don't return the real number of symbols in each > string, but the number of chars, depending on the size of the char. > So to calculate the real number of symbols, you cannot rely on a > standard function. I think a standard function to calculate the number > of "symbols", not the number of chars of a string, should be included, > maybe with another name, since length should be kept for > compatibility. UTF-16 is also variable-length, yes. The problem is that providing a function that calculates the number of code points opens a huge can of worms. The moment you do it, people will start asking about the number of graphemes, and about normalization forms, and within minutes you're looking at the job of implementing Unicode collation. C++0x simply doesn't have time for that. -- [ See http://www.gotw.ca/resources/clcm.htm for info about ] [ comp.lang.c++.moderated. First time posters: Do this! ]
|
Pages: 1 Prev: GPP Generic Preprocessor Next: UML Survey |