From: Peter Olcott on 16 May 2010 08:34 Since the reason for using other encodings than UTF-8 is speed and ease of use, a string that is as fast and easy to use (as the strings of other encodings) that often takes less space would be superior to alternative strings. I have derived a design for a utf8string that implements the most useful subset of std::string. I match the std::string interface to keep the learning curve to an absolute minimum. I just figured out a way to make most of utf8string operations take about the same amount of time and space as std::string operations. All of the other utf8string operations take a minimum amount of time and space over std::string. These operations involve construction/validation and converting to and from Unicode CodePoints. class utf8string { unsigned int BytePerCodepoint; std::vector<unsigned char> Data; std::vector<unsigned int> Index; } I use this regular expression found on this link: http://www.w3.org:80/2005/03/23-lex-U 1 ['\u0000'-'\u007F'] 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF']) 3 | ( '\u00E0' ['\u00A0'-'\u00BF'] ['\u0080'-'\u00BF']) 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) 5 | ( '\u00ED' ['\u0080'-'\u009F'] ['\u0080'-'\u00BF']) 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) 7 | ( '\u00F0' ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) 9 | ( '\u00F4' ['\u0080'-'\u008F'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) To build a finite state machine (DFA) recognizer for UTF-8 strings. There is no faster or simpler way to validate and divide a string of bytes into their corresponding Unicode code points than a finite state machine. Since most (if not all) character sets always have a consistent number of BytePerCodepoint, this value can be used to quickly get to any specific CodePoint in the UTF-8 encoded data. For ASCII strings this will have a value of One. In those rare cases where a single utf8string has differing length bytes sequences representing CodePoints, then the std::vector<unsigned int> Index; is derived. This std::vector stores the subscript within Data where each CodePoint begins. It is derived once during construction which is when validation occurs. A flag value of Zero is assigned to BytePerCodepoint indicates that the Index is needed. For the ASCII character set the use of utf8string is just as fast and uses hardly no more space than std::string. For other character sets utf8string is most often just as fast as std::string, and only uses a minimal increment of additional space only when needed. Even Chinese most often only takes three bytes.
From: Leigh Johnston on 16 May 2010 09:21 "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message news:hsSdnSJTmcrZe3LWnZ2dnUVZ_q6dnZ2d(a)giganews.com... > Since the reason for using other encodings than UTF-8 is speed and ease of > use, a string that is as fast and easy to use (as the strings of other > encodings) that often takes less space would be superior to alternative > strings. > > I have derived a design for a utf8string that implements the most useful > subset of std::string. I match the std::string interface to keep the > learning curve to an absolute minimum. > > I just figured out a way to make most of utf8string operations take about > the same amount of time and space as std::string operations. All of the > other utf8string operations take a minimum amount of time and space over > std::string. These operations involve construction/validation and > converting to and from Unicode CodePoints. > > class utf8string { > unsigned int BytePerCodepoint; > std::vector<unsigned char> Data; > std::vector<unsigned int> Index; > } > > I use this regular expression found on this link: > http://www.w3.org:80/2005/03/23-lex-U > > 1 ['\u0000'-'\u007F'] > 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF']) > 3 | ( '\u00E0' ['\u00A0'-'\u00BF'] ['\u0080'-'\u00BF']) > 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) > 5 | ( '\u00ED' ['\u0080'-'\u009F'] ['\u0080'-'\u00BF']) > 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) > 7 | ( '\u00F0' ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF'] > ['\u0080'-'\u00BF']) > 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'] > ['\u0080'-'\u00BF']) > 9 | ( '\u00F4' ['\u0080'-'\u008F'] ['\u0080'-'\u00BF'] > ['\u0080'-'\u00BF']) > > To build a finite state machine (DFA) recognizer for UTF-8 strings. There > is no faster or simpler way to validate and divide a string of bytes into > their corresponding Unicode code points than a finite state machine. > > Since most (if not all) character sets always have a consistent number of > BytePerCodepoint, this value can be used to quickly get to any specific > CodePoint in the UTF-8 encoded data. For ASCII strings this will have a > value of One. > > In those rare cases where a single utf8string has differing length bytes > sequences representing CodePoints, then the > std::vector<unsigned int> Index; > is derived. This std::vector stores the subscript within Data where each > CodePoint begins. It is derived once during construction which is when > validation occurs. A flag value of Zero is assigned to BytePerCodepoint > indicates that the Index is needed. > > For the ASCII character set the use of utf8string is just as fast and uses > hardly no more space than std::string. For other character sets utf8string > is most often just as fast as std::string, and only uses a minimal > increment of additional space only when needed. Even Chinese most often > only takes three bytes. Why do you insist on flogging this dead horse? I suspect most of us are happy storing UTF-8 in an ordinary std::string and converting (to std::wstring for example) as and when required, I certainly am. Your solution has little general utility: working with UTF-16 (std::wstring) can be more efficient than constantly decoding individual code points from UTF-8 like you suggest. /Leigh
From: Öö Tiib on 16 May 2010 09:51 On 16 mai, 15:34, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote: > Since the reason for using other encodings than UTF-8 is > speed and ease of use, a string that is as fast and easy to > use (as the strings of other encodings) that often takes > less space would be superior to alternative strings. If you care so much ... perhaps throw together your utf8string and let us to see it. Perhaps test & profile it first to compare with Glib::ustring. http://library.gnome.org/devel/glibmm/2.23/classGlib_1_1ustring.html I suspect UTF8 fades gradually into history. Reasons are similar like 256 color video-modes and raster-graphic formats went. GUI-s are already often made with java or C# (for lack of C++ devs) and these use UTF16 internally. Notice that modern processor architectures are already optimized in the way that byte-level operations are often slower.
From: Peter Olcott on 16 May 2010 10:37 On 5/16/2010 8:21 AM, Leigh Johnston wrote: > "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message > news:hsSdnSJTmcrZe3LWnZ2dnUVZ_q6dnZ2d(a)giganews.com... >> Since the reason for using other encodings than UTF-8 is speed and >> ease of use, a string that is as fast and easy to use (as the strings >> of other encodings) that often takes less space would be superior to >> alternative strings. >> >> I have derived a design for a utf8string that implements the most >> useful subset of std::string. I match the std::string interface to >> keep the learning curve to an absolute minimum. >> >> I just figured out a way to make most of utf8string operations take >> about the same amount of time and space as std::string operations. All >> of the other utf8string operations take a minimum amount of time and >> space over std::string. These operations involve >> construction/validation and converting to and from Unicode CodePoints. >> >> class utf8string { >> unsigned int BytePerCodepoint; >> std::vector<unsigned char> Data; >> std::vector<unsigned int> Index; >> } >> >> I use this regular expression found on this link: >> http://www.w3.org:80/2005/03/23-lex-U >> >> 1 ['\u0000'-'\u007F'] >> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF']) >> 3 | ( '\u00E0' ['\u00A0'-'\u00BF'] ['\u0080'-'\u00BF']) >> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) >> 5 | ( '\u00ED' ['\u0080'-'\u009F'] ['\u0080'-'\u00BF']) >> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) >> 7 | ( '\u00F0' ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF'] >> ['\u0080'-'\u00BF']) >> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'] >> ['\u0080'-'\u00BF']) >> 9 | ( '\u00F4' ['\u0080'-'\u008F'] ['\u0080'-'\u00BF'] >> ['\u0080'-'\u00BF']) >> >> To build a finite state machine (DFA) recognizer for UTF-8 strings. >> There is no faster or simpler way to validate and divide a string of >> bytes into their corresponding Unicode code points than a finite state >> machine. >> >> Since most (if not all) character sets always have a consistent number >> of BytePerCodepoint, this value can be used to quickly get to any >> specific CodePoint in the UTF-8 encoded data. For ASCII strings this >> will have a value of One. >> >> In those rare cases where a single utf8string has differing length >> bytes sequences representing CodePoints, then the >> std::vector<unsigned int> Index; >> is derived. This std::vector stores the subscript within Data where >> each CodePoint begins. It is derived once during construction which is >> when validation occurs. A flag value of Zero is assigned to >> BytePerCodepoint indicates that the Index is needed. >> >> For the ASCII character set the use of utf8string is just as fast and >> uses hardly no more space than std::string. For other character sets >> utf8string is most often just as fast as std::string, and only uses a >> minimal increment of additional space only when needed. Even Chinese >> most often only takes three bytes. > > Why do you insist on flogging this dead horse? I just came up with this improved design this morning. > I suspect most of us are > happy storing UTF-8 in an ordinary std::string and converting (to > std::wstring for example) as and when required, I certainly am. Your > solution has little general utility: working with UTF-16 (std::wstring) > can be more efficient than constantly decoding individual code points > from UTF-8 like you suggest. > > /Leigh Neither std::string nor std::wstring know anything at all about Unicode. All Unicode based operations require very substantial manual intervention to work correctly with std::string or std::wstring. utf8string makes all of this transparent to the user. There are very few instances where the utf8string need be converted to individual code points. In almost all cases there is no need for this. If you are mixing character sets with differing byte length encodings (such as Chinese and English) in the same utf8string, then this would be needed. I can't imagine any other reasons to need to translate from UTF-8 to code points. UTF-8 is the standard Unicode data interchange format. This aspect is crucial to internet based applications. Unlike other encodings UTF-8 works the same way on every machine architecture not requiring any accounting or adaptation for things such as Little or Big Endian. utf8string handles all of the conversions needed transparently. Most often no conversion is needed. Because of this it is easier to use than the methods that you propose. It always works for any character set with maximum speed, and less space. If the use is focused on Asian character sets, then a UTF-16 string would take less space. If an application must handle every character set, then the space savings for ASCII will likely outweight the additional space cost of UTF-16. The reason for this is that studies have shown that the United States consumes about one half of the world's supply of software. In any case conversions can be provided between utf8string and utf16string. utf16string would have an identical design.
From: Peter Olcott on 16 May 2010 10:46
On 5/16/2010 8:51 AM, �� Tiib wrote: > On 16 mai, 15:34, "Peter Olcott"<NoS...(a)OCR4Screen.com> wrote: >> Since the reason for using other encodings than UTF-8 is >> speed and ease of use, a string that is as fast and easy to >> use (as the strings of other encodings) that often takes >> less space would be superior to alternative strings. > > If you care so much ... perhaps throw together your utf8string and let > us to see it. Perhaps test& profile it first to compare with > Glib::ustring. http://library.gnome.org/devel/glibmm/2.23/classGlib_1_1ustring.html > > I suspect UTF8 fades gradually into history. Reasons are similar like > 256 color video-modes and raster-graphic formats went. GUI-s are > already often made with java or C# (for lack of C++ devs) and these > use UTF16 internally. Notice that modern processor architectures are > already optimized in the way that byte-level operations are often > slower. UTF-8 is the best Unicode data-interchange format because it works exactly the same way across every machine architecture without the need for separate adaptations. It also stores the entire ASCII character set in a single byte per code point. I will put it together because it will become one of my standard tools. The design is now essentially complete. Coding this updated design will go very quickly. I will put it on my website and provide a free license for any use as long as the copyright notice remains in the source code. |