New utf8string design may make UTF-8 the superior encoding [MFC]

Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish

From: Peter Olcott on 16 May 2010 08:34

Since the reason for using other encodings than UTF-8 is
speed and ease of use, a string that is as fast and easy to
use (as the strings of other encodings) that often takes
less space would be superior to alternative strings.

I have derived a design for a utf8string that implements the
most useful subset of std::string. I match the std::string
interface to keep the learning curve to an absolute minimum.

I just figured out a way to make most of utf8string
operations take about the same amount of time and space as
std::string operations. All of the other utf8string
operations take a minimum amount of time and space over
std::string. These operations involve
construction/validation and converting to and from Unicode
CodePoints.

class utf8string {
unsigned int BytePerCodepoint;
std::vector<unsigned char> Data;
std::vector<unsigned int> Index;
}

I use this regular expression found on this link:
http://www.w3.org:80/2005/03/23-lex-U

1 ['\u0000'-'\u007F']
2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
3 | ( '\u00E0' ['\u00A0'-'\u00BF']
['\u0080'-'\u00BF'])
4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
['\u0080'-'\u00BF'])
5 | ( '\u00ED' ['\u0080'-'\u009F']
['\u0080'-'\u00BF'])
6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
['\u0080'-'\u00BF'])
7 | ( '\u00F0' ['\u0090'-'\u00BF']
['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
9 | ( '\u00F4' ['\u0080'-'\u008F']
['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])

To build a finite state machine (DFA) recognizer for UTF-8
strings. There is no faster or simpler way to validate and
divide a string of bytes into their corresponding Unicode
code points than a finite state machine.

Since most (if not all) character sets always have a
consistent number of BytePerCodepoint, this value can be
used to quickly get to any specific CodePoint in the UTF-8
encoded data. For ASCII strings this will have a value of
One.

In those rare cases where a single utf8string has differing
length bytes sequences representing CodePoints, then the
std::vector<unsigned int> Index;
is derived. This std::vector stores the subscript within
Data where each CodePoint begins. It is derived once during
construction which is when validation occurs. A flag value
of Zero is assigned to BytePerCodepoint indicates that the
Index is needed.

For the ASCII character set the use of utf8string is just as
fast and uses hardly no more space than std::string. For
other character sets utf8string is most often just as fast
as std::string, and only uses a minimal increment of
additional space only when needed. Even Chinese most often
only takes three bytes.

From: Leigh Johnston on 16 May 2010 09:21

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:hsSdnSJTmcrZe3LWnZ2dnUVZ_q6dnZ2d(a)giganews.com...
> Since the reason for using other encodings than UTF-8 is speed and ease of
> use, a string that is as fast and easy to use (as the strings of other
> encodings) that often takes less space would be superior to alternative
> strings.
>
> I have derived a design for a utf8string that implements the most useful
> subset of std::string. I match the std::string interface to keep the
> learning curve to an absolute minimum.
>
> I just figured out a way to make most of utf8string operations take about
> the same amount of time and space as std::string operations. All of the
> other utf8string operations take a minimum amount of time and space over
> std::string. These operations involve construction/validation and
> converting to and from Unicode CodePoints.
>
> class utf8string {
> unsigned int BytePerCodepoint;
> std::vector<unsigned char> Data;
> std::vector<unsigned int> Index;
> }
>
> I use this regular expression found on this link:
> http://www.w3.org:80/2005/03/23-lex-U
>
> 1 ['\u0000'-'\u007F']
> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
> 3 | ( '\u00E0' ['\u00A0'-'\u00BF'] ['\u0080'-'\u00BF'])
> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
> 5 | ( '\u00ED' ['\u0080'-'\u009F'] ['\u0080'-'\u00BF'])
> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
> 7 | ( '\u00F0' ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF']
> ['\u0080'-'\u00BF'])
> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']
> ['\u0080'-'\u00BF'])
> 9 | ( '\u00F4' ['\u0080'-'\u008F'] ['\u0080'-'\u00BF']
> ['\u0080'-'\u00BF'])
>
> To build a finite state machine (DFA) recognizer for UTF-8 strings. There
> is no faster or simpler way to validate and divide a string of bytes into
> their corresponding Unicode code points than a finite state machine.
>
> Since most (if not all) character sets always have a consistent number of
> BytePerCodepoint, this value can be used to quickly get to any specific
> CodePoint in the UTF-8 encoded data. For ASCII strings this will have a
> value of One.
>
> In those rare cases where a single utf8string has differing length bytes
> sequences representing CodePoints, then the
> std::vector<unsigned int> Index;
> is derived. This std::vector stores the subscript within Data where each
> CodePoint begins. It is derived once during construction which is when
> validation occurs. A flag value of Zero is assigned to BytePerCodepoint
> indicates that the Index is needed.
>
> For the ASCII character set the use of utf8string is just as fast and uses
> hardly no more space than std::string. For other character sets utf8string
> is most often just as fast as std::string, and only uses a minimal
> increment of additional space only when needed. Even Chinese most often
> only takes three bytes.

Why do you insist on flogging this dead horse? I suspect most of us are
happy storing UTF-8 in an ordinary std::string and converting (to
std::wstring for example) as and when required, I certainly am. Your
solution has little general utility: working with UTF-16 (std::wstring) can
be more efficient than constantly decoding individual code points from UTF-8
like you suggest.

/Leigh

From: Öö Tiib on 16 May 2010 09:51

On 16 mai, 15:34, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote:
> Since the reason for using other encodings than UTF-8 is
> speed and ease of use, a string that is as fast and easy to
> use (as the strings of other encodings) that often takes
> less space would be superior to alternative strings.

If you care so much ... perhaps throw together your utf8string and let
us to see it. Perhaps test & profile it first to compare with
Glib::ustring. http://library.gnome.org/devel/glibmm/2.23/classGlib_1_1ustring.html

I suspect UTF8 fades gradually into history. Reasons are similar like
256 color video-modes and raster-graphic formats went. GUI-s are
already often made with java or C# (for lack of C++ devs) and these
use UTF16 internally. Notice that modern processor architectures are
already optimized in the way that byte-level operations are often
slower.

From: Peter Olcott on 16 May 2010 10:37

On 5/16/2010 8:21 AM, Leigh Johnston wrote:
> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
> news:hsSdnSJTmcrZe3LWnZ2dnUVZ_q6dnZ2d(a)giganews.com...
>> Since the reason for using other encodings than UTF-8 is speed and
>> ease of use, a string that is as fast and easy to use (as the strings
>> of other encodings) that often takes less space would be superior to
>> alternative strings.
>>
>> I have derived a design for a utf8string that implements the most
>> useful subset of std::string. I match the std::string interface to
>> keep the learning curve to an absolute minimum.
>>
>> I just figured out a way to make most of utf8string operations take
>> about the same amount of time and space as std::string operations. All
>> of the other utf8string operations take a minimum amount of time and
>> space over std::string. These operations involve
>> construction/validation and converting to and from Unicode CodePoints.
>>
>> class utf8string {
>> unsigned int BytePerCodepoint;
>> std::vector<unsigned char> Data;
>> std::vector<unsigned int> Index;
>> }
>>
>> I use this regular expression found on this link:
>> http://www.w3.org:80/2005/03/23-lex-U
>>
>> 1 ['\u0000'-'\u007F']
>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>> 3 | ( '\u00E0' ['\u00A0'-'\u00BF'] ['\u0080'-'\u00BF'])
>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>> 5 | ( '\u00ED' ['\u0080'-'\u009F'] ['\u0080'-'\u00BF'])
>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>> 7 | ( '\u00F0' ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF']
>> ['\u0080'-'\u00BF'])
>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']
>> ['\u0080'-'\u00BF'])
>> 9 | ( '\u00F4' ['\u0080'-'\u008F'] ['\u0080'-'\u00BF']
>> ['\u0080'-'\u00BF'])
>>
>> To build a finite state machine (DFA) recognizer for UTF-8 strings.
>> There is no faster or simpler way to validate and divide a string of
>> bytes into their corresponding Unicode code points than a finite state
>> machine.
>>
>> Since most (if not all) character sets always have a consistent number
>> of BytePerCodepoint, this value can be used to quickly get to any
>> specific CodePoint in the UTF-8 encoded data. For ASCII strings this
>> will have a value of One.
>>
>> In those rare cases where a single utf8string has differing length
>> bytes sequences representing CodePoints, then the
>> std::vector<unsigned int> Index;
>> is derived. This std::vector stores the subscript within Data where
>> each CodePoint begins. It is derived once during construction which is
>> when validation occurs. A flag value of Zero is assigned to
>> BytePerCodepoint indicates that the Index is needed.
>>
>> For the ASCII character set the use of utf8string is just as fast and
>> uses hardly no more space than std::string. For other character sets
>> utf8string is most often just as fast as std::string, and only uses a
>> minimal increment of additional space only when needed. Even Chinese
>> most often only takes three bytes.
>
> Why do you insist on flogging this dead horse?

I just came up with this improved design this morning.

> I suspect most of us are
> happy storing UTF-8 in an ordinary std::string and converting (to
> std::wstring for example) as and when required, I certainly am. Your
> solution has little general utility: working with UTF-16 (std::wstring)
> can be more efficient than constantly decoding individual code points
> from UTF-8 like you suggest.
>
> /Leigh

Neither std::string nor std::wstring know anything at all about Unicode.
All Unicode based operations require very substantial manual
intervention to work correctly with std::string or std::wstring.
utf8string makes all of this transparent to the user.

There are very few instances where the utf8string need be converted to
individual code points. In almost all cases there is no need for this.
If you are mixing character sets with differing byte length encodings
(such as Chinese and English) in the same utf8string, then this would be
needed. I can't imagine any other reasons to need to translate from
UTF-8 to code points.

UTF-8 is the standard Unicode data interchange format. This aspect is
crucial to internet based applications. Unlike other encodings UTF-8
works the same way on every machine architecture not requiring any
accounting or adaptation for things such as Little or Big Endian.

utf8string handles all of the conversions needed transparently. Most
often no conversion is needed. Because of this it is easier to use than
the methods that you propose. It always works for any character set with
maximum speed, and less space.

If the use is focused on Asian character sets, then a UTF-16 string
would take less space. If an application must handle every character
set, then the space savings for ASCII will likely outweight the
additional space cost of UTF-16. The reason for this is that studies
have shown that the United States consumes about one half of the world's
supply of software. In any case conversions can be provided between
utf8string and utf16string. utf16string would have an identical design.

From: Peter Olcott on 16 May 2010 10:46

On 5/16/2010 8:51 AM, �� Tiib wrote:
> On 16 mai, 15:34, "Peter Olcott"<NoS...(a)OCR4Screen.com> wrote:
>> Since the reason for using other encodings than UTF-8 is
>> speed and ease of use, a string that is as fast and easy to
>> use (as the strings of other encodings) that often takes
>> less space would be superior to alternative strings.
>
> If you care so much ... perhaps throw together your utf8string and let
> us to see it. Perhaps test& profile it first to compare with
> Glib::ustring. http://library.gnome.org/devel/glibmm/2.23/classGlib_1_1ustring.html
>
> I suspect UTF8 fades gradually into history. Reasons are similar like
> 256 color video-modes and raster-graphic formats went. GUI-s are
> already often made with java or C# (for lack of C++ devs) and these
> use UTF16 internally. Notice that modern processor architectures are
> already optimized in the way that byte-level operations are often
> slower.

UTF-8 is the best Unicode data-interchange format because it works
exactly the same way across every machine architecture without the need
for separate adaptations. It also stores the entire ASCII character set
in a single byte per code point.

I will put it together because it will become one of my standard tools.
The design is now essentially complete. Coding this updated design will
go very quickly. I will put it on my website and provide a free license
for any use as long as the copyright notice remains in the source code.

| Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish