New utf8string design may make UTF-8 the superior encoding [MFC]

Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish

From: Peter Olcott on 17 May 2010 10:42

On 5/17/2010 3:22 AM, Oliver Regenfelder wrote:
> Hello,
>
> Joseph M. Newcomer wrote:
>> THis one makes no
>> sense. There will be ORDERS OF MAGNITUDE greater differences in input
>> time if you take
>> rotational latency and seek time into consideration (in fact, opening
>> the file will have
>> orders of magnitude more variance than the cost of a UTF-8 to UTF-16
>> or even UTF-32
>> conversion, because of the directory lookup time variance).
>
> Do yourself a favor Peter and believe him!
> A harddisk takes half guessed (seek + half a rotation @ 7.200 rpm)
> ~12-14 ms to reach a sector for IO and that is only raw hardware
> delay. On networks you will have roundtrip times of maybe 60ms and
> more (strongly depends on your INET connection and server location). So
> any computational effort for your string convertion doesn't matter.
> Especially, as your script language files won't be in the gigabyte range.
>
> Best regards,
>
> Oliver

I thought that he was saying that it takes much more time to read UTF-8
from disk than it takes to read UTF-32 from disk. This would be absurd.
That is takes much more time to read either UTF-8 or UTF-32 from disk
than it takes to convert either to the other I already knew.

In any case UTF-32 will be the internal representation of my GUI
scripting language's string type. I will stick with UTF-8 for the
Lexical analyzer and the SymbolTable.

From: Peter Olcott on 17 May 2010 10:48

On 5/17/2010 8:30 AM, Leigh Johnston wrote:
>
>
> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
> news:_redna1LAMZaomzWnZ2dnUVZ_tOdnZ2d(a)giganews.com...
>> On 5/17/2010 1:35 AM, Mihai N. wrote:
>>>
>>>> I studied the derivation of the above regular expression in
>>>> considerable
>>>> depth. I understand UTF-8 encoding quite well. So far I have found no
>>>> error.
>>>
>>>> It is published on w3c.org.
>>>
>>> Stop repeating this nonsense.
>>>
>>> The URL is http://www.w3.org/2005/03/23-lex-U and the post states:
>>> "It is not endorsed by the W3C members, team, or any working group."
>>>
>>> It is a hack implemented by someone and it happens to be on the w3c
>>> server.
>>> This is not enough to make it right. If I post something on the free
>>> blogging
>>> space offered by Microsoft, you will take as law and say "it is
>>> published
>>> on microsoft.com?
>>>
>>>
>> Do you know of any faster way to validate and divide a UTF-8 sequence
>> into its constituent code point parts than a regular expression
>> implemented as a finite state machine? (please don't cite a software
>> package, I am only interested in the underlying methodology).
>>
>> To the very best of my knowledge (and I have a patent on a finite
>> state recognizer) a regular expression implemented as a finite state
>> machine is the fastest and simplest possible way of every way that can
>> possibly exist to validate a UTF-8 sequence and divide it into its
>> constituent parts.
>
> My utf8_to_wide free function is not a finite state machine and it is
> pretty fast. It takes a std::string as input and returns a std::wstring
> as output. KISS.
>
> /Leigh

I couldn't imagine how to do this without using a finite state machine.
How did you do it?

std::wstring will not help me because I have confirmed my original
decision to use UTF-32 as my internal representation, and MS Windows
only has a 16 bit std::wstring.

From: Leigh Johnston on 17 May 2010 11:23

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:7PydnQ-UH8CoymzWnZ2dnUVZ_iydnZ2d(a)giganews.com...
> I couldn't imagine how to do this without using a finite state machine.
> How did you do it?

Writing such a free function is not rocket science so I will not bore people
by describing it here. N.B. my function is not quite the same as yours as
it doesn't "validate" a UTF8-sequence, instead if a particular byte (>=0x80)
is not part of a valid UTF-8 sequence my function will use mbtowc as a
fallback rather than signalling an invalid sequence (uses default locale).

>
> std::wstring will not help me because I have confirmed my original
> decision to use UTF-32 as my internal representation, and MS Windows only
> has a 16 bit std::wstring.

If you are developing for Windows only it makes sense to use UTF-16 as an
internal representation, i.e. use std::wstring.

/Leigh

From: Joseph M. Newcomer on 17 May 2010 11:48

The underlying technology is discussed in the Unicode documentation and on
www.unicode.org. There are a set of APIs that deliver character information including the
class information which are part of the Unicode support in Windows. But the point is,
thinking of Unicode code points by writing a regexp for UTF-8 is not a reasonable
approach.

Or to put it bluntly, the regexp set you show is wrong, I have shown it is wrong, and you
have to start thinking correctly about the problem.
joe

On Mon, 17 May 2010 08:08:22 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/17/2010 1:35 AM, Mihai N. wrote:
>>
>>> I studied the derivation of the above regular expression in considerable
>>> depth. I understand UTF-8 encoding quite well. So far I have found no
>>> error.
>>
>>> It is published on w3c.org.
>>
>> Stop repeating this nonsense.
>>
>> The URL is http://www.w3.org/2005/03/23-lex-U and the post states:
>> "It is not endorsed by the W3C members, team, or any working group."
>>
>> It is a hack implemented by someone and it happens to be on the w3c server.
>> This is not enough to make it right. If I post something on the free blogging
>> space offered by Microsoft, you will take as law and say "it is published
>> on microsoft.com?
>>
>>
>Do you know of any faster way to validate and divide a UTF-8 sequence
>into its constituent code point parts than a regular expression
>implemented as a finite state machine? (please don't cite a software
>package, I am only interested in the underlying methodology).
>
>To the very best of my knowledge (and I have a patent on a finite state
>recognizer) a regular expression implemented as a finite state machine
>is the fastest and simplest possible way of every way that can possibly
>exist to validate a UTF-8 sequence and divide it into its constituent
>parts.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 17 May 2010 12:04

See below...
On Mon, 17 May 2010 09:29:33 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/17/2010 1:44 AM, Joseph M. Newcomer wrote:
>>> If you are not a liar then show an error in the above regular
>>> expression, I dare you.
>> ***
>> I have already pointed out that it is insufficient for lexically recognizing accent marks
>> or invalid combinations of accent marks. So the requirement of demonstrating an error is
>> trivially met.
>>
>> In addition, the regexp values do not account for directional changes in the parse, which
>> is essential, for reasons I explained in another response.
>
>I have always defined correct to mean valid UTF-8 sequences (according
>to the UTF-8 specification) and now you are presenting the red-herring
>that it does not validate code point sequences. It is not supposed to
>validate code point sequences.
****
Oh, so it doesn't matter if it is *correct*, as long as it is *correct*. I thought you
said "semantically correct", and semantics necessarily implies "meaning". What you were
asking, or should have asked, is something along the lines of "does this set of regular
expressions define the set of valid UTF-8 character sequences?"

Then you started talking about codepoints, which of course means correct Unicode
representations, and I already pointed out why that doesn't work.

So what are you asking? Or is this another "magic morphing question" that will change
with every response?

State PRECISELY what the question is. Otherwise, you leave us guessing as to what you are
really asking.
****
>
>The reason that I ALWAYS ask you to explain your reasoning is that this
>most often provides the invalid assumptions that you are making.
*****
Oh. And your assumptions (that conversion time and space apparently matter) are always
valid? You would not be discussing the way to construct a utf8string unless you thought
there was a reason that such a kludge mattered, and you argued it based on time and space,
which means you made an invalid assumption: that these matter in the slightest! Out here
in the Real World, we make decisions which are based on global performance and correctness
goals, and include metrics like development cost, portability across localization,
maintainability, and similar parameters. UTF-8 as an internal representation fails on all
these scores.

And why do you think you can't have a CString of utf-32 characters? We have CStringA and
CStringW, and with a little work you could create a CString32 that had all the right
properties, and derived from the base class. It is a minor exercise, the kind I might
assign to a beginning C++ programmer. THen you could write a UTF-8-to-UTF-32 conversion
(for example
BOOL cv8_32(const CStringA utf8, CString32 & result);
)
and you would avoid most of the problems you are trying to solve. Note that the FSM
required to parse multibyte sequences is not based on ranges, but on the high-order bits
of the first byte; if you had read the Unicode 5.0 documentation for this, you would have
seen the table. I am not next to my computer right now, so the Unicode 5.0 book, which is
normally in arm's reach, is actually next door, or I'd give you a page number.

I would not use a regexp to syntactically validate an input UTF-8 string. If I were
writing cv8_32 I would follow the encoding rules for multibyte UTF-8 characters that are
specified in the book. THe regexp approach is just wrong.
****
****
>
>>
>> It would be easier if you had expressed it as Unicode codepoints; then it would be easy to
>> show the numerous failures. I'm sorry, I thought you had already applied exhaustive
>> categorical reasoning to this, which would have demonstrated the errors.
>
>This level of detail is not relevant to the specific problem that I am
>solving. The problem is providing a minimal cost way to permit people to
>write GUI scripts in their native language. Anything that goes beyond
>the scope of this problem is explicitly out-of-scope.
****
Huh? The correct approach is to let the write the scripts in whatever editor they like to
use, save them as UTF-8, read them in, and convert them to UTF-32. After that, you have
to figure out what parsing them means. For example, I believe the regexp you have given
does not properly handle numbers. But I already pointed this out.
joe
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish