From: Joseph M. Newcomer on

But the point is that no sane person who wanted to split up a UTF-8 sequence into code
points would EVER consider something as remarkably silly as using regexps like you showed.
The parsing is absolutely trivial, You look at the high-order bits of the leading byte and
this tells you how many bytes are involved! Duh! And if you want to "validate", you just
look at the high-order bit and the low-order values of the next few bytes! Double Duh!

It is a trivial single-pass loop, no try-and-rescan as suggested by the regexp model. You
have obviously confused a trivial problem with a complex problem. It is frankly an
exceptionally poor choice for the goal you set, but it was not clear what you were trying
to do. It is blindingly obvious that this is the worst possible choice, but to realize
this, you actually have to RTFM (table 3.6, page 103, Unicode 5.0 book; the rain drove me
inside so I'm back in my office).

If your only tool is a hammer, all your problems look like nails.
joe

On Mon, 17 May 2010 13:42:51 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/17/2010 10:48 AM, Joseph M. Newcomer wrote:
>> The underlying technology is discussed in the Unicode documentation and on
>> www.unicode.org. There are a set of APIs that deliver character information including the
>> class information which are part of the Unicode support in Windows. But the point is,
>> thinking of Unicode code points by writing a regexp for UTF-8 is not a reasonable
>> approach.
>>
>> Or to put it bluntly, the regexp set you show is wrong, I have shown it is wrong, and you
>> have to start thinking correctly about the problem.
>> joe
>
>No you did not show that it was wrong for its intended purpose of
>validating byte sequences as valid UTF-8 and dividing these sequences
>into their corresponding code points.
>
>You merely provided examples of things that it was not intended to do,
>which in no way shows that it is in any way incorrect when measured
>against its intended purpose.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on
See below..
On Mon, 17 May 2010 14:43:05 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/17/2010 2:25 PM, I V wrote:
>> On Mon, 17 May 2010 08:08:22 -0500, Peter Olcott wrote:
>>> Do you know of any faster way to validate and divide a UTF-8 sequence
>>> into its constituent code point parts than a regular expression
>>> implemented as a finite state machine? (please don't cite a software
>>> package, I am only interested in the underlying methodology).
>>
>> A finite state machine sounds like a good plan, but I'd be a bit
>> surprised if a regular expression was faster than a state machine
>> specifically written to parse UTF-8. Aside from the unnecessary
>> generality of regular expressions (I don't really know if that would
>> actually make them slower in this case), I would guess a regular
>> expression engine wouldn't take advantage of the way that UTF-8 encodes
>> the meaning of each byte (single-byte codepoint, first byte of multi-byte
>> code-point, or continuation of a multi-byte codepoint) in the most-
>> significant two bits of the byte.
>
>I was originally thinking That I would only need 256 * 4 bytes to encode
>a complete DFA recognizer using the simplest possible design. Now it
>looks like I need 256 ^ 4 bytes to encode the simplest possible design.
>Apparently Lex knows how to make this much more concise, thus not the
>simplest possible design.
****
It is equally obvious that the simplest possible design requires NO table, but simple a
couple bit-masks and if-statements! I didn't fully appreciate the depths of your
ignorance when I first answered the question. It was perfectly clear that a DFA would NOT
be the correct model, especially with the horrible regexp you were trying to use. But I
had no idea that you were totally clueless about UTF-8 implementation!
joe
****
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joshua Maurice on
On May 17, 12:25 pm, I V <ivle...(a)gmail.com> wrote:
> On Mon, 17 May 2010 08:08:22 -0500, Peter Olcott wrote:
> > Do you know of any faster way to validate and divide a UTF-8 sequence
> > into its constituent code point parts than a regular expression
> > implemented as a finite state machine? (please don't cite a software
> > package, I am only interested in the underlying methodology).
>
> A finite state machine sounds like a good plan, but I'd be a bit
> surprised if a regular expression was faster than a state machine
> specifically written to parse UTF-8. Aside from the unnecessary
> generality of regular expressions (I don't really know if that would
> actually make them slower in this case), I would guess a regular
> expression engine wouldn't take advantage of the way that UTF-8 encodes
> the meaning of each byte (single-byte codepoint, first byte of multi-byte
> code-point, or continuation of a multi-byte codepoint) in the most-
> significant two bits of the byte.

This sounds a little overkill to me, all of this talk of regular
expressions, finite state machines, etc.

Can't you just do something like the following? I understand that it
is a finite state machine in fact, but it uses no frameworks, no
regular expressions, etc. I'd expect that this is pretty good in terms
of speed and readability. It would be quite simple to add some code
using bit operations to convert from the utf8 array to Unicode code
points.

//COMPLETELY UNTESTED
bool validate_utf8(unsigned char * utf8str_start, unsigned char *
utf8str_end)
{
for (unsigned char * x = utf8str_start; x != utf8str_end; )
{
if ((*x & 0x80) == 0)
{
++x;
}
else if ((*x & (0x80 + 0x40 + 0x20)) == (0x80 + 0x40))
{
if (++x == utf8str_end || (*x & (0x80 + 0x40)) != (0x80))
return false;
++x;
}
else if ((*x & (0x80 + 0x40 + 0x20 + 0x10)) == (0x80 + 0x40 +
0x20))
{
if (++x == utf8str_end || (*x & (0x80 + 0x40)) != (0x80))
return false;
if (++x == utf8str_end || (*x & (0x80 + 0x40)) != (0x80))
return false;
++x;
}
else if ((*x & (0x80 + 0x40 + 0x20 + 0x10 + 0x08)) == (0x80 + 0x40
+ 0x20 + 0x10))
{
if (++x == utf8str_end || (*x & (0x80 + 0x40)) != (0x80))
return false;
if (++x == utf8str_end || (*x & (0x80 + 0x40)) != (0x80))
return false;
if (++x == utf8str_end || (*x & (0x80 + 0x40)) != (0x80))
return false;
++x;
} else
return false;
}
return true;
}
From: Peter Olcott on
On 5/17/2010 4:20 PM, Joseph M. Newcomer wrote:
>
> But the point is that no sane person who wanted to split up a UTF-8 sequence into code
> points would EVER consider something as remarkably silly as using regexps like you showed.
> The parsing is absolutely trivial, You look at the high-order bits of the leading byte and
> this tells you how many bytes are involved! Duh! And if you want to "validate", you just
> look at the high-order bit and the low-order values of the next few bytes! Double Duh!
>
> It is a trivial single-pass loop, no try-and-rescan as suggested by the regexp model. You
> have obviously confused a trivial problem with a complex problem. It is frankly an
> exceptionally poor choice for the goal you set, but it was not clear what you were trying
> to do. It is blindingly obvious that this is the worst possible choice, but to realize
> this, you actually have to RTFM (table 3.6, page 103, Unicode 5.0 book; the rain drove me
> inside so I'm back in my office).

I completed the detailed design on the DFA that would validate and
translate any valid UTF-8 byte sequence into UTF-32. It can not be done
faster or simpler. The state transition matrix only takes exactly 2 KB.

Having a regular expression is apparently 100% completely mandatory in
my case because of my mandatory Lex requirement. It might be possible to
derive some kludge around this, but, I abhor kludges.

>
> If your only tool is a hammer, all your problems look like nails.
> joe
>
> On Mon, 17 May 2010 13:42:51 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> On 5/17/2010 10:48 AM, Joseph M. Newcomer wrote:
>>> The underlying technology is discussed in the Unicode documentation and on
>>> www.unicode.org. There are a set of APIs that deliver character information including the
>>> class information which are part of the Unicode support in Windows. But the point is,
>>> thinking of Unicode code points by writing a regexp for UTF-8 is not a reasonable
>>> approach.
>>>
>>> Or to put it bluntly, the regexp set you show is wrong, I have shown it is wrong, and you
>>> have to start thinking correctly about the problem.
>>> joe
>>
>> No you did not show that it was wrong for its intended purpose of
>> validating byte sequences as valid UTF-8 and dividing these sequences
>> into their corresponding code points.
>>
>> You merely provided examples of things that it was not intended to do,
>> which in no way shows that it is in any way incorrect when measured
>> against its intended purpose.
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on
On 5/17/2010 4:23 PM, Joseph M. Newcomer wrote:
> See below..
> On Mon, 17 May 2010 14:43:05 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> On 5/17/2010 2:25 PM, I V wrote:
>>> On Mon, 17 May 2010 08:08:22 -0500, Peter Olcott wrote:
>>>> Do you know of any faster way to validate and divide a UTF-8 sequence
>>>> into its constituent code point parts than a regular expression
>>>> implemented as a finite state machine? (please don't cite a software
>>>> package, I am only interested in the underlying methodology).
>>>
>>> A finite state machine sounds like a good plan, but I'd be a bit
>>> surprised if a regular expression was faster than a state machine
>>> specifically written to parse UTF-8. Aside from the unnecessary
>>> generality of regular expressions (I don't really know if that would
>>> actually make them slower in this case), I would guess a regular
>>> expression engine wouldn't take advantage of the way that UTF-8 encodes
>>> the meaning of each byte (single-byte codepoint, first byte of multi-byte
>>> code-point, or continuation of a multi-byte codepoint) in the most-
>>> significant two bits of the byte.
>>
>> I was originally thinking That I would only need 256 * 4 bytes to encode
>> a complete DFA recognizer using the simplest possible design. Now it
>> looks like I need 256 ^ 4 bytes to encode the simplest possible design.
>> Apparently Lex knows how to make this much more concise, thus not the
>> simplest possible design.
> ****
> It is equally obvious that the simplest possible design requires NO table, but simple a
> couple bit-masks and if-statements! I didn't fully appreciate the depths of your
> ignorance when I first answered the question. It was perfectly clear that a DFA would NOT
> be the correct model, especially with the horrible regexp you were trying to use. But I
> had no idea that you were totally clueless about UTF-8 implementation!
> joe

Maybe it is much simpler for me than it would be for others because of
my strong bias towards DFA recognizers. I bet my DFA recognizer is at
least twice as fast as any other method for validating UTF-8 and
converting it to code points. I am estimating about 20 machine clocks
per code point.

> ****
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm