Is this Regular Expression for UTF-8 Correct?? [MFC]

Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks

From: Peter Olcott on 17 May 2010 16:24

On 5/17/2010 2:47 PM, Pete Delgado wrote:
> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote in message
> news:a_WdndCS8YLmBGzWnZ2dnUVZ_o6dnZ2d(a)giganews.com...
>> On 5/17/2010 12:42 PM, Pete Delgado wrote:
>>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote in message
>>> news:lI2dnTWeE-MF0GzWnZ2dnUVZ_gadnZ2d(a)giganews.com...
>>>> My source code encoding will be UTF-8. My interpreter is written in Yacc
>>>> and Lex, and 75% complete. This makes a UTF-8 regular expression
>>>> mandatory.
>>>
>>> Peter,
>>> You keep mentioning that your interpreter is 75% complete. Forgive my
>>> morbid
>>> curiosity but what exactly does that mean?
>>>
>>> -Pete
>>>
>>>
>>
>> The Yacc and Lex are done and working and correctly translate all input
>> into a corresponding abstract syntax tree. The control flow portion of the
>> code generator is done and correctly translates control flow statements
>> into corresponding jump code with minimum branches. The detailed design
>> for everything else is complete. The everything else mostly involves how
>> to handle all of the data types including objects, and also including
>> elemental operations upon these data types and objects.
>
> Peter,
> Correct me if I'm wrong, but it sounds to me as if you are throwing
> something at Yacc and Lex and are getting something you think is
> "reasonable" out of them but you haven't been able to test the validity of
> the output yet since the remainder of the coding to your spec is yet to be
> done. Is that a true statement?
>
>
> -Pete
>
>

That is just as true as the statement that no one is more Jewish than
the pope.

I started with a known good Yacc and Lex specification for "C"
http://www.lysator.liu.se/c/ANSI-C-grammar-y.html
http://www.lysator.liu.se/c/ANSI-C-grammar-l.html

I carefully constructed code to build a very clean abstract syntax tree.
I made absolute minimal transformations to the original Yacc and Lex
spec to attain the subset of C++ that I will be implementing.
(Basically the "C" language plus classes and minus pointers).

I used this abstract syntax tree to generate jump code with absolute
minimum number of of branches for the control flow statements that this
C++ subset will be providing. Testing has indicated that these results
have been correctly achieved.

Then I designed the whole infrastucture to handle every type of
elemental operation upon the fundamental data types, as well the
aggregate data types.

From: Joseph M. Newcomer on 17 May 2010 17:38

See below...
On Mon, 17 May 2010 14:05:54 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/17/2010 11:24 AM, Joseph M. Newcomer wrote:
>> See below...
>> On Mon, 17 May 2010 09:04:07 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>
>>>> Huh? What's this got to do with the encoding?
>>> (1) Lex requires a RegEx
>> ***
>> Your regexp should be in terms of UTF32, not UTF8.
>
>Wrong, Lex can not handle data larger than bytes.
***
The wrong choice of tooling. Just because the tooling is stupid is no reason to cater to
its failures.

It is too bad, but lex was done in a different era when everyone used 7-bit ASCII. Using
antiquated tools generates problems that in a sane world would not have to exist. I'd
think that converting lex to 32-bit would be a simpler solution.
****
>
>> ****
>>>
>>> (2) I still must convert from UTF-8 to UTF-32, and I don't think that a
>>> faster or simpler way to do this besides a regular expression
>>> implemented as a finite state machine can possibly exist.
>> ****
>> Actually, there is; you obviously know nothing about UTF-8, or you would know that the
>> high-order bits of the first byte tell you the length of the encoding, and the FSM is
>> written entirely in terms of the actual encoding, and is never written as a regexp.
>
>Ah I see, if I don't know everything that I must know nothing, I think
>that this logic is flawed. None of the docs that I read mentioned this
>nuance. It may prove to be useful. It looks like this will be most
>helpful when translating from UTF-32 to UTF-8, and not the other way
>around.
****
You are asking questions based on such elementary properties to the encoding that there is
little excuse for defending your position. You did not start out by saying "I know
nothing about UTF-8 at all..." in which case the simple answer is
(a) read about UTF-8 encoding (I've given you the citation, including
the page number)
(b) realize that your base approach is almost certainly wrong.
****
>
>It would still seem to be slower and more complex than a DFA based
>finite state machine for validating a UTF-8 byte sequence. It also look
>like it would be slower for translating from UTF-8 to UTF-32.
****
If I didn't have to leave in 10 minutes, I'd write the UTF-18 to UT32 conversion. But I
have to put together some items for my presentation (find the extension cord, among other
tasks) which doesn't quite leave enough time to write the code, which will take almost the
entire 10 minutes to write. But when I get back tonight, I'll write it and time myself.

But speed is not an issue here, and lex was never intended to support UTF-8. It sounds
like you are complaining that punched cards don't support full Unicode. Wrong tool for
the job.
joe
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 17 May 2010 23:29

On 5/17/2010 4:38 PM, Joseph M. Newcomer wrote:
> See below...
> On Mon, 17 May 2010 14:05:54 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> On 5/17/2010 11:24 AM, Joseph M. Newcomer wrote:
>>> See below...
>>> On Mon, 17 May 2010 09:04:07 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>>
>>>>> Huh? What's this got to do with the encoding?
>>>> (1) Lex requires a RegEx
>>> ***
>>> Your regexp should be in terms of UTF32, not UTF8.
>>
>> Wrong, Lex can not handle data larger than bytes.
> ***
> The wrong choice of tooling. Just because the tooling is stupid is no reason to cater to
> its failures.
>
> It is too bad, but lex was done in a different era when everyone used 7-bit ASCII. Using
> antiquated tools generates problems that in a sane world would not have to exist. I'd
> think that converting lex to 32-bit would be a simpler solution.
> ****
>>
>>> ****
>>>>
>>>> (2) I still must convert from UTF-8 to UTF-32, and I don't think that a
>>>> faster or simpler way to do this besides a regular expression
>>>> implemented as a finite state machine can possibly exist.
>>> ****
>>> Actually, there is; you obviously know nothing about UTF-8, or you would know that the
>>> high-order bits of the first byte tell you the length of the encoding, and the FSM is
>>> written entirely in terms of the actual encoding, and is never written as a regexp.
>>
>> Ah I see, if I don't know everything that I must know nothing, I think
>> that this logic is flawed. None of the docs that I read mentioned this
>> nuance. It may prove to be useful. It looks like this will be most
>> helpful when translating from UTF-32 to UTF-8, and not the other way
>> around.
> ****
> You are asking questions based on such elementary properties to the encoding that there is
> little excuse for defending your position. You did not start out by saying "I know
> nothing about UTF-8 at all..." in which case the simple answer is
> (a) read about UTF-8 encoding (I've given you the citation, including
> the page number)
> (b) realize that your base approach is almost certainly wrong.
> ****
>>
>> It would still seem to be slower and more complex than a DFA based
>> finite state machine for validating a UTF-8 byte sequence. It also look
>> like it would be slower for translating from UTF-8 to UTF-32.
> ****
> If I didn't have to leave in 10 minutes, I'd write the UTF-18 to UT32 conversion. But I
> have to put together some items for my presentation (find the extension cord, among other
> tasks) which doesn't quite leave enough time to write the code, which will take almost the
> entire 10 minutes to write. But when I get back tonight, I'll write it and time myself.
>
> But speed is not an issue here, and lex was never intended to support UTF-8. It sounds
> like you are complaining that punched cards don't support full Unicode. Wrong tool for
> the job.
> joe

I have made Lex handle UTF-8 in an optimal way, thus you are wrong yet
again. I do now agree with you that UTF-32 is generally much better than
UTF-8 for internal representation.

I want my GUI scripting language to be like the original Turbo Pascal
2.0, at least 100-fold faster compiles than the next best alternative.

> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 18 May 2010 00:12

See below...
On Mon, 17 May 2010 15:24:03 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/17/2010 2:47 PM, Pete Delgado wrote:
>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote in message
>> news:a_WdndCS8YLmBGzWnZ2dnUVZ_o6dnZ2d(a)giganews.com...
>>> On 5/17/2010 12:42 PM, Pete Delgado wrote:
>>>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote in message
>>>> news:lI2dnTWeE-MF0GzWnZ2dnUVZ_gadnZ2d(a)giganews.com...
>>>>> My source code encoding will be UTF-8. My interpreter is written in Yacc
>>>>> and Lex, and 75% complete. This makes a UTF-8 regular expression
>>>>> mandatory.
>>>>
>>>> Peter,
>>>> You keep mentioning that your interpreter is 75% complete. Forgive my
>>>> morbid
>>>> curiosity but what exactly does that mean?
>>>>
>>>> -Pete
>>>>
>>>>
>>>
>>> The Yacc and Lex are done and working and correctly translate all input
>>> into a corresponding abstract syntax tree. The control flow portion of the
>>> code generator is done and correctly translates control flow statements
>>> into corresponding jump code with minimum branches. The detailed design
>>> for everything else is complete. The everything else mostly involves how
>>> to handle all of the data types including objects, and also including
>>> elemental operations upon these data types and objects.
>>
>> Peter,
>> Correct me if I'm wrong, but it sounds to me as if you are throwing
>> something at Yacc and Lex and are getting something you think is
>> "reasonable" out of them but you haven't been able to test the validity of
>> the output yet since the remainder of the coding to your spec is yet to be
>> done. Is that a true statement?
>>
>>
>> -Pete
>>
>>
>
>That is just as true as the statement that no one is more Jewish than
>the pope.
>
>I started with a known good Yacc and Lex specification for "C"
> http://www.lysator.liu.se/c/ANSI-C-grammar-y.html
> http://www.lysator.liu.se/c/ANSI-C-grammar-l.html
****
I presume these are Yacc and Lex specifications, but why do you claim they are "known
good" specifications? And what do they have to do with UTF-8 encodings?

You don't have a good grasp of what constitutes a "letter" beyond the ASCII-7 spec so how
can you extend it to support UTF-8?
****
>
>I carefully constructed code to build a very clean abstract syntax tree.
>I made absolute minimal transformations to the original Yacc and Lex
>spec to attain the subset of C++ that I will be implementing.
>(Basically the "C" language plus classes and minus pointers).
>
>I used this abstract syntax tree to generate jump code with absolute
>minimum number of of branches for the control flow statements that this
>C++ subset will be providing. Testing has indicated that these results
>have been correctly achieved.
****
What is "jump code"? I've been in the compiler business since 1970 and I've never heard
of this code style. We assign tasks of generating code to first-semester compiler course
students; it isn't all that challenging. I did a google of "jump code" and got a lot of
useless stuff about advertising links in Web pages. Somehow, I don't believe you are
generating these from your compiler.
****
>
>Then I designed the whole infrastucture to handle every type of
>elemental operation upon the fundamental data types, as well the
>aggregate data types.
****
How do you know how much code is required, so that you know you are 75% done? I was once
consulting, and the company said "Yes, you've identified a serious bug, but we can't fix
it now during the coding phase, because we are 37.247% done with the coding" (they were
coding something the like of which had never existed before, but they could tell to three
decimal places how far they were into the coding) "...So we'll let it go to the testing
phase, and fix it there" (never mind it would cost several times as much to fix it; it was
somebody elses's schedule). So I am always deeply suspect of any "n% done" statements
under any conditions (let alonesomething like "75%" which means "more than 74% and less
than 76%". An estimate of "3/4 done" (which means you know to within one part in 4 how
much might be left) makes sense. But as a physicist I was taught to never add a decimal
point or digit of precision beyond what is actually known. "7/10" is not the same as
"70%" except in abstract math. "70%" means "not 69%, not 71%, I know to one part in 100
how accurate I am" but "7/10" means "more than 6/10 and less than 8/10" which would
translate to "70%�10%". So you have to not just quote a value, but your confidence in the
value.

In Pennsylvania, it was announced we have "75% compliance" with the 2010 census. This
leads to the question, if we are attempting to find out how many people we have, how do we
know that 75% of them have responded? We don't actually know how many people we have,
because we have a partial response, but how do we know it is 75%? 75% of how many? Well,
we don't know how many, that's what we're trying to figure out...
joe
****
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 18 May 2010 00:23

See below...
On Mon, 17 May 2010 14:53:39 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/17/2010 11:51 AM, Joseph M. Newcomer wrote:
>> See below...
>> On Mon, 17 May 2010 08:57:58 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>>
>>>
>>> I still MUST have a correct UTF-8 RegEx because my interpreter is 75%
>>> completed using Lex and Yacc. Besides this I need a good way to parse
>>> UTF-8 to convert it to UTF-32.
>> ****
>> No, it sucks. For reasons I have pointed out. You can easily write a UTF-32 converter
>> just based on the table in the Unicode 5.0 manual!
>
>Lex can ONLY handle bytes. Lex apparently can handle the RegEx that I
>posted. I am basically defining every UTF-8 byte sequence above the
>ASCII range as a valid Letter that can be used in an Identifier.
>[A-Za-z_] can also be used as a Letter.
****
No, you are saying "because I chose to use an obsolete piece of software, I am constrained
to think in obsolete terms".

Note that it makes no sense to declare every UTF-8 codepoint that translates UTF-32> 0xFF
as a letter. Only in some fantasy world could this possibly make sense, since you are
declaring that all localized digits, localized punctuation marks, and math symbols must be
letters! I find it hard to imagine how you can say this with your stated goal of letting
people code in their native language. Imagine if I allowed ASCII-7 identifiers like
"A+B"
"$%^&"
but that's what you are doing if you recognize international punctuation marks and digits
as "letters"

So I don't believe you have a clue about how to handle UTF-8.

I'd first look into fixing lex to handle modern coding practice. Otherwise, you have a
tool that doesn't solve your problem. You have to write lexical rules that recognize ONLY
letters about U00FF, and for digits, you have to allow digit sequences that allow
localized digits. And if I use the symbols ABCDEFGHIJ to represent some sequence of
localized digits, then you should write rules that allow numbers to be written like "123"
and "ABC" (try to remember that by "A" I mean the localized value of the digit "1"), but
then how do you handle a sequence like A1A? Does it translate to tne number we would
write as "111" or is it syntactically illegal? Essentially, doing this in UTF-8 is pretty
miserable, but if you don't do it, you are parsing nonsense phrases.
joe
****
>
>>
>> I realized that I have this information on a slide in my course, which is on my laptop, so
>> with a little copy-and-paste-and-reformat, here's the table. Note that no massive FSM
>> recognition is required to do the conversion, and it is even questionable as to whether an
>> FSM is required at all!
>>
>> All symbols represent bits, and x, y, u, z and w are metasymbols for bits that can be
>> either 0 or 1
>>
>> UTF-32 00000000 00000000 00000000 0xxxxxxx
>> UTF-16 00000000 0xxxxxxx
>> UTF-8 0xxxxxx
>>
>> UTF-32 00000000 00000000 00000yyy yyxxxxxx
>> UTF-16 00000yyy yyxxxxxx
>> UTF-8 110yyyyy 10xxxxxx
>>
>> UTF-32 00000000 00000000 zzzzyyyy yyxxxxxx
>> UTF-16 zzzzyyyy yyxxxxxx
>> UTF-8 1110zzzz 10yyyyyy 10xxxxxx
>>
>> UTF-32 00000000 000uuuuu zzzzyyyy yyzzzzzz
>> UTF-16 110110ww wwzzzzyy 110111yy yyxxxxxx*
>> UTF-8 11110uuu 10uuzzzzz 10yyyyyy 10xxxxxx
>>
>> uuuuu = wwww + 1
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>
>I was aware of the encodings between UTF-8 and UTF-32, the encoding to
>UTF-16 looks a little clumsy when we get to four UTF-8 bytes.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks