Is this Regular Expression for UTF-8 Correct?? [MFC]

Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks

From: Peter Olcott on 14 May 2010 09:27

"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
message news:68gpu599cjcsm3rjh1ptc6e9qu977smdph(a)4ax.com...
> No, an extremely verbose "You are going about this
> completely wrong".
> joe

Which still avoids rather than answers my question. This was
at one time a very effective ruse to hide the fact that you
don't know the answer. I can see through this ruse now, so
there is no sense in my attempting to justify my design
decision to you. That would simply be a waste of time.

>
> On Thu, 13 May 2010 18:14:47 -0500, "Peter Olcott"
> <NoSpam(a)OCR4Screen.com> wrote:
>
>>Ah so in other words an extremely verbose, "I don't know".
>>Let me take a different approach. Can postings on
>>www.w3.org
>>generally be relied upon?
>>
>>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>>message news:gprou55bvl3rgp2qmp6v3euk20ucf865mi(a)4ax.com...
>>> See below...
>>> On Thu, 13 May 2010 15:36:24 -0500, "Peter Olcott"
>>> <NoSpam(a)OCR4Screen.com> wrote:
>>>
>>>>
>>>>"Leigh Johnston" <leigh(a)i42.co.uk> wrote in message
>>>>news:GsGdnbYz-OIj_XHWnZ2dnUVZ8uCdnZ2d(a)giganews.com...
>>>>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in
>>>>> message
>>>>> news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d(a)giganews.com...
>>>>>> Is this Regular Expression for UTF-8 Correct??
>>>>>>
>>>>>> The solution is based on the GREEN portions of the
>>>>>> first
>>>>>> chart shown
>>>>>> on this link:
>>>>>> http://www.w3.org/2005/03/23-lex-U
>>> ****
>>> Note that in the "green" areas, we find
>>>
>>> U0482 Cyrillic thousands sign
>>> U055A Armenian apostrophe
>>> U055C Armenian exclamation mark
>>> U05C3 Hebrew punctuation SOF Pasuq
>>> U060C Arabic comma
>>> U066B Arabic decimal separator
>>> U0700-U0709 Assorted Syriac punctuation marks
>>> U0966-U096F Devanagari digits 0..9
>>> U09E6-U09EF Bengali digits 0..9
>>> U09F2-U09F3 Bengali rupee marks
>>> U0A66-U0A6F Gurmukhi digits 0..9
>>> U0AE6-U0AEF Gujarati digits 0..9
>>> U0B66-U0B6F Oriya digits 0..9
>>> U0BE6-U0BEF Tamil digits 0..9
>>> U0BF0-U0BF2 Tamil indicators for 10, 100, 1000
>>> U0BF3-U0BFA Tamil punctuation marks
>>> U0C66-U0C6F Telugu digits 0..9
>>> U0CE6-U0CEF Kannada digits 0..9
>>> U0D66-U0D6F Malayam digits 0..9
>>> U0E50-U0E59 Thai digits 0..9
>>> U0ED0-U0ED9 Lao digits 0..9
>>> U0F20-U0F29 Tibetan digits 0..9
>>> U0F2A-U0F33 Miscellaneous Tibetan numeric symbols
>>> U1040-U1049 - Myanmar digits 0..9
>>> U1360-U1368 Ethiopic punctuation marks
>>> U1369-U137C Ethiopic numeric values (digits, tens of
>>> digits, etc.)
>>> U17E0-U17E9 Khmer digits 0..9
>>> U1800-U180E Mongolian punctuation marks
>>> U1810-U1819 Mongolian digits 0..9
>>> U1946-U194F Limbu digits 0..9
>>> U19D0-U19D9 New Tai Lue digits 0..9
>>> ...at which point I realized I was wasting my time,
>>> because I was attempting to disprovde
>>> what is a Really Dumb Idea, which is to write
>>> applications
>>> that actually work on UTF-8
>>> encoded text.
>>>
>>> You are free to convert these to UTF-8, but in addition,
>>> if I've read some of the
>>> encodings correctly, the non-green areas preclude what
>>> are
>>> clearly "letters" in other
>>> languages.
>>>
>>> Forget UTF-8. It is a transport mechanism used at input
>>> and output edges. Use Unicode
>>> internally.
>>> ****
>>>>>>
>>>>>> A semantically identical regular expression is also
>>>>>> found
>>>>>> on the above link underValidating lex Template
>>>>>>
>>>>>> 1 ['\u0000'-'\u007F']
>>>>>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>>>>>> 3 | ( '\u00E0' ['\u00A0'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'])
>>>>>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'])
>>>>>> 5 | ( '\u00ED' ['\u0080'-'\u009F']
>>>>>> ['\u0080'-'\u00BF'])
>>>>>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'])
>>>>>> 7 | ( '\u00F0' ['\u0090'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>> 9 | ( '\u00F4' ['\u0080'-'\u008F']
>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>>
>>>>>> Here is my version, the syntax is different, but the
>>>>>> UTF8
>>>>>> portion should be semantically identical.
>>>>>>
>>>>>> UTF8_BYTE_ORDER_MARK [\xEF][\xBB][\xBF]
>>>>>>
>>>>>> ASCII [\x0-\x7F]
>>>>>>
>>>>>> U1 [a-zA-Z_]
>>>>>> U2 [\xC2-\xDF][\x80-\xBF]
>>>>>> U3 [\xE0][\xA0-\xBF][\x80-\xBF]
>>>>>> U4 [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
>>>>>> U5 [\xED][\x80-\x9F][\x80-\xBF]
>>>>>> U6 [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
>>>>>> U7 [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
>>>>>> U8
>>>>>> [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
>>>>>> U9 [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>>>>>>
>>>>>> UTF8
>>>>>> {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>>
>>>>>> // This identifies the "Letter" portion of an
>>>>>> Identifier.
>>>>>> L
>>>>>> {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>>
>>>>>> I guess that most of the analysis may simply boil
>>>>>> down
>>>>>> to
>>>>>> whether or not the original source from the link is
>>>>>> considered reliable. I had forgotten this original
>>>>>> source
>>>>>> when I first asked this question, that is why I am
>>>>>> reposting this same question again.
>>>>>
>>>>> What has this got to do with C++? What is your C++
>>>>> language question?
>>>>>
>>>>> /Leigh
>>>>
>>>>I will be implementing a utf8string to supplement
>>>>std::string and will be using a regular expression to
>>>>quickly divide up UTF-8 bytes into Unicode CodePoints.
>>> ***
>>> For someone who had an unholy fixation on "performance",
>>> why would you choose such a slow
>>> mechanism for doing recognition?
>>>
>>> I can imagine a lot of alternative approaches, including
>>> having a table of 65,536
>>> "character masks" for Unicode characters, including
>>> on-the-fly updating of the table, and
>>> extensions to support surrogates, which would outperform
>>> any regular expression based
>>> approach.
>>>
>>> What is your crtiterion for what constitutes a "letter"?
>>> Frankly, I have no interest in
>>> decoding something as bizarre as UTF-8 encodings to see
>>> if
>>> you covered the foreign
>>> delimiters, numbers, punctuation marks, etc. properly,
>>> and
>>> it makes no sense to do so. So
>>> there is no way I would waste my time trying to
>>> understand
>>> an example that should not
>>> exist at all.
>>>
>>> Why do you seem to choose the worst possible choice when
>>> there is more than one way to do
>>> something? The choices are (a) work in 8-bit ANSI (b)
>>> work in UTF-8 (c) work in Unicode.
>>> Of these, the worst possible choice is (b), followed by
>>> (a). (c) is clearly the winner.
>>>
>>> So why are you using something as bizarre as UTF-8
>>> internally? UTF-8 has ONE role, which
>>> is to write Unicode out in an 8-bit encoding, and read
>>> Unicode in an 8-bit encoding. You
>>> do NOT want to write the program in terms of UTF-8!
>>> joe
>>> ****
>>>>
>>>>Since there are no UTF-8 groups, or even Unicode groups
>>>>I
>>>>must post these questions to groups that are at most
>>>>indirectly related to this subject matter.
>>>>
>>> Joseph M. Newcomer [MVP]
>>> email: newcomer(a)flounder.com
>>> Web: http://www.flounder.com
>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 14 May 2010 09:36

"Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message
news:Xns9D78F42C2233MihaiN(a)207.46.248.16...
>
>> I can imagine a lot of alternative approaches, including
>> having a table of
>> 65,536 "character masks" for Unicode characters
>
> As we know, 65,536 (FFFF) is not enough, Unicode
> codepoints go to 10FFFF :-)
>
>
>
>> What is your crtiterion for what constitutes a "letter"?
>
> The best way to attack the identification is by using
> Unicode properties
> Each code point has attributes indicating if it is a
> letter
> (General Category)
>
> A good starting point is this:
> http://unicode.org/reports/tr31/tr31-1.html
>
> But this only shows that basing that on some UTF-8 kind of
> thing is no
> the way. And how are you going to deal with combining
> characters?
> Normalization?

I am going to handle this simplistically. Every code point
above the ASCII range will be considered an alpha numeric
character.

Eventually I will augment this to further divide these code
points into smaller categories. Unicode is supposed to have
a way to do this, but, I never could find anything as simple
as a table of the mapping of Unicode code points to their
category.

>
> There are very good reasons why the rule of thumb is:
> - UTF-16 or UTF-32 for processing
> - UTF-8 for storage/exchange
>
>
> --
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
>

From: Peter Olcott on 14 May 2010 09:41

"Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message
news:Xns9D781110C7D27MihaiN(a)207.46.248.16...
>> Can postings on www.w3.org generally be relied upon?
>
> For official documents, in general yes.
> Unless it is some private post that says something like:
> "It is not endorsed by the W3C members, team, or any
> working group."
> (see http://www.w3.org/2005/03/23-lex-U)
>
> And also does not mean that a solution that is enough to
> do some basic
> utf-8 validation for html is the right tool for writing a
> compiler.

I am internationalizing the language that I am creating
within the timeframe that I have.

UTF-8 is the standard encoding for internet applications. It
works across every platform equally well without adaptation.
It does not care about Little or Big Endian, it simply works
everywhere correctly.

>
>
> --
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
>

From: Pete Delgado on 14 May 2010 12:38

"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message
news:jfvou5ll41i9818ub01a4mgbfvetg4giu1(a)4ax.com...
> Actually, what it does is give us another opportunity to point how how
> really bad this
> design choice is, and thus Peter can tell us all we are fools for not
> answering a question
> that should never have been asked, not because it is inappropriate for the
> group, but
> because it represents the worst-possible-design decision that could be
> made.
> joe

Come on Joe, give Mr. Olcott some credit. I'm sure that he could dream up
an even worse design as he did with his OCR project once he is given (and
ignores) input from the professionals whos input he claims to seek. ;)

-Pete

From: Peter Olcott on 14 May 2010 12:53

"Pete Delgado" <Peter.Delgado(a)NoSpam.com> wrote in message
news:uU4O0P48KHA.1892(a)TK2MSFTNGP05.phx.gbl...
>
> "Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
> message news:jfvou5ll41i9818ub01a4mgbfvetg4giu1(a)4ax.com...
>> Actually, what it does is give us another opportunity to
>> point how how really bad this
>> design choice is, and thus Peter can tell us all we are
>> fools for not answering a question
>> that should never have been asked, not because it is
>> inappropriate for the group, but
>> because it represents the worst-possible-design decision
>> that could be made.
>> joe
>
> Come on Joe, give Mr. Olcott some credit. I'm sure that he
> could dream up an even worse design as he did with his OCR
> project once he is given (and ignores) input from the
> professionals whos input he claims to seek. ;)
>
>
> -Pete
>
>

Most often I am not looking for "input from professionals",
I am looking for answers to specific questions.

I now realize that every non-answer response tends to be a
mask for the true answer of "I don't know".

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks