Is this Regular Expression for UTF-8 Correct?? [MFC]

Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks

From: Pete Delgado on 14 May 2010 14:12

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:v5OdnTkHxKqRHXDWnZ2dnUVZ_uGdnZ2d(a)giganews.com...
>
> "Pete Delgado" <Peter.Delgado(a)NoSpam.com> wrote in message
> news:uU4O0P48KHA.1892(a)TK2MSFTNGP05.phx.gbl...
>>
>> "Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in message
>> news:jfvou5ll41i9818ub01a4mgbfvetg4giu1(a)4ax.com...
>>> Actually, what it does is give us another opportunity to point how how
>>> really bad this
>>> design choice is, and thus Peter can tell us all we are fools for not
>>> answering a question
>>> that should never have been asked, not because it is inappropriate for
>>> the group, but
>>> because it represents the worst-possible-design decision that could be
>>> made.
>>> joe
>>
>> Come on Joe, give Mr. Olcott some credit. I'm sure that he could dream up
>> an even worse design as he did with his OCR project once he is given (and
>> ignores) input from the professionals whos input he claims to seek. ;)
>>
>>
>> -Pete
>>
>>
>
> Most often I am not looking for "input from professionals", I am looking
> for answers to specific questions.

Which is one reason why your projects consistantly fail. If you have a few
days, take a look at the book "Programming Pearls" by Jon
Bentley -specifically the first chapter. Sometimes making sure you are
asking the *right* question is more important than getting an answer to a
question. You seem to have a problem with that particular concept.

>
> I now realize that every non-answer response tends to be a mask for the
> true answer of "I don't know".

In my case, you should change "I don't know" in your sentance above to: "I
don't care"...

To clarify:

* I don't care to answer off-topic questions
* I don't care to answer questions where the answer will be ignored
* I don't care to have to justify a correct answer against an incorrect
answer
* I don't care to answer questions where the resident SME (Mihai) has
already guided you
* I don't care to feed the trolls

HTH

-Pete

From: Peter Olcott on 14 May 2010 14:44

"Pete Delgado" <Peter.Delgado(a)NoSpam.com> wrote in message
news:O8vhKE58KHA.980(a)TK2MSFTNGP04.phx.gbl...
>
>> Most often I am not looking for "input from
>> professionals", I am looking for answers to specific
>> questions.
>
> Which is one reason why your projects consistantly fail.
> If you have a few

None of my projects have ever failed. Some of my projects
inherently take an enormous amount of time to complete.

> days, take a look at the book "Programming Pearls" by Jon
> Bentley -specifically the first chapter. Sometimes making
> sure you are asking the *right* question is more important
> than getting an answer to a question. You seem to have a
> problem with that particular concept.

Yes especially on those cases where I have already thought
the problem through completely using categorically
exhaustively complete reasoning.

In those rare instances anything at all besides a direct
answer to a direct question can only be a waste of time for
me.

From: Pete Delgado on 14 May 2010 17:24

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:f-6dnTCV1ce3B3DWnZ2dnUVZ_hCdnZ2d(a)giganews.com...
>
> "Pete Delgado" <Peter.Delgado(a)NoSpam.com> wrote in message
> news:O8vhKE58KHA.980(a)TK2MSFTNGP04.phx.gbl...
>>
>>> Most often I am not looking for "input from professionals", I am looking
>>> for answers to specific questions.
>>
>> Which is one reason why your projects consistantly fail. If you have a
>> few
>
> None of my projects have ever failed. Some of my projects inherently take
> an enormous amount of time to complete.

ROTFL

OK Peter.. If you say so... :-) I suppose that is the benefit of doing
development soley for your own amusement. You can take inordinate amounts of
time and not have to care if the market passes you by or if the relevancy of
the software is diminished.

>
>> days, take a look at the book "Programming Pearls" by Jon
>> Bentley -specifically the first chapter. Sometimes making sure you are
>> asking the *right* question is more important than getting an answer to a
>> question. You seem to have a problem with that particular concept.
>
> Yes especially on those cases where I have already thought the problem
> through completely using categorically exhaustively complete reasoning.

That *sounds* nice, but if one considers your recent questions here as a
guage of your success at reasoning out the problem and coming up with a
realistic, workable solution, it seems that your words and deeds do not
match.

>
> In those rare instances anything at all besides a direct answer to a
> direct question can only be a waste of time for me.

....which is why, long ago, I suggested that you simply hire a consultant.

-Pete

From: Bill Snyder on 14 May 2010 18:59

On Fri, 14 May 2010 17:24:27 -0400, "Pete Delgado"
<Peter.Delgado(a)NoSpam.com> wrote:

>
>"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>news:f-6dnTCV1ce3B3DWnZ2dnUVZ_hCdnZ2d(a)giganews.com...

>> None of my projects have ever failed. Some of my projects inherently take
>> an enormous amount of time to complete.
>
>ROTFL
>
>OK Peter.. If you say so... :-) I suppose that is the benefit of doing
>development soley for your own amusement. You can take inordinate amounts of
>time and not have to care if the market passes you by or if the relevancy of
>the software is diminished.

A project with an infinitely-extensible deadline can never fail;
it can only require more work.

--
Bill Snyder [This space unintentionally left blank]

From: Peter Olcott on 14 May 2010 23:42

"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
message news:68gpu599cjcsm3rjh1ptc6e9qu977smdph(a)4ax.com...
> No, an extremely verbose "You are going about this
> completely wrong".
> joe
>
> On Thu, 13 May 2010 18:14:47 -0500, "Peter Olcott"
> <NoSpam(a)OCR4Screen.com> wrote:
>
>>Ah so in other words an extremely verbose, "I don't know".
>>Let me take a different approach. Can postings on
>>www.w3.org
>>generally be relied upon?
>>
>>"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
>>message news:gprou55bvl3rgp2qmp6v3euk20ucf865mi(a)4ax.com...
>>> See below...
>>> On Thu, 13 May 2010 15:36:24 -0500, "Peter Olcott"
>>> <NoSpam(a)OCR4Screen.com> wrote:
>>>
>>>>
>>>>"Leigh Johnston" <leigh(a)i42.co.uk> wrote in message
>>>>news:GsGdnbYz-OIj_XHWnZ2dnUVZ8uCdnZ2d(a)giganews.com...
>>>>> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in
>>>>> message
>>>>> news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d(a)giganews.com...
>>>>>> Is this Regular Expression for UTF-8 Correct??
>>>>>>
>>>>>> The solution is based on the GREEN portions of the
>>>>>> first
>>>>>> chart shown
>>>>>> on this link:
>>>>>> http://www.w3.org/2005/03/23-lex-U
>>> ****
>>> Note that in the "green" areas, we find
>>>
>>> U0482 Cyrillic thousands sign
>>> U055A Armenian apostrophe
>>> U055C Armenian exclamation mark
>>> U05C3 Hebrew punctuation SOF Pasuq
>>> U060C Arabic comma
>>> U066B Arabic decimal separator
>>> U0700-U0709 Assorted Syriac punctuation marks
>>> U0966-U096F Devanagari digits 0..9
>>> U09E6-U09EF Bengali digits 0..9
>>> U09F2-U09F3 Bengali rupee marks
>>> U0A66-U0A6F Gurmukhi digits 0..9
>>> U0AE6-U0AEF Gujarati digits 0..9
>>> U0B66-U0B6F Oriya digits 0..9
>>> U0BE6-U0BEF Tamil digits 0..9
>>> U0BF0-U0BF2 Tamil indicators for 10, 100, 1000
>>> U0BF3-U0BFA Tamil punctuation marks
>>> U0C66-U0C6F Telugu digits 0..9
>>> U0CE6-U0CEF Kannada digits 0..9
>>> U0D66-U0D6F Malayam digits 0..9
>>> U0E50-U0E59 Thai digits 0..9
>>> U0ED0-U0ED9 Lao digits 0..9
>>> U0F20-U0F29 Tibetan digits 0..9
>>> U0F2A-U0F33 Miscellaneous Tibetan numeric symbols
>>> U1040-U1049 - Myanmar digits 0..9
>>> U1360-U1368 Ethiopic punctuation marks
>>> U1369-U137C Ethiopic numeric values (digits, tens of
>>> digits, etc.)
>>> U17E0-U17E9 Khmer digits 0..9
>>> U1800-U180E Mongolian punctuation marks
>>> U1810-U1819 Mongolian digits 0..9
>>> U1946-U194F Limbu digits 0..9
>>> U19D0-U19D9 New Tai Lue digits 0..9

Do you know anywhere where I can get a table that maps all
of the code points to their category?

>>> ...at which point I realized I was wasting my time,
>>> because I was attempting to disprovde
>>> what is a Really Dumb Idea, which is to write
>>> applications
>>> that actually work on UTF-8
>>> encoded text.
>>>
>>> You are free to convert these to UTF-8, but in addition,
>>> if I've read some of the
>>> encodings correctly, the non-green areas preclude what
>>> are
>>> clearly "letters" in other
>>> languages.
>>>
>>> Forget UTF-8. It is a transport mechanism used at input
>>> and output edges. Use Unicode
>>> internally.

That is how I intend to use it. To internationalize my GUI
scripting language the interpreter will accept UTF-8 input
as its source code files. It is substantially implemented
using Lex and Yacc specifications for "C" that have been
adapted to implement a subset of C++.

It was far easier (and far less error prone) to add the C++
that I needed to the "C" specification than it would have
been to remove what I do not need from the C++
specification.

The actual language itself will store its strings as 32-bit
codepoints. The SymbolTable will not bother to convert its
strings from UTF-8. It turns out that UTF-8 byte sort order
is identical to Unicode code point sort order.

I am implementing a utf8string that will provide the most
useful subset of the std::string interface. I need the
regular expression for Lex, and it also can be easily
converted into a DFA to very quickly and completely
correctly break of a UTF-8 string into its code point
constituent parts.

Do you know anywhere where I can get a table that maps all
of the code points to their category?

It is a shame that Microsoft will be killing this group next
month, where will we go?

>>> ****
>>>>>>
>>>>>> A semantically identical regular expression is also
>>>>>> found
>>>>>> on the above link underValidating lex Template
>>>>>>
>>>>>> 1 ['\u0000'-'\u007F']
>>>>>> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF'])
>>>>>> 3 | ( '\u00E0' ['\u00A0'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'])
>>>>>> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'])
>>>>>> 5 | ( '\u00ED' ['\u0080'-'\u009F']
>>>>>> ['\u0080'-'\u00BF'])
>>>>>> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'])
>>>>>> 7 | ( '\u00F0' ['\u0090'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF']
>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>> 9 | ( '\u00F4' ['\u0080'-'\u008F']
>>>>>> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'])
>>>>>>
>>>>>> Here is my version, the syntax is different, but the
>>>>>> UTF8
>>>>>> portion should be semantically identical.
>>>>>>
>>>>>> UTF8_BYTE_ORDER_MARK [\xEF][\xBB][\xBF]
>>>>>>
>>>>>> ASCII [\x0-\x7F]
>>>>>>
>>>>>> U1 [a-zA-Z_]
>>>>>> U2 [\xC2-\xDF][\x80-\xBF]
>>>>>> U3 [\xE0][\xA0-\xBF][\x80-\xBF]
>>>>>> U4 [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
>>>>>> U5 [\xED][\x80-\x9F][\x80-\xBF]
>>>>>> U6 [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
>>>>>> U7 [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
>>>>>> U8
>>>>>> [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
>>>>>> U9 [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
>>>>>>
>>>>>> UTF8
>>>>>> {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>>
>>>>>> // This identifies the "Letter" portion of an
>>>>>> Identifier.
>>>>>> L
>>>>>> {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
>>>>>>
>>>>>> I guess that most of the analysis may simply boil
>>>>>> down
>>>>>> to
>>>>>> whether or not the original source from the link is
>>>>>> considered reliable. I had forgotten this original
>>>>>> source
>>>>>> when I first asked this question, that is why I am
>>>>>> reposting this same question again.
>>>>>
>>>>> What has this got to do with C++? What is your C++
>>>>> language question?
>>>>>
>>>>> /Leigh
>>>>
>>>>I will be implementing a utf8string to supplement
>>>>std::string and will be using a regular expression to
>>>>quickly divide up UTF-8 bytes into Unicode CodePoints.
>>> ***
>>> For someone who had an unholy fixation on "performance",
>>> why would you choose such a slow
>>> mechanism for doing recognition?
>>>
>>> I can imagine a lot of alternative approaches, including
>>> having a table of 65,536
>>> "character masks" for Unicode characters, including
>>> on-the-fly updating of the table, and
>>> extensions to support surrogates, which would outperform
>>> any regular expression based
>>> approach.
>>>
>>> What is your crtiterion for what constitutes a "letter"?
>>> Frankly, I have no interest in
>>> decoding something as bizarre as UTF-8 encodings to see
>>> if
>>> you covered the foreign
>>> delimiters, numbers, punctuation marks, etc. properly,
>>> and
>>> it makes no sense to do so. So
>>> there is no way I would waste my time trying to
>>> understand
>>> an example that should not
>>> exist at all.
>>>
>>> Why do you seem to choose the worst possible choice when
>>> there is more than one way to do
>>> something? The choices are (a) work in 8-bit ANSI (b)
>>> work in UTF-8 (c) work in Unicode.
>>> Of these, the worst possible choice is (b), followed by
>>> (a). (c) is clearly the winner.
>>>
>>> So why are you using something as bizarre as UTF-8
>>> internally? UTF-8 has ONE role, which
>>> is to write Unicode out in an 8-bit encoding, and read
>>> Unicode in an 8-bit encoding. You
>>> do NOT want to write the program in terms of UTF-8!
>>> joe
>>> ****
>>>>
>>>>Since there are no UTF-8 groups, or even Unicode groups
>>>>I
>>>>must post these questions to groups that are at most
>>>>indirectly related to this subject matter.
>>>>
>>> Joseph M. Newcomer [MVP]
>>> email: newcomer(a)flounder.com
>>> Web: http://www.flounder.com
>>> MVP Tips: http://www.flounder.com/mvp_tips.htm
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Prev: Where to handle CSliderCtl messages in
Next: Blocking mouse clicks