Is this UTF-8 regular expression semantically correct? [MFC]

Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients

From: Peter Olcott on 11 May 2010 10:38

BYTE_ORDER_MARK [0\xEF][0\xBB][0\xBF]
ASCII [\x0-\x7f]
U2 [\xC2-\xDF][\x80-\xBF]
U3 [\xE0][\xA0-\xBF][\x80-\xBF]
U4 [\xE1-\xEC][\x80-\xBF][\x80-\xBF]
U5 [\xED][\x80-\x9F][\x80-\xBF]
U6 [\xEE-\xEF][\x80-\xBF][\x80-\xBF]
U7 [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
U8 [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
U9 [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
U {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}

From: Mihai N. on 13 May 2010 06:00

> I am writing a compiler that takes UTF-8 input, so I must
> have a correct regular expression to be used by the lexical
> analyzer.

The more I look at them, the wronger they seem :-)
Really, if you write a compiler, then forger regexp.

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Peter Olcott on 13 May 2010 07:47

"Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message
news:Xns9D771E851DEFBMihaiN(a)207.46.248.16...
>
>> I am writing a compiler that takes UTF-8 input, so I must
>> have a correct regular expression to be used by the
>> lexical
>> analyzer.
>
> The more I look at them, the wronger they seem :-)
> Really, if you write a compiler, then forger regexp.
>

I have no idea what you are saying about forger regexp.
I have been able to derive a process for reverse-engineering
and empirically validating the correct regular expression.

I guess that you didn't bother to look at an almost
identical regular expression that has been published and
critiqued for years.
http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex

>
>
> --
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
>

From: Peter Olcott on 13 May 2010 07:50

"Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message
news:Xns9D771E851DEFBMihaiN(a)207.46.248.16...
>
>> I am writing a compiler that takes UTF-8 input, so I must
>> have a correct regular expression to be used by the
>> lexical
>> analyzer.
>
> The more I look at them, the wronger they seem :-)
> Really, if you write a compiler, then forger regexp.

Ah maybe you are saying forget regexp, I can't because the
compiler is based on lex and yacc. I am writing a simplified
C++ interpreter by slightly modifying the correct lex and
yacc syntax for "C".

>
>
>
> --
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
>

From: Joseph M. Newcomer on 13 May 2010 11:21

Regular expressions are often used to define the lexical components of a language.

This does not suggest that using a regexp recognizer is a sensible implementation of a
compiler.

In general, we build FSMs to recognize lexical elements, and PDAs (Push Down Automata, not
pocket-sized little computers) to recognize syntactic elements.

Often these are generated by programs such as Bison and YACC, and in many cases are just
hand-written. Personally, I write my lexers as a switch-based FSM, and use recursive
descent to write my parser. I throw exceptions when there are lexical or syntactic
errors.

I though I understood the question until the phrase "writing a compiler" appeared.

Tkaing UTF input is not the same as saying "I am doing lexical analysis on UTF-8 text".
Typically, what I would do is take UTF-8 input and immediately convert it to Unicode, and
work in terms of Unicode internally.

The problem here is defining the "alphabetic" and "numeric" characters; fortunately,
isalpha, isalnum, isnum, etc. seem to be locale-aware, and you could always use the
Unicode-related APIs to determine a character class. Also, check out the Unicode tab in
my Locale Explorer.
joe

On Thu, 13 May 2010 03:00:01 -0700, "Mihai N." <nmihai_year_2000(a)yahoo.com> wrote:

>
>> I am writing a compiler that takes UTF-8 input, so I must
>> have a correct regular expression to be used by the lexical
>> analyzer.
>
>The more I look at them, the wronger they seem :-)
>Really, if you write a compiler, then forger regexp.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

| Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients