Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients
From: Peter Olcott on 11 May 2010 10:38 BYTE_ORDER_MARK [0\xEF][0\xBB][0\xBF] ASCII [\x0-\x7f] U2 [\xC2-\xDF][\x80-\xBF] U3 [\xE0][\xA0-\xBF][\x80-\xBF] U4 [\xE1-\xEC][\x80-\xBF][\x80-\xBF] U5 [\xED][\x80-\x9F][\x80-\xBF] U6 [\xEE-\xEF][\x80-\xBF][\x80-\xBF] U7 [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF] U8 [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF] U9 [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF] U {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9}
From: Mihai N. on 13 May 2010 06:00 > I am writing a compiler that takes UTF-8 input, so I must > have a correct regular expression to be used by the lexical > analyzer. The more I look at them, the wronger they seem :-) Really, if you write a compiler, then forger regexp. -- Mihai Nita [Microsoft MVP, Visual C++] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
From: Peter Olcott on 13 May 2010 07:47 "Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message news:Xns9D771E851DEFBMihaiN(a)207.46.248.16... > >> I am writing a compiler that takes UTF-8 input, so I must >> have a correct regular expression to be used by the >> lexical >> analyzer. > > The more I look at them, the wronger they seem :-) > Really, if you write a compiler, then forger regexp. > I have no idea what you are saying about forger regexp. I have been able to derive a process for reverse-engineering and empirically validating the correct regular expression. I guess that you didn't bother to look at an almost identical regular expression that has been published and critiqued for years. http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex > > > -- > Mihai Nita [Microsoft MVP, Visual C++] > http://www.mihai-nita.net > ------------------------------------------ > Replace _year_ with _ to get the real email >
From: Peter Olcott on 13 May 2010 07:50 "Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message news:Xns9D771E851DEFBMihaiN(a)207.46.248.16... > >> I am writing a compiler that takes UTF-8 input, so I must >> have a correct regular expression to be used by the >> lexical >> analyzer. > > The more I look at them, the wronger they seem :-) > Really, if you write a compiler, then forger regexp. Ah maybe you are saying forget regexp, I can't because the compiler is based on lex and yacc. I am writing a simplified C++ interpreter by slightly modifying the correct lex and yacc syntax for "C". > > > > -- > Mihai Nita [Microsoft MVP, Visual C++] > http://www.mihai-nita.net > ------------------------------------------ > Replace _year_ with _ to get the real email >
From: Joseph M. Newcomer on 13 May 2010 11:21
Regular expressions are often used to define the lexical components of a language. This does not suggest that using a regexp recognizer is a sensible implementation of a compiler. In general, we build FSMs to recognize lexical elements, and PDAs (Push Down Automata, not pocket-sized little computers) to recognize syntactic elements. Often these are generated by programs such as Bison and YACC, and in many cases are just hand-written. Personally, I write my lexers as a switch-based FSM, and use recursive descent to write my parser. I throw exceptions when there are lexical or syntactic errors. I though I understood the question until the phrase "writing a compiler" appeared. Tkaing UTF input is not the same as saying "I am doing lexical analysis on UTF-8 text". Typically, what I would do is take UTF-8 input and immediately convert it to Unicode, and work in terms of Unicode internally. The problem here is defining the "alphabetic" and "numeric" characters; fortunately, isalpha, isalnum, isnum, etc. seem to be locale-aware, and you could always use the Unicode-related APIs to determine a character class. Also, check out the Unicode tab in my Locale Explorer. joe On Thu, 13 May 2010 03:00:01 -0700, "Mihai N." <nmihai_year_2000(a)yahoo.com> wrote: > >> I am writing a compiler that takes UTF-8 input, so I must >> have a correct regular expression to be used by the lexical >> analyzer. > >The more I look at them, the wronger they seem :-) >Really, if you write a compiler, then forger regexp. Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm |