Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients
From: Peter Olcott on 14 May 2010 14:38 "Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message news:Xns9D78D4CBF598MihaiN(a)207.46.248.16... >> I can't because the >> compiler is based on lex and yacc. I am writing a >> simplified >> C++ interpreter by slightly modifying the correct lex and >> yacc syntax for "C". > > Saying that the expresion is used with lex/yacc context > makes a big > difference, because that implies that there is a state > machine > somewhere that can track the context. > Possibly, but, I was really only looking for a yes or no answer. Also I am unaware of any reasonable alternative to a finite state machine for processing regular expressions. From my point of view regular expressions and finite state machines are mutually dependent upon each other. I see no other view that could possibly be correct. I guess no one here, or anywhere else knows whether or not the regular expression is correct. This leaves me with the much more time consuming option of empirical validation. > Otherwise it is like saying > I am writing a compiler that takes C input > then show a regular expressions like if|else|while|do > that you use to detect the C keywords. > A regexp using that will accept "bif" as input, lex will > not :-) > > > -- > Mihai Nita [Microsoft MVP, Visual C++] > http://www.mihai-nita.net > ------------------------------------------ > Replace _year_ with _ to get the real email >
From: Mihai N. on 15 May 2010 06:23 > Possibly, but, I was really only looking for a yes or no > answer. If you wanted a yes/no answer you should give complete info (like the fact that you are talking lex context) Othewise you wil very likely get a wrong answer. -- Mihai Nita [Microsoft MVP, Visual C++] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
From: Peter Olcott on 15 May 2010 10:20 "Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message news:Xns9D7922762D9EBMihaiN(a)207.46.248.16... > > >> Possibly, but, I was really only looking for a yes or no >> answer. > > If you wanted a yes/no answer you should give complete > info > (like the fact that you are talking lex context) > Othewise you wil very likely get a wrong answer. I don't see why this would be the case for a yes or no question. > > > -- > Mihai Nita [Microsoft MVP, Visual C++] > http://www.mihai-nita.net > ------------------------------------------ > Replace _year_ with _ to get the real email >
From: Joseph M. Newcomer on 17 May 2010 01:40 Because without the context it is not a valid question. For example, since this is a C++/MFC group, the question might have been in terms of a regexp library, which suggests you are using UTF-8 internally, which would be wrong. But as stated, the question is wrong, because you are presuming an over-simplified concept of "letter", for which I have already pointed out there are failures (numbers in other languages). You would have to deal with all accent marks, and while some languages have e-umlaut, i-umlaut and y-umlaut, these are not letters in German, and so you have to take into account the localization context to determine if they really are "letters". And in Chinese, a single glyph may be a "word" and thus two of these in sequence would be syntactically illegal. So how do you define "letter"? And in some cases, the accent mark is a separate codepoint, so a separate UTF-8 encoding, but you can't combine that accent mark with any but a few letters, so the regexp does not account for these at all! What about RTL encodings. In Hebrew, which I will simplifiy for NG syntax, if I wanted to write ABC it would appear as CBA because of the left-to-right nature of that language. But if I wanted to write ABC 123 DEF it would appear as FED*123$CBA where the $ represents the token that says "change to RTL" and the * represents the token that says "change to LTR". Read the Unicode documentation! (RTFM!) So if you are parsing this into tokens, is it "FED" "123" "CBA" or "ABC" "321" "DEF" or "ABC" "123" "DEF"? If you can't answer this question, then you can't ask the one about the regexp being correct. What if I have a lexically illegal sequence of accent marks and characters? What if I have the sequence '`a? If 'a means � and `a means � (I'm not talking about the ANSI characters, here ' means U0300 and ` means U0301), what does '`a or `'a mean? Whoops, lexical error. There is no rule in your regexp that detects this, therefore, it is wrong. (UTF-32 these would be U00000300 U00000061 and U00000301 U00000061and in UTF-8 these would be "cc 80 61 cc 81 61" So the simple answer is "It is completely and utterly insufficient, and its correctness is problematic, and it does not define even what a letter is", and even if you convert to UTF-32 you have not solved this problem. joe So the simplest answer is "No", under no imaginable conditions is this collection of regexps even CLOSE to being usable, and even if expressed in UTF-32, there is no possible way something this overly-simplistic could be construed to make sense, and the real problem is vastly more complicated than you have imagined! joe On Sat, 15 May 2010 09:20:42 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote: > >"Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message >news:Xns9D7922762D9EBMihaiN(a)207.46.248.16... >> >> >>> Possibly, but, I was really only looking for a yes or no >>> answer. >> >> If you wanted a yes/no answer you should give complete >> info >> (like the fact that you are talking lex context) >> Othewise you wil very likely get a wrong answer. > >I don't see why this would be the case for a yes or no >question. > >> >> >> -- >> Mihai Nita [Microsoft MVP, Visual C++] >> http://www.mihai-nita.net >> ------------------------------------------ >> Replace _year_ with _ to get the real email >> > Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Peter Olcott on 17 May 2010 10:17
On 5/17/2010 12:40 AM, Joseph M. Newcomer wrote: > Because without the context it is not a valid question. > > For example, since this is a C++/MFC group, the question might have been in terms of a > regexp library, which suggests you are using UTF-8 internally, which would be wrong. > > But as stated, the question is wrong, because you are presuming an over-simplified concept > of "letter", for which I have already pointed out there are failures (numbers in other > languages). You would have to deal with all accent marks, and while some languages have > e-umlaut, i-umlaut and y-umlaut, these are not letters in German, and so you have to take > into account the localization context to determine if they really are "letters". And in > Chinese, a single glyph may be a "word" and thus two of these in sequence would be > syntactically illegal. So how do you define "letter"? And in some cases, the accent mark > is a separate codepoint, so a separate UTF-8 encoding, but you can't combine that accent > mark with any but a few letters, so the regexp does not account for these at all! > > What about RTL encodings. In Hebrew, which I will simplifiy for NG syntax, if I wanted to > write ABC it would appear as CBA because of the left-to-right nature of that language. But > if I wanted to write ABC 123 DEF it would appear as FED*123$CBA where the $ represents the > token that says "change to RTL" and the * represents the token that says "change to LTR". > Read the Unicode documentation! (RTFM!) So if you are parsing this into tokens, is it > "FED" "123" "CBA" or "ABC" "321" "DEF" or "ABC" "123" "DEF"? If you can't answer this > question, then you can't ask the one about the regexp being correct. What if I have a > lexically illegal sequence of accent marks and characters? What if I have the sequence > '`a? If 'a means � and `a means � (I'm not talking about the ANSI characters, here' > means U0300 and ` means U0301), what does '`a or `'a mean? Whoops, lexical error. There > is no rule in your regexp that detects this, therefore, it is wrong. (UTF-32 these would > be U00000300 U00000061 and U00000301 U00000061and in UTF-8 these would be "cc 80 61 cc 81 > 61" > > So the simple answer is "It is completely and utterly insufficient, and its correctness is > problematic, and it does not define even what a letter is", and even if you convert to > UTF-32 you have not solved this problem. > joe > > So the simplest answer is "No", under no imaginable conditions is this collection of > regexps even CLOSE to being usable, and even if expressed in UTF-32, there is no possible > way something this overly-simplistic could be construed to make sense, and the real > problem is vastly more complicated than you have imagined! > joe You are taking the incorrect approach in that if a solution does not provide support for every possible issue then the this solution does not solve the problem. The failure in this approach is that for many problems most of these issues are entirely moot. For the purpose of creating an interpreted GUI scripting language that permits people to write GUI scripts in their native language I only need to be able to handle UTF-8 input and make sure that it it valid UTF-8. There is no need for me to validate this any further. > > On Sat, 15 May 2010 09:20:42 -0500, "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote: > >> >> "Mihai N."<nmihai_year_2000(a)yahoo.com> wrote in message >> news:Xns9D7922762D9EBMihaiN(a)207.46.248.16... >>> >>> >>>> Possibly, but, I was really only looking for a yes or no >>>> answer. >>> >>> If you wanted a yes/no answer you should give complete >>> info >>> (like the fact that you are talking lex context) >>> Othewise you wil very likely get a wrong answer. >> >> I don't see why this would be the case for a yes or no >> question. >> >>> >>> >>> -- >>> Mihai Nita [Microsoft MVP, Visual C++] >>> http://www.mihai-nita.net >>> ------------------------------------------ >>> Replace _year_ with _ to get the real email >>> >> > Joseph M. Newcomer [MVP] > email: newcomer(a)flounder.com > Web: http://www.flounder.com > MVP Tips: http://www.flounder.com/mvp_tips.htm |