Is this UTF-8 regular expression semantically correct? [MFC]

Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients

From: Peter Olcott on 14 May 2010 14:38

"Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message
news:Xns9D78D4CBF598MihaiN(a)207.46.248.16...
>> I can't because the
>> compiler is based on lex and yacc. I am writing a
>> simplified
>> C++ interpreter by slightly modifying the correct lex and
>> yacc syntax for "C".
>
> Saying that the expresion is used with lex/yacc context
> makes a big
> difference, because that implies that there is a state
> machine
> somewhere that can track the context.
>

Possibly, but, I was really only looking for a yes or no
answer.

Also I am unaware of any reasonable alternative to a finite
state machine for processing regular expressions. From my
point of view regular expressions and finite state machines
are mutually dependent upon each other. I see no other view
that could possibly be correct.

I guess no one here, or anywhere else knows whether or not
the regular expression is correct. This leaves me with the
much more time consuming option of empirical validation.

> Otherwise it is like saying
> I am writing a compiler that takes C input
> then show a regular expressions like if|else|while|do
> that you use to detect the C keywords.
> A regexp using that will accept "bif" as input, lex will
> not :-)
>
>
> --
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
>

From: Mihai N. on 15 May 2010 06:23

> Possibly, but, I was really only looking for a yes or no
> answer.

If you wanted a yes/no answer you should give complete info
(like the fact that you are talking lex context)
Othewise you wil very likely get a wrong answer.

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Peter Olcott on 15 May 2010 10:20

"Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message
news:Xns9D7922762D9EBMihaiN(a)207.46.248.16...
>
>
>> Possibly, but, I was really only looking for a yes or no
>> answer.
>
> If you wanted a yes/no answer you should give complete
> info
> (like the fact that you are talking lex context)
> Othewise you wil very likely get a wrong answer.

I don't see why this would be the case for a yes or no
question.

>
>
> --
> Mihai Nita [Microsoft MVP, Visual C++]
> http://www.mihai-nita.net
> ------------------------------------------
> Replace _year_ with _ to get the real email
>

From: Joseph M. Newcomer on 17 May 2010 01:40

Because without the context it is not a valid question.

For example, since this is a C++/MFC group, the question might have been in terms of a
regexp library, which suggests you are using UTF-8 internally, which would be wrong.

But as stated, the question is wrong, because you are presuming an over-simplified concept
of "letter", for which I have already pointed out there are failures (numbers in other
languages). You would have to deal with all accent marks, and while some languages have
e-umlaut, i-umlaut and y-umlaut, these are not letters in German, and so you have to take
into account the localization context to determine if they really are "letters". And in
Chinese, a single glyph may be a "word" and thus two of these in sequence would be
syntactically illegal. So how do you define "letter"? And in some cases, the accent mark
is a separate codepoint, so a separate UTF-8 encoding, but you can't combine that accent
mark with any but a few letters, so the regexp does not account for these at all!

What about RTL encodings. In Hebrew, which I will simplifiy for NG syntax, if I wanted to
write ABC it would appear as CBA because of the left-to-right nature of that language. But
if I wanted to write ABC 123 DEF it would appear as FED*123$CBA where the $ represents the
token that says "change to RTL" and the * represents the token that says "change to LTR".
Read the Unicode documentation! (RTFM!) So if you are parsing this into tokens, is it
"FED" "123" "CBA" or "ABC" "321" "DEF" or "ABC" "123" "DEF"? If you can't answer this
question, then you can't ask the one about the regexp being correct. What if I have a
lexically illegal sequence of accent marks and characters? What if I have the sequence
'`a? If 'a means � and `a means � (I'm not talking about the ANSI characters, here '
means U0300 and ` means U0301), what does '`a or `'a mean? Whoops, lexical error. There
is no rule in your regexp that detects this, therefore, it is wrong. (UTF-32 these would
be U00000300 U00000061 and U00000301 U00000061and in UTF-8 these would be "cc 80 61 cc 81
61"

So the simple answer is "It is completely and utterly insufficient, and its correctness is
problematic, and it does not define even what a letter is", and even if you convert to
UTF-32 you have not solved this problem.
joe

So the simplest answer is "No", under no imaginable conditions is this collection of
regexps even CLOSE to being usable, and even if expressed in UTF-32, there is no possible
way something this overly-simplistic could be construed to make sense, and the real
problem is vastly more complicated than you have imagined!
joe

On Sat, 15 May 2010 09:20:42 -0500, "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

>
>"Mihai N." <nmihai_year_2000(a)yahoo.com> wrote in message
>news:Xns9D7922762D9EBMihaiN(a)207.46.248.16...
>>
>>
>>> Possibly, but, I was really only looking for a yes or no
>>> answer.
>>
>> If you wanted a yes/no answer you should give complete
>> info
>> (like the fact that you are talking lex context)
>> Othewise you wil very likely get a wrong answer.
>
>I don't see why this would be the case for a yes or no
>question.
>
>>
>>
>> --
>> Mihai Nita [Microsoft MVP, Visual C++]
>> http://www.mihai-nita.net
>> ------------------------------------------
>> Replace _year_ with _ to get the real email
>>
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 17 May 2010 10:17

On 5/17/2010 12:40 AM, Joseph M. Newcomer wrote:
> Because without the context it is not a valid question.
>
> For example, since this is a C++/MFC group, the question might have been in terms of a
> regexp library, which suggests you are using UTF-8 internally, which would be wrong.
>
> But as stated, the question is wrong, because you are presuming an over-simplified concept
> of "letter", for which I have already pointed out there are failures (numbers in other
> languages). You would have to deal with all accent marks, and while some languages have
> e-umlaut, i-umlaut and y-umlaut, these are not letters in German, and so you have to take
> into account the localization context to determine if they really are "letters". And in
> Chinese, a single glyph may be a "word" and thus two of these in sequence would be
> syntactically illegal. So how do you define "letter"? And in some cases, the accent mark
> is a separate codepoint, so a separate UTF-8 encoding, but you can't combine that accent
> mark with any but a few letters, so the regexp does not account for these at all!
>
> What about RTL encodings. In Hebrew, which I will simplifiy for NG syntax, if I wanted to
> write ABC it would appear as CBA because of the left-to-right nature of that language. But
> if I wanted to write ABC 123 DEF it would appear as FED*123$CBA where the $ represents the
> token that says "change to RTL" and the * represents the token that says "change to LTR".
> Read the Unicode documentation! (RTFM!) So if you are parsing this into tokens, is it
> "FED" "123" "CBA" or "ABC" "321" "DEF" or "ABC" "123" "DEF"? If you can't answer this
> question, then you can't ask the one about the regexp being correct. What if I have a
> lexically illegal sequence of accent marks and characters? What if I have the sequence
> '`a? If 'a means � and `a means � (I'm not talking about the ANSI characters, here'
> means U0300 and ` means U0301), what does '`a or `'a mean? Whoops, lexical error. There
> is no rule in your regexp that detects this, therefore, it is wrong. (UTF-32 these would
> be U00000300 U00000061 and U00000301 U00000061and in UTF-8 these would be "cc 80 61 cc 81
> 61"
>
> So the simple answer is "It is completely and utterly insufficient, and its correctness is
> problematic, and it does not define even what a letter is", and even if you convert to
> UTF-32 you have not solved this problem.
> joe
>
> So the simplest answer is "No", under no imaginable conditions is this collection of
> regexps even CLOSE to being usable, and even if expressed in UTF-32, there is no possible
> way something this overly-simplistic could be construed to make sense, and the real
> problem is vastly more complicated than you have imagined!
> joe

You are taking the incorrect approach in that if a solution does not
provide support for every possible issue then the this solution does not
solve the problem. The failure in this approach is that for many
problems most of these issues are entirely moot.

For the purpose of creating an interpreted GUI scripting language that
permits people to write GUI scripts in their native language I only need
to be able to handle UTF-8 input and make sure that it it valid UTF-8.
There is no need for me to validate this any further.

>
> On Sat, 15 May 2010 09:20:42 -0500, "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote:
>
>>
>> "Mihai N."<nmihai_year_2000(a)yahoo.com> wrote in message
>> news:Xns9D7922762D9EBMihaiN(a)207.46.248.16...
>>>
>>>
>>>> Possibly, but, I was really only looking for a yes or no
>>>> answer.
>>>
>>> If you wanted a yes/no answer you should give complete
>>> info
>>> (like the fact that you are talking lex context)
>>> Othewise you wil very likely get a wrong answer.
>>
>> I don't see why this would be the case for a yes or no
>> question.
>>
>>>
>>>
>>> --
>>> Mihai Nita [Microsoft MVP, Visual C++]
>>> http://www.mihai-nita.net
>>> ------------------------------------------
>>> Replace _year_ with _ to get the real email
>>>
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients