Is this UTF-8 regular expression semantically correct? [MFC]

Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients

From: Peter Olcott on 20 May 2010 10:51

On 5/20/2010 2:17 AM, Mihai N. wrote:
>
>> I though that it choked on anything besides ASCII. So are you implying
>> that it can take Unicode within any encoding?
>
> Can take Unicode in some Unicode form.
> It can take the Unicode form accepted by the compiler.
> Some compilers understand UTF-16, some understand UTF-8,
> some understand none.
> But even for the the ones that don't understand anything than ASCII,
> they should still accept escaped form (\uXXXX)
>
> int x\u6565rap = 123;
> is a perfectly valid name.
>
> (and if the compiler accepts utf-8 or utf-16, you can use some
> human readable form)
>
>
>
Ah so my idea to allow UTF-8 encoded identifiers is really not all that
bad.

From: Peter Olcott on 20 May 2010 11:24

On 5/20/2010 2:39 AM, Joseph M. Newcomer wrote:
> Note that an identifier is defined as incorporating "other implementation-defined
> characters". If someone is claiming to extend C syntax to include localized letters, then
> it should be philosophically consistent with the localized environment and define letters
> to be consistent with that environment, or alternatively, be inclusive and include all
> letters in all localized environments. Letters in a localized environment would not
> include digits in a localized environment, punctuation marks of a localized environment,
> etc.
>

It has already taken me 12 years (since 1998) and I still don't have a
product. If I take the time to do learn about and do all of these little
thing, I will be dead before I am done.

> Peter makes one of the common mistakes he is so fond of: he fastens on ONE implementation
> by ONE vendor and makes a claim that it is DEFINITIVE. You can't even argue that Intel's
> C++ compiler or gcc "prove" that this is true for ALL compilers, since they are intended
> to be clones of each other and historically they all date back to the PDP-11 C++ compiler
> which only used ASCII-7, so they are clones of that, except for extending the syntax to
> more modern constructs. So he comes along and says "I'm going to extend this" and as soon
> as I point out that the extensions have serious problems, he says "but the regular C++
> language does not work that way!" which seems to beg the question of what is meant by
> creating an extension that meets the requirements of allowing "native language". There
> are interesting questions about accent marks, vowel marks, combining characters, localized
> punctuation, localized digits, etc., but when I raised these, I was informed that the
> extensions to support "native language coding" did NOT mean "support native language
> coding" but meant "support something that allows native language programmers to write
> identifiers in their native language that don't even make sense lexically in the native
> language", and while making claims about how fast the recognizer is, refuse to limit the
> productions because the copy-and-paste lex rules would actually require WORK to make them
> correct, so he argues that it is not "convenient" to do it right.

I err on the side of abstracting out too many details, and you err on
the side of including so many details that all of the development budget
would be eaten up by the feasibility study.

> I guess I don't respect doing a job wrong, and rationalizations that say "wrong is OK,
> because whatever it is that I have defined is necessarily right, whether it is right or
> not". There are some VERY interesting questions about combining accent marks and
> combining characters, but if we ignore those, there is ZERO excuse for not writing
> productions based on localized letters or digits (other than the copy-and-paste solution
> no longer works!) because it cannot POSSIBLY affect the performance of the lexer! He even
> says it can't, so the only remaining reason is the need to actually THINK about the
> problem, instead of accepting an unsanctioned and unsupported regexp rule set

If all that needs to be done is to map some local code points to some
ASCII characters this may be implemented before I release my GUI
scripting language.

The grammatical productions would still be written using the ASCII
character set. The lexical specification would map differing character
sets to their corresponding ASCII equivalents.

It is the rats nest of complexity of grapheme clusters that causes me to
say whoa, too much, let's stop here.

> Note that the lexical rules require that localized characters be mapped to the base
> character set, so a thai digit character should map to the corresponding 0..9 value, and a
> conforming compiler that allowed Thai input would do so because the C++ standard requires
> that it do so. So his argument about why his extended C++ does not have to treat a
> localized comma as a comma or a localized semicolon as a semicolon does not make sense;
> the standard says that the input character set is implementation-specific but must map to
> the base character set,

That sounds reasonable and relatively easy.

> so the argument that if I treat the following sequence in some
> language "A,B" that if I use a localized comma with localized letters this is, by his
> rules, necessarily an identifier. It means that under the mapping requirements it does
> not translate and therefore his assertion is (no big surprise here) gibberish.

Now you have finally explained your view sufficiently so that I can see
what you are saying makes perfect sense.

> But why are we arguing over this? We KNOW his design is wrong; only HE can defend his bad
> decisions by rationalizing them to himself. The first time a customer programmer
> complains "But you SAID your extensions supported UTF-8 input, and I wrote this code in my
> native language and it is correct" he can explain to a PAYING CUSTOMER why his
> implementation makes no sense.

I wish you would have explained it this well earlier on it would have
avoided a lot of wasted time.

This is still probably out of scope for my first release. My first
release will probably only support ASCII. The idea of home grown capitol
is to get some sales quickly and as these sales provide self sufficient
positive cash flow, then proceed with additional development.

>
> Note that you should also cite section 2.3 and the footnotes on page 16.
> joe
>
> On Thu, 20 May 2010 00:59:44 -0400, "Pete Delgado"<Peter.Delgado(a)NoSpam.com> wrote:
>
>>
>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote in message
>> news:gomdnfY9-INibm7WnZ2dnUVZ_sEAAAAA(a)giganews.com...
>>> On 5/19/2010 12:55 AM, Mihai N. wrote:
>>>>
>>>>> So C++ can take UTF-8 Identifiers?
>>>>
>>>> No, it can take Unicode identifiers.
>>>> The exact transformation format is not relevant.
>>>>
>>>>
>>> I though that it choked on anything besides ASCII. So are you implying
>>> that it can take Unicode within any encoding?
>>
>> *Read* the C++ standards documents. It explains *everything*. For
>> information about identifiers, see section 2.11.There are draft copies of
>> the current proposed standard available for free :
>>
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3092.pdf
>>
>> There is no need to "imply" anything. As usual, Mihai is correct in matters
>> such as this and your "thought" was wrong.
>>
>> -Pete
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joseph M. Newcomer on 20 May 2010 12:38

On Thu, 20 May 2010 10:24:13 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/20/2010 2:39 AM, Joseph M. Newcomer wrote:
>> Note that an identifier is defined as incorporating "other implementation-defined
>> characters". If someone is claiming to extend C syntax to include localized letters, then
>> it should be philosophically consistent with the localized environment and define letters
>> to be consistent with that environment, or alternatively, be inclusive and include all
>> letters in all localized environments. Letters in a localized environment would not
>> include digits in a localized environment, punctuation marks of a localized environment,
>> etc.
>>
>
>It has already taken me 12 years (since 1998) and I still don't have a
>product. If I take the time to do learn about and do all of these little
>thing, I will be dead before I am done.
****
How is it that other people have managed to gain knowledge of lots of topics, such as
virtual memory, threading, computer architectures, etc. and are still alive? I'm pretty
good, but people like Nigel Horspool are incredibly knowledgeable, and he's younger than I
am. In fact, I hang out with a group of incredibly knowledgeable people, more than half
of whom are much younger than I am, and they have not had any problem learning new things
very quickly. IFIP working group 2.4. See http://wg24.cs.uvic.ca/ContentWG24.shtml .
****
>
>> Peter makes one of the common mistakes he is so fond of: he fastens on ONE implementation
>> by ONE vendor and makes a claim that it is DEFINITIVE. You can't even argue that Intel's
>> C++ compiler or gcc "prove" that this is true for ALL compilers, since they are intended
>> to be clones of each other and historically they all date back to the PDP-11 C++ compiler
>> which only used ASCII-7, so they are clones of that, except for extending the syntax to
>> more modern constructs. So he comes along and says "I'm going to extend this" and as soon
>> as I point out that the extensions have serious problems, he says "but the regular C++
>> language does not work that way!" which seems to beg the question of what is meant by
>> creating an extension that meets the requirements of allowing "native language". There
>> are interesting questions about accent marks, vowel marks, combining characters, localized
>> punctuation, localized digits, etc., but when I raised these, I was informed that the
>> extensions to support "native language coding" did NOT mean "support native language
>> coding" but meant "support something that allows native language programmers to write
>> identifiers in their native language that don't even make sense lexically in the native
>> language", and while making claims about how fast the recognizer is, refuse to limit the
>> productions because the copy-and-paste lex rules would actually require WORK to make them
>> correct, so he argues that it is not "convenient" to do it right.
>
>I err on the side of abstracting out too many details, and you err on
>the side of including so many details that all of the development budget
>would be eaten up by the feasibility study.
****
What "feasibility study"? Why would this even enter the discussion? The "feasibility" is
to look at the Unicode character set and write a set of lex productions that include only
those sequences called "letters". An analogous study of "digits" can allow writing lex
rules about "digits". There. That's the "feasibility study"; the result: it is trivial
to do. So where is the problem? How many people need to form a committee, and are they
producing a printed report to management, and what is their timeline for this complex
study? Oh, I missed the fact that a one-person project comes with a built-in bureaucracy.
****
>
>> I guess I don't respect doing a job wrong, and rationalizations that say "wrong is OK,
>> because whatever it is that I have defined is necessarily right, whether it is right or
>> not". There are some VERY interesting questions about combining accent marks and
>> combining characters, but if we ignore those, there is ZERO excuse for not writing
>> productions based on localized letters or digits (other than the copy-and-paste solution
>> no longer works!) because it cannot POSSIBLY affect the performance of the lexer! He even
>> says it can't, so the only remaining reason is the need to actually THINK about the
>> problem, instead of accepting an unsanctioned and unsupported regexp rule set
>
>If all that needs to be done is to map some local code points to some
>ASCII characters this may be implemented before I release my GUI
>scripting language.
****
But why did you have to argue about this?
****
>
>The grammatical productions would still be written using the ASCII
>character set. The lexical specification would map differing character
>sets to their corresponding ASCII equivalents.
****
If you believe that convert-to-base-language argument, yes, and you need to read the C++
standard for what is meant by the "base character set". I did, it took me perhaps 10
minutes. I have no idea how long it would take your feasibility study committee.
****
>
>It is the rats nest of complexity of grapheme clusters that causes me to
>say whoa, too much, let's stop here.
****
Yes, and you can make some statements about that. For example, I've been told that in
Hebrew that vowel marks are considered redundant, so you might tackle such issues
incrementally as you get feedback from various users. But this is not the same as saying
"everything that is not ASCII-7 is a letter".
****
>
>
>> Note that the lexical rules require that localized characters be mapped to the base
>> character set, so a thai digit character should map to the corresponding 0..9 value, and a
>> conforming compiler that allowed Thai input would do so because the C++ standard requires
>> that it do so. So his argument about why his extended C++ does not have to treat a
>> localized comma as a comma or a localized semicolon as a semicolon does not make sense;
>> the standard says that the input character set is implementation-specific but must map to
>> the base character set,
>
>That sounds reasonable and relatively easy.
****
Yes, and if you had taken the ten minutes to read the standard, you would have realized
that, too!
****
>
> > so the argument that if I treat the following sequence in some
>> language "A,B" that if I use a localized comma with localized letters this is, by his
>> rules, necessarily an identifier. It means that under the mapping requirements it does
>> not translate and therefore his assertion is (no big surprise here) gibberish.
>
>Now you have finally explained your view sufficiently so that I can see
>what you are saying makes perfect sense.
****
But I said this on day one, and it was so screamingly obvious I don't know why you didn't
get the AHA! event then!
****
>
>> But why are we arguing over this? We KNOW his design is wrong; only HE can defend his bad
>> decisions by rationalizing them to himself. The first time a customer programmer
>> complains "But you SAID your extensions supported UTF-8 input, and I wrote this code in my
>> native language and it is correct" he can explain to a PAYING CUSTOMER why his
>> implementation makes no sense.
>
>I wish you would have explained it this well earlier on it would have
>avoided a lot of wasted time.
****
I leave a certain amount of work as an Exercise For The Reader. I don't feel I have to
explain every single little detail to an experienced programmer.
****
>
>This is still probably out of scope for my first release. My first
>release will probably only support ASCII. The idea of home grown capitol
****
For someone who accuses me of understanding nothing because I make a couple typos, you
should be a lot more careful about your own typos. "Capital" is, by one definition, money
to invest; "capitol" is a building in which certain legislative bodies meet. Be careful,
or someone may state that because you are the sort of person who cannot spell-check that
you are necessarily a babbling idiot.
****
>is to get some sales quickly and as these sales provide self sufficient
>positive cash flow, then proceed with additional development.
****
But you must not misrepresent the product in the way you have.
joe
****
>
>>
>> Note that you should also cite section 2.3 and the footnotes on page 16.
>> joe
>>
>> On Thu, 20 May 2010 00:59:44 -0400, "Pete Delgado"<Peter.Delgado(a)NoSpam.com> wrote:
>>
>>>
>>> "Peter Olcott"<NoSpam(a)OCR4Screen.com> wrote in message
>>> news:gomdnfY9-INibm7WnZ2dnUVZ_sEAAAAA(a)giganews.com...
>>>> On 5/19/2010 12:55 AM, Mihai N. wrote:
>>>>>
>>>>>> So C++ can take UTF-8 Identifiers?
>>>>>
>>>>> No, it can take Unicode identifiers.
>>>>> The exact transformation format is not relevant.
>>>>>
>>>>>
>>>> I though that it choked on anything besides ASCII. So are you implying
>>>> that it can take Unicode within any encoding?
>>>
>>> *Read* the C++ standards documents. It explains *everything*. For
>>> information about identifiers, see section 2.11.There are draft copies of
>>> the current proposed standard available for free :
>>>
>>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3092.pdf
>>>
>>> There is no need to "imply" anything. As usual, Mihai is correct in matters
>>> such as this and your "thought" was wrong.
>>>
>>> -Pete
>>>
>> Joseph M. Newcomer [MVP]
>> email: newcomer(a)flounder.com
>> Web: http://www.flounder.com
>> MVP Tips: http://www.flounder.com/mvp_tips.htm
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Peter Olcott on 20 May 2010 13:01

On 5/20/2010 11:38 AM, Joseph M. Newcomer wrote:
>
>
> On Thu, 20 May 2010 10:24:13 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> If all that needs to be done is to map some local code points to some
>> ASCII characters this may be implemented before I release my GUI
>> scripting language.
> ****
> But why did you have to argue about this?

To get you to explain the reasoning behind your dogmatic statements so
that I could see that there was a reasonable basis for what you were
claiming.

>>> so the argument that if I treat the following sequence in some
>>> language "A,B" that if I use a localized comma with localized letters this is, by his
>>> rules, necessarily an identifier. It means that under the mapping requirements it does
>>> not translate and therefore his assertion is (no big surprise here) gibberish.
>>
>> Now you have finally explained your view sufficiently so that I can see
>> what you are saying makes perfect sense.
> ****
> But I said this on day one, and it was so screamingly obvious I don't know why you didn't
> get the AHA! event then!

It made no sense that there would be something such as a localized
comma. Because it made no sense (and still makes no sense) I thought
that you were just pulling my chain. Even if there was such a thing as a
localized comma (and apparently there is) I thought that C/C++
standardized on the ASCII comma.

I coined a term long ago [ignorance squared]. What this means is that
there is no possible way for any person lacking knowledge to accurately
quantify the specific degree of this lack of knowledge because this
requires having the knowledge to measure the lack against.

To a person whom lacks knowledge this lack can only appear to be
disagreement. Only the person whom has the knowledge can accurately
quantify the degree of the lack.

Ignorance squared means that one is even ignorance of their own
ignorance. (or at least the degree of this ignorance).

From: Joseph M. Newcomer on 20 May 2010 14:07

See below...
On Thu, 20 May 2010 12:01:12 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/20/2010 11:38 AM, Joseph M. Newcomer wrote:
>>
>>
>> On Thu, 20 May 2010 10:24:13 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>
>>> If all that needs to be done is to map some local code points to some
>>> ASCII characters this may be implemented before I release my GUI
>>> scripting language.
>> ****
>> But why did you have to argue about this?
>
>To get you to explain the reasoning behind your dogmatic statements so
>that I could see that there was a reasonable basis for what you were
>claiming.
***
I made no dogmatic statements; I merely pointed out the screamingly obvious defects in the
design.
****
>
>>>> so the argument that if I treat the following sequence in some
>>>> language "A,B" that if I use a localized comma with localized letters this is, by his
>>>> rules, necessarily an identifier. It means that under the mapping requirements it does
>>>> not translate and therefore his assertion is (no big surprise here) gibberish.
>>>
>>> Now you have finally explained your view sufficiently so that I can see
>>> what you are saying makes perfect sense.
>> ****
>> But I said this on day one, and it was so screamingly obvious I don't know why you didn't
>> get the AHA! event then!
>
>It made no sense that there would be something such as a localized
>comma. Because it made no sense (and still makes no sense) I thought
>that you were just pulling my chain. Even if there was such a thing as a
>localized comma (and apparently there is) I thought that C/C++
>standardized on the ASCII comma.
****
That is a stupid statement. All you had to do was read the Unicode standard and you would
see that the ARE localized punctuation marks! I borught up the list of Unicode code
points and it took me less than five minutes to discover this! I even gave you the
precise code points, so you could not POSSIBLY have missed the idea that there are
localized punctuation marks! And you could have verified my observations just by looking
at the Unicode standard! RTFM!!!!

And the C++ standard is very clear about what is going on; if there are character set
transformations required to create a legitimate C++ program, these are handled by a
mechanism outside the standard. And in any case, since you explicitly said you were
EXTENDING the character set, why should you revert to insisting that the ASCII-7 standard
is what a programmer programming in his or her "native language" should adhere to. I
merely pointed out an obvious inconsistency in your reasoning. You reverted to saying "I
said X, but I meant something other than X" which puts us back in the world of the Magic
Morphing Requirements.
***
>
>I coined a term long ago [ignorance squared]. What this means is that
>there is no possible way for any person lacking knowledge to accurately
>quantify the specific degree of this lack of knowledge because this
>requires having the knowledge to measure the lack against.
****
Hmm. But I had attempted to correct your ignorance, in particular, I remember
specifically giving the Unicode code point for the Armenian Comma and several other
localized punctuation marks, and you made the assumption I was "yanking your chain". Now
THAT's a manifestation of ignorance squared! When someone corrects you by stating a fact,
and you find the fact "inconvenient", it does not make you smarter; it only proves that
you like remaining ignorant.
****
>
>To a person whom lacks knowledge this lack can only appear to be
>disagreement. Only the person whom has the knowledge can accurately
>quantify the degree of the lack.
****
I had done that, by pointing out ranges of localized digits, and localized punctuation
marks, and you chose to both ignore me and argue that such things didn't matter, which was
inconsistent with your stated design goal (program in the localized language). I even
pointed out that I had used my Locale Explorer to find these, and it is a free download
(and the table I use in it is directly from the Unicode Web site, and is the official,
sanctioned, data, at least as of the time I downloaded it; it is potentially obsolete, but
it already contained enough information to show you were wrong)
****
>
>Ignorance squared means that one is even ignorance of their own
>ignorance. (or at least the degree of this ignorance).
***
So what do you call an insistence on remaining ignorant, even when others are supplying
knowledge you didn't have?
joe
****
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients