Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients
From: Peter Olcott on 22 May 2010 10:28 On 5/22/2010 5:16 AM, Joseph M. Newcomer wrote: > See below... > On Fri, 21 May 2010 15:23:25 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote: > >> It would probably take me much longer than 40 hours just to find the >> exhaustive list of every local code point that must be mapped to an >> ASCII code point. The whole rest of this adaptation would be nearly >> trivial. > **** > Why do you care about ASCII code points? You explicitly said you are implementing an > EXTENSION to C++ syntax, for a language which is NOT C++ but your private scripting > language! So what in the world does the C++ specification have to do with your EXTENSION > to the syntax???? C++ requires that every non ASCII character be mapped to the ASCII set. I am extending this to be that I only require semantically significant non ASCII characters to be mapped to the ASCII set. This approach (as you already know if you have the compiler design experience that you claim) is simple because it can reuse the same parser and only require changes to the lexer. This design requires obtaining the exhaustively complete set of every character in every language that must be mapped to those characters within C++ that have semantic significance. This includes all C++ punctuation marks as well as the local set of numeric digits. Fining the local set of numeric digits would be easy enough. Finding out whether or not a Chinese semicolon (if one even exists) is reasonable to map to the ASCII semicolon for every punctuation mark for every language would take more time that I have unless someone else has already done this. > > If you say "I wish to ignore the limitations of the C++ language" and then you say "I am > forced to do a bad implementation because I have to adhere to the limitations of the C++ > language", how can we resolve these two positions? > **** >> >>> Assume that you only have novice levels of >>> understanding of Unicode and any learning must also be included in this >>> 40 hour budget. > ***** > It does not take much experience to read the Unicode tables and see what are letters and > what are digits and what are puctuation marks! And it does not take hours of study to do > this! > **** Which local punctuation mark can be mapped to which ASCII punctuation mark specifically taking into account all of the subtle nuances of semantic distinctions will take longer than I have. A concrete example is that the comma is used as a decimal point in some countries. >>> >>> Since my language would not treat any code point above ASCII as >>> lexically or syntactically significant, I still think that my approach >>> within my budget is optimal. > ***** > Oh, what happened to that stated specification of allowing people to program in their > native character set? Oh, that was just a Magic Morphing Requirement which is no longer > true. Never mind. > **** >>> >>> What I learned from you is that if and when I do decide to map local >>> punctuation and digits to their corresponding ASCII equivalents, then I >>> would need to restrict the use of these remapped code points from being >>> used within identifiers. Until then it makes little difference. > ***** > But it is so trivial to do the job right in the first place! You treat anything > recognizably called a "letter" as a letter, anything recognizably called a "digit" as a > digit, write lexical rules for a number which has productions of the form That would be wrong. Rejecting a combining mark as not a Letter and thus not valid in an identifier would be incorrect. That is why I take the opposite approach. Anything that is used in ways that a Letter is not used (C++ significant punctuation and numeric digits) is not a Letter. Everything else is a Letter in terms a its use in any identifier. The hard part is deriving the table mapping local punctuation marks to their ASCII equivalents while specifically taking into account possibly very great depths of subtle nuances in semantic meaning. Just last night I looked in the Unicode table and found many code points that had a letter with an implied comma embedded within its meaning. The comma was being used as a diacritical mark. > > thai_number = [0-9] (where 0-9 represent the code points for a thai number) > chinese_number = [0-9] (where 0-9 represent the code poitns for a chinese number) > english_Number = [0-9] (where 0-9 represent the code points \u0030 to \u0039) > > number = thai_number | chinese_number | english_number | ...lots of others... > > Note that converting a Chinese number to a binary representation is a bit trickier, > because Chinese has a symbol for "ten", so you need to know the syntax for doing the > conversion, but that's a trivial detail. That's what you worry about in the other 35 > hours. > joe > **** >>> >>> I also learned from you that this next step of localization provides >>> much more functionality for relatively little cost. > Joseph M. Newcomer [MVP] > email: newcomer(a)flounder.com > Web: http://www.flounder.com > MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Hector Santos on 22 May 2010 10:41 Peter Olcott wrote: > On 5/22/2010 5:03 AM, Joseph M. Newcomer wrote: >> See below... >> On Fri, 21 May 2010 14:43:07 -0500, Peter >> Olcott<NoSpam(a)OCR4Screen.com> wrote: >> >>> On 5/21/2010 2:33 PM, Joseph M. Newcomer wrote: >>>> :-)!!!! And I can decode that even without looking up the actual >>>> codepoints! Yes, I've >>>> been seriously tempted, but as I said in the last tedious thread, I >>>> think I must suffer >>>> from OCD because I keep trying to educate him, in spite of his >>>> resistance to it! >>>> joe >>> >>> I did acknowledge that you did make your point as soon as you provided >>> me with enough reasoning to make your point. >> **** >> Sadly, all of this was so evident that I didn't see a need to keep >> drilling down when the >> correct issues were screamingly obvious. You should have been able to >> determine all of >> this on your own from my first responses. >> joe > > Within the context of the basic assumption (and I have already said this > several times but you still don't get it) that C++ requires ASCII at the > lexical level, everything that you said about how I was treating > identifiers was utter nonsense gibberish. > > ONLY after this incorrect assumption was corrected could anything that > you said about how I was treating identifiers make any sense at all. > > The ONLY reason that C++ does not allow any character in an identifier > is that it would screw up the parser. If is would not screw up the > parser then any character at all could be used in an identifier. It took > you an enormous amount of time to explain why it would screw up the > parser. You kept insisting upon arbitrary historical convention as your > criterion for correct identifiers without pointing out how the parser > would be screwed up. It reminds of the classic "Press Any Key to continue." and someone you would respond: "Where is the ANY key?" Pedro, when people say "Eat Food" the term EAT implies many basic ideas about the process of obtaining a food item, moving it towards your mouth, putting it into your mouth and begin chewing and swallowing process. Do you need this level of attention? What you are not grasping is that when you begin to talk about compiler (or translators) design, there is a natural presumption that you have some basic level of understanding for the basic requirements about the concept. Besides why are you in any of the COMP.* groups discussing compiler design concepts? Why the MFC group? Do you have that much disdain for everyone? Or do you need to prove something about yourself we don't already know? -- HLS
From: Mihai N. on 23 May 2010 04:22 > C++ requires that every non ASCII character be mapped to the ASCII set. Where did you get this from? > I looked in the Unicode table and found many code points that had a > letter with an implied comma embedded within its meaning. The comma was > being used as a diacritical mark. I am not sure what are you refering to. I don't know of any comma used as diacritical mark. If you are talking about things like U+0219 (LATIN SMALL LETTER S WITH COMMA BELOW), that is a stand-alone letter, not a letter with a diacritical mark. Like I would say "O with a small squigly" when talking about Q. The Unicode names describe the character using plain ASCII, but does not imply anything about the meaning of the thing. -- Mihai Nita [Microsoft MVP, Visual C++] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
From: Peter Olcott on 23 May 2010 09:48 On 5/23/2010 3:22 AM, Mihai N. wrote: > > >> C++ requires that every non ASCII character be mapped to the ASCII set. > > Where did you get this from? > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3035.pdf 2.2 Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set. The set of physical source file characters accepted is implementation-defined. 2.3 The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters: > >> I looked in the Unicode table and found many code points that had a >> letter with an implied comma embedded within its meaning. The comma was >> being used as a diacritical mark. > > I am not sure what are you refering to. > I don't know of any comma used as diacritical mark. > If you are talking about things like U+0219 (LATIN SMALL LETTER S WITH > COMMA BELOW), that is a stand-alone letter, not a letter with a > diacritical mark. > Like I would say "O with a small squigly" when talking about Q. > > The Unicode names describe the character using plain ASCII, but > does not imply anything about the meaning of the thing. > > > Since the above examples had the term "Comma" embedded within their name it was possible for them to contain a nuance of the semantic meaning of the punctuation mark. In any case it seems that Joe may have been wrong about this. If one takes the Java language as an example of how computer languages are internationalized. http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.1 Instead of using the local symbol for Comma and translating it into the ASCII comma, Java takes a different approach. The ASCII character "," is used. It seems that native speakers of the Java language think that this is perfectly reasonable. Java also requires [0-9] digits. It seems that Java takes essentially the same approach that I am taking and only allows non ASCII characters within identifiers. Here is another useful link: http://en.wikipedia.org:80/wiki/Non-English-based_programming_languages
From: Mihai N. on 23 May 2010 14:52
> 2.2 > Physical source file characters are mapped, in an implementation-defined > manner, to the basic source character set. The set of physical > source file characters accepted is implementation-defined. Editing the quote to make a point is cheating. The quote is: "Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary." Note the "if necessary"? This might mean that it can be an implementation-defined way to map other commas (like Arabic, or Mongolian) to the ASCII comma. > Since the above examples had the term "Comma" embedded within their name > it was possible for them to contain a nuance of the semantic meaning of > the punctuation mark. That is not the case, believe me. This might be true for other things (like accent grave, or acute), but even then it would be locale dependent. For some countries a with acute (U+00C1) is a letter, for some it is an accent, for some it is a tone mark. But that's not true for comma. And there is no way to tell how something is used, unless you know about it (it is not captured in the Unicode tables). > It seems that native speakers of the Java language think that this > is perfectly reasonable. Did you talk to them? And does anyone claim that Java allows people to write code in their own language? -- Mihai Nita [Microsoft MVP, Visual C++] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email |