Prev: Love Potion for Miss Blandish
Next: Newcomer's CAsyncSocket example: trouble connecting with other clients
From: Pete Delgado on 20 May 2010 00:59 "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message news:gomdnfY9-INibm7WnZ2dnUVZ_sEAAAAA(a)giganews.com... > On 5/19/2010 12:55 AM, Mihai N. wrote: >> >>> So C++ can take UTF-8 Identifiers? >> >> No, it can take Unicode identifiers. >> The exact transformation format is not relevant. >> >> > I though that it choked on anything besides ASCII. So are you implying > that it can take Unicode within any encoding? *Read* the C++ standards documents. It explains *everything*. For information about identifiers, see section 2.11.There are draft copies of the current proposed standard available for free : http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3092.pdf There is no need to "imply" anything. As usual, Mihai is correct in matters such as this and your "thought" was wrong. -Pete
From: Mihai N. on 20 May 2010 03:17 > I though that it choked on anything besides ASCII. So are you implying > that it can take Unicode within any encoding? Can take Unicode in some Unicode form. It can take the Unicode form accepted by the compiler. Some compilers understand UTF-16, some understand UTF-8, some understand none. But even for the the ones that don't understand anything than ASCII, they should still accept escaped form (\uXXXX) int x\u6565rap = 123; is a perfectly valid name. (and if the compiler accepts utf-8 or utf-16, you can use some human readable form) -- Mihai Nita [Microsoft MVP, Visual C++] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
From: Joseph M. Newcomer on 20 May 2010 03:39 Note that an identifier is defined as incorporating "other implementation-defined characters". If someone is claiming to extend C syntax to include localized letters, then it should be philosophically consistent with the localized environment and define letters to be consistent with that environment, or alternatively, be inclusive and include all letters in all localized environments. Letters in a localized environment would not include digits in a localized environment, punctuation marks of a localized environment, etc. Peter makes one of the common mistakes he is so fond of: he fastens on ONE implementation by ONE vendor and makes a claim that it is DEFINITIVE. You can't even argue that Intel's C++ compiler or gcc "prove" that this is true for ALL compilers, since they are intended to be clones of each other and historically they all date back to the PDP-11 C++ compiler which only used ASCII-7, so they are clones of that, except for extending the syntax to more modern constructs. So he comes along and says "I'm going to extend this" and as soon as I point out that the extensions have serious problems, he says "but the regular C++ language does not work that way!" which seems to beg the question of what is meant by creating an extension that meets the requirements of allowing "native language". There are interesting questions about accent marks, vowel marks, combining characters, localized punctuation, localized digits, etc., but when I raised these, I was informed that the extensions to support "native language coding" did NOT mean "support native language coding" but meant "support something that allows native language programmers to write identifiers in their native language that don't even make sense lexically in the native language", and while making claims about how fast the recognizer is, refuse to limit the productions because the copy-and-paste lex rules would actually require WORK to make them correct, so he argues that it is not "convenient" to do it right. I guess I don't respect doing a job wrong, and rationalizations that say "wrong is OK, because whatever it is that I have defined is necessarily right, whether it is right or not". There are some VERY interesting questions about combining accent marks and combining characters, but if we ignore those, there is ZERO excuse for not writing productions based on localized letters or digits (other than the copy-and-paste solution no longer works!) because it cannot POSSIBLY affect the performance of the lexer! He even says it can't, so the only remaining reason is the need to actually THINK about the problem, instead of accepting an unsanctioned and unsupported regexp rule set. Note that the lexical rules require that localized characters be mapped to the base character set, so a thai digit character should map to the corresponding 0..9 value, and a conforming compiler that allowed Thai input would do so because the C++ standard requires that it do so. So his argument about why his extended C++ does not have to treat a localized comma as a comma or a localized semicolon as a semicolon does not make sense; the standard says that the input character set is implementation-specific but must map to the base character set, so the argument that if I treat the following sequence in some language "A,B" that if I use a localized comma with localized letters this is, by his rules, necessarily an identifier. It means that under the mapping requirements it does not translate and therefore his assertion is (no big surprise here) gibberish. But why are we arguing over this? We KNOW his design is wrong; only HE can defend his bad decisions by rationalizing them to himself. The first time a customer programmer complains "But you SAID your extensions supported UTF-8 input, and I wrote this code in my native language and it is correct" he can explain to a PAYING CUSTOMER why his implementation makes no sense. Note that you should also cite section 2.3 and the footnotes on page 16. joe On Thu, 20 May 2010 00:59:44 -0400, "Pete Delgado" <Peter.Delgado(a)NoSpam.com> wrote: > >"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message >news:gomdnfY9-INibm7WnZ2dnUVZ_sEAAAAA(a)giganews.com... >> On 5/19/2010 12:55 AM, Mihai N. wrote: >>> >>>> So C++ can take UTF-8 Identifiers? >>> >>> No, it can take Unicode identifiers. >>> The exact transformation format is not relevant. >>> >>> >> I though that it choked on anything besides ASCII. So are you implying >> that it can take Unicode within any encoding? > >*Read* the C++ standards documents. It explains *everything*. For >information about identifiers, see section 2.11.There are draft copies of >the current proposed standard available for free : > >http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3092.pdf > >There is no need to "imply" anything. As usual, Mihai is correct in matters >such as this and your "thought" was wrong. > >-Pete > Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on 20 May 2010 03:46 See below... On Wed, 19 May 2010 14:54:03 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote: >On 5/19/2010 2:31 PM, Joseph M. Newcomer wrote: >> See below... >> On Wed, 19 May 2010 10:01:36 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote: >> >>> On 5/19/2010 1:32 AM, Joseph M. Newcomer wrote: >>>> See below... >>>> On Tue, 18 May 2010 20:48:26 -0500, Peter Olcott<NoSpam(a)OCR4Screen.com> wrote: >>>> >>>>> >>>>> A church that I used to go to had an expression, "What would Jesus do?" >>>>> as their measure of correct behavior. In my case I have an analogous >>>>> measure, "What would C++ do?" >>>>> >>>>> Would C++ permit digits other than ASCII [0-9] ??? >>>> *** >>>> How about >>>> >>>> "Would a person who makes a claim that his language allows programmers to program in their >>>> native language create a compiler in which their native digits are considered letters be >>>> lying in his teeth about his claim?" >>>> **** >>> >>> The claim was merely imprecisely (thus incorrectly) stated. What I meant >>> by this was that Identifiers can be written in the native language, and >>> C++ language constraints must be otherwise maintained. >> **** >> This is a typical pattern: >> >> Peter: "X is true" >> World: "X is false" >> Peter: "I meant to say, X is true under the following conditions" >> World: "X is false in two of those three conditions" >> Peter: "No, I REALLY meant that X is true only under conditions when it is true, >> and I'm going to ignore all the conditions where it is false, and >> define them out of existence by stating they were not part of >> my design" >>> >>> Will C++ allow anything other than an ASCII comma between parameters? >>> What are the limits on exactly how much C++ is Unicode aware? I already >>> know that std::wstring it totally clueless. I assumed based on this that >>> all of C++ was generally clueless about Unicode. >> **** >> But you said you were EXTENDING the language to be a C-like language that supported >> localization! Did you mean something different? (See Magic Morphing Requirements) > >I was not aware of any other issues pertaining to the localization of a >language based on C++ than providing a way to write identifiers in the >native language. I had thought that C++ required all users to use ASCII >numeric digits. *** What do you mean "I thought"? Does it mean "I once heard a rumor about this" or "I found something on someone's Web page", or what? It certainly cannot mean "I read the C++ Standard" because that is very explicit about stating that it is the responsibility of the input mechanism to map the input character set to the base character set, which clearly suggests that localized digits are permissible for numbers! **** > >To exactly what extent does the current C++ provide for localization? >Does the current C++ allow you to use anything other than an ASCII comma >to separate parameters? **** It explicitly states that the input mechanism is responsible for mapping from the input character set to the base character set. And you have seriously been confused by saying "I am extending the lexical rules to allow UTF-8" and then saying "But if you use a character which is not in conformance with the ASCII-7 subset, I am not obligated to honor it". Duh! Since you have been given a citation to the C++ Standard, I suggest reading it. Why should I copy-and-paste from the document to save you the effort? I just read it, and it is pretty explicit about what constitutes correct behavior. And your implementation is not even CLOSE to supporting correct behavior! And note that if you say "I am extending the lexical rules" then you must actually do this, and not say "What I meant by that was I am extending the lexical rules, except when it would require that I do a little work beyond copy and paste to make them correct". joe **** Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Hector Santos on 20 May 2010 09:17
Mihai N. wrote: >> I though that it choked on anything besides ASCII. So are you implying >> that it can take Unicode within any encoding? > > Can take Unicode in some Unicode form. > It can take the Unicode form accepted by the compiler. > Some compilers understand UTF-16, some understand UTF-8, > some understand none. > But even for the the ones that don't understand anything than ASCII, > they should still accept escaped form (\uXXXX) > > int x\u6565rap = 123; > is a perfectly valid name. > > (and if the compiler accepts utf-8 or utf-16, you can use some > human readable form) Half the problem with all is that there is no context in the applicability. Overall, you have the fundamental ergonomics or interfaces: - text editor (creation) - text compiling (translation) - data transfer (heterogeneous networking) - display rendering (old and new user devices) What else? He was clearly wrong about C/C++ only supporting ASCII and thats only because he is not a programmer to know it isn't true. But even if it as true, so what? Not everyone is using C/C++ only. Unicode Editing was neccesary for a developer, they will find the tools. There are other languages and the Creation/Translation/Rednering is pretty much well defined (complex but well defined). OTOH, the data transfer is generally the problem and for us, that was our main focus the past year or so because the new IETF requirement for mail transport protocols. UTF-8 encoding has made this easy. Next it the other parts for us. I will have questions about this soon. :) -- HLS |