From: Peter Olcott on 13 May 2010 19:22 On 5/13/2010 6:12 PM, Sam wrote: > Victor Bazarov writes: > >> On 5/13/2010 6:04 PM, Peter Olcott wrote: >>> "Ian Collins"<ian-news(a)hotmail.com> wrote in message >>> news:8539h9F7f1U1(a)mid.individual.net... >>>> On 05/14/10 08:06 AM, Peter Olcott wrote: >>>>> Is this Regular Expression for UTF-8 Correct?? >>>> >>>> It's a fair bet you are off-topic in all the groups you >>>> have cross posted to. Why don't you pick a group for a >>>> language with built in UTF8 and regexp support (PHP?) and >>>> badger them? >>>> >>>> -- >>>> Ian Collins >>> >>> What does this question have to do with the C++ language? >> >> It does not have to have anything to do with C++. A post on the >> topicality of another post is *always on topic*. >> >>> At least my question is indirectly related to C++ by making >>> a utf8string for the C++ language from the regular >>> expression. >> >> <sarcasm> >> I am about to hold a party where I expect my colleagues to show up. >> They are all C++ programmers. Would the question on what to feed them, >> or whether 1970s pop music is going to be appropriate, be on topic in >> comp.lang.c++? It's *indirectly related* to C++, isn't it? >> </sarcasm> >> >>> Your question is not even indirectly related to the C++ >>> language. >> >> See above. > > This guy is a tool. He re-posted this question a second time because > when he first posted that snippet nobody cared either. But after > watching the struggle in the original thread, the ugly carnage appealed > to the infinitesimally small humanitarian aspect of my psyche > sufficiently enough to motivate myself into actually looking at the > regexp monstrosity. But after I explained why that spaghetti of a regexp > does not jive with RFC 2279, he got all huffy about it. He was confident > that I was wrong, and that the regular expression was right. But I was > able to explain my reasoning, by referencing directly to the contents of > RFC 2279, and he was unable to explain why he thought I was wrong, > instead sprinkling more URLs to some apparently orphaned web pages that > said something else. > > Which raised an obvious question: if he was so sure that his regular > expressions were correct, why was he asking? What exactly is the part of > RFC 2279 that he didn't understand? > > It seems to be his personality trait: when he asks a question, he thinks > he knows what the answer is, and every other answer is wrong. I can't > figure out what the real reason for asking the question must be, but I > think I really don't want to know the answer. > > It remains to be seen how long it will take him to figure out that the > difficulty he has in getting someone answer this might be, just might > be, due to the simple fact that this is one of these things that can be > answered simply by RTFMing. Really, UTF-8 is not some patented trade > secret. Its specifications are openly available, to anyone who wants to > read them. And anyone who reads them should be able to figure out the > correct regexp for themselves. It's not rocket science. > > Amusingly, he's been trying to find the answer to this question longer > than it took myself, originally, to read RFC 2279, and implement > encoding and decoding of Unicode using UTF-8. In C++. Well, in C > actually, but it's still technically valid C++. Which, I guess, makes > this on-topic, under the new rules that just came down, by fiat. > > This time I found the original source of a semantically identical regular expression that you berated so rudely. http://www.w3.org/2005/03/23-lex-U Who knows, maybe www.w3.org is wrong and you are right?
From: Sam on 13 May 2010 19:40 Peter Olcott writes: > On 5/13/2010 6:12 PM, Sam wrote: >> Victor Bazarov writes: >> >>> On 5/13/2010 6:04 PM, Peter Olcott wrote: >>>> "Ian Collins"<ian-news(a)hotmail.com> wrote in message >>>> news:8539h9F7f1U1(a)mid.individual.net... >>>>> On 05/14/10 08:06 AM, Peter Olcott wrote: >>>>>> Is this Regular Expression for UTF-8 Correct?? >>>>> >>>>> It's a fair bet you are off-topic in all the groups you >>>>> have cross posted to. Why don't you pick a group for a >>>>> language with built in UTF8 and regexp support (PHP?) and >>>>> badger them? >>>>> >>>>> -- >>>>> Ian Collins >>>> >>>> What does this question have to do with the C++ language? >>> >>> It does not have to have anything to do with C++. A post on the >>> topicality of another post is *always on topic*. >>> >>>> At least my question is indirectly related to C++ by making >>>> a utf8string for the C++ language from the regular >>>> expression. >>> >>> <sarcasm> >>> I am about to hold a party where I expect my colleagues to show up. >>> They are all C++ programmers. Would the question on what to feed them, >>> or whether 1970s pop music is going to be appropriate, be on topic in >>> comp.lang.c++? It's *indirectly related* to C++, isn't it? >>> </sarcasm> >>> >>>> Your question is not even indirectly related to the C++ >>>> language. >>> >>> See above. >> >> This guy is a tool. He re-posted this question a second time because >> when he first posted that snippet nobody cared either. But after >> watching the struggle in the original thread, the ugly carnage appealed >> to the infinitesimally small humanitarian aspect of my psyche >> sufficiently enough to motivate myself into actually looking at the >> regexp monstrosity. But after I explained why that spaghetti of a regexp >> does not jive with RFC 2279, he got all huffy about it. He was confident >> that I was wrong, and that the regular expression was right. But I was >> able to explain my reasoning, by referencing directly to the contents of >> RFC 2279, and he was unable to explain why he thought I was wrong, >> instead sprinkling more URLs to some apparently orphaned web pages that >> said something else. >> >> Which raised an obvious question: if he was so sure that his regular >> expressions were correct, why was he asking? What exactly is the part of >> RFC 2279 that he didn't understand? >> >> It seems to be his personality trait: when he asks a question, he thinks >> he knows what the answer is, and every other answer is wrong. I can't >> figure out what the real reason for asking the question must be, but I >> think I really don't want to know the answer. >> >> It remains to be seen how long it will take him to figure out that the >> difficulty he has in getting someone answer this might be, just might >> be, due to the simple fact that this is one of these things that can be >> answered simply by RTFMing. Really, UTF-8 is not some patented trade >> secret. Its specifications are openly available, to anyone who wants to >> read them. And anyone who reads them should be able to figure out the >> correct regexp for themselves. It's not rocket science. >> >> Amusingly, he's been trying to find the answer to this question longer >> than it took myself, originally, to read RFC 2279, and implement >> encoding and decoding of Unicode using UTF-8. In C++. Well, in C >> actually, but it's still technically valid C++. Which, I guess, makes >> this on-topic, under the new rules that just came down, by fiat. >> >> > > This time I found the original source of a semantically identical > regular expression that you berated so rudely. > http://www.w3.org/2005/03/23-lex-U > > Who knows, maybe www.w3.org is wrong and you are right? And as I wrote in the first thread, I suspected that the regular expression mish-mash's actual purpose was to validate some defined a subset of the entire Unicode range, as encoded in UTF-8. See message <cone.1273539693.340713.2085.500(a)commodore.email-scan.com>, where I wrote: > I think what that regexp really does is match a subset of all valid > UTF-8 sequences that corresponds with a subset of Unicodes that the > author was interested in. It doesn't match all valid UTF-8 sequences, > which the non-regexp version does. And reading the "www.w3.org" link, it's clear that's exactly what it does, and what the criteria is. Still, you replied as follows, in <cvydnfcDPJR5MHXWnZ2dnUVZ_vOdnZ2d(a)giganews.com>: > I think that your understanding might be less than complete. If you read > the commentary you will see that your view is not supported. Obviously, it's your thoughts turned out to be "less than complete". That regular expression does not validate whether an arbitrary octet stream is a UTF-8-encoded unicode value sequence. That regular expression checks whether whether an arbitrary octet stream is a UTF-8-encoded unicode value sequence and all unicode values belong to a specific, defined subset of the entire unicode value range.
From: Peter Olcott on 13 May 2010 19:58 On 5/13/2010 6:40 PM, Sam wrote: >> This time I found the original source of a semantically identical >> regular expression that you berated so rudely. >> http://www.w3.org/2005/03/23-lex-U >> >> Who knows, maybe www.w3.org is wrong and you are right? > > And as I wrote in the first thread, I suspected that the regular > expression mish-mash's actual purpose was to validate some defined a > subset of the entire Unicode range, as encoded in UTF-8. And this view is clearly incorrect. It validates the the entire set of UTF-8 encodings. Here is a quote: "This pattern does not restrict to the set of defined UCS characters, instead to the set that is permitted by UTF-8 encoding." The difference is the missing D800-DFFF High and Low surrogates that are not legal in UTF-8. All of the other CodePoints from 0-10FFFF are represented.
From: Sam on 13 May 2010 21:01 Peter Olcott writes: > On 5/13/2010 6:40 PM, Sam wrote: >>> This time I found the original source of a semantically identical >>> regular expression that you berated so rudely. >>> http://www.w3.org/2005/03/23-lex-U >>> >>> Who knows, maybe www.w3.org is wrong and you are right? >> >> And as I wrote in the first thread, I suspected that the regular >> expression mish-mash's actual purpose was to validate some defined a >> subset of the entire Unicode range, as encoded in UTF-8. > > And this view is clearly incorrect. It validates the the entire set of > UTF-8 encodings. Here is a quote: > > "This pattern does not restrict to the set of > defined UCS characters, instead to the set that > is permitted by UTF-8 encoding." > > The difference is the missing D800-DFFF High and Low surrogates that are > not legal in UTF-8. All of the other CodePoints from 0-10FFFF are > represented. Since you claim to know so much about UTF-8 encoding and decoding -- even more than RFC 2279 -- it's a wonder you had to ask your question at all. It seems that you already knew the answer to the question. Good luck UTF-8 encoding and decoding.
From: Peter Olcott on 13 May 2010 21:24
On 5/13/2010 8:01 PM, Sam wrote: > Peter Olcott writes: > >> On 5/13/2010 6:40 PM, Sam wrote: >>>> This time I found the original source of a semantically identical >>>> regular expression that you berated so rudely. >>>> http://www.w3.org/2005/03/23-lex-U >>>> >>>> Who knows, maybe www.w3.org is wrong and you are right? >>> >>> And as I wrote in the first thread, I suspected that the regular >>> expression mish-mash's actual purpose was to validate some defined a >>> subset of the entire Unicode range, as encoded in UTF-8. >> >> And this view is clearly incorrect. It validates the the entire set of >> UTF-8 encodings. Here is a quote: >> >> "This pattern does not restrict to the set of >> defined UCS characters, instead to the set that >> is permitted by UTF-8 encoding." >> >> The difference is the missing D800-DFFF High and Low surrogates that >> are not legal in UTF-8. All of the other CodePoints from 0-10FFFF are >> represented. > > Since you claim to know so much about UTF-8 encoding and decoding -- > even more than RFC 2279 -- it's a wonder you had to ask your question at > all. It seems that you already knew the answer to the question. http://tools.ietf.org/html/rfc3629 This memo obsoletes and replaces RFC 2279. > > Good luck UTF-8 encoding and decoding. > Thanks. |