From: Peter Olcott on 13 May 2010 16:06 Is this Regular Expression for UTF-8 Correct?? The solution is based on the GREEN portions of the first chart shown on this link: http://www.w3.org/2005/03/23-lex-U A semantically identical regular expression is also found on the above link underValidating lex Template 1 ['\u0000'-'\u007F'] 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF']) 3 | ( '\u00E0' ['\u00A0'-'\u00BF'] ['\u0080'-'\u00BF']) 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) 5 | ( '\u00ED' ['\u0080'-'\u009F'] ['\u0080'-'\u00BF']) 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) 7 | ( '\u00F0' ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) 9 | ( '\u00F4' ['\u0080'-'\u008F'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) Here is my version, the syntax is different, but the UTF8 portion should be semantically identical. UTF8_BYTE_ORDER_MARK [\xEF][\xBB][\xBF] ASCII [\x0-\x7F] U1 [a-zA-Z_] U2 [\xC2-\xDF][\x80-\xBF] U3 [\xE0][\xA0-\xBF][\x80-\xBF] U4 [\xE1-\xEC][\x80-\xBF][\x80-\xBF] U5 [\xED][\x80-\x9F][\x80-\xBF] U6 [\xEE-\xEF][\x80-\xBF][\x80-\xBF] U7 [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF] U8 [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF] U9 [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF] UTF8 {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9} // This identifies the "Letter" portion of an Identifier. L {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9} I guess that most of the analysis may simply boil down to whether or not the original source from the link is considered reliable. I had forgotten this original source when I first asked this question, that is why I am reposting this same question again.
From: Leigh Johnston on 13 May 2010 16:27 "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d(a)giganews.com... > Is this Regular Expression for UTF-8 Correct?? > > The solution is based on the GREEN portions of the first chart shown > on this link: > http://www.w3.org/2005/03/23-lex-U > > A semantically identical regular expression is also found on the above > link underValidating lex Template > > 1 ['\u0000'-'\u007F'] > 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF']) > 3 | ( '\u00E0' ['\u00A0'-'\u00BF'] ['\u0080'-'\u00BF']) > 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) > 5 | ( '\u00ED' ['\u0080'-'\u009F'] ['\u0080'-'\u00BF']) > 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) > 7 | ( '\u00F0' ['\u0090'-'\u00BF'] ['\u0080'-'\u00BF'] > ['\u0080'-'\u00BF']) > 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF'] > ['\u0080'-'\u00BF']) > 9 | ( '\u00F4' ['\u0080'-'\u008F'] ['\u0080'-'\u00BF'] > ['\u0080'-'\u00BF']) > > Here is my version, the syntax is different, but the UTF8 portion should > be semantically identical. > > UTF8_BYTE_ORDER_MARK [\xEF][\xBB][\xBF] > > ASCII [\x0-\x7F] > > U1 [a-zA-Z_] > U2 [\xC2-\xDF][\x80-\xBF] > U3 [\xE0][\xA0-\xBF][\x80-\xBF] > U4 [\xE1-\xEC][\x80-\xBF][\x80-\xBF] > U5 [\xED][\x80-\x9F][\x80-\xBF] > U6 [\xEE-\xEF][\x80-\xBF][\x80-\xBF] > U7 [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF] > U8 [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF] > U9 [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF] > > UTF8 {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9} > > // This identifies the "Letter" portion of an Identifier. > L {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9} > > I guess that most of the analysis may simply boil down to whether or not > the original source from the link is considered reliable. I had forgotten > this original source when I first asked this question, that is why I am > reposting this same question again. What has this got to do with C++? What is your C++ language question? /Leigh
From: Peter Olcott on 13 May 2010 16:36 "Leigh Johnston" <leigh(a)i42.co.uk> wrote in message news:GsGdnbYz-OIj_XHWnZ2dnUVZ8uCdnZ2d(a)giganews.com... > "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message > news:3sudnRDN849QxnHWnZ2dnUVZ_qKdnZ2d(a)giganews.com... >> Is this Regular Expression for UTF-8 Correct?? >> >> The solution is based on the GREEN portions of the first >> chart shown >> on this link: >> http://www.w3.org/2005/03/23-lex-U >> >> A semantically identical regular expression is also found >> on the above link underValidating lex Template >> >> 1 ['\u0000'-'\u007F'] >> 2 | (['\u00C2'-'\u00DF'] ['\u0080'-'\u00BF']) >> 3 | ( '\u00E0' ['\u00A0'-'\u00BF'] >> ['\u0080'-'\u00BF']) >> 4 | (['\u00E1'-'\u00EC'] ['\u0080'-'\u00BF'] >> ['\u0080'-'\u00BF']) >> 5 | ( '\u00ED' ['\u0080'-'\u009F'] >> ['\u0080'-'\u00BF']) >> 6 | (['\u00EE'-'\u00EF'] ['\u0080'-'\u00BF'] >> ['\u0080'-'\u00BF']) >> 7 | ( '\u00F0' ['\u0090'-'\u00BF'] >> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) >> 8 | (['\u00F1'-'\u00F3'] ['\u0080'-'\u00BF'] >> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) >> 9 | ( '\u00F4' ['\u0080'-'\u008F'] >> ['\u0080'-'\u00BF'] ['\u0080'-'\u00BF']) >> >> Here is my version, the syntax is different, but the UTF8 >> portion should be semantically identical. >> >> UTF8_BYTE_ORDER_MARK [\xEF][\xBB][\xBF] >> >> ASCII [\x0-\x7F] >> >> U1 [a-zA-Z_] >> U2 [\xC2-\xDF][\x80-\xBF] >> U3 [\xE0][\xA0-\xBF][\x80-\xBF] >> U4 [\xE1-\xEC][\x80-\xBF][\x80-\xBF] >> U5 [\xED][\x80-\x9F][\x80-\xBF] >> U6 [\xEE-\xEF][\x80-\xBF][\x80-\xBF] >> U7 [\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF] >> U8 [\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF] >> U9 [\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF] >> >> UTF8 {ASCII}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9} >> >> // This identifies the "Letter" portion of an Identifier. >> L {U1}|{U2}|{U3}|{U4}|{U5}|{U6}|{U7}|{U8}|{U9} >> >> I guess that most of the analysis may simply boil down to >> whether or not the original source from the link is >> considered reliable. I had forgotten this original source >> when I first asked this question, that is why I am >> reposting this same question again. > > What has this got to do with C++? What is your C++ > language question? > > /Leigh I will be implementing a utf8string to supplement std::string and will be using a regular expression to quickly divide up UTF-8 bytes into Unicode CodePoints. Since there are no UTF-8 groups, or even Unicode groups I must post these questions to groups that are at most indirectly related to this subject matter.
From: Leigh Johnston on 13 May 2010 16:41 "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message news:xMOdnahxZJNX_3HWnZ2dnUVZ_hCdnZ2d(a)giganews.com... >> >> What has this got to do with C++? What is your C++ language question? >> >> /Leigh > > I will be implementing a utf8string to supplement std::string and will be > using a regular expression to quickly divide up UTF-8 bytes into Unicode > CodePoints. > > Since there are no UTF-8 groups, or even Unicode groups I must post these > questions to groups that are at most indirectly related to this subject > matter. Wrong: off-topic is off-topic. If I chose to write a Tetris game in C++ it would be inappropriate to ask about the rules of Tetris in this newsgroup even if there was not a more appropriate newsgroup. /Leigh
From: Peter Olcott on 13 May 2010 16:54
"Leigh Johnston" <leigh(a)i42.co.uk> wrote in message news:v7CdnY8dPrNy_nHWnZ2dnUVZ8t-dnZ2d(a)giganews.com... > > > "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message > news:xMOdnahxZJNX_3HWnZ2dnUVZ_hCdnZ2d(a)giganews.com... >>> >>> What has this got to do with C++? What is your C++ >>> language question? >>> >>> /Leigh >> >> I will be implementing a utf8string to supplement >> std::string and will be using a regular expression to >> quickly divide up UTF-8 bytes into Unicode CodePoints. >> >> Since there are no UTF-8 groups, or even Unicode groups I >> must post these questions to groups that are at most >> indirectly related to this subject matter. > > Wrong: off-topic is off-topic. If I chose to write a > Tetris game in C++ it would be inappropriate to ask about > the rules of Tetris in this newsgroup even if there was > not a more appropriate newsgroup. > > /Leigh I think that posting to the next most relevant group(s) where a directly relevant group does not exist is right, and thus you are simply wrong. |