From: Peter Olcott on 17 May 2010 10:42 On 5/17/2010 3:22 AM, Oliver Regenfelder wrote: > Hello, > > Joseph M. Newcomer wrote: >> THis one makes no >> sense. There will be ORDERS OF MAGNITUDE greater differences in input >> time if you take >> rotational latency and seek time into consideration (in fact, opening >> the file will have >> orders of magnitude more variance than the cost of a UTF-8 to UTF-16 >> or even UTF-32 >> conversion, because of the directory lookup time variance). > > Do yourself a favor Peter and believe him! > A harddisk takes half guessed (seek + half a rotation @ 7.200 rpm) > ~12-14 ms to reach a sector for IO and that is only raw hardware > delay. On networks you will have roundtrip times of maybe 60ms and > more (strongly depends on your INET connection and server location). So > any computational effort for your string convertion doesn't matter. > Especially, as your script language files won't be in the gigabyte range. > > Best regards, > > Oliver I thought that he was saying that it takes much more time to read UTF-8 from disk than it takes to read UTF-32 from disk. This would be absurd. That is takes much more time to read either UTF-8 or UTF-32 from disk than it takes to convert either to the other I already knew. In any case UTF-32 will be the internal representation of my GUI scripting language's string type. I will stick with UTF-8 for the Lexical analyzer and the SymbolTable.
From: Peter Olcott on 17 May 2010 10:48 On 5/17/2010 8:30 AM, Leigh Johnston wrote: > > > "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message > news:_redna1LAMZaomzWnZ2dnUVZ_tOdnZ2d(a)giganews.com... >> On 5/17/2010 1:35 AM, Mihai N. wrote: >>> >>>> I studied the derivation of the above regular expression in >>>> considerable >>>> depth. I understand UTF-8 encoding quite well. So far I have found no >>>> error. >>> >>>> It is published on w3c.org. >>> >>> Stop repeating this nonsense. >>> >>> The URL is http://www.w3.org/2005/03/23-lex-U and the post states: >>> "It is not endorsed by the W3C members, team, or any working group." >>> >>> It is a hack implemented by someone and it happens to be on the w3c >>> server. >>> This is not enough to make it right. If I post something on the free >>> blogging >>> space offered by Microsoft, you will take as law and say "it is >>> published >>> on microsoft.com? >>> >>> >> Do you know of any faster way to validate and divide a UTF-8 sequence >> into its constituent code point parts than a regular expression >> implemented as a finite state machine? (please don't cite a software >> package, I am only interested in the underlying methodology). >> >> To the very best of my knowledge (and I have a patent on a finite >> state recognizer) a regular expression implemented as a finite state >> machine is the fastest and simplest possible way of every way that can >> possibly exist to validate a UTF-8 sequence and divide it into its >> constituent parts. > > My utf8_to_wide free function is not a finite state machine and it is > pretty fast. It takes a std::string as input and returns a std::wstring > as output. KISS. > > /Leigh I couldn't imagine how to do this without using a finite state machine. How did you do it? std::wstring will not help me because I have confirmed my original decision to use UTF-32 as my internal representation, and MS Windows only has a 16 bit std::wstring.
From: Leigh Johnston on 17 May 2010 11:23 "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message news:7PydnQ-UH8CoymzWnZ2dnUVZ_iydnZ2d(a)giganews.com... > I couldn't imagine how to do this without using a finite state machine. > How did you do it? Writing such a free function is not rocket science so I will not bore people by describing it here. N.B. my function is not quite the same as yours as it doesn't "validate" a UTF8-sequence, instead if a particular byte (>=0x80) is not part of a valid UTF-8 sequence my function will use mbtowc as a fallback rather than signalling an invalid sequence (uses default locale). > > std::wstring will not help me because I have confirmed my original > decision to use UTF-32 as my internal representation, and MS Windows only > has a 16 bit std::wstring. If you are developing for Windows only it makes sense to use UTF-16 as an internal representation, i.e. use std::wstring. /Leigh
From: Joseph M. Newcomer on 17 May 2010 11:48 The underlying technology is discussed in the Unicode documentation and on www.unicode.org. There are a set of APIs that deliver character information including the class information which are part of the Unicode support in Windows. But the point is, thinking of Unicode code points by writing a regexp for UTF-8 is not a reasonable approach. Or to put it bluntly, the regexp set you show is wrong, I have shown it is wrong, and you have to start thinking correctly about the problem. joe On Mon, 17 May 2010 08:08:22 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote: >On 5/17/2010 1:35 AM, Mihai N. wrote: >> >>> I studied the derivation of the above regular expression in considerable >>> depth. I understand UTF-8 encoding quite well. So far I have found no >>> error. >> >>> It is published on w3c.org. >> >> Stop repeating this nonsense. >> >> The URL is http://www.w3.org/2005/03/23-lex-U and the post states: >> "It is not endorsed by the W3C members, team, or any working group." >> >> It is a hack implemented by someone and it happens to be on the w3c server. >> This is not enough to make it right. If I post something on the free blogging >> space offered by Microsoft, you will take as law and say "it is published >> on microsoft.com? >> >> >Do you know of any faster way to validate and divide a UTF-8 sequence >into its constituent code point parts than a regular expression >implemented as a finite state machine? (please don't cite a software >package, I am only interested in the underlying methodology). > >To the very best of my knowledge (and I have a patent on a finite state >recognizer) a regular expression implemented as a finite state machine >is the fastest and simplest possible way of every way that can possibly >exist to validate a UTF-8 sequence and divide it into its constituent >parts. Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joseph M. Newcomer on 17 May 2010 12:04
See below... On Mon, 17 May 2010 09:29:33 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote: >On 5/17/2010 1:44 AM, Joseph M. Newcomer wrote: >>> If you are not a liar then show an error in the above regular >>> expression, I dare you. >> *** >> I have already pointed out that it is insufficient for lexically recognizing accent marks >> or invalid combinations of accent marks. So the requirement of demonstrating an error is >> trivially met. >> >> In addition, the regexp values do not account for directional changes in the parse, which >> is essential, for reasons I explained in another response. > >I have always defined correct to mean valid UTF-8 sequences (according >to the UTF-8 specification) and now you are presenting the red-herring >that it does not validate code point sequences. It is not supposed to >validate code point sequences. **** Oh, so it doesn't matter if it is *correct*, as long as it is *correct*. I thought you said "semantically correct", and semantics necessarily implies "meaning". What you were asking, or should have asked, is something along the lines of "does this set of regular expressions define the set of valid UTF-8 character sequences?" Then you started talking about codepoints, which of course means correct Unicode representations, and I already pointed out why that doesn't work. So what are you asking? Or is this another "magic morphing question" that will change with every response? State PRECISELY what the question is. Otherwise, you leave us guessing as to what you are really asking. **** > >The reason that I ALWAYS ask you to explain your reasoning is that this >most often provides the invalid assumptions that you are making. ***** Oh. And your assumptions (that conversion time and space apparently matter) are always valid? You would not be discussing the way to construct a utf8string unless you thought there was a reason that such a kludge mattered, and you argued it based on time and space, which means you made an invalid assumption: that these matter in the slightest! Out here in the Real World, we make decisions which are based on global performance and correctness goals, and include metrics like development cost, portability across localization, maintainability, and similar parameters. UTF-8 as an internal representation fails on all these scores. And why do you think you can't have a CString of utf-32 characters? We have CStringA and CStringW, and with a little work you could create a CString32 that had all the right properties, and derived from the base class. It is a minor exercise, the kind I might assign to a beginning C++ programmer. THen you could write a UTF-8-to-UTF-32 conversion (for example BOOL cv8_32(const CStringA utf8, CString32 & result); ) and you would avoid most of the problems you are trying to solve. Note that the FSM required to parse multibyte sequences is not based on ranges, but on the high-order bits of the first byte; if you had read the Unicode 5.0 documentation for this, you would have seen the table. I am not next to my computer right now, so the Unicode 5.0 book, which is normally in arm's reach, is actually next door, or I'd give you a page number. I would not use a regexp to syntactically validate an input UTF-8 string. If I were writing cv8_32 I would follow the encoding rules for multibyte UTF-8 characters that are specified in the book. THe regexp approach is just wrong. **** **** > >> >> It would be easier if you had expressed it as Unicode codepoints; then it would be easy to >> show the numerous failures. I'm sorry, I thought you had already applied exhaustive >> categorical reasoning to this, which would have demonstrated the errors. > >This level of detail is not relevant to the specific problem that I am >solving. The problem is providing a minimal cost way to permit people to >write GUI scripts in their native language. Anything that goes beyond >the scope of this problem is explicitly out-of-scope. **** Huh? The correct approach is to let the write the scripts in whatever editor they like to use, save them as UTF-8, read them in, and convert them to UTF-32. After that, you have to figure out what parsing them means. For example, I believe the regexp you have given does not properly handle numbers. But I already pointed this out. joe Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm |