From: Mihai N. on 18 May 2010 03:44 > the fastest and simplest possible way > to validate and divide any UTF=8 sequence into its constituent code > point parts is a regular expression implemented as a finite state > machine Sorry, where did you get this one from? -- Mihai Nita [Microsoft MVP, Visual C++] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
From: Joshua Maurice on 18 May 2010 05:51 On May 18, 12:38 am, "Mihai N." <nmihai_year_2...(a)yahoo.com> wrote: > > //COMPLETELY UNTESTED > > Then most likely wrong :-) Yes. It was there just for demonstration purposes on how easy the code is, and how I might consider "regex" and "state machine libraries" or whatever to be overkill. I will wait patiently for his code and compare to what I whipped off the top of my head.
From: James Kanze on 18 May 2010 10:18 On 16 May, 14:51, Öö Tiib <oot...(a)hot.ee> wrote: > On 16 mai, 15:34, "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote: > I suspect UTF8 fades gradually into history. Reasons are > similar like 256 color video-modes and raster-graphic formats > went. GUI-s are already often made with java or C# (for lack > of C++ devs) and these use UTF16 internally. Notice that > modern processor architectures are already optimized in the > way that byte-level operations are often slower. The network is still 8 bits UTF-8. As are the disks; using UTF-16 on an external support simply doesn't work. Also, UTF-8 may result in less memory use, and thus less paging. If all you're doing are simple operations, searching for a few ASCII delimiters and copying the delimited substrings, for example, UTF-8 will probably be significantly faster: the CPU will always read a word at a time, even if you access it byte by byte, and you'll usually get more characters per word using UTF-8. If you need full and complete support, as in an editor, for example, UTF-32 is the best general solution. For a lot of things in between, UTF-16 is a good compromise. But the trade-offs only concern internal representation. Externally, the world is 8 bits, and UTF-8 is the only solution. -- James Kanze
From: Oliver Regenfelder on 18 May 2010 10:26 Hello, Peter Olcott wrote: > I completed the detailed design on the DFA that would validate and > translate any valid UTF-8 byte sequence into UTF-32. It can not be done > faster or simpler. The state transition matrix only takes exactly 2 KB. Who cares about DFAs and state transition matrix sizes when all you want to do is convert UTF-8 to UTF-32. That are some if/else and switch statements in your programming language of choice + error handling. Best regards, Oliver
From: Oliver Regenfelder on 18 May 2010 10:29
Hello, Peter Olcott wrote: > Maybe it is much simpler for me than it would be for others because of > my strong bias towards DFA recognizers. I would say it is exactly the oposite. Your strong bias towards DFA recognizers lets you complete forget about the current abstraction level you are dealing with. > I bet my DFA recognizer is at > least twice as fast as any other method for validating UTF-8 and > converting it to code points. > I am estimating about 20 machine clocks > per code point. You might want to reread some of the postings regarding optimization from the earlier threads. Have you been a hardware engineer before by any chance? Best regards, Oliver |