From: Joseph M. Newcomer on 19 May 2010 15:19 See below... On Wed, 19 May 2010 10:14:48 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote: >On 5/19/2010 5:39 AM, James Kanze wrote: >> On May 18, 8:17 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote: >>> On 5/18/2010 9:34 AM, James Kanze wrote: >>>> On 17 May, 14:08, Peter Olcott<NoS...(a)OCR4Screen.com> wrote: >>>>> On 5/17/2010 1:35 AM, Mihai N. wrote: >> >>>>> a regular expression implemented as a finite state machine >>>>> is the fastest and simplest possible way of every way that >>>>> can possibly exist to validate a UTF-8 sequence and divide >>>>> it into its constituent parts. >> >>>> It all depends on the formal specification; one of the >>>> characteristics of UTF-8 is that you don't have to look at >>>> every character to find the length of a sequence. And >>>> a regular expression generally will have to look at every >>>> character. >> >>> Validation and translation to UTF-32 concurrently can not be >>> done faster than a DFA recognizer, in fact it must always be >>> slower. >> >> UTF-8 was designed intentionally in a way that it doesn't >> require a complete DFA to handle, but can be handled faster. >> Complete DFA's are usually slower than caluculations on modern >> processors, since they require memory accesses, and memory is >> often the limiting factor. >> >> In fact, there is no "must always be slower". There are too >> many variables involved to be able to make such statements. >> >> -- >> James Kanze > >This is the essence of my optimal design, try and show one that is >faster for matching 50 keywords. **** How do you mean "recognize"? If you are talking about reserved words like "if" or "else", then you need a perfect hash algorithm in addition. **** > >Looking up an ActionCode switch statement value based on a >state_transition_matrix that is indexed by current_state and >current_input_byte. > >unsigned char States[8][256]; > >The above is the state transition matrix for validating and translating >UTF-8 to UTF-32. > >The above design completely proves my point to everyone with sufficent >knowledge of UTF-8 and DFA state transition matrices. > >I will also add that I only have twelve ActionCodes including >InvalidByteError and the OutOfdata sentinel. Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Joshua Maurice on 19 May 2010 17:02 On May 19, 1:50 am, Öö Tiib <oot...(a)hot.ee> wrote: > On May 19, 8:24 am, "Mihai N." <nmihai_year_2...(a)yahoo.com> wrote: > > > > I perhaps have too low experience with sophisticated text processing. > > > Simple std::sort(), wide char literals of C++ and boost::wformat plus > > > full set of conversion functions is all i need really. > > > It depends a lot what you need. > > > Sorting is locale-sensitive (German, Swedish, French, Spanish, all > > have different sorting rules). > > The CRT (and STL, and boost) are pretty dumb when dealing with things > > in a locale sensitive way (meaning that they usualy don't :-) > > Yes, sorting in real alphabetic order for user is perhaps business of > GUI. GUI has to display it. GUI however usually has its WxStrings or > FooStrings anyway. I hate when someone leaks these weirdos to > application mechanics layer. Internal application logic is often best > made totally locale-agnostic and not caring about positioning in GUI > and if the end-users write from up to down or from right to left. > > So text in electronic interfaces layer are bytes, text in application > layer are wchar_t and text in user interface layer are whatever weirdo > rules there. If maintainer forgets to convert in interface between > layers he gets compiler warnings or errors. That makes life easy, but > i suspect my problems with texts are more trivial than these of some > others. First, as I mentioned in the other current thread on Unicode, please stop saying "wchar_t" and "wstring" as though that means something, or is at all a useful portable tool. wchar_t is 16 bits on windows, and 32 bits on most Unix-like systems IIRC. (Yes, the other thread listed some more exceptions.) So, either you're suggesting an entirely not portable solution with wstring, or you are suggesting that it makes sense to use UTF32 on Unix-like computers and UTF16 on windows computers, a quite silly statement. Then, locales in my experience have not been terribly portable, not portable enough for my company's product which runs on nearly all computer OSs known to man, including windows, win x64, the so to be "desupported by windows" windows itanium, Linux, z Linux, OS 2, HPUX IPF, and so on. Moreover, it's not terribly practical to tell our customers "you have to install these 'x' locales". Moreover, the locales of the same name on different OSs have been known to have subtly different behavior. Finally, I can't think of a useful example off the top of my head where sorting based on locale would be required except when "printing", to the screen, file, etc., but this doesn't convince me that there is no use for it. As a potential example, should you have to bring in an entire GUI framework just to implement the Unix utility "sort" except with an additional locale option? That seems silly to me.
From: Joseph M. Newcomer on 19 May 2010 18:19 See below... On Wed, 19 May 2010 14:02:28 -0700 (PDT), Joshua Maurice <joshuamaurice(a)gmail.com> wrote: >On May 19, 1:50�am, �� Tiib <oot...(a)hot.ee> wrote: >> On May 19, 8:24�am, "Mihai N." <nmihai_year_2...(a)yahoo.com> wrote: >> >> > > I perhaps have too low experience with sophisticated text processing. >> > > Simple std::sort(), wide char literals of C++ and boost::wformat plus >> > > full set of conversion functions is all i need really. >> >> > It depends a lot what you need. >> >> > Sorting is locale-sensitive (German, Swedish, French, Spanish, all >> > have different sorting rules). >> > The CRT (and STL, and boost) are pretty dumb when dealing with things >> > in a locale sensitive way (meaning that they usualy don't :-) >> >> Yes, sorting in real alphabetic order for user is perhaps business of >> GUI. GUI has to display it. GUI however usually has its WxStrings or >> FooStrings anyway. I hate when someone leaks these weirdos to >> application mechanics layer. Internal application logic is often best >> made totally locale-agnostic and not caring about positioning in GUI >> and if the end-users write from up to down or from right to left. >> >> So text in electronic interfaces layer are bytes, text in application >> layer are wchar_t and text in user interface layer are whatever weirdo >> rules there. If maintainer forgets to convert in interface between >> layers he gets compiler warnings or errors. That makes life easy, but >> i suspect my problems with texts are more trivial than these of some >> others. > >First, as I mentioned in the other current thread on Unicode, please >stop saying "wchar_t" and "wstring" as though that means something, or >is at all a useful portable tool. wchar_t is 16 bits on windows, and >32 bits on most Unix-like systems IIRC. (Yes, the other thread listed >some more exceptions.) So, either you're suggesting an entirely not >portable solution with wstring, or you are suggesting that it makes >sense to use UTF32 on Unix-like computers and UTF16 on windows >computers, a quite silly statement. **** wchar_t is whatever the implementor of the compiler wants it to be. The Microsoft C compiler implements it as a 16-bit value, but with the advent of the extended ranges of Unicode, 32-bit makes a lot more sense. If your compiler defines wchar_t as 16 bits, then it implies UTF-16 encoding, meaning surrogates are required for characters > UFFFF. Code that was written assuming UTF-32 encoding would not be portable down to UTF-16 encoding. But this doesn't change the fact that sorting is not a part of the GUI, but an abstract concept based on localized conventions. **** > >Then, locales in my experience have not been terribly portable, not >portable enough for my company's product which runs on nearly all >computer OSs known to man, including windows, win x64, the so to be >"desupported by windows" windows itanium, Linux, z Linux, OS 2, HPUX >IPF, and so on. Moreover, it's not terribly practical to tell our >customers "you have to install these 'x' locales". Moreover, the >locales of the same name on different OSs have been known to have >subtly different behavior. **** Well, the locale names are supposed to be the ISO standard string designators; if the runtime does not support it, that constitutes a "bug". I was never able to find any code that was truly "portable" on Unix systems because of the plethora of C compilers and runtimes that existed (in fact, I believe that code portability is largely a myth which the C programmers like to propagate; fifteen years of trying to either write code that would port or trying to port allegedly "portable" code convinced me that this is substantially more difficult than anyone suspects. Character encoding, as I tell my students, is less than 10% of the problem of portability. You have to worry about collating sequences, date formats, time formats, etc. And neither date formats nor time formats have simple solutions (e.g., in Norway, they use a 12-hour clock, unless the application is going to deal with mass transit in which case, by law, it must use a 24-hour clock; in Japan, contracts are not legal if they use the "Western" dates such as 23-Jan-10 or 10/23/10 or 23/10/2010; instead, they must use the date based on the nth year, kth month, mth day of the reign of the emperor <name here>). Localization is not a trivial problem, and I've had interesting issues arise even when I have tried to be extremely careful of the problems. **** > >Finally, I can't think of a useful example off the top of my head >where sorting based on locale would be required except when >"printing", to the screen, file, etc., but this doesn't convince me >that there is no use for it. As a potential example, should you have >to bring in an entire GUI framework just to implement the Unix utility >"sort" except with an additional locale option? That seems silly to >me. **** You don't need a GUI framework to sort. At least not in Windows; The CompareString API does it by returning a code to indicate the relative ordering of the two strings being compared. It takes a locale specifier (LCID). My major applications that required localized sorting (ALA "locale") had no GUI at all, only output files or output to the printer. And there was no GUI framework at all in them, because one ran on the DECSystem 10 under the TOPS-10 operating system, and the other was written to run on MS-DOS. All of this preceded the notion of "locale" in the CRT or OS. The DECSystem-10 project was written in the SAIL language, which used only the ASCII-7 character set. In principle, you should have something like -l<LCID HERE> option to a utility program, or the name of the locale in the ISO standard notation.. I have no idea how the idea of "localized sorting requires a GUI" even arose! joe **** Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm
From: Öö Tiib on 19 May 2010 18:33 On May 20, 12:02 am, Joshua Maurice <joshuamaur...(a)gmail.com> wrote: > On May 19, 1:50 am, Öö Tiib <oot...(a)hot.ee> wrote: > > On May 19, 8:24 am, "Mihai N." <nmihai_year_2...(a)yahoo.com> wrote: > > > > > I perhaps have too low experience with sophisticated text processing. > > > > Simple std::sort(), wide char literals of C++ and boost::wformat plus > > > > full set of conversion functions is all i need really. > > > > It depends a lot what you need. > > > > Sorting is locale-sensitive (German, Swedish, French, Spanish, all > > > have different sorting rules). > > > The CRT (and STL, and boost) are pretty dumb when dealing with things > > > in a locale sensitive way (meaning that they usualy don't :-) > > > Yes, sorting in real alphabetic order for user is perhaps business of > > GUI. GUI has to display it. GUI however usually has its WxStrings or > > FooStrings anyway. I hate when someone leaks these weirdos to > > application mechanics layer. Internal application logic is often best > > made totally locale-agnostic and not caring about positioning in GUI > > and if the end-users write from up to down or from right to left. > > > So text in electronic interfaces layer are bytes, text in application > > layer are wchar_t and text in user interface layer are whatever weirdo > > rules there. If maintainer forgets to convert in interface between > > layers he gets compiler warnings or errors. That makes life easy, but > > i suspect my problems with texts are more trivial than these of some > > others. > > First, as I mentioned in the other current thread on Unicode, please > stop saying "wchar_t" and "wstring" as though that means something, or > is at all a useful portable tool. wchar_t is 16 bits on windows, and > 32 bits on most Unix-like systems IIRC. (Yes, the other thread listed > some more exceptions.) So, either you're suggesting an entirely not > portable solution with wstring, or you are suggesting that it makes > sense to use UTF32 on Unix-like computers and UTF16 on windows > computers, a quite silly statement. Now ... seems that there is strange misunderstanding. For anyone converting between whatever char sequence to whatever wchar_t sequence it is highly-platform-dependent-operation anyway. I have no way said that such operations are portable. Since wstring is used for internally holding texts the sizeof(wchar_t) is not affecting anything. The major property of wchar_t for me is that it is different from char on all platforms i know and so i get warnings/errors from tools on attempts to mechanically assign one to other. > Then, locales in my experience have not been terribly portable, not > portable enough for my company's product which runs on nearly all > computer OSs known to man, including windows, win x64, the so to be > "desupported by windows" windows itanium, Linux, z Linux, OS 2, HPUX > IPF, and so on. You managed to somehow have portability in string-to-string conversions? Congrats. I have abandoned all hope there. Different code is used for conversions platform-by-platform. The platform makers (and not only) seemingly fight with each other to make their data incompatible so why should i hope there will be peace and portability any day? Is there something new? Same goes on with dates, values with measurement units and even plain floating point numbers ... only name it. Plain text is nothing different. > Moreover, it's not terribly practical to tell our > customers "you have to install these 'x' locales". Moreover, the > locales of the same name on different OSs have been known to have > subtly different behavior. Exactly! So portability and localization is possible only by having converter for each platform that does know the quirks of platform. If sizeof(wchar_t) is 2 or 4 does not matter at all since code that produces it is anyway different. > Finally, I can't think of a useful example off the top of my head > where sorting based on locale would be required except when > "printing", to the screen, file, etc., but this doesn't convince me > that there is no use for it. No need to nail me. I only confirm that i have not meet a need for it, but i can not prove that it does not exist. I fight problems that i meet on field, not theoretical possibilities. ;) As a potential example, should you have > to bring in an entire GUI framework just to implement the Unix utility > "sort" except with an additional locale option? That seems silly to > me. No. GUI sorts if there is GUI and printing is part of GUI (if it really deserves to be named GUI that is). If it goes elsewhere then it is not a GUI and so why should i sort without user to see it? As for GUI I am optimistic there. GUI sorts based on the things it uses. For example: bool QString::operator< ( const QString & other ) const {} In theoretical failure on particular case/platform/locale i would get defect report, can forward a bug to Nokia and meanwhile write some custom operator to be used instead: bool hack::broken_platform_name_here::less( const QString & one, const QString & another); In practice however it seems to work or is classified cosmetic or minor problem. Such do not affect success.
From: Pete Delgado on 20 May 2010 00:13
"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message news:5O2dnS2UptANt2nWnZ2dnUVZ_rqdnZ2d(a)giganews.com... > Here are the actual results from the working prototype of my original DFA > based glyph recognition engine. > http://www.ocr4screen.com/Unique.html > The new algorithm is much better than this. The salient points that you fail to mention is that the alternative solutions can perform OCR on *any* font while your implementation requires the customer to tell the OCR system which font (including all specifics such as point size) is being used. In addition, the other systems can perform when the font is not consistent in the document or if different font weights are used, your implementation cannot and will fail miserably. All in all, very misleading. PS: The information used in my critique of your OCR system was obtained by looking at your prior posts as well as your patent and are not merely conjecture. -Pete |