New utf8string design may make UTF-8 the superior encoding [MFC]

Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish

From: Joseph M. Newcomer on 19 May 2010 15:19

See below...
On Wed, 19 May 2010 10:14:48 -0500, Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:

>On 5/19/2010 5:39 AM, James Kanze wrote:
>> On May 18, 8:17 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>>> On 5/18/2010 9:34 AM, James Kanze wrote:
>>>> On 17 May, 14:08, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>>>>> On 5/17/2010 1:35 AM, Mihai N. wrote:
>>
>>>>> a regular expression implemented as a finite state machine
>>>>> is the fastest and simplest possible way of every way that
>>>>> can possibly exist to validate a UTF-8 sequence and divide
>>>>> it into its constituent parts.
>>
>>>> It all depends on the formal specification; one of the
>>>> characteristics of UTF-8 is that you don't have to look at
>>>> every character to find the length of a sequence. And
>>>> a regular expression generally will have to look at every
>>>> character.
>>
>>> Validation and translation to UTF-32 concurrently can not be
>>> done faster than a DFA recognizer, in fact it must always be
>>> slower.
>>
>> UTF-8 was designed intentionally in a way that it doesn't
>> require a complete DFA to handle, but can be handled faster.
>> Complete DFA's are usually slower than caluculations on modern
>> processors, since they require memory accesses, and memory is
>> often the limiting factor.
>>
>> In fact, there is no "must always be slower". There are too
>> many variables involved to be able to make such statements.
>>
>> --
>> James Kanze
>
>This is the essence of my optimal design, try and show one that is
>faster for matching 50 keywords.
****
How do you mean "recognize"? If you are talking about reserved words like "if" or "else",
then you need a perfect hash algorithm in addition.
****
>
>Looking up an ActionCode switch statement value based on a
>state_transition_matrix that is indexed by current_state and
>current_input_byte.
>
>unsigned char States[8][256];
>
>The above is the state transition matrix for validating and translating
>UTF-8 to UTF-32.
>
>The above design completely proves my point to everyone with sufficent
>knowledge of UTF-8 and DFA state transition matrices.
>
>I will also add that I only have twelve ActionCodes including
>InvalidByteError and the OutOfdata sentinel.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Joshua Maurice on 19 May 2010 17:02

On May 19, 1:50 am, Öö Tiib <oot...(a)hot.ee> wrote:
> On May 19, 8:24 am, "Mihai N." <nmihai_year_2...(a)yahoo.com> wrote:
>
> > > I perhaps have too low experience with sophisticated text processing.
> > > Simple std::sort(), wide char literals of C++ and boost::wformat plus
> > > full set of conversion functions is all i need really.
>
> > It depends a lot what you need.
>
> > Sorting is locale-sensitive (German, Swedish, French, Spanish, all
> > have different sorting rules).
> > The CRT (and STL, and boost) are pretty dumb when dealing with things
> > in a locale sensitive way (meaning that they usualy don't :-)
>
> Yes, sorting in real alphabetic order for user is perhaps business of
> GUI. GUI has to display it. GUI however usually has its WxStrings or
> FooStrings anyway. I hate when someone leaks these weirdos to
> application mechanics layer. Internal application logic is often best
> made totally locale-agnostic and not caring about positioning in GUI
> and if the end-users write from up to down or from right to left.
>
> So text in electronic interfaces layer are bytes, text in application
> layer are wchar_t and text in user interface layer are whatever weirdo
> rules there. If maintainer forgets to convert in interface between
> layers he gets compiler warnings or errors. That makes life easy, but
> i suspect my problems with texts are more trivial than these of some
> others.

First, as I mentioned in the other current thread on Unicode, please
stop saying "wchar_t" and "wstring" as though that means something, or
is at all a useful portable tool. wchar_t is 16 bits on windows, and
32 bits on most Unix-like systems IIRC. (Yes, the other thread listed
some more exceptions.) So, either you're suggesting an entirely not
portable solution with wstring, or you are suggesting that it makes
sense to use UTF32 on Unix-like computers and UTF16 on windows
computers, a quite silly statement.

Then, locales in my experience have not been terribly portable, not
portable enough for my company's product which runs on nearly all
computer OSs known to man, including windows, win x64, the so to be
"desupported by windows" windows itanium, Linux, z Linux, OS 2, HPUX
IPF, and so on. Moreover, it's not terribly practical to tell our
customers "you have to install these 'x' locales". Moreover, the
locales of the same name on different OSs have been known to have
subtly different behavior.

Finally, I can't think of a useful example off the top of my head
where sorting based on locale would be required except when
"printing", to the screen, file, etc., but this doesn't convince me
that there is no use for it. As a potential example, should you have
to bring in an entire GUI framework just to implement the Unix utility
"sort" except with an additional locale option? That seems silly to
me.

From: Joseph M. Newcomer on 19 May 2010 18:19

See below...
On Wed, 19 May 2010 14:02:28 -0700 (PDT), Joshua Maurice <joshuamaurice(a)gmail.com> wrote:

>On May 19, 1:50�am, �� Tiib <oot...(a)hot.ee> wrote:
>> On May 19, 8:24�am, "Mihai N." <nmihai_year_2...(a)yahoo.com> wrote:
>>
>> > > I perhaps have too low experience with sophisticated text processing.
>> > > Simple std::sort(), wide char literals of C++ and boost::wformat plus
>> > > full set of conversion functions is all i need really.
>>
>> > It depends a lot what you need.
>>
>> > Sorting is locale-sensitive (German, Swedish, French, Spanish, all
>> > have different sorting rules).
>> > The CRT (and STL, and boost) are pretty dumb when dealing with things
>> > in a locale sensitive way (meaning that they usualy don't :-)
>>
>> Yes, sorting in real alphabetic order for user is perhaps business of
>> GUI. GUI has to display it. GUI however usually has its WxStrings or
>> FooStrings anyway. I hate when someone leaks these weirdos to
>> application mechanics layer. Internal application logic is often best
>> made totally locale-agnostic and not caring about positioning in GUI
>> and if the end-users write from up to down or from right to left.
>>
>> So text in electronic interfaces layer are bytes, text in application
>> layer are wchar_t and text in user interface layer are whatever weirdo
>> rules there. If maintainer forgets to convert in interface between
>> layers he gets compiler warnings or errors. That makes life easy, but
>> i suspect my problems with texts are more trivial than these of some
>> others.
>
>First, as I mentioned in the other current thread on Unicode, please
>stop saying "wchar_t" and "wstring" as though that means something, or
>is at all a useful portable tool. wchar_t is 16 bits on windows, and
>32 bits on most Unix-like systems IIRC. (Yes, the other thread listed
>some more exceptions.) So, either you're suggesting an entirely not
>portable solution with wstring, or you are suggesting that it makes
>sense to use UTF32 on Unix-like computers and UTF16 on windows
>computers, a quite silly statement.
****
wchar_t is whatever the implementor of the compiler wants it to be. The Microsoft C
compiler implements it as a 16-bit value, but with the advent of the extended ranges of
Unicode, 32-bit makes a lot more sense. If your compiler defines wchar_t as 16 bits, then
it implies UTF-16 encoding, meaning surrogates are required for characters > UFFFF. Code
that was written assuming UTF-32 encoding would not be portable down to UTF-16 encoding.

But this doesn't change the fact that sorting is not a part of the GUI, but an abstract
concept based on localized conventions.
****
>
>Then, locales in my experience have not been terribly portable, not
>portable enough for my company's product which runs on nearly all
>computer OSs known to man, including windows, win x64, the so to be
>"desupported by windows" windows itanium, Linux, z Linux, OS 2, HPUX
>IPF, and so on. Moreover, it's not terribly practical to tell our
>customers "you have to install these 'x' locales". Moreover, the
>locales of the same name on different OSs have been known to have
>subtly different behavior.
****
Well, the locale names are supposed to be the ISO standard string designators; if the
runtime does not support it, that constitutes a "bug".

I was never able to find any code that was truly "portable" on Unix systems because of the
plethora of C compilers and runtimes that existed (in fact, I believe that code
portability is largely a myth which the C programmers like to propagate; fifteen years of
trying to either write code that would port or trying to port allegedly "portable" code
convinced me that this is substantially more difficult than anyone suspects.

Character encoding, as I tell my students, is less than 10% of the problem of portability.
You have to worry about collating sequences, date formats, time formats, etc. And neither
date formats nor time formats have simple solutions (e.g., in Norway, they use a 12-hour
clock, unless the application is going to deal with mass transit in which case, by law, it
must use a 24-hour clock; in Japan, contracts are not legal if they use the "Western"
dates such as 23-Jan-10 or 10/23/10 or 23/10/2010; instead, they must use the date based
on the nth year, kth month, mth day of the reign of the emperor <name here>). Localization
is not a trivial problem, and I've had interesting issues arise even when I have tried to
be extremely careful of the problems.
****
>
>Finally, I can't think of a useful example off the top of my head
>where sorting based on locale would be required except when
>"printing", to the screen, file, etc., but this doesn't convince me
>that there is no use for it. As a potential example, should you have
>to bring in an entire GUI framework just to implement the Unix utility
>"sort" except with an additional locale option? That seems silly to
>me.
****
You don't need a GUI framework to sort. At least not in Windows; The CompareString API
does it by returning a code to indicate the relative ordering of the two strings being
compared. It takes a locale specifier (LCID). My major applications that required
localized sorting (ALA "locale") had no GUI at all, only output files or output to the
printer. And there was no GUI framework at all in them, because one ran on the DECSystem
10 under the TOPS-10 operating system, and the other was written to run on MS-DOS. All of
this preceded the notion of "locale" in the CRT or OS. The DECSystem-10 project was
written in the SAIL language, which used only the ASCII-7 character set. In principle,
you should have something like -l<LCID HERE> option to a utility program, or the name of
the locale in the ISO standard notation.. I have no idea how the idea of "localized
sorting requires a GUI" even arose!
joe
****
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Öö Tiib on 19 May 2010 18:33

On May 20, 12:02 am, Joshua Maurice <joshuamaur...(a)gmail.com> wrote:
> On May 19, 1:50 am, Öö Tiib <oot...(a)hot.ee> wrote:
> > On May 19, 8:24 am, "Mihai N." <nmihai_year_2...(a)yahoo.com> wrote:
>
> > > > I perhaps have too low experience with sophisticated text processing.
> > > > Simple std::sort(), wide char literals of C++ and boost::wformat plus
> > > > full set of conversion functions is all i need really.
>
> > > It depends a lot what you need.
>
> > > Sorting is locale-sensitive (German, Swedish, French, Spanish, all
> > > have different sorting rules).
> > > The CRT (and STL, and boost) are pretty dumb when dealing with things
> > > in a locale sensitive way (meaning that they usualy don't :-)
>
> > Yes, sorting in real alphabetic order for user is perhaps business of
> > GUI. GUI has to display it. GUI however usually has its WxStrings or
> > FooStrings anyway. I hate when someone leaks these weirdos to
> > application mechanics layer. Internal application logic is often best
> > made totally locale-agnostic and not caring about positioning in GUI
> > and if the end-users write from up to down or from right to left.
>
> > So text in electronic interfaces layer are bytes, text in application
> > layer are wchar_t and text in user interface layer are whatever weirdo
> > rules there. If maintainer forgets to convert in interface between
> > layers he gets compiler warnings or errors. That makes life easy, but
> > i suspect my problems with texts are more trivial than these of some
> > others.
>
> First, as I mentioned in the other current thread on Unicode, please
> stop saying "wchar_t" and "wstring" as though that means something, or
> is at all a useful portable tool. wchar_t is 16 bits on windows, and
> 32 bits on most Unix-like systems IIRC. (Yes, the other thread listed
> some more exceptions.) So, either you're suggesting an entirely not
> portable solution with wstring, or you are suggesting that it makes
> sense to use UTF32 on Unix-like computers and UTF16 on windows
> computers, a quite silly statement.

Now ... seems that there is strange misunderstanding. For anyone
converting between whatever char sequence to whatever wchar_t sequence
it is highly-platform-dependent-operation anyway. I have no way said
that such operations are portable. Since wstring is used for
internally holding texts the sizeof(wchar_t) is not affecting
anything. The major property of wchar_t for me is that it is different
from char on all platforms i know and so i get warnings/errors from
tools on attempts to mechanically assign one to other.

> Then, locales in my experience have not been terribly portable, not
> portable enough for my company's product which runs on nearly all
> computer OSs known to man, including windows, win x64, the so to be
> "desupported by windows" windows itanium, Linux, z Linux, OS 2, HPUX
> IPF, and so on.

You managed to somehow have portability in string-to-string
conversions? Congrats. I have abandoned all hope there. Different code
is used for conversions platform-by-platform. The platform makers (and
not only) seemingly fight with each other to make their data
incompatible so why should i hope there will be peace and portability
any day? Is there something new? Same goes on with dates, values with
measurement units and even plain floating point numbers ... only name
it. Plain text is nothing different.

> Moreover, it's not terribly practical to tell our
> customers "you have to install these 'x' locales". Moreover, the
> locales of the same name on different OSs have been known to have
> subtly different behavior.

Exactly! So portability and localization is possible only by having
converter for each platform that does know the quirks of platform. If
sizeof(wchar_t) is 2 or 4 does not matter at all since code that
produces it is anyway different.

> Finally, I can't think of a useful example off the top of my head
> where sorting based on locale would be required except when
> "printing", to the screen, file, etc., but this doesn't convince me
> that there is no use for it.

No need to nail me. I only confirm that i have not meet a need for it,
but i can not prove that it does not exist. I fight problems that i
meet on field, not theoretical possibilities. ;)

As a potential example, should you have
> to bring in an entire GUI framework just to implement the Unix utility
> "sort" except with an additional locale option? That seems silly to
> me.

No. GUI sorts if there is GUI and printing is part of GUI (if it
really deserves to be named GUI that is). If it goes elsewhere then it
is not a GUI and so why should i sort without user to see it? As for
GUI I am optimistic there. GUI sorts based on the things it uses. For
example:

bool QString::operator< ( const QString & other ) const {}

In theoretical failure on particular case/platform/locale i would get
defect report, can forward a bug to Nokia and meanwhile write some
custom operator to be used instead:

bool hack::broken_platform_name_here::less( const QString & one,
const QString & another);

In practice however it seems to work or is classified cosmetic or
minor problem. Such do not affect success.

From: Pete Delgado on 20 May 2010 00:13

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:5O2dnS2UptANt2nWnZ2dnUVZ_rqdnZ2d(a)giganews.com...
> Here are the actual results from the working prototype of my original DFA
> based glyph recognition engine.
> http://www.ocr4screen.com/Unique.html
> The new algorithm is much better than this.

The salient points that you fail to mention is that the alternative
solutions can perform OCR on *any* font while your implementation requires
the customer to tell the OCR system which font (including all specifics such
as point size) is being used. In addition, the other systems can perform
when the font is not consistent in the document or if different font weights
are used, your implementation cannot and will fail miserably.

All in all, very misleading.

PS: The information used in my critique of your OCR system was obtained by
looking at your prior posts as well as your patent and are not merely
conjecture.

-Pete

First | Prev | Next | Last
Pages: 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish