Designing a Finite State Machine DFA Recognizer for UTF-8 [MFC]

Prev: Designing a Finite State Machine DFA Recognizer for UTF-8
Next: Simple Valication Check... Question

From: Öö Tiib on 19 May 2010 14:20

On May 19, 9:08 pm, Hector Santos <sant9...(a)gmail.com> wrote:
> You are too SNEAKY of a person. You
> are not honest and you are the type of person that will steal from
> others and claim it your own.

But he uses a legal way? He does not steal. He asks nicely and people
tell him. "Ask thy will be given" all by book, words of Jesus
himself.

From: Hector Santos on 19 May 2010 15:15

On May 19, 2:42 pm, "Leigh Johnston" <le...(a)i42.co.uk> wrote:
> "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote in message
>
> news:P6idnX4azPfvs2nWnZ2dnUVZ_oudnZ2d(a)giganews.com...
>
> >> Whilst what you say is technically correct I try to avoid writing code
> >> which does not check against an end iterator when iterating over a
> >> sequence, just personal preference (due to a slight concern re safety).
> >> We are probably only talking about an extra CPU instruction or two to
> >> check for end of sequence in the main loop along with the O(1) check of
> >> the final state when the main loop is exited. Your solution would also
> >> require making a copy of the input sequence to allow appending of the
> >> sentinel unless you consider mutating input parameters to be OK. My
>
> > The main purpose of this is to read in a file of UTF-8 to be converted to
> > UTF-32. I don't have to mutate the input at all, the user must know to
> > append the 0xFF byte.
>
> Are you for real? That sounds like a really stupid idea.
>

Keep in mind, with Peter, worthiness and value has no meaning.

If it doesn't exist, then he believes he can file an patent on it.
It only cost $200 or less to file an one page Provisional Patent which
gives you one year to complete the full patent. During that time,
Patent Trolls will test the market to see where and how stong their IP
claims legs stand in the technical and market place.

That is what Patent Trolls do and he has admitted as much in the
archives that he does not need to have any real work done but
documenting the idea to be reasonably sound, even if he done it with
some simple codiing, his belief that any deviation of this simple
coding can not violatied the overall idea, hence it doesn't matter if
you have another working and more complete method.

That is exactly whats going on here and once again he cleverly
disguise it as an innocent question post but the sicko eventually
reveals his true intent one way or another.

It doesn't matter if it has any value to anyone.

--
HLS

From: Paul Bibbings on 19 May 2010 17:01

On May 19, 9:44 pm, Peter Olcott <NoS...(a)OCR4Screen.com> wrote:
> On 5/19/2010 2:51 PM, Leigh Johnston wrote:
>
>
>
>
>
>
>
> > "Peter Olcott" <NoS...(a)OCR4Screen.com> wrote in message
> >news:W_ydnU32Ptzn3WnWnZ2dnUVZ_qgAAAAA(a)giganews.com...
>
> >> Again you forget the primary purpose of this whole line-of-reasoning.
> >> The goal is to show that it is not possible to construct a faster
> >> lexer than the one based on a state transition matrix.
>
> > This contradicts what you said earlier, i.e.:
>
> >>>>> The main purpose of this is to read in a file of UTF-8 to be converted
> >>>>> to UTF-32. I don't have to mutate the input at all, the user must know
> >>>>> to append the 0xFF byte.
>
> > What is the difference between "primary purpose" and "main purpose"?
>
> > I give up, your replies are too troll-like whether intentionally or not..
>
> > /Leigh
>
> The purpose of the software is to validate UTF-8 and translate it into
> UTF-32 as fast as possible.
>
> The purpose of this thread is to show that alternatives to state
> transition matrix based solutions to this problem must be slower.

Think. Anyone. Before you press another key, think.

Let's clear this one out, and let's do it today.

From: James Kanze on 20 May 2010 12:11

On May 19, 7:31 pm, Peter Olcott <NoS...(a)OCR4Screen.com> wrote:
> On 5/19/2010 1:00 PM, Leigh Johnston wrote:

[...]
> The main purpose of this is to read in a file of UTF-8 to be converted
> to UTF-32. I don't have to mutate the input at all, the user must know
> to append the 0xFF byte.

In the file?

[...]
> Any UTF-8 to UTF-32 converter would not have 0xFF in its input unless
> the data is corrupted.

Am I the only one who senses a problem here. If you're reading
from an external source (a file), then you have to assume that
the file might contain anything; people do pass in the wrong
filename, and your program has to handle that gracefully.
(Error message, etc.)

Of course, this should be done on input. Internally, if you
continue to use UTF-8 (rather than converting to UTF-16 or
UTF-32), you can assume correct UTF-8. But in that case, the
fastest way to advance to the next character is almost certainly
'p += byteCount(*p)', rather that a DFA; if you assume correct
UTF-8, there's no need to look at each character.

--
James Kanze

From: James Kanze on 20 May 2010 12:14

On May 19, 7:42 pm, "Leigh Johnston" <le...(a)i42.co.uk> wrote:
> Firstly my requirement is for conversion to UTF-16 not UTF-32.

Just curious, but wouldn't the simplest way to do this be to
convert to UTF-32, then check whether you need surrogates or
not?

--
James Kanze

| Next | Last
Pages: 1 2 3 4 5 6 7 8
Prev: Designing a Finite State Machine DFA Recognizer for UTF-8
Next: Simple Valication Check... Question