From: Peter Olcott on
State 0
00-7F ASCII
C2-DF goto State 1 // Two Byte
E0-EF goto State 2 // Three Byte
F0-F4 goto State 4 // Four Byte
else Error
State 1
80-BF
else Error
State 2
80-BF goto State 3
else Error
State 3
80-BF
else Error
State 4
80-BF goto State 5
else Error
State 5
80-BF goto State 6
else Error
State 6
80-BF goto State 7
else Error
State 7
80-BF
else Error

// Holds ActionCodes Indexed by NextState and Data[N]
uint8 States[256][8];

// This is the input data to be transformed
std::vector<uint8> Data; // LastByte hold sentinel value 11

Twelve ActionCodes
00 InvalidByteError
01 FirstByteOfOneByte
02 FirstByteOfTwoBytes
03 FirstByteOfThreeBytes
04 FirstByteOfFourBytes
05 SecondByteOfTwoBytes
06 SecondByteOfThreeBytes
07 SecondByteOfFourBytes
08 ThirdByteOfThreeBytes
09 ThirdByteOfFourBytes
10 FourthByteOfFourBytes
11 OutOfData (Sentinel)


From: Leigh Johnston on
"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:5PednSUdPpMbl2nWnZ2dnUVZ_j6dnZ2d(a)giganews.com...
> // This is the input data to be transformed
> std::vector<uint8> Data; // LastByte hold sentinel value 11
>

What if a character with the same ASCII value as your sentinel is present in
the input data?

Really this UTF-8 / specific algorithm spam has little to do with the C++
language which is what this newsgroup is about: try posting to
comp.programming instead.

/Leigh

From: Peter Olcott on
On 5/19/2010 11:11 AM, Leigh Johnston wrote:
> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
> news:5PednSUdPpMbl2nWnZ2dnUVZ_j6dnZ2d(a)giganews.com...
>> // This is the input data to be transformed
>> std::vector<uint8> Data; // LastByte hold sentinel value 11
>>
>
> What if a character with the same ASCII value as your sentinel is
> present in the input data?
>
> Really this UTF-8 / specific algorithm spam has little to do with the
> C++ language which is what this newsgroup is about: try posting to
> comp.programming instead.
>
> /Leigh

The state transition matrix table only holds values between zero and
eleven. These values are ActionCodes used in a C++ switch statement. The
input data is only used as the second subscript into the States[8][256]
table. The first subscript is the CurrentState.

The CurrentState is initialized to Zero and changed (as needed) in each
of the elements of the C++ switch statement.
From: Peter Olcott on
On 5/19/2010 11:11 AM, Leigh Johnston wrote:
> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
> news:5PednSUdPpMbl2nWnZ2dnUVZ_j6dnZ2d(a)giganews.com...
>> // This is the input data to be transformed
>> std::vector<uint8> Data; // LastByte hold sentinel value 11
>>
>
> What if a character with the same ASCII value as your sentinel is
> present in the input data?

Make 0x0 or 0xFF the sentinel value, and remove it from the Error states.

>
> Really this UTF-8 / specific algorithm spam has little to do with the
> C++ language which is what this newsgroup is about: try posting to
> comp.programming instead.
>
> /Leigh

From: Peter Olcott on
On 5/19/2010 11:37 AM, Leigh Johnston wrote:
>
>
> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
> news:re6dna39C89cj2nWnZ2dnUVZ_vidnZ2d(a)giganews.com...

> Stop spamming this newsgroup with your UTF-8 DFA irrelevance.
>
> /Leigh

Several people specifically requested that I back up my claims, so I did.