Prev: Does anyone copyright or patent their applications?
Next: Designing a Finite State Machine DFA Recognizer for UTF-8
From: Peter Olcott on 19 May 2010 11:57 State 0 00-7F ASCII C2-DF goto State 1 // Two Byte E0-EF goto State 2 // Three Byte F0-F4 goto State 4 // Four Byte else Error State 1 80-BF else Error State 2 80-BF goto State 3 else Error State 3 80-BF else Error State 4 80-BF goto State 5 else Error State 5 80-BF goto State 6 else Error State 6 80-BF goto State 7 else Error State 7 80-BF else Error // Holds ActionCodes Indexed by NextState and Data[N] uint8 States[256][8]; // This is the input data to be transformed std::vector<uint8> Data; // LastByte hold sentinel value 11 Twelve ActionCodes 00 InvalidByteError 01 FirstByteOfOneByte 02 FirstByteOfTwoBytes 03 FirstByteOfThreeBytes 04 FirstByteOfFourBytes 05 SecondByteOfTwoBytes 06 SecondByteOfThreeBytes 07 SecondByteOfFourBytes 08 ThirdByteOfThreeBytes 09 ThirdByteOfFourBytes 10 FourthByteOfFourBytes 11 OutOfData (Sentinel)
From: Leigh Johnston on 19 May 2010 12:11 "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message news:5PednSUdPpMbl2nWnZ2dnUVZ_j6dnZ2d(a)giganews.com... > // This is the input data to be transformed > std::vector<uint8> Data; // LastByte hold sentinel value 11 > What if a character with the same ASCII value as your sentinel is present in the input data? Really this UTF-8 / specific algorithm spam has little to do with the C++ language which is what this newsgroup is about: try posting to comp.programming instead. /Leigh
From: Peter Olcott on 19 May 2010 12:33 On 5/19/2010 11:11 AM, Leigh Johnston wrote: > "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message > news:5PednSUdPpMbl2nWnZ2dnUVZ_j6dnZ2d(a)giganews.com... >> // This is the input data to be transformed >> std::vector<uint8> Data; // LastByte hold sentinel value 11 >> > > What if a character with the same ASCII value as your sentinel is > present in the input data? > > Really this UTF-8 / specific algorithm spam has little to do with the > C++ language which is what this newsgroup is about: try posting to > comp.programming instead. > > /Leigh The state transition matrix table only holds values between zero and eleven. These values are ActionCodes used in a C++ switch statement. The input data is only used as the second subscript into the States[8][256] table. The first subscript is the CurrentState. The CurrentState is initialized to Zero and changed (as needed) in each of the elements of the C++ switch statement.
From: Peter Olcott on 19 May 2010 12:47 On 5/19/2010 11:11 AM, Leigh Johnston wrote: > "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message > news:5PednSUdPpMbl2nWnZ2dnUVZ_j6dnZ2d(a)giganews.com... >> // This is the input data to be transformed >> std::vector<uint8> Data; // LastByte hold sentinel value 11 >> > > What if a character with the same ASCII value as your sentinel is > present in the input data? Make 0x0 or 0xFF the sentinel value, and remove it from the Error states. > > Really this UTF-8 / specific algorithm spam has little to do with the > C++ language which is what this newsgroup is about: try posting to > comp.programming instead. > > /Leigh
From: Peter Olcott on 19 May 2010 12:55 On 5/19/2010 11:37 AM, Leigh Johnston wrote: > > > "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message > news:re6dna39C89cj2nWnZ2dnUVZ_vidnZ2d(a)giganews.com... > Stop spamming this newsgroup with your UTF-8 DFA irrelevance. > > /Leigh Several people specifically requested that I back up my claims, so I did.
|
Next
|
Last
Pages: 1 2 3 4 Prev: Does anyone copyright or patent their applications? Next: Designing a Finite State Machine DFA Recognizer for UTF-8 |