Designing a Finite State Machine DFA Recognizer for UTF-8 [MFC]

Prev: Designing a Finite State Machine DFA Recognizer for UTF-8
Next: Simple Valication Check... Question

From: Leigh Johnston on 20 May 2010 12:21

"James Kanze" <james.kanze(a)gmail.com> wrote in message
news:d24c65d5-e98b-4822-bc3b-57a4a844955e(a)j27g2000vbp.googlegroups.com...
> On May 19, 7:42 pm, "Leigh Johnston" <le...(a)i42.co.uk> wrote:
>> Firstly my requirement is for conversion to UTF-16 not UTF-32.
>
> Just curious, but wouldn't the simplest way to do this be to
> convert to UTF-32, then check whether you need surrogates or
> not?
>

Not for me, I develop for Windows whose native Unicode encoding is UTF-16
making UTF-32 is pretty useless on that platform.

/Leigh

From: Peter Olcott on 20 May 2010 12:26

On 5/20/2010 11:11 AM, James Kanze wrote:
> On May 19, 7:31 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>> On 5/19/2010 1:00 PM, Leigh Johnston wrote:
>
> [...]
>> The main purpose of this is to read in a file of UTF-8 to be converted
>> to UTF-32. I don't have to mutate the input at all, the user must know
>> to append the 0xFF byte.
>
> In the file?
>
> [...]
>> Any UTF-8 to UTF-32 converter would not have 0xFF in its input unless
>> the data is corrupted.
>
> Am I the only one who senses a problem here. If you're reading
> from an external source (a file), then you have to assume that
> the file might contain anything; people do pass in the wrong
> filename, and your program has to handle that gracefully.
> (Error message, etc.)
>
> Of course, this should be done on input. Internally, if you
> continue to use UTF-8 (rather than converting to UTF-16 or
> UTF-32), you can assume correct UTF-8. But in that case, the
> fastest way to advance to the next character is almost certainly
> 'p += byteCount(*p)', rather that a DFA; if you assume correct
> UTF-8, there's no need to look at each character.
>
> --
> James Kanze

I must be validating UTF-8 and well as converting it to UTF-32. Only a
DFA can do this very quickly.

From: Leigh Johnston on 20 May 2010 12:33

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:wtadnXf96NMo_2jWnZ2dnUVZ_r6dnZ2d(a)giganews.com...
> On 5/20/2010 11:11 AM, James Kanze wrote:
>> On May 19, 7:31 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>>> On 5/19/2010 1:00 PM, Leigh Johnston wrote:
>>
>> [...]
>>> The main purpose of this is to read in a file of UTF-8 to be converted
>>> to UTF-32. I don't have to mutate the input at all, the user must know
>>> to append the 0xFF byte.
>>
>> In the file?
>>
>> [...]
>>> Any UTF-8 to UTF-32 converter would not have 0xFF in its input unless
>>> the data is corrupted.
>>
>> Am I the only one who senses a problem here. If you're reading
>> from an external source (a file), then you have to assume that
>> the file might contain anything; people do pass in the wrong
>> filename, and your program has to handle that gracefully.
>> (Error message, etc.)
>>
>> --
>> James Kanze
>
> I must be validating UTF-8 and well as converting it to UTF-32. Only a DFA
> can do this very quickly.

You didn't respond to JK's point. If you require the file to contain 0xFF
as the last byte then if a wrong file is given by mistake your algorithm
will perform a buffer overrun as you only rely on the sentinel to check for
end. This is a crash waiting to happen. Better to not rely on a sentinel
at all and check if end has been reached each iteration, we are only talking
about an extra CPU instruction per iteration (compare and conditional jump
versus unconditional jump).

/Leigh

From: Peter Olcott on 20 May 2010 12:43

On 5/20/2010 11:33 AM, Leigh Johnston wrote:
>
>
> "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
> news:wtadnXf96NMo_2jWnZ2dnUVZ_r6dnZ2d(a)giganews.com...
>> On 5/20/2010 11:11 AM, James Kanze wrote:
>>> On May 19, 7:31 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>>>> On 5/19/2010 1:00 PM, Leigh Johnston wrote:
>>>
>>> [...]
>>>> The main purpose of this is to read in a file of UTF-8 to be converted
>>>> to UTF-32. I don't have to mutate the input at all, the user must know
>>>> to append the 0xFF byte.
>>>
>>> In the file?
>>>
>>> [...]
>>>> Any UTF-8 to UTF-32 converter would not have 0xFF in its input unless
>>>> the data is corrupted.
>>>
>>> Am I the only one who senses a problem here. If you're reading
>>> from an external source (a file), then you have to assume that
>>> the file might contain anything; people do pass in the wrong
>>> filename, and your program has to handle that gracefully.
>>> (Error message, etc.)
>>>
>>> --
>>> James Kanze
>>
>> I must be validating UTF-8 and well as converting it to UTF-32. Only a
>> DFA can do this very quickly.
>
> You didn't respond to JK's point. If you require the file to contain
> 0xFF as the last byte

I do not require this. probably the best tradeoff of the various design
alternatives keeping maximum speed as the binding constraint is that the
user passes me a mutable std::vector<unsigned char>. My code both
appends and then later removes the required 0xFF.

I could also provide an overloaded immutable function that is slower
because it must copy all of the data.

> then if a wrong file is given by mistake your
> algorithm will perform a buffer overrun as you only rely on the sentinel
> to check for end. This is a crash waiting to happen. Better to not rely
> on a sentinel at all and check if end has been reached each iteration,
> we are only talking about an extra CPU instruction per iteration
> (compare and conditional jump versus unconditional jump).
>
> /Leigh

From: Joseph M. Newcomer on 20 May 2010 13:30

See below...
On Thu, 20 May 2010 17:33:53 +0100, "Leigh Johnston" <leigh(a)i42.co.uk> wrote:

>
>
>"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
>news:wtadnXf96NMo_2jWnZ2dnUVZ_r6dnZ2d(a)giganews.com...
>> On 5/20/2010 11:11 AM, James Kanze wrote:
>>> On May 19, 7:31 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>>>> On 5/19/2010 1:00 PM, Leigh Johnston wrote:
>>>
>>> [...]
>>>> The main purpose of this is to read in a file of UTF-8 to be converted
>>>> to UTF-32. I don't have to mutate the input at all, the user must know
>>>> to append the 0xFF byte.
>>>
>>> In the file?
>>>
>>> [...]
>>>> Any UTF-8 to UTF-32 converter would not have 0xFF in its input unless
>>>> the data is corrupted.
****
This was one of the most stupid ideas I have heard proposed in a long time.

Note that every UTF-8 string will be terminated with \x00 (NUL) if it is a canonical
representation. The typical way to read a file in Windows is to simply allocate a buffer
of filesize+sizeof(WCHAR), read in the entire contents of the file, then, given the
number of bytes read, append two \x00 bytes (which will be one NUL character if it is a
UTF-16 encoding) to the buffer. Then you can look for a BOM; if one is found, then you
adjust the start point to be just past the BOM; if it is UTF-16BE, on Windows you then run
through and swap the bytes of each UTF-16 character.before working with the data. If it
is UTF-8, then you treat it as UTF-8 for whatever reason you want UTF-8; if it is
UTF-16LE, then you treat it as Windows' native UTF-16 encoding and do with it what you
want. But because two \x00 bytes have been appended, it is already a NUL-terminated
string. This is not Rocket Science, and it does not impose on the end user the need to
insert a non-standard character at the end of the file. What, exactly, is the problem
that appending a \xFF to the file solve that appending a \x00 byte after the file is read
does not?

Note this algorithm can be generalized to support the possibility of UTF-32LE and UTF-32BE
input files. But I leave that generalization as an Exercise For The Reader.

Requiring the user put some weird character at the end of the file is just a stupid
design. No sane designer (let alone a superb designer) would impose such an ubelievably
stupid requirement!
joe

>>>
>>> Am I the only one who senses a problem here. If you're reading
>>> from an external source (a file), then you have to assume that
>>> the file might contain anything; people do pass in the wrong
>>> filename, and your program has to handle that gracefully.
>>> (Error message, etc.)
>>>
>>> --
>>> James Kanze
>>
>> I must be validating UTF-8 and well as converting it to UTF-32. Only a DFA
>> can do this very quickly.
>
>You didn't respond to JK's point. If you require the file to contain 0xFF
>as the last byte then if a wrong file is given by mistake your algorithm
>will perform a buffer overrun as you only rely on the sentinel to check for
>end. This is a crash waiting to happen. Better to not rely on a sentinel
>at all and check if end has been reached each iteration, we are only talking
>about an extra CPU instruction per iteration (compare and conditional jump
>versus unconditional jump).
>
>/Leigh
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8
Prev: Designing a Finite State Machine DFA Recognizer for UTF-8
Next: Simple Valication Check... Question