Prev: Designing a Finite State Machine DFA Recognizer for UTF-8
Next: Simple Valication Check... Question
From: Leigh Johnston on 20 May 2010 12:21 "James Kanze" <james.kanze(a)gmail.com> wrote in message news:d24c65d5-e98b-4822-bc3b-57a4a844955e(a)j27g2000vbp.googlegroups.com... > On May 19, 7:42 pm, "Leigh Johnston" <le...(a)i42.co.uk> wrote: >> Firstly my requirement is for conversion to UTF-16 not UTF-32. > > Just curious, but wouldn't the simplest way to do this be to > convert to UTF-32, then check whether you need surrogates or > not? > Not for me, I develop for Windows whose native Unicode encoding is UTF-16 making UTF-32 is pretty useless on that platform. /Leigh
From: Peter Olcott on 20 May 2010 12:26 On 5/20/2010 11:11 AM, James Kanze wrote: > On May 19, 7:31 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote: >> On 5/19/2010 1:00 PM, Leigh Johnston wrote: > > [...] >> The main purpose of this is to read in a file of UTF-8 to be converted >> to UTF-32. I don't have to mutate the input at all, the user must know >> to append the 0xFF byte. > > In the file? > > [...] >> Any UTF-8 to UTF-32 converter would not have 0xFF in its input unless >> the data is corrupted. > > Am I the only one who senses a problem here. If you're reading > from an external source (a file), then you have to assume that > the file might contain anything; people do pass in the wrong > filename, and your program has to handle that gracefully. > (Error message, etc.) > > Of course, this should be done on input. Internally, if you > continue to use UTF-8 (rather than converting to UTF-16 or > UTF-32), you can assume correct UTF-8. But in that case, the > fastest way to advance to the next character is almost certainly > 'p += byteCount(*p)', rather that a DFA; if you assume correct > UTF-8, there's no need to look at each character. > > -- > James Kanze I must be validating UTF-8 and well as converting it to UTF-32. Only a DFA can do this very quickly.
From: Leigh Johnston on 20 May 2010 12:33 "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message news:wtadnXf96NMo_2jWnZ2dnUVZ_r6dnZ2d(a)giganews.com... > On 5/20/2010 11:11 AM, James Kanze wrote: >> On May 19, 7:31 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote: >>> On 5/19/2010 1:00 PM, Leigh Johnston wrote: >> >> [...] >>> The main purpose of this is to read in a file of UTF-8 to be converted >>> to UTF-32. I don't have to mutate the input at all, the user must know >>> to append the 0xFF byte. >> >> In the file? >> >> [...] >>> Any UTF-8 to UTF-32 converter would not have 0xFF in its input unless >>> the data is corrupted. >> >> Am I the only one who senses a problem here. If you're reading >> from an external source (a file), then you have to assume that >> the file might contain anything; people do pass in the wrong >> filename, and your program has to handle that gracefully. >> (Error message, etc.) >> >> -- >> James Kanze > > I must be validating UTF-8 and well as converting it to UTF-32. Only a DFA > can do this very quickly. You didn't respond to JK's point. If you require the file to contain 0xFF as the last byte then if a wrong file is given by mistake your algorithm will perform a buffer overrun as you only rely on the sentinel to check for end. This is a crash waiting to happen. Better to not rely on a sentinel at all and check if end has been reached each iteration, we are only talking about an extra CPU instruction per iteration (compare and conditional jump versus unconditional jump). /Leigh
From: Peter Olcott on 20 May 2010 12:43 On 5/20/2010 11:33 AM, Leigh Johnston wrote: > > > "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message > news:wtadnXf96NMo_2jWnZ2dnUVZ_r6dnZ2d(a)giganews.com... >> On 5/20/2010 11:11 AM, James Kanze wrote: >>> On May 19, 7:31 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote: >>>> On 5/19/2010 1:00 PM, Leigh Johnston wrote: >>> >>> [...] >>>> The main purpose of this is to read in a file of UTF-8 to be converted >>>> to UTF-32. I don't have to mutate the input at all, the user must know >>>> to append the 0xFF byte. >>> >>> In the file? >>> >>> [...] >>>> Any UTF-8 to UTF-32 converter would not have 0xFF in its input unless >>>> the data is corrupted. >>> >>> Am I the only one who senses a problem here. If you're reading >>> from an external source (a file), then you have to assume that >>> the file might contain anything; people do pass in the wrong >>> filename, and your program has to handle that gracefully. >>> (Error message, etc.) >>> >>> -- >>> James Kanze >> >> I must be validating UTF-8 and well as converting it to UTF-32. Only a >> DFA can do this very quickly. > > You didn't respond to JK's point. If you require the file to contain > 0xFF as the last byte I do not require this. probably the best tradeoff of the various design alternatives keeping maximum speed as the binding constraint is that the user passes me a mutable std::vector<unsigned char>. My code both appends and then later removes the required 0xFF. I could also provide an overloaded immutable function that is slower because it must copy all of the data. > then if a wrong file is given by mistake your > algorithm will perform a buffer overrun as you only rely on the sentinel > to check for end. This is a crash waiting to happen. Better to not rely > on a sentinel at all and check if end has been reached each iteration, > we are only talking about an extra CPU instruction per iteration > (compare and conditional jump versus unconditional jump). > > /Leigh
From: Joseph M. Newcomer on 20 May 2010 13:30
See below... On Thu, 20 May 2010 17:33:53 +0100, "Leigh Johnston" <leigh(a)i42.co.uk> wrote: > > >"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message >news:wtadnXf96NMo_2jWnZ2dnUVZ_r6dnZ2d(a)giganews.com... >> On 5/20/2010 11:11 AM, James Kanze wrote: >>> On May 19, 7:31 pm, Peter Olcott<NoS...(a)OCR4Screen.com> wrote: >>>> On 5/19/2010 1:00 PM, Leigh Johnston wrote: >>> >>> [...] >>>> The main purpose of this is to read in a file of UTF-8 to be converted >>>> to UTF-32. I don't have to mutate the input at all, the user must know >>>> to append the 0xFF byte. >>> >>> In the file? >>> >>> [...] >>>> Any UTF-8 to UTF-32 converter would not have 0xFF in its input unless >>>> the data is corrupted. **** This was one of the most stupid ideas I have heard proposed in a long time. Note that every UTF-8 string will be terminated with \x00 (NUL) if it is a canonical representation. The typical way to read a file in Windows is to simply allocate a buffer of filesize+sizeof(WCHAR), read in the entire contents of the file, then, given the number of bytes read, append two \x00 bytes (which will be one NUL character if it is a UTF-16 encoding) to the buffer. Then you can look for a BOM; if one is found, then you adjust the start point to be just past the BOM; if it is UTF-16BE, on Windows you then run through and swap the bytes of each UTF-16 character.before working with the data. If it is UTF-8, then you treat it as UTF-8 for whatever reason you want UTF-8; if it is UTF-16LE, then you treat it as Windows' native UTF-16 encoding and do with it what you want. But because two \x00 bytes have been appended, it is already a NUL-terminated string. This is not Rocket Science, and it does not impose on the end user the need to insert a non-standard character at the end of the file. What, exactly, is the problem that appending a \xFF to the file solve that appending a \x00 byte after the file is read does not? Note this algorithm can be generalized to support the possibility of UTF-32LE and UTF-32BE input files. But I leave that generalization as an Exercise For The Reader. Requiring the user put some weird character at the end of the file is just a stupid design. No sane designer (let alone a superb designer) would impose such an ubelievably stupid requirement! joe >>> >>> Am I the only one who senses a problem here. If you're reading >>> from an external source (a file), then you have to assume that >>> the file might contain anything; people do pass in the wrong >>> filename, and your program has to handle that gracefully. >>> (Error message, etc.) >>> >>> -- >>> James Kanze >> >> I must be validating UTF-8 and well as converting it to UTF-32. Only a DFA >> can do this very quickly. > >You didn't respond to JK's point. If you require the file to contain 0xFF >as the last byte then if a wrong file is given by mistake your algorithm >will perform a buffer overrun as you only rely on the sentinel to check for >end. This is a crash waiting to happen. Better to not rely on a sentinel >at all and check if end has been reached each iteration, we are only talking >about an extra CPU instruction per iteration (compare and conditional jump >versus unconditional jump). > >/Leigh Joseph M. Newcomer [MVP] email: newcomer(a)flounder.com Web: http://www.flounder.com MVP Tips: http://www.flounder.com/mvp_tips.htm |