From: Peter Olcott on 31 May 2010 13:29 On 5/31/2010 11:35 AM, Daniel T. wrote: > Peter Olcott<NoSpam(a)OCR4Screen.com> wrote: > >> I used the two tables from this link as the basis for my design: >> http://en.wikipedia.org/wiki/UTF-8 > > I suggest you use http://unicode.org/ for your source. Why use a > secondary source when the primary source is easily available? Wading though all that to get what I need takes too long. > >> I would like this reviewed for algorithm correctness: > > Surely your tests have already shown whether the algorithm is correct. > >> void UnicodeEncodingConversion:: >> toUTF8(std::vector<uint32_t>& UTF32, std::vector<uint8_t>& UTF8) { >> uint8_t Byte; >> uint32_t CodePoint; >> UTF8.reserve(UTF32.size() * 4); // worst case >> for (uint32_t N = 0; N< UTF32.size(); N++) { >> CodePoint = UTF32[N]; > > I suggest you use an iterator instead of an integer for the loop. That > way you wont need the extraneous variable. > >> if (CodePoint<= 0x7F) { >> Byte = CodePoint; >> UTF8.push_back(Byte); >> } >> else if (CodePoint<= 0x7FF) { >> Byte = 0xC0 | (CodePoint>> 6); >> UTF8.push_back(Byte); >> Byte = 0x80 | (CodePoint& 0x3F); >> UTF8.push_back(Byte); >> } >> else if (CodePoint<= 0xFFFF) { >> Byte = 0xE0 | (CodePoint>> 12); >> UTF8.push_back(Byte); >> Byte = 0x80 | ((CodePoint>> 6)& 0x3F); >> UTF8.push_back(Byte); >> Byte = 0x80 | (CodePoint& 0x3F); >> UTF8.push_back(Byte); >> } >> else if (CodePoint<= 0x10FFFF) { > > The codes 10FFFE and 10FFFF are guaranteed not to be unicode > characters... So then Wikipedia is wrong? http://en.wikipedia.org/wiki/Unicode 16 100000�10FFFF Supplementary Private Use Area-B > >> Byte = 0xF0 | (CodePoint>> 18); >> UTF8.push_back(Byte); >> Byte = 0x80 | ((CodePoint>> 12)& 0x3F); >> UTF8.push_back(Byte); >> Byte = 0x80 | ((CodePoint>> 6)& 0x3F); >> UTF8.push_back(Byte); >> Byte = 0x80 | (CodePoint& 0x3F); >> UTF8.push_back(Byte); >> } >> else >> printf("%d is outside of the Unicode range!\n", CodePoint); > > Throw is more appropriate here. > >> } >> } So it looks otherwise correct? I am guessing that I should probably also screen out the high and low surrogates.
From: Giovanni Dicanio on 31 May 2010 14:16 "Joseph M. Newcomer" <newcomer(a)flounder.com> wrote: >> UTF8.reserve(UTF32.size() * 4); // worst case > **** > Note that this will call malloc(), which will involve setting a lock, then > searching for a > block to allocate, then releasing the lock. Since you have been a fanatic > about > performance, why is it you put a very expensive operation like 'reserve' > in your code? > > While it is perfectly reasonable, it seems inconsistent with your > previously-stated goals. Joe: I'm not sure if you are ironic or something :) ... but I believe that std::vector::reserve() with a proper capacity value, followed by several push_back()s, is very efficient. Sure, not as efficient as a static stack-allocated array, but very efficient. > No, the CORRECT way to write such code is to either throw an exception (if > you are in C++, > which you clearly are) or return a value indicating the error (for > example, in C, an In this case, I'm for exception. Thanks to exception, you could use the precious function return value to actually return the resulting buffer (UTF8 string), instead of passing it as a reference to the function: // Updated prototype: // - use 'const' correctness for utf32 // - return resulting utf8 // - may throw on error std::vector<uint8_t> toUTF8(const std::vector<uint32_t> & utf32); Note that thanks to the move semantics (i.e. the new "&&" thing of C++0x, available in VC10 a.k.a. VS2010), you don't pay for extra useless copies in returning potentially big objects. Giovanni
From: Daniel T. on 31 May 2010 14:24 Peter Olcott <NoSpam(a)OCR4Screen.com> wrote: > On 5/31/2010 11:35 AM, Daniel T. wrote: > > The codes 10FFFE and 10FFFF are guaranteed not to be unicode > > characters... > > So then Wikipedia is wrong? > http://en.wikipedia.org/wiki/Unicode > 16 100000�10FFFF Supplementary Private Use Area-B According to unicode.org, apparently yes. You'd know that if you hadn't been lazy and only consulted a secondary source. > So it looks otherwise correct? Does it pass all your tests? You do have tests don't you?
From: Daniel T. on 31 May 2010 14:41 "Leigh Johnston" <leigh(a)i42.co.uk> wrote: > "Daniel T." <daniel_t(a)earthlink.net> wrote: > > Peter Olcott <NoSpam(a)OCR4Screen.com> wrote: > > > > > void UnicodeEncodingConversion:: > > > toUTF8(std::vector<uint32_t>& UTF32, std::vector<uint8_t>& UTF8) { > > > uint8_t Byte; > > > uint32_t CodePoint; > > > UTF8.reserve(UTF32.size() * 4); // worst case > > > for (uint32_t N = 0; N < UTF32.size(); N++) { > > > CodePoint = UTF32[N]; > > > > I suggest you use an iterator instead of an integer for the loop. That > > way you wont need the extraneous variable. > > Then the iterator would be extraneous surely? Unless you mean CodePoint is > the extraneous variable which it isn't as it is accessed multiple times and > dereferencing an iterator multiple times would not be as efficient modulo > any compiler optimizations; it certainly is not as clear as using a > temporary (IMO). Our opinions must differ then. Such micro-optimizations would be barely perceptible even in a contrived example. "Every piece of knowledge must have a single, unambiguous, authoritative representation within a system." -- Andrew Hunt CodePoint and UTF32[N] are two representations that both refer to the same piece of knowledge. Why the unnecessary duplication?
From: Peter Olcott on 31 May 2010 14:49
On 5/31/2010 1:16 PM, Giovanni Dicanio wrote: > "Joseph M. Newcomer" <newcomer(a)flounder.com> wrote: > >>> UTF8.reserve(UTF32.size() * 4); // worst case >> **** >> Note that this will call malloc(), which will involve setting a lock, >> then searching for a >> block to allocate, then releasing the lock. Since you have been a >> fanatic about >> performance, why is it you put a very expensive operation like >> 'reserve' in your code? >> >> While it is perfectly reasonable, it seems inconsistent with your >> previously-stated goals. > > Joe: I'm not sure if you are ironic or something :) ... but I believe > that std::vector::reserve() with a proper capacity value, followed by > several push_back()s, is very efficient. > Sure, not as efficient as a static stack-allocated array, but very > efficient. He needed to find some excuse to denigrate my code. He has had a personal grudge against me for several months. I don't really know what I said to offend him, but, it must have occurred sometime after he sung very high praises about my patent a few months ago. >> No, the CORRECT way to write such code is to either throw an exception >> (if you are in C++, >> which you clearly are) or return a value indicating the error (for >> example, in C, an > The "correct" way to handle an error when testing code for the first time is to use a printf() statement, or other easy to use debugging construct. When the code moves to production, then either of the other two suggestions may be appropriate. > In this case, I'm for exception. > Thanks to exception, you could use the precious function return value to > actually return the resulting buffer (UTF8 string), instead of passing > it as a reference to the function: > > // Updated prototype: > // - use 'const' correctness for utf32 > // - return resulting utf8 > // - may throw on error > std::vector<uint8_t> toUTF8(const std::vector<uint32_t> & utf32); For most compilers this requires making an extra copy. > > Note that thanks to the move semantics (i.e. the new "&&" thing of > C++0x, available in VC10 a.k.a. VS2010), you don't pay for extra useless > copies in returning potentially big objects. > > Giovanni > > > Counting on this results in code that does not have the same performance characteristics across multiple platforms. |