From: Öö Tiib on 18 May 2010 19:01 On 18 mai, 17:18, James Kanze <james.ka...(a)gmail.com> wrote: > The network is still 8 bits UTF-8. As are the disks; using > UTF-16 on an external support simply doesn't work. > > Also, UTF-8 may result in less memory use, and thus less paging. > > If all you're doing are simple operations, searching for a few > ASCII delimiters and copying the delimited substrings, for > example, UTF-8 will probably be significantly faster: the CPU > will always read a word at a time, even if you access it byte by > byte, and you'll usually get more characters per word using > UTF-8. > > If you need full and complete support, as in an editor, for > example, UTF-32 is the best general solution. For a lot of > things in between, UTF-16 is a good compromise. > > But the trade-offs only concern internal representation. > Externally, the world is 8 bits, and UTF-8 is the only solution. I would be honestly extremely glad if it was the only solution. Real life applications throw in texts in all possible forms also they await responses in all possible forms. For example texts in financial transactions done in most Northern Europe assume that "/\{}[]" means something like "ÄäÅåÖö" (i do not remember correct order, but something like that). I prefer to convert incoming texts into std::wstring. Outgoing texts i convert back to whatever they await (UTF-8 is really relaxing news there, true). All what i need is a set of conversion functions. If it is going to user interface then std::wstring goes and it is business of UI to convert it further into CString or QString or whatever they enjoy there and sort it out for user. I perhaps have too low experience with sophisticated text processing. Simple std::sort(), wide char literals of C++ and boost::wformat plus full set of conversion functions is all i need really. Peter Olcott raises lot of noise around it and so it makes me a bit interested. :)
From: Mihai N. on 19 May 2010 01:24 > I perhaps have too low experience with sophisticated text processing. > Simple std::sort(), wide char literals of C++ and boost::wformat plus > full set of conversion functions is all i need really. It depends a lot what you need. Sorting is locale-sensitive (German, Swedish, French, Spanish, all have different sorting rules). The CRT (and STL, and boost) are pretty dumb when dealing with things in a locale sensitive way (meaning that they usualy don't :-) -- Mihai Nita [Microsoft MVP, Visual C++] http://www.mihai-nita.net ------------------------------------------ Replace _year_ with _ to get the real email
From: Pete Delgado on 19 May 2010 01:39 "Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message news:hcednaq6ks5ml2_WnZ2dnUVZ_tqdnZ2d(a)giganews.com... > > The finite state machine's detailed design is now completed. Its state > transition matrix only takes 2048 bytes. It will be faster than any other > possible method. So once again you find yourself with a *design* that is complete but you have not done any *coding*? Yet you claim that it will be faster than any other possible method? Is anyone else noticing a pattern here??? (no pun intended...) -Pete
From: Öö Tiib on 19 May 2010 04:50 On May 19, 8:24 am, "Mihai N." <nmihai_year_2...(a)yahoo.com> wrote: > > I perhaps have too low experience with sophisticated text processing. > > Simple std::sort(), wide char literals of C++ and boost::wformat plus > > full set of conversion functions is all i need really. > > It depends a lot what you need. > > Sorting is locale-sensitive (German, Swedish, French, Spanish, all > have different sorting rules). > The CRT (and STL, and boost) are pretty dumb when dealing with things > in a locale sensitive way (meaning that they usualy don't :-) Yes, sorting in real alphabetic order for user is perhaps business of GUI. GUI has to display it. GUI however usually has its WxStrings or FooStrings anyway. I hate when someone leaks these weirdos to application mechanics layer. Internal application logic is often best made totally locale-agnostic and not caring about positioning in GUI and if the end-users write from up to down or from right to left. So text in electronic interfaces layer are bytes, text in application layer are wchar_t and text in user interface layer are whatever weirdo rules there. If maintainer forgets to convert in interface between layers he gets compiler warnings or errors. That makes life easy, but i suspect my problems with texts are more trivial than these of some others.
From: James Kanze on 19 May 2010 06:21
On May 19, 12:01 am, Öö Tiib <oot...(a)hot.ee> wrote: > On 18 mai, 17:18, James Kanze <james.ka...(a)gmail.com> wrote: [...] > > But the trade-offs only concern internal representation. > > Externally, the world is 8 bits, and UTF-8 is the only solution. > I would be honestly extremely glad if it was the only solution. Real > life applications throw in texts in all possible forms also they await > responses in all possible forms. Yes. I meant it is the only solution if you are choosing yourself. In practice, there are a lot of other solutions being used; they don't work, except in limited environments, but they are being widely used. > For example texts in financial transactions done in most > Northern Europe assume that "/\{}[]" means something like > "ÄäÅåÖö" (i do not remember correct order, but something like > that). > I prefer to convert incoming texts into std::wstring. Outgoing > texts i convert back to whatever they await (UTF-8 is really > relaxing news there, true). All what i need is a set of > conversion functions. If it is going to user interface then > std::wstring goes and it is business of UI to convert it > further into CString or QString or whatever they enjoy there > and sort it out for user. In theory, the conversion should take place in the filebuf, using the imbued locale. > I perhaps have too low experience with sophisticated text processing. > Simple std::sort(), wide char literals of C++ and boost::wformat plus > full set of conversion functions is all i need really. Peter Olcott > raises lot of noise around it and so it makes me a bit > interested. :) There can be advantages to using UTF-8 internally, as well as at the interface level, and if you're not doing too complicated things, it can work quite nicely. But only as long as your manipulations aren't too complicated. -- James Kanze |