New utf8string design may make UTF-8 the superior encoding [MFC]

Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish

From: Öö Tiib on 18 May 2010 19:01

On 18 mai, 17:18, James Kanze <james.ka...(a)gmail.com> wrote:
> The network is still 8 bits UTF-8. As are the disks; using
> UTF-16 on an external support simply doesn't work.
>
> Also, UTF-8 may result in less memory use, and thus less paging.
>
> If all you're doing are simple operations, searching for a few
> ASCII delimiters and copying the delimited substrings, for
> example, UTF-8 will probably be significantly faster: the CPU
> will always read a word at a time, even if you access it byte by
> byte, and you'll usually get more characters per word using
> UTF-8.
>
> If you need full and complete support, as in an editor, for
> example, UTF-32 is the best general solution. For a lot of
> things in between, UTF-16 is a good compromise.
>
> But the trade-offs only concern internal representation.
> Externally, the world is 8 bits, and UTF-8 is the only solution.

I would be honestly extremely glad if it was the only solution. Real
life applications throw in texts in all possible forms also they await
responses in all possible forms. For example texts in financial
transactions done in most Northern Europe assume that "/\{}[]" means
something like "ÄäÅåÖö" (i do not remember correct order, but
something like that).

I prefer to convert incoming texts into std::wstring. Outgoing texts i
convert back to whatever they await (UTF-8 is really relaxing news
there, true). All what i need is a set of conversion functions. If it
is going to user interface then std::wstring goes and it is business
of UI to convert it further into CString or QString or whatever they
enjoy there and sort it out for user.

I perhaps have too low experience with sophisticated text processing.
Simple std::sort(), wide char literals of C++ and boost::wformat plus
full set of conversion functions is all i need really. Peter Olcott
raises lot of noise around it and so it makes me a bit
interested. :)

From: Mihai N. on 19 May 2010 01:24

> I perhaps have too low experience with sophisticated text processing.
> Simple std::sort(), wide char literals of C++ and boost::wformat plus
> full set of conversion functions is all i need really.

It depends a lot what you need.

Sorting is locale-sensitive (German, Swedish, French, Spanish, all
have different sorting rules).
The CRT (and STL, and boost) are pretty dumb when dealing with things
in a locale sensitive way (meaning that they usualy don't :-)

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Pete Delgado on 19 May 2010 01:39

"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote in message
news:hcednaq6ks5ml2_WnZ2dnUVZ_tqdnZ2d(a)giganews.com...
>
> The finite state machine's detailed design is now completed. Its state
> transition matrix only takes 2048 bytes. It will be faster than any other
> possible method.

So once again you find yourself with a *design* that is complete but you
have not done any *coding*? Yet you claim that it will be faster than any
other possible method?

Is anyone else noticing a pattern here??? (no pun intended...)

-Pete

From: Öö Tiib on 19 May 2010 04:50

On May 19, 8:24 am, "Mihai N." <nmihai_year_2...(a)yahoo.com> wrote:
> > I perhaps have too low experience with sophisticated text processing.
> > Simple std::sort(), wide char literals of C++ and boost::wformat plus
> > full set of conversion functions is all i need really.
>
> It depends a lot what you need.
>
> Sorting is locale-sensitive (German, Swedish, French, Spanish, all
> have different sorting rules).
> The CRT (and STL, and boost) are pretty dumb when dealing with things
> in a locale sensitive way (meaning that they usualy don't :-)

Yes, sorting in real alphabetic order for user is perhaps business of
GUI. GUI has to display it. GUI however usually has its WxStrings or
FooStrings anyway. I hate when someone leaks these weirdos to
application mechanics layer. Internal application logic is often best
made totally locale-agnostic and not caring about positioning in GUI
and if the end-users write from up to down or from right to left.

So text in electronic interfaces layer are bytes, text in application
layer are wchar_t and text in user interface layer are whatever weirdo
rules there. If maintainer forgets to convert in interface between
layers he gets compiler warnings or errors. That makes life easy, but
i suspect my problems with texts are more trivial than these of some
others.

From: James Kanze on 19 May 2010 06:21

On May 19, 12:01 am, Öö Tiib <oot...(a)hot.ee> wrote:
> On 18 mai, 17:18, James Kanze <james.ka...(a)gmail.com> wrote:

[...]
> > But the trade-offs only concern internal representation.
> > Externally, the world is 8 bits, and UTF-8 is the only solution.

> I would be honestly extremely glad if it was the only solution. Real
> life applications throw in texts in all possible forms also they await
> responses in all possible forms.

Yes. I meant it is the only solution if you are choosing
yourself. In practice, there are a lot of other solutions being
used; they don't work, except in limited environments, but they
are being widely used.

> For example texts in financial transactions done in most
> Northern Europe assume that "/\{}[]" means something like
> "ÄäÅåÖö" (i do not remember correct order, but something like
> that).

> I prefer to convert incoming texts into std::wstring. Outgoing
> texts i convert back to whatever they await (UTF-8 is really
> relaxing news there, true). All what i need is a set of
> conversion functions. If it is going to user interface then
> std::wstring goes and it is business of UI to convert it
> further into CString or QString or whatever they enjoy there
> and sort it out for user.

In theory, the conversion should take place in the filebuf,
using the imbued locale.

> I perhaps have too low experience with sophisticated text processing.
> Simple std::sort(), wide char literals of C++ and boost::wformat plus
> full set of conversion functions is all i need really. Peter Olcott
> raises lot of noise around it and so it makes me a bit
> interested. :)

There can be advantages to using UTF-8 internally, as well as at
the interface level, and if you're not doing too complicated
things, it can work quite nicely. But only as long as your
manipulations aren't too complicated.

--
James Kanze

First | Prev | Next | Last
Pages: 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Prev: UTF-8 string in MBCS project
Next: Love Potion for Miss Blandish