UNICODE charsets and widechars [wxWindows]

Prev: wxTreeMultiCtrl
Next: wxSQLite vs. wxODBC

From: =?ISO-8859-1?Q?Manuel_Mart=EDn?= on 3 Oct 2006 12:25

Hi

Reading docs, lists and wiki about charsets and UNICODE, I'm getting a
bit confused.

IMHO docs are some old with vocabulary: when they say "converts between
the UTF-8 encoding and Unicode", I think this is not fully valid,
because UFT-8 is part of UNICODE.

So I decided to write some clarifying text, but I also think an "expert"
should confirm it. Here it is:

First, some definitions:

(A) 'ASCII-7' : Chars are stored using 7 bits (128 diferent chars)
(B) 'ASCII-E' : Chars are stored using 8 bits. But positions after 128
are used distinctly for different languages.
(C) 'charset' : How positions 129-255 are used in ASCII-E. An inverted
question mark is 0x00bf in ISO-8859-1 (Latin-1), but
the same byte 0x00bf using WINDOWS-1250 means a 'Z'
with a point on top of it.
Examples of charsets (same as 'code pages') are
ISO-8859-1, WINDOWS-1252, KOI8-R
(D) 'UNICODE' : For a complete description see www.unicode.org.
In short: a way of managing _all_ possible chars
simultaneusly.
UNICODE speaks about 'code points' (positions on the
list) and leaves 'char' as a representation of a code
point. UNICODE defines some ways for 'encoding' or
how bytes are organized: UFT-8, UTF-16,...
(E) 'UTF-16' : An UNICODE encoding that uses 2 bytes for each char.
This allows more than 65000 chars, but may be not
enough (i.e. Chinese needs more graphs).
(F) 'UTF-32' : An UNICODE encoding that uses 4 bytes for each char.
(G) 'UTF-8' : An UNICODE encoding that uses a variable (1 to 6)
number of bytes for each char. The first 128 'positions'
(code points) are identical to ASCII-7.
(H) 'widechar': A name for how an OS manage UTF-16 or UTF-32 chars. You
use 'wchar' in your code and Windows XP understands it
as a 2-byte char and some Unices as a 4-byte char.
(I) 'multibyte': A sequence of bytes that is supposed to be an UNICODE
char. If it is a 2-byte sequence, it may be a UTF-16
char but perhaps an UTF-8 one. UNICODE defines a way
for telling this unambiguously.
(H) 'font' : A table of correspondences between chars and their
representations (graphic draw) on screen or printer.

An now, how wx manages it all:

1) If you want to use UNICODE chars, compile wx and your app passing
_UNICODE to compiler. It is also possible to use widechars in ANSI build
(not UNICODE), but you have to convert them.

2) Use macro wxT() for all strings. Use wxString instead of C style
strings.

3) If you need an ASCII-E char in one literal string don't write it
directly (your complier may rise an error). You can pass a charset
parameter to compiler, but it is preferred to tell wxString the charset
to use (see wxString constructors with conversion).

4) Conversions:
wxMBConv is the base class for conversions between widechar and
multibyte. Some specialized classes are wxConvLocal, wxConvUTF8, etc.

wxCSConv converts between ASCII-E and widechars. Be aware you can get
undesired representations for some chars converting to ASCII-E, because
it is possible they don't exist on that charset (See (C) some lines
above on this text).

wxEncodingConverter is capable of converting strings between two 8-bit
encodings/charsets. It's usefull if you allow replacing some char with
other similar (i.e. stripping accent).

5) Fonts
wxFont, wxFontEnumerator and wxFontMapper allow working with different
charsets.

6) Notes:
You should add an encoding-descriptor to your data files (as HTML does)
and use it when reading the data.

wxLocale reads catalogs using encoding-descriptor.
wxLocale changes the application locale on its Init(), so be aware when
using printf(), wxConvLibc, wxConvLocal ...

Remember wx does not really convert. It relies on OS libraries to do
the 'real' work. If some conversion is not possible, it's OS fault.
You can find on the web some apps that do this job, despite of OS
capabilities.

TIA
Manolo

---------------------------------------------------------------------
To unsubscribe, e-mail: wx-users-unsubscribe(a)lists.wxwidgets.org
For additional commands, e-mail: wx-users-help(a)lists.wxwidgets.org

From: Vadim Zeitlin on 3 Oct 2006 13:21

On Tue, 03 Oct 2006 18:25:56 +0200 Manuel Mart?n <mmartin(a)ceyd.es> wrote:

MM> So I decided to write some clarifying text, but I also think an
MM> "expert" should confirm it.

It's globally correct but a few remarks are in order:

MM> IMHO docs are some old with vocabulary: when they say "converts between
MM> the UTF-8 encoding and Unicode", I think this is not fully valid,
MM> because UFT-8 is part of UNICODE.

No, UTF-8 is just one possible encoding of Unicode, as you say yourself
below. In general, "Unicode" in relation to wxWidgets means "wchar_t".
While this is not totally correct neither (especially under Windows which
uses 16 bit wchar_t), it's more or less true as there is a one to one
mapping between at least the BMP (or, in case of Unix systems where wchar_t
is 32bit, the entire Unicode code space) and the wide characters.

MM> (E) 'UTF-16' : An UNICODE encoding that uses 2 bytes for each char.

At least 2 bytes. The composites need more.

MM> This allows more than 65000 chars, but may be not
MM> enough (i.e. Chinese needs more graphs).

Not the usual Chinese ideograms though, they're part of the BMP. So "most"
of the commonly used symbols needs only 2 bytes in UTF-16.

MM> (H) 'widechar': A name for how an OS manage UTF-16 or UTF-32 chars. You
MM> use 'wchar' in your code and Windows XP understands it
MM> as a 2-byte char and some Unices as a 4-byte char.

It's wchar_t and, AFAIK, all Unices use 32 bit wchar_t.

MM> (I) 'multibyte': A sequence of bytes that is supposed to be an UNICODE

Not necessarily, there are many multibyte non-Unicode encodings.

MM> 1) If you want to use UNICODE chars, compile wx and your app passing
MM> _UNICODE to compiler.

There should be rarely need to define this directly. If you use VC IDE
project files you just select one of the "Unicode" build configurations. If
you're under Unix, configure the library with --enable-unicode switch.

MM> 2) Use macro wxT() for all strings. Use wxString instead of C style
MM> strings.

Maybe not quite all but it's surely a good advice to do it by default.

MM> 3) If you need an ASCII-E char in one literal string don't write it
MM> directly (your complier may rise an error). You can pass a charset
MM> parameter to compiler, but it is preferred to tell wxString the charset
MM> to use (see wxString constructors with conversion).

Yes. You can also encode wide chars inside the program using \uxxxx escape
sequence although older compilers don't support this.

Regards,
VZ

--
TT-Solutions: wxWidgets consultancy and technical support
http://www.tt-solutions.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: wx-users-unsubscribe(a)lists.wxwidgets.org
For additional commands, e-mail: wx-users-help(a)lists.wxwidgets.org

|
Pages: 1
Prev: wxTreeMultiCtrl
Next: wxSQLite vs. wxODBC