Prev: wxTreeMultiCtrl
Next: wxSQLite vs. wxODBC
From: =?ISO-8859-1?Q?Manuel_Mart=EDn?= on 3 Oct 2006 12:25 Hi Reading docs, lists and wiki about charsets and UNICODE, I'm getting a bit confused. IMHO docs are some old with vocabulary: when they say "converts between the UTF-8 encoding and Unicode", I think this is not fully valid, because UFT-8 is part of UNICODE. So I decided to write some clarifying text, but I also think an "expert" should confirm it. Here it is: First, some definitions: (A) 'ASCII-7' : Chars are stored using 7 bits (128 diferent chars) (B) 'ASCII-E' : Chars are stored using 8 bits. But positions after 128 are used distinctly for different languages. (C) 'charset' : How positions 129-255 are used in ASCII-E. An inverted question mark is 0x00bf in ISO-8859-1 (Latin-1), but the same byte 0x00bf using WINDOWS-1250 means a 'Z' with a point on top of it. Examples of charsets (same as 'code pages') are ISO-8859-1, WINDOWS-1252, KOI8-R (D) 'UNICODE' : For a complete description see www.unicode.org. In short: a way of managing _all_ possible chars simultaneusly. UNICODE speaks about 'code points' (positions on the list) and leaves 'char' as a representation of a code point. UNICODE defines some ways for 'encoding' or how bytes are organized: UFT-8, UTF-16,... (E) 'UTF-16' : An UNICODE encoding that uses 2 bytes for each char. This allows more than 65000 chars, but may be not enough (i.e. Chinese needs more graphs). (F) 'UTF-32' : An UNICODE encoding that uses 4 bytes for each char. (G) 'UTF-8' : An UNICODE encoding that uses a variable (1 to 6) number of bytes for each char. The first 128 'positions' (code points) are identical to ASCII-7. (H) 'widechar': A name for how an OS manage UTF-16 or UTF-32 chars. You use 'wchar' in your code and Windows XP understands it as a 2-byte char and some Unices as a 4-byte char. (I) 'multibyte': A sequence of bytes that is supposed to be an UNICODE char. If it is a 2-byte sequence, it may be a UTF-16 char but perhaps an UTF-8 one. UNICODE defines a way for telling this unambiguously. (H) 'font' : A table of correspondences between chars and their representations (graphic draw) on screen or printer. An now, how wx manages it all: 1) If you want to use UNICODE chars, compile wx and your app passing _UNICODE to compiler. It is also possible to use widechars in ANSI build (not UNICODE), but you have to convert them. 2) Use macro wxT() for all strings. Use wxString instead of C style strings. 3) If you need an ASCII-E char in one literal string don't write it directly (your complier may rise an error). You can pass a charset parameter to compiler, but it is preferred to tell wxString the charset to use (see wxString constructors with conversion). 4) Conversions: wxMBConv is the base class for conversions between widechar and multibyte. Some specialized classes are wxConvLocal, wxConvUTF8, etc. wxCSConv converts between ASCII-E and widechars. Be aware you can get undesired representations for some chars converting to ASCII-E, because it is possible they don't exist on that charset (See (C) some lines above on this text). wxEncodingConverter is capable of converting strings between two 8-bit encodings/charsets. It's usefull if you allow replacing some char with other similar (i.e. stripping accent). 5) Fonts wxFont, wxFontEnumerator and wxFontMapper allow working with different charsets. 6) Notes: You should add an encoding-descriptor to your data files (as HTML does) and use it when reading the data. wxLocale reads catalogs using encoding-descriptor. wxLocale changes the application locale on its Init(), so be aware when using printf(), wxConvLibc, wxConvLocal ... Remember wx does not really convert. It relies on OS libraries to do the 'real' work. If some conversion is not possible, it's OS fault. You can find on the web some apps that do this job, despite of OS capabilities. TIA Manolo --------------------------------------------------------------------- To unsubscribe, e-mail: wx-users-unsubscribe(a)lists.wxwidgets.org For additional commands, e-mail: wx-users-help(a)lists.wxwidgets.org
From: Vadim Zeitlin on 3 Oct 2006 13:21 On Tue, 03 Oct 2006 18:25:56 +0200 Manuel Mart?n <mmartin(a)ceyd.es> wrote: MM> So I decided to write some clarifying text, but I also think an MM> "expert" should confirm it. It's globally correct but a few remarks are in order: MM> IMHO docs are some old with vocabulary: when they say "converts between MM> the UTF-8 encoding and Unicode", I think this is not fully valid, MM> because UFT-8 is part of UNICODE. No, UTF-8 is just one possible encoding of Unicode, as you say yourself below. In general, "Unicode" in relation to wxWidgets means "wchar_t". While this is not totally correct neither (especially under Windows which uses 16 bit wchar_t), it's more or less true as there is a one to one mapping between at least the BMP (or, in case of Unix systems where wchar_t is 32bit, the entire Unicode code space) and the wide characters. MM> (E) 'UTF-16' : An UNICODE encoding that uses 2 bytes for each char. At least 2 bytes. The composites need more. MM> This allows more than 65000 chars, but may be not MM> enough (i.e. Chinese needs more graphs). Not the usual Chinese ideograms though, they're part of the BMP. So "most" of the commonly used symbols needs only 2 bytes in UTF-16. MM> (H) 'widechar': A name for how an OS manage UTF-16 or UTF-32 chars. You MM> use 'wchar' in your code and Windows XP understands it MM> as a 2-byte char and some Unices as a 4-byte char. It's wchar_t and, AFAIK, all Unices use 32 bit wchar_t. MM> (I) 'multibyte': A sequence of bytes that is supposed to be an UNICODE Not necessarily, there are many multibyte non-Unicode encodings. MM> 1) If you want to use UNICODE chars, compile wx and your app passing MM> _UNICODE to compiler. There should be rarely need to define this directly. If you use VC IDE project files you just select one of the "Unicode" build configurations. If you're under Unix, configure the library with --enable-unicode switch. MM> 2) Use macro wxT() for all strings. Use wxString instead of C style MM> strings. Maybe not quite all but it's surely a good advice to do it by default. MM> 3) If you need an ASCII-E char in one literal string don't write it MM> directly (your complier may rise an error). You can pass a charset MM> parameter to compiler, but it is preferred to tell wxString the charset MM> to use (see wxString constructors with conversion). Yes. You can also encode wide chars inside the program using \uxxxx escape sequence although older compilers don't support this. Regards, VZ -- TT-Solutions: wxWidgets consultancy and technical support http://www.tt-solutions.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: wx-users-unsubscribe(a)lists.wxwidgets.org For additional commands, e-mail: wx-users-help(a)lists.wxwidgets.org
|
Pages: 1 Prev: wxTreeMultiCtrl Next: wxSQLite vs. wxODBC |