UTF-8 and char [Unix Programming]

Prev: ANN: Seed7 Release 2009-12-20
Next: Happy Holidays and A Happy New Year

From: Muhammed on 21 Dec 2009 06:53

Hi All.

My code should support UTF-8 chars (all languages like chinese,
arabic) i have used char's in code..Is it ok?? Or do i need to use
wide chars..(is it avaliable in Unix platform?)

From: Mikko Rauhala on 21 Dec 2009 09:44

On Mon, 21 Dec 2009 03:53:36 -0800 (PST), Muhammed <doublemaster007(a)gmail.com>
wrote:
> My code should support UTF-8 chars (all languages like chinese,
> arabic) i have used char's in code..Is it ok?? Or do i need to use
> wide chars..(is it avaliable in Unix platform?)

Wide characters (wchar_t) are generally available on (modern)
Unix platforms (often using the UTF-32 representation internally
in contrast to UTF-16 on Windows).

However, it's not required to use wide characters for Unicode
support; in fact, due to the arguable cumbersomeness of C wide
character support, many people prefer to do exactly what you
seem to be doing: storing the strings as UTF-8 inside plain
old C strings (char arrays).

This is done for example by the popular GTK+ toolkit (used by
eg. Gnome) and its rendering library, Pango. You might find
some useful Unicode/UTF-8 utility functions in GLib[1].

The downside is of course that if you want to mix and match
wchar_t and utf-8 char array using code, you'll have to do
conversions where appropriate. GLib should be of help for that,
too:

On systems using GNU iconv, you can use "WCHAR_T" as a source
or target codeset for g_convert() and friends. However, stock
iconv doesn't necessarily support the WCHAR_T (or any other
necessary) type on all systems, so you might have to do a
bit of architecture-spesific code that knows what the local
wchar_t type actually is, and if it's either UTF-16 or
UTF-32 (~UCS-4), use some of the g_utf8_to_ucs4() and similar
functions from GLib. But of course, this only if you need
to use mixed type Unicode handling code.

(Also, if your wchar_t is non-Unicode, you're probably better
off not touching that. You can check if the __STDC_ISO_10646__
macro is defined; if it is, your wchar_t does use ISO-10646
characters, which for code point mapping purposes is the
same thing as Unicode.)

Hope this helps and all.

[1] http://library.gnome.org/devel/glib/2.22/glib-Unicode-Manipulation.html
http://library.gnome.org/devel/glib/2.22/glib-Character-Set-Conversion.html

--
Mikko Rauhala <mjr(a)iki.fi> - http://www.iki.fi/mjr/blog/
The Finnish Pirate Party - http://piraattipuolue.fi/
World Transhumanist Association - http://transhumanism.org/
Singularity Institute - http://singinst.org/

From: Muhammed on 22 Dec 2009 00:51

On Dec 21, 7:44 pm, Mikko Rauhala <m...(a)iki.fi> wrote:
> On Mon, 21 Dec 2009 03:53:36 -0800 (PST), Muhammed <doublemaster...(a)gmail..com>
>
> wrote:
> > My code should support UTF-8 chars (all languages like chinese,
> > arabic) i have used char's in code..Is it ok?? Or do i need to use
> > wide chars..(is it avaliable in Unix platform?)
>
> Wide characters (wchar_t) are generally available on (modern)
> Unix platforms (often using the UTF-32 representation internally
> in contrast to UTF-16 on Windows).
>
> However, it's not required to use wide characters for Unicode
> support; in fact, due to the arguable cumbersomeness of C wide
> character support, many people prefer to do exactly what you
> seem to be doing: storing the strings as UTF-8 inside plain
> old C strings (char arrays).
>
> This is done for example by the popular GTK+ toolkit (used by
> eg. Gnome) and its rendering library, Pango. You might find
> some useful Unicode/UTF-8 utility functions in GLib[1].
>
> The downside is of course that if you want to mix and match
> wchar_t and utf-8 char array using code, you'll have to do
> conversions where appropriate. GLib should be of help for that,
> too:
>
> On systems using GNU iconv, you can use "WCHAR_T" as a source
> or target codeset for g_convert() and friends. However, stock
> iconv doesn't necessarily support the WCHAR_T (or any other
> necessary) type on all systems, so you might have to do a
> bit of architecture-spesific code that knows what the local
> wchar_t type actually is, and if it's either UTF-16 or
> UTF-32 (~UCS-4), use some of the g_utf8_to_ucs4() and similar
> functions from GLib. But of course, this only if you need
> to use mixed type Unicode handling code.
>
> (Also, if your wchar_t is non-Unicode, you're probably better
> off not touching that. You can check if the __STDC_ISO_10646__
> macro is defined; if it is, your wchar_t does use ISO-10646
> characters, which for code point mapping purposes is the
> same thing as Unicode.)
>
> Hope this helps and all.
>
> [1]http://library.gnome.org/devel/glib/2.22/glib-Unicode-Manipulation.html
> http://library.gnome.org/devel/glib/2.22/glib-Character-Set-Conversio...
>
> --
> Mikko Rauhala <m...(a)iki.fi> -http://www.iki.fi/mjr/blog/
> The Finnish Pirate Party -http://piraattipuolue.fi/
> World Transhumanist Association -http://transhumanism.org/
> Singularity Institute -http://singinst.org/

Thank you sooo much...it helped me some extent..i need to check more

|
Pages: 1
Prev: ANN: Seed7 Release 2009-12-20
Next: Happy Holidays and A Happy New Year