about encoding UTF-8 and UTF-16 [CSharp]

Prev: This spanish character string "ñ" cause something that I don't understand
Next: PGP Decryption

From: Tony Johansson on 31 Mar 2010 11:15

Hi!

This character "�" is represented as 241 in UTF-16.
The code point of is U+00F1
This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1
(195 177 decimal) as UTF-8.

My first question.
When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has zeros
in the highorder byte as it is in this case where 241 fits in one byte ?

My second question does a code page include all the Unicode standards UTF-8,
UTF-16 and UTF-32.if not
where are for example this character "�" defined for the different Unicode
standards ?

//Tony

From: Peter Duniho on 31 Mar 2010 11:30

Tony Johansson wrote:
> Hi!
>
> This character "�" is represented as 241 in UTF-16.
> The code point of is U+00F1
> This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1
> (195 177 decimal) as UTF-8.
>
> My first question.
> When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has zeros
> in the highorder byte as it is in this case where 241 fits in one byte ?

Define "common". But absent a specific definition from you, I'd say
"no". UTF-8 can easily represent far more than 256 different characters
using two bytes, while 256 is the absolute theoretical maximum that
UTF-16 could represent with a value having the form 00xx, where "x" is a
hexadecimal digit.

> My second question does a code page include all the Unicode standards UTF-8,
> UTF-16 and UTF-32.if not
> where are for example this character "�" defined for the different Unicode
> standards ?

In general, Unicode is a superset of each of the various code pages.
So, no�a given code page is not going to be able to include all of the
Unicode characters.

Finally, note as has been mentioned before: UTF-8, -16, and -32 are
_encodings_ for Unicode, while Unicode is the character set. Each of
the encodings can represent all of the characters in Unicode, and the
actual code point within Unicode for any given character is always the
same. Only the value in a specific encoding changes, and you can find
ALL of this information on the http://www.unicode.org/ web site
(including, for example, code point and encoding values for a given
character, such as '�').

Pete

From: Jeff Johnson on 31 Mar 2010 12:01

"Tony Johansson" <johansson.andersson(a)telia.com> wrote in message
news:%23hQhuTO0KHA.260(a)TK2MSFTNGP05.phx.gbl...

> This character "�" is represented as 241 in UTF-16.
> The code point of is U+00F1
> This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1
> (195 177 decimal) as UTF-8.
>
> My first question.
> When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has
> zeros
> in the highorder byte as it is in this case where 241 fits in one byte ?

Honestly Tony, in this case, who cares? I realize that you're trying to
learn things but I think you need to pick and choose what you want to dive
deeply into, and in my opinion the internal workings of UTF-8 and UTF-16
shouldn't concern you. UTF-8 is a middleman. It exists to bridge the gap
between single-byte code pages and the new, global world of Unicode. Data
stored in UTF-8 is almost always translated into something else (like .NET
translates everything to UTF-16) so you should really only know how to USE
UTF-8 without worrying about its guts. (Unless you're trying to write your
own UTF-8 encoder/decoder, of course.)

> My second question does a code page include all the Unicode standards
> UTF-8,
> UTF-16 and UTF-32.if not
> where are for example this character "�" defined for the different Unicode
> standards ?

Code pages do not "include Unicode standards." Let me see if I can come up
with a good analogy.

If you know anything about bitmaps, you know that there are indexed bitmaps
and true-color bitmaps. An indexed bitmap is like a paint-by-number set
(assuming you're old enough to remember those things and they existed where
you grew up). You have a limited supply of colors and each color is mapped
to a number (the index). Perhaps 0 = Red, 1 = White, 2 = Purple, etc. You
cannot use any color outside the range of your given color palette. Let's
say this color palette has 256 entries for this example, and therefore each
index value fits nicely into one byte. You define your bitmap by specifying
a bunch of indexes (bytes) that indicate which color is to be applied to
each pixel. So your bitmap data might contain 0 0 0 2 2 1, meaning three
pixels of red, two pixels of purple, and one pixel of white. Six pixels, six
bytes. Moderately compact.

On the other hand, in a true color image each pixel is represented by three
(or four, if you want transparency) bytes. There is no color palette because
those bytes can represent any color. So now your six pixels look like this:
0xFF0000 0xFF0000 0xFF0000 0xFF00FF 0xFF00FF 0xFFFFFF. Six pixels, 18 bytes.
Big, but flexible as far as colors go.

A code page is like an indexed image. Single-byte code pages contain 256
"slots," each of which can represent a character (a glyph). Each code page
has a table somewhere which tells it how to map each index (0 - 255) to a
specific Unicode character (called "code points").

Unicode itself is like the entire color spectrum (or at least it's pretty
close).

The Windows-1252 code page (Latin 1 or something like that) maps 65 -> A
(U+0041), 34 -> " (U+0022), 42 -> * (U+003A), and so on. Many other code
pages have similar mappings for indexes 0 - 127, but when you get to 128 -
255 you tend to see more variation. For example, and I'm totally making this
up, a Russian code page might map 165 to U+0427 whereas a Spanish code page
might map it to your �, U+00F1.

UTF-8, on the other hand, is not a mapping but rather an encoding, which
takes a Unicode code point and stores it in 1 to 4 bytes (encoding), or
takes 1 to 4 bytes and translates that into a Unicode code point (decoding).

Unicode is like the center of a wheel (the hub), and code pages are the
spokes. Everything ultimately goes through the hub. UTF-8 and friends are
not spokes; they are more like "transport mechanisms" and are not directly
related to code pages.

From: Arne Vajhøj on 31 Mar 2010 19:47

On 31-03-2010 11:15, Tony Johansson wrote:
> This character "�" is represented as 241 in UTF-16.

It is a 16 bit integer with the value 241.

> The code point of is U+00F1

That may be the common notation.

> This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1
> (195 177 decimal) as UTF-8.

It must be:

0x00 0xF1 for UTF-16 bytes
0x00 0x00 0x00 0xF1 for UTF-32 bytes
0xC3 0xB1 for UTF-8 bytes

> My first question.
> When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has zeros
> in the highorder byte as it is in this case where 241 fits in one byte ?

It is the case for characters that are also in ISO-8859-1.

So yes - it is common for western texts.

> My second question does a code page include all the Unicode standards UTF-8,
> UTF-16 and UTF-32.

CP 1200 and 1201 = UTF-16 (little and big endian) [well - actually
UCS-2, but let us ignore that difference ...]

CP 65000 = UTF-7 [nobody uses that]

CP 65001 = UTF-8

> if not
> where are for example this character "�" defined for the different Unicode
> standards ?

In CP 1252 (which is approx. ISO-8859-1) it is a single byte 241 (0xF1).

Arne

From: Tim Roberts on 2 Apr 2010 01:49

"Tony Johansson" <johansson.andersson(a)telia.com> wrote:
>
>This character "�" is represented as 241 in UTF-16.
>The code point of is U+00F1
>This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1
>(195 177 decimal) as UTF-8.
>
>My first question.
>When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has zeros
>in the highorder byte as it is in this case where 241 fits in one byte ?

You can look all of this up. These are international standards, and there
are very good reasons for this design.

The lowest 128 Unicode code points map to one-byte encodings in UTF-8. The
next 1,920 code points map to two-byte encodings. The next 63,488 code
points map to three-byte encodings. Anything above U+10000 requires four
bytes.

So, yes, the Unicode code points from U+0080 to U+00FF always take two
bytes in UTF-8.
--
Tim Roberts, timr(a)probo.com
Providenza & Boekelheide, Inc.

| Next | Last
Pages: 1 2
Prev: This spanish character string "ñ" cause something that I don't understand
Next: PGP Decryption