about encoding UTF-8 and UTF-16 [CSharp]

Prev: This spanish character string "ñ" cause something that I don't understand
Next: PGP Decryption

From: Jeff Johnson on 2 Apr 2010 11:40

"Tim Roberts" <timr(a)probo.com> wrote in message
news:d01br5d743h8428470dlsoaj1ngthi7j0b(a)4ax.com...

> So, yes, the Unicode code points from U+0080 to U+00FF always take two
> bytes in UTF-8.

But the "opposite" is not true! That is, just because the UTF-8 encoding
yields 2 bytes does not suggest that the UTF-16 encoding will "likely" have
0 in the MSB. If there are 1920 possible 2-byte UTF-8 sequences and only 128
of them represent U+0080 - U+00FF, then that accounts for only 6.667% of the
possible 2-byte sequences. So back to Tony's question:

>> When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has
>> zeros
>> in the highorder byte as it is in this case where 241 fits in one byte ?

I would say "Don't count on it."

From: Tim Roberts on 4 Apr 2010 00:06

"Jeff Johnson" <i.get(a)enough.spam> wrote:
>
>But the "opposite" is not true! That is, just because the UTF-8 encoding
>yields 2 bytes does not suggest that the UTF-16 encoding will "likely" have
>0 in the MSB. If there are 1920 possible 2-byte UTF-8 sequences and only 128
>of them represent U+0080 - U+00FF, then that accounts for only 6.667% of the
>possible 2-byte sequences. So back to Tony's question:
>
>>> When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has
>>> zeros in the highorder byte as it is in this case where 241 fits in one
>>> byte ?
>
>I would say "Don't count on it."

You're right. The question I read was not the question he really asked.
--
Tim Roberts, timr(a)probo.com
Providenza & Boekelheide, Inc.

First | Prev |
Pages: 1 2
Prev: This spanish character string "ñ" cause something that I don't understand
Next: PGP Decryption