Why use other encoding then UTF-8 when this support almost every language [CSharp]

Prev: Read Mail - tcp
Next: How to prevent focus outlining on buttons

From: Tony Johansson on 25 Mar 2010 11:05

Hi!

This Unicode UTF-8 can use up to 24 bit for encoding. UTF-8 support almost
all languages so what is the reason
to use another Unicode then this UTF-8.

//Tony

From: Tony Johansson on 25 Mar 2010 11:09

"Tony Johansson" <johansson.andersson(a)telia.com> skrev i meddelandet
news:OU1ZNyCzKHA.5940(a)TK2MSFTNGP02.phx.gbl...
> Hi!
>
> This Unicode UTF-8 can use up to 24 bit for encoding. UTF-8 support almost
> all languages so what is the reason
> to use another Unicode then this UTF-8.
>
> //Tony

I must correct myself UTF-8 can use up to 48-bit.

//Tony

From: Maate on 25 Mar 2010 12:33

On 25 Mar., 16:09, "Tony Johansson" <johansson.anders...(a)telia.com>
wrote:
> "Tony Johansson" <johansson.anders...(a)telia.com> skrev i meddelandetnews:OU1ZNyCzKHA.5940(a)TK2MSFTNGP02.phx.gbl...
>
> > Hi!
>
> > This Unicode UTF-8 can use up to 24 bit for encoding. UTF-8 support almost
> > all languages so what is the reason
> > to use another Unicode then this UTF-8.
>
> > //Tony
>
> I must correct myself UTF-8 can use up to 48-bit.
>
> //Tony

Hey, I'm not sure, but I would guess that UTF-8 is slightly more
expensive to parse than other unicode encodings. For example, when
reading UTF-16 encoded text the parser would know that it has to read
exactly two bytes per character. On the other hand, if UTF-8 encoded,
the number of bytes to read per character will depend on the
information stored in individual bits. You could consider just a
simple example: this code in c# "my test string".Substring(5, 1), will
be easy to calculate in UTF-16, but with UTF-8 the parser would have
to calculate the individual character starting from the beginning in
order to determine which bytes actually represents character number 5
- perhaps making it at least 5 times as expensive. Probably this also
explains why for example .NET CLR stores text as UTF-16 internally -
it probably makes it easier (better performant) to manipulate and
search text.

Anyway, just some thoughts :-)

Br. Morten

From: Chris Dunaway on 25 Mar 2010 12:58

On Mar 25, 10:05 am, "Tony Johansson" <johansson.anders...(a)telia.com>
wrote:
> Hi!
>
> This Unicode UTF-8 can use up to 24 bit for encoding. UTF-8 support almost
> all languages so what is the reason
> to use another Unicode then this UTF-8.
>
> //Tony

http://www.joelonsoftware.com/articles/Unicode.html

From: Konrad Neitzel on 25 Mar 2010 15:12

Hi all!

"Maate" <maate(a)retkomma.dk> schrieb im Newsbeitrag
news:cb98f95e-6f15-45c7-bc05-44e0b96f922d(a)e7g2000yqf.googlegroups.com...
> Hey, I'm not sure, but I would guess that UTF-8 is slightly more
> expensive to parse than other unicode encodings.
Why that? UTF-16 also is not fixed to 2 Bytes per character. It can use more
bytes per character if required (A reason, why there is also a UTF-32)

> For example, when
> reading UTF-16 encoded text the parser would know that it has to read
> exactly two bytes per character. On the other hand, if UTF-8 encoded,
> the number of bytes to read per character will depend on the
> information stored in individual bits.

And yes, that can be the important point. Whenever you want to have random
access to characters without parsing all characters till the character you
want to read, you must be carefull that you really know how you many bytes
each character has.

UTF-16 is not fixed to 2 Bytes! That is a common mistake you find often. If
you want a fixed 2 Byte encoding, UCS-2 could be choosen but then you do not
support all characters that are supported with UTF-16!

More details can be found on
http://en.wikipedia.org/wiki/UTF
http://en.wikipedia.org/wiki/UTF-16
http://en.wikipedia.org/wiki/UTF-32

Konrad

| Next | Last
Pages: 1 2 3 4
Prev: Read Mail - tcp
Next: How to prevent focus outlining on buttons