From: Tony Johansson on 31 Mar 2010 11:15 Hi! This character "�" is represented as 241 in UTF-16. The code point of is U+00F1 This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1 (195 177 decimal) as UTF-8. My first question. When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has zeros in the highorder byte as it is in this case where 241 fits in one byte ? My second question does a code page include all the Unicode standards UTF-8, UTF-16 and UTF-32.if not where are for example this character "�" defined for the different Unicode standards ? //Tony
From: Peter Duniho on 31 Mar 2010 11:30 Tony Johansson wrote: > Hi! > > This character "�" is represented as 241 in UTF-16. > The code point of is U+00F1 > This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1 > (195 177 decimal) as UTF-8. > > My first question. > When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has zeros > in the highorder byte as it is in this case where 241 fits in one byte ? Define "common". But absent a specific definition from you, I'd say "no". UTF-8 can easily represent far more than 256 different characters using two bytes, while 256 is the absolute theoretical maximum that UTF-16 could represent with a value having the form 00xx, where "x" is a hexadecimal digit. > My second question does a code page include all the Unicode standards UTF-8, > UTF-16 and UTF-32.if not > where are for example this character "�" defined for the different Unicode > standards ? In general, Unicode is a superset of each of the various code pages. So, no�a given code page is not going to be able to include all of the Unicode characters. Finally, note as has been mentioned before: UTF-8, -16, and -32 are _encodings_ for Unicode, while Unicode is the character set. Each of the encodings can represent all of the characters in Unicode, and the actual code point within Unicode for any given character is always the same. Only the value in a specific encoding changes, and you can find ALL of this information on the http://www.unicode.org/ web site (including, for example, code point and encoding values for a given character, such as '�'). Pete
From: Jeff Johnson on 31 Mar 2010 12:01 "Tony Johansson" <johansson.andersson(a)telia.com> wrote in message news:%23hQhuTO0KHA.260(a)TK2MSFTNGP05.phx.gbl... > This character "�" is represented as 241 in UTF-16. > The code point of is U+00F1 > This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1 > (195 177 decimal) as UTF-8. > > My first question. > When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has > zeros > in the highorder byte as it is in this case where 241 fits in one byte ? Honestly Tony, in this case, who cares? I realize that you're trying to learn things but I think you need to pick and choose what you want to dive deeply into, and in my opinion the internal workings of UTF-8 and UTF-16 shouldn't concern you. UTF-8 is a middleman. It exists to bridge the gap between single-byte code pages and the new, global world of Unicode. Data stored in UTF-8 is almost always translated into something else (like .NET translates everything to UTF-16) so you should really only know how to USE UTF-8 without worrying about its guts. (Unless you're trying to write your own UTF-8 encoder/decoder, of course.) > My second question does a code page include all the Unicode standards > UTF-8, > UTF-16 and UTF-32.if not > where are for example this character "�" defined for the different Unicode > standards ? Code pages do not "include Unicode standards." Let me see if I can come up with a good analogy. If you know anything about bitmaps, you know that there are indexed bitmaps and true-color bitmaps. An indexed bitmap is like a paint-by-number set (assuming you're old enough to remember those things and they existed where you grew up). You have a limited supply of colors and each color is mapped to a number (the index). Perhaps 0 = Red, 1 = White, 2 = Purple, etc. You cannot use any color outside the range of your given color palette. Let's say this color palette has 256 entries for this example, and therefore each index value fits nicely into one byte. You define your bitmap by specifying a bunch of indexes (bytes) that indicate which color is to be applied to each pixel. So your bitmap data might contain 0 0 0 2 2 1, meaning three pixels of red, two pixels of purple, and one pixel of white. Six pixels, six bytes. Moderately compact. On the other hand, in a true color image each pixel is represented by three (or four, if you want transparency) bytes. There is no color palette because those bytes can represent any color. So now your six pixels look like this: 0xFF0000 0xFF0000 0xFF0000 0xFF00FF 0xFF00FF 0xFFFFFF. Six pixels, 18 bytes. Big, but flexible as far as colors go. A code page is like an indexed image. Single-byte code pages contain 256 "slots," each of which can represent a character (a glyph). Each code page has a table somewhere which tells it how to map each index (0 - 255) to a specific Unicode character (called "code points"). Unicode itself is like the entire color spectrum (or at least it's pretty close). The Windows-1252 code page (Latin 1 or something like that) maps 65 -> A (U+0041), 34 -> " (U+0022), 42 -> * (U+003A), and so on. Many other code pages have similar mappings for indexes 0 - 127, but when you get to 128 - 255 you tend to see more variation. For example, and I'm totally making this up, a Russian code page might map 165 to U+0427 whereas a Spanish code page might map it to your �, U+00F1. UTF-8, on the other hand, is not a mapping but rather an encoding, which takes a Unicode code point and stores it in 1 to 4 bytes (encoding), or takes 1 to 4 bytes and translates that into a Unicode code point (decoding). Unicode is like the center of a wheel (the hub), and code pages are the spokes. Everything ultimately goes through the hub. UTF-8 and friends are not spokes; they are more like "transport mechanisms" and are not directly related to code pages.
From: Arne Vajhøj on 31 Mar 2010 19:47 On 31-03-2010 11:15, Tony Johansson wrote: > This character "�" is represented as 241 in UTF-16. It is a 16 bit integer with the value 241. > The code point of is U+00F1 That may be the common notation. > This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1 > (195 177 decimal) as UTF-8. It must be: 0x00 0xF1 for UTF-16 bytes 0x00 0x00 0x00 0xF1 for UTF-32 bytes 0xC3 0xB1 for UTF-8 bytes > My first question. > When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has zeros > in the highorder byte as it is in this case where 241 fits in one byte ? It is the case for characters that are also in ISO-8859-1. So yes - it is common for western texts. > My second question does a code page include all the Unicode standards UTF-8, > UTF-16 and UTF-32. CP 1200 and 1201 = UTF-16 (little and big endian) [well - actually UCS-2, but let us ignore that difference ...] CP 65000 = UTF-7 [nobody uses that] CP 65001 = UTF-8 > if not > where are for example this character "�" defined for the different Unicode > standards ? In CP 1252 (which is approx. ISO-8859-1) it is a single byte 241 (0xF1). Arne
From: Tim Roberts on 2 Apr 2010 01:49 "Tony Johansson" <johansson.andersson(a)telia.com> wrote: > >This character "�" is represented as 241 in UTF-16. >The code point of is U+00F1 >This is 0xF1 (or 241 decimal) in UTF-16 or UTF-32, and C3 B1 >(195 177 decimal) as UTF-8. > >My first question. >When UTF-8 encoding is using 2 bytes is it then common that UTF-16 has zeros >in the highorder byte as it is in this case where 241 fits in one byte ? You can look all of this up. These are international standards, and there are very good reasons for this design. The lowest 128 Unicode code points map to one-byte encodings in UTF-8. The next 1,920 code points map to two-byte encodings. The next 63,488 code points map to three-byte encodings. Anything above U+10000 requires four bytes. So, yes, the Unicode code points from U+0080 to U+00FF always take two bytes in UTF-8. -- Tim Roberts, timr(a)probo.com Providenza & Boekelheide, Inc.
|
Next
|
Last
Pages: 1 2 Prev: This spanish character string "ñ" cause something that I don't understand Next: PGP Decryption |