From: Roedy Green on 22 Dec 2009 14:01 Let's say you wanted to include some 32-bit characters in Java String literals. I understand what the stream would look like in UTF-8 or a int[], but what I am curious about is the cleanest way to create string literals in a Java program containing such awkward characters. -- Roedy Green Canadian Mind Products http://mindprod.com If you think it�s expensive to hire a professional to do the job, wait until you hire an amateur. ~ Red Adair (born: 1915-06-18 died: 2004-08-07 at age: 89)
From: Peter Duniho on 22 Dec 2009 14:19 Roedy Green wrote: > Let's say you wanted to include some 32-bit characters in Java String > literals. > > I understand what the stream would look like in UTF-8 or a int[], but > what I am curious about is the cleanest way to create string literals > in a Java program containing such awkward characters. The Java class java.lang.String uses UTF-16. For supplemental characters (i.e. those that require more than 16 bits), you use surrogate pairs. But each character in a pair is just a regular 16-bit character of data. So you specify them in a String literal just like you'd specify any other 16-bit literal data. I haven't done a lot of experimentation with Java and Unicode text source files, but it's possible you can just enter the characters normally, and the compiler will handle things for you. Otherwise, for sure you can always use the '\uXXXX' character literal syntax to specify the characters. You'd just need a pair of such characters to specify a single 32-bit character, using the appropriate surrogate pair rather than the raw 32-bit character split in half. See for more detail: http://java.sun.com/javase/6/docs/api/java/lang/Character.html#unicode http://java.sun.com/javase/6/docs/api/java/lang/String.html Pete
From: Thomas Pornin on 22 Dec 2009 15:47 According to Roedy Green <see_website(a)mindprod.com.invalid>: > Let's say you wanted to include some 32-bit characters in Java String > literals. Technically Unicode code points fit in the 0..0x10FFFF range, using 21 bits at most. The JLS says this in section 3.1: << The Unicode standard was originally designed as a fixed-width 16-bit character encoding. It has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, using the hexadecimal U+n notation. Characters whose code points are greater than U+FFFF are called supplementary characters. To represent the complete range of characters using only 16-bit units, the Unicode standard defines an encoding called UTF-16. In this encoding, supplementary characters are represented as pairs of 16-bit code units, the first from the high-surrogates range, (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). For characters in the range U+0000 to U+FFFF, the values of code points and UTF-16 code units are the same. The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding. A few APIs, primarily in the Character class, use 32-bit integers to represent code points as individual entities. The Java platform provides methods to convert between the two representations. >> So basically, if you want to represent in the source code (e.g. in a String literal) a code point beyond the first plane, then you use a pair of \uxxxx sequences, for the two surrogates. E.g., if you want to have a String literal with U+10C22 (that's OLD TURKIC LETTER ORKHON EM; it somewhat looks like a fish), then you first convert 0x10C22 to a surrogate pair: 1. subtract 0x10000: you get 0xC22 2. get the upper (u) and lower (l) 10 bits; you get u=0x3 and l=0x022 (i.e. (u << 10) + l == 0xC22) 3. the high surrogate is 0xD800 + u, the low surrogate is 0xDC00 + l. Therefore you get this: public class Foo { public static final String BAR = "an old Turkic letter: \uD803\uDC22"; } Note that this is an ASCII-compatible representation of a Java source file, which conceptually consists in a sequence of 16-bit code units. Now, with javac from Sun's JDK 1.6.0_16, I can use a UTF-8 representation of the source code. This allows me to use old Turkic letters directly. For instance, the encoding of the source Foo.java could look like this: 00000000 70 75 62 6c 69 63 20 63 6c 61 73 73 20 46 6f 6f |public class Foo| 00000010 20 7b 0a 09 70 75 62 6c 69 63 20 73 74 61 74 69 | {..public stati| 00000020 63 20 66 69 6e 61 6c 20 53 74 72 69 6e 67 20 42 |c final String B| 00000030 41 52 20 3d 20 22 61 6e 20 6f 6c 64 20 54 75 72 |AR = "an old Tur| 00000040 6b 69 63 20 6c 65 74 74 65 72 3a 20 f1 80 b0 a2 |kic letter: ....| 00000050 22 3b 0a 7d 0a |";.}.| we see that in the source code, the "f1 80 b0 a2" UTF-8 sequence was used. Javac accepts this, and this yields the same Foo.class than previously. I still recommand using the two \uxxxx sequences detailed above, for maximum portability (ASCII works everywhere and is resilient to the various abuse suffered by text in emails or Usenet messages). You may want to look at the resulting .class file. In the classfiles, a "modified UTF-8" format is used for String literals, in which surrogates are encoded separately. Thus, regardless of how I gave the old Turkic letter to the Java compiler, the .class file will contain the 6-byte sequence "ed a3 83 ed b0 a2" (UTF-8 encoding of U+D803, then UTF-8 encoding of U+DC22). --Thomas Pornin
From: Mayeul on 23 Dec 2009 11:09 Andreas Leitgeb wrote: > Thomas Pornin <pornin(a)bolet.org> quoted the JLS section 3.1: >> << The Unicode standard was originally designed as a fixed-width 16-bit >> character encoding. It has since been changed to allow for characters >> whose representation requires more than 16 bits. The range of legal >> code points is now U+0000 to U+10FFFF > > I have problems understanding why the surrogate code points are counted > twice: once as their code points isolated and then again as the code-points > that are reached by an adjacent pair of them. It makes defining UTF-16 easy and less error-prone. Yet I guess the range of legal codepoints is still be U+0000 to U+10FFFF, excluding the surrogates range in the middle. -- Mayeul
From: Tom Anderson on 23 Dec 2009 11:17
On Wed, 23 Dec 2009, Andreas Leitgeb wrote: > Thomas Pornin <pornin(a)bolet.org> quoted the JLS section 3.1: >> << The Unicode standard was originally designed as a fixed-width 16-bit >> character encoding. It has since been changed to allow for characters >> whose representation requires more than 16 bits. The range of legal >> code points is now U+0000 to U+10FFFF > > I have problems understanding why the surrogate code points are counted > twice: once as their code points isolated and then again as the code-points > that are reached by an adjacent pair of them. The range is a bound - all legal code points are inside it. It doesn't mean that all numbers inside it are legal code points. There are plenty of numbers which aren't mapped to any character, and so aren't legal code points - the surrogates are just a special case of those. I reckon. tom -- X is for ... EXECUTION! |