From: Roedy Green on
Let's say you wanted to include some 32-bit characters in Java String
literals.

I understand what the stream would look like in UTF-8 or a int[], but
what I am curious about is the cleanest way to create string literals
in a Java program containing such awkward characters.
--
Roedy Green Canadian Mind Products
http://mindprod.com
If you think it�s expensive to hire a professional to do the job, wait until you hire an amateur.
~ Red Adair (born: 1915-06-18 died: 2004-08-07 at age: 89)
From: Peter Duniho on
Roedy Green wrote:
> Let's say you wanted to include some 32-bit characters in Java String
> literals.
>
> I understand what the stream would look like in UTF-8 or a int[], but
> what I am curious about is the cleanest way to create string literals
> in a Java program containing such awkward characters.

The Java class java.lang.String uses UTF-16. For supplemental
characters (i.e. those that require more than 16 bits), you use
surrogate pairs. But each character in a pair is just a regular 16-bit
character of data. So you specify them in a String literal just like
you'd specify any other 16-bit literal data.

I haven't done a lot of experimentation with Java and Unicode text
source files, but it's possible you can just enter the characters
normally, and the compiler will handle things for you. Otherwise, for
sure you can always use the '\uXXXX' character literal syntax to specify
the characters. You'd just need a pair of such characters to specify a
single 32-bit character, using the appropriate surrogate pair rather
than the raw 32-bit character split in half.

See for more detail:
http://java.sun.com/javase/6/docs/api/java/lang/Character.html#unicode
http://java.sun.com/javase/6/docs/api/java/lang/String.html

Pete
From: Thomas Pornin on
According to Roedy Green <see_website(a)mindprod.com.invalid>:
> Let's say you wanted to include some 32-bit characters in Java String
> literals.

Technically Unicode code points fit in the 0..0x10FFFF range, using
21 bits at most.

The JLS says this in section 3.1:

<< The Unicode standard was originally designed as a fixed-width 16-bit
character encoding. It has since been changed to allow for characters
whose representation requires more than 16 bits. The range of legal
code points is now U+0000 to U+10FFFF, using the hexadecimal U+n
notation. Characters whose code points are greater than U+FFFF are
called supplementary characters. To represent the complete range of
characters using only 16-bit units, the Unicode standard defines an
encoding called UTF-16. In this encoding, supplementary characters
are represented as pairs of 16-bit code units, the first from the
high-surrogates range, (U+D800 to U+DBFF), the second from the
low-surrogates range (U+DC00 to U+DFFF). For characters in the range
U+0000 to U+FFFF, the values of code points and UTF-16 code units are
the same.

The Java programming language represents text in sequences of 16-bit
code units, using the UTF-16 encoding. A few APIs, primarily in the
Character class, use 32-bit integers to represent code points as
individual entities. The Java platform provides methods to convert
between the two representations. >>

So basically, if you want to represent in the source code (e.g. in a
String literal) a code point beyond the first plane, then you use
a pair of \uxxxx sequences, for the two surrogates.

E.g., if you want to have a String literal with U+10C22 (that's
OLD TURKIC LETTER ORKHON EM; it somewhat looks like a fish),
then you first convert 0x10C22 to a surrogate pair:
1. subtract 0x10000: you get 0xC22
2. get the upper (u) and lower (l) 10 bits; you get u=0x3 and l=0x022
(i.e. (u << 10) + l == 0xC22)
3. the high surrogate is 0xD800 + u, the low surrogate is 0xDC00 + l.

Therefore you get this:

public class Foo {
public static final String BAR = "an old Turkic letter: \uD803\uDC22";
}

Note that this is an ASCII-compatible representation of a Java source
file, which conceptually consists in a sequence of 16-bit code units.


Now, with javac from Sun's JDK 1.6.0_16, I can use a UTF-8 representation
of the source code. This allows me to use old Turkic letters directly.
For instance, the encoding of the source Foo.java could look like
this:

00000000 70 75 62 6c 69 63 20 63 6c 61 73 73 20 46 6f 6f |public class Foo|
00000010 20 7b 0a 09 70 75 62 6c 69 63 20 73 74 61 74 69 | {..public stati|
00000020 63 20 66 69 6e 61 6c 20 53 74 72 69 6e 67 20 42 |c final String B|
00000030 41 52 20 3d 20 22 61 6e 20 6f 6c 64 20 54 75 72 |AR = "an old Tur|
00000040 6b 69 63 20 6c 65 74 74 65 72 3a 20 f1 80 b0 a2 |kic letter: ....|
00000050 22 3b 0a 7d 0a |";.}.|

we see that in the source code, the "f1 80 b0 a2" UTF-8 sequence was
used. Javac accepts this, and this yields the same Foo.class than
previously. I still recommand using the two \uxxxx sequences detailed
above, for maximum portability (ASCII works everywhere and is resilient
to the various abuse suffered by text in emails or Usenet messages).


You may want to look at the resulting .class file. In the classfiles,
a "modified UTF-8" format is used for String literals, in which
surrogates are encoded separately. Thus, regardless of how I gave the
old Turkic letter to the Java compiler, the .class file will contain
the 6-byte sequence "ed a3 83 ed b0 a2" (UTF-8 encoding of U+D803,
then UTF-8 encoding of U+DC22).


--Thomas Pornin
From: Mayeul on
Andreas Leitgeb wrote:
> Thomas Pornin <pornin(a)bolet.org> quoted the JLS section 3.1:
>> << The Unicode standard was originally designed as a fixed-width 16-bit
>> character encoding. It has since been changed to allow for characters
>> whose representation requires more than 16 bits. The range of legal
>> code points is now U+0000 to U+10FFFF
>
> I have problems understanding why the surrogate code points are counted
> twice: once as their code points isolated and then again as the code-points
> that are reached by an adjacent pair of them.

It makes defining UTF-16 easy and less error-prone.

Yet I guess the range of legal codepoints is still be U+0000 to
U+10FFFF, excluding the surrogates range in the middle.

--
Mayeul
From: Tom Anderson on
On Wed, 23 Dec 2009, Andreas Leitgeb wrote:

> Thomas Pornin <pornin(a)bolet.org> quoted the JLS section 3.1:
>> << The Unicode standard was originally designed as a fixed-width 16-bit
>> character encoding. It has since been changed to allow for characters
>> whose representation requires more than 16 bits. The range of legal
>> code points is now U+0000 to U+10FFFF
>
> I have problems understanding why the surrogate code points are counted
> twice: once as their code points isolated and then again as the code-points
> that are reached by an adjacent pair of them.

The range is a bound - all legal code points are inside it. It doesn't
mean that all numbers inside it are legal code points. There are plenty of
numbers which aren't mapped to any character, and so aren't legal code
points - the surrogates are just a special case of those. I reckon.

tom

--
X is for ... EXECUTION!