32-bit characters in Java string literals [Java Help]

Prev: Clean way to write Double into String without the trailing ".0"?
Next: need help or explanation

From: Andreas Leitgeb on 23 Dec 2009 12:07

Tom Anderson <twic(a)urchin.earth.li> wrote:
>> Thomas Pornin <pornin(a)bolet.org> quoted the JLS section 3.1:
>>> << The Unicode standard was originally designed as a fixed-width 16-bit
>>> character encoding. It has since been changed to allow for characters
>>> whose representation requires more than 16 bits. The range of legal
>>> code points is now U+0000 to U+10FFFF
>> I have problems understanding why the surrogate code points are counted
>> twice: once as their code points isolated and then again as the code-points
>> that are reached by an adjacent pair of them.
> The range is a bound - all legal code points are inside it. It doesn't
> mean that all numbers inside it are legal code points. There are plenty of
> numbers which aren't mapped to any character, and so aren't legal code
> points - the surrogates are just a special case of those. I reckon.

Thanks, that was my catch: I somehow mistakenly took "range" as implying
"all in the range" - and a codepoint with no char mapped to it wasn't
necessarily illegal in my mind, but single surrogate was.

From: Roedy Green on 22 Dec 2009 21:01

On 22 Dec 2009 20:47:39 GMT, Thomas Pornin <pornin(a)bolet.org> wrote,
quoted or indirectly quoted someone who said :

>E.g., if you want to have a String literal with U+10C22 (that's
>OLD TURKIC LETTER ORKHON EM; it somewhat looks like a fish),
>then you first convert 0x10C22 to a surrogate pair:
> 1. subtract 0x10000: you get 0xC22
> 2. get the upper (u) and lower (l) 10 bits; you get u=0x3 and l=0x022
> (i.e. (u << 10) + l == 0xC22)
> 3. the high surrogate is 0xD800 + u, the low surrogate is 0xDC00 + l.

That is what I was afraid of. I am doing that now to generate tables
of char entities and the equivalent hex and \u entities on various
pages of mindprod.com, e.g. http://mindprod.com/jgloss/html5.html
which shows the new HTML entities in HTML 5.

here is my code:

final int extract = theCharNumber - 0x10000;
final int high = ( extract >>> 10 ) + 0xd800;
final int low = ( extract & 0x3ff ) + 0xdc00;
sb.append( ""\\u" );
sb.append( StringTools.toLzHexString( high, 4 ) );
sb.append( "\\u" );
sb.append( StringTools.toLzHexString( low, 4 ) );
sb.append( """ );

I started to think about what would be needed to make this less
onerous.

1. an applet to convert hex to a surrogate pair.

2. allow \u12345 in string literals. However that would break
existing code. \u12345 currently means
"\u1234" + "5".

3. So you have to pick another letter: e.g. \c12345; for codepoint. IT
needs a terminator, so that in future it could also handle \c123456;
I don't know what that might break.

4. Introduce 32-bit CodePoint string literals with extensible \u
mechanism. E.g. CString b = c"\u12345;Hello";

5. specify weird chars with named entities to make the code more
readable. Entities in String literals would be translated to binary
at compile time, so the entities would not exist at run-time. The
HTML 5 set would be greatly extended to give pretty well every Unicode
glyph a name.

P.S. I have been poking around in HTML 5. W3C did an odd thing. They
REDEFINED the entities &lang; and &rang; to different glyphs from HTML
4. I don't think they have ever done anything like that before. I
hope it was just an error. I have written the W3C asking if they
really meant to do that.

--
Roedy Green Canadian Mind Products
http://mindprod.com
If you think it�s expensive to hire a professional to do the job, wait until you hire an amateur.
~ Red Adair (born: 1915-06-18 died: 2004-08-07 at age: 89)

From: Roedy Green on 23 Dec 2009 03:30

On Tue, 22 Dec 2009 18:01:17 -0800, Roedy Green
<see_website(a)mindprod.com.invalid> wrote, quoted or indirectly quoted
someone who said :

>I started to think about what would be needed to make this less
>onerous.

If you had only a few, you could create library of named constants for
them, and glue them together with compile time concatenation. With
only a little cleverness, a compiler would avoid embedding constants
it did not use.

Is any OS, JVM, utility, browser etc. capable of rendering a code
point above 0xffff? I get the impression all we can do is embed them
in UTF-8 files.
--
Roedy Green Canadian Mind Products
http://mindprod.com
If you think it�s expensive to hire a professional to do the job, wait until you hire an amateur.
~ Red Adair (born: 1915-06-18 died: 2004-08-07 at age: 89)

From: Steven Simpson on 23 Dec 2009 04:40

On 23/12/09 02:01, Roedy Green wrote:
> 3. So you have to pick another letter: e.g. \c12345; for codepoint. IT
> needs a terminator, so that in future it could also handle \c123456;
> I don't know what that might break.
>

IIRC, C99 introduced \uXXXX and \UXXXXXXXX.

--
ss at comp dot lancs dot ac dot uk

From: Thomas Pornin on 23 Dec 2009 07:58

According to Roedy Green <see_website(a)mindprod.com.invalid>:
> Is any OS, JVM, utility, browser etc. capable of rendering a code
> point above 0xffff?

Oh yes, plenty.

Well, at least on my system (Linux with Ubuntu 9.10). For instance,
if I write this HTML file:

<html>
<body>
<p>🂓</p>
</body>
</html>

then both Firefox and Chromium display the "DOMINO TILE VERTICAL-06-06"
as they should. Now if I write this Java code:

public class Foo {
public static void main(String[] args)
{
StringBuilder sb = new StringBuilder();
sb.appendCodePoint(0x1F093);
System.out.println(sb.toString());
}
}

and run it in a standard terminal (GNOME Terminal 2.28.1 on that
system), then the domino tile is displayed. If I redirect the output in
a file, I can edit it just fine with the vim text editor; the domino
tile is being handled as a single character, just like it is supposed to
be.

Internally, C programs which wish to handle the full Unicode on Linux
use the 'wide character' type (wchar_t) which, on Linux, is defined to
be a 32-bit integer. Therefore there is nothing special with the 0xFFFF
limit. In practice, Unicode display trouble usually stem from limited
availability of fonts with exotic characters (although Linux has a fair
share of such fonts), double-width characters in monospace fonts, and
right-to-left scripts, all of which being orthogonal to the 16/32-bit
issue.

The same is not true in Windows, which switched to Unicode earlier, when
code points were 16-bit only; on Windows, wchar_t and the "wide string
literals" use 16-bit characters, and recent versions of Windows have to
resort to UTF-16 to process higher planes, just like Java. I have been
told that the OS is plainly able to process and display all of the
Unicode planes, but it can be expected that some applications are not up
to it yet.

C# is a late-comer (2001) but uses a 16-bit char type. This may be an
artefact of Java imitation. This may also be an attempt to ease
conversion of C or C++ code for Windows into C# code.

--Thomas Pornin

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: Clean way to write Double into String without the trailing ".0"?
Next: need help or explanation