32-bit characters in Java string literals [Java Programming]

Prev: drools+rational 7
Next: how to convert c struct to java classes

From: Andreas Leitgeb on 23 Dec 2009 09:31

Thomas Pornin <pornin(a)bolet.org> quoted the JLS section 3.1:
><< The Unicode standard was originally designed as a fixed-width 16-bit
> character encoding. It has since been changed to allow for characters
> whose representation requires more than 16 bits. The range of legal
> code points is now U+0000 to U+10FFFF

I have problems understanding why the surrogate code points are counted
twice: once as their code points isolated and then again as the code-points
that are reached by an adjacent pair of them.

In my understanding that would make 0x10F7FF really legal codepoints, as
the surrogates wouldn't be legal as single code points, but only as pairs.

But then again, perhaps my own understanding of "legal code points" just
differs from some common definition.

From: Thomas Pornin on 23 Dec 2009 16:09

According to Andreas Leitgeb <avl(a)logic.at>:
> I have problems understanding why the surrogate code points are counted
> twice: once as their code points isolated and then again as the code-points
> that are reached by an adjacent pair of them.

Not all values from 0 to 0x10FFFF are legal code points by themselves.
For instance, 0xFFFE and 0xFFFF are explicitly defined to be illegal as
code point values, not only now but also for future versions of Unicode
(this makes BOM detection unambiguous).

Surrogates are not legal "alone" but it is quite handy the old Unicode
systems (those which only know of 16-bit code, such as Java pre-5) will
accept surrogates as just any other non-special code point: thus,
surrogate pairs can be smuggled into a 16-bit-only system, and that's
called UTF-16. This is somewhat equivalent to pushing UTF-8 data into an
application which expects ASCII: all ASCII characters keep the same
encoding, and we just hope that the application will just store the
bytes in the 0x80..0xF7 range unmolested.

--Thomas Pornin

From: Roedy Green on 23 Dec 2009 18:33

On Wed, 23 Dec 2009 09:40:04 +0000, Steven Simpson <ss(a)domain.invalid>
wrote, quoted or indirectly quoted someone who said :

>
>IIRC, C99 introduced \uXXXX and \UXXXXXXXX.

It would make sense to follow suit. Life is complicated enough already
for people who code in more than one language each day.
--
Roedy Green Canadian Mind Products
http://mindprod.com
If you think it�s expensive to hire a professional to do the job, wait until you hire an amateur.
~ Red Adair (born: 1915-06-18 died: 2004-08-07 at age: 89)

From: neuneudr on 24 Dec 2009 04:16

On Dec 24, 4:55Â am, Owen Jacobson <angrybald...(a)gmail.com> wrote:
....
> In the interests of science, what characters do you see on the next line?
>
> ð ð ð ð ð ð ð¡

Debian Lenny / browser Iceweasel 3.0.6 (Firefox re-branded for true
freedom ;)
I see boxes with tiny hexcode in them not corresponding to the
characters.

But then I can select them, past them in an xterm, where I see all
'? ? ? ? ?'
thinggies but then the file I pasted them in the terminal (using cat >
aa.txt)
contains the correct characters, as shown by an hexdump:

$ hexdump aa.txt
0000000 90f0 8084 f020 8590 2080 90f0 9086 f020
0000010 8c90 2080 90f0 8090 f020 9190 2090 9df0
0000020 a184 000a

:)

From: neuneudr on 24 Dec 2009 04:22

On Dec 22, 9:47 pm, Thomas Pornin <por...(a)bolet.org> wrote:
> ...
> ...(ASCII works everywhere...

This

Here we've got a mix of Windows, Linux and OS X
devs so we're using scripts called at (Ant) build time that
enforces that all .java files:

a) use a subset of ASCII in their name
b) contains only ASCII characters

You can't build an app with non-ASCII characters in our
..java files and you certainly can't commit them :)

It's in the guidelines.

Better safe than sorry :)

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: drools+rational 7
Next: how to convert c struct to java classes