From: Andreas Leitgeb on 23 Dec 2009 12:07 Tom Anderson <twic(a)urchin.earth.li> wrote: >> Thomas Pornin <pornin(a)bolet.org> quoted the JLS section 3.1: >>> << The Unicode standard was originally designed as a fixed-width 16-bit >>> character encoding. It has since been changed to allow for characters >>> whose representation requires more than 16 bits. The range of legal >>> code points is now U+0000 to U+10FFFF >> I have problems understanding why the surrogate code points are counted >> twice: once as their code points isolated and then again as the code-points >> that are reached by an adjacent pair of them. > The range is a bound - all legal code points are inside it. It doesn't > mean that all numbers inside it are legal code points. There are plenty of > numbers which aren't mapped to any character, and so aren't legal code > points - the surrogates are just a special case of those. I reckon. Thanks, that was my catch: I somehow mistakenly took "range" as implying "all in the range" - and a codepoint with no char mapped to it wasn't necessarily illegal in my mind, but single surrogate was.
From: Roedy Green on 22 Dec 2009 21:01 On 22 Dec 2009 20:47:39 GMT, Thomas Pornin <pornin(a)bolet.org> wrote, quoted or indirectly quoted someone who said : >E.g., if you want to have a String literal with U+10C22 (that's >OLD TURKIC LETTER ORKHON EM; it somewhat looks like a fish), >then you first convert 0x10C22 to a surrogate pair: > 1. subtract 0x10000: you get 0xC22 > 2. get the upper (u) and lower (l) 10 bits; you get u=0x3 and l=0x022 > (i.e. (u << 10) + l == 0xC22) > 3. the high surrogate is 0xD800 + u, the low surrogate is 0xDC00 + l. That is what I was afraid of. I am doing that now to generate tables of char entities and the equivalent hex and \u entities on various pages of mindprod.com, e.g. http://mindprod.com/jgloss/html5.html which shows the new HTML entities in HTML 5. here is my code: final int extract = theCharNumber - 0x10000; final int high = ( extract >>> 10 ) + 0xd800; final int low = ( extract & 0x3ff ) + 0xdc00; sb.append( ""\\u" ); sb.append( StringTools.toLzHexString( high, 4 ) ); sb.append( "\\u" ); sb.append( StringTools.toLzHexString( low, 4 ) ); sb.append( """ ); I started to think about what would be needed to make this less onerous. 1. an applet to convert hex to a surrogate pair. 2. allow \u12345 in string literals. However that would break existing code. \u12345 currently means "\u1234" + "5". 3. So you have to pick another letter: e.g. \c12345; for codepoint. IT needs a terminator, so that in future it could also handle \c123456; I don't know what that might break. 4. Introduce 32-bit CodePoint string literals with extensible \u mechanism. E.g. CString b = c"\u12345;Hello"; 5. specify weird chars with named entities to make the code more readable. Entities in String literals would be translated to binary at compile time, so the entities would not exist at run-time. The HTML 5 set would be greatly extended to give pretty well every Unicode glyph a name. P.S. I have been poking around in HTML 5. W3C did an odd thing. They REDEFINED the entities ⟨ and ⟩ to different glyphs from HTML 4. I don't think they have ever done anything like that before. I hope it was just an error. I have written the W3C asking if they really meant to do that. -- Roedy Green Canadian Mind Products http://mindprod.com If you think it�s expensive to hire a professional to do the job, wait until you hire an amateur. ~ Red Adair (born: 1915-06-18 died: 2004-08-07 at age: 89)
From: Roedy Green on 23 Dec 2009 03:30 On Tue, 22 Dec 2009 18:01:17 -0800, Roedy Green <see_website(a)mindprod.com.invalid> wrote, quoted or indirectly quoted someone who said : >I started to think about what would be needed to make this less >onerous. If you had only a few, you could create library of named constants for them, and glue them together with compile time concatenation. With only a little cleverness, a compiler would avoid embedding constants it did not use. Is any OS, JVM, utility, browser etc. capable of rendering a code point above 0xffff? I get the impression all we can do is embed them in UTF-8 files. -- Roedy Green Canadian Mind Products http://mindprod.com If you think it�s expensive to hire a professional to do the job, wait until you hire an amateur. ~ Red Adair (born: 1915-06-18 died: 2004-08-07 at age: 89)
From: Steven Simpson on 23 Dec 2009 04:40 On 23/12/09 02:01, Roedy Green wrote: > 3. So you have to pick another letter: e.g. \c12345; for codepoint. IT > needs a terminator, so that in future it could also handle \c123456; > I don't know what that might break. > IIRC, C99 introduced \uXXXX and \UXXXXXXXX. -- ss at comp dot lancs dot ac dot uk
From: Thomas Pornin on 23 Dec 2009 07:58 According to Roedy Green <see_website(a)mindprod.com.invalid>: > Is any OS, JVM, utility, browser etc. capable of rendering a code > point above 0xffff? Oh yes, plenty. Well, at least on my system (Linux with Ubuntu 9.10). For instance, if I write this HTML file: <html> <body> <p>🂓</p> </body> </html> then both Firefox and Chromium display the "DOMINO TILE VERTICAL-06-06" as they should. Now if I write this Java code: public class Foo { public static void main(String[] args) { StringBuilder sb = new StringBuilder(); sb.appendCodePoint(0x1F093); System.out.println(sb.toString()); } } and run it in a standard terminal (GNOME Terminal 2.28.1 on that system), then the domino tile is displayed. If I redirect the output in a file, I can edit it just fine with the vim text editor; the domino tile is being handled as a single character, just like it is supposed to be. Internally, C programs which wish to handle the full Unicode on Linux use the 'wide character' type (wchar_t) which, on Linux, is defined to be a 32-bit integer. Therefore there is nothing special with the 0xFFFF limit. In practice, Unicode display trouble usually stem from limited availability of fonts with exotic characters (although Linux has a fair share of such fonts), double-width characters in monospace fonts, and right-to-left scripts, all of which being orthogonal to the 16/32-bit issue. The same is not true in Windows, which switched to Unicode earlier, when code points were 16-bit only; on Windows, wchar_t and the "wide string literals" use 16-bit characters, and recent versions of Windows have to resort to UTF-16 to process higher planes, just like Java. I have been told that the OS is plainly able to process and display all of the Unicode planes, but it can be expected that some applications are not up to it yet. C# is a late-comer (2001) but uses a 16-bit char type. This may be an artefact of Java imitation. This may also be an attempt to ease conversion of C or C++ code for Windows into C# code. --Thomas Pornin
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 Prev: drools+rational 7 Next: how to convert c struct to java classes |