From: Peter Duniho on 29 Jan 2010 02:59 moonhkt wrote: > Hi All > Why using utf-8, the hex value return 51cc and 6668 ? > > od -cx utf8_file01.text > > 22e5 878c e699 a822 with " befor and after I don't understand the above. Are you trying to suggest that the text 'with " befor and after' is part of the output of the "od" program? If so, why does it not appear to match up with the binary values written out? And if the characters you're concerned with are at index 101 and 102, why only eight bytes in the file? And if the file is UTF-8, why are you dumping its contents as shorts? Why not just bytes? Frankly, the whole question doesn't make much sense to me. That said, the basic answer to your question is, I believe: UTF-8 and UTF-16 are different, so of course the bytes used to represent a character in a UTF-8 file are going to look different from the bytes used to represent the same character in a UTF-16 data structure. Pete
From: John B. Matthews on 29 Jan 2010 06:14 In article <990608dd-46fb-4280-88b7-f86dcd520c21(a)2g2000prl.googlegroups.com>, moonhkt <moonhkt(a)gmail.com> wrote: [...] > My Question is input utf-16 hex value, when write to file with UTF8 > codepage, the data will encode to UTF-8 ? When I run your program, I get this file content: $ hd out_utf.text 000000: e5 87 8c e6 99 a8 0a ?..?.?. > Do you know hwo to input hex value of utf-8? Do you mean like this? String a = "\u51cc\u6668"; String b = new String(new byte[] { (byte) 0xe5, (byte) 0x87, (byte) 0x8c, (byte) 0xe6, (byte) 0x99, (byte) 0xa8 }); System.out.println("a.equals(b) is " + a.equals(b)); This prints "a.equals(b) is true". For reference: $ cat ~/bin/hd #!/usr/bin/hexdump -f "%06.6_ax: " 16/1 "%02x " " " 16/1 "%_p" "\n" -- John B. Matthews trashgod at gmail dot com <http://sites.google.com/site/drjohnbmatthews>
From: Peter Duniho on 29 Jan 2010 12:38 moonhkt wrote: > Text file just have two utf-8 chinease character. > cat out_utf.text > 凌晨 Then your original post still doesn't make sense to me. A two character file can't have characters at offset 101 and 102. > od -cx out_utf.text > 0000000 207 214 231 \n > e587 8ce6 99a8 0a00 > 0000007 Looks fine to me, given the code you posted. > java to build utf-8 data, input using utf-16 value. I does not know > how to input utf-8 hex value. Input how? As a literal? Or reading from a file? > My Question is input utf-16 hex value, when write to file with UTF8 > codepage, the data will encode to UTF-8 ? Of course. If the encoding for the writer is UTF-8, the output is UTF-8. That's the whole point of using an encoding-specific class. > Do you know hwo to input hex value of utf-8 ? I tried \0xe5 not works. Reading UTF-8 works just like writing UTF-8, except you use an InputStreamReader instead of an OutputStreamWriter. But, the Java char and String data structures use UTF-16, not UTF-8. The only time you'll see the actual UTF-8 bytes is if you read the file as raw bytes. By definition, if you use a character-encoding-specific class like InputStreamReader, it will automatically convert from UTF-8 to UTF-16 for you. If you want UTF-8 data in your program, then don't use something that's entire purpose is to convert the input encoding (like UTF-8) to the encoding used in Java (UTF-16). Of course, if you do that, you can't use UTF-8 as actual character data in your Java program. But presumably you have some other reason for wanting UTF-8 data instead. Pete
From: markspace on 29 Jan 2010 13:33 moonhkt wrote: > But, I want Print out UTF-8 hex value How to Print ? e.g U+51CC to e5 > 87 8c. > What coding can handle this ? Oh, I see. Try this: package test; import java.io.UnsupportedEncodingException; public class UtfOut { public static void main( String[] args ) throws UnsupportedEncodingException { String a = "\u51cc\u6668"; byte [] buf = a.getBytes( "UTF-8" ); for( byte b : buf ) { System.out.printf( "%02X ", b ); } System.out.println( ); } } You could also use a ByteArrayOutputStream.
From: Roedy Green on 30 Jan 2010 04:51 On Thu, 28 Jan 2010 23:40:07 -0800 (PST), moonhkt <moonhkt(a)gmail.com> wrote, quoted or indirectly quoted someone who said : >Hi All >Why using utf-8, the hex value return 51cc and 6668 ? UTF-8 is a mixture of 8 bit chars, and magic 8-bit sequences that turn into 16 bit and 32 bit code sequences. To see how the algorithm works see http://mindprod.com/jgloss/utf.html http://mindprod.com/jgloss/codepoint.html -- Roedy Green Canadian Mind Products http://mindprod.com Computers are useless. They can only give you answers. ~ Pablo Picasso (born: 1881-10-25 died: 1973-04-08 at age: 91)
|
Next
|
Last
Pages: 1 2 3 4 Prev: Need to recompile a Java Applet as an Executable Next: Save xls file as blob to db ? |