From: Peter Duniho on
moonhkt wrote:
> Hi All
> Why using utf-8, the hex value return 51cc and 6668 ?
>
> od -cx utf8_file01.text
>
> 22e5 878c e699 a822 with " befor and after

I don't understand the above. Are you trying to suggest that the text
'with " befor and after' is part of the output of the "od" program? If
so, why does it not appear to match up with the binary values written
out? And if the characters you're concerned with are at index 101 and
102, why only eight bytes in the file? And if the file is UTF-8, why
are you dumping its contents as shorts? Why not just bytes?

Frankly, the whole question doesn't make much sense to me. That said,
the basic answer to your question is, I believe: UTF-8 and UTF-16 are
different, so of course the bytes used to represent a character in a
UTF-8 file are going to look different from the bytes used to represent
the same character in a UTF-16 data structure.

Pete
From: John B. Matthews on
In article
<990608dd-46fb-4280-88b7-f86dcd520c21(a)2g2000prl.googlegroups.com>,
moonhkt <moonhkt(a)gmail.com> wrote:

[...]
> My Question is input utf-16 hex value, when write to file with UTF8
> codepage, the data will encode to UTF-8 ?

When I run your program, I get this file content:

$ hd out_utf.text
000000: e5 87 8c e6 99 a8 0a ?..?.?.

> Do you know hwo to input hex value of utf-8?

Do you mean like this?

String a = "\u51cc\u6668";
String b = new String(new byte[] {
(byte) 0xe5, (byte) 0x87, (byte) 0x8c,
(byte) 0xe6, (byte) 0x99, (byte) 0xa8
});
System.out.println("a.equals(b) is " + a.equals(b));

This prints "a.equals(b) is true".

For reference: $ cat ~/bin/hd
#!/usr/bin/hexdump -f
"%06.6_ax: " 16/1 "%02x " " "
16/1 "%_p" "\n"

--
John B. Matthews
trashgod at gmail dot com
<http://sites.google.com/site/drjohnbmatthews>
From: Peter Duniho on
moonhkt wrote:
> Text file just have two utf-8 chinease character.
> cat out_utf.text
> 凌晨

Then your original post still doesn't make sense to me. A two character
file can't have characters at offset 101 and 102.

> od -cx out_utf.text
> 0000000 207 214 231 \n
> e587 8ce6 99a8 0a00
> 0000007

Looks fine to me, given the code you posted.

> java to build utf-8 data, input using utf-16 value. I does not know
> how to input utf-8 hex value.

Input how? As a literal? Or reading from a file?

> My Question is input utf-16 hex value, when write to file with UTF8
> codepage, the data will encode to UTF-8 ?

Of course. If the encoding for the writer is UTF-8, the output is
UTF-8. That's the whole point of using an encoding-specific class.

> Do you know hwo to input hex value of utf-8 ? I tried \0xe5 not works.

Reading UTF-8 works just like writing UTF-8, except you use an
InputStreamReader instead of an OutputStreamWriter. But, the Java char
and String data structures use UTF-16, not UTF-8. The only time you'll
see the actual UTF-8 bytes is if you read the file as raw bytes.

By definition, if you use a character-encoding-specific class like
InputStreamReader, it will automatically convert from UTF-8 to UTF-16
for you.

If you want UTF-8 data in your program, then don't use something that's
entire purpose is to convert the input encoding (like UTF-8) to the
encoding used in Java (UTF-16). Of course, if you do that, you can't
use UTF-8 as actual character data in your Java program. But presumably
you have some other reason for wanting UTF-8 data instead.

Pete
From: markspace on
moonhkt wrote:

> But, I want Print out UTF-8 hex value How to Print ? e.g U+51CC to e5
> 87 8c.
> What coding can handle this ?


Oh, I see.

Try this:


package test;
import java.io.UnsupportedEncodingException;

public class UtfOut {
public static void main( String[] args )
throws UnsupportedEncodingException
{
String a = "\u51cc\u6668";

byte [] buf = a.getBytes( "UTF-8" );

for( byte b : buf ) {
System.out.printf( "%02X ", b );
}
System.out.println( );

}
}


You could also use a ByteArrayOutputStream.
From: Roedy Green on
On Thu, 28 Jan 2010 23:40:07 -0800 (PST), moonhkt <moonhkt(a)gmail.com>
wrote, quoted or indirectly quoted someone who said :

>Hi All
>Why using utf-8, the hex value return 51cc and 6668 ?

UTF-8 is a mixture of 8 bit chars, and magic 8-bit sequences that turn
into 16 bit and 32 bit code sequences.

To see how the algorithm works see
http://mindprod.com/jgloss/utf.html
http://mindprod.com/jgloss/codepoint.html
--
Roedy Green Canadian Mind Products
http://mindprod.com
Computers are useless. They can only give you answers.
~ Pablo Picasso (born: 1881-10-25 died: 1973-04-08 at age: 91)