Read utf-8 char one by one [Java Programming]

Prev: split UTF-8 string to multi UTF8-file
Next: How to get an include-path of jni.h that is able to bedifferent on different platforms.

From: moonhkt on 27 Jan 2010 03:56

Hi All

how to read utf-8 char one by one ?

Below not work.

import java.nio.charset.Charset ;
import java.io.*;
import java.lang.String;
public class read_utf_char {
public static void main(String[] args) {
File aFile = new File("utf8_test.text");
try {
String str = "";
char[] ch = new char[];
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(aFile), "UTF8"));
while ( in.read(ch) != -1 )
{
System.out.print(ch);
}
} catch (UnsupportedEncodingException e) {
} catch (IOException e) {
}

From: Mayeul on 27 Jan 2010 04:29

moonhkt wrote:
> Hi All
>
> how to read utf-8 char one by one ?
>
> Below not work.

As far as I know, it works if your utf-8 stream contains only BMP
characters (characters with code point 0xFFFF or below.)

But it is indeed incorrect in the general case where you can't assume
characters are all in the BMP. This is a known Java limitation.

In the general case, you just don't read unicode characters one by one
from a stream. Either you convert the stream to String first (and then
use a clever combination of String.codePointAt() and
Character.charCount(), read the JavaDoc.)
Either you read looking for your delimiters, but storing whatever is
*not* your delimiter, in a char buffer, untouched. You do not write it
directly. For instance, BufferedReader implements reading line by line.
I suppose other implementations enable to read using a different delimiter.

--
Mayeul

From: Lothar Kimmeringer on 27 Jan 2010 07:17

moonhkt wrote:

> Below not work.

[...]

> char[] ch = new char[];

Because it doesn't compile.

What exactly doesn't work. Do you get a wrong output, do you
get an exception (you ignore in the source you provided). A
bit more information would really help to be able to answer
more than "something will be wrong in your code".

Regards, Lothar
--
Lothar Kimmeringer E-Mail: spamfang(a)kimmeringer.de
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!

From: moonhkt on 27 Jan 2010 10:33

On Jan 27, 8:17 pm, Lothar Kimmeringer <news200...(a)kimmeringer.de>
wrote:
> moonhkt wrote:
> > Below not work.
>
> [...]
>
> > char[] ch = new char[];
>
> Because it doesn't compile.
>
> What exactly doesn't work. Do you get a wrong output, do you
> get an exception (you ignore in the source you provided). A
> bit more information would really help to be able to answer
> more than "something will be wrong in your code".
>
> Regards, Lothar
> --
> Lothar Kimmeringer E-Mail: spamf...(a)kimmeringer.de
> PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)
>
> Always remember: The answer is forty-two, there can only be wrong
> questions!

Thank. I get below Example. But I can not get the UTF-8 char code.

class CodePointAtstring
{
public static void main(String[] args)
{
// Declaration of String
String a="\u00fc" + "\u34d7"+ "Welcome to Rose india";
//Displays the Actual String declared above
System.out.println("GIVEN STRING IS="+a);
// Returns the character (Unicode code point) at the specified
index.
System.out.println("Unicode code point at position 0 IN THE STRING
IS="+a.codePointAt(0));
System.out.println("Unicode code point at position 1 IN THE STRING
IS="+a.codePointAt(1));
System.out.println("Unicode code point at position 2 IN THE STRING
IS="+a.codePointAt(2));
System.out.println("Unicode code point at position 3 IN THE STRING
IS="+a.codePointAt(3));
System.out.println("Unicode code point at position 6 IN THE STRING
IS="+a.codePointAt(6));
}
}

Output
java CodePointAtstring
GIVEN STRING IS=³?Welcome to Rose india
Unicode code point at position 0 IN THE STRING IS=252
Unicode code point at position 1 IN THE STRING IS=13527
Unicode code point at position 2 IN THE STRING IS=87
Unicode code point at position 3 IN THE STRING IS=101
Unicode code point at position 6 IN THE STRING IS=111

From: RedGrittyBrick on 27 Jan 2010 11:12

moonhkt wrote:
> On Jan 27, 8:17 pm, Lothar Kimmeringer <news200...(a)kimmeringer.de>
> wrote:
>> moonhkt wrote:
>>> Below not work.
>> [...]
>>
>>> char[] ch = new char[];
>> Because it doesn't compile.
>>
>> What exactly doesn't work. Do you get a wrong output, do you
>> get an exception (you ignore in the source you provided). A
>> bit more information would really help to be able to answer
>> more than "something will be wrong in your code".
>>
>> Regards, Lothar
>> --
>> Lothar Kimmeringer E-Mail: spamf...(a)kimmeringer.de
>> PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)
>>
>> Always remember: The answer is forty-two, there can only be wrong
>> questions!
>
> Thank. I get below Example. But I can not get the UTF-8 char code.

What do you mean by "UTF-8 char code"? Strictly speaking there is no
such thing. You might mean "Unicode code-point" or "sequence of octets
in UTF8-encoding"

>
> class CodePointAtstring
> {
> public static void main(String[] args)
> {
> // Declaration of String
> String a="\u00fc" + "\u34d7"+ "Welcome to Rose india";
> //Displays the Actual String declared above
> System.out.println("GIVEN STRING IS="+a);
> // Returns the character (Unicode code point) at the specified
> index.
> System.out.println("Unicode code point at position 0 IN THE STRING
> IS="+a.codePointAt(0));
> System.out.println("Unicode code point at position 1 IN THE STRING
> IS="+a.codePointAt(1));
> System.out.println("Unicode code point at position 2 IN THE STRING
> IS="+a.codePointAt(2));
> System.out.println("Unicode code point at position 3 IN THE STRING
> IS="+a.codePointAt(3));
> System.out.println("Unicode code point at position 6 IN THE STRING
> IS="+a.codePointAt(6));
> }
> }
>
> Output
> java CodePointAtstring
> GIVEN STRING IS=³?Welcome to Rose india
> Unicode code point at position 0 IN THE STRING IS=252
> Unicode code point at position 1 IN THE STRING IS=13527
> Unicode code point at position 2 IN THE STRING IS=87
> Unicode code point at position 3 IN THE STRING IS=101
> Unicode code point at position 6 IN THE STRING IS=111
>

That seems completely reasonable to me because 252 = 0x00fc and 13527 =
0x34d7.

Nothing in your program has anything to do with UTF-8 encoding.

--
RGB

| Next | Last
Pages: 1 2 3
Prev: split UTF-8 string to multi UTF8-file
Next: How to get an include-path of jni.h that is able to bedifferent on different platforms.