Read utf-8 char one by one [Java Programming]

Prev: split UTF-8 string to multi UTF8-file
Next: How to get an include-path of jni.h that is able to bedifferent on different platforms.

From: moonhkt on 27 Jan 2010 20:00

Hi All
I want output the Character in the string one by one.
Now,codePointAt just print the Code points value.

On 1æ28æ¥, ä¸å12æ12å, RedGrittyBrick <RedGrittyBr...(a)spamweary.invalid>
wrote:
> moonhkt wrote:
> > On Jan 27, 8:17 pm, Lothar Kimmeringer <news200...(a)kimmeringer.de>
> > wrote:
> >> moonhkt wrote:
> >>> Below not work.
> >> [...]
>
> >>> Â Â char[] ch = new char[];
> >> Because it doesn't compile.
>
> >> What exactly doesn't work. Do you get a wrong output, do you
> >> get an exception (you ignore in the source you provided). A
> >> bit more information would really help to be able to answer
> >> more than "something will be wrong in your code".
>
> >> Regards, Lothar
> >> --
> >> Lothar Kimmeringer Â Â Â Â Â Â Â Â E-Mail: spamf...(a)kimmeringer.de
> >> Â Â Â Â Â Â Â Â PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)
>
> >> Always remember: The answer is forty-two, there can only be wrong
> >> Â Â Â Â Â Â Â Â Â questions!
>
> > Thank. I get below Example. But I can not get the UTF-8 char code.
>
> What do you mean by "UTF-8 char code"? Strictly speaking there is no
> such thing. You might mean "Unicode code-point" or "sequence of octets
> in UTF8-encoding"
>
>
>
>
>
>
>
> > class CodePointAtstring
> > {
> > Â public static void main(String[] args)
> > Â {
> > Â Â // Declaration of String
> > Â Â String a="\u00fc" + "\u34d7"+ "Welcome to Rose india";
> > Â Â //Displays the Actual String declared above
> > Â Â System.out.println("GIVEN STRING IS="+a);
> > Â Â // Â Returns the character (Unicode code point) at the specified
> > index.
> > Â Â System.out.println("Unicode code point at position 0 IN THE STRING
> > IS="+a.codePointAt(0));
> > Â Â System.out.println("Unicode code point at position 1 IN THE STRING
> > IS="+a.codePointAt(1));
> > Â Â System.out.println("Unicode code point at position 2 IN THE STRING
> > IS="+a.codePointAt(2));
> > Â Â System.out.println("Unicode code point at position 3 IN THE STRING
> > IS="+a.codePointAt(3));
> > Â Â System.out.println("Unicode code point at position 6 IN THE STRING
> > IS="+a.codePointAt(6));
> > Â }
> > }
>
> > Output
> > java CodePointAtstring
> > GIVEN STRING IS=Â³?Welcome to Rose india
> > Unicode code point at position 0 IN THE STRING IS=252
> > Unicode code point at position 1 IN THE STRING IS=13527
> > Unicode code point at position 2 IN THE STRING IS=87
> > Unicode code point at position 3 IN THE STRING IS=101
> > Unicode code point at position 6 IN THE STRING IS=111
>
> That seems completely reasonable to me because 252 = 0x00fc and 13527 =
> 0x34d7.
>
> Nothing in your program has anything to do with UTF-8 encoding.
>
> --
> RGB- é±èè¢«å¼ç¨æå -
>
> - é¡¯ç¤ºè¢«å¼ç¨æå -- é±èè¢«å¼ç¨æå -
>
> - é¡¯ç¤ºè¢«å¼ç¨æå -

From: Lew on 27 Jan 2010 20:49

Please, do not top-post.

moonhkt wrote:
> I want output the Character in the string one by one.
> Now,codePointAt just print the Code points value.

'codePointAt()' doesn't print anything. How are you actually printing it?

'codePointAt()' returns an int, not a character.
<http://java.sun.com/javase/6/docs/api/java/lang/String.html#codePointAt(int)>

Most methods that output an int show the int value, not the equivalent
character. If you want to display an int as a character, you have to use a
method that will do that. I don't know offhand of a method in the standard
API that does that, but perusal of the Javadocs might reveal one, otherwise
you'll have to code one yourself or find a third-party library that already
has such.

--
Lew

From: Roedy Green on 28 Jan 2010 01:11

On Wed, 27 Jan 2010 16:12:18 +0000, RedGrittyBrick
<RedGrittyBrick(a)spamweary.invalid> wrote, quoted or indirectly quoted
someone who said :

>
>What do you mean by "UTF-8 char code"? Strictly speaking there is no
>such thing. You might mean "Unicode code-point" or "sequence of octets
>in UTF8-encoding"

The point of an encoding is it hides the details of how 16-chars are
inserted into an 8-bit stream. All you are interested in the 16-bit
Java char value or perhaps the java codepoint value if you have 32-bit
chars embedded as well.
--
Roedy Green Canadian Mind Products
http://mindprod.com
Computers are useless. They can only give you answers.
~ Pablo Picasso (born: 1881-10-25 died: 1973-04-08 at age: 91)

From: moonhkt on 28 Jan 2010 09:35

Yes. This is my want.
But my output is not same with you. You are correct.

Run in Jcreator 4.5 version
--------------------Configuration: <Default>--------------------
GIVEN STRING IS=ç¾¹?î¢elcome to Rose India ??.
Length of string is 27
CodePoints in string is 26
Character[0] is ç¾¹
Character[1] is ??
Character[2] is W
Character[3] is e
Character[4] is l
Character[5] is c
Character[6] is o
Character[7] is m
Character[8] is e
Character[9] is
Character[10] is t
Character[11] is o
Character[12] is
Character[13] is R
Character[14] is o
Character[15] is s
Character[16] is e
Character[17] is
Character[18] is I
Character[19] is n
Character[20] is d
Character[21] is i
Character[22] is a
Character[23] is
Character[24] is ?
Character[25] is ?
Character[26] is .

Process completed.

On Jan 28, 6:38Â pm, RedGrittyBrick <RedGrittyBr...(a)spamweary.invalid>
wrote:> moonhkt wrote:
> > RedGrittyBrick wrote:
> >> moonhkt wrote:
> >>> Lothar Kimmeringer wrote:
> >>>> moonhkt wrote:
>
> >>>>> Below not work.
>
> >>>> [...]
> >>>> Because it doesn't compile. What exactly doesn't work. Do you
> >>>> get a wrong output, do you get an exception (you ignore in the
> >>>> source you provided). A bit more information would really help
> >>>> to be able to answer more than "something will be wrong in your
> >>>> code". Regards,
>
> >>> Thank. I get below Example. But I can not get the UTF-8 char
> >>> code.
>
> >> What do you mean by "UTF-8 char code"? Strictly speaking there is
> >> no such thing. You might mean "Unicode code-point" or "sequence of
> >> octets in UTF8-encoding"
>
> >> [...]
>
> >> Nothing in your program has anything to do with UTF-8 encoding.
>
> > Hi All I want output the Character in the string one by one.
> > Now,codePointAt just print the Code points value.
>
> Why not use String's length() and CharAt() methods?
>
> I assume you can disregard characters outside Unicode's Base
> Multilingual Plane (BMP) - if not, I think you'll have to check for
> surrogate pairs. Characters outside the BMP are too big for a char.
>
> -------------------------------------8<-----------------------------------
> public class UnicodeChars {
> Â Â public static void main(String[] args)
> Â Â Â Â throws UnsupportedEncodingException {
>
> Â Â Â // I want console output in UTF-8
> Â Â Â PrintStream sysout = new PrintStream(System.out, true, "UTF-8");
>
> Â Â Â // \u00fc is LATIN SMALL LETTER U WITH DIAERESIS;
> Â Â Â // \u34d7 is a character in CJK Unified Ideographs Extension A.
> Â Â Â // \uD834\uDD1E" are the surrogate pair for character U+1D11E.
> Â Â Â // U+1D11E is MUSICAL SYMBOL G CLEF;
> Â Â Â String a = "\u00fc\u34d7Welcome to Rose India \uD834\uDD1E.";
>
> Â Â Â int n = a.length();
> Â Â Â sysout.println("GIVEN STRING IS=" + a);
> Â Â Â sysout.printf("Length of string is %d%n", n);
> Â Â Â sysout.printf("CodePoints in string is %d%n",
> Â Â Â Â Â a.codePointCount(0,n));
> Â Â Â for (int i = 0; i < n; i++) {
> Â Â Â Â sysout.printf("Character[%d] is %s%n", i, a.charAt(i));
> Â Â Â }
> Â Â }}
>
> -------------------------------------8<-----------------------------------
> GIVEN STRING IS=Ã¼ãWelcome to Rose India ð.
> Length of string is 27
> CodePoints in string is 26
> Character[0] is Ã¼
> Character[1] is ã
> Character[2] is W
> Character[3] is e
> Character[4] is l
> Character[5] is c
> Character[6] is o
> Character[7] is m
> Character[8] is e
> Character[9] is
> Character[10] is t
> Character[11] is o
> Character[12] is
> Character[13] is R
> Character[14] is o
> Character[15] is s
> Character[16] is e
> Character[17] is
> Character[18] is I
> Character[19] is n
> Character[20] is d
> Character[21] is i
> Character[22] is a
> Character[23] is
> Character[24] is ?
> Character[25] is ?
> Character[26] is .
>
> --
> RGB

From: Lew on 28 Jan 2010 14:25

On Jan 28, 12:57 pm, RedGrittyBrick <RedGrittyBr...(a)spamweary.invalid>
wrote:
> PLEASE DON'T TOP-POST, PLEASE PUT YOUR REPLY AT THE BOTTOM, BELOW ANY
> QUOTED TEXT. THANKS!
>

Actually, it's better to post inline, with comments interspersed with
quoted material.

--
Lew

First | Prev | Next | Last
Pages: 1 2 3
Prev: split UTF-8 string to multi UTF8-file
Next: How to get an include-path of jni.h that is able to bedifferent on different platforms.