Prev: split UTF-8 string to multi UTF8-file
Next: How to get an include-path of jni.h that is able to bedifferent on different platforms.
From: moonhkt on 27 Jan 2010 03:56 Hi All how to read utf-8 char one by one ? Below not work. import java.nio.charset.Charset ; import java.io.*; import java.lang.String; public class read_utf_char { public static void main(String[] args) { File aFile = new File("utf8_test.text"); try { String str = ""; char[] ch = new char[]; BufferedReader in = new BufferedReader( new InputStreamReader(new FileInputStream(aFile), "UTF8")); while ( in.read(ch) != -1 ) { System.out.print(ch); } } catch (UnsupportedEncodingException e) { } catch (IOException e) { }
From: Mayeul on 27 Jan 2010 04:29 moonhkt wrote: > Hi All > > how to read utf-8 char one by one ? > > Below not work. As far as I know, it works if your utf-8 stream contains only BMP characters (characters with code point 0xFFFF or below.) But it is indeed incorrect in the general case where you can't assume characters are all in the BMP. This is a known Java limitation. In the general case, you just don't read unicode characters one by one from a stream. Either you convert the stream to String first (and then use a clever combination of String.codePointAt() and Character.charCount(), read the JavaDoc.) Either you read looking for your delimiters, but storing whatever is *not* your delimiter, in a char buffer, untouched. You do not write it directly. For instance, BufferedReader implements reading line by line. I suppose other implementations enable to read using a different delimiter. -- Mayeul
From: Lothar Kimmeringer on 27 Jan 2010 07:17 moonhkt wrote: > Below not work. [...] > char[] ch = new char[]; Because it doesn't compile. What exactly doesn't work. Do you get a wrong output, do you get an exception (you ignore in the source you provided). A bit more information would really help to be able to answer more than "something will be wrong in your code". Regards, Lothar -- Lothar Kimmeringer E-Mail: spamfang(a)kimmeringer.de PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81) Always remember: The answer is forty-two, there can only be wrong questions!
From: moonhkt on 27 Jan 2010 10:33 On Jan 27, 8:17 pm, Lothar Kimmeringer <news200...(a)kimmeringer.de> wrote: > moonhkt wrote: > > Below not work. > > [...] > > > char[] ch = new char[]; > > Because it doesn't compile. > > What exactly doesn't work. Do you get a wrong output, do you > get an exception (you ignore in the source you provided). A > bit more information would really help to be able to answer > more than "something will be wrong in your code". > > Regards, Lothar > -- > Lothar Kimmeringer E-Mail: spamf...(a)kimmeringer.de > PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81) > > Always remember: The answer is forty-two, there can only be wrong > questions! Thank. I get below Example. But I can not get the UTF-8 char code. class CodePointAtstring { public static void main(String[] args) { // Declaration of String String a="\u00fc" + "\u34d7"+ "Welcome to Rose india"; //Displays the Actual String declared above System.out.println("GIVEN STRING IS="+a); // Returns the character (Unicode code point) at the specified index. System.out.println("Unicode code point at position 0 IN THE STRING IS="+a.codePointAt(0)); System.out.println("Unicode code point at position 1 IN THE STRING IS="+a.codePointAt(1)); System.out.println("Unicode code point at position 2 IN THE STRING IS="+a.codePointAt(2)); System.out.println("Unicode code point at position 3 IN THE STRING IS="+a.codePointAt(3)); System.out.println("Unicode code point at position 6 IN THE STRING IS="+a.codePointAt(6)); } } Output java CodePointAtstring GIVEN STRING IS=³?Welcome to Rose india Unicode code point at position 0 IN THE STRING IS=252 Unicode code point at position 1 IN THE STRING IS=13527 Unicode code point at position 2 IN THE STRING IS=87 Unicode code point at position 3 IN THE STRING IS=101 Unicode code point at position 6 IN THE STRING IS=111
From: RedGrittyBrick on 27 Jan 2010 11:12
moonhkt wrote: > On Jan 27, 8:17 pm, Lothar Kimmeringer <news200...(a)kimmeringer.de> > wrote: >> moonhkt wrote: >>> Below not work. >> [...] >> >>> char[] ch = new char[]; >> Because it doesn't compile. >> >> What exactly doesn't work. Do you get a wrong output, do you >> get an exception (you ignore in the source you provided). A >> bit more information would really help to be able to answer >> more than "something will be wrong in your code". >> >> Regards, Lothar >> -- >> Lothar Kimmeringer E-Mail: spamf...(a)kimmeringer.de >> PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81) >> >> Always remember: The answer is forty-two, there can only be wrong >> questions! > > Thank. I get below Example. But I can not get the UTF-8 char code. What do you mean by "UTF-8 char code"? Strictly speaking there is no such thing. You might mean "Unicode code-point" or "sequence of octets in UTF8-encoding" > > class CodePointAtstring > { > public static void main(String[] args) > { > // Declaration of String > String a="\u00fc" + "\u34d7"+ "Welcome to Rose india"; > //Displays the Actual String declared above > System.out.println("GIVEN STRING IS="+a); > // Returns the character (Unicode code point) at the specified > index. > System.out.println("Unicode code point at position 0 IN THE STRING > IS="+a.codePointAt(0)); > System.out.println("Unicode code point at position 1 IN THE STRING > IS="+a.codePointAt(1)); > System.out.println("Unicode code point at position 2 IN THE STRING > IS="+a.codePointAt(2)); > System.out.println("Unicode code point at position 3 IN THE STRING > IS="+a.codePointAt(3)); > System.out.println("Unicode code point at position 6 IN THE STRING > IS="+a.codePointAt(6)); > } > } > > Output > java CodePointAtstring > GIVEN STRING IS=³?Welcome to Rose india > Unicode code point at position 0 IN THE STRING IS=252 > Unicode code point at position 1 IN THE STRING IS=13527 > Unicode code point at position 2 IN THE STRING IS=87 > Unicode code point at position 3 IN THE STRING IS=101 > Unicode code point at position 6 IN THE STRING IS=111 > That seems completely reasonable to me because 252 = 0x00fc and 13527 = 0x34d7. Nothing in your program has anything to do with UTF-8 encoding. -- RGB |