UDF-8 Reading for URL - not working [Java Programming]

Prev: Nightly build or daily build?
Next: Error message I can't figure out

From: Lothar Kimmeringer on 23 Feb 2010 14:45

Amith wrote:

> My problem is the UTF-8 string which i read from the URL is considered
> as unicode.. i need it as UTF-8
>
> i want it to be printed as "ನಮ್ಸ್ಕರಗುರು" and not as "\u0CA8\u0CAE\u0CCD
> \u0CB8\u0CCD\u0C95\u0CB0\u0C97\u0CC1\u0CB0\u0CC1"

What is this line for:
fullString = fullString + new String(inputLine.getBytes(),"UTF-8")

First of all, use StringBuilder and not String concatenation,
second, why do you create a byte-array from a string, to create
a new one again just to add it to an existing one. Just do
fullString += inputLine
should be enough (and solve your problem by the way). As said
above use a StringBuilder instead as next step.

Regards, Lothar
--
Lothar Kimmeringer E-Mail: spamfang(a)kimmeringer.de
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!

From: Amith on 23 Feb 2010 14:58

even if it is fullString = fullString + inputLine;
it doesnt work, i have tried it, some more useless experiments led me
to the this
fullString = fullString + new String(inputLine.getBytes(),"UTF-8")

From: Lew on 23 Feb 2010 15:12

Amith wrote:
> My problem is the UTF-8 string which i [sic] read from the URL is considered
> as unicode.. i [sic] need it as UTF-8

UTF-8 *is* Unicode!

> i [sic] want it to be printed as "ನಮ್ಸ್ಕರಗುರು" and not as "\u0CA8\u0CAE\u0CCD
> \u0CB8\u0CCD\u0C95\u0CB0\u0C97\u0CC1\u0CB0\u0CC1"

> public class URLReader {
> public static void main(String[] args) throws Exception {
> URL url = new URL("http://www.google.com/transliterate/indic?
> tlqt=1&langpair=en|kn&text=namskara%20guru&&tl_app=1");
> BufferedReader in = new BufferedReader(
> new InputStreamReader(
> url.openStream(), "UTF8"));
>
> String inputLine = "";

No need to initialize 'inputLine' to a value you are just going to throw away.

> String fullString = "";
>
>
> while ((inputLine = in.readLine()) != null)
> fullString = fullString + new String(inputLine.getBytes(),"UTF-8");

This is silly. Just do what Lothar said and add the String to the String.
I'm also pretty sure this isn't correct anyway because the way you defined the
BufferedReader will have already converted the bytes from UTF-8 on the way in
to 'inputLine', so that the 'getBytes()' will create bytes representing UTF-16
encoding. Reconverting those bytes to String using UTF-8 seems like it would
not work. In any event, using straightforward String concatenation, or as
Lothar suggested, StringBuilder concatenation, should keep encoding issues out
of the way.

Strings in Java internally will always be UTF-16.

> String string = fullString.substring(fullString.indexOf("[\"") + 2,
> fullString.indexOf("\",]"));
> System.out.println(string);

This will display the String using the platform's default encoding.

> in.close();

This should be in a 'finally' block tightly associated with the input loop.

> }
> }

Do not use TAB characters for indentation of Usenet posts. Use spaces, up to
four per indent level. To get help you might want to keep the code readable.

--
Lew

From: Lothar Kimmeringer on 23 Feb 2010 16:00

Amith wrote:

> even if it is fullString = fullString + inputLine;

Then it's quite likely that the stream you open is not
delivering bytes of UTF-8 encoded data

Regards, Lothar
--
Lothar Kimmeringer E-Mail: spamfang(a)kimmeringer.de
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!

From: markspace on 23 Feb 2010 16:16

Lothar Kimmeringer wrote:
> Amith wrote:
>
>> even if it is fullString = fullString + inputLine;
>
> Then it's quite likely that the stream you open is not
> delivering bytes of UTF-8 encoded data

or the stream actually contains the string "\u0CA8\u0CAE\u0CCD" etc.
I.e., it's UTF-8 with something else encoded on top of that.

Or the problem is he doesn't have the right glyphs installed on his
system, so he can't see the Arabic characters.

All of which sum up to "it's not in the code you've shown us."

First | Prev | Next | Last
Pages: 1 2 3
Prev: Nightly build or daily build?
Next: Error message I can't figure out