From: Steven Simpson on
On 23/02/10 17:53, Amith wrote:
> URL url = new URL("http://www.google.com/transliterate/indic?
> tlqt=1&langpair=en|kn&text=namskara%20guru&&tl_app=1");
>

Loaded into Firefox, the charset is reported to be UTF-8. However,
using nc to type the HTTP request directly, the response comes back as:

HTTP/1.1 200 OK
Set-Cookie: S=indic-transliteration=sFKFTAMZZsqRwb6I4zcSWw; path=/; domain=.google.com
Date: Tue, 23 Feb 2010 22:43:57 GMT
Pragma: no-cache
Expires: Fri, 01 Jan 1990 00:00:00 GMT
Cache-Control: no-cache, must-revalidate
Content-Type: text/javascript; charset=ISO-8859-1
Set-Cookie: PREF=ID=b732c5deb8245815:TM=1266965037:LM=1266965037:S=kdb9-XF7mkGvw1Ej; expires=Thu, 23-Feb-2012 22:43:57 GMT; path=/; domain=.google.com
Server: TFE/0.0
X-XSS-Protection: 0
Transfer-Encoding: chunked
Connection: close

74

[
{
"ew" : "namskara guru",
"hws" : [
"\u0CA8\u0CAE\u0CCD\u0CB8\u0CCD\u0C95\u0CB0\u0C97\u0CC1\u0CB0\u0CC1",
]
},
]

0


So the server seems to have chosen to send just ISO-8859-1, with the
unacceptable characters encoded according to Javascript, and so they
happen to look the same as in Java.

I'd suggest sending the request with "Accept-charset: UTF-8", but I
still got Latin1 when I tried.

I used the "User Agent Switcher" extension in Firefox to select a
different User-Agent string (Lynx, for example), and got the Latin1
version there too, so that's what the server is switching on. (That's
an exceedingly daft thing for a server to do, btw, when there's a
perfectly good Accept-Charset header to use instead.)

Looks like you might have to spoof a conventional browser to get the
right charset: set the User-Agent field. Otherwise, you'll have to
decode the characters yourself by parsing out the escape sequences.

> BufferedReader in = new BufferedReader(
> new InputStreamReader(
> url.openStream(), "UTF8"));
>

Don't assume the charset, parse it out from the Content-Type field.

--
ss at comp dot lancs dot ac dot uk

From: Roedy Green on
On Tue, 23 Feb 2010 09:53:47 -0800 (PST), Amith <amithgc(a)gmail.com>
wrote, quoted or indirectly quoted someone who said :

>Hello all,
>
>I have a problem, when i read a webpage contents (with UTF-8
>characterset) and try to display it.. it is just considered as unicode
>string
>please help me
>
>here is the code
>
>
>import java.net.*;
>import java.io.*;
>
> while ((inputLine = in.readLine()) != null)
> fullString = fullString + new String(inputLine.getBytes(),"UTF-8");
>

The text comes in dribs and drabs. See
http://mindprod.com/products.html#HTML for code to do that properly
that won't go into a tight loop reading empty strings.
--
Roedy Green Canadian Mind Products
http://mindprod.com

Imagine an architect who would never admit to making sketches, blueprints or erecting scaffolds. In his view, the finished building speaks for itself. How could a young architect learn from such a man? Mathematicians traditionally refuse ever to disclose the intuitions that lead them to a conjecture, or the empirical tests to see if it were likely true, or the initial proofs. They are like chefs who refuse to disclose their recipes, ingredients or techniques.
From: Joshua Cranmer on
On 02/23/2010 06:06 PM, Steven Simpson wrote:
> I used the "User Agent Switcher" extension in Firefox to select a
> different User-Agent string (Lynx, for example), and got the Latin1
> version there too, so that's what the server is switching on. (That's
> an exceedingly daft thing for a server to do, btw, when there's a
> perfectly good Accept-Charset header to use instead.)

I thought UA sniffing went out of fashion years ago. Then I discovered
that Google Wave sniffed in a rather limited manner when doing other
work. Now I see that Google sniffs here too. Now I'm never going to
trust Google's actual text results when I see them in the browser window.

--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth