From: Steven Simpson on 23 Feb 2010 18:06 On 23/02/10 17:53, Amith wrote: > URL url = new URL("http://www.google.com/transliterate/indic? > tlqt=1&langpair=en|kn&text=namskara%20guru&&tl_app=1"); > Loaded into Firefox, the charset is reported to be UTF-8. However, using nc to type the HTTP request directly, the response comes back as: HTTP/1.1 200 OK Set-Cookie: S=indic-transliteration=sFKFTAMZZsqRwb6I4zcSWw; path=/; domain=.google.com Date: Tue, 23 Feb 2010 22:43:57 GMT Pragma: no-cache Expires: Fri, 01 Jan 1990 00:00:00 GMT Cache-Control: no-cache, must-revalidate Content-Type: text/javascript; charset=ISO-8859-1 Set-Cookie: PREF=ID=b732c5deb8245815:TM=1266965037:LM=1266965037:S=kdb9-XF7mkGvw1Ej; expires=Thu, 23-Feb-2012 22:43:57 GMT; path=/; domain=.google.com Server: TFE/0.0 X-XSS-Protection: 0 Transfer-Encoding: chunked Connection: close 74 [ { "ew" : "namskara guru", "hws" : [ "\u0CA8\u0CAE\u0CCD\u0CB8\u0CCD\u0C95\u0CB0\u0C97\u0CC1\u0CB0\u0CC1", ] }, ] 0 So the server seems to have chosen to send just ISO-8859-1, with the unacceptable characters encoded according to Javascript, and so they happen to look the same as in Java. I'd suggest sending the request with "Accept-charset: UTF-8", but I still got Latin1 when I tried. I used the "User Agent Switcher" extension in Firefox to select a different User-Agent string (Lynx, for example), and got the Latin1 version there too, so that's what the server is switching on. (That's an exceedingly daft thing for a server to do, btw, when there's a perfectly good Accept-Charset header to use instead.) Looks like you might have to spoof a conventional browser to get the right charset: set the User-Agent field. Otherwise, you'll have to decode the characters yourself by parsing out the escape sequences. > BufferedReader in = new BufferedReader( > new InputStreamReader( > url.openStream(), "UTF8")); > Don't assume the charset, parse it out from the Content-Type field. -- ss at comp dot lancs dot ac dot uk
From: Roedy Green on 23 Feb 2010 19:14 On Tue, 23 Feb 2010 09:53:47 -0800 (PST), Amith <amithgc(a)gmail.com> wrote, quoted or indirectly quoted someone who said : >Hello all, > >I have a problem, when i read a webpage contents (with UTF-8 >characterset) and try to display it.. it is just considered as unicode >string >please help me > >here is the code > > >import java.net.*; >import java.io.*; > > while ((inputLine = in.readLine()) != null) > fullString = fullString + new String(inputLine.getBytes(),"UTF-8"); > The text comes in dribs and drabs. See http://mindprod.com/products.html#HTML for code to do that properly that won't go into a tight loop reading empty strings. -- Roedy Green Canadian Mind Products http://mindprod.com Imagine an architect who would never admit to making sketches, blueprints or erecting scaffolds. In his view, the finished building speaks for itself. How could a young architect learn from such a man? Mathematicians traditionally refuse ever to disclose the intuitions that lead them to a conjecture, or the empirical tests to see if it were likely true, or the initial proofs. They are like chefs who refuse to disclose their recipes, ingredients or techniques.
From: Joshua Cranmer on 23 Feb 2010 20:11 On 02/23/2010 06:06 PM, Steven Simpson wrote: > I used the "User Agent Switcher" extension in Firefox to select a > different User-Agent string (Lynx, for example), and got the Latin1 > version there too, so that's what the server is switching on. (That's > an exceedingly daft thing for a server to do, btw, when there's a > perfectly good Accept-Charset header to use instead.) I thought UA sniffing went out of fashion years ago. Then I discovered that Google Wave sniffed in a rather limited manner when doing other work. Now I see that Google sniffs here too. Now I'm never going to trust Google's actual text results when I see them in the browser window. -- Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth
First
|
Prev
|
Pages: 1 2 3 Prev: Nightly build or daily build? Next: Error message I can't figure out |