From: Lew on 21 Jan 2010 14:43 dk wrote: > @BugBear: yeah the xml [sic] is a well formed and properly validated xml [sic]. > That didn't answer his question. Answer his question. "Have you checked that your data IS valid UTF-8 ?" Clearly there is an improperly-encoded character in your XML file. Find that and fix it. > @Roedy: write now I'm using ultraEdit and inserting the characters > from the ASCII table that it has. I have even tried seeing it in hex > mode and I got the same value from both the places. > ASCII != UTF-8. That hex value for the bad character, does it match the UTF-8 code point for that character? It's four bytes long? What character is it, and what is the hex value you observe? (Note: that's four questions, so there ought to be four answers.) > Meanwhile I have found something more interesting while reading the > input stream from my xml [sic] if I exclusively define it to be formatted to > UTF-8 in getByteStream it is working fine. Now here is this a Java bug > (1.5.0.12)? or something else? > It's not a Java bug. > Now this has led to a confusion. I thought ISO-8859-1 is a charset Did you mean "encoding"? > which is subset of UTF-8. Then why didn't UTF-8 work whereas > ISO-8859-1 worked? > Because you were wrong. The two encodings differ. If you have an assumption, let's call it an hypothesis, and the evidence contradicts the hypothesis, then the hypothesis is wrong. Simple. -- Lew
From: Arne Vajhøj on 21 Jan 2010 22:10 On 21-01-2010 10:03, dk wrote: > Meanwhile I have found something more interesting while reading the > input stream from my xml if I exclusively define it to be formatted to > UTF-8 in getByteStream it is working fine. Now here is this a Java bug > (1.5.0.12)? or something else? If you post the XML input and the Java code, then we can tell you. Arne
From: Roedy Green on 22 Jan 2010 03:36 On Thu, 21 Jan 2010 07:03:23 -0800 (PST), dk <dhirendraism(a)gmail.com> wrote, quoted or indirectly quoted someone who said : >@Roedy: write now I'm using ultraEdit and inserting the characters >from the ASCII table that it has. I have even tried seeing it in hex >mode and I got the same value from both the places. You need to know what the hex SHOULD look like. See http://mindprod.com/jgloss/utf8.html You need a tool to see what it DOES look like. See http://www.sweetscape.com/010editor/ http://funduc.com/otsoft.htm#hexview And a tool to validate the encoding: http://mindprod.com/jgloss/native2asciiexe.html http://mindprod.com/applet/ecodingrecogniser.html -- Roedy Green Canadian Mind Products http://mindprod.com Responsible Development is the style of development I aspire to now. It can be summarized by answering the question, �How would I develop if it were my money?� I�m amazed how many theoretical arguments evaporate when faced with this question. ~ Kent Beck (born: 1961 age: 49) , evangelist for extreme programming .
From: bugbear on 22 Jan 2010 04:53
Mike Schilling wrote: > It may be a clue that 4-byte UTE-8 sequences only occur with > surrogates, which there are two reasonable ways to encode: > > 1. Encode the code point as 4 bytes > 2. Encode each 16-bit "char" as 3 bytes > > Only 1 is correct, but I'm sure there's lots of non-surrogate-aware > code that does 2. > > Good information - thank you. BugBear |