From: Lew on
dk wrote:
> @BugBear: yeah the xml [sic] is a well formed and properly validated xml [sic].
>

That didn't answer his question. Answer his question.
"Have you checked that your data IS valid UTF-8 ?"

Clearly there is an improperly-encoded character in your XML file.
Find that and fix it.

> @Roedy: write now I'm using ultraEdit and inserting the characters
> from the ASCII table that it has. I have even tried seeing it in hex
> mode and I got the same value from both the places.
>

ASCII != UTF-8.

That hex value for the bad character, does it match the UTF-8 code
point for that character? It's four bytes long? What character is
it, and what is the hex value you observe? (Note: that's four
questions, so there ought to be four answers.)

> Meanwhile I have found something more interesting while reading the
> input stream from my xml [sic] if I exclusively define it to be formatted to
> UTF-8 in getByteStream it is working fine. Now here is this a Java bug
> (1.5.0.12)? or something else?
>

It's not a Java bug.

> Now this has led to a confusion. I thought ISO-8859-1 is a charset

Did you mean "encoding"?

> which is subset of UTF-8. Then why didn't UTF-8 work whereas
> ISO-8859-1 worked?
>

Because you were wrong. The two encodings differ.

If you have an assumption, let's call it an hypothesis, and the
evidence contradicts the hypothesis, then the hypothesis is wrong.
Simple.

--
Lew
From: Arne Vajhøj on
On 21-01-2010 10:03, dk wrote:
> Meanwhile I have found something more interesting while reading the
> input stream from my xml if I exclusively define it to be formatted to
> UTF-8 in getByteStream it is working fine. Now here is this a Java bug
> (1.5.0.12)? or something else?

If you post the XML input and the Java code, then we can
tell you.

Arne
From: Roedy Green on
On Thu, 21 Jan 2010 07:03:23 -0800 (PST), dk <dhirendraism(a)gmail.com>
wrote, quoted or indirectly quoted someone who said :

>@Roedy: write now I'm using ultraEdit and inserting the characters
>from the ASCII table that it has. I have even tried seeing it in hex
>mode and I got the same value from both the places.

You need to know what the hex SHOULD look like.
See http://mindprod.com/jgloss/utf8.html

You need a tool to see what it DOES look like.
See http://www.sweetscape.com/010editor/
http://funduc.com/otsoft.htm#hexview

And a tool to validate the encoding:
http://mindprod.com/jgloss/native2asciiexe.html
http://mindprod.com/applet/ecodingrecogniser.html


--
Roedy Green Canadian Mind Products
http://mindprod.com
Responsible Development is the style of development I aspire to now. It can be summarized by answering the question, �How would I develop if it were my money?� I�m amazed how many theoretical arguments evaporate when faced with this question.
~ Kent Beck (born: 1961 age: 49) , evangelist for extreme programming .
From: bugbear on
Mike Schilling wrote:
> It may be a clue that 4-byte UTE-8 sequences only occur with
> surrogates, which there are two reasonable ways to encode:
>
> 1. Encode the code point as 4 bytes
> 2. Encode each 16-bit "char" as 3 bytes
>
> Only 1 is correct, but I'm sure there's lots of non-surrogate-aware
> code that does 2.
>
>

Good information - thank you.

BugBear