Facing exception: Invalid byte 2 of 4-byte UTF-8 sequence. [Java Programming]

Prev: Subversion in 2010 and Beyond
Next: any good site for java

From: dk on 21 Jan 2010 05:13

Hi All,

While I'm trying to use some UTF-8 characters in my xml while parsing
the xml using JDOM parser I'm getting this below exception:

Malformed XML, Caused by: 'Invalid byte 2 of 4-byte UTF-8 sequence.'
at com.clarify.boss.utility.xml.SimpleXmlParser.build
(SimpleXmlParser.java:236)
at
com.clarify.boss.msf.handler.RespHeaderInitiateHandler.getStandardHeader
(RespHeaderInitiateHandler.java:366)
at com.clarify.boss.msf.handler.RespHeaderInitiateHandler.execute
(RespHeaderInitiateHandler.java:289)
at
com.clarify.boss.utility.appcontroller.support.AbstractHandler.execute
(AbstractHandler.java:42)
at
com.clarify.boss.utility.appcontroller.support.ApplicationControllerImpl.handleRequest
(ApplicationControllerImpl.java:174)
at
com.clarify.boss.utility.appcontroller.support.ApplicationControllerImpl.execute
(ApplicationControllerImpl.java:311)
at com.clarify.boss.msf.support.ServiceFaultPublisherAB.executeImpl
(ServiceFaultPublisherAB.java:87)
at com.clarify.boss.common.base.BossActionBeanBase.execute
(BossActionBeanBase.java:125)
at com.clarify.boss.sa.msf.xbean.InvokeResponseXB.executeImpl
(InvokeResponseXB.java:198)
at com.clarify.cbo.XBeanImpl.baselineExecuteImpl_(XBeanImpl.java:275)
at com.amdocs.oss.sm.core.common.XBeanBase.baselineExecuteImpl_
(XBeanBase.java:75)
at com.clarify.cbo.XBeanImpl.execute(XBeanImpl.java:197)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke
(NativeMethodAccessorImpl.java:64)
at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:615)
at com.clarify.sam.JavaDispatch.invokeMethodImp(JavaDispatch.java:
396)
at com.clarify.sam.JavaDispatch.invokeMethod(JavaDispatch.java:348)
at com.clarify.sam.ActionBeanService.invokeBeanMethod
(ActionBeanService.java:509)
at com.clarify.sam.ActionBeanService.invokeAifOperation
(ActionBeanService.java:128)
at com.clarify.sam.AppFrameworkBindingHandler.executeOperation
(AppFrameworkBindingHandler.java:69)
at com.amdocs.aif.consumer.ServiceContext.executeWithRetries
(ServiceContext.java:900)
at com.amdocs.aif.consumer.ServiceContext.executeOperationImpl
(ServiceContext.java:756)
at com.amdocs.aif.consumer.ServiceContext.executeOperation
(ServiceContext.java:676)
at com.amdocs.aif.consumer.ServiceContext.executeOperation
(ServiceContext.java:323)
at
com.clarify.boss.errorhandler.resolver.ResolverLauncherSynchXB.executeImpl
(ResolverLauncherSynchXB.java:157)
... 35 more
Caused by: org.jdom.input.JDOMParseException: Error on line 72:
Invalid byte 2 of 4-byte UTF-8 sequence.
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:468)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:770)
at com.clarify.boss.utility.xml.SimpleXmlParser.build
(SimpleXmlParser.java:231)
... 60 more
Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 4-byte
UTF-8 sequence.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException
(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl
$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument
(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:453)
... 62 more

I have declared the encoding to be used while parsing, in my xml as
UTF-8:
<?xml version="1.0" encoding="UTF-8"?>

Initially I doubted that the xml backup had some problem because on
the same application server while I was trying to use the same xml as
input it worked but from one of my friends machine it didn't. So is
this could be the cause?

But now I have even something more interesting out of all this. I
tried changing the encoding to ISO-8859-1 i.e. : <?xml version="1.0"
encoding="ISO-8859-1"?> & to surprise it worked.

Now this has led to a confusion. I thought ISO-8859-1 is a charset
which is subset of UTF-8. Then why didn't UTF-8 work whereas
ISO-8859-1 worked?

And lastly I can't change this encoding in my xml as in turn I would
have to do all the regression once again on my application. So please
let me know where I have gone wrong.

The Java code that I'm using is:

/*
* (non-Javadoc)
/ *
* @see com.clarify.boss.utility.xml.XmlParser#build
(org.springframework.core.io.Resource)
*/
public Document build(Resource source) {
try {
return (getSystemId() == null ? getSaxBuilder().build
(source.getInputStream()) : getSaxBuilder().build(
source.getInputStream(), getSystemId()));
} catch (Exception e) {
e.printStackTrace();
BossErrorCode bossErrorCode = new BossErrorCode
(ErrorCode.BOSS_MALFORMED_XML);
throw new BossException(bossErrorCode, new String[] {e.getCause
().getMessage()},e);
}
}

the sax builder method is:

/**
* Getter method for the <b>saxBuilder </b> property
*
* @return Returns the saxBuilder.
*/
private PropertyAwareSAXBuilder getSaxBuilder() {
if (saxBuilder == null) {

PropertyAwareSAXBuilder myParser = new PropertyAwareSAXBuilder(
isValidate());

myParser.setFeature("http://apache.org/xml/features/validation/
schema", isValidate());
myParser.setFeature("http://xml.org/sax/features/namespaces",
true);

//CatalogResolver myResolver = new CatalogResolver();

CatalogResolver myResolver = getCatalogResolver();

myParser.setEntityResolver(myResolver);
setSaxBuilder(myParser);

Iterator it = getProperties().keySet().iterator();
while (it.hasNext()) {
String name = (String) it.next();
saxBuilder.setProperty(name, getProperties().get(name));
}
}
return saxBuilder;
}

Regards,
Dhirendra

From: bugbear on 21 Jan 2010 05:15

dk wrote:
> Hi All,
>
> While I'm trying to use some UTF-8 characters in my xml while parsing
> the xml using JDOM parser I'm getting this below exception:

Have you checked that your data IS valid UTF-8 ?

BugBear

From: Roedy Green on 21 Jan 2010 08:26

On Thu, 21 Jan 2010 02:13:27 -0800 (PST), dk <dhirendraism(a)gmail.com>
wrote, quoted or indirectly quoted someone who said :

>
>While I'm trying to use some UTF-8 characters in my xml while parsing
>the xml using JDOM parser I'm getting this below exception:

Partition your problem. Is it that the file is malformed or is the
problem getting the XML parser to understand the file is in UTF-8
encoding?

You can examine your file in a hex viewer if you are familiar with
UTF-8 encoding, or you could feed it to the Sun utility native2ascii
to see if it likes it.

See http://mindprod.com/jgloss/utf.html
http://mindprod.com/jgloss/encoding.html

You could also give up and use entities (NCRs).
see http://mindprod.com/jgloss/xml.html#AWKWARD
--
Roedy Green Canadian Mind Products
http://mindprod.com
Responsible Development is the style of development I aspire to now. It can be summarized by answering the question, �How would I develop if it were my money?� I�m amazed how many theoretical arguments evaporate when faced with this question.
~ Kent Beck (born: 1961 age: 49) , evangelist for extreme programming .

From: dk on 21 Jan 2010 10:03

On Jan 21, 6:26 pm, Roedy Green <see_webs...(a)mindprod.com.invalid>
wrote:
> On Thu, 21 Jan 2010 02:13:27 -0800 (PST), dk <dhirendra...(a)gmail.com>
> wrote, quoted or indirectly quoted someone who said :
>
>
>
> >While I'm trying to use some UTF-8 characters in my xml while parsing
> >the xml using JDOM parser I'm getting this below exception:
>
> Partition your problem. Is it that the file is malformed or is the
> problem getting the XML parser to understand the file is in UTF-8
> encoding?
>
> You can examine your file in a hex viewer if you are familiar with
> UTF-8 encoding, or you could feed it to the Sun utility native2ascii
> to see if it likes it.
>
> Seehttp://mindprod.com/jgloss/utf.htmlhttp://mindprod.com/jgloss/encoding..html
>
> You could also give up and use entities (NCRs).
> seehttp://mindprod.com/jgloss/xml.html#AWKWARD
> --
> Roedy Green Canadian Mind Productshttp://mindprod.com
> Responsible Development is the style of development I aspire to now. It can be summarized by answering the question, How would I develop if it were my money? I m amazed how many theoretical arguments evaporate when faced with this question.
> ~ Kent Beck (born: 1961 age: 49) , evangelist for extreme programming .

@BugBear: yeah the xml is a well formed and properly validated xml.

@Roedy: write now I'm using ultraEdit and inserting the characters
from the ASCII table that it has. I have even tried seeing it in hex
mode and I got the same value from both the places.

Meanwhile I have found something more interesting while reading the
input stream from my xml if I exclusively define it to be formatted to
UTF-8 in getByteStream it is working fine. Now here is this a Java bug
(1.5.0.12)? or something else?

From: Mike Schilling on 21 Jan 2010 13:07

It may be a clue that 4-byte UTE-8 sequences only occur with
surrogates, which there are two reasonable ways to encode:

1. Encode the code point as 4 bytes
2. Encode each 16-bit "char" as 3 bytes

Only 1 is correct, but I'm sure there's lots of non-surrogate-aware
code that does 2.

| Next | Last
Pages: 1 2
Prev: Subversion in 2010 and Beyond
Next: any good site for java