Parse (recovered) corrupt xml files and automatically repair them. [CSharp]

Prev: Consuming web service receiving/returning objects
Next: WSDL proxy issue!

From: Anna on 9 Jan 2010 06:03

I want to parse (recovered) corrupt xml files and automatically repair them
for forensic purposes. (some elements are not properly closed or missing)
I know the original xml scheme.
(When i read the (corrupt) xml file a XmlException raises wich indecates the
problem.)
What's the best approach to solve this problem.

I do appreciate any advice.

Anna

From: Anna on 10 Jan 2010 07:49

> If the markup is not well-formed then I don't think any of the XML APIs in
> the .NET framework help, they all want well-formed markup.

I was afraid of that.
So any advice on what's the best approach to solve this problem, writing my
own code ?

Anna

From: Anna on 12 Jan 2010 02:02

Thx, i'll give it a try.

Anna

"Martin Honnen" <mahotrash(a)yahoo.de> wrote in message
news:%23dPB1sfkKHA.5604(a)TK2MSFTNGP04.phx.gbl...
> Anna wrote:
>>> If the markup is not well-formed then I don't think any of the XML APIs
>>> in the .NET framework help, they all want well-formed markup.
>>
>> I was afraid of that.
>> So any advice on what's the best approach to solve this problem, writing
>> my own code ?
>
> You will need to find out exactly which rules the markup you have
> implements respectively if there are any rules at all. The only other
> markup language I know is SGML, it allows omitting certain tags, not
> quoting certain attribute values, but there are clear rules how the parser
> has to infer elements or has to find out where an attribute value ends.
> There is a .NET implementation of an SGML parser, SgmlReader
> (http://developer.mindtouch.com/SgmlReader) which can be used to convert
> "HTML tag soup" to XHTML. There is also a HTML Tidy application doing the
> same. So studying the code of such applications can help.
>
>
> --
>
> Martin Honnen --- MVP XML
> http://msmvps.com/blogs/martin_honnen/

From: Richard.Williams.20 on 21 Jan 2010 15:01

I had done something like this in the past, but can't find the code.
Here is what I did.

I defined template in the form:

m:company
m:department
m:employee
o:salary

This defines the hiearchy of XML. m: means mandatory, o: means
optional element.

I then parsed the input XML and built a stack of elements, doing the
following as I parsed the file.
- complete incomplete nodes
- ensured that the elements are in the correct hiearchy
- add missing (mandatory) elements with default values

I remember there were some situations where the XML simply could not
be repaired automatically. So this won't be the perfect solution, but
it will be a start. I used biterscripting for easy parsing, stack-
building, etc. Check on http://www.biterscripting.com/helppages_samplescripts.html
if there any sample scripts you can reuse.

|
Pages: 1
Prev: Consuming web service receiving/returning objects
Next: WSDL proxy issue!