Prev: FAQ 4.45 How do I find the first array element for which a condition is true?
Next: FAQ 4.9 How can I output Roman numerals?
From: bugbear on 3 Jun 2010 05:51 I've been tasked with handling/parsing some XML. It's been spec'd "by committee", and is composed of many sub-spec's all qualified by namespace. The spec has evolved over time. Old files still exist, conforming to old specs. Old file still exists with faults. Namespaces declaration and use in the files show numerous faults and inconsistencies. I need to parse (well, "work with") as many of these files as possible. Since the spec was written by commitee, the tag names are enormous (30+ characters!). Now to my technical question: The use of namespaces in these files is (actually) redundant - the tags are so long, and the tag nesting so over-the-top that all XPaths are unambiguous. Since the files I have to deal with as many files as possible, it would be a convenience to me to simply ignore namespaces. So - using XML::LibXML, is there a way of using XPaths, without namespaces? BugBear
From: Joe Kesselman on 3 Jun 2010 20:04 > So - using XML::LibXML, is there a way > of using XPaths, without namespaces? Can't vouch for that tool. You can, if you insist on doing so, write XPaths which are specifically testing the localname rather than the qualified name /*[localname()="foo"]/@*[localname()="bar"] though in some processors the performance of this variant will be inferior to the proper namespace-aware path. And of course the increased verbosity makes it harder to write, harder to read, and harder to maintain. If at all possible, I really recommend hammering on people to fix the documents and use namespaces correctly. This will continue to cause problems, and not every XML tool will let you construct this sort of workaround. You can pay the cost to fix them now, or you can wait and fix them in a complete panic (probably at greater cost) later. -- Joe Kesselman, http://www.love-song-productions.com/people/keshlam/index.html {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" -- /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
From: bugbear on 4 Jun 2010 05:22 Joe Kesselman wrote: >> So - using XML::LibXML, is there a way >> of using XPaths, without namespaces? > > Can't vouch for that tool. > > You can, if you insist on doing so, write XPaths which are specifically > testing the localname rather than the qualified name > /*[localname()="foo"]/@*[localname()="bar"] > though in some processors the performance of this variant will be > inferior to the proper namespace-aware path. And of course the increased > verbosity makes it harder to write, harder to read, and harder to maintain. Hmm. Didn't know that. In perl, I could probably overload an entry point that transformed "normal" XPaths into that form. I also considered walking the entire tree REMOVING namespaces, but that doesn't sound like a high performance solution. I'm only changing to XML:LibXML (from XML::DOM) due to the improved parsing speed. XML::DOM allows namespaces to be wiped out in the parser (my $parser = new XML::DOM::Parser(Namespaces => 1); which is what I currently do. Actually, this feature is in XML::Parser::Expat of which XML::DOM::Parser is a sub-class. > * Namespaces > When this option is given with a true value, then the parser does namespace processing. By default, names- > pace processing is turned off. When it is turned on, the parser consumes xmlns attributes and strips off > prefixes from element and attributes names where those prefixes have a defined namespace. A name�s names- > pace can be found using the "namespace" method and two names can be checked for absolute equality with the > "eq_name" method. > If at all possible, I really recommend hammering on people to fix the > documents and use namespaces correctly. Too late. Legacy applications and legacy files make this impossible. BugBear
From: Joe Kesselman on 4 Jun 2010 21:21 bugbear wrote: >> If at all possible, I really recommend hammering on people to fix the >> documents and use namespaces correctly. > > Too late. Legacy applications and legacy files make this impossible. Understood. As I say, that's going to continue to add to their costs in the future, but if they can't/won't get everything fixed now, that's their choice. "The customer is not always right. The customer is the one with the money. Sometimes you have to choose between being right and getting the money." (This is one reason for always having file formats -- in XML or any other representation -- carry version numbers. That gives you some hope of being able to recognize newer data, and process it more efficiently, while still supporting the "quirks mode" needed by older/sloppier instances.) -- Joe Kesselman, http://www.love-song-productions.com/people/keshlam/index.html {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" -- /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
From: Peter Flynn on 6 Jun 2010 09:15
bugbear wrote: [...] > I also considered walking the entire tree REMOVING namespaces, > but that doesn't sound like a high performance solution. sed -e "s+<\([/]*\)\([^:]*:\)+<\1+g" ? ///Peter |