ignoring namespaces? [Perl]

Prev: FAQ 4.45 How do I find the first array element for which a condition is true?
Next: FAQ 4.9 How can I output Roman numerals?

From: bugbear on 3 Jun 2010 05:51

I've been tasked with handling/parsing some XML.

It's been spec'd "by committee", and is composed
of many sub-spec's all qualified by namespace.

The spec has evolved over time.

Old files still exist, conforming to old specs.
Old file still exists with faults.

Namespaces declaration and use in the files
show numerous faults and inconsistencies.

I need to parse (well, "work with")
as many of these files as possible.

Since the spec was written by commitee,
the tag names are enormous (30+ characters!).

Now to my technical question:

The use of namespaces in these files
is (actually) redundant - the tags are so long,
and the tag nesting so over-the-top that all
XPaths are unambiguous.

Since the files I have to deal with as many files
as possible, it would be a convenience to me
to simply ignore namespaces.

So - using XML::LibXML, is there a way
of using XPaths, without namespaces?

BugBear

From: Joe Kesselman on 3 Jun 2010 20:04

> So - using XML::LibXML, is there a way
> of using XPaths, without namespaces?

Can't vouch for that tool.

You can, if you insist on doing so, write XPaths which are specifically
testing the localname rather than the qualified name
/*[localname()="foo"]/@*[localname()="bar"]
though in some processors the performance of this variant will be
inferior to the proper namespace-aware path. And of course the increased
verbosity makes it harder to write, harder to read, and harder to maintain.

If at all possible, I really recommend hammering on people to fix the
documents and use namespaces correctly. This will continue to cause
problems, and not every XML tool will let you construct this sort of
workaround. You can pay the cost to fix them now, or you can wait and
fix them in a complete panic (probably at greater cost) later.

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."

From: bugbear on 4 Jun 2010 05:22

Joe Kesselman wrote:
>> So - using XML::LibXML, is there a way
>> of using XPaths, without namespaces?
>
> Can't vouch for that tool.
>
> You can, if you insist on doing so, write XPaths which are specifically
> testing the localname rather than the qualified name
> /*[localname()="foo"]/@*[localname()="bar"]
> though in some processors the performance of this variant will be
> inferior to the proper namespace-aware path. And of course the increased
> verbosity makes it harder to write, harder to read, and harder to maintain.

Hmm. Didn't know that. In perl, I could probably overload an entry point
that transformed "normal" XPaths into that form.

I also considered walking the entire tree REMOVING namespaces,
but that doesn't sound like a high performance solution.

I'm only changing to XML:LibXML (from XML::DOM) due
to the improved parsing speed.

XML::DOM allows namespaces to be wiped out in the parser
(my $parser = new XML::DOM::Parser(Namespaces => 1);
which is what I currently do.

Actually, this feature is in XML::Parser::Expat
of which XML::DOM::Parser is a sub-class.

> * Namespaces
> When this option is given with a true value, then the parser does namespace processing. By default, names-
> pace processing is turned off. When it is turned on, the parser consumes xmlns attributes and strips off
> prefixes from element and attributes names where those prefixes have a defined namespace. A name�s names-
> pace can be found using the "namespace" method and two names can be checked for absolute equality with the
> "eq_name" method.

> If at all possible, I really recommend hammering on people to fix the
> documents and use namespaces correctly.

Too late. Legacy applications and legacy files make this impossible.

BugBear

From: Joe Kesselman on 4 Jun 2010 21:21

bugbear wrote:
>> If at all possible, I really recommend hammering on people to fix the
>> documents and use namespaces correctly.
>
> Too late. Legacy applications and legacy files make this impossible.

Understood. As I say, that's going to continue to add to their costs in
the future, but if they can't/won't get everything fixed now, that's
their choice.

"The customer is not always right. The customer is the one with the
money. Sometimes you have to choose between being right and getting the
money."

(This is one reason for always having file formats -- in XML or any
other representation -- carry version numbers. That gives you some hope
of being able to recognize newer data, and process it more efficiently,
while still supporting the "quirks mode" needed by older/sloppier
instances.)

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."

From: Peter Flynn on 6 Jun 2010 09:15

bugbear wrote:
[...]
> I also considered walking the entire tree REMOVING namespaces,
> but that doesn't sound like a high performance solution.

sed -e "s+<$[/]*$$[^:]*:$+<\1+g" ?

///Peter

| Next | Last
Pages: 1 2 3
Prev: FAQ 4.45 How do I find the first array element for which a condition is true?
Next: FAQ 4.9 How can I output Roman numerals?