ignoring namespaces? [Perl]

Prev: FAQ 4.45 How do I find the first array element for which a condition is true?
Next: FAQ 4.9 How can I output Roman numerals?

From: bugbear on 7 Jun 2010 04:55

Peter Flynn wrote:
> bugbear wrote:
> [...]
>> I also considered walking the entire tree REMOVING namespaces,
>> but that doesn't sound like a high performance solution.
>
> sed -e "s+<$[/]*$$[^:]*:$+<\1+g" ?

Given that my problem
(corrupt data) cannot be solved
by a "squeaky clean" solution (*),
that's strangely appealing.

BugBear

(*) "one cannot proceed from the informal
to the formal by formal means" Alan Perlis

From: Peter Flynn on 7 Jun 2010 07:32

bugbear wrote:
> Peter Flynn wrote:
>> bugbear wrote:
>> [...]
>>> I also considered walking the entire tree REMOVING namespaces,
>>> but that doesn't sound like a high performance solution.
>>
>> sed -e "s+<$[/]*$$[^:]*:$+<\1+g" ?
>
> Given that my problem (corrupt data) cannot be solved by a "squeaky
> clean" solution (*), that's strangely appealing.

I always counsel to avoid the non-XML approach because it carries no
guarantee that the object you elect to operate on is actually what you
think it is.

(At least, a formal XML method like XSLT/XPath doesn't have any
"guarantee" as such, but at least I can be reasonably certain that if I
select the fifth paragraph of section 4 of chapter 6, then that is what
I will get, leaving aside my own programming errors.)

But there are times (and invalid XML is one of them) when a combination
of sed, awk, grep, tr, and the rest if the tribe, including Perl, Emacs,
Python, and your own personal favourite, are the only viable solution.

sed has the advantage and disadvantage of being spectacularly fast: get
it wrong and it will eat your data. Properly tested, however, the above
will remove all namespace prefixes to element type names within the
document element. It will not remove the xmlns:* namespace binding
attributes from the root element start-tag, nor will it remove
namespaces prefixes from attributes anywhere (the addition of more REs,
alternations, subexpressions, and backreferences to achieve this is left
as an exercise to the reader :-). Because it is unparsed, it *will*
remove the namespace prefixes from examples of XML markup in CDATA
marked sections in documentation, for example.

P. Lepin wrote:
> Peter Flynn wrote:
>> bugbear wrote:
>> [...]
>>> I also considered walking the entire tree REMOVING namespaces,
>>> but that doesn't sound like a high performance solution.
>> sed -e "s+<$[/]*$$[^:]*:$+<\1+g" ?
>
> Haven't posted anything for a long while, but I cannot keep quiet
> after seeing this.
>
> That's barbarous, sir! Just barbarous!
> (smileys implied)

Peh. I have seen *far* worse [better], both in the Humanities and the
Natural Sciences, trying to coerce evilly-formed documents into XML :-)

///Peter
--
XML FAQ: http://xml.silmaril.ie/

From: Martijn Lievaart on 7 Jun 2010 08:53

On Mon, 07 Jun 2010 09:55:07 +0100, bugbear wrote:

> Peter Flynn wrote:
>> bugbear wrote:
>> [...]
>>> I also considered walking the entire tree REMOVING namespaces, but
>>> that doesn't sound like a high performance solution.
>>
>> sed -e "s+<$[/]*$$[^:]*:$+<\1+g" ?
>
> Given that my problem
> (corrupt data) cannot be solved
> by a "squeaky clean" solution (*),
> that's strangely appealing.

It is also very error prone, but may be acceptable. To improve on the
above solution, do split it in two steps. First step, a custom program
(instead of sed) cleans up the files and produces clean files without
namespaces, second step program(s) processes those clean files.

By creating a separate program for the first step, you can have it do
checks to see if the output it produces is sensible and die (to let you
investigate the problem) if it is not.

After cleaning the files, all programs that process them (second step)
don't have to carry convoluted logic to deal with the dirty files).

M4

From: sln on 7 Jun 2010 18:02

On Mon, 07 Jun 2010 12:32:15 +0100, Peter Flynn <peter.nosp(a)m.silmaril.ie> wrote:

>bugbear wrote:
>> Peter Flynn wrote:
>>> bugbear wrote:
>>> [...]
>>>> I also considered walking the entire tree REMOVING namespaces,
>>>> but that doesn't sound like a high performance solution.
>>>
>>> sed -e "s+<$[/]*$$[^:]*:$+<\1+g" ?
>>
>> Given that my problem (corrupt data) cannot be solved by a "squeaky
>> clean" solution (*), that's strangely appealing.
>
>I always counsel to avoid the non-XML approach because it carries no
>guarantee that the object you elect to operate on is actually what you
>think it is.
>
>(At least, a formal XML method like XSLT/XPath doesn't have any
>"guarantee" as such, but at least I can be reasonably certain that if I
>select the fifth paragraph of section 4 of chapter 6, then that is what
>I will get, leaving aside my own programming errors.)
>
>But there are times (and invalid XML is one of them) when a combination
>of sed, awk, grep, tr, and the rest if the tribe, including Perl, Emacs,
>Python, and your own personal favourite, are the only viable solution.
>
>sed has the advantage and disadvantage of being spectacularly fast: get
>it wrong and it will eat your data. Properly tested, however, the above
>will remove all namespace prefixes to element type names within the
>document element. It will not remove the xmlns:* namespace binding
>attributes from the root element start-tag, nor will it remove
>namespaces prefixes from attributes anywhere (the addition of more REs,
>alternations, subexpressions, and backreferences to achieve this is left
>as an exercise to the reader :-). Because it is unparsed, it *will*
>remove the namespace prefixes from examples of XML markup in CDATA
>marked sections in documentation, for example.
>

This might parse it (with a slight bit of validation)
using regex, while changing just specific parts of the source xml
dealing with namespace in tags and/or attributes.

-sln

# -----------------------------------------------------------
# rx_xml_fixnamespace.pl
# -sln, 6/7/2010
#
# Util to search/replace xml namespace from tags/attributes
# -----------------------------------------------------------

use strict;
use warnings;

## Initialization
##

my $Name = "[A-Za-z_:][\\w:.-]*";
my $SkipName = "[A-Za-z_][\\w.-]*";
my $rxskip_tag = "(?: $SkipName )"; # Skip tags
my $rxskip_attr = "(?: $SkipName )"; # Skip attribute's
my $rxtag = "(?: $Name )"; # Tags
my $rxattr = "(?: $Name )"; # Attribute's

use re 'eval';
my $topen = 0;

my $Rxmarkup = qr
{
(?(?{$topen}) # Begin Conditional

# Have open <TAG> ?
(?:
# Try to match next attribute
(?:
\s*=\s* (?:".*?"|'.*?') \K
|
\s* (?<=\s)
(?: $rxskip_attr \K | \K (?<ATTR> $rxattr) )
(?= \s*=\s* (?:".*?"|'.*?'))
)
(?= [^>]*? \s* /? > )
|
# No more attr's
(?{$topen = 0})
)
|
# Look for new open or close <TAG>
(?:
[^<]*
(?:
# Things that hide markup:
# - Comments/CDATA
(?: <!
(?:
\[CDATA\[.*?\]\]
| --.*?--
| \[[A-Z][A-Z\ ]*\[.*?\]\]
)
> \K
)
|
# Specific markup we seek:
# - TAG
<
(?:
/* $rxskip_tag \K (?= \s* /* >)
|
/* \K (?<TAG> $rxtag ) (?= \s* /* >)
|
(?: $rxskip_tag \K | \K (?<TAG> $rxtag ) )
(?= \s [^>]*? \s* /? > )
(?{$topen = 1})
)
)
|
< \K
)
) # End Conditional
}xs;

## Code
##

my $xml = join '', <DATA>;
$xml =~ s/$Rxmarkup/ fixnamespace( $+{TAG}, $+{ATTR} ) /eg;
print "\n",$xml;

exit (0);

## Subs
##

sub fixnamespace {

if (defined $_[0]) {
my $tag = $_[0];
if ($tag =~ s/^[^:]*://) {
print "Replaced\t$_[0]\n with \t$tag\n";
}
return $tag;
}
if (defined $_[1]) {
my $attr = $_[1];
if ($attr =~ s/^[^:]*://) {
print "Replaced\t$_[1]\n with \t$attr\n";
}
return $attr;
}
return "";
}

__DATA__

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>

<Profile xmlns="xxxxxxxxx" name="" version="1.1" xmlns:xsi="http://
www.w3.org/2001/XMLSchema-instance" junk="">

<monday:Application Name="App1" Id="/Local/App/App1"
Id2="/Local/App/App2" services="1" policy=""
StartApp="" Bal="5" sessInt="500" WaterMark="1.0"/>

<AppProfileGuid>586e3456dt</AppProfileGuid>

</Profile>

<Application
Name="App99" Id='/Dummy/Test/iii' Services="3"
policy="99" monday:StartApp="2" Bal="7" sessInt="27"
tuesday:WaterMark="4.3" />

<wednesday:Application Id="/testing"
Name="App100" monday:Id="/Dum
my/Test/iii
" Services="4"
policy="99" StartApp="2" Bal="7" sessInt="27"
WaterMark="4.3"/>

<Application
Name="Yyee" Id="/Dat/Inp/Out" Services="5"
policy="88" StartApp="" Bal="1" sessInt="8"
thrusday:WaterMark="2.1"/>

<![CDATA[ <Applic:ation Name="App" Id=""/> ]]>

<AppProfile:Guid>586e3456dt</AppProfile:Guid>
<AppProfile:Guid>a46y2hktt7</AppProfile:Guid>
<AppProfile:Guid>mi6j77mae6</AppProfile:Guid>
</Profile>

From: Peter Flynn on 9 Jun 2010 17:02

sln(a)netherlands.com wrote:
> On Mon, 07 Jun 2010 12:32:15 +0100, Peter Flynn <peter.nosp(a)m.silmaril.ie> wrote:
[...]
>> I always counsel to avoid the non-XML approach
[...]
> This might parse it (with a slight bit of validation)

It occurs to me that you can combine both methods, iff the document is
well-formed.

Run onsgmls -wxml /usr/share/xml/declaration/xml.dcl doc.xml >doc.esis
to get the ESIS, and then tweak the W3C's esis2xml.py script to re-form
the XML document, omitting the namespaces. Or write your own in Perl...

///Peter

First | Prev | Next | Last
Pages: 1 2 3
Prev: FAQ 4.45 How do I find the first array element for which a condition is true?
Next: FAQ 4.9 How can I output Roman numerals?