Prev: FAQ 4.45 How do I find the first array element for which a condition is true?
Next: FAQ 4.9 How can I output Roman numerals?
From: bugbear on 7 Jun 2010 04:55 Peter Flynn wrote: > bugbear wrote: > [...] >> I also considered walking the entire tree REMOVING namespaces, >> but that doesn't sound like a high performance solution. > > sed -e "s+<\([/]*\)\([^:]*:\)+<\1+g" ? Given that my problem (corrupt data) cannot be solved by a "squeaky clean" solution (*), that's strangely appealing. BugBear (*) "one cannot proceed from the informal to the formal by formal means" Alan Perlis
From: Peter Flynn on 7 Jun 2010 07:32 bugbear wrote: > Peter Flynn wrote: >> bugbear wrote: >> [...] >>> I also considered walking the entire tree REMOVING namespaces, >>> but that doesn't sound like a high performance solution. >> >> sed -e "s+<\([/]*\)\([^:]*:\)+<\1+g" ? > > Given that my problem (corrupt data) cannot be solved by a "squeaky > clean" solution (*), that's strangely appealing. I always counsel to avoid the non-XML approach because it carries no guarantee that the object you elect to operate on is actually what you think it is. (At least, a formal XML method like XSLT/XPath doesn't have any "guarantee" as such, but at least I can be reasonably certain that if I select the fifth paragraph of section 4 of chapter 6, then that is what I will get, leaving aside my own programming errors.) But there are times (and invalid XML is one of them) when a combination of sed, awk, grep, tr, and the rest if the tribe, including Perl, Emacs, Python, and your own personal favourite, are the only viable solution. sed has the advantage and disadvantage of being spectacularly fast: get it wrong and it will eat your data. Properly tested, however, the above will remove all namespace prefixes to element type names within the document element. It will not remove the xmlns:* namespace binding attributes from the root element start-tag, nor will it remove namespaces prefixes from attributes anywhere (the addition of more REs, alternations, subexpressions, and backreferences to achieve this is left as an exercise to the reader :-). Because it is unparsed, it *will* remove the namespace prefixes from examples of XML markup in CDATA marked sections in documentation, for example. P. Lepin wrote: > Peter Flynn wrote: >> bugbear wrote: >> [...] >>> I also considered walking the entire tree REMOVING namespaces, >>> but that doesn't sound like a high performance solution. >> sed -e "s+<\([/]*\)\([^:]*:\)+<\1+g" ? > > Haven't posted anything for a long while, but I cannot keep quiet > after seeing this. > > That's barbarous, sir! Just barbarous! > (smileys implied) Peh. I have seen *far* worse [better], both in the Humanities and the Natural Sciences, trying to coerce evilly-formed documents into XML :-) ///Peter -- XML FAQ: http://xml.silmaril.ie/
From: Martijn Lievaart on 7 Jun 2010 08:53 On Mon, 07 Jun 2010 09:55:07 +0100, bugbear wrote: > Peter Flynn wrote: >> bugbear wrote: >> [...] >>> I also considered walking the entire tree REMOVING namespaces, but >>> that doesn't sound like a high performance solution. >> >> sed -e "s+<\([/]*\)\([^:]*:\)+<\1+g" ? > > Given that my problem > (corrupt data) cannot be solved > by a "squeaky clean" solution (*), > that's strangely appealing. It is also very error prone, but may be acceptable. To improve on the above solution, do split it in two steps. First step, a custom program (instead of sed) cleans up the files and produces clean files without namespaces, second step program(s) processes those clean files. By creating a separate program for the first step, you can have it do checks to see if the output it produces is sensible and die (to let you investigate the problem) if it is not. After cleaning the files, all programs that process them (second step) don't have to carry convoluted logic to deal with the dirty files). M4
From: sln on 7 Jun 2010 18:02 On Mon, 07 Jun 2010 12:32:15 +0100, Peter Flynn <peter.nosp(a)m.silmaril.ie> wrote: >bugbear wrote: >> Peter Flynn wrote: >>> bugbear wrote: >>> [...] >>>> I also considered walking the entire tree REMOVING namespaces, >>>> but that doesn't sound like a high performance solution. >>> >>> sed -e "s+<\([/]*\)\([^:]*:\)+<\1+g" ? >> >> Given that my problem (corrupt data) cannot be solved by a "squeaky >> clean" solution (*), that's strangely appealing. > >I always counsel to avoid the non-XML approach because it carries no >guarantee that the object you elect to operate on is actually what you >think it is. > >(At least, a formal XML method like XSLT/XPath doesn't have any >"guarantee" as such, but at least I can be reasonably certain that if I >select the fifth paragraph of section 4 of chapter 6, then that is what >I will get, leaving aside my own programming errors.) > >But there are times (and invalid XML is one of them) when a combination >of sed, awk, grep, tr, and the rest if the tribe, including Perl, Emacs, >Python, and your own personal favourite, are the only viable solution. > >sed has the advantage and disadvantage of being spectacularly fast: get >it wrong and it will eat your data. Properly tested, however, the above >will remove all namespace prefixes to element type names within the >document element. It will not remove the xmlns:* namespace binding >attributes from the root element start-tag, nor will it remove >namespaces prefixes from attributes anywhere (the addition of more REs, >alternations, subexpressions, and backreferences to achieve this is left >as an exercise to the reader :-). Because it is unparsed, it *will* >remove the namespace prefixes from examples of XML markup in CDATA >marked sections in documentation, for example. > This might parse it (with a slight bit of validation) using regex, while changing just specific parts of the source xml dealing with namespace in tags and/or attributes. -sln # ----------------------------------------------------------- # rx_xml_fixnamespace.pl # -sln, 6/7/2010 # # Util to search/replace xml namespace from tags/attributes # ----------------------------------------------------------- use strict; use warnings; ## Initialization ## my $Name = "[A-Za-z_:][\\w:.-]*"; my $SkipName = "[A-Za-z_][\\w.-]*"; my $rxskip_tag = "(?: $SkipName )"; # Skip tags my $rxskip_attr = "(?: $SkipName )"; # Skip attribute's my $rxtag = "(?: $Name )"; # Tags my $rxattr = "(?: $Name )"; # Attribute's use re 'eval'; my $topen = 0; my $Rxmarkup = qr { (?(?{$topen}) # Begin Conditional # Have open <TAG> ? (?: # Try to match next attribute (?: \s*=\s* (?:".*?"|'.*?') \K | \s* (?<=\s) (?: $rxskip_attr \K | \K (?<ATTR> $rxattr) ) (?= \s*=\s* (?:".*?"|'.*?')) ) (?= [^>]*? \s* /? > ) | # No more attr's (?{$topen = 0}) ) | # Look for new open or close <TAG> (?: [^<]* (?: # Things that hide markup: # - Comments/CDATA (?: <! (?: \[CDATA\[.*?\]\] | --.*?-- | \[[A-Z][A-Z\ ]*\[.*?\]\] ) > \K ) | # Specific markup we seek: # - TAG < (?: /* $rxskip_tag \K (?= \s* /* >) | /* \K (?<TAG> $rxtag ) (?= \s* /* >) | (?: $rxskip_tag \K | \K (?<TAG> $rxtag ) ) (?= \s [^>]*? \s* /? > ) (?{$topen = 1}) ) ) | < \K ) ) # End Conditional }xs; ## Code ## my $xml = join '', <DATA>; $xml =~ s/$Rxmarkup/ fixnamespace( $+{TAG}, $+{ATTR} ) /eg; print "\n",$xml; exit (0); ## Subs ## sub fixnamespace { if (defined $_[0]) { my $tag = $_[0]; if ($tag =~ s/^[^:]*://) { print "Replaced\t$_[0]\n with \t$tag\n"; } return $tag; } if (defined $_[1]) { my $attr = $_[1]; if ($attr =~ s/^[^:]*://) { print "Replaced\t$_[1]\n with \t$attr\n"; } return $attr; } return ""; } __DATA__ <?xml version="1.0" encoding="UTF-8" standalone="no" ?> <Profile xmlns="xxxxxxxxx" name="" version="1.1" xmlns:xsi="http:// www.w3.org/2001/XMLSchema-instance" junk=""> <monday:Application Name="App1" Id="/Local/App/App1" Id2="/Local/App/App2" services="1" policy="" StartApp="" Bal="5" sessInt="500" WaterMark="1.0"/> <AppProfileGuid>586e3456dt</AppProfileGuid> </Profile> <Application Name="App99" Id='/Dummy/Test/iii' Services="3" policy="99" monday:StartApp="2" Bal="7" sessInt="27" tuesday:WaterMark="4.3" /> <wednesday:Application Id="/testing" Name="App100" monday:Id="/Dum my/Test/iii " Services="4" policy="99" StartApp="2" Bal="7" sessInt="27" WaterMark="4.3"/> <Application Name="Yyee" Id="/Dat/Inp/Out" Services="5" policy="88" StartApp="" Bal="1" sessInt="8" thrusday:WaterMark="2.1"/> <![CDATA[ <Applic:ation Name="App" Id=""/> ]]> <AppProfile:Guid>586e3456dt</AppProfile:Guid> <AppProfile:Guid>a46y2hktt7</AppProfile:Guid> <AppProfile:Guid>mi6j77mae6</AppProfile:Guid> </Profile>
From: Peter Flynn on 9 Jun 2010 17:02
sln(a)netherlands.com wrote: > On Mon, 07 Jun 2010 12:32:15 +0100, Peter Flynn <peter.nosp(a)m.silmaril.ie> wrote: [...] >> I always counsel to avoid the non-XML approach [...] > This might parse it (with a slight bit of validation) It occurs to me that you can combine both methods, iff the document is well-formed. Run onsgmls -wxml /usr/share/xml/declaration/xml.dcl doc.xml >doc.esis to get the ESIS, and then tweak the W3C's esis2xml.py script to re-form the XML document, omitting the namespaces. Or write your own in Perl... ///Peter |