From: Ivan Shmakov on 14 Apr 2010 13:43 IIRC, some time ago someone have asked here about a better way to extract specific data from HTML files. Having learned the basics of the XSLT 1.0 language almost two years ago, I cannot help myself feeling that it is such a way. Consider, e. g.: $ xsltproc --html href.xsl \ http://en.wikipedia.org/wiki/ #column-one #searchInput /wiki/Wikipedia … http://wikimediafoundation.org/wiki/Privacy_policy /wiki/Wikipedia:About /wiki/Wikipedia:General_disclaimer $ The XSLT code is as follows: $ cat href.xsl <?xml version="1.0"?> <!-- -*- XML -*- --> <!-- href.xsl — Extract the payload of <a href="" /> --> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="text" /> <xsl:template match="*"> <!-- <xsl:message>* Processing <xsl:value-of select="local-name (.)" /> </xsl:message> --> <xsl:apply-templates /> </xsl:template> <xsl:template match="a"> <!-- <xsl:message>* Processing [a] <xsl:value-of select="local-name (.)" /> </xsl:message> --> <xsl:apply-templates select="@href" /> </xsl:template> <xsl:template match="a/@href"> <xsl:value-of select="." /> <xsl:text> </xsl:text> </xsl:template> <xsl:template match="@*|text()|comment()"> <!-- do nothing --> </xsl:template> </xsl:stylesheet> <!-- href.xsl ends here --> $ (Sort of Awk-ish, isn't it?) -- FSF associate member #7257
From: pk on 14 Apr 2010 16:29 Ivan Shmakov wrote: > </xsl:stylesheet> > <!-- href.xsl ends here --> > $ > > (Sort of Awk-ish, isn't it?) You may want to look into xmlgawk, in case you don't know already.
From: Thomas 'PointedEars' Lahn on 16 Apr 2010 18:28 Ivan Shmakov wrote: > IIRC, some time ago someone have asked here about a better way > to extract specific data from HTML files. Having learned the > basics of the XSLT 1.0 language almost two years ago, I cannot > help myself feeling that it is such a way. HTML is not necessarily well-formed, so generally you cannot apply XSLT to it. You can try to convert it to XHTML with e.g. htmltidy(1), and if you are lucky you can apply XSLT to the result (BTDT), or you can transform XML/XHTML to HTML with XSLT. > Consider, e. g.: > > $ xsltproc --html href.xsl \ > http://en.wikipedia.org/wiki/ This works by coincidence because the referred original document is written in Valid XHTML, not HTML. However, for extracting specific data out of markup documents you would use XPath directly; XSLT using XPath is a possibility (and a less efficient one at that), but not a necessity. What does this have to do with *x shells anyway? > (Sort of Awk-ish, isn't it?) Yes, like PHP 5 is sort of C++-ish. (Was that your question?) PointedEars
|
Pages: 1 Prev: bash built-in regex : oddity Next: HDF-EOS ODL metadata parsing |