a better way to extract data from XHTML (XML) [Shell]

Prev: bash built-in regex : oddity
Next: HDF-EOS ODL metadata parsing

From: Ivan Shmakov on 14 Apr 2010 13:43

IIRC, some time ago someone have asked here about a better way
to extract specific data from HTML files. Having learned the
basics of the XSLT 1.0 language almost two years ago, I cannot
help myself feeling that it is such a way.

Consider, e. g.:

$ xsltproc --html href.xsl \
http://en.wikipedia.org/wiki/
#column-one
#searchInput
/wiki/Wikipedia
…
http://wikimediafoundation.org/wiki/Privacy_policy
/wiki/Wikipedia:About
/wiki/Wikipedia:General_disclaimer
$

The XSLT code is as follows:

$ cat href.xsl
<?xml version="1.0"?> 

<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:output method="text" />

<xsl:template match="*">

<xsl:apply-templates />
</xsl:template>

<xsl:template match="a">

<xsl:apply-templates select="@href" />
</xsl:template>

<xsl:template match="a/@href">
<xsl:value-of select="." />
<xsl:text>
</xsl:text>
</xsl:template>

<xsl:template match="@*|text()|comment()">

</xsl:template>

</xsl:stylesheet>

$

(Sort of Awk-ish, isn't it?)

--
FSF associate member #7257

From: pk on 14 Apr 2010 16:29

Ivan Shmakov wrote:

> </xsl:stylesheet>
> 
> $
>
> (Sort of Awk-ish, isn't it?)

You may want to look into xmlgawk, in case you don't know already.

From: Thomas 'PointedEars' Lahn on 16 Apr 2010 18:28

Ivan Shmakov wrote:

> IIRC, some time ago someone have asked here about a better way
> to extract specific data from HTML files. Having learned the
> basics of the XSLT 1.0 language almost two years ago, I cannot
> help myself feeling that it is such a way.

HTML is not necessarily well-formed, so generally you cannot apply XSLT to
it. You can try to convert it to XHTML with e.g. htmltidy(1), and if you
are lucky you can apply XSLT to the result (BTDT), or you can transform
XML/XHTML to HTML with XSLT.

> Consider, e. g.:
>
> $ xsltproc --html href.xsl \
> http://en.wikipedia.org/wiki/

This works by coincidence because the referred original document is written
in Valid XHTML, not HTML. However, for extracting specific data out of
markup documents you would use XPath directly; XSLT using XPath is a
possibility (and a less efficient one at that), but not a necessity.

What does this have to do with *x shells anyway?

> (Sort of Awk-ish, isn't it?)

Yes, like PHP 5 is sort of C++-ish. (Was that your question?)

PointedEars

|
Pages: 1
Prev: bash built-in regex : oddity
Next: HDF-EOS ODL metadata parsing