From: Vinay on
Is there a library available to parse HTML ? I need to extract certain
tags like links and images from the body.


--- news://freenews.netfront.net/ - complaints: news(a)netfront.net ---
From: Captain Obvious on
V> Is there a library available to parse HTML ? I need to extract certain
V> tags like links and images from the body.

If you don't care about getting it 100% correct for all pages, just use
regexes.
E.g. cl-ppcre library.

It usually works very well when you need to scrap a certain site, not "sites
in general".

From: Petter Gustad on
Vinay <vinay(a)vmmenon.org> writes:

> Is there a library available to parse HTML ? I need to extract certain
> tags like links and images from the body.

Use the Drakma client:
http://weitz.de/drakma/


Or net.html.parser:parse-html see this older post for a simple example:
http://groups.google.no/group/comp.lang.lisp/msg/cda1a24ac3b50a43

There's several other options available:
http://www.cliki.net/HTML

Petter
--
..sig removed by request.
From: Vinay on
On 2010-05-11 07:02:46 -0700, Petter Gustad <newsmailcomp6(a)gustad.com> said:

> Vinay <vinay(a)vmmenon.org> writes:
>
>> Is there a library available to parse HTML ? I need to extract certain
>> tags like links and images from the body.
>
> Use the Drakma client:
> http://weitz.de/drakma/
>
>
> Or net.html.parser:parse-html see this older post for a simple example:
> http://groups.google.no/group/comp.lang.lisp/msg/cda1a24ac3b50a43
>
> There's several other options available:
> http://www.cliki.net/HTML
>
> Petter

Thanks.

Closure HTML (http://common-lisp.net/project/closure/closure-html/)
seems pretty simple to use. Any other suggestions welcome ...


--- news://freenews.netfront.net/ - complaints: news(a)netfront.net ---
From: vanekl on
Vinay wrote:
snip
>
> Closure HTML (http://common-lisp.net/project/closure/closure-html/)
> seems pretty simple to use. Any other suggestions welcome ...
>

First I tried Python's Beautiful Soup because I've read many people say it's
quite good. It choked on the first complicated page I tried to parse, so I
gave that up.

Then I tried closure html (:closure-common :closure-html :cxml-stp) and this
was able to parse malformed HTML much better than beautiful soup, but it
wasn't perfect.

Then I tried using the following Ruby packages,
hpricot
mechanize
open-uri
iconv
and this combination proved the most robust and productive of the three HTML
parsing platforms for me. If you are trying to parse misshapen html I highly
recommended hpricot, even if you don't know Ruby. It saved me a great deal
of time.