Prev: NYC LOCAL: Tuesday 11 May 2010 Lisp NYC Meet and NYLUG Hack Meet
Next: Other online lisp places
From: Vinay on 11 May 2010 09:34 Is there a library available to parse HTML ? I need to extract certain tags like links and images from the body. --- news://freenews.netfront.net/ - complaints: news(a)netfront.net ---
From: Captain Obvious on 11 May 2010 09:45 V> Is there a library available to parse HTML ? I need to extract certain V> tags like links and images from the body. If you don't care about getting it 100% correct for all pages, just use regexes. E.g. cl-ppcre library. It usually works very well when you need to scrap a certain site, not "sites in general".
From: Petter Gustad on 11 May 2010 10:02 Vinay <vinay(a)vmmenon.org> writes: > Is there a library available to parse HTML ? I need to extract certain > tags like links and images from the body. Use the Drakma client: http://weitz.de/drakma/ Or net.html.parser:parse-html see this older post for a simple example: http://groups.google.no/group/comp.lang.lisp/msg/cda1a24ac3b50a43 There's several other options available: http://www.cliki.net/HTML Petter -- ..sig removed by request.
From: Vinay on 11 May 2010 10:14 On 2010-05-11 07:02:46 -0700, Petter Gustad <newsmailcomp6(a)gustad.com> said: > Vinay <vinay(a)vmmenon.org> writes: > >> Is there a library available to parse HTML ? I need to extract certain >> tags like links and images from the body. > > Use the Drakma client: > http://weitz.de/drakma/ > > > Or net.html.parser:parse-html see this older post for a simple example: > http://groups.google.no/group/comp.lang.lisp/msg/cda1a24ac3b50a43 > > There's several other options available: > http://www.cliki.net/HTML > > Petter Thanks. Closure HTML (http://common-lisp.net/project/closure/closure-html/) seems pretty simple to use. Any other suggestions welcome ... --- news://freenews.netfront.net/ - complaints: news(a)netfront.net ---
From: vanekl on 11 May 2010 11:59 Vinay wrote: snip > > Closure HTML (http://common-lisp.net/project/closure/closure-html/) > seems pretty simple to use. Any other suggestions welcome ... > First I tried Python's Beautiful Soup because I've read many people say it's quite good. It choked on the first complicated page I tried to parse, so I gave that up. Then I tried closure html (:closure-common :closure-html :cxml-stp) and this was able to parse malformed HTML much better than beautiful soup, but it wasn't perfect. Then I tried using the following Ruby packages, hpricot mechanize open-uri iconv and this combination proved the most robust and productive of the three HTML parsing platforms for me. If you are trying to parse misshapen html I highly recommended hpricot, even if you don't know Ruby. It saved me a great deal of time.
|
Next
|
Last
Pages: 1 2 3 Prev: NYC LOCAL: Tuesday 11 May 2010 Lisp NYC Meet and NYLUG Hack Meet Next: Other online lisp places |