HTML Library [Lisp]

Prev: NYC LOCAL: Tuesday 11 May 2010 Lisp NYC Meet and NYLUG Hack Meet
Next: Other online lisp places

From: Drew Crampsie on 12 May 2010 18:42

Drew Crampsie <drewc(a)tech.coop> writes:
> "Captain Obvious" <udodenko(a)users.sourceforge.net> writes:
>
>> V> Is there a library available to parse HTML ? I need to extract certain
>> V> tags like links and images from the body.
>>
>> If you don't care about getting it 100% correct for all pages, just
>> use regexes.
>> E.g. cl-ppcre library.
[snip]
>
> So many nice API's exist for querying and manipulating an XML tree, yet
> you suggest the author roll their own? Why brought this on?
^What

posted this before i had my coffee :)

Anyways, to back this up, here is the cxml-stp code to get all a-hrefs and
img-src's out of almost any mostly well formed HTML page :

CL-USER> (defun show-links-and-images (url)
(let* ((str (drakma:http-request url))
(document (chtml:parse str (cxml-stp:make-builder))))
(stp:do-recursively (a document)
(when (and (typep a 'stp:element)
(or (equal (stp:local-name a) "img")
(equal (stp:local-name a) "a")))
(print (or (stp:attribute-value a "src" )
(stp:attribute-value a "href")))))))
SHOW-LINKS-AND-IMAGES
CL-USER> (show-links-and-images "http://common-lisp.net")

"logo120x80.png"
"http://www.lisp.org/"
"http://www.cliki.net/"
"http://planet.lisp.org/"
"http://www.cl-user.net/"
.... etc

The example code at
http://common-lisp.net/project/closure/closure-html/examples.html
contains all you need to know... this code was a trivial cut and paste
job.

Cheers,

drewc

From: Captain Obvious on 13 May 2010 16:52

??>> Have I mentioned parser?

DC> No, but the OP did :

DC> "Is there a library available to parse HTML ? I need to extract certain
DC> tags like links and images from the body."

I thought he might not know that you don't need to fully parse HTML to
extract links and images from the body.
And I think I was pretty clear about that it is a half-assed solution.
That's just an option.

DC> My point is that, when someone asks for a parser, telling them that can
DC> make a crappy half-assed one via regexps is a terrible bit of advice.

I wrote that you can do this in "crappy half-assed way" without a parser at
all. That it is a different thing.

I used thing like that few times, it took me maybe 5 minutes to get it
working and it was working 100% well (for a certain site).
So what's wrong with it?

DC> Or it might not... hiring cheap labour to do it by hand might work too
DC> and will likely be the most robust of all... but the OP asked for a
DC> parser.

Well... Do you know that 95% of posts here on comp.lang.lisp which start
with "Help me to fix this macro..." are not about macros but about a person
not understanding something?
It just might be that person looking for a parser can do thing he wants
without a parser.

If he really needs a parser, he can just ignore my comment and listen to
people who have provided links to various parsers.

So what's wrong with it, really?

??>>>> It usually works very well when you need to scrap a certain site,
??>>>> not "sites in general".

DC> So, if you'd like your code to work more than once, avoid regexps?
DC> I can agree with you there.

Well, I don't know what he is doing. Sometimes people need to scrap a
certain site or few of them.
Regexes are fine for that. They might get broken if they change layout on
the site.
But parser-based solution might get broken too (that is, way you need to
traverse DOM changes).

??>> Therefore, it is complex and is probably more error prone. Also,
??>> slower.

DC> Non sequitur ... and absolute bollocks.

Of course it might be the other way around, but, generally, more complex
things are more prone to errors.
And also when you extract more information and do it in more complex way
that takes more time.

E.g. if it is DOM-based parser, I'm pretty sure that full DOM of a page
takes more memory than a bunch of links.

DC> I applaud your use of the sexp syntax for regexps, but, this following
DC> code actually fetches any given webpage from the internet and collects
DC> images and links, something similar to what the OP may have wanted.

DC> (defun links-and-images (url &aux links-and-images)
DC> (flet ((collect (e name attribute)
DC> (when (equal (stp:local-name e) name)
DC> (push (stp:attribute-value e attribute)
DC> links-and-images))))
DC> (stp:do-recursively (e (chtml:parse (drakma:http-request url)
DC> (cxml-stp:make-builder)))
DC> (when (typep e 'stp:element)
DC> (collect e "a" "href")
DC> (collect e "img" "src")))
DC> links-and-images))

DC> Does more, actually readable and works for a significantly larger
DC> portion of the interwebs.

Well, you see, I was not interested in all links on that page, I was
interested only on those with class "title" and are http:// links.
So regex says exactly that.
I don't know what exactly OP wants, I guess we don't have a full story here.

DC> Can you show me a regexp based solution that meets the op's
DC> specification and works for a equal or greater number of pages?

I don't see specification here. He did not say that he just wants all links.

DC> If you can show me that then we'll compare speed if you like.

That would be interesting...

DC> Lets make it realistic and a little more interesting than the average
DC> usenet pissing match. Perhaps we can get the OP back

Well, sorry, I probably don't have time for this.

DC> No solution is going to work with all broken html, that's
DC> impossible. Is your advice that, because it's possible it may not work
DC> with a small portion of HTML, to ensure it's limited to an even smaller
DC> portion?

If you use regex like this to get a tags: <a\s[^>]+> -- I honestly don't see
a lot of ways how it can get broken.
In fact, I don't see any. Well, I wrote this in 5 seconds, if I'd think on
it for a hour or so I think I'll have absolutely bulletproof link-extracting
regex.
Good thing about it is that it absolutely does not care about context. It
might be also a bad thing, depending on what's you doing.
So I don't agree on "smaller portion." Fixing HTML is general is harder than
fixing for a specific task.

DC> As i previously stated, i think that's horrible advice.

Ok, ok, I get it. But it is just your opinion :)

DC> The way CLOSURE-HTML (and a good many HTML parsers) work is by cleaning
DC> up the HTML to make an XML-like parse tree, and then using established
DC> API's to work with that. Makes sense to use tools that are designed to
DC> solve the problem you are trying to solve unless those tools are
DC> deficient in some way, non?

Having full DOM is an overkill. If you're only working with small sites
that's ok, but a DOM of a larger thing can eat lots of memory.

DC> That's where they'd end up while attempting to use your advice to solve
DC> their problem, in my opinion. Sure, a regexp based solution can be
DC> convinced to work, but it's not the right tool for the job.

So DOM parser might be not the right tool either. Sometimes job requires
very specific tool and general ones are deficient.

DC> “Whenever faced with a problem, some people say `Lets use AWK.'
DC> Now, they have two problems.” -- D. Tilbrook

Overly simplistic regexes are inferior to formal parsers, but HTML as people
use it is not formally defined, so it is inapplicable here.

From: Drew Crampsie on 18 May 2010 18:29

"Captain Obvious" <udodenko(a)users.sourceforge.net> writes:

> ??>> Have I mentioned parser?
>
> DC> No, but the OP did :
>
> DC> "Is there a library available to parse HTML ? I need to extract certain
> DC> tags like links and images from the body."
>
> I thought he might not know that you don't need to fully parse HTML to
> extract links and images from the body.

Let's say that was true, and that the OP was quite naive and had never
used regular expressions to extract data from text, or never thought to
apply regular expressions to this particular problem.

If that were the case, i think it would also be likely that the OP was
not all that familiar with regular expressions to begin with. I'm of the
opinion that this is unlikely, and the OP had already rejected regular
expressions as the wrong solution.

In the former case, learning regular expressions in order to extract
links and images from HTML is not something i would recommend.

> And I think I was pretty clear about that it is a half-assed
> solution. That's just an option.

It's a terrible solution, and i can't see why you're still defending
it.

>
> DC> My point is that, when someone asks for a parser, telling them that can
> DC> make a crappy half-assed one via regexps is a terrible bit of advice.
>
> I wrote that you can do this in "crappy half-assed way" without a
> parser at all. That it is a different thing.
>
> I used thing like that few times, it took me maybe 5 minutes to get it
> working and it was working 100% well (for a certain site).
> So what's wrong with it?

It took me a lot less to get mine working, and it works for more than
one site. If the site changes, mine will still extract the images and
links, yours will not. Also, mine was a complete and working piece of
code.

>
> DC> Or it might not... hiring cheap labour to do it by hand might work too
> DC> and will likely be the most robust of all... but the OP asked for a
> DC> parser.
>
> Well... Do you know that 95% of posts here on comp.lang.lisp which
> start with "Help me to fix this macro..." are not about macros but
> about a person not understanding something?

What you don't seem to understand is that regular expressions, for
extracting things from HTML, are almost always the wrong
solution. Similar to using a macro when a function is what is called
for, or using EVAL when a macro would do.

> It just might be that person looking for a parser can do thing he
> wants without a parser.

Since the OP is not around to comment on what their exact needs were, we
have to assume that, having long since chosen closure-html as the thing
that will do the thing he wants, the OP was in fact looking to do the
kind of things a parser does.

> If he really needs a parser, he can just ignore my comment and listen
> to people who have provided links to various parsers.

As they did.

>
> So what's wrong with it, really?

What worries me is not so much that the OP might have taken your advice,
but rather that you thought it was good advice to give. My contrary
demonstrations are as much for your benefit as for those who may still
be following this thread.

> ??>>>> It usually works very well when you need to scrap a certain site,
> ??>>>> not "sites in general".
>
> DC> So, if you'd like your code to work more than once, avoid regexps?
> DC> I can agree with you there.
>
> Well, I don't know what he is doing.

Seemed to me like he was trying to extract links and images and the like
from html.

> Sometimes people need to scrap a
> certain site or few of them.

'scrape', scrap is what i'd do to your code if you tried to get it past
review.

> Regexes are fine for that.

If you want to write brittle code that is prone to breakage and only
works on one site as long as that site stays relatively static, rather
than fairly solid code that works on a majority of sites that are
allowed to change significantly, and you don't know anything about how
to use HTML parsers and the surrounding toolsets, i'd still recommend
you learn to use the right tools for the job.

> They might get broken if they change layout
> on the site.

Indeed they might.

> But parser-based solution might get broken too (that is, way you need
> to traverse DOM changes).

This is nonsense. For extracting links and images (that is, a and img
tags, and their attributes, in a useful data structure), a parser based
solution will track html changes significantly better than a regular
expression based solution.

For any scraping task where the specifics of the document structure are
involved, either solution is going to have problems.. so what are you
trying to say? That changes in input structure may break code that depends on
it? Hardly an argument for regexps.

> ??>> Therefore, it is complex and is probably more error prone. Also,
> ??>> slower.
>
> DC> Non sequitur ... and absolute bollocks.
>
> Of course it might be the other way around, but, generally, more
> complex things are more prone to errors.

CXML and companion libraries are excellent code that is well tested, and
the problem they solve (xml parsing, manipulation, and unparsing) is not
that difficult.

The closer-html library is able to understand more HTML, and allow the
user to do more with the data with less code, than the solutions you've
presented.

> And also when you extract more information and do it in more complex
> way that takes more time.

You have not shown that it takes significantly more time or is any
more complex for a task of reasonable size.

For any less complex tasks, regular expressions themselves are too
complex and take too much time. Read on and i will be happy to show
this.

>
> E.g. if it is DOM-based parser, I'm pretty sure that full DOM of a
> page takes more memory than a bunch of links.

If you knew anything about the topic at hand, you'd know that there are
many many solutions for XML that do not involve a 'full DOM'. SAX, for
example... the 'Streaming API for XML'.

Here is a modification of my previous code (which was not DOM based, so
enough about DOM) to use the SAX interface and not create a data
structure that represents the entire document :

(defclass links-and-images-handler (sax:default-handler)
((links-and-images :accessor links-and-images
:initform nil)))

(defmethod sax:end-document ((handler links-and-images-handler))
(links-and-images handler))

(defmethod sax:start-element ((handler links-and-images-handler)
uri local-name qname attributes)
(flet ((collect (element attribute)
(when (string-equal element local-name)
(let ((attribute
(find attribute attributes
:key #'sax:attribute-local-name
:test #'string-equal)))
(when attribute
(push (sax:attribute-value attribute)
(links-and-images handler)))))))
(collect "a" "href")
(collect "img" "src")))

(defun sax-links-and-images (url)
(chtml:parse (drakma:http-request url :want-stream t)
(make-instance 'links-and-images-handler)))

Also notice that it doesn't even make a string out of the webpage
itself, but rather reads from the stream and parses it
incrementally. I'm sure that having to read the entire file into
memory in order run a regular expression over it is going to take more
memory than, well, not doing that.

>
> DC> I applaud your use of the sexp syntax for regexps, but, this
> DC> following code actually fetches any given webpage from the
> DC> internet and collects images and links, something similar to what
> DC> the OP may have wanted.
>
> DC> (defun links-and-images (url &aux links-and-images)
> DC> (flet ((collect (e name attribute)
> DC> (when (equal (stp:local-name e) name)
> DC> (push (stp:attribute-value e attribute)
> DC> links-and-images))))
> DC> (stp:do-recursively (e (chtml:parse (drakma:http-request url)
> DC> (cxml-stp:make-builder)))
> DC> (when (typep e 'stp:element)
> DC> (collect e "a" "href")
> DC> (collect e "img" "src")))
> DC> links-and-images))
>
> DC> Does more, actually readable and works for a significantly larger
> DC> portion of the interwebs.
>
> Well, you see, I was not interested in all links on that page, I was
> interested only on those with class "title" and are http:// links.
> So regex says exactly that.

Does it now? What if the class comes after the href?

Ok, here's that code modified to your new spec:

(defun http-links-with-class (url &aux links-and-images)
(flet ((collect (e class name attribute)
(when (equal (stp:local-name e) name)
(when (equal (stp:attribute-value e "class") class)
(let ((value (stp:attribute-value e attribute)))
(when (string-equal "http://"
(subseq value 0 7))
(push value links-and-images)))))))
(stp:do-recursively (e (chtml:parse (drakma:http-request url)
(cxml-stp:make-builder))
links-and-images)
(when (typep e 'stp:element)
(collect e "title" "a" "href")))))

Please note that it actually works for a greater percentage of pages,
including those where the class attribute is not directly before the
href attribute.

[snip]

> If you use regex like this to get a tags: <a\s[^>]+> -- I honestly
> don't see a lot of ways how it can get broken.
> In fact, I don't see any.

If that's all you want to achieve, why bother with regular
expression at all?

It's complex, and more error prone, and therefore slower, than a
hand-rolled function :

(defvar *page* (drakma:http-request "http://common-lisp.net"))

(defun match-html-a (stream)
(declare (optimize (speed 3) (space 3)))
(loop
:for char := (read-char stream nil)
:while char
:when (and (eql char #\<)
(member (peek-char nil stream) '(#\a #\A)))
:collect (loop
:for char := (read-char stream nil)
:while char :collect char into stack
:until (eql char #\>)
:finally (return (coerce (cons #\< stack) 'string)))))

(locally (declare (optimize (speed 3) (space 3)))
(let ((scanner (cl-ppcre:create-scanner
"<a\\s[^>]*>"
:case-insensitive-mode t
:multi-line-mode t)))
(defun ppcre-match-html-a (string)
(nreverse
(let (links)
(cl-ppcre:do-matches-as-strings
(string scanner string links)
(push string links)))))))

CL-USER> (equalp (ppcre-match-html-a *page*)
(with-input-from-string (s *page*)
(match-html-a s)))
=> T

CL-USER> (time (dotimes (n 1024)
(with-input-from-string (page *page*)
(match-html-a page))))

Evaluation took:
0.778 seconds of real time
0.776049 seconds of total run time (0.776049 user, 0.000000 system)
[ Run times consist of 0.044 seconds GC time,
and 0.733 seconds non-GC time. ]
99.74% CPU
1,553,424,264 processor cycles
63,615,928 bytes consed

NIL
CL-USER> (time (dotimes (n 1024)
(ppcre-match-html-a *page*)))

Evaluation took:
1.612 seconds of real time
1.612101 seconds of total run time (1.608101 user, 0.004000 system)
[ Run times consist of 0.024 seconds GC time,
and 1.589 seconds non-GC time. ]
100.00% CPU
3,214,872,948 processor cycles
47,587,008 bytes consed

NIL
CL-USER> (time (dotimes (n 1024)
(with-input-from-string (page *page*)
;level the playing field
(ppcre-match-html-a *page*))))

Evaluation took:
1.942 seconds of real time
1.948122 seconds of total run time (1.948122 user, 0.000000 system)
[ Run times consist of 0.044 seconds GC time,
and 1.905 seconds non-GC time. ]
100.31% CPU
3,874,149,564 processor cycles
85,368,544 bytes consed

NIL
CL-USER>

> Well, I wrote this in 5 seconds, if I'd
> think on it for a hour or so I think I'll have absolutely bulletproof
> link-extracting regex.

MATCH-HTML-A took a little longer than 5 seconds, but not much, and i
didn't have to use regular expressions, which in this case add
complexity.

> Good thing about it is that it absolutely does not care about
> context. It might be also a bad thing, depending on what's you doing.
> So I don't agree on "smaller portion." Fixing HTML is general is
> harder than fixing for a specific task.

The thing is, we want context at some point. All we have now is a list
of string that look like "<a ... >". If we want to extract the actual
href, we have to work a little bit harder :

CL-USER> (with-input-from-string (page *page*)
(match-html-a page))
("<a href=\"http://www.lisp.org/\">"
"<a href=\"http://www.cliki.net/\">"
"<a href=\"http://planet.lisp.org/\">" ...)

(let ((scanner (cl-ppcre:create-scanner "\\s+" :multi-line-mode t)))
(defun match-a-href-value (string)
(dolist (attribute (cl-ppcre:split scanner string))
(when (and (> (length attribute) 5)
(string-equal
(subseq attribute 0 4) "href"))
(return-from match-a-href-value
(second (split-sequence:split-sequence
#\" attribute )))))))

(defun linear-match-a-href (string)
(mapcar #'match-a-href-value
(ppcre-match-html-a string)))

CL-USER> (linear-match-a-href *page*)
("http://www.lisp.org/"
"http://www.cliki.net/"
"http://planet.lisp.org/" ...)

CL-USER> (time (dotimes (n 1024)
(linear-match-a-href *page*)))

Evaluation took:
2.529 seconds of real time
2.528158 seconds of total run time (2.528158 user, 0.000000 system)
[ Run times consist of 0.032 seconds GC time,
and 2.497 seconds non-GC time. ]
99.96% CPU
5,043,773,340 processor cycles
77,095,968 bytes consed

Maybe that's not the best way to do that, but that's the naive
ad-hoc implementation i came up with of the top of my head.

A SAX based version might look like this :

(defclass links-handler (sax:default-handler)
((links :accessor links
:initform nil)))

(defmethod sax:end-document ((handler links-handler))
(nreverse (links handler)))

(defmethod sax:start-element ((handler links-handler)
uri local-name qname attributes)
(when (string-equal "a" local-name)
(let ((attribute
(find "href" attributes
:key #'sax:attribute-local-name
:test #'string-equal)))
(when attribute
(push (sax:attribute-value attribute)
(links handler))))))

(defun sax-match-html-a-href (string)
(chtml:parse string (make-instance 'links-handler)))

CL-USER> (time (dotimes (n 1024)
(sax-match-html-a-href *page*)))

Evaluation took:
2.959 seconds of real time
2.952185 seconds of total run time (2.904182 user, 0.048003 system)
[ Run times consist of 0.160 seconds GC time,
and 2.793 seconds non-GC time. ]
99.76% CPU
5,903,549,172 processor cycles
282,859,128 bytes consed

Those run times are comparable, and if we're not storing the web pages
in memory, the differences will be lost in the i/o latency. At this
point, the regexp based solution starts to take on the characteristics
of a parser, and also becomes more prone to error as it requires new
untested code.

So what is the advantage of the half assed regular expression based
solution?

It's not code size, nor run time, nor ease of use. Only for a very
simple task like this one is the amount of effort anywhere near
comparable.

> DC> As i previously stated, i think that's horrible advice.
>
> Ok, ok, I get it. But it is just your opinion :)

One that i can back up with experience, and actual code. If you'd ever
had to work with some other coder's regexp-based pseudo-parser, or even
your own if you've made that mistake (as i have), you'd recognize it is
as a good opinion.

[snipped more DOM-related nonsense]

> DC> “Whenever faced with a problem, some people say `Lets use AWK.'
> DC> Now, they have two problems.” -- D. Tilbrook
>
> Overly simplistic regexes are inferior to formal parsers, but HTML as
> people use it is not formally defined, so it is inapplicable here.

This is fallacious as well. The HTML that CLOSURE-HTML is designed to
parse is the HTML as people use it. Just as an ad-hoc regular expression
will attempt to extract meaning from improperly structured HTML, so does
the parser behind CHTML:PARSE.

It is possible to construct input that chtml rejects but an ad-hoc
regexp based solution might accept, just as i can easily construct a
string that the regular expression will reject but a hand-rolled
solution will accept. This proves nothing, unless you can make an
argument that the hand-rolled solution is in fact a better solution than
CLOSURE-HTML. That i'd be willing to listen to.

Obviously, the solution that works is better than the one that
doesn't. In the case where two tools are arguably equally good (the very
simple case we presented above), using the one that is designed for the
job is most likely, in all cases, going to be the right idea.

If you'll indulge me one more cliché

"It is tempting, if the only tool you have is a hammer, to treat
everything as if it were a nail."
-- Abraham Maslow, 1966

Cheers,

drewc

From: Drew Crampsie on 18 May 2010 18:34

Tim X <timx(a)nospam.dev.null> writes:

> Drew Crampsie <drewc(a)tech.coop> writes:
>
>> "Captain Obvious" <udodenko(a)users.sourceforge.net> writes:
>>
>>> ??>> If you don't care about getting it 100% correct for all pages, just
>>> ??>> use regexes.
>>> ??>> E.g. cl-ppcre library.
>>>
>>> DC> This is one of the worst pieces of advice i've ever seen given on this
>>> DC> newsgroup. Please do not attempt to roll your own HTML parser using
>>> DC> CL-PPCRE

> Drew,
>
> thank you for doing this.

Someone had to :).

> I cannot express how frustrating it is when people use REs in this
> context for more than a single page/url hack. I have lost count of
> the number of times I have had to fix broken systems that were due
> precisely to the use of REs to process HTML pages.

This, more than the specific unsuitability of the tool to the task, is
the reason i spoke up. I've been there, it's hell, and completely
avoidable.

Hopefully, the epic followup i just posted will end any questions about
the matter! :)

Cheers,

drewc

>
> The RE solution is not a good solution. It /can/ work in a very limited
> context, but as soon as you try to apply this approach in a more
> generalised solution, you end up with a maintenance nightmare. Worse
> still, it means that you will need somone to maintain this solution that
> is experienced and understands how REs work. Unfortunately, few people
> actually do understand this. Trivial REs are easy, but a soon as they
> begin to get a little complex, you really need a deeper understanding of
> how they work, anchoring, backtracking etc.
>
> For the OP, there are a number of tools, such as 'tidy' which can
> lean-up HTML that can in turn make the parsers work better. It is true
> that HTML is broken in many ways, making it hard to process reliably.
> Many HTML generation libraries are extremely inefficient and buggy -
> take a look at the HTML generated by many MS programs such as outlook to
> see what I mean. Howeever, this does not mean you cannot parse it.
> Obviously you can or we wold not have any working web browsers. Using
> tools like 'tidy' to clean up the HTML before parsing means that the
> parser doesn't have to work as hard and may not need to deal with as
> many exceptions.
>
> Avoid the RE solution unless you are dealing with a single page that is
> fairly simple. If you are loking for something more general, use one of
> the parses and something like 'tidy'. When it fails, extend the parser
> and incrementally improve it.
>
> Tim

From: RG on 18 May 2010 20:27

In article <87d3ws4zef.fsf(a)tech.coop>, Drew Crampsie <drewc(a)tech.coop>
wrote:

[A ton of useful info on HTML and XML parsing]

Wow, that has to be one of the most content-full posts ever to C.L.L.
Thanks for taking the time to write it!

rg

First | Prev | Next | Last
Pages: 1 2 3
Prev: NYC LOCAL: Tuesday 11 May 2010 Lisp NYC Meet and NYLUG Hack Meet
Next: Other online lisp places