Prev: NYC LOCAL: Tuesday 11 May 2010 Lisp NYC Meet and NYLUG Hack Meet
Next: Other online lisp places
From: Drew Crampsie on 12 May 2010 18:42 Drew Crampsie <drewc(a)tech.coop> writes: > "Captain Obvious" <udodenko(a)users.sourceforge.net> writes: > >> V> Is there a library available to parse HTML ? I need to extract certain >> V> tags like links and images from the body. >> >> If you don't care about getting it 100% correct for all pages, just >> use regexes. >> E.g. cl-ppcre library. [snip] > > So many nice API's exist for querying and manipulating an XML tree, yet > you suggest the author roll their own? Why brought this on? ^What posted this before i had my coffee :) Anyways, to back this up, here is the cxml-stp code to get all a-hrefs and img-src's out of almost any mostly well formed HTML page : CL-USER> (defun show-links-and-images (url) (let* ((str (drakma:http-request url)) (document (chtml:parse str (cxml-stp:make-builder)))) (stp:do-recursively (a document) (when (and (typep a 'stp:element) (or (equal (stp:local-name a) "img") (equal (stp:local-name a) "a"))) (print (or (stp:attribute-value a "src" ) (stp:attribute-value a "href"))))))) SHOW-LINKS-AND-IMAGES CL-USER> (show-links-and-images "http://common-lisp.net") "logo120x80.png" "http://www.lisp.org/" "http://www.cliki.net/" "http://planet.lisp.org/" "http://www.cl-user.net/" .... etc The example code at http://common-lisp.net/project/closure/closure-html/examples.html contains all you need to know... this code was a trivial cut and paste job. Cheers, drewc
From: Captain Obvious on 13 May 2010 16:52 ??>> Have I mentioned parser? DC> No, but the OP did : DC> "Is there a library available to parse HTML ? I need to extract certain DC> tags like links and images from the body." I thought he might not know that you don't need to fully parse HTML to extract links and images from the body. And I think I was pretty clear about that it is a half-assed solution. That's just an option. DC> My point is that, when someone asks for a parser, telling them that can DC> make a crappy half-assed one via regexps is a terrible bit of advice. I wrote that you can do this in "crappy half-assed way" without a parser at all. That it is a different thing. I used thing like that few times, it took me maybe 5 minutes to get it working and it was working 100% well (for a certain site). So what's wrong with it? DC> Or it might not... hiring cheap labour to do it by hand might work too DC> and will likely be the most robust of all... but the OP asked for a DC> parser. Well... Do you know that 95% of posts here on comp.lang.lisp which start with "Help me to fix this macro..." are not about macros but about a person not understanding something? It just might be that person looking for a parser can do thing he wants without a parser. If he really needs a parser, he can just ignore my comment and listen to people who have provided links to various parsers. So what's wrong with it, really? ??>>>> It usually works very well when you need to scrap a certain site, ??>>>> not "sites in general". DC> So, if you'd like your code to work more than once, avoid regexps? DC> I can agree with you there. Well, I don't know what he is doing. Sometimes people need to scrap a certain site or few of them. Regexes are fine for that. They might get broken if they change layout on the site. But parser-based solution might get broken too (that is, way you need to traverse DOM changes). ??>> Therefore, it is complex and is probably more error prone. Also, ??>> slower. DC> Non sequitur ... and absolute bollocks. Of course it might be the other way around, but, generally, more complex things are more prone to errors. And also when you extract more information and do it in more complex way that takes more time. E.g. if it is DOM-based parser, I'm pretty sure that full DOM of a page takes more memory than a bunch of links. DC> I applaud your use of the sexp syntax for regexps, but, this following DC> code actually fetches any given webpage from the internet and collects DC> images and links, something similar to what the OP may have wanted. DC> (defun links-and-images (url &aux links-and-images) DC> (flet ((collect (e name attribute) DC> (when (equal (stp:local-name e) name) DC> (push (stp:attribute-value e attribute) DC> links-and-images)))) DC> (stp:do-recursively (e (chtml:parse (drakma:http-request url) DC> (cxml-stp:make-builder))) DC> (when (typep e 'stp:element) DC> (collect e "a" "href") DC> (collect e "img" "src"))) DC> links-and-images)) DC> Does more, actually readable and works for a significantly larger DC> portion of the interwebs. Well, you see, I was not interested in all links on that page, I was interested only on those with class "title" and are http:// links. So regex says exactly that. I don't know what exactly OP wants, I guess we don't have a full story here. DC> Can you show me a regexp based solution that meets the op's DC> specification and works for a equal or greater number of pages? I don't see specification here. He did not say that he just wants all links. DC> If you can show me that then we'll compare speed if you like. That would be interesting... DC> Lets make it realistic and a little more interesting than the average DC> usenet pissing match. Perhaps we can get the OP back Well, sorry, I probably don't have time for this. DC> No solution is going to work with all broken html, that's DC> impossible. Is your advice that, because it's possible it may not work DC> with a small portion of HTML, to ensure it's limited to an even smaller DC> portion? If you use regex like this to get a tags: <a\s[^>]+> -- I honestly don't see a lot of ways how it can get broken. In fact, I don't see any. Well, I wrote this in 5 seconds, if I'd think on it for a hour or so I think I'll have absolutely bulletproof link-extracting regex. Good thing about it is that it absolutely does not care about context. It might be also a bad thing, depending on what's you doing. So I don't agree on "smaller portion." Fixing HTML is general is harder than fixing for a specific task. DC> As i previously stated, i think that's horrible advice. Ok, ok, I get it. But it is just your opinion :) DC> The way CLOSURE-HTML (and a good many HTML parsers) work is by cleaning DC> up the HTML to make an XML-like parse tree, and then using established DC> API's to work with that. Makes sense to use tools that are designed to DC> solve the problem you are trying to solve unless those tools are DC> deficient in some way, non? Having full DOM is an overkill. If you're only working with small sites that's ok, but a DOM of a larger thing can eat lots of memory. DC> That's where they'd end up while attempting to use your advice to solve DC> their problem, in my opinion. Sure, a regexp based solution can be DC> convinced to work, but it's not the right tool for the job. So DOM parser might be not the right tool either. Sometimes job requires very specific tool and general ones are deficient. DC> “Whenever faced with a problem, some people say `Lets use AWK.' DC> Now, they have two problems.” -- D. Tilbrook Overly simplistic regexes are inferior to formal parsers, but HTML as people use it is not formally defined, so it is inapplicable here.
From: Drew Crampsie on 18 May 2010 18:29 "Captain Obvious" <udodenko(a)users.sourceforge.net> writes: > ??>> Have I mentioned parser? > > DC> No, but the OP did : > > DC> "Is there a library available to parse HTML ? I need to extract certain > DC> tags like links and images from the body." > > I thought he might not know that you don't need to fully parse HTML to > extract links and images from the body. Let's say that was true, and that the OP was quite naive and had never used regular expressions to extract data from text, or never thought to apply regular expressions to this particular problem. If that were the case, i think it would also be likely that the OP was not all that familiar with regular expressions to begin with. I'm of the opinion that this is unlikely, and the OP had already rejected regular expressions as the wrong solution. In the former case, learning regular expressions in order to extract links and images from HTML is not something i would recommend. > And I think I was pretty clear about that it is a half-assed > solution. That's just an option. It's a terrible solution, and i can't see why you're still defending it. > > DC> My point is that, when someone asks for a parser, telling them that can > DC> make a crappy half-assed one via regexps is a terrible bit of advice. > > I wrote that you can do this in "crappy half-assed way" without a > parser at all. That it is a different thing. > > I used thing like that few times, it took me maybe 5 minutes to get it > working and it was working 100% well (for a certain site). > So what's wrong with it? It took me a lot less to get mine working, and it works for more than one site. If the site changes, mine will still extract the images and links, yours will not. Also, mine was a complete and working piece of code. > > DC> Or it might not... hiring cheap labour to do it by hand might work too > DC> and will likely be the most robust of all... but the OP asked for a > DC> parser. > > Well... Do you know that 95% of posts here on comp.lang.lisp which > start with "Help me to fix this macro..." are not about macros but > about a person not understanding something? What you don't seem to understand is that regular expressions, for extracting things from HTML, are almost always the wrong solution. Similar to using a macro when a function is what is called for, or using EVAL when a macro would do. > It just might be that person looking for a parser can do thing he > wants without a parser. Since the OP is not around to comment on what their exact needs were, we have to assume that, having long since chosen closure-html as the thing that will do the thing he wants, the OP was in fact looking to do the kind of things a parser does. > If he really needs a parser, he can just ignore my comment and listen > to people who have provided links to various parsers. As they did. > > So what's wrong with it, really? What worries me is not so much that the OP might have taken your advice, but rather that you thought it was good advice to give. My contrary demonstrations are as much for your benefit as for those who may still be following this thread. > ??>>>> It usually works very well when you need to scrap a certain site, > ??>>>> not "sites in general". > > DC> So, if you'd like your code to work more than once, avoid regexps? > DC> I can agree with you there. > > Well, I don't know what he is doing. Seemed to me like he was trying to extract links and images and the like from html. > Sometimes people need to scrap a > certain site or few of them. 'scrape', scrap is what i'd do to your code if you tried to get it past review. > Regexes are fine for that. If you want to write brittle code that is prone to breakage and only works on one site as long as that site stays relatively static, rather than fairly solid code that works on a majority of sites that are allowed to change significantly, and you don't know anything about how to use HTML parsers and the surrounding toolsets, i'd still recommend you learn to use the right tools for the job. > They might get broken if they change layout > on the site. Indeed they might. > But parser-based solution might get broken too (that is, way you need > to traverse DOM changes). This is nonsense. For extracting links and images (that is, a and img tags, and their attributes, in a useful data structure), a parser based solution will track html changes significantly better than a regular expression based solution. For any scraping task where the specifics of the document structure are involved, either solution is going to have problems.. so what are you trying to say? That changes in input structure may break code that depends on it? Hardly an argument for regexps. > ??>> Therefore, it is complex and is probably more error prone. Also, > ??>> slower. > > DC> Non sequitur ... and absolute bollocks. > > Of course it might be the other way around, but, generally, more > complex things are more prone to errors. CXML and companion libraries are excellent code that is well tested, and the problem they solve (xml parsing, manipulation, and unparsing) is not that difficult. The closer-html library is able to understand more HTML, and allow the user to do more with the data with less code, than the solutions you've presented. > And also when you extract more information and do it in more complex > way that takes more time. You have not shown that it takes significantly more time or is any more complex for a task of reasonable size. For any less complex tasks, regular expressions themselves are too complex and take too much time. Read on and i will be happy to show this. > > E.g. if it is DOM-based parser, I'm pretty sure that full DOM of a > page takes more memory than a bunch of links. If you knew anything about the topic at hand, you'd know that there are many many solutions for XML that do not involve a 'full DOM'. SAX, for example... the 'Streaming API for XML'. Here is a modification of my previous code (which was not DOM based, so enough about DOM) to use the SAX interface and not create a data structure that represents the entire document : (defclass links-and-images-handler (sax:default-handler) ((links-and-images :accessor links-and-images :initform nil))) (defmethod sax:end-document ((handler links-and-images-handler)) (links-and-images handler)) (defmethod sax:start-element ((handler links-and-images-handler) uri local-name qname attributes) (flet ((collect (element attribute) (when (string-equal element local-name) (let ((attribute (find attribute attributes :key #'sax:attribute-local-name :test #'string-equal))) (when attribute (push (sax:attribute-value attribute) (links-and-images handler))))))) (collect "a" "href") (collect "img" "src"))) (defun sax-links-and-images (url) (chtml:parse (drakma:http-request url :want-stream t) (make-instance 'links-and-images-handler))) Also notice that it doesn't even make a string out of the webpage itself, but rather reads from the stream and parses it incrementally. I'm sure that having to read the entire file into memory in order run a regular expression over it is going to take more memory than, well, not doing that. > > DC> I applaud your use of the sexp syntax for regexps, but, this > DC> following code actually fetches any given webpage from the > DC> internet and collects images and links, something similar to what > DC> the OP may have wanted. > > DC> (defun links-and-images (url &aux links-and-images) > DC> (flet ((collect (e name attribute) > DC> (when (equal (stp:local-name e) name) > DC> (push (stp:attribute-value e attribute) > DC> links-and-images)))) > DC> (stp:do-recursively (e (chtml:parse (drakma:http-request url) > DC> (cxml-stp:make-builder))) > DC> (when (typep e 'stp:element) > DC> (collect e "a" "href") > DC> (collect e "img" "src"))) > DC> links-and-images)) > > DC> Does more, actually readable and works for a significantly larger > DC> portion of the interwebs. > > Well, you see, I was not interested in all links on that page, I was > interested only on those with class "title" and are http:// links. > So regex says exactly that. Does it now? What if the class comes after the href? Ok, here's that code modified to your new spec: (defun http-links-with-class (url &aux links-and-images) (flet ((collect (e class name attribute) (when (equal (stp:local-name e) name) (when (equal (stp:attribute-value e "class") class) (let ((value (stp:attribute-value e attribute))) (when (string-equal "http://" (subseq value 0 7)) (push value links-and-images))))))) (stp:do-recursively (e (chtml:parse (drakma:http-request url) (cxml-stp:make-builder)) links-and-images) (when (typep e 'stp:element) (collect e "title" "a" "href"))))) Please note that it actually works for a greater percentage of pages, including those where the class attribute is not directly before the href attribute. [snip] > If you use regex like this to get a tags: <a\s[^>]+> -- I honestly > don't see a lot of ways how it can get broken. > In fact, I don't see any. If that's all you want to achieve, why bother with regular expression at all? It's complex, and more error prone, and therefore slower, than a hand-rolled function : (defvar *page* (drakma:http-request "http://common-lisp.net")) (defun match-html-a (stream) (declare (optimize (speed 3) (space 3))) (loop :for char := (read-char stream nil) :while char :when (and (eql char #\<) (member (peek-char nil stream) '(#\a #\A))) :collect (loop :for char := (read-char stream nil) :while char :collect char into stack :until (eql char #\>) :finally (return (coerce (cons #\< stack) 'string))))) (locally (declare (optimize (speed 3) (space 3))) (let ((scanner (cl-ppcre:create-scanner "<a\\s[^>]*>" :case-insensitive-mode t :multi-line-mode t))) (defun ppcre-match-html-a (string) (nreverse (let (links) (cl-ppcre:do-matches-as-strings (string scanner string links) (push string links))))))) CL-USER> (equalp (ppcre-match-html-a *page*) (with-input-from-string (s *page*) (match-html-a s))) => T CL-USER> (time (dotimes (n 1024) (with-input-from-string (page *page*) (match-html-a page)))) Evaluation took: 0.778 seconds of real time 0.776049 seconds of total run time (0.776049 user, 0.000000 system) [ Run times consist of 0.044 seconds GC time, and 0.733 seconds non-GC time. ] 99.74% CPU 1,553,424,264 processor cycles 63,615,928 bytes consed NIL CL-USER> (time (dotimes (n 1024) (ppcre-match-html-a *page*))) Evaluation took: 1.612 seconds of real time 1.612101 seconds of total run time (1.608101 user, 0.004000 system) [ Run times consist of 0.024 seconds GC time, and 1.589 seconds non-GC time. ] 100.00% CPU 3,214,872,948 processor cycles 47,587,008 bytes consed NIL CL-USER> (time (dotimes (n 1024) (with-input-from-string (page *page*) ;level the playing field (ppcre-match-html-a *page*)))) Evaluation took: 1.942 seconds of real time 1.948122 seconds of total run time (1.948122 user, 0.000000 system) [ Run times consist of 0.044 seconds GC time, and 1.905 seconds non-GC time. ] 100.31% CPU 3,874,149,564 processor cycles 85,368,544 bytes consed NIL CL-USER> > Well, I wrote this in 5 seconds, if I'd > think on it for a hour or so I think I'll have absolutely bulletproof > link-extracting regex. MATCH-HTML-A took a little longer than 5 seconds, but not much, and i didn't have to use regular expressions, which in this case add complexity. > Good thing about it is that it absolutely does not care about > context. It might be also a bad thing, depending on what's you doing. > So I don't agree on "smaller portion." Fixing HTML is general is > harder than fixing for a specific task. The thing is, we want context at some point. All we have now is a list of string that look like "<a ... >". If we want to extract the actual href, we have to work a little bit harder : CL-USER> (with-input-from-string (page *page*) (match-html-a page)) ("<a href=\"http://www.lisp.org/\">" "<a href=\"http://www.cliki.net/\">" "<a href=\"http://planet.lisp.org/\">" ...) (let ((scanner (cl-ppcre:create-scanner "\\s+" :multi-line-mode t))) (defun match-a-href-value (string) (dolist (attribute (cl-ppcre:split scanner string)) (when (and (> (length attribute) 5) (string-equal (subseq attribute 0 4) "href")) (return-from match-a-href-value (second (split-sequence:split-sequence #\" attribute ))))))) (defun linear-match-a-href (string) (mapcar #'match-a-href-value (ppcre-match-html-a string))) CL-USER> (linear-match-a-href *page*) ("http://www.lisp.org/" "http://www.cliki.net/" "http://planet.lisp.org/" ...) CL-USER> (time (dotimes (n 1024) (linear-match-a-href *page*))) Evaluation took: 2.529 seconds of real time 2.528158 seconds of total run time (2.528158 user, 0.000000 system) [ Run times consist of 0.032 seconds GC time, and 2.497 seconds non-GC time. ] 99.96% CPU 5,043,773,340 processor cycles 77,095,968 bytes consed Maybe that's not the best way to do that, but that's the naive ad-hoc implementation i came up with of the top of my head. A SAX based version might look like this : (defclass links-handler (sax:default-handler) ((links :accessor links :initform nil))) (defmethod sax:end-document ((handler links-handler)) (nreverse (links handler))) (defmethod sax:start-element ((handler links-handler) uri local-name qname attributes) (when (string-equal "a" local-name) (let ((attribute (find "href" attributes :key #'sax:attribute-local-name :test #'string-equal))) (when attribute (push (sax:attribute-value attribute) (links handler)))))) (defun sax-match-html-a-href (string) (chtml:parse string (make-instance 'links-handler))) CL-USER> (time (dotimes (n 1024) (sax-match-html-a-href *page*))) Evaluation took: 2.959 seconds of real time 2.952185 seconds of total run time (2.904182 user, 0.048003 system) [ Run times consist of 0.160 seconds GC time, and 2.793 seconds non-GC time. ] 99.76% CPU 5,903,549,172 processor cycles 282,859,128 bytes consed Those run times are comparable, and if we're not storing the web pages in memory, the differences will be lost in the i/o latency. At this point, the regexp based solution starts to take on the characteristics of a parser, and also becomes more prone to error as it requires new untested code. So what is the advantage of the half assed regular expression based solution? It's not code size, nor run time, nor ease of use. Only for a very simple task like this one is the amount of effort anywhere near comparable. > DC> As i previously stated, i think that's horrible advice. > > Ok, ok, I get it. But it is just your opinion :) One that i can back up with experience, and actual code. If you'd ever had to work with some other coder's regexp-based pseudo-parser, or even your own if you've made that mistake (as i have), you'd recognize it is as a good opinion. [snipped more DOM-related nonsense] > DC> “Whenever faced with a problem, some people say `Lets use AWK.' > DC> Now, they have two problems.” -- D. Tilbrook > > Overly simplistic regexes are inferior to formal parsers, but HTML as > people use it is not formally defined, so it is inapplicable here. This is fallacious as well. The HTML that CLOSURE-HTML is designed to parse is the HTML as people use it. Just as an ad-hoc regular expression will attempt to extract meaning from improperly structured HTML, so does the parser behind CHTML:PARSE. It is possible to construct input that chtml rejects but an ad-hoc regexp based solution might accept, just as i can easily construct a string that the regular expression will reject but a hand-rolled solution will accept. This proves nothing, unless you can make an argument that the hand-rolled solution is in fact a better solution than CLOSURE-HTML. That i'd be willing to listen to. Obviously, the solution that works is better than the one that doesn't. In the case where two tools are arguably equally good (the very simple case we presented above), using the one that is designed for the job is most likely, in all cases, going to be the right idea. If you'll indulge me one more cliché "It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail." -- Abraham Maslow, 1966 Cheers, drewc
From: Drew Crampsie on 18 May 2010 18:34 Tim X <timx(a)nospam.dev.null> writes: > Drew Crampsie <drewc(a)tech.coop> writes: > >> "Captain Obvious" <udodenko(a)users.sourceforge.net> writes: >> >>> ??>> If you don't care about getting it 100% correct for all pages, just >>> ??>> use regexes. >>> ??>> E.g. cl-ppcre library. >>> >>> DC> This is one of the worst pieces of advice i've ever seen given on this >>> DC> newsgroup. Please do not attempt to roll your own HTML parser using >>> DC> CL-PPCRE > Drew, > > thank you for doing this. Someone had to :). > I cannot express how frustrating it is when people use REs in this > context for more than a single page/url hack. I have lost count of > the number of times I have had to fix broken systems that were due > precisely to the use of REs to process HTML pages. This, more than the specific unsuitability of the tool to the task, is the reason i spoke up. I've been there, it's hell, and completely avoidable. Hopefully, the epic followup i just posted will end any questions about the matter! :) Cheers, drewc > > The RE solution is not a good solution. It /can/ work in a very limited > context, but as soon as you try to apply this approach in a more > generalised solution, you end up with a maintenance nightmare. Worse > still, it means that you will need somone to maintain this solution that > is experienced and understands how REs work. Unfortunately, few people > actually do understand this. Trivial REs are easy, but a soon as they > begin to get a little complex, you really need a deeper understanding of > how they work, anchoring, backtracking etc. > > For the OP, there are a number of tools, such as 'tidy' which can > lean-up HTML that can in turn make the parsers work better. It is true > that HTML is broken in many ways, making it hard to process reliably. > Many HTML generation libraries are extremely inefficient and buggy - > take a look at the HTML generated by many MS programs such as outlook to > see what I mean. Howeever, this does not mean you cannot parse it. > Obviously you can or we wold not have any working web browsers. Using > tools like 'tidy' to clean up the HTML before parsing means that the > parser doesn't have to work as hard and may not need to deal with as > many exceptions. > > Avoid the RE solution unless you are dealing with a single page that is > fairly simple. If you are loking for something more general, use one of > the parses and something like 'tidy'. When it fails, extend the parser > and incrementally improve it. > > Tim
From: RG on 18 May 2010 20:27 In article <87d3ws4zef.fsf(a)tech.coop>, Drew Crampsie <drewc(a)tech.coop> wrote: [A ton of useful info on HTML and XML parsing] Wow, that has to be one of the most content-full posts ever to C.L.L. Thanks for taking the time to write it! rg
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 Prev: NYC LOCAL: Tuesday 11 May 2010 Lisp NYC Meet and NYLUG Hack Meet Next: Other online lisp places |