Prev: NYC LOCAL: Tuesday 11 May 2010 Lisp NYC Meet and NYLUG Hack Meet
Next: Other online lisp places
From: Dmitry Statyvka on 19 May 2010 05:13 >>>>> Drew Crampsie writes: [...] >> And I think I was pretty clear about that it is a half-assed >> solution. That's just an option. DC> It's a terrible solution, and i can't see why you're still DC> defending it. I can. Because it may be a pretty good solution in certain conditions. For example, when one wants just to extract some links (not all) or pieces of text from given page on given site, and the page is generated by [...]
From: Dmitry Statyvka on 19 May 2010 16:40 >>>>> Drew Crampsie writes: DC> Dmitry Statyvka <dmitry(a)statyvka.org.ua> writes: >>>>>>> Drew Crampsie writes: >> >> [...] >> >> >> And I think I was pretty clear about that it is a half-assed >> >> solution. That's just an option. >> DC> It's a terrible solution, and i can't see why you're still DC> defending it. >> >> I can. Because it may be a pretty good solution in certain >> conditions. DC> Compared to what? I've shown for the simple cases where a regexp DC> may be a 'pretty good' solution, a hand-rolled matcher is the same DC> amount of code and effort, and significantly faster. DC> For anything more, i've shown that using a proper parser is less DC> code, less effort, and actually works in more cases. It is, if a proper parser can parse what we downloaded. It's wrong to assume that any downloaded page will be valid HTML. It's wrong to suppose that driving the pipeline "HTML-client | HTML-beautifier | parser + content extractor | business logic" will be simplier, faster, whatever than driving "HTML-client | content extractor | business logic" for any possible case. It's a point in original Cap's post. And it's obviously. [...] DC> I figure you meant to 'cancel' and send instead, I meant to 'save' and send instead. DC> as nobody using their real name would in their right mind defend a DC> regexp based solution in the face of the evidence offered.) For the sake of clarity, in my opinion: regexps based solution can be simplier, more manageable and faster than HTML-parsing based solution. I have no good example at hand, sorry. I've just saw it in real development (and real support too) a few years ago. Dmitry.
From: Drew Crampsie on 19 May 2010 20:26 Dmitry Statyvka <dmitry(a)statyvka.org.ua> writes: >>>>>> Drew Crampsie writes: > > DC> Dmitry Statyvka <dmitry(a)statyvka.org.ua> writes: > >>>>>>> Drew Crampsie writes: > >> > >> [...] > >> > >> >> And I think I was pretty clear about that it is a half-assed >> > >> solution. That's just an option. > >> > DC> It's a terrible solution, and i can't see why you're still > DC> defending it. > >> > >> I can. Because it may be a pretty good solution in certain > >> conditions. > > DC> Compared to what? I've shown for the simple cases where a regexp > DC> may be a 'pretty good' solution, a hand-rolled matcher is the same > DC> amount of code and effort, and significantly faster. > > DC> For anything more, i've shown that using a proper parser is less > DC> code, less effort, and actually works in more cases. > > It is, if a proper parser can parse what we downloaded. It's wrong to > assume that any downloaded page will be valid HTML. It's wrong to > suppose that driving the pipeline "HTML-client | HTML-beautifier | > parser + content extractor | business logic" will be simplier, faster, > whatever than driving "HTML-client | content extractor | business logic" > for any possible case. It's a point in original Cap's post. And it's > obviously. If your input is corrupted in such a way as to be unparsable, then perhaps fixing your input before parsing it is a better idea than making a one-off matcher, unless your data format is simple enough that a parser is not needed... And, as i've shown, for the simple cases where a regexp based solution is usable, it's just as easy to hand-craft a matcher, so regexps could be said to suffer from the same 'over complex' problem in the instances you claim they are useful. So none of this is convincing me that regular expressions are the right tool for extracting information HTML. When you tell me they are good for extacting information from things that are not-html, don't be offended if i say 'who gives a toss?'. I've shown the two pipelines to be of similar speed and similar effort for a toy problem, and also demonstrated that for a larger problem, the parser gains a significant advantage. Have you ever used CLOSURE-HTML? CL-PPCRE? Can you whip open a repl and back up your assertions with actual code? If not, then whatever hypotheticals you are discussing are not something that interest me, and are certainly not the topic under discussion (extracting information from HTML using Common Lisp). > DC> as nobody using their real name would in their right mind defend a > DC> regexp based solution in the face of the evidence offered.) > > For the sake of clarity, in my opinion: regexps based solution can be > simplier, more manageable and faster than HTML-parsing based solution. I recognize your right to your opinion, but your unwillingness to back it up with any code, or even an example, sets off my bullshit alarm. Since you have no refutation at hand, i'm not going to turn it off. :P > I have no good example at hand, sorry. I've just saw it in real > development (and real support too) a few years ago. In this case, is 'real' a mess of perl or php code? Perhaps something written by morons? Is this a solution involving parsing HTML in order to retrieve information such as links and images, which is what we are talking about here... non? I'd like to see an example that's a little more substantial, but i doubt your ability to produce it. I'm willing to be swayed with a good argument, but i'm pretty sure it doesn't exist. :) Cheers, drewc > > Dmitry.
From: Dmitry Statyvka on 20 May 2010 18:39 [...] DC> If your input is corrupted in such a way as to be unparsable, then DC> perhaps fixing your input before parsing it is a better idea than DC> making a one-off matcher, unless your data format is simple enough DC> that a parser is not needed... Almost agreed. It's a better idea to fixing an input before parsing it, unless it takes less effort than making a matcher. Sure, effort of support matters. [...] DC> So none of this is convincing me that regular expressions are the DC> right tool for extracting information HTML. Regexps are just one of such tools. Note: not for parsing HTML, for extracting information. [...] DC> Have you ever used CLOSURE-HTML? CL-PPCRE? Can you whip open a repl DC> and back up your assertions with actual code? If not, then whatever DC> hypotheticals you are discussing are not something that interest DC> me, and are certainly not the topic under discussion (extracting DC> information from HTML using Common Lisp). Yes, yes, no. I have no time to write epic post on trivial subject, sorry. :-) [...] >> I have no good example at hand, sorry. I've just saw it in real >> development (and real support too) a few years ago. DC> In this case, is 'real' a mess of perl or php code? No, they used C#. DC> Perhaps something written by morons? I have no reasons to think so. The developed system seems well designed, the code was well structured, most sites were processed by parser-based extractors, re-based extractors seems to fit themselves in the system well enough... DC> Is this a solution involving parsing HTML in order to retrieve DC> information such as links and images, which is what we are talking DC> about here... non? The goal was to extract posts and comments from a several forums. [...] Dmitry.
First
|
Prev
|
Pages: 1 2 3 Prev: NYC LOCAL: Tuesday 11 May 2010 Lisp NYC Meet and NYLUG Hack Meet Next: Other online lisp places |