From: Dmitry Statyvka on
>>>>> Drew Crampsie writes:

[...]

>> And I think I was pretty clear about that it is a half-assed
>> solution. That's just an option.

DC> It's a terrible solution, and i can't see why you're still
DC> defending it.

I can. Because it may be a pretty good solution in certain conditions.
For example, when one wants just to extract some links (not all) or
pieces of text from given page on given site, and the page is generated
by

[...]
From: Dmitry Statyvka on
>>>>> Drew Crampsie writes:

DC> Dmitry Statyvka <dmitry(a)statyvka.org.ua> writes:
>>>>>>> Drew Crampsie writes:
>>
>> [...]
>>
>> >> And I think I was pretty clear about that it is a half-assed >>
>> solution. That's just an option.
>>
DC> It's a terrible solution, and i can't see why you're still
DC> defending it.
>>
>> I can. Because it may be a pretty good solution in certain
>> conditions.

DC> Compared to what? I've shown for the simple cases where a regexp
DC> may be a 'pretty good' solution, a hand-rolled matcher is the same
DC> amount of code and effort, and significantly faster.

DC> For anything more, i've shown that using a proper parser is less
DC> code, less effort, and actually works in more cases.

It is, if a proper parser can parse what we downloaded. It's wrong to
assume that any downloaded page will be valid HTML. It's wrong to
suppose that driving the pipeline "HTML-client | HTML-beautifier |
parser + content extractor | business logic" will be simplier, faster,
whatever than driving "HTML-client | content extractor | business logic"
for any possible case. It's a point in original Cap's post. And it's
obviously.

[...]

DC> I figure you meant to 'cancel' and send instead,

I meant to 'save' and send instead.

DC> as nobody using their real name would in their right mind defend a
DC> regexp based solution in the face of the evidence offered.)

For the sake of clarity, in my opinion: regexps based solution can be
simplier, more manageable and faster than HTML-parsing based solution.
I have no good example at hand, sorry. I've just saw it in real
development (and real support too) a few years ago.

Dmitry.

From: Drew Crampsie on
Dmitry Statyvka <dmitry(a)statyvka.org.ua> writes:

>>>>>> Drew Crampsie writes:
>
> DC> Dmitry Statyvka <dmitry(a)statyvka.org.ua> writes:
> >>>>>>> Drew Crampsie writes:
> >>
> >> [...]
> >>
> >> >> And I think I was pretty clear about that it is a half-assed >>
> >> solution. That's just an option.
> >>
> DC> It's a terrible solution, and i can't see why you're still
> DC> defending it.
> >>
> >> I can. Because it may be a pretty good solution in certain
> >> conditions.
>
> DC> Compared to what? I've shown for the simple cases where a regexp
> DC> may be a 'pretty good' solution, a hand-rolled matcher is the same
> DC> amount of code and effort, and significantly faster.
>
> DC> For anything more, i've shown that using a proper parser is less
> DC> code, less effort, and actually works in more cases.
>
> It is, if a proper parser can parse what we downloaded. It's wrong to
> assume that any downloaded page will be valid HTML. It's wrong to
> suppose that driving the pipeline "HTML-client | HTML-beautifier |
> parser + content extractor | business logic" will be simplier, faster,
> whatever than driving "HTML-client | content extractor | business logic"
> for any possible case. It's a point in original Cap's post. And it's
> obviously.

If your input is corrupted in such a way as to be unparsable, then
perhaps fixing your input before parsing it is a better idea than making
a one-off matcher, unless your data format is simple enough that a
parser is not needed...

And, as i've shown, for the simple cases where a regexp based solution
is usable, it's just as easy to hand-craft a matcher, so regexps could
be said to suffer from the same 'over complex' problem in the instances
you claim they are useful.

So none of this is convincing me that regular expressions are the right
tool for extracting information HTML. When you tell me they are good for
extacting information from things that are not-html, don't be offended
if i say 'who gives a toss?'.

I've shown the two pipelines to be of similar speed and similar effort
for a toy problem, and also demonstrated that for a larger problem, the
parser gains a significant advantage.

Have you ever used CLOSURE-HTML? CL-PPCRE? Can you whip open a repl and
back up your assertions with actual code? If not, then whatever
hypotheticals you are discussing are not something that interest me, and
are certainly not the topic under discussion (extracting information
from HTML using Common Lisp).


> DC> as nobody using their real name would in their right mind defend a
> DC> regexp based solution in the face of the evidence offered.)
>
> For the sake of clarity, in my opinion: regexps based solution can be
> simplier, more manageable and faster than HTML-parsing based solution.

I recognize your right to your opinion, but your unwillingness to back
it up with any code, or even an example, sets off my bullshit
alarm. Since you have no refutation at hand, i'm not going to turn it
off. :P

> I have no good example at hand, sorry. I've just saw it in real
> development (and real support too) a few years ago.

In this case, is 'real' a mess of perl or php code? Perhaps something
written by morons? Is this a solution involving parsing HTML in order
to retrieve information such as links and images, which is what we are
talking about here... non?

I'd like to see an example that's a little more substantial, but i doubt
your ability to produce it. I'm willing to be swayed with a good
argument, but i'm pretty sure it doesn't exist. :)


Cheers,

drewc


>
> Dmitry.
From: Dmitry Statyvka on

[...]

DC> If your input is corrupted in such a way as to be unparsable, then
DC> perhaps fixing your input before parsing it is a better idea than
DC> making a one-off matcher, unless your data format is simple enough
DC> that a parser is not needed...

Almost agreed. It's a better idea to fixing an input before parsing it,
unless it takes less effort than making a matcher. Sure, effort of
support matters.

[...]

DC> So none of this is convincing me that regular expressions are the
DC> right tool for extracting information HTML.

Regexps are just one of such tools. Note: not for parsing HTML, for
extracting information.

[...]

DC> Have you ever used CLOSURE-HTML? CL-PPCRE? Can you whip open a repl
DC> and back up your assertions with actual code? If not, then whatever
DC> hypotheticals you are discussing are not something that interest
DC> me, and are certainly not the topic under discussion (extracting
DC> information from HTML using Common Lisp).

Yes, yes, no. I have no time to write epic post on trivial subject,
sorry. :-)

[...]

>> I have no good example at hand, sorry. I've just saw it in real
>> development (and real support too) a few years ago.

DC> In this case, is 'real' a mess of perl or php code?

No, they used C#.

DC> Perhaps something written by morons?

I have no reasons to think so. The developed system seems well
designed, the code was well structured, most sites were processed by
parser-based extractors, re-based extractors seems to fit themselves in
the system well enough...

DC> Is this a solution involving parsing HTML in order to retrieve
DC> information such as links and images, which is what we are talking
DC> about here... non?

The goal was to extract posts and comments from a several forums.

[...]

Dmitry.