Perl HTML searching [Perl]

Prev: PDF::API2 underlining text
Next: FAQ 5.41 How do I delete a directory tree?

From: Ben Morrow on 24 Mar 2010 19:40

Quoth Tad McClellan <tadmc(a)seesig.invalid>:
>
> "pattern matching" is not at all the same as "parsing".
>
> Regular expressions are *great* for pattern matching.
>
> It is mathematically impossible to do a proper parse of a context-free
> lanuguage such as HTML with nothing more than regular expressions.
>
> They do not contain the requisite power.
>
> Google for the "Chomsky hierarchy".
>
> HTML allows a table within a table within a table within a table,
> to an arbitrary depth. ie. it is not "regular".

Perl's regexen are not regular. With the new features in 5.10 it's easy
to match something like that (it was possible before with (??{}), but
not easy):

perl -E'"[[][[][]]]" =~ m!(?<nest> \[ (?&nest)* \] )!x
and say $+{nest}'
[[][[][]]]

Building a proper grammar for something like HTML would be harder,
especially if you wanted to keep it readable, but I expect it would be
possible. Certainly something simple that tracked comment/not-comment/
tag/not-tag would not be too hard, and would be sufficient for many
purposes.

> > I think there's an argument that, considering you can do this so easily
> > (in under 15 lines of code) without the overhead of unnecessary
> > includes, my way would be more efficient.
>
>
> Do you want easy and wrong or hard and correct?

I want easy and correct, so I'll use a module :).

> "pattern matching" is often "good enough", but you should realize
> its fragility so that you can assess whether it is worth the ease
> of implementation or not.

I just quoted that because I think it bears repeating.

Ben

From: J�rgen Exner on 24 Mar 2010 20:29

"Kyle T. Jones" <KBfoMe(a)realdomain.net> wrote:
>J�rgen Exner wrote:
>> "Kyle T. Jones" <KBfoMe(a)realdomain.net> wrote:
>>> Tad McClellan wrote:
>>>> Kyle T. Jones <KBfoMe(a)realdomain.net> wrote:
>>>>> Steve wrote:
>>>>>> like lets say I searched a site
>>>>>> that had 15 news links and 3 of them said "Hello" in the title. I
>>>>>> would want to extract only the links that said hello in the title.
>>>>> Read up on perl regular expressions.
>>>>
>>>> While reading up on regular expressions is certainly a good idea,
>>>> it is a horrid idea for the purposes of parsing HTML.
>>>>
>>> Ummm. Could you expand on that?
[...]
>> Regular expressions recognize regular languages. But HTML is a
>> context-free language and therefore cannot be recognized solely by a
>> regular parser.
[...]
>> But you cannot! Ever heard of the Chomsky Hierarchy? No recollection of
>> Theory of Computer Languages or Basics of Compiler Construction?
>> What do people learn in Computer Science today?
>
>But isn't the Chomsky Hierarchy completely irrelevant in this (forgive
>the pun) context? Surely you "get" that my input is analyzed in terms
>of being nothing more or less than a sequence of characters - that it
>was originally written in HTML, or any other CFG-based language, is
>meaningless - both syntactical and semantical considerations of that
>original language are irrelevant in the (again, forgive me) context of
>what I'm attempting - which is simply to match one finite sequence of
>characters against another finite sequence of characters - I could care
>less what those characters mean, what href indicates, what a <body> tag
>is, etc.

True. If you know exactly what format your input can possibly have (and
if that input can be described using a finite state automaton) then by
all means yes, go for it. REs are perfect for such tasks.

But that is not what you have been asking, see the Subject of this
thread.

>I believe what you say above is true - to truly "parse" the page AS HTML
>is beyond the ability of REs - but I'm not parsing anything AS HTML, if
>that makes sense. In fact, to take that a step further, I'm not
>"parsing" period - so perhaps it was a mistake for me to use that term.
> I meant to use the term colloquially, sorry if that caused any confusion.

Well, yes and no. If you are in control of the format and you know
exactly what format is allowed and which formats are not allowed, then
you are right.
But if you are not in control of the input format, e.g. you are reading
from a third-party web page or you get your input data from finance or
marketing or the subsidiary on the opposite side of the world, then your
code must be able to handle any legal HTML because the format could be
changed on you at any time. Which in turn means you must formally parse
the HTML code as HTML code, their is just no way around it.

jue

From: sln on 24 Mar 2010 21:36

On Wed, 24 Mar 2010 23:40:52 +0000, Ben Morrow <ben(a)morrow.me.uk> wrote:

>
>Quoth Tad McClellan <tadmc(a)seesig.invalid>:
>>
>> "pattern matching" is not at all the same as "parsing".
>>
>> Regular expressions are *great* for pattern matching.
>>
>> It is mathematically impossible to do a proper parse of a context-free
>> lanuguage such as HTML with nothing more than regular expressions.
>>
>> They do not contain the requisite power.
>>
>> Google for the "Chomsky hierarchy".
>>
>> HTML allows a table within a table within a table within a table,
>> to an arbitrary depth. ie. it is not "regular".
>
>Perl's regexen are not regular. With the new features in 5.10 it's easy
>to match something like that (it was possible before with (??{}), but
>not easy):
>
> perl -E'"[[][[][]]]" =~ m!(?<nest> \[ (?&nest)* \] )!x
> and say $+{nest}'
> [[][[][]]]
>
^^^^^^^^^^
All this shows is balanced character '[' ']' matching using the
recursive ability of the 5.10 engine.

Could this be an example such that each square bracket is a
markup instruction, like <tag> ?
It certainly doesen't pertain the the '<' angle brackets, the
parsing delimeter of the instruction.

There is no compliance in HTML to have closing tags so as embedded
markup ustructions interspersed with content are parsed, a guess is
made, if errors are found, where to discontinue the instruction
as applied to the context. And in general, where the nesting is stopped.

There is a separation between the markup instruction and the content
via the markup delimeter '<'. That is the first level of parsing,
extracting the instruction from its delimeter and thereby the
content. The second level is structuring the markup instruction
within the content.

When a complete discreet structure is obtained, the document processor
renders it, a chunk at a time, mid-stream.

The first level, separating markup instructions from its delimeter
(and as a side-effect, exposing content) can be done by any language
that can compare characters.

The second level can be done by any language that can do a stack
or nested variables.

There is no place for balanced text processing for the first
level of parsing markup instructions. Instructions within
instructions are NOT well formed and will be kicked out of
processors.

So essentially, as slow as it can be, if the aim is to peal away
delimeters to expose the markup instruction, regular expressions
work great. C processors work about 100 - 500 times faster but
don't have the ability to give extended (look ahead) errors,
nor will they self correct and continue. Most cases, a
regular expression can identify errant markup instruction syntax
while correctly encapsulating the delimeting expression.
If there is an errant '<' delimeter in content, it is not
well-formed but is still captured as content and easily reported.

Overall, there is no requirement for processors to stop on
not well-formed, but most do because they are full featured
and compliant. Most go out and bring in includes, do substitutions,
reparse, etc.

No, you won't get that with regular expressions, but there
is nothing stopping anybody from using them to parse out
markup instructions and content, nothing at all. Just compare
characters is all you do.

The reason regex is so slow is that it does pattern matching
with backtracking, grouping, etc.

This doesen't mean it can't compare characters, it sure can,
and in a variable way which allows looking ahead which has
benifits over state processing.

As long as the regex takes into account ALL possible markup
instructions and delimeters as exclusionary items, there is
no reason why it can't be used to find specific sub-patterns
either in content or, markup instructions themselves.

And it can drive over and re-align after discrete syntax errors without
stopping. All in all, its a niche parser and perfect at times
when a Dom or SAX is just too cumbersome, too much code overhead
for something simple.

-sln

From: Kyle T. Jones on 26 Mar 2010 01:14

Tad McClellan wrote:

Thanks for the reply - in particular, some of the code you provided and
corrected was interesting and informative.

You make a big deal about my use of the term "parse" throughout - I sure
felt as if I was being chastised. I was kind of surprised that I did
use it, to be honest. I figured I must have used it casually - and
mentioned such in another response:

"I believe what you say above is true - to truly "parse" the page AS
HTML is beyond the ability of REs - but I'm not parsing anything AS
HTML, if that makes sense. In fact, to take that a step further, I'm
not "parsing" period - so perhaps it was a mistake for me to use that
term. I meant to use the term colloquially, sorry if that caused any
confusion. " - me

I'll attempt to stay away from such casual use of that particular term
in future interactions here. As for suggestions that I google "Chomsky
hierarchy" - all my peeps got a kick out of that one.

Cheers.

From: Peter J. Holzer on 28 Mar 2010 06:05

On 2010-03-24 23:10, Tad McClellan <tadmc(a)seesig.invalid> wrote:
> Kyle T. Jones <KBfoMe(a)realdomain.net> wrote:
>> Tad McClellan wrote:
>>> Kyle T. Jones <KBfoMe(a)realdomain.net> wrote:
>>>> Steve wrote:
>>>
>>>>> like lets say I searched a site that had 15 news links and 3 of
>>>>> them said "Hello" in the title. I would want to extract only the
>>>>> links that said hello in the title.
>>>> Read up on perl regular expressions.
>>>
>>> While reading up on regular expressions is certainly a good idea,
>>> it is a horrid idea for the purposes of parsing HTML.
>>>
>>
>> Ummm. Could you expand on that?
>
>
> I think the FAQ answer does a pretty good job of it.
>
>
>> My initial reaction would be something like - I'm pretty sure *any*
>> method, including the use of HTML::LinkExtor, or XML transform (both
>> outlined upthread), involves using regular expressions "for the purposes
>> of parsing HTML".
>
>
> "pattern matching" is not at all the same as "parsing".
>
> Regular expressions are *great* for pattern matching.
>
> It is mathematically impossible to do a proper parse of a context-free
> lanuguage such as HTML with nothing more than regular expressions.
>
> They do not contain the requisite power.
>
> Google for the "Chomsky hierarchy".
>
> HTML allows a table within a table within a table within a table,
> to an arbitrary depth. ie. it is not "regular".

However, for extracting links you don't need to process nested tables.
You can view the file as linear sequence of tags and text. And this can
be done with a regular grammar, you don't need a context-free grammar.

> Try it with this:
>
> -------------------
> my $contents = '
><html><body>
>
></body></html>
> ';
> -------------------

Comments in HTML can also be described by regular expressions - no need
to write a context-free grammar for that.

But this is a good example why you should use an existing module instead
of rolling your own: When you roll your own it is easy to forget about
special cases like this. A module which has been in use by lots of
people for some time is unlikely to contain such a bug.

> Also, your code does not address the OP's question.
>
> It tests the URL for a string rather than testing the <a> tag's _contents_.

A tag doesn't have content, an element has.

> That is, he wanted to test
>
> <a href="...">...</a>
> ^^^
> ^^^ here
>
> rather than
>
> <a href="...">...</a>
> ^^^
> ^^^

There are two tags in this snippet:

* <a href="...">
* </a>

The a element consists of the start tag, the end tag and the content,
which is enclosed between the two tags.

For some elements the end tag and for some even the start tag can be
omitted, but the element is still there.

hp

First | Prev |
Pages: 1 2 3 4 5
Prev: PDF::API2 underlining text
Next: FAQ 5.41 How do I delete a directory tree?