From: Ben Morrow on 24 Mar 2010 19:40 Quoth Tad McClellan <tadmc(a)seesig.invalid>: > > "pattern matching" is not at all the same as "parsing". > > Regular expressions are *great* for pattern matching. > > It is mathematically impossible to do a proper parse of a context-free > lanuguage such as HTML with nothing more than regular expressions. > > They do not contain the requisite power. > > Google for the "Chomsky hierarchy". > > HTML allows a table within a table within a table within a table, > to an arbitrary depth. ie. it is not "regular". Perl's regexen are not regular. With the new features in 5.10 it's easy to match something like that (it was possible before with (??{}), but not easy): perl -E'"[[][[][]]]" =~ m!(?<nest> \[ (?&nest)* \] )!x and say $+{nest}' [[][[][]]] Building a proper grammar for something like HTML would be harder, especially if you wanted to keep it readable, but I expect it would be possible. Certainly something simple that tracked comment/not-comment/ tag/not-tag would not be too hard, and would be sufficient for many purposes. > > I think there's an argument that, considering you can do this so easily > > (in under 15 lines of code) without the overhead of unnecessary > > includes, my way would be more efficient. > > > Do you want easy and wrong or hard and correct? I want easy and correct, so I'll use a module :). > "pattern matching" is often "good enough", but you should realize > its fragility so that you can assess whether it is worth the ease > of implementation or not. I just quoted that because I think it bears repeating. Ben
From: J�rgen Exner on 24 Mar 2010 20:29 "Kyle T. Jones" <KBfoMe(a)realdomain.net> wrote: >J�rgen Exner wrote: >> "Kyle T. Jones" <KBfoMe(a)realdomain.net> wrote: >>> Tad McClellan wrote: >>>> Kyle T. Jones <KBfoMe(a)realdomain.net> wrote: >>>>> Steve wrote: >>>>>> like lets say I searched a site >>>>>> that had 15 news links and 3 of them said "Hello" in the title. I >>>>>> would want to extract only the links that said hello in the title. >>>>> Read up on perl regular expressions. >>>> >>>> While reading up on regular expressions is certainly a good idea, >>>> it is a horrid idea for the purposes of parsing HTML. >>>> >>> Ummm. Could you expand on that? [...] >> Regular expressions recognize regular languages. But HTML is a >> context-free language and therefore cannot be recognized solely by a >> regular parser. [...] >> But you cannot! Ever heard of the Chomsky Hierarchy? No recollection of >> Theory of Computer Languages or Basics of Compiler Construction? >> What do people learn in Computer Science today? > >But isn't the Chomsky Hierarchy completely irrelevant in this (forgive >the pun) context? Surely you "get" that my input is analyzed in terms >of being nothing more or less than a sequence of characters - that it >was originally written in HTML, or any other CFG-based language, is >meaningless - both syntactical and semantical considerations of that >original language are irrelevant in the (again, forgive me) context of >what I'm attempting - which is simply to match one finite sequence of >characters against another finite sequence of characters - I could care >less what those characters mean, what href indicates, what a <body> tag >is, etc. True. If you know exactly what format your input can possibly have (and if that input can be described using a finite state automaton) then by all means yes, go for it. REs are perfect for such tasks. But that is not what you have been asking, see the Subject of this thread. >I believe what you say above is true - to truly "parse" the page AS HTML >is beyond the ability of REs - but I'm not parsing anything AS HTML, if >that makes sense. In fact, to take that a step further, I'm not >"parsing" period - so perhaps it was a mistake for me to use that term. > I meant to use the term colloquially, sorry if that caused any confusion. Well, yes and no. If you are in control of the format and you know exactly what format is allowed and which formats are not allowed, then you are right. But if you are not in control of the input format, e.g. you are reading from a third-party web page or you get your input data from finance or marketing or the subsidiary on the opposite side of the world, then your code must be able to handle any legal HTML because the format could be changed on you at any time. Which in turn means you must formally parse the HTML code as HTML code, their is just no way around it. jue
From: sln on 24 Mar 2010 21:36 On Wed, 24 Mar 2010 23:40:52 +0000, Ben Morrow <ben(a)morrow.me.uk> wrote: > >Quoth Tad McClellan <tadmc(a)seesig.invalid>: >> >> "pattern matching" is not at all the same as "parsing". >> >> Regular expressions are *great* for pattern matching. >> >> It is mathematically impossible to do a proper parse of a context-free >> lanuguage such as HTML with nothing more than regular expressions. >> >> They do not contain the requisite power. >> >> Google for the "Chomsky hierarchy". >> >> HTML allows a table within a table within a table within a table, >> to an arbitrary depth. ie. it is not "regular". > >Perl's regexen are not regular. With the new features in 5.10 it's easy >to match something like that (it was possible before with (??{}), but >not easy): > > perl -E'"[[][[][]]]" =~ m!(?<nest> \[ (?&nest)* \] )!x > and say $+{nest}' > [[][[][]]] > ^^^^^^^^^^ All this shows is balanced character '[' ']' matching using the recursive ability of the 5.10 engine. Could this be an example such that each square bracket is a markup instruction, like <tag> ? It certainly doesen't pertain the the '<' angle brackets, the parsing delimeter of the instruction. There is no compliance in HTML to have closing tags so as embedded markup ustructions interspersed with content are parsed, a guess is made, if errors are found, where to discontinue the instruction as applied to the context. And in general, where the nesting is stopped. There is a separation between the markup instruction and the content via the markup delimeter '<'. That is the first level of parsing, extracting the instruction from its delimeter and thereby the content. The second level is structuring the markup instruction within the content. When a complete discreet structure is obtained, the document processor renders it, a chunk at a time, mid-stream. The first level, separating markup instructions from its delimeter (and as a side-effect, exposing content) can be done by any language that can compare characters. The second level can be done by any language that can do a stack or nested variables. There is no place for balanced text processing for the first level of parsing markup instructions. Instructions within instructions are NOT well formed and will be kicked out of processors. So essentially, as slow as it can be, if the aim is to peal away delimeters to expose the markup instruction, regular expressions work great. C processors work about 100 - 500 times faster but don't have the ability to give extended (look ahead) errors, nor will they self correct and continue. Most cases, a regular expression can identify errant markup instruction syntax while correctly encapsulating the delimeting expression. If there is an errant '<' delimeter in content, it is not well-formed but is still captured as content and easily reported. Overall, there is no requirement for processors to stop on not well-formed, but most do because they are full featured and compliant. Most go out and bring in includes, do substitutions, reparse, etc. No, you won't get that with regular expressions, but there is nothing stopping anybody from using them to parse out markup instructions and content, nothing at all. Just compare characters is all you do. The reason regex is so slow is that it does pattern matching with backtracking, grouping, etc. This doesen't mean it can't compare characters, it sure can, and in a variable way which allows looking ahead which has benifits over state processing. As long as the regex takes into account ALL possible markup instructions and delimeters as exclusionary items, there is no reason why it can't be used to find specific sub-patterns either in content or, markup instructions themselves. And it can drive over and re-align after discrete syntax errors without stopping. All in all, its a niche parser and perfect at times when a Dom or SAX is just too cumbersome, too much code overhead for something simple. -sln
From: Kyle T. Jones on 26 Mar 2010 01:14 Tad McClellan wrote: Thanks for the reply - in particular, some of the code you provided and corrected was interesting and informative. You make a big deal about my use of the term "parse" throughout - I sure felt as if I was being chastised. I was kind of surprised that I did use it, to be honest. I figured I must have used it casually - and mentioned such in another response: "I believe what you say above is true - to truly "parse" the page AS HTML is beyond the ability of REs - but I'm not parsing anything AS HTML, if that makes sense. In fact, to take that a step further, I'm not "parsing" period - so perhaps it was a mistake for me to use that term. I meant to use the term colloquially, sorry if that caused any confusion. " - me I'll attempt to stay away from such casual use of that particular term in future interactions here. As for suggestions that I google "Chomsky hierarchy" - all my peeps got a kick out of that one. Cheers.
From: Peter J. Holzer on 28 Mar 2010 06:05 On 2010-03-24 23:10, Tad McClellan <tadmc(a)seesig.invalid> wrote: > Kyle T. Jones <KBfoMe(a)realdomain.net> wrote: >> Tad McClellan wrote: >>> Kyle T. Jones <KBfoMe(a)realdomain.net> wrote: >>>> Steve wrote: >>> >>>>> like lets say I searched a site that had 15 news links and 3 of >>>>> them said "Hello" in the title. I would want to extract only the >>>>> links that said hello in the title. >>>> Read up on perl regular expressions. >>> >>> While reading up on regular expressions is certainly a good idea, >>> it is a horrid idea for the purposes of parsing HTML. >>> >> >> Ummm. Could you expand on that? > > > I think the FAQ answer does a pretty good job of it. > > >> My initial reaction would be something like - I'm pretty sure *any* >> method, including the use of HTML::LinkExtor, or XML transform (both >> outlined upthread), involves using regular expressions "for the purposes >> of parsing HTML". > > > "pattern matching" is not at all the same as "parsing". > > Regular expressions are *great* for pattern matching. > > It is mathematically impossible to do a proper parse of a context-free > lanuguage such as HTML with nothing more than regular expressions. > > They do not contain the requisite power. > > Google for the "Chomsky hierarchy". > > HTML allows a table within a table within a table within a table, > to an arbitrary depth. ie. it is not "regular". However, for extracting links you don't need to process nested tables. You can view the file as linear sequence of tags and text. And this can be done with a regular grammar, you don't need a context-free grammar. > Try it with this: > > ------------------- > my $contents = ' ><html><body> ><!-- > this is NOT a link... > <a href="google.com">Google</a> > --> ></body></html> > '; > ------------------- Comments in HTML can also be described by regular expressions - no need to write a context-free grammar for that. But this is a good example why you should use an existing module instead of rolling your own: When you roll your own it is easy to forget about special cases like this. A module which has been in use by lots of people for some time is unlikely to contain such a bug. > Also, your code does not address the OP's question. > > It tests the URL for a string rather than testing the <a> tag's _contents_. A tag doesn't have content, an element has. > That is, he wanted to test > > <a href="...">...</a> > ^^^ > ^^^ here > > rather than > > <a href="...">...</a> > ^^^ > ^^^ There are two tags in this snippet: * <a href="..."> * </a> The a element consists of the start tag, the end tag and the content, which is enclosed between the two tags. For some elements the end tag and for some even the start tag can be omitted, but the element is still there. hp
First
|
Prev
|
Pages: 1 2 3 4 5 Prev: PDF::API2 underlining text Next: FAQ 5.41 How do I delete a directory tree? |