From: Alan Silver on 25 Feb 2010 15:15 >Try the HTMLAgilityPack, it's much better for getting the information >you want. > >See Codeplex.com/HtmlAgilityPack Harumph! If I'd seen that earlier, I could have saved a good few hours of frustration! Mind you, what I ended up with was very neat and compact, so I can't complain I suppose. Thanks for pointing that one out. It certainly deserves a close look. -- Alan Silver (anything added below this line is nothing to do with me)
From: Michael Wojcik on 2 Mar 2010 12:18 Alan Silver wrote: > > Say I want to look for a link to the domain www.fred.com, then the regex... > > <a .*?nofollow.*?http://www\.fred\.com.*?>.*?</a> > > ...will match the following... > > <a rel="nofollow" href="http://www.fred.com">fred</a> > > ...which is right, but it will also match... > > <a href="http://www.cnn.com/" rel="nofollow">CNN</a><a > href="http://www.fred.com">fred</a> > > ...which I don't want. It seems that the regex is matching the nofollow > part to the first link, and so telling me that the whole HTML fragment > contains a nofollow link to www.fred.com. This is wrong. It's difficult to do this completely reliably, because implementing the entire HTML DTD *plus* violations of it that are accepted by common UAs (browsers and such) in a DFA is very complicated. If we make some assumptions about the quality of the HTML you're dealing with, though, we can simplify it considerably. Let's say that it has to be well-formed, and that there's no whitespace between "<" and "a" of an anchor tag. Then you can prevent your regex above from spanning multiple anchor elements by: - Ensuring you don't span the end of the <a> tag when matching the attributes you're looking for within it. Change ".*" in that part of the regex to "[^>]*", so the subexpression will stop at the closing ">". - Ensuring we don't capture "</a" between the "<a>" tag and the closing "</a>" tag - that is, that we stop at the first "</a>" and don't continue on to a later one, swallowing additional entire anchor elements in the process. You can do that with a regex that matches: - any number of: - any number of characters that aren't "<", then - either: - "<" followed by a character that isn't "/", or - "</" followed by a character that isn't "a" That can be expressed by this regex expression: ([^<]*((<[^/])|(</[^a]))*)* (Read it from the inside out. "(<[^/])" is "'<' followed by a character that isn't '/'", and so on.) That gives us: <a [^>]*nofollow[^>]*http://www\.fred\.com[^>]*>([^<]*((<[^/])|(</[^a]))*)*</a> (That's probably going to be wrapped. It should be all on one line, obviously.) Also note that you don't need the "?" operator after ".*"; the "*" matches zero or more of the preceding element. This works with your examples above. It also correctly handles child elements of the anchor element (other than <a> within <a>, which isn't well-formed): <a rel="nofollow" href="http://fred.com">f<b>r</b>ed</a> It seems to me that there ought to be a way to handle the second half of that regex with negative lookahead, which might be simpler, but I couldn't get that to work with a couple of quick tries. All this is assuming you actually need to match the entire anchor element in the HTML source for some reason. If you just want to verify whether the <a> tag is present with those attributes, you can ignore what comes after the closing ">" and greatly simplify the regex. -- Michael Wojcik Micro Focus Rhetoric & Writing, Michigan State University
From: Alan Silver on 3 Mar 2010 09:37 Wow, what a comprehensive reply! Comments below... In article <hmjm6d01qpp(a)news4.newsguy.com>, Michael Wojcik <mwojcik(a)newsguy.com> writes >It's difficult to do this completely reliably, because implementing >the entire HTML DTD *plus* violations of it that are accepted by >common UAs (browsers and such) in a DFA is very complicated. Yup, I was assuming (perhaps foolishly) that such a simple thing as an anchor tag might be generally well-formed ;-) <snip> ><a >[^>]*nofollow[^>]*http://www\.fred\.com[^>]*>([^<]*((<[^/])|(</[^a]))*)*</a> <Snip> >All this is assuming you actually need to match the entire anchor >element in the HTML source for some reason. If you just want to verify >whether the <a> tag is present with those attributes, you can ignore >what comes after the closing ">" and greatly simplify the regex. I realised after posting that I was only interested in the opening part of the tag, as my interest here is whether or not the link is there, and if there is a nofollow value set. I ignored the anchor text and closing tag. So, how does your regex compare with the one I posted a couple of days ago? I solved the problem I had in a similar way to yours (I think), and ended up with... <a [^<>]+nofollow[^<>]+http://www\.fred\.com[^<>]+> This one only matches if there is a nofollow. I need to detect that, so I had one regex to check for an anchor tag... <a .*?http://www\.fred\.com.*?>.*?</a> ....and then the previous regex to match a nofollow before the href and a similar one for when the nofollow is after the href. Is there anything to choose between your method and mine? I'm a rank beginner as regexs, so if yours has some distinct advantage, please explain what. It could just be that they are two slightly different ways of doing the same thing, I don't know. Thanks very much for the reply -- Alan Silver (anything added below this line is nothing to do with me)
From: Michael Wojcik on 3 Mar 2010 12:48 Alan Silver wrote: > In article <hmjm6d01qpp(a)news4.newsguy.com>, Michael Wojcik > <mwojcik(a)newsguy.com> writes >> It's difficult to do this completely reliably, because implementing >> the entire HTML DTD *plus* violations of it that are accepted by >> common UAs (browsers and such) in a DFA is very complicated. > > Yup, I was assuming (perhaps foolishly) that such a simple thing as an > anchor tag might be generally well-formed ;-) Alas, with HTML, you never know (unless you validate the HTML). User Agents will accept all sorts of garbage, so many authors don't feel any need to create valid markup. But usually you can get by with some assumptions and live with a small probability of encountering bogus markup that doesn't work. >> <a >> [^>]*nofollow[^>]*http://www\.fred\.com[^>]*>([^<]*((<[^/])|(</[^a]))*)*</a> >> > <Snip> >> All this is assuming you actually need to match the entire anchor >> element in the HTML source for some reason. If you just want to verify >> whether the <a> tag is present with those attributes, you can ignore >> what comes after the closing ">" and greatly simplify the regex. > > I realised after posting that I was only interested in the opening part > of the tag, as my interest here is whether or not the link is there, and > if there is a nofollow value set. I ignored the anchor text and closing > tag. > > So, how does your regex compare with the one I posted a couple of days > ago? I solved the problem I had in a similar way to yours (I think), and > ended up with... > > <a [^<>]+nofollow[^<>]+http://www\.fred\.com[^<>]+> If you remove the part of mine that captures the element content and closing tag, your regex and mine have a few differences, but in practice they should be equally usable. You're eliminating "<" from inside the a tag. It shouldn't appear there (unless the page uses the SGML short tag syntax, but I've never seen anyone do so), so in practice my "[^>]" and your "[^<>]" will produce the same results. Use whichever you prefer. (Some people might find yours more readable, due to its visual symmetry.) You're using the + operator where I use the * operator. We expect that at least one character will be matched in all of those places, so again this shouldn't make any difference in practice. > This one only matches if there is a nofollow. I need to detect that, so > I had one regex to check for an anchor tag... > > <a .*?http://www\.fred\.com.*?>.*?</a> > > ...and then the previous regex to match a nofollow before the href and a > similar one for when the nofollow is after the href. You could combine all three of these into a single expression, but frankly if all you're looking for is whether you have a match - you're not capturing groups or anything like that - I'd stick with the three regexes you have now. They work, and they're easier to read, understand, and maintain. People who write a lot of regexes tend to start viewing them as an opportunity for cleverness to the point of obscurity, like TECO macros were back in the day. Personally, I'm a fan of readability and maintainability. Where I have hard-coded regexes in my code, I usually split the string up into component parts with comments, so the reader can see what I'm doing. -- Michael Wojcik Micro Focus Rhetoric & Writing, Michigan State University
From: Alan Silver on 7 Mar 2010 14:12
In article <hmm9l4027gr(a)news3.newsguy.com>, Michael Wojcik <mwojcik(a)newsguy.com> writes >People who write a lot of regexes tend to start viewing them as an >opportunity for cleverness to the point of obscurity, like TECO macros >were back in the day. Personally, I'm a fan of readability and >maintainability. Where I have hard-coded regexes in my code, I usually >split the string up into component parts with comments, so the reader >can see what I'm doing. Hee hee, I'm with you. I remember my (fairly brief) foray in Perl. I got the same impression there - some people were only interested in how short (and therefore unreadable) they could make their coding. Anyway, I'm glad what I did is basically the same as yours. I understand it a lot better now. Thanks very much for the help. -- Alan Silver (anything added below this line is nothing to do with me) |