From: Alan Silver on 24 Feb 2010 13:16 Hello, I'm trying to write some code to check for a link in some HTML that has been pulled from a web site. I think this should be easy with a RegEx, but I can't get my head round it. To make sure it's clear, a normal HTML link looks like... <a href="http://www.microsoft.com/sompage.aspx">some page</a> ....but can also look like... <a href="http://www.microsoft.com/sompage.aspx" rel="nofollow">some page</a> There are loads of other variations, but this is all that interests me right now. I want to check the HTML to see... 1) Is there a link to my target URL (which will be given), and 2) Does that link have the rel="nofollow" part or not? Anyone any ideas how I would do this? I've tried all sorts of things, but not got anything that works. Just to throw a spanner in the works, the rel="nofollow" bit could appear before or after the href="whatever" bit. I would be really grateful for any help here. TIA -- Alan Silver (anything added below this line is nothing to do with me)
From: Alan Silver on 24 Feb 2010 14:01 >Hello, Just to follow up on my own post, I've finally got something that nearly works, but it isn't quite there. Say I want to look for a link to the domain www.fred.com, then the regex... <a .*?nofollow.*?http://www\.fred\.com.*?>.*?</a> ....will match the following... <a rel="nofollow" href="http://www.fred.com">fred</a> ....which is right, but it will also match... <a href="http://www.cnn.com/" rel="nofollow">CNN</a><a href="http://www.fred.com">fred</a> ....which I don't want. It seems that the regex is matching the nofollow part to the first link, and so telling me that the whole HTML fragment contains a nofollow link to www.fred.com. This is wrong. So, how do I modify this regex so that it won't look at the nofollow part in another link? Thanks for any help -- Alan Silver (anything added below this line is nothing to do with me)
From: Jesse Houwing on 25 Feb 2010 05:17 * Alan Silver wrote, On 24-2-2010 19:16: > Hello, > > I'm trying to write some code to check for a link in some HTML that has > been pulled from a web site. I think this should be easy with a RegEx, > but I can't get my head round it. > > To make sure it's clear, a normal HTML link looks like... > > <a href="http://www.microsoft.com/sompage.aspx">some page</a> > > ...but can also look like... > > <a href="http://www.microsoft.com/sompage.aspx" rel="nofollow">some > page</a> > > There are loads of other variations, but this is all that interests me > right now. > > I want to check the HTML to see... > > 1) Is there a link to my target URL (which will be given), and > 2) Does that link have the rel="nofollow" part or not? > > Anyone any ideas how I would do this? I've tried all sorts of things, > but not got anything that works. > > Just to throw a spanner in the works, the rel="nofollow" bit could > appear before or after the href="whatever" bit. > > I would be really grateful for any help here. > > TIA > Try the HTMLAgilityPack, it's much better for getting the information you want. See Codeplex.com/HtmlAgilityPack Jesse -- Jesse Houwing jesse.houwing at sogeti.nl
From: eBob.com on 25 Feb 2010 09:07 I haven't played with the HTMLAgilityPack or any other HTML parser so I can't compare that approach to RegEx. I highly recommend Expresso from UltraPico for experimenting with regular expressions. (It's free.) I think your problem is that .*? is sucking up too many characters and overflowing into another tag. So instead of matching . (any character) you could try matching any character other than "<". Based on what you've told us, and just off the top of my head, I think my expression would look for, in pseudo regex, <a optional nofollow http://www\.fred\.com optional nofollow </a> That would match some dumb html which had nofollow before and after the url, but I'd guess that doesn't matter. I don't know if there is a way in regex to insist that the nofollow can appear in one place or another but not both. But using "named groups" (I think that's the right terminology) you could determine where the nofollows had occurred. Good Luck, Bob "Alan Silver" <alan-silver(a)nospam.thanx.invalid> wrote in message news:BaBPjcFneXhLFwvb(a)nospamthankyou.spam... > >Hello, > > Just to follow up on my own post, I've finally got something that nearly > works, but it isn't quite there. > > Say I want to look for a link to the domain www.fred.com, then the > regex... > > <a .*?nofollow.*?http://www\.fred\.com.*?>.*?</a> > > ...will match the following... > > <a rel="nofollow" href="http://www.fred.com">fred</a> > > ...which is right, but it will also match... > > <a href="http://www.cnn.com/" rel="nofollow">CNN</a><a > href="http://www.fred.com">fred</a> > > ...which I don't want. It seems that the regex is matching the nofollow > part to the first link, and so telling me that the whole HTML fragment > contains a nofollow link to www.fred.com. This is wrong. > > So, how do I modify this regex so that it won't look at the nofollow part > in another link? > > Thanks for any help > > -- > Alan Silver > (anything added below this line is nothing to do with me)
From: Alan Silver on 25 Feb 2010 15:10
Hello, Thanks for the reply. I have Expresso, which is very good, but doesn't necessarily tell you how to build the regex you want. However, after some playing around, I came up with something that worked. As you pointed out, the regex was greedy, and was matching with stuff outside of the current tag. I added some bits to stop that, and it worked fine. I had to do two regexs, one to catch the nofollow before the href, and one when it was after. The code I ended up with was... Regex regLink = new Regex(@"<a .*?http://" + targetUrl.Replace(".", @"\.") + @".*?>.*?</a>", RegexOptions.Singleline); Regex regLinkNofollowL = new Regex(@"<a [^<>]+nofollow[^<>]+http://" + targetUrl.Replace(".", @"\.") + @"[^<>]+>", RegexOptions.Singleline); Regex regLinkNofollowR = new Regex(@"<a [^<>]+http://" + targetUrl.Replace(".", @"\.") + @"[^<>]+nofollow[^<>]+>", RegexOptions.Singleline); The string variable targetUrl contains the domain name of the link I want to look for. regLink.IsMatch(html) will be true if a link is found regLinkNofollowL.IsMatch(html) will be true if the link has a nofollow before the href regLinkNofollowR.IsMatch(html) will be true if the link has a nofollow after the href Hope this is of some use to someone. Thanks again for the reply. >I haven't played with the HTMLAgilityPack or any other HTML parser so I >can't compare that approach to RegEx. > >I highly recommend Expresso from UltraPico for experimenting with regular >expressions. (It's free.) > >I think your problem is that .*? is sucking up too many characters and >overflowing into another tag. So instead of matching . (any character) you >could try matching any character other than "<". > >Based on what you've told us, and just off the top of my head, I think my >expression would look for, in pseudo regex, > ><a optional nofollow http://www\.fred\.com optional nofollow </a> > >That would match some dumb html which had nofollow before and after the url, >but I'd guess that doesn't matter. I don't know if there is a way in regex >to insist that the nofollow can appear in one place or another but not both. >But using "named groups" (I think that's the right terminology) you could >determine where the nofollows had occurred. > >Good Luck, Bob > > >"Alan Silver" <alan-silver(a)nospam.thanx.invalid> wrote in message >news:BaBPjcFneXhLFwvb(a)nospamthankyou.spam... >> >Hello, >> >> Just to follow up on my own post, I've finally got something that nearly >> works, but it isn't quite there. >> >> Say I want to look for a link to the domain www.fred.com, then the >> regex... >> >> <a .*?nofollow.*?http://www\.fred\.com.*?>.*?</a> >> >> ...will match the following... >> >> <a rel="nofollow" href="http://www.fred.com">fred</a> >> >> ...which is right, but it will also match... >> >> <a href="http://www.cnn.com/" rel="nofollow">CNN</a><a >> href="http://www.fred.com">fred</a> >> >> ...which I don't want. It seems that the regex is matching the nofollow >> part to the first link, and so telling me that the whole HTML fragment >> contains a nofollow link to www.fred.com. This is wrong. >> >> So, how do I modify this regex so that it won't look at the nofollow part >> in another link? >> >> Thanks for any help >> >> -- >> Alan Silver >> (anything added below this line is nothing to do with me) > > -- Alan Silver (anything added below this line is nothing to do with me) |