From: Steve on 19 Mar 2010 18:10 On Mar 19, 2:40 pm, Ben Morrow <b...(a)morrow.me.uk> wrote: > Quoth Steve <st...(a)staticg.com>: > > > > > Based on what you all said, I can make a more clear description. > > Essentially, I'm trying to search craigslist more efficiently. I want > > Are you sure craigslist's Terms of Use allow this? Most sites of this > nature don't. > > > > > > > the link the a tag points to, as well as the description. here is > > code I used already that I made that gets me only the links: > > ----------------------------- > > > #!/usr/bin/perl -w > > use strict; > > use LWP::Simple; > > use HTML::LinkExtor; > > use Data::Dumper; > > > ###### VARIABLES ###### > > my $craigs = "http://seattle.craigslist.org"; > > my $source = "$craigs/search/sss?query=what+Im+Looking > > +for&catAbbreviation=sss"; > > my $browser = 'google-chrome'; > > > ###### SEARCH ####### > > > my $page = get("$source"); > > my $parser = HTML::LinkExtor->new(); > > > $parser->parse($page); > > my @links = $parser->links; > > open LINKS, ">/home/me/Desktop/links.txt"; > > Use 3-arg open. > Use lexical filehandles. > *Always* check the return value of open. > > open my $LINKS, ">", "/home/me/Desktop/links.txt" > or die "can't write to 'links.txt': $!"; > > You may wish to consider using the 'autodie' module from CPAN, which > will do the 'or die' checks for you. > > > print LINKS Dumper \@links; > > > open READLINKS, "</home/me/Desktop/links.txt"; > > open OUT, ">/home/me/Desktop/final.txt"; > > As above. > > > while (<READLINKS>){ > > Why are you writing the links out to a file only to read them in again? > Just use the array you already have: > > for (@links) { > > > if ( /html/ ){ > > my $url = $_; > > for ($url){ > > s/\'//g; > > s/^\s+//; > > } > > > print OUT "$craigs$url"; > > } > > } > > open BROWSE, "</home/me/Desktop/final.txt"; > > As above. > > Ben I have no idea, but it's personal use. I don't see what so bad about it, if I was using my web browser I'd be doing the same thing. Craigslist is just an example. That's aside the point though, I'm just doing it for fun/practice/ learning. Let's say we are using a different site then, perhaps one I'm going to make, it makes no difference to me. So any way I can do this or...?
From: Ben Morrow on 19 Mar 2010 18:30 Quoth Steve <steve(a)staticg.com>: > > I have no idea, but it's personal use. I don't see what so bad about > it, if I was using my web browser I'd be doing the same thing. That's not the point. If their TOS say 'no robots' then that means 'no robots', not 'no robots unless it's for personal use and you can't see why you shouldn't'. Apart from anything else, a lot of these sites make money from ads, which you will completely bypass. > Craigslist is just an example. > > That's aside the point though, I'm just doing it for fun/practice/ > learning. Let's say we are using a different site then, perhaps one > I'm going to make, it makes no difference to me. > > So any way I can do this or...? I've already suggested using XML::LibXML. Others have pointed you to an example of using HTML::Parser. Pick one and try it. Ben
From: Steve on 19 Mar 2010 18:39 On Mar 19, 3:30 pm, Ben Morrow <b...(a)morrow.me.uk> wrote: > Quoth Steve <st...(a)staticg.com>: > > > > > I have no idea, but it's personal use. I don't see what so bad about > > it, if I was using my web browser I'd be doing the same thing. > > That's not the point. If their TOS say 'no robots' then that means 'no > robots', not 'no robots unless it's for personal use and you can't see > why you shouldn't'. Apart from anything else, a lot of these sites make > money from ads, which you will completely bypass. > > > Craigslist is just an example. > > > That's aside the point though, I'm just doing it for fun/practice/ > > learning. Let's say we are using a different site then, perhaps one > > I'm going to make, it makes no difference to me. > > > So any way I can do this or...? > > I've already suggested using XML::LibXML. Others have pointed you to an > example of using HTML::Parser. Pick one and try it. > > Ben I realize this, I'm not using craigslist. It was the first thing I could think of for an example. This is for internal/personal use only, and I don't like how you're labeling me as breaking any TOS for an _EXAMPLE_. Notice how my home folder is changed to "me"? I'm putting as little personal information here, hence the craigslist example.
From: Tad McClellan on 19 Mar 2010 22:38 Kyle T. Jones <KBfoMe(a)realdomain.net> wrote: > Steve wrote: >> like lets say I searched a site >> that had 15 news links and 3 of them said "Hello" in the title. I >> would want to extract only the links that said hello in the title. > > Read up on perl regular expressions. While reading up on regular expressions is certainly a good idea, it is a horrid idea for the purposes of parsing HTML. Have you read the FAQ answers that mention HTML? perldoc -q HTML > for instance, taking the above, you might first split it into a > "one-line per" array - > > @stuff=split(/\n/, $content); > > then parse each line for hello - > > foreach(@stuff){ > if($_=~/Hello/){ > do whatever;} > } The code below prints "do whatever" 3 times, but there is only one link containing "Hello"... --------------------------- #!/usr/bin/perl use warnings; use strict; # some perfectly valid HTML: my $content = ' <html><body> <p>Hello Kitty</p> <a href = "hello.com" >Hello</a > <!-- There is no Hello here --> </body></html> '; my @stuff = split /\n/, $content; foreach (@stuff) { if(/Hello/) { print "do whatever\n"; } } --------------------------- -- Tad McClellan email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/" The above message is a Usenet post. I don't recall having given anyone permission to use it on a Web site.
From: Kyle T. Jones on 24 Mar 2010 14:54 Tad McClellan wrote: > Kyle T. Jones <KBfoMe(a)realdomain.net> wrote: >> Steve wrote: > >>> like lets say I searched a site >>> that had 15 news links and 3 of them said "Hello" in the title. I >>> would want to extract only the links that said hello in the title. >> Read up on perl regular expressions. > > > While reading up on regular expressions is certainly a good idea, > it is a horrid idea for the purposes of parsing HTML. > Ummm. Could you expand on that? My initial reaction would be something like - I'm pretty sure *any* method, including the use of HTML::LinkExtor, or XML transform (both outlined upthread), involves using regular expressions "for the purposes of parsing HTML". At best, you're just abstracting the regex work back to the includes. AFAIK, and feel free to correct me (I'll go take a look at some of the relevant module code in a bit), every CPAN module that is involved with parsing HTML uses fairly straightforward regex matching somewhere within that module's methods. I think there's an argument that, considering you can do this so easily (in under 15 lines of code) without the overhead of unnecessary includes, my way would be more efficient. We can run some benchmarks if you want (see further down for working code). > Have you read the FAQ answers that mention HTML? > > perldoc -q HTML > > >> for instance, taking the above, you might first split it into a >> "one-line per" array - >> >> @stuff=split(/\n/, $content); >> >> then parse each line for hello - >> >> foreach(@stuff){ >> if($_=~/Hello/){ >> do whatever;} >> } > > > The code below prints "do whatever" 3 times, but there is only one link > containing "Hello"... > I should have been clearer - the above wasn't a "solution", meant to be copied, pasted, and put into use - it was just meant to illustrate the basic operation. I think this works fine: #!/usr/bin/perl -w use strict; use warnings; use LWP::Simple; my $targeturl="http://www.google.com"; my $searchstring="google"; my $contents=get($targeturl); my @semiparsed=split(/href/i, $contents); foreach(@semiparsed){ if($_=~/^\s*=\s*('|")(.*?)('|")/){ my $link=$2; if($link=~/$searchstring/i){ print "Link: $link\n"; } } } OUTPUT: Link: http://images.google.com/imghp?hl=en&tab=wi Link: http://video.google.com/?hl=en&tab=wv Link: http://maps.google.com/maps?hl=en&tab=wl Link: http://news.google.com/nwshp?hl=en&tab=wn Link: http://www.google.com/prdhp?hl=en&tab=wf Link: http://mail.google.com/mail/?hl=en&tab=wm Link: http://www.google.com/intl/en/options/ Link: /url?sa=p&pref=ig&pval=3&q=http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg Link: https://www.google.com/accounts/Login?hl=en&continue=http://www.google.com/ Link: /aclk?sa=L&ai=CbpBLOFeqS_gX3ZmVB_SbuZINs_2WoQHf44OSEMHZnNkTEAEgwVRQpuf5xAJgPaoEhQFP0M0ypnTnQAI3b4WYFAHIvHiLv4iZWVehmiie-78BOdRJQOj6QayRkYYHH4cKXyaNmAp2rmQiiPSHxtEyaVD5OZo41Kxvy6SAeAAF6CIw-SQAFsLT-9iHRfJUcoYh4qlpGqGbC080ZVCWlUUipS404rornNJFmeGlP89sgXehqOfpe8uL&num=1&sig=AGiWqtw95aIEfk5F25oGM2i6eMwkBBuj6Q&q=http://www.google.com/doodle4google/ Or, if you're only interested in the http/https links, you can do this: #!/usr/bin/perl -w use strict; use warnings; use LWP::Simple; my $targeturl="http://www.google.com"; my $searchstring="google"; my $contents=get($targeturl); my @semiparsed=split(/href/i, $contents); foreach(@semiparsed){ if($_=~/^\s*=\s*('|")(http.*?)('|")/i){ my $link=$2; if($link=~/$searchstring/i){ print "Link: $link\n"; } } } OUTPUT: Link: http://images.google.com/imghp?hl=en&tab=wi Link: http://video.google.com/?hl=en&tab=wv Link: http://maps.google.com/maps?hl=en&tab=wl Link: http://news.google.com/nwshp?hl=en&tab=wn Link: http://www.google.com/prdhp?hl=en&tab=wf Link: http://mail.google.com/mail/?hl=en&tab=wm Link: http://www.google.com/intl/en/options/ Link: https://www.google.com/accounts/Login?hl=en&continue=http://www.google.com/ Like I said, if you want to present a different method where you push all the regex work off to an include like HTML::LinkExtor, please post it, and I can run both using a benchmark module to determine which method is more efficient. I could be way off, here - maybe using one or more of the modules mentioned in this thread somehow improves efficiency. If so, please let me know. By the way - I can think of wrenches to throw into this solution, too - addressing the use of ' or " inside a link, for instance - but, then, I could throw "you prolly won't ever see this but it's theoretically possible" wrenches into most of the HTML parsing CPAN modules, too, so... Cheers.
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 5 Prev: PDF::API2 underlining text Next: FAQ 5.41 How do I delete a directory tree? |