From: Kyle T. Jones on 19 Mar 2010 14:58 Steve wrote: > On Mar 19, 11:01 am, J�rgen Exner <jurge...(a)hotmail.com> wrote: >> Steve <st...(a)staticg.com> wrote: >>> I started a little project where I need to search web pages for their >>> text and return the links of those pages to me. I am using >>> LWP::Simple, HTML::LinkExtor, and Data::Dumper. Basically all I have >>> done so far is a list of URL's from my search query of a website, but >>> I want to be able to filter this content based on the pages contents. >>> How can I do this? How can I get the content of a web page, and not >>> just the URL? >> ??? >> >> I don't understand. >> >> use LWP::Simple; >> $content = get("http://www.whateverURL"); >> >> will get you exactly the content of that web page and assign it to >> $content and apparently you are doing that already. >> >> So what is your problem? >> >> jue > > Sorry I am a little overwhelmed with the coding so far (I'm not very > good at perl). I have what you have posted, but my problem is that I > would like to filter that content... like lets say I searched a site > that had 15 news links and 3 of them said "Hello" in the title. I > would want to extract only the links that said hello in the title. Read up on perl regular expressions. for instance, taking the above, you might first split it into a "one-line per" array - @stuff=split(/\n/, $content); then parse each line for hello - foreach(@stuff){ if($_=~/Hello/){ do whatever;} } Cheers.
From: Ben Morrow on 19 Mar 2010 14:53 Quoth Steve <steve(a)staticg.com>: > On Mar 19, 11:01�am, J�rgen Exner <jurge...(a)hotmail.com> wrote: > > Steve <st...(a)staticg.com> wrote: > > >I started a little project where I need to search web pages for their > > >text and return the links of those pages to me. �I am using > > >LWP::Simple, HTML::LinkExtor, and Data::Dumper. �Basically all I have > > >done so far is a list of URL's from my search query of a website, but > > >I want to be able to filter this content based on the pages contents. > > >How can I do this? How can I get the content of a web page, and not > > >just the URL? > > > > � � � � use LWP::Simple; > > � � � � $content = get("http://www.whateverURL"); > > > > will get you exactly the content of that web page and assign it to > > $content and apparently you are doing that already. > > Sorry I am a little overwhelmed with the coding so far (I'm not very > good at perl). I have what you have posted, but my problem is that I > would like to filter that content... like lets say I searched a site > that had 15 news links and 3 of them said "Hello" in the title. I > would want to extract only the links that said hello in the title. Ah, you don't want the content *pointed to* by the link, you want the content of the <a> element itself. I don't think you can use HTML::LinkExtor for that. I would start by building a DOM for the page, and then going through and finding the <a> elements and checking their content. XML::LibXML (despite the name) has a decent HTML parser, though you will probably want to set the 'recover' option if you are parsing random HTML from the Web. You can then use DOM methods like ->getElementsByTagName to find the <a> elements and ->textContent to find their contents (ignoring further tags within the <a> element). Ben
From: Steve on 19 Mar 2010 17:10 On Mar 19, 11:42 am, "J. Gleixner" <glex_no-s...(a)qwest-spam- no.invalid> wrote: > Steve wrote: > > On Mar 19, 11:01 am, J rgen Exner <jurge...(a)hotmail.com> wrote: > >> Steve <st...(a)staticg.com> wrote: > >>> I started a little project where I need to search web pages for their > >>> text and return the links of those pages to me. I am using > >>> LWP::Simple, HTML::LinkExtor, and Data::Dumper. Basically all I have > >>> done so far is a list of URL's from my search query of a website, but > >>> I want to be able to filter this content based on the pages contents. > >>> How can I do this? How can I get the content of a web page, and not > >>> just the URL? > >> ??? > > >> I don't understand. > > >> use LWP::Simple; > >> $content = get("http://www.whateverURL"); > > >> will get you exactly the content of that web page and assign it to > >> $content and apparently you are doing that already. > > >> So what is your problem? > > >> jue > > > Sorry I am a little overwhelmed with the coding so far (I'm not very > > good at perl). I have what you have posted, but my problem is that I > > would like to filter that content... like lets say I searched a site > > that had 15 news links and 3 of them said "Hello" in the title. I > > would want to extract only the links that said hello in the title. > > '"Hello" in the title'??.. The title element of the HTML???? > Or the 'a' element contains 'Hello'?? e.g. <a href="...">Hello Kitty</a> > > How are you using HTML::LinkExtor?? > > That seems like the right choice. > > Why are you using Data::Dumper? > > That's helpful when debugging, or logging, so how are you using it? > > Post your very short example, because there's something you're > missing and no one can tell what that is based on your description. Based on what you all said, I can make a more clear description. Essentially, I'm trying to search craigslist more efficiently. I want the link the a tag points to, as well as the description. here is code I used already that I made that gets me only the links: ----------------------------- #!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::LinkExtor; use Data::Dumper; ###### VARIABLES ###### my $craigs = "http://seattle.craigslist.org"; my $source = "$craigs/search/sss?query=what+Im+Looking +for&catAbbreviation=sss"; my $browser = 'google-chrome'; ###### SEARCH ####### my $page = get("$source"); my $parser = HTML::LinkExtor->new(); $parser->parse($page); my @links = $parser->links; open LINKS, ">/home/me/Desktop/links.txt"; print LINKS Dumper \@links; open READLINKS, "</home/me/Desktop/links.txt"; open OUT, ">/home/me/Desktop/final.txt"; while (<READLINKS>){ if ( /html/ ){ my $url = $_; for ($url){ s/\'//g; s/^\s+//; } print OUT "$craigs$url"; } } open BROWSE, "</home/me/Desktop/final.txt"; system ($browser); foreach(<BROWSE>){ system ($browser, $_); } ----------------------------- I've since created a different script that's a little more cleaned up
From: J. Gleixner on 19 Mar 2010 17:10 J. Gleixner wrote: > Steve wrote: >> On Mar 19, 11:01 am, J�rgen Exner <jurge...(a)hotmail.com> wrote: >>> Steve <st...(a)staticg.com> wrote: >>>> I started a little project where I need to search web pages for their >>>> text and return the links of those pages to me. I am using >>>> LWP::Simple, HTML::LinkExtor, and Data::Dumper. Basically all I have >>>> done so far is a list of URL's from my search query of a website, but >>>> I want to be able to filter this content based on the pages contents. >>>> How can I do this? How can I get the content of a web page, and not >>>> just the URL? >>> ??? >>> >>> I don't understand. >>> >>> use LWP::Simple; >>> $content = get("http://www.whateverURL"); >>> >>> will get you exactly the content of that web page and assign it to >>> $content and apparently you are doing that already. >>> >>> So what is your problem? >>> >>> jue >> >> Sorry I am a little overwhelmed with the coding so far (I'm not very >> good at perl). I have what you have posted, but my problem is that I >> would like to filter that content... like lets say I searched a site >> that had 15 news links and 3 of them said "Hello" in the title. I >> would want to extract only the links that said hello in the title. > > > '"Hello" in the title'??.. The title element of the HTML???? > Or the 'a' element contains 'Hello'?? e.g. <a href="...">Hello Kitty</a> > > How are you using HTML::LinkExtor?? > > That seems like the right choice. After looking at it further, HTML::LinkExtor only gives the attributes, not the text that makes up the hyperlink. Seems like that would be a useful enhancement. This might help you: http://cpansearch.perl.org/src/GAAS/HTML-Parser-3.64/eg/hanchors
From: Ben Morrow on 19 Mar 2010 17:40 Quoth Steve <steve(a)staticg.com>: > > Based on what you all said, I can make a more clear description. > Essentially, I'm trying to search craigslist more efficiently. I want Are you sure craigslist's Terms of Use allow this? Most sites of this nature don't. > the link the a tag points to, as well as the description. here is > code I used already that I made that gets me only the links: > ----------------------------- > > #!/usr/bin/perl -w > use strict; > use LWP::Simple; > use HTML::LinkExtor; > use Data::Dumper; > > ###### VARIABLES ###### > my $craigs = "http://seattle.craigslist.org"; > my $source = "$craigs/search/sss?query=what+Im+Looking > +for&catAbbreviation=sss"; > my $browser = 'google-chrome'; > > ###### SEARCH ####### > > my $page = get("$source"); > my $parser = HTML::LinkExtor->new(); > > $parser->parse($page); > my @links = $parser->links; > open LINKS, ">/home/me/Desktop/links.txt"; Use 3-arg open. Use lexical filehandles. *Always* check the return value of open. open my $LINKS, ">", "/home/me/Desktop/links.txt" or die "can't write to 'links.txt': $!"; You may wish to consider using the 'autodie' module from CPAN, which will do the 'or die' checks for you. > print LINKS Dumper \@links; > > open READLINKS, "</home/me/Desktop/links.txt"; > open OUT, ">/home/me/Desktop/final.txt"; As above. > while (<READLINKS>){ Why are you writing the links out to a file only to read them in again? Just use the array you already have: for (@links) { > if ( /html/ ){ > my $url = $_; > for ($url){ > s/\'//g; > s/^\s+//; > } > > print OUT "$craigs$url"; > } > } > open BROWSE, "</home/me/Desktop/final.txt"; As above. Ben
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 5 Prev: PDF::API2 underlining text Next: FAQ 5.41 How do I delete a directory tree? |