Perl HTML searching [Perl]

Prev: PDF::API2 underlining text
Next: FAQ 5.41 How do I delete a directory tree?

From: Steve on 19 Mar 2010 18:10

On Mar 19, 2:40 pm, Ben Morrow <b...(a)morrow.me.uk> wrote:
> Quoth Steve <st...(a)staticg.com>:
>
>
>
> > Based on what you all said, I can make a more clear description.
> > Essentially, I'm trying to search craigslist more efficiently. I want
>
> Are you sure craigslist's Terms of Use allow this? Most sites of this
> nature don't.
>
>
>
>
>
> > the link the a tag points to, as well as the description. here is
> > code I used already that I made that gets me only the links:
> > -----------------------------
>
> > #!/usr/bin/perl -w
> > use strict;
> > use LWP::Simple;
> > use HTML::LinkExtor;
> > use Data::Dumper;
>
> > ###### VARIABLES ######
> > my $craigs = "http://seattle.craigslist.org";
> > my $source = "$craigs/search/sss?query=what+Im+Looking
> > +for&catAbbreviation=sss";
> > my $browser = 'google-chrome';
>
> > ###### SEARCH #######
>
> > my $page = get("$source");
> > my $parser = HTML::LinkExtor->new();
>
> > $parser->parse($page);
> > my @links = $parser->links;
> > open LINKS, ">/home/me/Desktop/links.txt";
>
> Use 3-arg open.
> Use lexical filehandles.
> *Always* check the return value of open.
>
> open my $LINKS, ">", "/home/me/Desktop/links.txt"
> or die "can't write to 'links.txt': $!";
>
> You may wish to consider using the 'autodie' module from CPAN, which
> will do the 'or die' checks for you.
>
> > print LINKS Dumper \@links;
>
> > open READLINKS, "</home/me/Desktop/links.txt";
> > open OUT, ">/home/me/Desktop/final.txt";
>
> As above.
>
> > while (<READLINKS>){
>
> Why are you writing the links out to a file only to read them in again?
> Just use the array you already have:
>
> for (@links) {
>
> > if ( /html/ ){
> > my $url = $_;
> > for ($url){
> > s/\'//g;
> > s/^\s+//;
> > }
>
> > print OUT "$craigs$url";
> > }
> > }
> > open BROWSE, "</home/me/Desktop/final.txt";
>
> As above.
>
> Ben

I have no idea, but it's personal use. I don't see what so bad about
it, if I was using my web browser I'd be doing the same thing.
Craigslist is just an example.

That's aside the point though, I'm just doing it for fun/practice/
learning. Let's say we are using a different site then, perhaps one
I'm going to make, it makes no difference to me.

So any way I can do this or...?

From: Ben Morrow on 19 Mar 2010 18:30

Quoth Steve <steve(a)staticg.com>:
>
> I have no idea, but it's personal use. I don't see what so bad about
> it, if I was using my web browser I'd be doing the same thing.

That's not the point. If their TOS say 'no robots' then that means 'no
robots', not 'no robots unless it's for personal use and you can't see
why you shouldn't'. Apart from anything else, a lot of these sites make
money from ads, which you will completely bypass.

> Craigslist is just an example.
>
> That's aside the point though, I'm just doing it for fun/practice/
> learning. Let's say we are using a different site then, perhaps one
> I'm going to make, it makes no difference to me.
>
> So any way I can do this or...?

I've already suggested using XML::LibXML. Others have pointed you to an
example of using HTML::Parser. Pick one and try it.

Ben

From: Steve on 19 Mar 2010 18:39

On Mar 19, 3:30 pm, Ben Morrow <b...(a)morrow.me.uk> wrote:
> Quoth Steve <st...(a)staticg.com>:
>
>
>
> > I have no idea, but it's personal use. I don't see what so bad about
> > it, if I was using my web browser I'd be doing the same thing.
>
> That's not the point. If their TOS say 'no robots' then that means 'no
> robots', not 'no robots unless it's for personal use and you can't see
> why you shouldn't'. Apart from anything else, a lot of these sites make
> money from ads, which you will completely bypass.
>
> > Craigslist is just an example.
>
> > That's aside the point though, I'm just doing it for fun/practice/
> > learning. Let's say we are using a different site then, perhaps one
> > I'm going to make, it makes no difference to me.
>
> > So any way I can do this or...?
>
> I've already suggested using XML::LibXML. Others have pointed you to an
> example of using HTML::Parser. Pick one and try it.
>
> Ben

I realize this, I'm not using craigslist. It was the first thing I
could think of for an example. This is for internal/personal use
only, and I don't like how you're labeling me as breaking any TOS for
an _EXAMPLE_. Notice how my home folder is changed to "me"? I'm
putting as little personal information here, hence the craigslist
example.

From: Tad McClellan on 19 Mar 2010 22:38

Kyle T. Jones <KBfoMe(a)realdomain.net> wrote:
> Steve wrote:

>> like lets say I searched a site
>> that had 15 news links and 3 of them said "Hello" in the title. I
>> would want to extract only the links that said hello in the title.
>
> Read up on perl regular expressions.

While reading up on regular expressions is certainly a good idea,
it is a horrid idea for the purposes of parsing HTML.

Have you read the FAQ answers that mention HTML?

perldoc -q HTML

> for instance, taking the above, you might first split it into a
> "one-line per" array -
>
> @stuff=split(/\n/, $content);
>
> then parse each line for hello -
>
> foreach(@stuff){
> if($_=~/Hello/){
> do whatever;}
> }

The code below prints "do whatever" 3 times, but there is only one link
containing "Hello"...

---------------------------
#!/usr/bin/perl
use warnings;
use strict;

# some perfectly valid HTML:
my $content = '
<html><body>
<p>Hello
Kitty</p>
<a
href
=
"hello.com"
>Hello</a
>

</body></html>
';

my @stuff = split /\n/, $content;
foreach (@stuff) {
if(/Hello/) {
print "do whatever\n";
}
}
---------------------------

--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.

From: Kyle T. Jones on 24 Mar 2010 14:54

Tad McClellan wrote:
> Kyle T. Jones <KBfoMe(a)realdomain.net> wrote:
>> Steve wrote:
>
>>> like lets say I searched a site
>>> that had 15 news links and 3 of them said "Hello" in the title. I
>>> would want to extract only the links that said hello in the title.
>> Read up on perl regular expressions.
>
>
> While reading up on regular expressions is certainly a good idea,
> it is a horrid idea for the purposes of parsing HTML.
>

Ummm. Could you expand on that?

My initial reaction would be something like - I'm pretty sure *any*
method, including the use of HTML::LinkExtor, or XML transform (both
outlined upthread), involves using regular expressions "for the purposes
of parsing HTML".

At best, you're just abstracting the regex work back to the includes.
AFAIK, and feel free to correct me (I'll go take a look at some of the
relevant module code in a bit), every CPAN module that is involved with
parsing HTML uses fairly straightforward regex matching somewhere within
that module's methods.

I think there's an argument that, considering you can do this so easily
(in under 15 lines of code) without the overhead of unnecessary
includes, my way would be more efficient. We can run some benchmarks if
you want (see further down for working code).

> Have you read the FAQ answers that mention HTML?
>
> perldoc -q HTML
>
>
>> for instance, taking the above, you might first split it into a
>> "one-line per" array -
>>
>> @stuff=split(/\n/, $content);
>>
>> then parse each line for hello -
>>
>> foreach(@stuff){
>> if($_=~/Hello/){
>> do whatever;}
>> }
>
>
> The code below prints "do whatever" 3 times, but there is only one link
> containing "Hello"...
>

I should have been clearer - the above wasn't a "solution", meant to be
copied, pasted, and put into use - it was just meant to illustrate the
basic operation.

I think this works fine:

#!/usr/bin/perl -w
use strict;
use warnings;
use LWP::Simple;

my $targeturl="http://www.google.com";
my $searchstring="google";
my $contents=get($targeturl);
my @semiparsed=split(/href/i, $contents);

foreach(@semiparsed){
if($_=~/^\s*=\s*('|")(.*?)('|")/){
my $link=$2;
if($link=~/$searchstring/i){
print "Link: $link\n";
}
}
}

OUTPUT:

Link: http://images.google.com/imghp?hl=en&tab=wi
Link: http://video.google.com/?hl=en&tab=wv
Link: http://maps.google.com/maps?hl=en&tab=wl
Link: http://news.google.com/nwshp?hl=en&tab=wn
Link: http://www.google.com/prdhp?hl=en&tab=wf
Link: http://mail.google.com/mail/?hl=en&tab=wm
Link: http://www.google.com/intl/en/options/
Link:
/url?sa=p&pref=ig&pval=3&q=http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg
Link:
https://www.google.com/accounts/Login?hl=en&continue=http://www.google.com/
Link:
/aclk?sa=L&ai=CbpBLOFeqS_gX3ZmVB_SbuZINs_2WoQHf44OSEMHZnNkTEAEgwVRQpuf5xAJgPaoEhQFP0M0ypnTnQAI3b4WYFAHIvHiLv4iZWVehmiie-78BOdRJQOj6QayRkYYHH4cKXyaNmAp2rmQiiPSHxtEyaVD5OZo41Kxvy6SAeAAF6CIw-SQAFsLT-9iHRfJUcoYh4qlpGqGbC080ZVCWlUUipS404rornNJFmeGlP89sgXehqOfpe8uL&num=1&sig=AGiWqtw95aIEfk5F25oGM2i6eMwkBBuj6Q&q=http://www.google.com/doodle4google/

Or, if you're only interested in the http/https links, you can do this:

#!/usr/bin/perl -w
use strict;
use warnings;
use LWP::Simple;

my $targeturl="http://www.google.com";
my $searchstring="google";
my $contents=get($targeturl);
my @semiparsed=split(/href/i, $contents);

foreach(@semiparsed){
if($_=~/^\s*=\s*('|")(http.*?)('|")/i){
my $link=$2;
if($link=~/$searchstring/i){
print "Link: $link\n";
}
}
}

OUTPUT:

Link: http://images.google.com/imghp?hl=en&tab=wi
Link: http://video.google.com/?hl=en&tab=wv
Link: http://maps.google.com/maps?hl=en&tab=wl
Link: http://news.google.com/nwshp?hl=en&tab=wn
Link: http://www.google.com/prdhp?hl=en&tab=wf
Link: http://mail.google.com/mail/?hl=en&tab=wm
Link: http://www.google.com/intl/en/options/
Link:
https://www.google.com/accounts/Login?hl=en&continue=http://www.google.com/

Like I said, if you want to present a different method where you push
all the regex work off to an include like HTML::LinkExtor, please post
it, and I can run both using a benchmark module to determine which
method is more efficient. I could be way off, here - maybe using one or
more of the modules mentioned in this thread somehow improves
efficiency. If so, please let me know.

By the way - I can think of wrenches to throw into this solution, too -
addressing the use of ' or " inside a link, for instance - but, then, I
could throw "you prolly won't ever see this but it's theoretically
possible" wrenches into most of the HTML parsing CPAN modules, too, so...

Cheers.

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: PDF::API2 underlining text
Next: FAQ 5.41 How do I delete a directory tree?