From: Thomas Andersson on
I have set myself a task to create a script that can collect data from web
pages and insert them intoa MySQl database. I'm a complete noob at this
thougha nd not even sure what language I need (to learn), but think perl
might be it. What I ask now is not for you to tell me whow to do it, only if
it's feasible or if I'm barking up the wrong tree (pointers on where to find
relevant information is wellcome though.

First step would be to export a list of pids to be processed, each paired
with the last sid processed for the pid.
The script would read the list and set the first pid in list as current
Next step woud be for it to add current pid to a URL and load that page
containinga list.
From this page a list of sids needs to be collected untill I hit the "last
processed" one, these might be spread over severall pages so it need to keep
going either untill it finds "last processed" or there's no futher pages to
load (a fail I guess)

Next is the new sid list created in the previous step, each one need to be
processed and data collected
some basic data is collected frrom each sid and then 2 possible (but not
always excistant) lists.
The basic data collected for the sid cotains two values to be set as
variables, these decides how many data blocks needs to be collected lower
down on the page.
Go to first type block, collect the data I want and repeat as many times as
variable says
Go to seciodn type block and repeat.

Store the data collected from previous ina textfile named after pid, it
should contain 4sections of data to be inserted into 4 databases
First section update the pid with new last processed
Second section add sids with info to DB.
Third section add the data from type 1 blocks on sid pages to DB.
Fourth section section add the data from type 2 blocks on sid pages to DB.

Close the file, load next pid from list and repeat the process untill pid
list is empty.

A guess a bonus at the end would be if it could also insert all the data
collected into the db as well.

Is this something perl would be suitable for or is there a better choise?
My system is Win 7 64bit btw, running MySQL 5.1.

TIA
Thomas


From: RedGrittyBrick on
On 30/07/2010 15:45, Thomas Andersson wrote:
> I have set myself a task to create a script that can collect data from web
> pages and insert them intoa MySQl database. I'm a complete noob at this
> thougha nd not even sure what language I need (to learn), but think perl
> might be it. What I ask now is not for you to tell me whow to do it, only if
> it's feasible or if I'm barking up the wrong tree (pointers on where to find
> relevant information is wellcome though.

It is feasible using Perl.

Other languages have subroutine libraries. Perl has "modules" for
handling specific tasks. Some modules are "core modules" that are
included with a normal Perl installation. Other modules can be found in
an online repository called CPAN. You can search it at
http://search.cpan.org/

You will need a module for fetching web pages and for extracting data
from the retrieved HTML.

You will need a module for working with MySQL. Perl's Database Interface
module is called DBI. See http://dbi.perl.org/

>
> First step would be to export a list of pids to be processed, each paired
> with the last sid processed for the pid.
> The script would read the list and set the first pid in list as current
> Next step woud be for it to add current pid to a URL and load that page
> containinga list.
> From this page a list of sids needs to be collected untill I hit the "last
> processed" one, these might be spread over severall pages so it need to keep
> going either untill it finds "last processed" or there's no futher pages to
> load (a fail I guess)
>
> Next is the new sid list created in the previous step, each one need to be
> processed and data collected
> some basic data is collected frrom each sid and then 2 possible (but not
> always excistant) lists.
> The basic data collected for the sid cotains two values to be set as
> variables, these decides how many data blocks needs to be collected lower
> down on the page.
> Go to first type block, collect the data I want and repeat as many times as
> variable says
> Go to seciodn type block and repeat.
>
> Store the data collected from previous ina textfile named after pid, it
> should contain 4sections of data to be inserted into 4 databases
> First section update the pid with new last processed
> Second section add sids with info to DB.
> Third section add the data from type 1 blocks on sid pages to DB.
> Fourth section section add the data from type 2 blocks on sid pages to DB.
>
> Close the file, load next pid from list and repeat the process untill pid
> list is empty.
>
> A guess a bonus at the end would be if it could also insert all the data
> collected into the db as well.
>
> Is this something perl would be suitable for or is there a better choise?
> My system is Win 7 64bit btw, running MySQL 5.1.
>

I confess I can't fully follow your description but I didn't notice
anything that would be difficult using Perl.

I suggest you start with a Perl script that just fetches a web page. If
you have problems, try to reproduce the problem in the smallest possible
Perl program and post that here with a short description of what you
expected to happen and what actually happened (cut & paste messages
rather than re-typing them).

There's a Posting FAQ posted regularly in this newsgroup, it is worth
reading.

--
RGB
From: Tad McClellan on
Thomas Andersson <thomas(a)tifozi.net> wrote:
> I have set myself a task to create a script that can collect data from web
> pages and insert them intoa MySQl database.

> Is this something perl would be suitable for


Sure! I have written dozens of such programs in Perl.

perldoc -q HTML

How do I fetch an HTML file?

How do I automate an HTML form submission?

The WWW::Mechanize module is invaluable for this type of thing:

http://search.cpan.org/~petdance/WWW-Mechanize-1.64/

See also the Web Scraping Proxy:

http://www2.research.att.com/sw/tools/wsp/


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.
From: RedGrittyBrick on
On 30/07/2010 16:22, RedGrittyBrick wrote:
> On 30/07/2010 15:45, Thomas Andersson wrote:
>> I'm a complete noob at this thougha nd not even sure what language
>> I need (to learn), but think perl might be it.

Oh yes, http://learn.perl.org/

--
RGB
From: bugbear on
RedGrittyBrick wrote:
> On 30/07/2010 15:45, Thomas Andersson wrote:
>> I have set myself a task to create a script that can collect data from

>
> I confess I can't fully follow your description but I didn't notice
> anything that would be difficult using Perl.

I think he's re-inventing depth-first or breadth-first
search.

BugBear