Prev: piped open and shell metacharacters
Next: FAQ 8.27 What's wrong with using backticks in a void context?
From: Thomas Andersson on 30 Jul 2010 10:45 I have set myself a task to create a script that can collect data from web pages and insert them intoa MySQl database. I'm a complete noob at this thougha nd not even sure what language I need (to learn), but think perl might be it. What I ask now is not for you to tell me whow to do it, only if it's feasible or if I'm barking up the wrong tree (pointers on where to find relevant information is wellcome though. First step would be to export a list of pids to be processed, each paired with the last sid processed for the pid. The script would read the list and set the first pid in list as current Next step woud be for it to add current pid to a URL and load that page containinga list. From this page a list of sids needs to be collected untill I hit the "last processed" one, these might be spread over severall pages so it need to keep going either untill it finds "last processed" or there's no futher pages to load (a fail I guess) Next is the new sid list created in the previous step, each one need to be processed and data collected some basic data is collected frrom each sid and then 2 possible (but not always excistant) lists. The basic data collected for the sid cotains two values to be set as variables, these decides how many data blocks needs to be collected lower down on the page. Go to first type block, collect the data I want and repeat as many times as variable says Go to seciodn type block and repeat. Store the data collected from previous ina textfile named after pid, it should contain 4sections of data to be inserted into 4 databases First section update the pid with new last processed Second section add sids with info to DB. Third section add the data from type 1 blocks on sid pages to DB. Fourth section section add the data from type 2 blocks on sid pages to DB. Close the file, load next pid from list and repeat the process untill pid list is empty. A guess a bonus at the end would be if it could also insert all the data collected into the db as well. Is this something perl would be suitable for or is there a better choise? My system is Win 7 64bit btw, running MySQL 5.1. TIA Thomas
From: RedGrittyBrick on 30 Jul 2010 11:22 On 30/07/2010 15:45, Thomas Andersson wrote: > I have set myself a task to create a script that can collect data from web > pages and insert them intoa MySQl database. I'm a complete noob at this > thougha nd not even sure what language I need (to learn), but think perl > might be it. What I ask now is not for you to tell me whow to do it, only if > it's feasible or if I'm barking up the wrong tree (pointers on where to find > relevant information is wellcome though. It is feasible using Perl. Other languages have subroutine libraries. Perl has "modules" for handling specific tasks. Some modules are "core modules" that are included with a normal Perl installation. Other modules can be found in an online repository called CPAN. You can search it at http://search.cpan.org/ You will need a module for fetching web pages and for extracting data from the retrieved HTML. You will need a module for working with MySQL. Perl's Database Interface module is called DBI. See http://dbi.perl.org/ > > First step would be to export a list of pids to be processed, each paired > with the last sid processed for the pid. > The script would read the list and set the first pid in list as current > Next step woud be for it to add current pid to a URL and load that page > containinga list. > From this page a list of sids needs to be collected untill I hit the "last > processed" one, these might be spread over severall pages so it need to keep > going either untill it finds "last processed" or there's no futher pages to > load (a fail I guess) > > Next is the new sid list created in the previous step, each one need to be > processed and data collected > some basic data is collected frrom each sid and then 2 possible (but not > always excistant) lists. > The basic data collected for the sid cotains two values to be set as > variables, these decides how many data blocks needs to be collected lower > down on the page. > Go to first type block, collect the data I want and repeat as many times as > variable says > Go to seciodn type block and repeat. > > Store the data collected from previous ina textfile named after pid, it > should contain 4sections of data to be inserted into 4 databases > First section update the pid with new last processed > Second section add sids with info to DB. > Third section add the data from type 1 blocks on sid pages to DB. > Fourth section section add the data from type 2 blocks on sid pages to DB. > > Close the file, load next pid from list and repeat the process untill pid > list is empty. > > A guess a bonus at the end would be if it could also insert all the data > collected into the db as well. > > Is this something perl would be suitable for or is there a better choise? > My system is Win 7 64bit btw, running MySQL 5.1. > I confess I can't fully follow your description but I didn't notice anything that would be difficult using Perl. I suggest you start with a Perl script that just fetches a web page. If you have problems, try to reproduce the problem in the smallest possible Perl program and post that here with a short description of what you expected to happen and what actually happened (cut & paste messages rather than re-typing them). There's a Posting FAQ posted regularly in this newsgroup, it is worth reading. -- RGB
From: Tad McClellan on 30 Jul 2010 11:23 Thomas Andersson <thomas(a)tifozi.net> wrote: > I have set myself a task to create a script that can collect data from web > pages and insert them intoa MySQl database. > Is this something perl would be suitable for Sure! I have written dozens of such programs in Perl. perldoc -q HTML How do I fetch an HTML file? How do I automate an HTML form submission? The WWW::Mechanize module is invaluable for this type of thing: http://search.cpan.org/~petdance/WWW-Mechanize-1.64/ See also the Web Scraping Proxy: http://www2.research.att.com/sw/tools/wsp/ -- Tad McClellan email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/" The above message is a Usenet post. I don't recall having given anyone permission to use it on a Web site.
From: RedGrittyBrick on 30 Jul 2010 11:26 On 30/07/2010 16:22, RedGrittyBrick wrote: > On 30/07/2010 15:45, Thomas Andersson wrote: >> I'm a complete noob at this thougha nd not even sure what language >> I need (to learn), but think perl might be it. Oh yes, http://learn.perl.org/ -- RGB
From: bugbear on 30 Jul 2010 12:07
RedGrittyBrick wrote: > On 30/07/2010 15:45, Thomas Andersson wrote: >> I have set myself a task to create a script that can collect data from > > I confess I can't fully follow your description but I didn't notice > anything that would be difficult using Perl. I think he's re-inventing depth-first or breadth-first search. BugBear |