From: Chris Rebert on 15 Oct 2009 04:00 On Thu, Oct 15, 2009 at 12:39 AM, Raji Seetharaman <sraji.me(a)gmail.com> wrote: > Hi all, > > Im learning web scraping with python from the following link > http://www.packtpub.com/article/web-scraping-with-python > > To work with it, mechanize to be installed > I installed mechanize using > > sudo apt-get install python-mechanize > > As given in the tutorial, i tried the code as below > > import mechanize > BASE_URL = "http://www.packtpub.com/article-network" > br = mechanize.Browser() > data = br.open(BASE_URL).get_data() > > Received the following error > > File "webscrap.py", line 4, in <module> >    data = br.open(BASE_URL).get_data() >  File "/usr/lib/python2.6/dist-packages/mechanize/_mechanize.py", line 209, > in open >    return self._mech_open(url, data, timeout=timeout) >  File "/usr/lib/python2.6/dist-packages/mechanize/_mechanize.py", line 261, > in _mech_open >    raise response > mechanize._response.httperror_seek_wrapper: HTTP Error 403: request > disallowed by robots.txt Apparently that website's tutorial and robots.txt are not in sync. robots.txt is part of the Robot Exclusion Standard (http://en.wikipedia.org/wiki/Robots_exclusion_standard) and is the standard way websites specify which webpages should and should not be accessed programmatically. In this case, that site's robots.txt is forbidding access to the webpage in question from autonomous programs. There's probably a way to tell mechanize to ignore robots.txt though, given the standard is not enforced server-side; programs just follow it voluntarily. Cheers, Chris -- http://blog.rebertia.com
|
Pages: 1 Prev: MUD Game Programmming - Python Modules in C++ Next: Python 2.6.3 and finding init.tcl |