Prev: I think I found a bug in Python 2.6.4 (in the inspect module)
Next: Solved: TypeError: startView() takes exactly 1 argument (3 given)
From: MRAB on 30 Dec 2009 20:08 Brian D wrote: > Thanks MRAB as well. I've printed all of the replies to retain with my > pile of essential documentation. > > To follow up with a complete response, I'm ripping out of my mechanize > module the essential components of the solution I got to work. > > The main body of the code passes a URL to the scrape_records function. > The function attempts to open the URL five times. > > If the URL is opened, a values dictionary is populated and returned to > the calling statement. If the URL cannot be opened, a fatal error is > printed and the module terminates. There's a little sleep call in the > function to leave time for any errant connection problem to resolve > itself. > > Thanks to all for your replies. I hope this helps someone else: > > import urllib2, time > from mechanize import Browser > > def scrape_records(url): > maxattempts = 5 > br = Browser() > user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: > 1.9.0.16) Gecko/2009120208 Firefox/3.0.16 (.NET CLR 3.5.30729)' > br.addheaders = [('User-agent', user_agent)] > for count in xrange(maxattempts): > try: > print url, count > br.open(url) > break > except urllib2.URLError: > print 'URL error', count > # Pretend a failed connection was fixed > if count == 2: > url = 'http://www.google.com' > time.sleep(1) > pass 'pass' isn't necessary. > else: > print 'Fatal URL error. Process terminated.' > return None > # Scrape page and populate valuesDict > valuesDict = {} > return valuesDict > > url = 'http://badurl' > valuesDict = scrape_records(url) > if valuesDict == None: When checking whether or not something is a singleton, such as None, use "is" or "is not" instead of "==" or "!=". > print 'Failed to retrieve valuesDict'
From: Brian D on 30 Dec 2009 20:17 On Dec 30, 7:08 pm, MRAB <pyt...(a)mrabarnett.plus.com> wrote: > Brian D wrote: > > Thanks MRAB as well. I've printed all of the replies to retain with my > > pile of essential documentation. > > > To follow up with a complete response, I'm ripping out of my mechanize > > module the essential components of the solution I got to work. > > > The main body of the code passes a URL to the scrape_records function. > > The function attempts to open the URL five times. > > > If the URL is opened, a values dictionary is populated and returned to > > the calling statement. If the URL cannot be opened, a fatal error is > > printed and the module terminates. There's a little sleep call in the > > function to leave time for any errant connection problem to resolve > > itself. > > > Thanks to all for your replies. I hope this helps someone else: > > > import urllib2, time > > from mechanize import Browser > > > def scrape_records(url): > > maxattempts = 5 > > br = Browser() > > user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: > > 1.9.0.16) Gecko/2009120208 Firefox/3.0.16 (.NET CLR 3.5.30729)' > > br.addheaders = [('User-agent', user_agent)] > > for count in xrange(maxattempts): > > try: > > print url, count > > br.open(url) > > break > > except urllib2.URLError: > > print 'URL error', count > > # Pretend a failed connection was fixed > > if count == 2: > > url = 'http://www.google.com' > > time.sleep(1) > > pass > > 'pass' isn't necessary. > > > else: > > print 'Fatal URL error. Process terminated.' > > return None > > # Scrape page and populate valuesDict > > valuesDict = {} > > return valuesDict > > > url = 'http://badurl' > > valuesDict = scrape_records(url) > > if valuesDict == None: > > When checking whether or not something is a singleton, such as None, use > "is" or "is not" instead of "==" or "!=". > > > print 'Failed to retrieve valuesDict' > > I'm definitely acquiring some well-deserved schooling -- and it's really appreciated. I'd seen the "is/is not" preference before, but it just didn't stick. I see now that "pass" is redundant -- thanks for catching that. Cheers.
From: Steve Holden on 30 Dec 2009 20:55 Brian D wrote: [...] > I'm definitely acquiring some well-deserved schooling -- and it's > really appreciated. I'd seen the "is/is not" preference before, but it > just didn't stick. > Yes, a lot of people have acquired the majority of their Python education from this list - I have certainly learned a thing or two from it over the years, and had some very interesting discussions. is/is not are about object identity. Saying a is b is pretty much the same thing as saying id(a) == id(b) so it's a test that two expressions are references to the exact same object. So it works with None, since there is only ever one value of <type 'NoneType'>. Be careful not to use it when there can be several different but equal values, though. > I see now that "pass" is redundant -- thanks for catching that. > regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 PyCon is coming! Atlanta, Feb 2010 http://us.pycon.org/ Holden Web LLC http://www.holdenweb.com/ UPCOMING EVENTS: http://holdenweb.eventbrite.com/
From: Aahz on 13 Jan 2010 20:27 In article <mailman.233.1262197919.28905.python-list(a)python.org>, Philip Semanchuk <philip(a)semanchuk.com> wrote: > >While I don't fully understand what you're trying to accomplish by >changing the URL to google.com after 3 iterations, I suspect that some >of your trouble comes from using "while True". Your code would be >clearer if the while clause actually stated the exit condition. Here's >a suggestion (untested): > >MAX_ATTEMPTS = 5 > >count = 0 >while count <= MAX_ATTEMPTS: > count += 1 > try: > print 'attempt ' + str(count) > request = urllib2.Request(url, None, headers) > response = urllib2.urlopen(request) > if response: > print 'True response.' > except URLError: > print 'fail ' + str(count) Note that you may have good reason for doing it differently: MAX_ATTEMPTS = 5 def retry(url): count = 0 while True: count += 1 try: print 'attempt', count request = urllib2.Request(url, None, headers) response = urllib2.urlopen(request) if response: print 'True response' except URLError: if count < MAX_ATTEMPTS: time.sleep(5) else: raise This structure is required in order for the raise to do a proper re-raise. BTW, your code is rather oddly indented, please stick with PEP8. -- Aahz (aahz(a)pythoncraft.com) <*> http://www.pythoncraft.com/ "If you think it's expensive to hire a professional to do the job, wait until you hire an amateur." --Red Adair
From: Aahz on 13 Jan 2010 20:31
In article <hilruv$nuv$1(a)panix5.panix.com>, Aahz <aahz(a)pythoncraft.com> wrote: >In article <mailman.233.1262197919.28905.python-list(a)python.org>, >Philip Semanchuk <philip(a)semanchuk.com> wrote: >> >>While I don't fully understand what you're trying to accomplish by >>changing the URL to google.com after 3 iterations, I suspect that some >>of your trouble comes from using "while True". Your code would be >>clearer if the while clause actually stated the exit condition. Here's >>a suggestion (untested): >> >>MAX_ATTEMPTS = 5 >> >>count = 0 >>while count <= MAX_ATTEMPTS: >> count += 1 >> try: >> print 'attempt ' + str(count) >> request = urllib2.Request(url, None, headers) >> response = urllib2.urlopen(request) >> if response: >> print 'True response.' >> except URLError: >> print 'fail ' + str(count) > >Note that you may have good reason for doing it differently: > >MAX_ATTEMPTS = 5 >def retry(url): > count = 0 > while True: > count += 1 > try: > print 'attempt', count > request = urllib2.Request(url, None, headers) > response = urllib2.urlopen(request) > if response: > print 'True response' ^^^^^ Oops, that print should have been a return. > except URLError: > if count < MAX_ATTEMPTS: > time.sleep(5) > else: > raise > >This structure is required in order for the raise to do a proper >re-raise. -- Aahz (aahz(a)pythoncraft.com) <*> http://www.pythoncraft.com/ "If you think it's expensive to hire a professional to do the job, wait until you hire an amateur." --Red Adair |