Prev: Undo/Redo in PyQt
Next: dict's as dict's key.
From: yamamoto on 13 Jan 2010 07:46 Hi, I am new to Python. I'd like to extract "a" tag from a website by using "beautifulsoup" module. but it doesnt work! //sample.py from BeautifulSoup import BeautifulSoup as bs import urllib url="http://www.d-addicts.com/forum/torrents.php" doc=urllib.urlopen(url).read() soup=bs(doc) result=soup.findAll("a") for i in result: print i Traceback (most recent call last): File "C:\Users\falcon\workspace\p\pyqt\ex1.py", line 8, in <module> soup=bs(doc) File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1499, in __init__ BeautifulStoneSoup.__init__(self, *args, **kwargs) File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1230, in __init__ self._feed(isHTML=isHTML) File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1263, in _feed self.builder.feed(markup) File "C:\Python26\lib\HTMLParser.py", line 108, in feed self.goahead(0) File "C:\Python26\lib\HTMLParser.py", line 148, in goahead k = self.parse_starttag(i) File "C:\Python26\lib\HTMLParser.py", line 226, in parse_starttag endpos = self.check_for_whole_start_tag(i) File "C:\Python26\lib\HTMLParser.py", line 301, in check_for_whole_start_tag self.error("malformed start tag") File "C:\Python26\lib\HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos()) HTMLParser.HTMLParseError: malformed start tag, at line 276, column 36 any suggestion? thanks in advance
From: Peter Otten on 13 Jan 2010 09:11 yamamoto wrote: > Hi, > I am new to Python. I'd like to extract "a" tag from a website by > using "beautifulsoup" module. > but it doesnt work! > > //sample.py > > from BeautifulSoup import BeautifulSoup as bs > import urllib > url="http://www.d-addicts.com/forum/torrents.php" > doc=urllib.urlopen(url).read() > soup=bs(doc) > result=soup.findAll("a") > for i in result: > print i > > > Traceback (most recent call last): > File "C:\Users\falcon\workspace\p\pyqt\ex1.py", line 8, in <module> > soup=bs(doc) > File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1499, in > __init__ > BeautifulStoneSoup.__init__(self, *args, **kwargs) > File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1230, in > __init__ > self._feed(isHTML=isHTML) > File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1263, in > _feed > self.builder.feed(markup) > File "C:\Python26\lib\HTMLParser.py", line 108, in feed > self.goahead(0) > File "C:\Python26\lib\HTMLParser.py", line 148, in goahead > k = self.parse_starttag(i) > File "C:\Python26\lib\HTMLParser.py", line 226, in parse_starttag > endpos = self.check_for_whole_start_tag(i) > File "C:\Python26\lib\HTMLParser.py", line 301, in > check_for_whole_start_tag > self.error("malformed start tag") > File "C:\Python26\lib\HTMLParser.py", line 115, in error > raise HTMLParseError(message, self.getpos()) > HTMLParser.HTMLParseError: malformed start tag, at line 276, column 36 > > any suggestion? When BeautifulSoup encounters an error that it cannot fix the first thing you need is a better error message: from BeautifulSoup import BeautifulSoup as bs import urllib import HTMLParser url = "http://www.d-addicts.com/forum/torrents.php" doc = urllib.urlopen(url).read() #doc = doc.replace("\>", "/>") try: soup=bs(doc) except HTMLParser.HTMLParseError as e: lines = doc.splitlines(True) print lines[e.lineno-1].rstrip() print " " * e.offset + "^" else: result = soup.findAll("a") for i in result: print i Once you know the origin of the problem you can devise a manual fix. Here you could uncomment the line doc = doc.replace("\>", "/>") Keep in mind though that what fixes this broken document may break another (valid) one. Peter
From: John Nagle on 15 Jan 2010 15:25 It's just somebody pirating movies. Ineptly. Ignore. John Nagle yamamoto wrote: > Hi, > I am new to Python. I'd like to extract "a" tag from a website by > using "beautifulsoup" module. > but it doesnt work! > > //sample.py > > from BeautifulSoup import BeautifulSoup as bs > import urllib > url="http://www.d-addicts.com/forum/torrents.php" > doc=urllib.urlopen(url).read() > soup=bs(doc) > result=soup.findAll("a") > for i in result: > print i > > > Traceback (most recent call last): > File "C:\Users\falcon\workspace\p\pyqt\ex1.py", line 8, in <module> > soup=bs(doc) > File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1499, in > __init__ > BeautifulStoneSoup.__init__(self, *args, **kwargs) > File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1230, in > __init__ > self._feed(isHTML=isHTML) > File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1263, in > _feed > self.builder.feed(markup) > File "C:\Python26\lib\HTMLParser.py", line 108, in feed > self.goahead(0) > File "C:\Python26\lib\HTMLParser.py", line 148, in goahead > k = self.parse_starttag(i) > File "C:\Python26\lib\HTMLParser.py", line 226, in parse_starttag > endpos = self.check_for_whole_start_tag(i) > File "C:\Python26\lib\HTMLParser.py", line 301, in > check_for_whole_start_tag > self.error("malformed start tag") > File "C:\Python26\lib\HTMLParser.py", line 115, in error > raise HTMLParseError(message, self.getpos()) > HTMLParser.HTMLParseError: malformed start tag, at line 276, column 36 > > any suggestion? > thanks in advance >
From: Phlip on 15 Jan 2010 15:17 John Nagle wrote: > It's just somebody pirating movies. Ineptly. Ignore. Anyone who leaves their movies hanging out in <a> tags, without a daily download limit or a daily hashtag, deserves to be taught a lesson! -- Phlip
From: John Bokma on 15 Jan 2010 16:56
yamamoto <blueskykind02(a)gmail.com> writes: > Hi, > I am new to Python. I'd like to extract "a" tag from a website by > using "beautifulsoup" module. > but it doesnt work! [..] > check_for_whole_start_tag > self.error("malformed start tag") > File "C:\Python26\lib\HTMLParser.py", line 115, in error > raise HTMLParseError(message, self.getpos()) > HTMLParser.HTMLParseError: malformed start tag, at line 276, column 36 > > any suggestion? I guess you're using 3.1.0. If yes, see: http://www.crummy.com/software/BeautifulSoup/3.1-problems.html You might want to do: sudo easy_install -U "BeautifulSoup==3.0.7a" and try again. -- John Bokma j3b Hacking & Hiking in Mexico - http://johnbokma.com/ http://castleamber.com/ - Perl & Python Development |