Prev: MAKE UPTO $5000 MONTHLY! $2000 INYOUR FIRST 30 DAYS!
Next: using subprocess.Popen does not suppress terminal window on Windows
From: someone on 18 Jun 2010 11:41 Hello, does anyone know how to get html contents of an tag with BeautifulSoup? In example I'd like to get all html which is in first <p> tag, i.e. <span id="foo">This is paragraph</span> <b>one</b>. as unicode object p.contents gives me a list which I cannot join TypeError: sequence item 0: expected string, Tag found Thanks! from BeautifulSoup import BeautifulSoup import re doc = ['<html><head><title>Page title</title></head>', '<body><p id="firstpara" align="center"><span id="foo">This is paragraph</span> <b>one</b>.</p>', '<p id="secondpara" align="blah">This is paragraph <b>two</b>.</ p>', '</body></html>'] soup = BeautifulSoup(''.join(doc)) #print soup.prettify() r = re.compile(r'<[^<]*?/?>') for i, p in enumerate(soup.findAll('p')): #print type(p) #<class 'BeautifulSoup.Tag'> #print type(p.contents) #list content = "".join(p.contents) #fails p_without_html = r.sub(' ', content) print p_without_html
From: someone on 18 Jun 2010 12:01
On Jun 18, 5:41 pm, someone <petshm...(a)googlemail.com> wrote: > Hello, > > does anyone know how to get html contents of an tag with > BeautifulSoup? In example I'd like to get all html which is in first > <p> tag, i.e. <span id="foo">This is paragraph</span> <b>one</b>. as > unicode object > > p.contents gives me a list which I cannot join TypeError: sequence > item 0: expected string, Tag found > > Thanks! > > from BeautifulSoup import BeautifulSoup > import re > > doc = ['<html><head><title>Page title</title></head>', > '<body><p id="firstpara" align="center"><span id="foo">This is > paragraph</span> <b>one</b>.</p>', > '<p id="secondpara" align="blah">This is paragraph <b>two</b>.</ > p>', > '</body></html>'] > soup = BeautifulSoup(''.join(doc)) > #print soup.prettify() > r = re.compile(r'<[^<]*?/?>') > for i, p in enumerate(soup.findAll('p')): > #print type(p) #<class 'BeautifulSoup.Tag'> > #print type(p.contents) #list > content = "".join(p.contents) #fails > > p_without_html = r.sub(' ', content) > print p_without_html p.renderContents() was what I've looked for |