From: Johann Spies on 10 Dec 2009 04:15 I am trying to get csv-output from a html-file. With this code I had a little success: ========================= from BeautifulSoup import BeautifulSoup from string import replace, join import re f = open("configuration.html","r") g = open("configuration.csv",'w') soup = BeautifulSoup(f) t = soup.findAll('table') for table in t: rows = table.findAll('tr') for th in rows[0]: t = th.find(text=True) g.write(t) g.write(',') # print(','.join(t)) for tr in rows: cols = tr.findAll('td') for td in cols: try: t = td.find(text=True).replace(' ','') g.write(t) except: g.write ('') g.write(",") g.write("\n") =============================== producing output like this: RULE,SOURCE,DESTINATION,SERVICES,ACTION,TRACK,TIME,INSTALL ON,COMMENTS, 1,,,,drop,Log,Any,,, 2,All Users(a)Any,,Any,clientencrypt,Log,Any,,, 3,Any,Any,,drop,None,Any,,, 4,,,,drop,None,Any,,, .... It left out all the non-plaintext parts of <td></td> I then tried using t.renderContents and then got something like this (one line broken into many for the sake of this email): 1,<img src=icons/group.png> <a href=#OBJ_sunetint> sunetint</A><BR>, <img src=icons/gateway_cluster.png> <a>href=#OBJ_Rainwall_Cluster >Rainwall_Cluster</A> <BR>, <img>src=icons/udp.png> <a href=#SVC_IKE >IKE</a><br>, <img src=icons/drop.png> drop, <img src=icons/log.png> Log , <img src=icons/any.png> Any<br> , <img src=icons/gateway_cluster.png> <a href=#OBJ_Rainwall_Cluster >Rainwall_Cluster</A> <BR> , How do I get Beautifulsoup to render (taking the above line as example) sunentint for <img src=icons/group.png> <a href=#OBJ_sunetint>sunetint</A><BR> and still provide the text-parts in the <td>'s with plain text? I have experimented a little bit with regular expressions, but could so far not find a solution. Regards Johann -- Johann Spies Telefoon: 021-808 4599 Informasietegnologie, Universiteit van Stellenbosch "Lo, children are an heritage of the LORD: and the fruit of the womb is his reward." Psalms 127:3
From: Gabriel Genellina on 10 Dec 2009 22:23 En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies <jspies(a)sun.ac.za> escribi�: > How do I get Beautifulsoup to render (taking the above line as > example) > > sunentint for <img src=icons/group.png> <a > href=#OBJ_sunetint>sunetint</A><BR> > > and still provide the text-parts in the <td>'s with plain text? Hard to tell if we don't see what's inside those <td>'s - please provide at least a few rows of the original HTML table. -- Gabriel Genellina
From: Johann Spies on 11 Dec 2009 02:04 Gabriel Genellina het geskryf: > En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies <jspies(a)sun.ac.za> > escribió: > >> How do I get Beautifulsoup to render (taking the above line as >> example) >> >> sunentint for <img src=icons/group.png> <a >> href=#OBJ_sunetint>sunetint</A><BR> >> >> and still provide the text-parts in the <td>'s with plain text? > > Hard to tell if we don't see what's inside those <td>'s - please > provide at least a few rows of the original HTML table. > Thanks for your reply. Here are a few lines: <!------- Rule 1 -------> <tr style="background-color: #ffffff"><td class=normal>2</td><td><img src=icons/usrgroup.png> All Users(a)Any<br><td><im$ </td><td><img src=icons/any.png> Any<br></td><td><img src=icons/clientencrypt.png> clientencrypt</td><td><img src$ </td><td> </td></tr> <!------- Rule 2 -------> <tr style="background-color: #eeeeee"><td class=normal>3</td><td><img src=icons/any.png> Any<br><td><img src=icons/any$ </td><td> </td></tr> <!------- Rule 3 -------> <tr style="background-color: #ffffff"><td class=normal>4</td><td><img src=icons/group.png> <a href=#OBJ_Rainwall_Group$ <td><img src=icons/group.png> <a href=#OBJ_Rainwall_Group >Rainwall_Group</A> <BR> </td><td><img src=icons/udp.png> <a href=#SVC_RainWall_Stop >RainWall_Stop</a><br></td><td><img src=icons/drop.png>&nb$ </td><td> </td></tr> <!------- Rule 4 -------> <tr style="background-color: #eeeeee"><td class=normal>5</td><td><img src=icons/host.png> <a href=#OBJ_Rainwall_Broadc$ <img src=icons/group.png> <a href=#OBJ_Rainwall_Group >Rainwall_Group</A> <BR> <td><img src=icons/group.png> <a href=#OBJ_Rainwall_Group >Rainwall_Group</A> <BR> <img src=icons/host.png> <a href=#OBJ_Rainwall_Broadcast >Rainwall_Broadcast</A> <BR> </td><td><img src=icons/udp.png> <a href=#SVC_RainWall_Daemon >RainWall_Daemon</a><br></td><td><img src=icons/accept.p$ </td><td> </td></tr> Regards Johann -- Johann Spies Telefoon: 021-808 4599 Informasietegnologie, Universiteit van Stellenbosch "Lo, children are an heritage of the LORD: and the fruit of the womb is his reward." Psalms 127:3
From: Gabriel Genellina on 13 Dec 2009 05:58 En Fri, 11 Dec 2009 04:04:38 -0300, Johann Spies <jspies(a)sun.ac.za> escribi�: > Gabriel Genellina het geskryf: >> En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies <jspies(a)sun.ac.za> >> escribi�: >> >>> How do I get Beautifulsoup to render (taking the above line as >>> example) >>> >>> sunentint for <img src=icons/group.png> <a >>> href=#OBJ_sunetint>sunetint</A><BR> >>> >>> and still provide the text-parts in the <td>'s with plain text? >> >> Hard to tell if we don't see what's inside those <td>'s - please >> provide at least a few rows of the original HTML table. >> > Thanks for your reply. Here are a few lines: > > <!------- Rule 1 -------> > <tr style="background-color: #ffffff"><td class=normal>2</td><td><img > src=icons/usrgroup.png> All Users(a)Any<br><td><im$ > </td><td><img src=icons/any.png> Any<br></td><td><img > src=icons/clientencrypt.png> clientencrypt</td><td><img src$ > </td><td> </td></tr> I *think* I finally understand what you want (your previous example above confused me). If you want for Rule 1 to generate a line like this: 2,All Users(a)Any,<im$,Any,clientencrypt,, this code should serve as a starting point: lines = [] soup = BeautifulSoup(html) for table in soup.findAll("table"): for row in table.findAll("tr"): line = [] for cell in row.findAll("td"): text = ' '.join( s.replace('\n',' ').replace(' ',' ') for s in cell.findAll(text=True)).strip() line.append(text) lines.append(line) import csv with open("output.csv","wb") as f: writer = csv.writer(f) writer.writerows(lines) cell.findAll(text=True) returns a list of all text nodes inside a <td> cell; I preprocess all \n and in each text node, and join them all. lines is a list of lists (each entry one cell), as expected by the csv module used to write the output file. -- Gabriel Genellina
From: Johann Spies on 14 Dec 2009 01:58 On Sun, Dec 13, 2009 at 07:58:55AM -0300, Gabriel Genellina wrote: > this code should serve as a starting point: Thank you very much! > cell.findAll(text=True) returns a list of all text nodes inside a > <td> cell; I preprocess all \n and in each text node, and > join them all. lines is a list of lists (each entry one cell), as > expected by the csv module used to write the output file. I have struggled a bit to find the documentation for (text=True). Most of documentation for Beautifulsoup I saw mostly contained some examples without explaining what the options do. Thanks for your explanation. As far as I can see there was no documentation installed with the debian package. Regards Johann -- Johann Spies Telefoon: 021-808 4599 Informasietegnologie, Universiteit van Stellenbosch "But I will hope continually, and will yet praise thee more and more." Psalms 71:14
|
Next
|
Last
Pages: 1 2 Prev: Python for Newbies Next: Connecting to Python COM server from Excel VBA does not work |