From: Stefan Behnel on 8 Feb 2010 03:59 Paul Rubin, 04.02.2010 02:51: > John Nagle writes: >> Analysis of each domain is >> performed in a separate process, but each process uses multiple >> threads to read process several web pages simultaneously. >> >> Some of the threads go compute-bound for a second or two at a time as >> they parse web pages. > > You're probably better off using separate processes for the different > pages. If I remember, you were using BeautifulSoup, which while very > cool, is pretty doggone slow for use on large volumes of pages. I don't > know if there's much that can be done about that without going off on a > fairly messy C or C++ coding adventure. Maybe someday someone will do > that. Well, if multi-core performance is so important here, then there's a pretty simple thing the OP can do: switch to lxml. http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ Stefan
From: Paul Rubin on 8 Feb 2010 04:10 Stefan Behnel <stefan_ml(a)behnel.de> writes: > Well, if multi-core performance is so important here, then there's a pretty > simple thing the OP can do: switch to lxml. > > http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it only works on well-formed XML. The point of Beautiful Soup is that it works on all kinds of garbage hand-written legacy HTML with mismatched tags and other sorts of errors. Beautiful Soup is slower because it's full of special cases and hacks for that reason, and it is written in Python. Writing something that complex in C to handle so much potentially malicious input would be quite a lot of work to write at all, and very difficult to ensure was really safe. Look at the many browser vulnerabilities we've seen over the years due to that sort of problem, for example. But, for web crawling, you really do need to handle the messy and wrong HTML properly.
From: Antoine Pitrou on 8 Feb 2010 09:28 Le Tue, 02 Feb 2010 15:02:49 -0800, John Nagle a écrit : > I know there's a performance penalty for running Python on a multicore > CPU, but how bad is it? I've read the key paper > ("www.dabeaz.com/python/GIL.pdf"), of course. It would be adequate if > the GIL just limited Python to running on one CPU at a time, but it's > worse than that; there's excessive overhead due to a lame locking > implementation. Running CPU-bound multithreaded code on a dual-core CPU > runs HALF AS FAST as on a single-core CPU, according to Beasley. That's on certain types of workloads, and perhaps on certain OSes, so you should try benching your own workload to see whether it applies. Two closing remarks: - this should (hopefully) be fixed in 3.2, as exarkun noticed - instead of spawning one thread per Web page, you could use Twisted or another event loop mechanism in order to process pages serially, in the order of arrival Regards Antoine.
From: J Kenneth King on 8 Feb 2010 11:21 Paul Rubin <no.email(a)nospam.invalid> writes: > Stefan Behnel <stefan_ml(a)behnel.de> writes: >> Well, if multi-core performance is so important here, then there's a pretty >> simple thing the OP can do: switch to lxml. >> >> http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ > > Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it > only works on well-formed XML. The point of Beautiful Soup is that it > works on all kinds of garbage hand-written legacy HTML with mismatched > tags and other sorts of errors. Beautiful Soup is slower because it's > full of special cases and hacks for that reason, and it is written in > Python. Writing something that complex in C to handle so much > potentially malicious input would be quite a lot of work to write at > all, and very difficult to ensure was really safe. Look at the many > browser vulnerabilities we've seen over the years due to that sort of > problem, for example. But, for web crawling, you really do need to > handle the messy and wrong HTML properly. If the difference is great enough, you might get a benefit from analyzing all pages with lxml and throwing invalid pages into a bucket for later processing with BeautifulSoup.
From: John Krukoff on 8 Feb 2010 11:43 On Mon, 2010-02-08 at 01:10 -0800, Paul Rubin wrote: > Stefan Behnel <stefan_ml(a)behnel.de> writes: > > Well, if multi-core performance is so important here, then there's a pretty > > simple thing the OP can do: switch to lxml. > > > > http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ > > Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it > only works on well-formed XML. The point of Beautiful Soup is that it > works on all kinds of garbage hand-written legacy HTML with mismatched > tags and other sorts of errors. Beautiful Soup is slower because it's > full of special cases and hacks for that reason, and it is written in > Python. Writing something that complex in C to handle so much > potentially malicious input would be quite a lot of work to write at > all, and very difficult to ensure was really safe. Look at the many > browser vulnerabilities we've seen over the years due to that sort of > problem, for example. But, for web crawling, you really do need to > handle the messy and wrong HTML properly. > Actually, lxml has an HTML parser which does pretty well with the standard level of broken one finds most often on the web. And, when it falls down, it's easy to integrate BeautifulSoup as a slow backup for when things go really wrong (as J Kenneth King mentioned earlier): http://codespeak.net/lxml/lxmlhtml.html#parsing-html At least in my experience, I haven't actually had to parse anything that lxml couldn't handle yet, however. -- John Krukoff <jkrukoff(a)ltgc.com> Land Title Guarantee Company
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 Prev: python admin abuse complaint Next: MBA ASSIGNMENT SOLVING VIA E-MAIL |