From: John Nagle on 3 Feb 2010 22:46 Paul Rubin wrote: > John Nagle <nagle(a)animats.com> writes: >> Analysis of each domain is >> performed in a separate process, but each process uses multiple >> threads to read process several web pages simultaneously. >> >> Some of the threads go compute-bound for a second or two at a time as >> they parse web pages. > > You're probably better off using separate processes for the different > pages. If I remember, you were using BeautifulSoup, which while very > cool, is pretty doggone slow for use on large volumes of pages. I don't > know if there's much that can be done about that without going off on a > fairly messy C or C++ coding adventure. Maybe someday someone will do > that. I already use separate processes for different domains. I could live with Python's GIL as long as moving to a multicore server doesn't make performance worse. That's why I asked about CPU dedication for each process, to avoid thrashing at the GIL. There's enough intercommunication between the threads working on a single site that it's a pain to do them as subprocesses. And I definitely don't want to launch subprocesses for each page; the Python load time would be worse than the actual work. The subprocess module assumes you're willing to launch a subprocess for each transaction. The current program organization is that there's a scheduler process which gets requests, prioritizes them, and runs the requested domains through the site evaluation mill. The scheduler maintains a pool of worker processes which get work request via their input pipe, in Pickle format, and return results, again in Pickle format. When not in use, the worker processes sit there dormant, so there's no Python launch cost for each transaction. If a worker process crashes, the scheduler replaces it with a fresh one, and every few hundred uses, each worker process is replaced with a fresh copy, in case Python has a memory leak. It's a lot like the way FCGI works. Scheduling is managed using an in-memory table in MySQL, so the load can be spread over a cluster if desired, with a scheduler process on each machine. So I already have a scalable architecture. The only problem is excess overhead on multicore CPUs. John Nagle
From: Steve Holden on 3 Feb 2010 22:50 John Nagle wrote: > Paul Rubin wrote: >> John Nagle <nagle(a)animats.com> writes: >>> Analysis of each domain is >>> performed in a separate process, but each process uses multiple >>> threads to read process several web pages simultaneously. >>> >>> Some of the threads go compute-bound for a second or two at a time as >>> they parse web pages. >> >> You're probably better off using separate processes for the different >> pages. If I remember, you were using BeautifulSoup, which while very >> cool, is pretty doggone slow for use on large volumes of pages. I don't >> know if there's much that can be done about that without going off on a >> fairly messy C or C++ coding adventure. Maybe someday someone will do >> that. > > I already use separate processes for different domains. I could > live with Python's GIL as long as moving to a multicore server > doesn't make performance worse. That's why I asked about CPU dedication > for each process, to avoid thrashing at the GIL. > I believe it's already been said that the GIL thrashing is mostly MacOS specific. You might also find something in the affinity module http://pypi.python.org/pypi/affinity/0.1.0 to ensure that each process in your pool runs on only one processor. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 PyCon is coming! Atlanta, Feb 2010 http://us.pycon.org/ Holden Web LLC http://www.holdenweb.com/ UPCOMING EVENTS: http://holdenweb.eventbrite.com/
From: John Nagle on 4 Feb 2010 01:11 Steve Holden wrote: > John Nagle wrote: >> Paul Rubin wrote: >>> John Nagle <nagle(a)animats.com> writes: >>>> Analysis of each domain is >>>> performed in a separate process, but each process uses multiple >>>> threads to read process several web pages simultaneously. >>>> >>>> Some of the threads go compute-bound for a second or two at a time as >>>> they parse web pages. >>> You're probably better off using separate processes for the different >>> pages. If I remember, you were using BeautifulSoup, which while very >>> cool, is pretty doggone slow for use on large volumes of pages. I don't >>> know if there's much that can be done about that without going off on a >>> fairly messy C or C++ coding adventure. Maybe someday someone will do >>> that. >> I already use separate processes for different domains. I could >> live with Python's GIL as long as moving to a multicore server >> doesn't make performance worse. That's why I asked about CPU dedication >> for each process, to avoid thrashing at the GIL. >> > I believe it's already been said that the GIL thrashing is mostly MacOS > specific. You might also find something in the affinity module No, the original analysis was MacOS oriented, but the same mechanism applies for fighting over the GIL on all platforms. There was some pontification that it might be a MacOS-only issue, but no facts were presented. It might be cheaper on C implementations with mutexes that don't make system calls for the non-blocking cases. John Nagle
From: Paul Rubin on 4 Feb 2010 01:57 John Nagle <nagle(a)animats.com> writes: > There's enough intercommunication between the threads working on > a single site that it's a pain to do them as subprocesses. And I > definitely don't want to launch subprocesses for each page; the > Python load time would be worse than the actual work. The > subprocess module assumes you're willing to launch a subprocess > for each transaction. Why not just use socketserver and have something like a fastcgi?
From: Anh Hai Trinh on 4 Feb 2010 06:13 On Feb 4, 10:46 am, John Nagle <na...(a)animats.com> wrote: > > There's enough intercommunication between the threads working on > a single site that it's a pain to do them as subprocesses. And I > definitely don't want to launch subprocesses for each page; the > Python load time would be worse than the actual work. The > subprocess module assumes you're willing to launch a subprocess > for each transaction. You could perhaps use a process pool inside each domain worker to work on the pages? There is multiprocessing.Pool and other implementations. For examples, in this library, you can s/ThreadPool/ProcessPool/g and this example would work: <http://www.onideas.ws/stream.py/#retrieving- web-pages-concurrently>. If you want to DIY, with multiprocessing.Lock/Pipe/Queue, I don't understand why it would be more of a pain to write your threads as processes. // aht http://blog.onideas.ws
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 Prev: python admin abuse complaint Next: MBA ASSIGNMENT SOLVING VIA E-MAIL |