Prev: SETP-10 Call for papers
Next: Unification Algorithm
From: Moi on 25 Dec 2009 08:36 On Mon, 16 Nov 2009 17:33:32 +0000, Jon Harrop wrote: > > What other high-level solutions for parallel programming exist and what > problems are best solved using them? For example, when is nested data > parallelism preferable? I have experimented with parallel matrix transposition. It works by sending a list of pairs-of-tile-numbers to swap and flip into a pipe. The child processes consume the pairs (read() is atomic) and do the actual work. I think this is a natural candidate for parallelism, because the threads are *known* not to interfere, only to compete for CPU and disk resources. HTH, AvK
From: Patricia Shanahan on 25 Dec 2009 09:13 Moi wrote: > On Mon, 16 Nov 2009 17:33:32 +0000, Jon Harrop wrote: > > >> What other high-level solutions for parallel programming exist and what >> problems are best solved using them? For example, when is nested data >> parallelism preferable? > > I have experimented with parallel matrix transposition. > It works by sending a list of pairs-of-tile-numbers to swap and flip > into a pipe. The child processes consume the pairs (read() is atomic) > and do the actual work. > > I think this is a natural candidate for parallelism, because the threads > are *known* not to interfere, only to compete for CPU and disk resources. The threads also share cache lines, though not bytes within cache lines, at the tile boundaries, unless you tune the tiling to the memory layout. Patricia
From: Moi on 25 Dec 2009 09:39
On Fri, 25 Dec 2009 06:13:30 -0800, Patricia Shanahan wrote: > Moi wrote: > > The threads also share cache lines, though not bytes within cache lines, > at the tile boundaries, unless you tune the tiling to the memory layout. > > Patricia Yes, you are right, they compete for cache lines, too. Given the enormous cost of bringing the disk pages into core, I tend to ignore memory cache. Accesses are page aligned, and a tile typically consist of (pagesize / sizeof element) pages. (which is 1024 pages for a 4byte element on a 4k page intel box, causing a total footprint of 8MB (per 2 tiles), which is bigger than my L2 cache) Once all the pages are pulled in, I expect a thread to complete its flip&swap task in one sweep, so a thread competes only with itself for cache slots. So it is more or less semi-cache-oblivious ;-) AvK |