Pedagogical examples of parallelism [General Programming]

Prev: SETP-10 Call for papers
Next: Unification Algorithm

From: Moi on 25 Dec 2009 08:36

On Mon, 16 Nov 2009 17:33:32 +0000, Jon Harrop wrote:

>
> What other high-level solutions for parallel programming exist and what
> problems are best solved using them? For example, when is nested data
> parallelism preferable?

I have experimented with parallel matrix transposition.
It works by sending a list of pairs-of-tile-numbers to swap and flip
into a pipe. The child processes consume the pairs (read() is atomic)
and do the actual work.

I think this is a natural candidate for parallelism, because the threads
are *known* not to interfere, only to compete for CPU and disk resources.

HTH,
AvK

From: Patricia Shanahan on 25 Dec 2009 09:13

Moi wrote:
> On Mon, 16 Nov 2009 17:33:32 +0000, Jon Harrop wrote:
>
>
>> What other high-level solutions for parallel programming exist and what
>> problems are best solved using them? For example, when is nested data
>> parallelism preferable?
>
> I have experimented with parallel matrix transposition.
> It works by sending a list of pairs-of-tile-numbers to swap and flip
> into a pipe. The child processes consume the pairs (read() is atomic)
> and do the actual work.
>
> I think this is a natural candidate for parallelism, because the threads
> are *known* not to interfere, only to compete for CPU and disk resources.

The threads also share cache lines, though not bytes within cache lines,
at the tile boundaries, unless you tune the tiling to the memory layout.

Patricia

From: Moi on 25 Dec 2009 09:39

On Fri, 25 Dec 2009 06:13:30 -0800, Patricia Shanahan wrote:

> Moi wrote:

>
> The threads also share cache lines, though not bytes within cache lines,
> at the tile boundaries, unless you tune the tiling to the memory layout.
>
> Patricia

Yes, you are right, they compete for cache lines, too.
Given the enormous cost of bringing the disk pages into core, I tend to
ignore memory cache. Accesses are page aligned, and a tile typically consist
of (pagesize / sizeof element) pages.
(which is 1024 pages for a 4byte element on a 4k page intel box,
causing a total footprint of 8MB (per 2 tiles), which is bigger than my
L2 cache)

Once all the pages are pulled in, I expect a thread to complete its
flip&swap task in one sweep, so a thread competes only with itself for
cache slots. So it is more or less semi-cache-oblivious ;-)

AvK

First | Prev |
Pages: 1 2
Prev: SETP-10 Call for papers
Next: Unification Algorithm