Parfor slices shared array. spmd copies. How to make a SHARED array? [Matlab]

Prev: streaming in D2Q7 LB model
Next: shape detection

From: ibisek on 11 Oct 2009 02:27

Hi Edric,

If I understand it well, in the loop you posted the workers run synchronously and the labBroadcast command is blocking? Do the others wait until worker #1 sends the message?

Ibisek

From: Edric M Ellis on 12 Oct 2009 03:18

"ibisek " <cervenka(a)fai.utb.cz> writes:

> If I understand it well, in the loop you posted the workers run synchronously
> and the labBroadcast command is blocking? Do the others wait until worker #1
> sends the message?

Yes, the labBroadcast command is effectively blocking - all workers must make
the same labBroadcast call together. (Actually, we don't guarantee whether the
"root" worker sending the data will proceed until the others have received the
data - this depends on the relative size of the data compared to the buffering
in the transport layer, but it's best to assume that the call may be blocking).

Could I ask a little more about why you need the shared access to the array? How
large is the dataset that you'd like to share? How does that compare to the
amount of memory in your machine? How long does each iteration take to run?

Cheers,

Edric.

From: ibisek on 13 Oct 2009 08:01

Allright,

so I managed to make it with the labSend commands. It seems to show quite significant communication overhead when the worker's jobs are short, but we'll see how it performs on a larger jobs....

At this moment it works, as you stated we need to have the same number of jobs for every worker ( mod(numJobs, numLabs)==0). Otherwise it throws an error. I fixed this problem by extending the number of jobs so all workers have the same.

> Could I ask a little more about why you need the shared access to the array? How
> large is the dataset that you'd like to share? How does that compare to the
> amount of memory in your machine? How long does each iteration take to run?

I was developing a parallel version of an evolutionary algorithm. They work with a population of individuals, which are stored in a matrix of (m, n), where 'm' is number of individuals and 'n' length of its 'genome'. In other words, the rows represent individuals and columns are properties of the individuals.

And individual represents a combination of parameters to some model = a possible solution. To improve these solutions we need to somehow handle with these vectors. In our algorithm we migrate one individual to another one. It is something like a wolf is running towards a forest and is looking for some food. A step along this path represents coordinates of its position in the forest. And the amount of food find tells about the quality of that particular solution/position.

But back to the topic. One of the strategies is All-To-One, where there is a best position (individual) and all the other migrate towards that one. This can be easily parallelized by the 'parfor' as the runs do not depend on the other ones. So the matrix is sliced by rows (individuals) and if there are better positions then at the initial moment of their migration, they are stored back to the slice and the entire array of individuals is then de-sliced just after the parfor-end block.

However, in another strategy - All-To-All-Adaptive - you pick individuals one by one and migrate them towards ALL others. You pick a first one and migrate it toward the second one (searching for a better 'source of food'). Once you searched the trail along this line, you go back to the best location/position you found, pick individual #3 and do a migration again. After you migrated to all of the individuals in the population, you STORE the best found location/position/solution back to the SHARED matrix of individuals in a row. And this is the shared array I was looking for.

What we essentially do, is we work upon a population of individuals (the rows in the matrix). The idea is to split this matrix among the Matlab's workers (let's say we have 4 workers, so every worker has a quarter of the rows; the exact fraction is not important - the worker has a list of rows which can modify, never touching other worker's rows (for writing)) and these workers do the job described above. Writing in the rows is restricted only on the 'worker's own' rows, but it can read the current state from the other rows and thus migrate its own individuals towards the others' individuals, which concurrently may be moved into another position.

Does it make any sense? :)

The only collision here may appear in situation when one worker is writing a row with a newly find position while another attempts to read from the same row at the same moment. Result of the read depends on the implementation of Matlab matrices - how they handle concurrent access to variables.

I think adding a functionality of concurrently accessible variables into Parallel Computing Toolkit would ease many people's lives :) On the other hand, it could also be a serious source of problems....

Edric, hereby I would like to thank you very very much for guiding me through this problem and giving me a helping hand. I owe you more than one beer :)

Ibisek

Did it make any sense at all? :)

From: Edric M Ellis on 14 Oct 2009 03:11

"ibisek " <cervenka(a)fai.utb.cz> writes:

>> Could I ask a little more about why you need the shared access to the array? How
>> large is the dataset that you'd like to share? How does that compare to the
>> amount of memory in your machine? How long does each iteration take to run?
>
> I was developing a parallel version of an evolutionary algorithm. They work
>with a population of individuals, which are stored in a matrix of (m, n), where
>'m' is number of individuals and 'n' length of its 'genome'. In other words,
>the rows represent individuals and columns are properties of the individuals.
> [...]

Thanks for taking the time to explain your application. I think I understand the
access patterns you're after now. Unfortunately, our current (co)distributed
arrays are designed to be used for linear algebra, primarily in the case where
the data is too large to fit into the RAM of a single machine. This results in
communication patterns that fit well with the "message passing" approach,
i.e. where both sender and receiver are actively involved, and both know what
data is being exchanged.

In your application, one interpretation is that you want "one-sided"
communication, where each worker can either request data from another worker
without the other worker's involvement, or else each worker can push out updates
to their data to the other workers. Either way, this is not a communication
pattern that we currently support (not least because the underlying technology -
MPI - does not have great support for this; also, there are the synchronization
issues to be taken care of).

I think the only way to address the "random access" nature of your application
is as I described earlier, and to have the workers actively and co-operatively
ensuring that all other workers are up-to-date after each round of calculation.

Cheers,

Edric.

From: ibisek on 14 Oct 2009 05:00

Exactly, the synchronization plays here the main role.

The truth is, I haven't been considering run on multiple machines. Running our algorithm on a single multi-core computer is good enough for us (at least for this moment:). Keeping such a distributed and shared matrix synchronized on multiple machines might become evil, especially when the stuff is automatically done by the logic behind the scene :)

And again, Edric, thank you for the support you've given me, I really really appreciate the time you've spend ! :)

Ibisek

First | Prev |
Pages: 1 2 3
Prev: streaming in D2Q7 LB model
Next: shape detection