Distributed array help? [Matlab]

Prev: Password Entry Using Inputdlg
Next: Programming SPI-Interface on dSpace DS1104 ! How ?

From: Dan Tyndall on 10 Dec 2009 19:02

I've been running into the memory wall because I am dealing with really large arrays. Currently, I am running my MATLAB script over a much smaller domain, but am taking advantage of six cores on a single compute node using the Parallel Computing Toolbox and parfor loops. My university has a license for the MATLAB Distributed Compute Server, and I would like to take advantage of that, and run my program on multiple compute nodes so I don't run into the memory wall.

Here's the part of the code that I need help with:

-------------------------------------------------------------------------------------------------------------
% compute forward operator H
disp('Computing foward operator matrix');
radius = 6370;
h = sparse(numobs,numpts);
for k=1:numobs
dx = radius .* cos(pi .* yo_lat_all(k) ./ 180.) .* pi .*(yo_lon_all(k)-xb_lon) ./ 180.;
dy = radius .* pi .* (yo_lat_all(k) - xb_lat) ./ 180.;
dist = sqrt(dx.*dx + dy.*dy);
[min_val,min_index] = min(dist);
h(k,min_index) = 1;
yo_xbin_all(k)=min_index;
end
toc(tstart)

% compute PbH^T
disp('Computing PbH^T');
rad2 = rad * rad;
radz2 = radz * radz;
ht = h'; % store h' so you don't have to keep doing it each loop

pbht = zeros(numxb,numobs);
%pb_row = sparse(1,numxb);

parfor k=1:numxb; %loop over each row of bkg cov matrix
%for k=1:numxb; %debug single proc code
pb_row = zeros(1,numxb);
dx = radius .* cos(pi .*xb_lat ./180.) .* pi .* (xb_lon - xb_lon(k)) ./ 180.;
dy = radius .* pi .* (xb_lat - xb_lat(k)) ./ 180.;
dz = xb_felv - xb_felv(k);
r2 = dx .* dx + dy .* dy;
z2 = (dz .* dz);
%calc_pb_row = find(r2<300^2);
pb_row(r2<maxr2tol) = sigb .* (exp(-r2(r2<maxr2tol)./rad2).*exp(-z2(r2<maxr2tol)/radz2));
% compute PbH^T
pbht(k,:) = pb_row * ht;
end;
clear pb_row;
clear ht;
pbht = sparse(pbht);
toc(tstart)
-------------------------------------------------------------------------------------------------------------

The issue I have is that I'm not sure how tell MATLAB to slice the pbht array to take maximum use of the parfor loop. Each node doesn't really need to communicate anything to each other, except when putting array back together as a sparse matrix (I've tried computing it as a sparse matrix from the start, but that adds a huge amount of computer time rather than converting it to sparse int he end). What I would like MATLAB to do is to slice the pbht array using the parfor command to send it only to nodes that will actually do computations on the parts that it owns. I've tried initializing the array by declaring a distributed.zeros array, but that always seems to cause the program to go into an infinite loop (not totally sure, but it always seems to hit the maximum wall time even when I am running a smaller domain).

So, in summary, my main questions are:
1. How do you create a (co)-distributed array that takes advantage of the splicing in the parfor loops (I'm not quite sure what the difference is between a codistributed array and a distributed array)?
2. Do I still initialize the program with the matlabpool open command I use for my parfor loops, or is it something different now?
3. Could the people at MATLAB make a data parallel webinar? I've been to your live seminar on task parallel MATLAB tools, and it was great... but the webinars you have on large data arrays don't really get into the parallel applications as deep as I would like.

Thanks for your help!

From: Jill Reese on 14 Dec 2009 13:16

Hi Dan. While I could be missing something, it seems like your operations could be performed more simply on a distributed array rather than using a parfor loop. You would want to open a matlabpool as usual, and then create a distributed array (at one point you'd mentioned calling distributed.zeros(...) ).

Once you have a distributed array to work with, you have a couple of options. The easiest thing to do is to perform your operations on the distributed array itself. In your example, this would involve further vectorizing your code blocks rather than looping over each row. If you need more control over how the array is stored over all the workers then you might want to look into using an spmd block (single program, multiple data). The distributed array you have already created can be accessed within the spmd block as a codistributed array over which you have more fine-grained control.

Again, I could have missed something that makes this problem more amenable to parfor. If you feel that is the case, please pass along more information (like the size and type of the variables like xb_lon, for example and the order of magnitude you expect for numobs, numpts, and numxb) that could help me isolate your issue.

Cheers,

Jill

"Dan Tyndall" <dtyndall(a)met.utah.edu> wrote in message <hfs26c$rju$1(a)fred.mathworks.com>...
> I've been running into the memory wall because I am dealing with really large arrays. Currently, I am running my MATLAB script over a much smaller domain, but am taking advantage of six cores on a single compute node using the Parallel Computing Toolbox and parfor loops. My university has a license for the MATLAB Distributed Compute Server, and I would like to take advantage of that, and run my program on multiple compute nodes so I don't run into the memory wall.
>
> Here's the part of the code that I need help with:
>
> -------------------------------------------------------------------------------------------------------------
> % compute forward operator H
> disp('Computing foward operator matrix');
> radius = 6370;
> h = sparse(numobs,numpts);
> for k=1:numobs
> dx = radius .* cos(pi .* yo_lat_all(k) ./ 180.) .* pi .*(yo_lon_all(k)-xb_lon) ./ 180.;
> dy = radius .* pi .* (yo_lat_all(k) - xb_lat) ./ 180.;
> dist = sqrt(dx.*dx + dy.*dy);
> [min_val,min_index] = min(dist);
> h(k,min_index) = 1;
> yo_xbin_all(k)=min_index;
> end
> toc(tstart)
>
>
> % compute PbH^T
> disp('Computing PbH^T');
> rad2 = rad * rad;
> radz2 = radz * radz;
> ht = h'; % store h' so you don't have to keep doing it each loop
>
> pbht = zeros(numxb,numobs);
> %pb_row = sparse(1,numxb);
>
> parfor k=1:numxb; %loop over each row of bkg cov matrix
> %for k=1:numxb; %debug single proc code
> pb_row = zeros(1,numxb);
> dx = radius .* cos(pi .*xb_lat ./180.) .* pi .* (xb_lon - xb_lon(k)) ./ 180.;
> dy = radius .* pi .* (xb_lat - xb_lat(k)) ./ 180.;
> dz = xb_felv - xb_felv(k);
> r2 = dx .* dx + dy .* dy;
> z2 = (dz .* dz);
> %calc_pb_row = find(r2<300^2);
> pb_row(r2<maxr2tol) = sigb .* (exp(-r2(r2<maxr2tol)./rad2).*exp(-z2(r2<maxr2tol)/radz2));
> % compute PbH^T
> pbht(k,:) = pb_row * ht;
> end;
> clear pb_row;
> clear ht;
> pbht = sparse(pbht);
> toc(tstart)
> -------------------------------------------------------------------------------------------------------------
>
> The issue I have is that I'm not sure how tell MATLAB to slice the pbht array to take maximum use of the parfor loop. Each node doesn't really need to communicate anything to each other, except when putting array back together as a sparse matrix (I've tried computing it as a sparse matrix from the start, but that adds a huge amount of computer time rather than converting it to sparse int he end). What I would like MATLAB to do is to slice the pbht array using the parfor command to send it only to nodes that will actually do computations on the parts that it owns. I've tried initializing the array by declaring a distributed.zeros array, but that always seems to cause the program to go into an infinite loop (not totally sure, but it always seems to hit the maximum wall time even when I am running a smaller domain).
>
> So, in summary, my main questions are:
> 1. How do you create a (co)-distributed array that takes advantage of the splicing in the parfor loops (I'm not quite sure what the difference is between a codistributed array and a distributed array)?
> 2. Do I still initialize the program with the matlabpool open command I use for my parfor loops, or is it something different now?
> 3. Could the people at MATLAB make a data parallel webinar? I've been to your live seminar on task parallel MATLAB tools, and it was great... but the webinars you have on large data arrays don't really get into the parallel applications as deep as I would like.
>
> Thanks for your help!

From: Dan Tyndall on 14 Dec 2009 15:39

"Jill Reese" <jill.reese(a)mathworks.com> wrote in message <hg5vdj$sur$1(a)fred.mathworks.com>...
> Hi Dan. While I could be missing something, it seems like your operations could be performed more simply on a distributed array rather than using a parfor loop. You would want to open a matlabpool as usual, and then create a distributed array (at one point you'd mentioned calling distributed.zeros(...) ).

Yup, I was experimenting with the distributed arrays, but when I ran the code, it seemed to completely freeze the program (or stuck in some sort of infinite loop). I'm not totally sure what was going wrong, but the program over one of the smaller domains did not complete in 72 hours on 4 nodes (it normally would take about 7 hours on a single processing node). It could be that I simply don't have the right syntax, because I was mixing parfors and distributed arrays.

> Once you have a distributed array to work with, you have a couple of options. The easiest thing to do is to perform your operations on the distributed array itself. In your example, this would involve further vectorizing your code blocks rather than looping over each row. If you need more control over how the array is stored over all the workers then you might want to look into using an spmd block (single program, multiple data). The distributed array you have already created can be accessed within the spmd block as a codistributed array over which you have more fine-grained control.

The unfortunate thing about vectorization is that I believe that I've vectorized the code as much as possible. If I were to vectorize it further, and store pb_row as the full pb matrix, it would have a dimension of 700,000^2 (over one of my largest domains). This array by itself would consume a huge amount of memory, which is one of the reasons why I break it up, and compute a single row as pb_row (which is a single row array of length 700,000). I don't need pb_row ever again, which is why it's constantly overwritten. The matrix I need to store (pbht), is much smaller, 3,000x700,000 values. However, I am having problems storing this matrix on a single machine (I am certain this code works, because I have used it over a smaller domain, which made the matrix about 300x210,000). It would be nice if I could assign each worker to store the part of pbht that only they would do the computations
on (that way, there is very little communications overhead). The issue I still run into with an spmd code block is how do I find out what parts of pbht does each lab have? For example, with the parfor loop, for 2 labs, using the smaller domain (numxb=210,000), I would imagine lab 1 would compute computations on index 1 to 105,000, and lab 2 would make computations on 105,001 to 210,000. However, if I change the parfor block to an spmd loop, how can I tell lab 1 to loop from 1 to 105,000 and lab 2 to loop from 105,001 to 210,000 (or just over the rows that matlab has assigned for each lab to own.) Keep in mind that using 2 labs was just an example.

> Again, I could have missed something that makes this problem more amenable to parfor. If you feel that is the case, please pass along more information (like the size and type of the variables like xb_lon, for example and the order of magnitude you expect for numobs, numpts, and numxb) that could help me isolate your issue.

Ok... so xb_lon, xb_lat, xb_felv are all single dimension arrays of length numxb. numxb (which also shares its value with numpts, which I forgot to rename), is 700,000 for the domain I wish to run the code over (although, I have successfully run the program over size 210,000). numobs has a value of 3000 for the simulation I would like to run, but for the successful simulation, it has a value of 300.

>
> Cheers,
>
> Jill
>

Thanks for your help with this,

-Dan

From: Dan Tyndall on 4 Jan 2010 14:26

So, I just wanted to close this thread--I've modified my script to take advantage of the spmd command. That has definitely helped in testing larger domains, but not the largest domain I would like to test--that has led to "out of memory" errors, which I deem a new problem, and will start a new thread in a minute. Thanks for your help Jill.

-Dan

|
Pages: 1
Prev: Password Entry Using Inputdlg
Next: Programming SPI-Interface on dSpace DS1104 ! How ?