From: Sean Kelly on
I'm using some code to perform learned sparse coding for the visual system. For each set of basis functions I need to minimize their coefficients (i've chosen to use GPSR to do so). The below parfor should be sufficient and is identical to all the parfor loops i've written before, but this one shows zero speed gains. While I don't have access to the time spent in GPSR_Basic while it's running parfor, the thread blocking calls account for 90% of the runtime (according to Profiler), and I assume that those roughly correspond to time spent in GPSR_Basic, assuming minimal parfor overhead since it's a local matlabpool.

If I run the exact same script, dropping the par from parfor, the wall time remains exactly the same. No errors are presented by MATLAB regarding the parallelization, but a look at the resource utilization as shown by windows reveals that 3 of my cores (given 3 workers are assigned) are not being particularly used (overall CPU utilization jumps from 4% to 35%, so one might expect a single core is being used). If I put a disp(idx2) at the beginning of the parfor, I can see that 3 workers are in fact initialized as I get 3 indexes displayed concurrently each "tick".

So I guess the question is: without looking deeply into the construction of GPSR_Baisc, what kinds of coding structure within it could cause it to be as slow in parallel as it is serial? Each call to the function takes 6 ms (roughly) so is it just likely that my system is hitting an IO wall and the function simply cannot run any faster?


parfor idx2 = 1:100

xmin = randi([1, 512-16]);
ymin = randi([1, 512-16]);
whichimg = randi([1, 10]);

baseimage = im{whichimg}(ymin:ymin+15,xmin:xmin+15);

baseimage = reshape(baseimage,16*16,1);
sigma = var(baseimage);

x = GPSR_Basic(baseimage,pred,.1,'Verbose',0);

ihat = pred*x;

resid(idx2,:) = baseimage-ihat;
weights(idx2,:) = x;
end
From: Edric M Ellis on
"Sean Kelly" <stkelly85(a)not.gmail.com> writes:

> So I guess the question is: without looking deeply into the
> construction of GPSR_Baisc, what kinds of coding structure within it
> could cause it to be as slow in parallel as it is serial? Each call to
> the function takes 6 ms (roughly) so is it just likely that my system
> is hitting an IO wall and the function simply cannot run any faster?

It's certainly possible that you could be hitting a wall with memory
bandwidth, especially if PARFOR is working as expected for you
otherwise. For example, addition of large arrays is typically memory
bound since the amount of computation required for each element of the
array is small compared to the memory transfer required. (This is
sometimes referred to as "arithmetic intensity", especially for GPU
computations).

Cheers,

Edric.