From: Marcin on
I'm trying to use Distributed Computing Server with SGE scheduler. My configuration passes validation, but the pool I get always consists of a single lab only. When I issue "matlabpool 1", everything is fine. When I issue "matlabpool 2", I get the following output in matlab on the client machine:

Starting matlabpool using the 'SGE-smart(a)dec120' configuration ...
Your job 1664 ("Job1.1") has been submitted
Your job 1665 ("Job1.2") has been submitted

and it gets stuck there.

Now, when I try qstat on the head node, it says that Job1.1 is running all the time (which is good) but Job2.1 runs for a moment and then finishes. When I look at the log files for both tasks (see below) there is an error for Task2, but I have no idea what it means. When I try "matlabpool 3" etc. it's always the first task that seems to be fine and there is this error for all the rest. It doesn't depend on the node which is executing the task (so the same node works fine if it gets Task1 but fails if it gets Task2, 3 etc.) For the things to be even more complicated, my configuration passes the verification procedure without problems, although at the matlabpool stage I get "Connected to 1 lab", instead of 15.

I suspect that it might be some problem with communication between the labs. I wasn't however able to find anything about how the labs actually communicate (protocol, port number etc.) in the documentation.

---------------- Task1 -------------------------------------------
Executing: /opt/matlab/2009b/bin/worker -parallel

< M A T L A B (R) >
Copyright 1984-2009 The MathWorks, Inc.
Version 7.9.0.529 (R2009b) 64-bit (glnxa64)
August 12, 2009


To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.

About to construct the storage object using constructor "makeFileStorageObject" and location "/home/smart/PCWIN"
About to find job proxy using location "Job1"
About to find task proxy using location "Job1/Task1"
Completed pre-execution phase
About to pPreJobEvaluate
About to pPreTaskEvaluate
About to add job dependencies
About to call jobStartup
About to call taskStartup
About to get evaluation data
About to pInstantiatePool
Pool instatiation complete
About to call poolStartup
Begin task function
End task function

---------------- Task2 -------------------------------------------
Executing: /opt/matlab/2009b/bin/worker -parallel

< M A T L A B (R) >
Copyright 1984-2009 The MathWorks, Inc.
Version 7.9.0.529 (R2009b) 64-bit (glnxa64)
August 12, 2009


To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.

About to construct the storage object using constructor "makeFileStorageObject" and location "/home/smart/PCWIN"
About to find job proxy using location "Job1"
About to find task proxy using location "Job1/Task2"
Completed pre-execution phase
About to pPreJobEvaluate
About to pPreTaskEvaluate
Unexpected error in PreTaskEvaluate - MATLAB will now exit.
No appropriate method, property, or field pPreTaskEvaluate for class handle.handle.

Error in ==> dctEvaluateTask at 40
task.pPreTaskEvaluate;

Error in ==> distcomp_evaluate_filetask>iDoTask at 96
dctEvaluateTask(postFcns, finishFcn);

Error in ==> distcomp_evaluate_filetask at 38
iDoTask(handlers, postFcns);
From: Edric M Ellis on
"Marcin " <mb1234(a)gazeta.pl> writes:

> I'm trying to use Distributed Computing Server with SGE scheduler. My
> configuration passes validation, but the pool I get always consists of a
> single lab only. When I issue "matlabpool 1", everything is fine. When I issue
> "matlabpool 2", I get the following output in matlab on the client machine:
>
> Starting matlabpool using the 'SGE-smart(a)dec120' configuration ...
> Your job 1664 ("Job1.1") has been submitted
> Your job 1665 ("Job1.2") has been submitted

I'm not too familiar with SGE - is that the expected behaviour for a *parallel*
job under SGE? Shouldn't there be a single parallel job submitted using
something like "qsub ... -pe matlab 2" ? Are you using the integration scripts
in toolbox/distcomp/examples/integration/sge? If not, do you know what the
parallel "qsub" command line looks like?

> [...]
> About to construct the storage object using constructor "makeFileStorageObject" and location "/home/smart/PCWIN"
> About to find job proxy using location "Job1"
> About to find task proxy using location "Job1/Task2"
> Completed pre-execution phase
> About to pPreJobEvaluate
> About to pPreTaskEvaluate
> Unexpected error in PreTaskEvaluate - MATLAB will now exit.
> No appropriate method, property, or field pPreTaskEvaluate for class handle.handle.

When things end up as "handle.handle", that's usually a sign that the underlying
files for the job or task have been deleted. Not quite sure how you're ending up
there...

Cheers,

Edric.
From: Marcin on
Hi,

Yes, I am using the integration scripts which came with MATLAB, although I had to modify them a bit, as in the original they were not working at all (the job didn't even get submitted to the cluster).
I still think that there is a problem with communication between the labs, but I don't know how to check it.

Marcin

Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytw4oojdmcd.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>...
> "Marcin " <mb1234(a)gazeta.pl> writes:
>
> > I'm trying to use Distributed Computing Server with SGE scheduler. My
> > configuration passes validation, but the pool I get always consists of a
> > single lab only. When I issue "matlabpool 1", everything is fine. When I issue
> > "matlabpool 2", I get the following output in matlab on the client machine:
> >
> > Starting matlabpool using the 'SGE-smart(a)dec120' configuration ...
> > Your job 1664 ("Job1.1") has been submitted
> > Your job 1665 ("Job1.2") has been submitted
>
> I'm not too familiar with SGE - is that the expected behaviour for a *parallel*
> job under SGE? Shouldn't there be a single parallel job submitted using
> something like "qsub ... -pe matlab 2" ? Are you using the integration scripts
> in toolbox/distcomp/examples/integration/sge? If not, do you know what the
> parallel "qsub" command line looks like?
>
> > [...]
> > About to construct the storage object using constructor "makeFileStorageObject" and location "/home/smart/PCWIN"
> > About to find job proxy using location "Job1"
> > About to find task proxy using location "Job1/Task2"
> > Completed pre-execution phase
> > About to pPreJobEvaluate
> > About to pPreTaskEvaluate
> > Unexpected error in PreTaskEvaluate - MATLAB will now exit.
> > No appropriate method, property, or field pPreTaskEvaluate for class handle.handle.
>
> When things end up as "handle.handle", that's usually a sign that the underlying
> files for the job or task have been deleted. Not quite sure how you're ending up
> there...
>
> Cheers,
>
> Edric.
From: Edric M Ellis on
"Marcin " <mb1234(a)gazeta.pl> writes:

> Yes, I am using the integration scripts which came with MATLAB, although I had
> to modify them a bit, as in the original they were not working at all (the job
> didn't even get submitted to the cluster). I still think that there is a
> problem with communication between the labs, but I don't know how to check it.

What does your parallel job "qsub" command line look like?

Cheers,

Edric.
From: Marcin on
Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytwzl6adasz.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>...
> "Marcin " <mb1234(a)gazeta.pl> writes:
>
> > Yes, I am using the integration scripts which came with MATLAB, although I had
> > to modify them a bit, as in the original they were not working at all (the job
> > didn't even get submitted to the cluster). I still think that there is a
> > problem with communication between the labs, but I don't know how to check it.
>
> What does your parallel job "qsub" command line look like?
>
> Cheers,
>
> Edric.

It's generated by the integration scripts and looks for example like this:

qsub -N Job2.8 -l q=matlab_pe -j yes -o "/home/smart/PCWIN_2009b/Job2/Task8.log" "/home/smart/PCWIN_2009b/Job2/sgeWrapper.sh"

Marcin