From: Marcin on 19 Nov 2009 11:27 I'm trying to use Distributed Computing Server with SGE scheduler. My configuration passes validation, but the pool I get always consists of a single lab only. When I issue "matlabpool 1", everything is fine. When I issue "matlabpool 2", I get the following output in matlab on the client machine: Starting matlabpool using the 'SGE-smart(a)dec120' configuration ... Your job 1664 ("Job1.1") has been submitted Your job 1665 ("Job1.2") has been submitted and it gets stuck there. Now, when I try qstat on the head node, it says that Job1.1 is running all the time (which is good) but Job2.1 runs for a moment and then finishes. When I look at the log files for both tasks (see below) there is an error for Task2, but I have no idea what it means. When I try "matlabpool 3" etc. it's always the first task that seems to be fine and there is this error for all the rest. It doesn't depend on the node which is executing the task (so the same node works fine if it gets Task1 but fails if it gets Task2, 3 etc.) For the things to be even more complicated, my configuration passes the verification procedure without problems, although at the matlabpool stage I get "Connected to 1 lab", instead of 15. I suspect that it might be some problem with communication between the labs. I wasn't however able to find anything about how the labs actually communicate (protocol, port number etc.) in the documentation. ---------------- Task1 ------------------------------------------- Executing: /opt/matlab/2009b/bin/worker -parallel < M A T L A B (R) > Copyright 1984-2009 The MathWorks, Inc. Version 7.9.0.529 (R2009b) 64-bit (glnxa64) August 12, 2009 To get started, type one of these: helpwin, helpdesk, or demo. For product information, visit www.mathworks.com. About to construct the storage object using constructor "makeFileStorageObject" and location "/home/smart/PCWIN" About to find job proxy using location "Job1" About to find task proxy using location "Job1/Task1" Completed pre-execution phase About to pPreJobEvaluate About to pPreTaskEvaluate About to add job dependencies About to call jobStartup About to call taskStartup About to get evaluation data About to pInstantiatePool Pool instatiation complete About to call poolStartup Begin task function End task function ---------------- Task2 ------------------------------------------- Executing: /opt/matlab/2009b/bin/worker -parallel < M A T L A B (R) > Copyright 1984-2009 The MathWorks, Inc. Version 7.9.0.529 (R2009b) 64-bit (glnxa64) August 12, 2009 To get started, type one of these: helpwin, helpdesk, or demo. For product information, visit www.mathworks.com. About to construct the storage object using constructor "makeFileStorageObject" and location "/home/smart/PCWIN" About to find job proxy using location "Job1" About to find task proxy using location "Job1/Task2" Completed pre-execution phase About to pPreJobEvaluate About to pPreTaskEvaluate Unexpected error in PreTaskEvaluate - MATLAB will now exit. No appropriate method, property, or field pPreTaskEvaluate for class handle.handle. Error in ==> dctEvaluateTask at 40 task.pPreTaskEvaluate; Error in ==> distcomp_evaluate_filetask>iDoTask at 96 dctEvaluateTask(postFcns, finishFcn); Error in ==> distcomp_evaluate_filetask at 38 iDoTask(handlers, postFcns);
From: Edric M Ellis on 25 Nov 2009 03:45 "Marcin " <mb1234(a)gazeta.pl> writes: > I'm trying to use Distributed Computing Server with SGE scheduler. My > configuration passes validation, but the pool I get always consists of a > single lab only. When I issue "matlabpool 1", everything is fine. When I issue > "matlabpool 2", I get the following output in matlab on the client machine: > > Starting matlabpool using the 'SGE-smart(a)dec120' configuration ... > Your job 1664 ("Job1.1") has been submitted > Your job 1665 ("Job1.2") has been submitted I'm not too familiar with SGE - is that the expected behaviour for a *parallel* job under SGE? Shouldn't there be a single parallel job submitted using something like "qsub ... -pe matlab 2" ? Are you using the integration scripts in toolbox/distcomp/examples/integration/sge? If not, do you know what the parallel "qsub" command line looks like? > [...] > About to construct the storage object using constructor "makeFileStorageObject" and location "/home/smart/PCWIN" > About to find job proxy using location "Job1" > About to find task proxy using location "Job1/Task2" > Completed pre-execution phase > About to pPreJobEvaluate > About to pPreTaskEvaluate > Unexpected error in PreTaskEvaluate - MATLAB will now exit. > No appropriate method, property, or field pPreTaskEvaluate for class handle.handle. When things end up as "handle.handle", that's usually a sign that the underlying files for the job or task have been deleted. Not quite sure how you're ending up there... Cheers, Edric.
From: Marcin on 25 Nov 2009 07:17 Hi, Yes, I am using the integration scripts which came with MATLAB, although I had to modify them a bit, as in the original they were not working at all (the job didn't even get submitted to the cluster). I still think that there is a problem with communication between the labs, but I don't know how to check it. Marcin Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytw4oojdmcd.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>... > "Marcin " <mb1234(a)gazeta.pl> writes: > > > I'm trying to use Distributed Computing Server with SGE scheduler. My > > configuration passes validation, but the pool I get always consists of a > > single lab only. When I issue "matlabpool 1", everything is fine. When I issue > > "matlabpool 2", I get the following output in matlab on the client machine: > > > > Starting matlabpool using the 'SGE-smart(a)dec120' configuration ... > > Your job 1664 ("Job1.1") has been submitted > > Your job 1665 ("Job1.2") has been submitted > > I'm not too familiar with SGE - is that the expected behaviour for a *parallel* > job under SGE? Shouldn't there be a single parallel job submitted using > something like "qsub ... -pe matlab 2" ? Are you using the integration scripts > in toolbox/distcomp/examples/integration/sge? If not, do you know what the > parallel "qsub" command line looks like? > > > [...] > > About to construct the storage object using constructor "makeFileStorageObject" and location "/home/smart/PCWIN" > > About to find job proxy using location "Job1" > > About to find task proxy using location "Job1/Task2" > > Completed pre-execution phase > > About to pPreJobEvaluate > > About to pPreTaskEvaluate > > Unexpected error in PreTaskEvaluate - MATLAB will now exit. > > No appropriate method, property, or field pPreTaskEvaluate for class handle.handle. > > When things end up as "handle.handle", that's usually a sign that the underlying > files for the job or task have been deleted. Not quite sure how you're ending up > there... > > Cheers, > > Edric.
From: Edric M Ellis on 25 Nov 2009 07:54 "Marcin " <mb1234(a)gazeta.pl> writes: > Yes, I am using the integration scripts which came with MATLAB, although I had > to modify them a bit, as in the original they were not working at all (the job > didn't even get submitted to the cluster). I still think that there is a > problem with communication between the labs, but I don't know how to check it. What does your parallel job "qsub" command line look like? Cheers, Edric.
From: Marcin on 25 Nov 2009 08:24
Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytwzl6adasz.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>... > "Marcin " <mb1234(a)gazeta.pl> writes: > > > Yes, I am using the integration scripts which came with MATLAB, although I had > > to modify them a bit, as in the original they were not working at all (the job > > didn't even get submitted to the cluster). I still think that there is a > > problem with communication between the labs, but I don't know how to check it. > > What does your parallel job "qsub" command line look like? > > Cheers, > > Edric. It's generated by the integration scripts and looks for example like this: qsub -N Job2.8 -l q=matlab_pe -j yes -o "/home/smart/PCWIN_2009b/Job2/Task8.log" "/home/smart/PCWIN_2009b/Job2/sgeWrapper.sh" Marcin |