From: Edric M Ellis on
"Marcin " <mb1234(a)gazeta.pl> writes:

> Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytwzl6adasz.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>...
>> "Marcin " <mb1234(a)gazeta.pl> writes:
>>
>> > Yes, I am using the integration scripts which came with MATLAB, although I had
>> > to modify them a bit, as in the original they were not working at all (the job
>> > didn't even get submitted to the cluster). I still think that there is a
>> > problem with communication between the labs, but I don't know how to check it.
>>
>> What does your parallel job "qsub" command line look like?
>>
>> Cheers,
>>
>> Edric.
>
> It's generated by the integration scripts and looks for example like this:
>
> qsub -N Job2.8 -l q=matlab_pe -j yes -o "/home/smart/PCWIN_2009b/Job2/Task8.log" "/home/smart/PCWIN_2009b/Job2/sgeWrapper.sh"

Hmm, that's not actually submitting a parallel job, and you're not submitting
the parallel wrapper, so it's not unexpected that that doesn't work.

You must submit something along the lines of

qsub ... -pe matlab 2 ... /path/to/Job#/sgeParallelWrapper.sh

otherwise there's no chance that a parallel job will function correctly. The
"-pe matlab 2" states that you need a "parallel environment" called "matlab",
and that you need two parallel processes. The script that you submit must be
something like the sgeParallelWrapper.sh which starts up the smpd daemons and
then uses mpiexec to launch the workers.

I'd suggest going back to the shipping integration scripts (which should work
with only minor modifications) - what doesn't work when you use those?

Cheers,

Edric.
From: Marcin on
Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytwvdgyd1en.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>...
> "Marcin " <mb1234(a)gazeta.pl> writes:
>
> > Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytwzl6adasz.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>...
> >> "Marcin " <mb1234(a)gazeta.pl> writes:
> >>
> >> > Yes, I am using the integration scripts which came with MATLAB, although I had
> >> > to modify them a bit, as in the original they were not working at all (the job
> >> > didn't even get submitted to the cluster). I still think that there is a
> >> > problem with communication between the labs, but I don't know how to check it.
> >>
> >> What does your parallel job "qsub" command line look like?
> >>
> >> Cheers,
> >>
> >> Edric.
> >
> > It's generated by the integration scripts and looks for example like this:
> >
> > qsub -N Job2.8 -l q=matlab_pe -j yes -o "/home/smart/PCWIN_2009b/Job2/Task8.log" "/home/smart/PCWIN_2009b/Job2/sgeWrapper.sh"
>
> Hmm, that's not actually submitting a parallel job, and you're not submitting
> the parallel wrapper, so it's not unexpected that that doesn't work.
>
> You must submit something along the lines of
>
> qsub ... -pe matlab 2 ... /path/to/Job#/sgeParallelWrapper.sh
>
> otherwise there's no chance that a parallel job will function correctly. The
> "-pe matlab 2" states that you need a "parallel environment" called "matlab",
> and that you need two parallel processes. The script that you submit must be
> something like the sgeParallelWrapper.sh which starts up the smpd daemons and
> then uses mpiexec to launch the workers.
>
> I'd suggest going back to the shipping integration scripts (which should work
> with only minor modifications) - what doesn't work when you use those?
>
> Cheers,
>
> Edric.

It was it! Thank you, thank you, thank you 1000 times :)) I have discovered that indeed instead of sgeNonSharedParallelSubmitFcn, sgeNonSharedSimpleSubmitFcn was called. It's a shame though that Mathworks support didn't notice it...
From: Marcin on
"Marcin " <mb1234(a)gazeta.pl> wrote in message <hek92b$92v$1(a)fred.mathworks.com>...
> Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytwvdgyd1en.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>...
> > "Marcin " <mb1234(a)gazeta.pl> writes:
> >
> > > Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytwzl6adasz.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>...
> > >> "Marcin " <mb1234(a)gazeta.pl> writes:
> > >>
> > >> > Yes, I am using the integration scripts which came with MATLAB, although I had
> > >> > to modify them a bit, as in the original they were not working at all (the job
> > >> > didn't even get submitted to the cluster). I still think that there is a
> > >> > problem with communication between the labs, but I don't know how to check it.
> > >>
> > >> What does your parallel job "qsub" command line look like?
> > >>
> > >> Cheers,
> > >>
> > >> Edric.
> > >
> > > It's generated by the integration scripts and looks for example like this:
> > >
> > > qsub -N Job2.8 -l q=matlab_pe -j yes -o "/home/smart/PCWIN_2009b/Job2/Task8.log" "/home/smart/PCWIN_2009b/Job2/sgeWrapper.sh"
> >
> > Hmm, that's not actually submitting a parallel job, and you're not submitting
> > the parallel wrapper, so it's not unexpected that that doesn't work.
> >
> > You must submit something along the lines of
> >
> > qsub ... -pe matlab 2 ... /path/to/Job#/sgeParallelWrapper.sh
> >
> > otherwise there's no chance that a parallel job will function correctly. The
> > "-pe matlab 2" states that you need a "parallel environment" called "matlab",
> > and that you need two parallel processes. The script that you submit must be
> > something like the sgeParallelWrapper.sh which starts up the smpd daemons and
> > then uses mpiexec to launch the workers.
> >
> > I'd suggest going back to the shipping integration scripts (which should work
> > with only minor modifications) - what doesn't work when you use those?
> >
> > Cheers,
> >
> > Edric.
>
> It was it! Thank you, thank you, thank you 1000 times :)) I have discovered that indeed instead of sgeNonSharedParallelSubmitFcn, sgeNonSharedSimpleSubmitFcn was called. It's a shame though that Mathworks support didn't notice it...

But now, I'm getting a new error:

>> pmode open 12
Starting pmode using the 'SGE-smart(a)dec120' configuration ...
Your job 2100 ("Job1") has been submitted

??? Error using ==> distcomp.interactiveclient.start at 103
The client lost connection to lab 6.
This might be due to network problems, or the interactive matlabpool job might have errored. This is causing:
java.io.IOException: An existing connection was forcibly closed by the remote host

Error in ==> pmode at 84
client.start('pmode', nlabs, config, 'opengui');

Sending a stop signal to all the labs ... stopped.

??? Error using ==> distcomp.interactiveclient.start at 119
Failed to initialize the interactive session.
This is caused by:
Java exception occurred:
com.mathworks.toolbox.distcomp.pmode.SessionDestroyedException
at com.mathworks.toolbox.distcomp.pmode.Session.getFileDependenciesAssistant(Session.java:146)

Error in ==> pmode at 84
client.start('pmode', nlabs, config, 'opengui');

When I try to create a smaller pool though, like pmode open 8 - it usually works. My cluster has 15 nodes, the total number of slots has been set to 75 (5 per node). There shouldn't be any connectivity problems, as it all runs on a separate gigabit network.

Interestingly, when I submit a parallel job like this:

clusterHost = 'dec120.bmth.ac.uk';
remoteDataLocation = '/home/smart';
sched = findResource('scheduler', 'type', 'generic');
% Use a local directory as the DataLocation
set(sched, 'DataLocation', struct('pc','C:/TEMP/MATLAB','unix','/home/smart'));
set(sched, 'ClusterMatlabRoot', '/opt/matlab/2009b');
set(sched, 'HasSharedFilesystem', false);
set(sched, 'ClusterOsType', 'unix');
set(sched, 'GetJobStateFcn', @sgeGetJobState);
set(sched, 'DestroyJobFcn', @sgeDestroyJob);
set(sched, 'SubmitFcn', {@sgeNonSharedSimpleSubmitFcn, clusterHost, remoteDataLocation});
set(sched, 'ParallelSubmitFcn', {@sgeNonSharedParallelSubmitFcn, clusterHost, remoteDataLocation});

parJob = createParallelJob(sched,'Min',15,'Max',15);
createTask(parJob, @labindex, 1);
submit(parJob);
waitForState(parJob);
results2 = getAllOutputArguments(parJob);

It finishes without error and all 15 nodes are involved (I know it by examining the log files on the cluster).

Many thanks, Marcin
From: Edric M Ellis on
"Marcin " <mb1234(a)gazeta.pl> writes:

> [...]

Hmm, glad we made *some* progress!

> But now, I'm getting a new error:
>
>>> pmode open 12
> Starting pmode using the 'SGE-smart(a)dec120' configuration ...
> Your job 2100 ("Job1") has been submitted
>
> ??? Error using ==> distcomp.interactiveclient.start at 103
> The client lost connection to lab 6.
> This might be due to network problems, or the interactive matlabpool job might have errored. This is causing:
> java.io.IOException: An existing connection was forcibly closed by the remote host
>
> Error in ==> pmode at 84
> client.start('pmode', nlabs, config, 'opengui');
>
> Sending a stop signal to all the labs ... stopped.
>
> ??? Error using ==> distcomp.interactiveclient.start at 119
> Failed to initialize the interactive session.
> This is caused by:
> Java exception occurred:
> com.mathworks.toolbox.distcomp.pmode.SessionDestroyedException
> at com.mathworks.toolbox.distcomp.pmode.Session.getFileDependenciesAssistant(Session.java:146)

That error basically means that the connection between the workers and the
client went away. Unfortunately, this is a relatively generic error that doesn't
really indicate what the cause might be.

> When I try to create a smaller pool though, like pmode open 8 - it usually
> works. My cluster has 15 nodes, the total number of slots has been set to 75
> (5 per node). There shouldn't be any connectivity problems, as it all runs on
> a separate gigabit network.

You say "usually works" - is there a point where it always works, and a point
where it always fails?

> Interestingly, when I submit a parallel job like this:
>
> clusterHost = 'dec120.bmth.ac.uk';
> remoteDataLocation = '/home/smart';
> sched = findResource('scheduler', 'type', 'generic');
> % Use a local directory as the DataLocation
> set(sched, 'DataLocation', struct('pc','C:/TEMP/MATLAB','unix','/home/smart'));
> set(sched, 'ClusterMatlabRoot', '/opt/matlab/2009b');
> set(sched, 'HasSharedFilesystem', false);
> set(sched, 'ClusterOsType', 'unix');
> set(sched, 'GetJobStateFcn', @sgeGetJobState);
> set(sched, 'DestroyJobFcn', @sgeDestroyJob);
> set(sched, 'SubmitFcn', {@sgeNonSharedSimpleSubmitFcn, clusterHost, remoteDataLocation});
> set(sched, 'ParallelSubmitFcn', {@sgeNonSharedParallelSubmitFcn, clusterHost, remoteDataLocation});
>
> parJob = createParallelJob(sched,'Min',15,'Max',15);
> createTask(parJob, @labindex, 1);
> submit(parJob);
> waitForState(parJob);
> results2 = getAllOutputArguments(parJob);
>
> It finishes without error and all 15 nodes are involved (I know it by
> examining the log files on the cluster).

Are the settings that you've got there identical to whatever you've got set for
the configuration used by pmode? I'd try

sched = findResource( 'scheduler', 'Configuration', '<configname>' )

rather than all the manual settings and see if that works...

Cheers,

Edric.
From: Marcin on
Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytwr5rld74d.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>...
> "Marcin " <mb1234(a)gazeta.pl> writes:
>
> > [...]
>
> Hmm, glad we made *some* progress!
>
> > But now, I'm getting a new error:
> >
> >>> pmode open 12
> > Starting pmode using the 'SGE-smart(a)dec120' configuration ...
> > Your job 2100 ("Job1") has been submitted
> >
> > ??? Error using ==> distcomp.interactiveclient.start at 103
> > The client lost connection to lab 6.
> > This might be due to network problems, or the interactive matlabpool job might have errored. This is causing:
> > java.io.IOException: An existing connection was forcibly closed by the remote host
> >
> > Error in ==> pmode at 84
> > client.start('pmode', nlabs, config, 'opengui');
> >
> > Sending a stop signal to all the labs ... stopped.
> >
> > ??? Error using ==> distcomp.interactiveclient.start at 119
> > Failed to initialize the interactive session.
> > This is caused by:
> > Java exception occurred:
> > com.mathworks.toolbox.distcomp.pmode.SessionDestroyedException
> > at com.mathworks.toolbox.distcomp.pmode.Session.getFileDependenciesAssistant(Session.java:146)
>
> That error basically means that the connection between the workers and the
> client went away. Unfortunately, this is a relatively generic error that doesn't
> really indicate what the cause might be.
>
> > When I try to create a smaller pool though, like pmode open 8 - it usually
> > works. My cluster has 15 nodes, the total number of slots has been set to 75
> > (5 per node). There shouldn't be any connectivity problems, as it all runs on
> > a separate gigabit network.
>
> You say "usually works" - is there a point where it always works, and a point
> where it always fails?
>
> > Interestingly, when I submit a parallel job like this:
> >
> > clusterHost = 'dec120.bmth.ac.uk';
> > remoteDataLocation = '/home/smart';
> > sched = findResource('scheduler', 'type', 'generic');
> > % Use a local directory as the DataLocation
> > set(sched, 'DataLocation', struct('pc','C:/TEMP/MATLAB','unix','/home/smart'));
> > set(sched, 'ClusterMatlabRoot', '/opt/matlab/2009b');
> > set(sched, 'HasSharedFilesystem', false);
> > set(sched, 'ClusterOsType', 'unix');
> > set(sched, 'GetJobStateFcn', @sgeGetJobState);
> > set(sched, 'DestroyJobFcn', @sgeDestroyJob);
> > set(sched, 'SubmitFcn', {@sgeNonSharedSimpleSubmitFcn, clusterHost, remoteDataLocation});
> > set(sched, 'ParallelSubmitFcn', {@sgeNonSharedParallelSubmitFcn, clusterHost, remoteDataLocation});
> >
> > parJob = createParallelJob(sched,'Min',15,'Max',15);
> > createTask(parJob, @labindex, 1);
> > submit(parJob);
> > waitForState(parJob);
> > results2 = getAllOutputArguments(parJob);
> >
> > It finishes without error and all 15 nodes are involved (I know it by
> > examining the log files on the cluster).
>
> Are the settings that you've got there identical to whatever you've got set for
> the configuration used by pmode? I'd try
>
> sched = findResource( 'scheduler', 'Configuration', '<configname>' )
>
> rather than all the manual settings and see if that works...
>
> Cheers,
>
> Edric.

Hi,

It doesn't make a difference, but I know a bit more about the problem now. It seems that there are two nodes in my cluster, which when used together cause the problem. So when I create a pool using matlabpool or pmode and they are both in this pool - it will crash but if only one of them gets picked - it works. Strange, as all of them have exactly the same configuration. At least they should...