From: Marcin on
"Marcin " <mb1234(a)gazeta.pl> wrote in message <heljbj$ft$1(a)fred.mathworks.com>...
> Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytwr5rld74d.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>...
> > "Marcin " <mb1234(a)gazeta.pl> writes:
> >
> > > [...]
> >
> > Hmm, glad we made *some* progress!
> >
> > > But now, I'm getting a new error:
> > >
> > >>> pmode open 12
> > > Starting pmode using the 'SGE-smart(a)dec120' configuration ...
> > > Your job 2100 ("Job1") has been submitted
> > >
> > > ??? Error using ==> distcomp.interactiveclient.start at 103
> > > The client lost connection to lab 6.
> > > This might be due to network problems, or the interactive matlabpool job might have errored. This is causing:
> > > java.io.IOException: An existing connection was forcibly closed by the remote host
> > >
> > > Error in ==> pmode at 84
> > > client.start('pmode', nlabs, config, 'opengui');
> > >
> > > Sending a stop signal to all the labs ... stopped.
> > >
> > > ??? Error using ==> distcomp.interactiveclient.start at 119
> > > Failed to initialize the interactive session.
> > > This is caused by:
> > > Java exception occurred:
> > > com.mathworks.toolbox.distcomp.pmode.SessionDestroyedException
> > > at com.mathworks.toolbox.distcomp.pmode.Session.getFileDependenciesAssistant(Session.java:146)
> >
> > That error basically means that the connection between the workers and the
> > client went away. Unfortunately, this is a relatively generic error that doesn't
> > really indicate what the cause might be.
> >
> > > When I try to create a smaller pool though, like pmode open 8 - it usually
> > > works. My cluster has 15 nodes, the total number of slots has been set to 75
> > > (5 per node). There shouldn't be any connectivity problems, as it all runs on
> > > a separate gigabit network.
> >
> > You say "usually works" - is there a point where it always works, and a point
> > where it always fails?
> >
> > > Interestingly, when I submit a parallel job like this:
> > >
> > > clusterHost = 'dec120.bmth.ac.uk';
> > > remoteDataLocation = '/home/smart';
> > > sched = findResource('scheduler', 'type', 'generic');
> > > % Use a local directory as the DataLocation
> > > set(sched, 'DataLocation', struct('pc','C:/TEMP/MATLAB','unix','/home/smart'));
> > > set(sched, 'ClusterMatlabRoot', '/opt/matlab/2009b');
> > > set(sched, 'HasSharedFilesystem', false);
> > > set(sched, 'ClusterOsType', 'unix');
> > > set(sched, 'GetJobStateFcn', @sgeGetJobState);
> > > set(sched, 'DestroyJobFcn', @sgeDestroyJob);
> > > set(sched, 'SubmitFcn', {@sgeNonSharedSimpleSubmitFcn, clusterHost, remoteDataLocation});
> > > set(sched, 'ParallelSubmitFcn', {@sgeNonSharedParallelSubmitFcn, clusterHost, remoteDataLocation});
> > >
> > > parJob = createParallelJob(sched,'Min',15,'Max',15);
> > > createTask(parJob, @labindex, 1);
> > > submit(parJob);
> > > waitForState(parJob);
> > > results2 = getAllOutputArguments(parJob);
> > >
> > > It finishes without error and all 15 nodes are involved (I know it by
> > > examining the log files on the cluster).
> >
> > Are the settings that you've got there identical to whatever you've got set for
> > the configuration used by pmode? I'd try
> >
> > sched = findResource( 'scheduler', 'Configuration', '<configname>' )
> >
> > rather than all the manual settings and see if that works...
> >
> > Cheers,
> >
> > Edric.
>
> Hi,
>
> It doesn't make a difference, but I know a bit more about the problem now. It seems that there are two nodes in my cluster, which when used together cause the problem. So when I create a pool using matlabpool or pmode and they are both in this pool - it will crash but if only one of them gets picked - it works. Strange, as all of them have exactly the same configuration. At least they should...

As a workaround we have removed one of the problematic nodes from the cluster and the admin is currently investigating the issue. I have another question though: how can I monitor the progress of my parallel job other than using qstat, which doesn't tell me much?
From: Edric M Ellis on
"Marcin " <mb1234(a)gazeta.pl> writes:

> [...]
> As a workaround we have removed one of the problematic nodes from the cluster
> and the admin is currently investigating the issue.

Just a wild stab in the dark here - occasionally, we see weird problems caused
by bogus localhost entries in /etc/hosts - in particular, lines like

"127.0.0.1 <stuff> <real-machine-name>"

cause problems.

> I have another question though: how can I monitor the progress of my parallel
> job other than using qstat, which doesn't tell me much?

We don't have any built-in facilities I'm afraid. (What sort of thing were you
after?) For now, your best bet is to write stuff out to a file that you can
access from your client.

Cheers,

Edric.
From: Marcin on
Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytwd435cowq.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>...
> "Marcin " <mb1234(a)gazeta.pl> writes:
>
> > [...]
> > As a workaround we have removed one of the problematic nodes from the cluster
> > and the admin is currently investigating the issue.
>
> Just a wild stab in the dark here - occasionally, we see weird problems caused
> by bogus localhost entries in /etc/hosts - in particular, lines like
>
> "127.0.0.1 <stuff> <real-machine-name>"
>
> cause problems.
>
> > I have another question though: how can I monitor the progress of my parallel
> > job other than using qstat, which doesn't tell me much?
>
> We don't have any built-in facilities I'm afraid. (What sort of thing were you
> after?) For now, your best bet is to write stuff out to a file that you can
> access from your client.
>
> Cheers,
>
> Edric.

Well I was thinking about a way to monitor the resource usage on all nodes, how much resources have been allocated to a particular job and this kind of stuff.
Thanks again for you help.
From: Marcin on
"Marcin " <mb1234(a)gazeta.pl> wrote in message <hem8k7$d1q$1(a)fred.mathworks.com>...
> Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytwd435cowq.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>...
> > "Marcin " <mb1234(a)gazeta.pl> writes:
> >
> > > [...]
> > > As a workaround we have removed one of the problematic nodes from the cluster
> > > and the admin is currently investigating the issue.
> >
> > Just a wild stab in the dark here - occasionally, we see weird problems caused
> > by bogus localhost entries in /etc/hosts - in particular, lines like
> >
> > "127.0.0.1 <stuff> <real-machine-name>"
> >
> > cause problems.
> >
> > > I have another question though: how can I monitor the progress of my parallel
> > > job other than using qstat, which doesn't tell me much?
> >
> > We don't have any built-in facilities I'm afraid. (What sort of thing were you
> > after?) For now, your best bet is to write stuff out to a file that you can
> > access from your client.
> >
> > Cheers,
> >
> > Edric.
>
> Well I was thinking about a way to monitor the resource usage on all nodes, how much resources have been allocated to a particular job and this kind of stuff.
> Thanks again for you help.

Edric, I have another small problem. When I pmode to all my cluster nodes and issue the maxNumCompThreads command, each of them returns 1, although the machines have quadcore CPUs. After I issue maxNumCompThreads ('automatic') it indeed changes to 4. Can I somehow force each worker to use more than one core at startup?

Thanks
From: Edric M Ellis on
"Marcin " <mb1234(a)gazeta.pl> writes:

> Edric, I have another small problem. When I pmode to all my cluster nodes and
> issue the maxNumCompThreads command, each of them returns 1, although the
> machines have quadcore CPUs. After I issue maxNumCompThreads ('automatic') it
> indeed changes to 4. Can I somehow force each worker to use more than one core
> at startup?

You should be able to use jobStartup.m to do that.

http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/jobstartup.html

Cheers,

Edric.