From: Jean-Baptiste Fiot on
Dear all,

I am trying to get some toy program work on a cluster using DCS / PCT. I will try to be as accurate as possible (apologies for the length of this message).

First, here are the versions I'm using:
-------------------------------------------------------------------------------------
MATLAB Version 7.9.0.529 (R2009b)
MATLAB License Number: 295465
Operating System: Linux 2.6.16.27-0.9-bigsmp #1 SMP Tue Feb 13 09:35:18 UTC 2007 i686
Java VM Version: Java 1.6.0_12-b04 with Sun Microsystems Inc. Java HotSpot(TM) Client VM mixed mode
-------------------------------------------------------------------------------------
MATLAB Version 7.9 (R2009b)
Image Processing Toolbox Version 6.4 (R2009b)
MATLAB Compiler Version 4.11 (R2009b)
Optimization Toolbox Version 4.3 (R2009b)
Parallel Computing Toolbox Version 4.2 (R2009b)
Signal Processing Toolbox Version 6.12 (R2009b)
Statistics Toolbox Version 7.2 (R2009b)

Here is my source code:
clc; clear all;
addpath('/tools/matlab/R2009b/bin/glnxa64');
addpath('/tools/matlab/R2009b/toolbox/distcomp/examples/integration/pbs/nonshared/unix');

CLUSTER_HOST = 'XXXX';
NUMBER_OF_WORKERS = 4;
REMOTE_DATA_LOCATION = 'XXXX';
DATA_LOCATION = 'XXXX';
CLUSTER_MATLAB_ROOT = '/tools/matlab/R2009b';

sched = findResource('scheduler', 'type', 'generic');
set(sched, 'DataLocation', DATA_LOCATION)
set(sched, 'ClusterMatlabRoot',CLUSTER_MATLAB_ROOT);
set(sched, 'HasSharedFilesystem', true)
set(sched, 'ClusterOsType', 'unix');

set(sched, 'SubmitFcn', {@pbsNonSharedSimpleSubmitFcn, CLUSTER_HOST, REMOTE_DATA_LOCATION});
j = createJob(sched);
createTask(j, @rand, 1, {3});
createTask(j, @rand, 1, {3});
createTask(j, @rand, 1, {3});
createTask(j, @rand, 1, {3});
submit(j)
waitForState(j)
results = getAllOutputArguments(j);
celldisp(results)

What happens:
The program gets stuck at the waitForState command. According to [http://www.mathworks.ch/matlabcentral/newsreader/view_thread/236046#600517], "When things get stuck in "waitForState" for much longer than they should, that
generally means that execution on the cluster hasn't worked completely
correctly. " So I had a look at log:

Task1.log tells me:
Begin PBS Prologue Wed Apr 7 14:49:36 EST 2010 1270615776
Job ID: 563582.XXXX
Username: XXXX
Group: XXXX
Nodes: XXXX
End PBS Prologue Wed Apr 7 14:49:37 EST 2010 1270615777

/tools/matlab/R2009b/bin/glnxa64/MATLAB: error while loading shared libraries: libmwma57.so: failed to map segment from shared object: Cannot allocate memory

Begin PBS Epilogue Wed Apr 7 14:49:42 EST 2010 1270615782
Job ID: 563582.XXXX
Username: XXXX
Group: XXXX
Job Name: MATLAB_Job4/Task1
Session: 18781
Limits: vmem=200mb,walltime=00:05:00
Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:01
Queue: normal
Account:
Nodes: XXXX
Killing leftovers...

End PBS Epilogue Wed Apr 7 14:49:43 EST 2010 1270615783

I haven't the foggiest idea about how to fix this issue.... so any hint would be appreciated!
Cheers
From: Edric M Ellis on
"Jean-Baptiste Fiot" <hellwoxx(a)gmail.com> writes:

> I am trying to get some toy program work on a cluster using DCS /
> PCT. I will try to be as accurate as possible (apologies for the
> length of this message).
> [...]
> What happens:
> The program gets stuck at the waitForState command. According to
> [http://www.mathworks.ch/matlabcentral/newsreader/view_thread/236046#600517],
> "When things get stuck in "waitForState" for much longer than they
> should, that generally means that execution on the cluster hasn't
> worked completely correctly. " So I had a look at log:
>
> Task1.log tells me:
> Begin PBS Prologue Wed Apr 7 14:49:36 EST 2010 1270615776
> Job ID: 563582.XXXX
> Username: XXXX
> Group: XXXX
> Nodes: XXXX
> End PBS Prologue Wed Apr 7 14:49:37 EST 2010 1270615777
>
> /tools/matlab/R2009b/bin/glnxa64/MATLAB: error while loading shared
> libraries: libmwma57.so: failed to map segment from shared object:
> Cannot allocate memory

I think the problem is the limits imposed by PBS which are preventing
MATLAB from starting up. In the epilogue we see:

> Begin PBS Epilogue Wed Apr 7 14:49:42 EST 2010 1270615782
> Job ID: 563582.XXXX
> Username: XXXX
> Group: XXXX
> Job Name: MATLAB_Job4/Task1
> Session: 18781
> Limits: vmem=200mb,walltime=00:05:00
> Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:01
> Queue: normal
> Account:
> Nodes: XXXX
> Killing leftovers...

The limit on vmem is 200mb. I think this is too small. You need to
modify the job submission to specify a larger value - I think the
virtual memory limit needs to be much larger, try 1Gb (the resident size
needed is much smaller - this is the amount of physical RAM being
consumed).

Also note that your walltime limit is 5 minutes - you may well wish to
increase this if your tasks are likely to take longer than that.

Cheers,

Edric.
From: Jean-Baptiste Fiot on
Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytwwrwjen66.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>...
> I think the problem is the limits imposed by PBS which are preventing
> MATLAB from starting up. In the epilogue we see:
>
> > Begin PBS Epilogue Wed Apr 7 14:49:42 EST 2010 1270615782
> > Job ID: 563582.XXXX
> > Username: XXXX
> > Group: XXXX
> > Job Name: MATLAB_Job4/Task1
> > Session: 18781
> > Limits: vmem=200mb,walltime=00:05:00
> > Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:01
> > Queue: normal
> > Account:
> > Nodes: XXXX
> > Killing leftovers...
>
> The limit on vmem is 200mb. I think this is too small. You need to
> modify the job submission to specify a larger value - I think the
> virtual memory limit needs to be much larger, try 1Gb (the resident size
> needed is much smaller - this is the amount of physical RAM being
> consumed).
>
> Also note that your walltime limit is 5 minutes - you may well wish to
> increase this if your tasks are likely to take longer than that.
>
> Cheers,
>
> Edric.

Thanks Edric for your response.

I have a newbie question: can I set these vmem and walltime in my matlab script, does it have to be done from somewhere else?
I found on [ http://rcsg.rice.edu/ada/docs/matlab.html ] that I can use the qalter command from a terminal to change the walltime of a job already in the queue. However this does not really suit me. I have also seen that people using a batch script calling matlab use "#PBS -l walltime=00:02:00,vmem=250MB", but I'm not in this case. Finally, I looked in the scheduler properties and in the Parallel config (menu Parallel > Manage configurations...) but no walltime or vmem...

Thanks again
From: Edric M Ellis on
"Jean-Baptiste Fiot" <hellwoxx(a)gmail.com> writes:

> Edric M Ellis <eellis(a)mathworks.com> wrote in message
> <ytwwrwjen66.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>...
>> [...]
>> The limit on vmem is 200mb. I think this is too small. You need to
>> modify the job submission to specify a larger value [...]
>
> I have a newbie question: can I set these vmem and walltime in my
> matlab script, does it have to be done from somewhere else? I found
> on [ http://rcsg.rice.edu/ada/docs/matlab.html ] that I can use the
> qalter command from a terminal to change the walltime of a job already
> in the queue. However this does not really suit me. I have also seen
> that people using a batch script calling matlab use "#PBS -l
> walltime=00:02:00,vmem=250MB", but I'm not in this case. Finally, I
> looked in the scheduler properties and in the Parallel config (menu
> Parallel > Manage configurations...) but no walltime or vmem...

You can add the "#PBS" directive lines by modifying the submission
function pbsNonSharedSimpleSubmitFcn.m that you're using - see the
"createPBSSubmitScript" subfunction - there are some other #PBS
directives there, so hopefully it's clear where you should add
things.

Cheers,

Edric.
From: Jean-Baptiste Fiot on
Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytwsk77e93y.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>...
> You can add the "#PBS" directive lines by modifying the submission
> function pbsNonSharedSimpleSubmitFcn.m that you're using - see the
> "createPBSSubmitScript" subfunction - there are some other #PBS
> directives there, so hopefully it's clear where you should add
> things.
>
> Cheers,
>
> Edric.

Thanks Edric, that fixed that error message.

(Now the log tells me that the licence checkout failed (licence manager error -18), so I have asked the administrators of my cluster to restart the licence manager, and hopefully my toy program will work after!)

Jean-Baptiste