From: Jean-Baptiste Fiot on 7 Apr 2010 01:14 Dear all, I am trying to get some toy program work on a cluster using DCS / PCT. I will try to be as accurate as possible (apologies for the length of this message). First, here are the versions I'm using: ------------------------------------------------------------------------------------- MATLAB Version 7.9.0.529 (R2009b) MATLAB License Number: 295465 Operating System: Linux 2.6.16.27-0.9-bigsmp #1 SMP Tue Feb 13 09:35:18 UTC 2007 i686 Java VM Version: Java 1.6.0_12-b04 with Sun Microsystems Inc. Java HotSpot(TM) Client VM mixed mode ------------------------------------------------------------------------------------- MATLAB Version 7.9 (R2009b) Image Processing Toolbox Version 6.4 (R2009b) MATLAB Compiler Version 4.11 (R2009b) Optimization Toolbox Version 4.3 (R2009b) Parallel Computing Toolbox Version 4.2 (R2009b) Signal Processing Toolbox Version 6.12 (R2009b) Statistics Toolbox Version 7.2 (R2009b) Here is my source code: clc; clear all; addpath('/tools/matlab/R2009b/bin/glnxa64'); addpath('/tools/matlab/R2009b/toolbox/distcomp/examples/integration/pbs/nonshared/unix'); CLUSTER_HOST = 'XXXX'; NUMBER_OF_WORKERS = 4; REMOTE_DATA_LOCATION = 'XXXX'; DATA_LOCATION = 'XXXX'; CLUSTER_MATLAB_ROOT = '/tools/matlab/R2009b'; sched = findResource('scheduler', 'type', 'generic'); set(sched, 'DataLocation', DATA_LOCATION) set(sched, 'ClusterMatlabRoot',CLUSTER_MATLAB_ROOT); set(sched, 'HasSharedFilesystem', true) set(sched, 'ClusterOsType', 'unix'); set(sched, 'SubmitFcn', {@pbsNonSharedSimpleSubmitFcn, CLUSTER_HOST, REMOTE_DATA_LOCATION}); j = createJob(sched); createTask(j, @rand, 1, {3}); createTask(j, @rand, 1, {3}); createTask(j, @rand, 1, {3}); createTask(j, @rand, 1, {3}); submit(j) waitForState(j) results = getAllOutputArguments(j); celldisp(results) What happens: The program gets stuck at the waitForState command. According to [http://www.mathworks.ch/matlabcentral/newsreader/view_thread/236046#600517], "When things get stuck in "waitForState" for much longer than they should, that generally means that execution on the cluster hasn't worked completely correctly. " So I had a look at log: Task1.log tells me: Begin PBS Prologue Wed Apr 7 14:49:36 EST 2010 1270615776 Job ID: 563582.XXXX Username: XXXX Group: XXXX Nodes: XXXX End PBS Prologue Wed Apr 7 14:49:37 EST 2010 1270615777 /tools/matlab/R2009b/bin/glnxa64/MATLAB: error while loading shared libraries: libmwma57.so: failed to map segment from shared object: Cannot allocate memory Begin PBS Epilogue Wed Apr 7 14:49:42 EST 2010 1270615782 Job ID: 563582.XXXX Username: XXXX Group: XXXX Job Name: MATLAB_Job4/Task1 Session: 18781 Limits: vmem=200mb,walltime=00:05:00 Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:01 Queue: normal Account: Nodes: XXXX Killing leftovers... End PBS Epilogue Wed Apr 7 14:49:43 EST 2010 1270615783 I haven't the foggiest idea about how to fix this issue.... so any hint would be appreciated! Cheers
From: Edric M Ellis on 7 Apr 2010 03:28 "Jean-Baptiste Fiot" <hellwoxx(a)gmail.com> writes: > I am trying to get some toy program work on a cluster using DCS / > PCT. I will try to be as accurate as possible (apologies for the > length of this message). > [...] > What happens: > The program gets stuck at the waitForState command. According to > [http://www.mathworks.ch/matlabcentral/newsreader/view_thread/236046#600517], > "When things get stuck in "waitForState" for much longer than they > should, that generally means that execution on the cluster hasn't > worked completely correctly. " So I had a look at log: > > Task1.log tells me: > Begin PBS Prologue Wed Apr 7 14:49:36 EST 2010 1270615776 > Job ID: 563582.XXXX > Username: XXXX > Group: XXXX > Nodes: XXXX > End PBS Prologue Wed Apr 7 14:49:37 EST 2010 1270615777 > > /tools/matlab/R2009b/bin/glnxa64/MATLAB: error while loading shared > libraries: libmwma57.so: failed to map segment from shared object: > Cannot allocate memory I think the problem is the limits imposed by PBS which are preventing MATLAB from starting up. In the epilogue we see: > Begin PBS Epilogue Wed Apr 7 14:49:42 EST 2010 1270615782 > Job ID: 563582.XXXX > Username: XXXX > Group: XXXX > Job Name: MATLAB_Job4/Task1 > Session: 18781 > Limits: vmem=200mb,walltime=00:05:00 > Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:01 > Queue: normal > Account: > Nodes: XXXX > Killing leftovers... The limit on vmem is 200mb. I think this is too small. You need to modify the job submission to specify a larger value - I think the virtual memory limit needs to be much larger, try 1Gb (the resident size needed is much smaller - this is the amount of physical RAM being consumed). Also note that your walltime limit is 5 minutes - you may well wish to increase this if your tasks are likely to take longer than that. Cheers, Edric.
From: Jean-Baptiste Fiot on 7 Apr 2010 06:01 Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytwwrwjen66.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>... > I think the problem is the limits imposed by PBS which are preventing > MATLAB from starting up. In the epilogue we see: > > > Begin PBS Epilogue Wed Apr 7 14:49:42 EST 2010 1270615782 > > Job ID: 563582.XXXX > > Username: XXXX > > Group: XXXX > > Job Name: MATLAB_Job4/Task1 > > Session: 18781 > > Limits: vmem=200mb,walltime=00:05:00 > > Resources: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:01 > > Queue: normal > > Account: > > Nodes: XXXX > > Killing leftovers... > > The limit on vmem is 200mb. I think this is too small. You need to > modify the job submission to specify a larger value - I think the > virtual memory limit needs to be much larger, try 1Gb (the resident size > needed is much smaller - this is the amount of physical RAM being > consumed). > > Also note that your walltime limit is 5 minutes - you may well wish to > increase this if your tasks are likely to take longer than that. > > Cheers, > > Edric. Thanks Edric for your response. I have a newbie question: can I set these vmem and walltime in my matlab script, does it have to be done from somewhere else? I found on [ http://rcsg.rice.edu/ada/docs/matlab.html ] that I can use the qalter command from a terminal to change the walltime of a job already in the queue. However this does not really suit me. I have also seen that people using a batch script calling matlab use "#PBS -l walltime=00:02:00,vmem=250MB", but I'm not in this case. Finally, I looked in the scheduler properties and in the Parallel config (menu Parallel > Manage configurations...) but no walltime or vmem... Thanks again
From: Edric M Ellis on 7 Apr 2010 08:32 "Jean-Baptiste Fiot" <hellwoxx(a)gmail.com> writes: > Edric M Ellis <eellis(a)mathworks.com> wrote in message > <ytwwrwjen66.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>... >> [...] >> The limit on vmem is 200mb. I think this is too small. You need to >> modify the job submission to specify a larger value [...] > > I have a newbie question: can I set these vmem and walltime in my > matlab script, does it have to be done from somewhere else? I found > on [ http://rcsg.rice.edu/ada/docs/matlab.html ] that I can use the > qalter command from a terminal to change the walltime of a job already > in the queue. However this does not really suit me. I have also seen > that people using a batch script calling matlab use "#PBS -l > walltime=00:02:00,vmem=250MB", but I'm not in this case. Finally, I > looked in the scheduler properties and in the Parallel config (menu > Parallel > Manage configurations...) but no walltime or vmem... You can add the "#PBS" directive lines by modifying the submission function pbsNonSharedSimpleSubmitFcn.m that you're using - see the "createPBSSubmitScript" subfunction - there are some other #PBS directives there, so hopefully it's clear where you should add things. Cheers, Edric.
From: Jean-Baptiste Fiot on 7 Apr 2010 20:06
Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytwsk77e93y.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>... > You can add the "#PBS" directive lines by modifying the submission > function pbsNonSharedSimpleSubmitFcn.m that you're using - see the > "createPBSSubmitScript" subfunction - there are some other #PBS > directives there, so hopefully it's clear where you should add > things. > > Cheers, > > Edric. Thanks Edric, that fixed that error message. (Now the log tells me that the licence checkout failed (licence manager error -18), so I have asked the administrators of my cluster to restart the licence manager, and hopefully my toy program will work after!) Jean-Baptiste |