From: Marcin on 26 Nov 2009 09:00 "Marcin " <mb1234(a)gazeta.pl> wrote in message <heljbj$ft$1(a)fred.mathworks.com>... > Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytwr5rld74d.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>... > > "Marcin " <mb1234(a)gazeta.pl> writes: > > > > > [...] > > > > Hmm, glad we made *some* progress! > > > > > But now, I'm getting a new error: > > > > > >>> pmode open 12 > > > Starting pmode using the 'SGE-smart(a)dec120' configuration ... > > > Your job 2100 ("Job1") has been submitted > > > > > > ??? Error using ==> distcomp.interactiveclient.start at 103 > > > The client lost connection to lab 6. > > > This might be due to network problems, or the interactive matlabpool job might have errored. This is causing: > > > java.io.IOException: An existing connection was forcibly closed by the remote host > > > > > > Error in ==> pmode at 84 > > > client.start('pmode', nlabs, config, 'opengui'); > > > > > > Sending a stop signal to all the labs ... stopped. > > > > > > ??? Error using ==> distcomp.interactiveclient.start at 119 > > > Failed to initialize the interactive session. > > > This is caused by: > > > Java exception occurred: > > > com.mathworks.toolbox.distcomp.pmode.SessionDestroyedException > > > at com.mathworks.toolbox.distcomp.pmode.Session.getFileDependenciesAssistant(Session.java:146) > > > > That error basically means that the connection between the workers and the > > client went away. Unfortunately, this is a relatively generic error that doesn't > > really indicate what the cause might be. > > > > > When I try to create a smaller pool though, like pmode open 8 - it usually > > > works. My cluster has 15 nodes, the total number of slots has been set to 75 > > > (5 per node). There shouldn't be any connectivity problems, as it all runs on > > > a separate gigabit network. > > > > You say "usually works" - is there a point where it always works, and a point > > where it always fails? > > > > > Interestingly, when I submit a parallel job like this: > > > > > > clusterHost = 'dec120.bmth.ac.uk'; > > > remoteDataLocation = '/home/smart'; > > > sched = findResource('scheduler', 'type', 'generic'); > > > % Use a local directory as the DataLocation > > > set(sched, 'DataLocation', struct('pc','C:/TEMP/MATLAB','unix','/home/smart')); > > > set(sched, 'ClusterMatlabRoot', '/opt/matlab/2009b'); > > > set(sched, 'HasSharedFilesystem', false); > > > set(sched, 'ClusterOsType', 'unix'); > > > set(sched, 'GetJobStateFcn', @sgeGetJobState); > > > set(sched, 'DestroyJobFcn', @sgeDestroyJob); > > > set(sched, 'SubmitFcn', {@sgeNonSharedSimpleSubmitFcn, clusterHost, remoteDataLocation}); > > > set(sched, 'ParallelSubmitFcn', {@sgeNonSharedParallelSubmitFcn, clusterHost, remoteDataLocation}); > > > > > > parJob = createParallelJob(sched,'Min',15,'Max',15); > > > createTask(parJob, @labindex, 1); > > > submit(parJob); > > > waitForState(parJob); > > > results2 = getAllOutputArguments(parJob); > > > > > > It finishes without error and all 15 nodes are involved (I know it by > > > examining the log files on the cluster). > > > > Are the settings that you've got there identical to whatever you've got set for > > the configuration used by pmode? I'd try > > > > sched = findResource( 'scheduler', 'Configuration', '<configname>' ) > > > > rather than all the manual settings and see if that works... > > > > Cheers, > > > > Edric. > > Hi, > > It doesn't make a difference, but I know a bit more about the problem now. It seems that there are two nodes in my cluster, which when used together cause the problem. So when I create a pool using matlabpool or pmode and they are both in this pool - it will crash but if only one of them gets picked - it works. Strange, as all of them have exactly the same configuration. At least they should... As a workaround we have removed one of the problematic nodes from the cluster and the admin is currently investigating the issue. I have another question though: how can I monitor the progress of my parallel job other than using qstat, which doesn't tell me much?
From: Edric M Ellis on 26 Nov 2009 09:59 "Marcin " <mb1234(a)gazeta.pl> writes: > [...] > As a workaround we have removed one of the problematic nodes from the cluster > and the admin is currently investigating the issue. Just a wild stab in the dark here - occasionally, we see weird problems caused by bogus localhost entries in /etc/hosts - in particular, lines like "127.0.0.1 <stuff> <real-machine-name>" cause problems. > I have another question though: how can I monitor the progress of my parallel > job other than using qstat, which doesn't tell me much? We don't have any built-in facilities I'm afraid. (What sort of thing were you after?) For now, your best bet is to write stuff out to a file that you can access from your client. Cheers, Edric.
From: Marcin on 26 Nov 2009 10:59 Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytwd435cowq.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>... > "Marcin " <mb1234(a)gazeta.pl> writes: > > > [...] > > As a workaround we have removed one of the problematic nodes from the cluster > > and the admin is currently investigating the issue. > > Just a wild stab in the dark here - occasionally, we see weird problems caused > by bogus localhost entries in /etc/hosts - in particular, lines like > > "127.0.0.1 <stuff> <real-machine-name>" > > cause problems. > > > I have another question though: how can I monitor the progress of my parallel > > job other than using qstat, which doesn't tell me much? > > We don't have any built-in facilities I'm afraid. (What sort of thing were you > after?) For now, your best bet is to write stuff out to a file that you can > access from your client. > > Cheers, > > Edric. Well I was thinking about a way to monitor the resource usage on all nodes, how much resources have been allocated to a particular job and this kind of stuff. Thanks again for you help.
From: Marcin on 28 Nov 2009 14:14 "Marcin " <mb1234(a)gazeta.pl> wrote in message <hem8k7$d1q$1(a)fred.mathworks.com>... > Edric M Ellis <eellis(a)mathworks.com> wrote in message <ytwd435cowq.fsf(a)uk-eellis-deb5-64.mathworks.co.uk>... > > "Marcin " <mb1234(a)gazeta.pl> writes: > > > > > [...] > > > As a workaround we have removed one of the problematic nodes from the cluster > > > and the admin is currently investigating the issue. > > > > Just a wild stab in the dark here - occasionally, we see weird problems caused > > by bogus localhost entries in /etc/hosts - in particular, lines like > > > > "127.0.0.1 <stuff> <real-machine-name>" > > > > cause problems. > > > > > I have another question though: how can I monitor the progress of my parallel > > > job other than using qstat, which doesn't tell me much? > > > > We don't have any built-in facilities I'm afraid. (What sort of thing were you > > after?) For now, your best bet is to write stuff out to a file that you can > > access from your client. > > > > Cheers, > > > > Edric. > > Well I was thinking about a way to monitor the resource usage on all nodes, how much resources have been allocated to a particular job and this kind of stuff. > Thanks again for you help. Edric, I have another small problem. When I pmode to all my cluster nodes and issue the maxNumCompThreads command, each of them returns 1, although the machines have quadcore CPUs. After I issue maxNumCompThreads ('automatic') it indeed changes to 4. Can I somehow force each worker to use more than one core at startup? Thanks
From: Edric M Ellis on 30 Nov 2009 03:33
"Marcin " <mb1234(a)gazeta.pl> writes: > Edric, I have another small problem. When I pmode to all my cluster nodes and > issue the maxNumCompThreads command, each of them returns 1, although the > machines have quadcore CPUs. After I issue maxNumCompThreads ('automatic') it > indeed changes to 4. Can I somehow force each worker to use more than one core > at startup? You should be able to use jobStartup.m to do that. http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/jobstartup.html Cheers, Edric. |