Sun Grid Engine / NFS and Python shell execution question [Python]

Prev: Convert Unix timestamp to Readable Date/time
Next: Improper Backtraces in Exec'd Code

From: J.B. Brown on 22 Jul 2010 10:51

Hello everyone, and thanks for your time to read this.

For quite some time, I have had a problem using Python's shell
execution facilities in combination with a cluster computer
environment (such as Sun Grid Engine (SGE)).
In particular, I wish to repeatedly execute a number of commands in
sub-shells or pipes within a single function, and the repeated
execution is depending on the previous execution, so just writing a
brute force script file and executing commands is not an option for
me.

To isolate and exemplify my problem, I have created three files:
(1) one which exemplifies the spirit of the code I wish to execute in Python
(2) one which serves as the SGE execution script file, and actually
calls python to execute the code in (1)
(3) a simple shell script which executes (2) a sufficient number of
times that it fills all processors on my computing cluster and leaves
an additional number of jobs in the queue.

Here is the spirit of the experiment/problem:
generateTest.py:
----------------------------------------------
# Constants
numParallelJobs = 100
testCommand = "continue" #"os.popen( \"clear\" )"
loopSize = "1000"

# First, write file with test script.
pythonScript = file( "testScript.py", "w" )
pythonScript.write(
"""
import os
for i in range( 0, """ + loopSize + """ ):
for j in range( 0, """ + loopSize + """ ):
for k in range( 0, """ + loopSize + """ ):
for l in range( 0, """ + loopSize + """ ):
""" + testCommand + """
""" )
pythonScript.close()

# Second, write SGE script file to execute the Python script.
sgeScript = file( "testScript.sge", "w" )
sgeScript.write (
"""
#$ -cwd
#$ -N pythonTest
#$ -e /export/home/jbbrown/errorLog
#$ -o /export/home/jbbrown/outputLog
python testScript.py
""" )
sgeScript.close()

# Finally, write script to run SGE script a specified number of times.
import os
launchScript = file( "testScript.sh", "w" )
for i in range( 0, numParallelJobs ):
launchScript.write( "qsub testScript.sge" + os.linesep )
launchScript.close()

----------------------------------------------

Now, let's assume that I have about 50 processors available across 8
compute nodes, with one NFS-mounted disk.
If I run the code as above, simply executing Python "continue"
statements and do nothing, the cluster head node reports no serious
NFS daemon load.

However - if I change the code to use the os.popen() call shown as a
comment above, or use os.system(),
the NFS daemon load on my system skyrockets within seconds of
distributing the jobs to the compute nodes -- even though I'm doing
nothing but executing the clear screen command, which technically
doesn't pipe any output to the location for logging stdout.
Even if I change the SGE script file to redirect standard output and
error to explicitly go to /dev/null, I still have the same problem.

I believe the source of this problem is that os.popen() or os.system()
calls spawn subshells which then reference my shell resource files
(.zshrc, .cshrc, .bashrc, etc.).
But I don't see an alternative to os.popen{234} or os.system().
os.exec*() cannot solve my problem, because it transfers execution to
that program and stops executing the script which called os.exec*().

Without having to rewrite a considerable amount of code (which
performs cross validation by repeatedly executing in a subshell) in
terms of a shell script language filled with a large number of
conditional statements, does anyone know of a way to execute external
programs in the middle of a script without referencing the shell
resource file located on an NFS mounted directory?
I have read through the >help(os) documentation repeatedly, but just
can't find a solution.

Even a small lead or thought would be greatly appreciated.

With thanks from humid Kyoto,
J.B. Brown

From: MRAB on 22 Jul 2010 11:31

J.B. Brown wrote:
> Hello everyone, and thanks for your time to read this.
>
> For quite some time, I have had a problem using Python's shell
> execution facilities in combination with a cluster computer
> environment (such as Sun Grid Engine (SGE)).
> In particular, I wish to repeatedly execute a number of commands in
> sub-shells or pipes within a single function, and the repeated
> execution is depending on the previous execution, so just writing a
> brute force script file and executing commands is not an option for
> me.
>
> To isolate and exemplify my problem, I have created three files:
> (1) one which exemplifies the spirit of the code I wish to execute in Python
> (2) one which serves as the SGE execution script file, and actually
> calls python to execute the code in (1)
> (3) a simple shell script which executes (2) a sufficient number of
> times that it fills all processors on my computing cluster and leaves
> an additional number of jobs in the queue.
>
> Here is the spirit of the experiment/problem:
> generateTest.py:
> ----------------------------------------------
> # Constants
> numParallelJobs = 100
> testCommand = "continue" #"os.popen( \"clear\" )"
> loopSize = "1000"
>
> # First, write file with test script.
> pythonScript = file( "testScript.py", "w" )
> pythonScript.write(
> """
> import os
> for i in range( 0, """ + loopSize + """ ):
> for j in range( 0, """ + loopSize + """ ):
> for k in range( 0, """ + loopSize + """ ):
> for l in range( 0, """ + loopSize + """ ):
> """ + testCommand + """
> """ )
> pythonScript.close()
>
> # Second, write SGE script file to execute the Python script.
> sgeScript = file( "testScript.sge", "w" )
> sgeScript.write (
> """
> #$ -cwd
> #$ -N pythonTest
> #$ -e /export/home/jbbrown/errorLog
> #$ -o /export/home/jbbrown/outputLog
> python testScript.py
> """ )
> sgeScript.close()
>
> # Finally, write script to run SGE script a specified number of times.
> import os
> launchScript = file( "testScript.sh", "w" )
> for i in range( 0, numParallelJobs ):
> launchScript.write( "qsub testScript.sge" + os.linesep )
> launchScript.close()
>
> ----------------------------------------------
>
> Now, let's assume that I have about 50 processors available across 8
> compute nodes, with one NFS-mounted disk.
> If I run the code as above, simply executing Python "continue"
> statements and do nothing, the cluster head node reports no serious
> NFS daemon load.
>
> However - if I change the code to use the os.popen() call shown as a
> comment above, or use os.system(),
> the NFS daemon load on my system skyrockets within seconds of
> distributing the jobs to the compute nodes -- even though I'm doing
> nothing but executing the clear screen command, which technically
> doesn't pipe any output to the location for logging stdout.
> Even if I change the SGE script file to redirect standard output and
> error to explicitly go to /dev/null, I still have the same problem.
>
> I believe the source of this problem is that os.popen() or os.system()
> calls spawn subshells which then reference my shell resource files
> (.zshrc, .cshrc, .bashrc, etc.).
> But I don't see an alternative to os.popen{234} or os.system().
> os.exec*() cannot solve my problem, because it transfers execution to
> that program and stops executing the script which called os.exec*().
>
> Without having to rewrite a considerable amount of code (which
> performs cross validation by repeatedly executing in a subshell) in
> terms of a shell script language filled with a large number of
> conditional statements, does anyone know of a way to execute external
> programs in the middle of a script without referencing the shell
> resource file located on an NFS mounted directory?
> I have read through the >help(os) documentation repeatedly, but just
> can't find a solution.
>
> Even a small lead or thought would be greatly appreciated.
>
Have you looked at the 'subprocess' module?

From: Neil Hodgson on 22 Jul 2010 19:19

J.B. Brown:

> I believe the source of this problem is that os.popen() or os.system()
> calls spawn subshells which then reference my shell resource files
> (.zshrc, .cshrc, .bashrc, etc.).
> But I don't see an alternative to os.popen{234} or os.system().
> os.exec*() cannot solve my problem, because it transfers execution to
> that program and stops executing the script which called os.exec*().

Call fork then call exec from the new process. Search the web for
"fork exec" to find examples in C.

Neil

|
Pages: 1
Prev: Convert Unix timestamp to Readable Date/time
Next: Improper Backtraces in Exec'd Code