ParallelTable slows down computation [Mathematica]

Prev: Any way to make "iterb" in Mathematica 7.0 compatible with older
Next: Repeat Data reading in one file

From: K on 15 Dec 2009 07:33

Hi,

I was trying to evaluate definite integrals of different product
combinations of trigonometric functions like so:

ClearSystemCache[];
AbsoluteTiming[
Table[Integrate[
Sin[ph]*1/(2 Pi)*Sin[nn*ph]*Cos[mm*ph], {ph, Pi/2, Pi}], {nn, 0,
15}, {mm, 0, 15}];]

I included ClearSystemCache[] to get comparable results for successive
runs. Output of the actual matrix result is suppressed. On my dual
core AMD, I got this result from Mathematica 7.0.1 (Linux x86 64-bit)
for the above command:

{65.240614, Null}

Now I thought that this computation could be almost perfectly
parallelized by having, e.g., nn = 0,...,7 evaluated by one kernel and
nn=8, ..., 15 by the other and typed:

ParallelEvaluate[ClearSystemCache[]];
AbsoluteTiming[
ParallelTable[
Integrate[
Sin[ph]*1/(2 Pi)*Sin[nn*ph]*Cos[mm*ph], {ph, Pi/2, Pi}], {nn, 0,
15}, {mm, 0, 15}, Method -> "CoarsestGrained"];]

The result, however, was disappointing:

{76.993888, Null}

By the way, Kernel[] returns:

{KernelObject[1,local],KernelObject[2,local]}

This seems to me that the parallel command should in fact have been
evaluated by two kernels. With Method-> "CoarsestGrained", I hoped to
obtain the data splitting I mentioned above. If I do the splitting and
combining myself, it gets even a bit worse:

ParallelEvaluate[ClearSystemCache[]];
AbsoluteTiming[
job1=ParallelSubmit[Table[Integrate[Sin[ph]*1/(2 Pi)*Sin[nn*ph]*Cos
[mm*ph],{ph, Pi/2,Pi}],{nn,0,7},{mm,0,15}]];
job2=ParallelSubmit[Table[Integrate[Sin[ph]*1/(2 Pi)*Sin[nn*ph]*Cos
[mm*ph],{ph, Pi/2,Pi}],{nn,8,15},{mm,0,15}]];
{res1,res2}=WaitAll[{job1,job2}];
Flatten[{{res1},{res2}},2];]

Result is here:

{78.669442,Null}

I can't believe that the splitting and combining overhead on a single
machine (no network involved here) can eat up all the gain from
distributing the actual workload to two kernels. Does anyone have an
idea what is going wrong here?
Thanks,
K.

From: Eric Wort on 16 Dec 2009 06:18

Hi K,

Some processors support adjusting their clock speed on the fly, and in
Linux if a process has a low priority the machine will stay running at a
lower clock speed if there are only low priority threads competing for
cpu time.

By default, Mathematica launches subkernels with a lower than standard
priority, which can often cause this issue. If you look in the Parallel
tab of the Preferences dialog, there is an option entitled "Run kernels
at a lower process priority". Make sure that this is not checked if you
want the subkernels to run as quickly as possible.

I obtained the following results running your example on my system with
the option unchecked:

In[1]:= ClearSystemCache[];
AbsoluteTiming[
Table[Integrate[
Sin[ph]*1/(2 Pi)*Sin[nn*ph]*Cos[mm*ph], {ph, Pi/2, Pi}], {nn, 0,
15}, {mm, 0, 15}];]

Out[2]= {37.142150, Null}

In[3]:= LaunchKernels[2]

Out[3]= {KernelObject[1, "local"], KernelObject[2, "local"]}

In[4]:= ParallelEvaluate[ClearSystemCache[]];
AbsoluteTiming[
ParallelTable[
Integrate[
Sin[ph]*1/(2 Pi)*Sin[nn*ph]*Cos[mm*ph], {ph, Pi/2, Pi}], {nn, 0,
15}, {mm, 0, 15}, Method -> "CoarsestGrained"];]

Out[5]= {23.712933, Null}

Sincerely,
Eric Wort

K wrote:
> Hi,
>
> I was trying to evaluate definite integrals of different product
> combinations of trigonometric functions like so:
>
> ClearSystemCache[];
> AbsoluteTiming[
> Table[Integrate[
> Sin[ph]*1/(2 Pi)*Sin[nn*ph]*Cos[mm*ph], {ph, Pi/2, Pi}], {nn, 0,
> 15}, {mm, 0, 15}];]
>
> I included ClearSystemCache[] to get comparable results for successive
> runs. Output of the actual matrix result is suppressed. On my dual
> core AMD, I got this result from Mathematica 7.0.1 (Linux x86 64-bit)
> for the above command:
>
> {65.240614, Null}
>
> Now I thought that this computation could be almost perfectly
> parallelized by having, e.g., nn = 0,...,7 evaluated by one kernel and
> nn=8, ..., 15 by the other and typed:
>
> ParallelEvaluate[ClearSystemCache[]];
> AbsoluteTiming[
> ParallelTable[
> Integrate[
> Sin[ph]*1/(2 Pi)*Sin[nn*ph]*Cos[mm*ph], {ph, Pi/2, Pi}], {nn, 0,
> 15}, {mm, 0, 15}, Method -> "CoarsestGrained"];]
>
> The result, however, was disappointing:
>
> {76.993888, Null}
>
> By the way, Kernel[] returns:
>
> {KernelObject[1,local],KernelObject[2,local]}
>
> This seems to me that the parallel command should in fact have been
> evaluated by two kernels. With Method-> "CoarsestGrained", I hoped to
> obtain the data splitting I mentioned above. If I do the splitting and
> combining myself, it gets even a bit worse:
>
> ParallelEvaluate[ClearSystemCache[]];
> AbsoluteTiming[
> job1=ParallelSubmit[Table[Integrate[Sin[ph]*1/(2 Pi)*Sin[nn*ph]*Cos
> [mm*ph],{ph, Pi/2,Pi}],{nn,0,7},{mm,0,15}]];
> job2=ParallelSubmit[Table[Integrate[Sin[ph]*1/(2 Pi)*Sin[nn*ph]*Cos
> [mm*ph],{ph, Pi/2,Pi}],{nn,8,15},{mm,0,15}]];
> {res1,res2}=WaitAll[{job1,job2}];
> Flatten[{{res1},{res2}},2];]
>
> Result is here:
>
> {78.669442,Null}
>
> I can't believe that the splitting and combining overhead on a single
> machine (no network involved here) can eat up all the gain from
> distributing the actual workload to two kernels. Does anyone have an
> idea what is going wrong here?
> Thanks,
> K.
>
>

From: Patrick Scheibe on 16 Dec 2009 06:21

Hi,

Here (Ubuntu 64bit, 4 Cores, Mathematica 7.0.1) the timing is 53 for the
serial evaluation and 22 sec for the parallel computation.

If I try to minimize the data-transfer-overhead which arises when the
kernels return their result, then the speed-up is more visible.
Note the changed stepsize and the semicolon:

ClearSystemCache[];
AbsoluteTiming[
Table[Integrate[
Sin[ph]*1/(2 Pi)*Sin[nn*ph]*Cos[mm*ph], {ph, Pi/2, Pi}];, {nn, 0,
15}, {mm, 0, 15, 1/2}];]

needs 145.314899 seconds

AbsoluteTiming[
ParallelTable[
Table[Integrate[
Sin[ph]*1/(2 Pi)*Sin[nn*ph]*Cos[mm*ph], {ph, Pi/2, Pi}];, {nn,
0, 15}],
{mm, 0, 15, 1/2}]
;]

needs 52.036152 seconds. Every call on a new Mathematica-session.

Cheers
Patrick

On Tue, 2009-12-15 at 07:33 -0500, K wrote:
> Hi,
>
> I was trying to evaluate definite integrals of different product
> combinations of trigonometric functions like so:
>
> ClearSystemCache[];
> AbsoluteTiming[
> Table[Integrate[
> Sin[ph]*1/(2 Pi)*Sin[nn*ph]*Cos[mm*ph], {ph, Pi/2, Pi}], {nn, 0,
> 15}, {mm, 0, 15}];]
>
> I included ClearSystemCache[] to get comparable results for successive
> runs. Output of the actual matrix result is suppressed. On my dual
> core AMD, I got this result from Mathematica 7.0.1 (Linux x86 64-bit)
> for the above command:
>
> {65.240614, Null}
>
> Now I thought that this computation could be almost perfectly
> parallelized by having, e.g., nn = 0,...,7 evaluated by one kernel and
> nn=8, ..., 15 by the other and typed:
>
> ParallelEvaluate[ClearSystemCache[]];
> AbsoluteTiming[
> ParallelTable[
> Integrate[
> Sin[ph]*1/(2 Pi)*Sin[nn*ph]*Cos[mm*ph], {ph, Pi/2, Pi}], {nn, 0,
> 15}, {mm, 0, 15}, Method -> "CoarsestGrained"];]
>
> The result, however, was disappointing:
>
> {76.993888, Null}
>
> By the way, Kernel[] returns:
>
> {KernelObject[1,local],KernelObject[2,local]}
>
> This seems to me that the parallel command should in fact have been
> evaluated by two kernels. With Method-> "CoarsestGrained", I hoped to
> obtain the data splitting I mentioned above. If I do the splitting and
> combining myself, it gets even a bit worse:
>
> ParallelEvaluate[ClearSystemCache[]];
> AbsoluteTiming[
> job1=ParallelSubmit[Table[Integrate[Sin[ph]*1/(2 Pi)*Sin[nn*ph]*Cos
> [mm*ph],{ph, Pi/2,Pi}],{nn,0,7},{mm,0,15}]];
> job2=ParallelSubmit[Table[Integrate[Sin[ph]*1/(2 Pi)*Sin[nn*ph]*Cos
> [mm*ph],{ph, Pi/2,Pi}],{nn,8,15},{mm,0,15}]];
> {res1,res2}=WaitAll[{job1,job2}];
> Flatten[{{res1},{res2}},2];]
>
> Result is here:
>
> {78.669442,Null}
>
> I can't believe that the splitting and combining overhead on a single
> machine (no network involved here) can eat up all the gain from
> distributing the actual workload to two kernels. Does anyone have an
> idea what is going wrong here?
> Thanks,
> K.
>

From: Mark McClure on 16 Dec 2009 06:23

On Tue, Dec 15, 2009 at 7:33 AM, K <kgspga(a)googlemail.com> wrote:
> I was trying to evaluate definite integrals of different product
> combinations of trigonometric functions like so:
> ...
> Now I thought that this computation could be almost perfectly
> parallelized by having, e.g., nn = 0,...,7 evaluated by one kernel
> and nn=8, ..., 15 by the other and typed:
> ...
> The result, however, was disappointing:

Two symbolic computations that appear superficially similar may
actually take vastly different amounts of time to perform and there
may be no general a priori way to determine which will take longer.
Thus, the computation of a large number of symbolic computations
typically parallelizes very poorly, since there is no way to break the
problems up into parts that take comparable times. In particular, the
integrals in your computation take a wide range of times to compute.
Here's a simple illustration of the range of timings in your
computation.

ClearSystemCache[];
timings = Table[Timing[Integrate[
Sin[ph] Sin[nn*ph]*Cos[mm*ph]/(2 Pi),
{ph, Pi/2, Pi}]][[1]],
{nn, 0, 15}, {mm, 0, 15}];
ListPlot[Flatten[timings]]

In contrast, here is a collection of trivial computations that take
similar amounts of time.

AbsoluteTiming[
Table[Total[RandomReal[{0, 1}, {500}]], {500}, {500}];
]
{2.781051, Null}

In this case, we do gain the expected benifit by performing the
computation in parallel.

LaunchKernels[2];
AbsoluteTiming[
ParallelTable[Total[RandomReal[{0, 1}, {500}]], {500}, {500}];
]
{1.632608, Null}

Mark McClure

From: K on 17 Dec 2009 07:29

Thank you all for your answers to my problem.

Eric Wort's suggestion of unchecking the lower process priority option
brought the timings down a bit, but not much. I'm now at 58 s/66 s for
serial/parallel evaluation. I also unchecked "Enable parallel
monitoring tools" to see whether the monitoring had any effect, but it
didn't.

Mark McClure's remark about the differing times for similar symbolic
computation tasks was very valuable, and the list plot of timings is
interesting to see. Timings can differ by a factor of 4 or more for
the different integrations. The actual time consumed for one integral
seems almost random. However, if I use Mark's code to generate the
timings table and then sum up over the first half and the second half
of the results, then we see that the total time is not as diverse as
the single timings:

In[8]:= Sum[Flatten[timings][[ii]],{ii,1,128}]
Out[8]= 29.2016
In[9]:= Sum[Flatten[timings][[ii]],{ii,129,256}]
Out[9]= 27.8508

Just in case, I also split the timings matrix along the other
dimension:

In[6]:= Sum[Flatten[Transpose[timings]][[ii]],{ii,1,128}]
Out[6]= 21.7057
In[7]:= Sum[Flatten[Transpose[timings]][[ii]],{ii,129,256}]
Out[7]= 35.3466

Here, we see a more noticeable difference. And indeed, if I watch the
kernels in the parallel kernel status window or the usage of the cores
in the ksysguard window of KDE, I find that one kernel finishes its
work after practically half the time the other kernel uses. However,
this behavior remains independently of which variable (nn or mm) I
split.

In ksysguard, I also noticed that the main kernel, the Mathematica
gui, and a java process spawned by Mathematica together take up
between 20%-40% of the CPU resources even after I just started
Mathematica with an empty notebook. The MathKernel uses 10%-20% even
if no computation is going on at all. Is that normal? I'm on Fedora
11, KDE 4.3.3 with Linux kernel 2.6.30.9-102.fc11.x86_64, the
processor is an AMD Athlon(tm) 64 X2 Dual Core Processor 5600+.

In the parallel kernel status window, the column Time shows usually a
value of 25s for the master, then 30s for the first and 40s for the
second local kernel for one evaluation of the integration matrix. Are
the ratios of these values approximately what you get for the
computation?
Regards,
K.

On 15 Dec, 13:33, K <kgs...(a)googlemail.com> wrote:
> Hi,
>
> I was trying to evaluate definite integrals of different product
> combinations of trigonometric functions like so:
>
> ClearSystemCache[];
> AbsoluteTiming[
> Table[Integrate[
> Sin[ph]*1/(2 Pi)*Sin[nn*ph]*Cos[mm*ph], {ph, Pi/2, Pi}], {nn, =
0,
> 15}, {mm, 0, 15}];]
>
> I included ClearSystemCache[] to get comparable results for successive
> runs. Output of the actual matrix result is suppressed. On my dual
> core AMD, I got this result from Mathematica 7.0.1 (Linux x86 64-bit)
> for the above command:
>
> {65.240614, Null}
>
> Now I thought that this computation could be almost perfectly
> parallelized by having, e.g., nn = 0,...,7 evaluated by one kernel and
> nn=8, ..., 15 by the other and typed:
>
> ParallelEvaluate[ClearSystemCache[]];
> AbsoluteTiming[
> ParallelTable[
> Integrate[
> Sin[ph]*1/(2 Pi)*Sin[nn*ph]*Cos[mm*ph], {ph, Pi/2, Pi}], {nn, =
0,
> 15}, {mm, 0, 15}, Method -> "CoarsestGrained"];]
>
> The result, however, was disappointing:
>
> {76.993888, Null}
>
> By the way, Kernel[] returns:
>
> {KernelObject[1,local],KernelObject[2,local]}
>
> This seems to me that the parallel command should in fact have been
> evaluated by two kernels. With Method-> "CoarsestGrained", I hoped to
> obtain the data splitting I mentioned above. If I do the splitting and
> combining myself, it gets even a bit worse:
>
> ParallelEvaluate[ClearSystemCache[]];
> AbsoluteTiming[
> job1=ParallelSubmit[Table[Integrate[Sin[ph]*1/(2 Pi)*Sin[nn*ph]*=
Cos
> [mm*ph],{ph, Pi/2,Pi}],{nn,0,7},{mm,0,15}]];
> job2=ParallelSubmit[Table[Integrate[Sin[ph]*1/(2 Pi)*Sin[nn*ph]*=
Cos
> [mm*ph],{ph, Pi/2,Pi}],{nn,8,15},{mm,0,15}]];
> {res1,res2}=WaitAll[{job1,job2}];
> Flatten[{{res1},{res2}},2];]
>
> Result is here:
>
> {78.669442,Null}
>
> I can't believe that the splitting and combining overhead on a single
> machine (no network involved here) can eat up all the gain from
> distributing the actual workload to two kernels. Does anyone have an
> idea what is going wrong here?
> Thanks,
> K.

|
Pages: 1
Prev: Any way to make "iterb" in Mathematica 7.0 compatible with older
Next: Repeat Data reading in one file