From: Kevin Brown on
I have a trial version of the Paralell Computing toolbox that I'm testing out. I'm using the parfor command to process some loops in parallel. The computation within each loop is relatively small (3 seconds or so), so I'm not getting the speed-up from 8 workers (one per core) that I had hoped for - in one instance, it's only 3x faster and in another, it's only 20% faster.

I understand that the overhead from initializing and creating memory for the workers is probably eating into the performance gain. Is there some way to examine where that time is spent, so that I can optimize my code better? Like the profiler, but for parallel jobs? For example, if I knew that it was the memory transfer that was eating up the time, I could try to design for less memory transfer required.

Thanks in advance,
- Kevin
From: Edric M Ellis on
"Kevin Brown" <kevin.m.brown(a)philips.com> writes:

> I have a trial version of the Paralell Computing toolbox that I'm testing out.
> I'm using the parfor command to process some loops in parallel. The computation
> within each loop is relatively small (3 seconds or so), so I'm not getting the
> speed-up from 8 workers (one per core) that I had hoped for - in one instance,
> it's only 3x faster and in another, it's only 20% faster.
>
> I understand that the overhead from initializing and creating memory for the
> workers is probably eating into the performance gain. Is there some way to
> examine where that time is spent, so that I can optimize my code better? Like
> the profiler, but for parallel jobs? For example, if I knew that it was the
> memory transfer that was eating up the time, I could try to design for less
> memory transfer required.
>
> Thanks in advance, - Kevin

Unfortunately we don't currently have a good solution for profiling data
transfers involved in PARFOR loops. You can track the time used on the
client by running the standard profiler with the "-timer real" option to
see how much time is taken in the PARFOR loops themselves.

Cheers,

Edric.