Speed change of SUM [Matlab]

Prev: Pass large matrices accross matlab functions
Next: RGB 2D slice in a 3D plot

From: Jan Simon on 22 Feb 2010 15:25

Dear Rune!

> Well, if you aren't interested, I am a bit surprised that you
> brought up the subject in the first place - mind you; you *did*
> criticize matlab by comparing it to C. That alone is sufficient
> reason to question if you understand your own comparision.

Thanks for posting.
I've explained already, what I'm interested in. You are welcome to answer or not to answer.

Jan

From: Jan Simon on 22 Feb 2010 15:56

Dear Bruno, Oleg and Aarif!

Thanks!!!

I've stacked the commands in one line for compact code in the newsgroup post only. In real programs this impedes line-wise debugging and readability, and speed if the JIT compiler gets confused.

Summary:

> 2010A, 64 bit Prerelease, Intel core 2 duo E8500, 3.16 GHz
> x = rand(88999, 1);
> toc % Elapsed time is 0.102085 seconds.
> x = rand(89000, 1);
> toc % Elapsed time is 0.107495 seconds.

> 2007a, Athalon 64 X2 2.4 GHz
> x = rand(88999, 1);
> Elapsed time is 0.283402 seconds.
> x = rand(89000, 1);
> Elapsed time is 0.282451 seconds.

> 2009b, Intel Core 2 Duo 2.5 GHz Vista 32
> time drops around the 89.. limit

> 2009a, Pentium-M:
> time drops at 88999

My conclusion: The SUM breakdown was not present in 2007a, but in 2009a and 2009b and it vanished again in 2010A 64 bit. The behaviour does not depend on the number of cores (except if Oleg started his Matlab with -singleCompThread, but I do not assume this).

Jan

From: Oleg Komarov on 22 Feb 2010 16:17

"Jan Simon" <matlab.THIS_YEAR(a)nMINUSsimon.de> wrote in message <hlur14$aru$1(a)fred.mathworks.com>...
> Dear Bruno, Oleg and Aarif!
>
> Thanks!!!
>
> I've stacked the commands in one line for compact code in the newsgroup post only. In real programs this impedes line-wise debugging and readability, and speed if the JIT compiler gets confused.
>
> Summary:
>
> > 2010A, 64 bit Prerelease, Intel core 2 duo E8500, 3.16 GHz
> > x = rand(88999, 1);
> > toc % Elapsed time is 0.102085 seconds.
> > x = rand(89000, 1);
> > toc % Elapsed time is 0.107495 seconds.
>
> > 2007a, Athalon 64 X2 2.4 GHz
> > x = rand(88999, 1);
> > Elapsed time is 0.283402 seconds.
> > x = rand(89000, 1);
> > Elapsed time is 0.282451 seconds.
>
> > 2009b, Intel Core 2 Duo 2.5 GHz Vista 32
> > time drops around the 89.. limit
>
> > 2009a, Pentium-M:
> > time drops at 88999
>
> My conclusion: The SUM breakdown was not present in 2007a, but in 2009a and 2009b and it vanished again in 2010A 64 bit. The behaviour does not depend on the number of cores (except if Oleg started his Matlab with -singleCompThread, but I do not assume this).
>
> Jan

Apparently starting as single threaded the matrixSize-time plot doesn't drop:
http://drop.io/lpwtozo

Oleg

From: Jan Simon on 22 Feb 2010 16:42

Dear Oleg!

> Apparently starting as single threaded the matrixSize-time plot doesn't drop:
> http://drop.io/lpwtozo

Better conclusion:
Matlab 2009a/b,
- multi-thread machine: SUM is 50% faster for > 89000 elements.
- multi-thread machine driven with singleCompThread: no speed change.
- single-thread machine: SUM is 75% slower for > 89000 elements.
(independent from -singleCompThread -- not nice, TMW!)
Matlab 2010A:
At least no change in speed at the magic 89000 limit.

So I arrived in the century of multi-threading: Comparing "speed" is ruled by defining it at first.

Thanks, Jan

From: Rune Allnor on 23 Feb 2010 04:24

On 22 Feb, 17:46, Rune Allnor <all...(a)tele.ntnu.no> wrote:

> So if these numbers are representative for C compilers, your
> program, compiled with freeware, runs a factor 10'ish too slow.
> Which, by induction, might suggest that matlab runs a factor 5'ish
> slower than the fast C code.
>
> If correct - one would need to confirm your numbers by compiling
> your code with a state-of-the-art compiler - the question becomes
> why matlab runs that slow in the first place.

Of course I became a bit intrigued by this, so I couldn't
resists testing the two C compilers against the built-in
SUM function (matlab script and C code below). As before,
I compiled the same C code to two executables, testsum2lcc
compiled with the LCC compiler and testsum2msvc compiled
with the MSVS 2008 C compiler, where the /arch:SSE2 flag
was set in the MSVC compiler.

The output is:

Results match
SUM : 100 runs in 3.59375 s
testsum2lcc : 100 runs in 3.78125 s
testsum2msvc : 100 runs in 1.75 s

which means the MSVC executable runs twice as fast as the
built-in function, which in turn runs some 5-10% faster than
the LCC executable.

Do note that the data are fetched through a far pointer
acess - through a pointer into a different cimpilation
module - which often is considered to be slow. This far
pointer access would explain the relative improvement by
a factor 5 of the LCC relative to the MSVC, compared to
the test I did yesterday. In that test all the work was
done on local variables.

It would be very interesting to see a similar test done
with the Intel compiler.

Ah, yes, I almost forgot: I ran this test with an old matlab
version (R2006a). If somebody tries this with a newer version,
keep in mind that there were some changes made recently,
where the SUM function was adapted to match the results
of parallel algorithms. This will first of all introduce
some overhead in the single-thread version of SUM, slowing it
even more, and also change the results somewhat so that the
matching test at the end will fail.

But none of this was in place in 2006, when my matlab version
was released.

Rune

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clc
x = randn(1,10000000);
% Make sure all executables are loaded prior to
% timed run
s0 = sum(x);
s1 = testsum2lcc(x);
s2 = testsum2msvc(x);

Nruns = 100;

t0 = 0;
for n=1:Nruns
ts = cputime;
s0 = sum(x);
te = cputime - ts;
t0 = t0 + te;
end

t1 = 0;
for n=1:Nruns
ts = cputime;
s1 = testsum2lcc(x);
te = cputime - ts;
t1 = t1 + te;
end

t2 = 0;
for n=1:Nruns
ts = cputime;
s2 = testsum2msvc(x);
te = cputime - ts;
t2 = t2 + te;
end

if (~((s0 == s1 )&(s1 == s2 )))
disp('Deviant results')
disp('Expected in recent matlab versions due to parallel
algorithms')
else
disp('Results match')
end

disp(sprintf('SUM : %d runs in %g s',Nruns,t0));
disp(sprintf('testsum2lcc : %d runs in %g s',Nruns,t1));
disp(sprintf('testsum2msvc : %d runs in %g s',Nruns,t2));

/****************************************************************/
#include <math.h>
#include "mex.h"

extern void _main();

void mexFunction(
int nlhs,
mxArray *plhs[],
int nrhs,
const mxArray *prhs[]
)
{
int i,N;
double sum;
double *x;

N = mxGetN(prhs[0]);
x = mxGetPr(prhs[0]);

sum = 0;
for (i = 0; i< N; ++i)
{
sum +=x[i];
}

plhs[0] = mxCreateDoubleScalar(sum);
return;
}

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: Pass large matrices accross matlab functions
Next: RGB 2D slice in a 3D plot