From: Jan Simon on
Dear Rune!


> Of course I became a bit intrigued by this, so I couldn't
> resists testing the two C compilers against the built-in
> SUM function (matlab script and C code below). As before,
> I compiled the same C code to two executables, testsum2lcc
> compiled with the LCC compiler and testsum2msvc compiled
> with the MSVS 2008 C compiler, where the /arch:SSE2 flag
> was set in the MSVC compiler.
>
> The output is:
> Results match
> SUM : 100 runs in 3.59375 s
> testsum2lcc : 100 runs in 3.78125 s
> testsum2msvc : 100 runs in 1.75 s

As expected: Better compilers compile better compilations.

> Ah, yes, I almost forgot: I ran this test with an old matlab
> version (R2006a). If somebody tries this with a newer version,
> keep in mind that there were some changes made recently,
> where the SUM function was adapted to match the results
> of parallel algorithms.

Exactly the behaviour of parallelized SUM was my problem.

But now give LCC a chance to use its optimizer:

% ---------------------------------------- 8< -------------
% SimpleSum.c, Jan Simon, Matlab 6.5 to 2009b
#include "mex.h"

// 32 bit array dimensions for Matlab 6.5:
#ifndef MWSIZE_MAX
#define mwSize int32_T // Defined in tmwtypes.h
#define mwIndex int32_T
#define MWSIZE_MAX MAX_int32_T
#endif

double GetSum1(double *X, int N);

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
double *X;
mwSize N;

X = mxGetPr(prhs[0]);
N = mxGetNumberOfElements(prhs[0]);
plhs[0] = mxCreateDoubleScalar(GetSum1(X, N));

return;
}

double GetSum1(double *X, int N)
{
int i;
double Sum = 0.0;
for (i = 0; i < N; i++) {
Sum += X[i];
}
return (Sum);
}

% ---------------------------- >8 ------------------------

While Open Watcom can handle your program as expected, the LCC from 2003 shipped with Matlab needs the caclulations in a separate function to get the register allocation correctly.
Then the time for summing gets from 0.90 to 0.60 on my computer. I'm interested if this speed gain is reproducible.
I will compare this with MSVS 2008 when I find some time.

Kind regards, Jan
From: Jan Simon on
Dear Rune!

Sorry - this post is only weakly related to Matlab. At least I'm going to publish the source for stable summation in the FEX.

Snippet from the C-Mex implementation:
double GetSum1(double *X, double *Xf)
{
double Sum = 0.0; // or: long double
_control87(PC_64, MCW_PC); // LCC: _control87(_PC_64,_MCW_PC);
for ( ; X < Xf; Sum += *X++) ; // empty loop
_control87(PC_53, MCW_PC); // LCC: _control87(_PC_53,_MCW_PC);
return (Sum);
}

X = randn(1E7, 1);
tic; for i=1:10; v = sum(X); clear('v'); end; toc
tic; for i=1:10; v = mexsum(X); clear('v'); end; toc

Oberservations (1.5 GHz Pentium-M, Matlab 2009a, single-threaded):
- SUM: 0.94 sec
- LCC v2.4 (shipped with Matlab): 0.96 sec, same accuracy as SUM
- LCC v3.8: 0.66 sec, same accuracy SUM.
3 additional valid digits, if "Sum" is a long double. Then 0.85 sec.
- Open Watcom 1.8: 0.70 sec
3 additional valid digits, because the double "Sum" is really accumulated in a 80 bit register (but no further improvements for long double).
- MS VC++ 2008 Express: 0.50 sec
The result is just 5% more accurate than SUM when compiled with /fp:fast
and equal to SUM when compiled with /fp:precise (5% ?!? How can this be possible?).
No differences between double and long double.

Jan