How to achieve ultimate summing speed? [Matlab]

Prev: spm_select
Next: MATLAB code speed

From: Jean-Francois on 11 Mar 2010 14:16

> // sum1.c, calculates the equivalent of sum(R<k,1)
> // called as sum1(R,k)
> #include "mex.h"
> void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
> {
> mwSize i, j, m, n;
> register mwSize s;
> double *pr, *R;
> double k;
>
> m = mxGetM(prhs[0]);
> n = mxGetN(prhs[0]);
> R = mxGetPr(prhs[0]);
> k = mxGetScalar(prhs[1]);
> plhs[0] = mxCreateDoubleMatrix(1, n, mxREAL);
> pr = mxGetPr(plhs[0]);
> for( j=0; j<n; j++ ) {
> s = 0;
> for( i=0; i<m; i++ ) {
> if( (*R++) < k ) {
> ++s;
> }
> }
> *pr++ = s;
> }
> }

I tried using sum1.c, with no significantly improved performance over using Matlab's sum. Let me point out that in performing sum(R<k), ~66% of the time is spent on (R<k), and the rest on the summation itself, which confirms what you said above about the overhead. Is there any way to make the comparison between R and k more efficient, either with Matlab or a mex code?

From: Jan Simon on 11 Mar 2010 15:04

Dear Jean-Francois!

> > // sum1.c, calculates the equivalent of sum(R<k,1)
> > // called as sum1(R,k)

> I tried using sum1.c, with no significantly improved performance over using Matlab's sum. Let me point out that in performing sum(R<k), ~66% of the time is spent on (R<k), and the rest on the summation itself, ...

I'm confused. sum1.c includes the the R<k comparison already, so if 66% of the time of SUM(R<k,1) is spent for R<k, then I expect sum1.c to be faster. Can you show us your speed comparison with some details?
Which compiler do you use?

You can distribute the calculations for the different columns to different threads, if your computer has multiple cores. E.g.:
http://www.mathworks.com/matlabcentral/fileexchange/21233
shows how this can be programmed.

Kind regards, Jan

From: Oliver Woodford on 11 Mar 2010 17:43

"James Tursa" wrote:
> plhs[0] = mxCreateDoubleMatrix(1, n, mxREAL);
> pr = mxGetPr(plhs[0]);
> for( j=0; j<n; j++ ) {
> s = 0;
> for( i=0; i<m; i++ ) {
> if( (*R++) < k ) {
> ++s;
> }
> }
> *pr++ = s;
> }

James, I'm surprised a speed demon like you is using mxCreateDoubleMatrix like that. Since it sets the matrix to zero first it drags the whole matrix through the cache once, before you even write to it. Since you then set every entry later you don't need to do that. See a discussion on the subject, and solution, here:
http://wwwuser.gwdg.de/~mleuten/MATLABToolbox/CmexWrapper.html

Oliver

From: Oliver Woodford on 11 Mar 2010 17:56

"Jan Simon" wrote:
> You can distribute the calculations for the different columns to different threads, if your computer has multiple cores. E.g.:
> http://www.mathworks.com/matlabcentral/fileexchange/21233
> shows how this can be programmed.

For this simple iteration over each column, with no dependencies between each one, the easiest way (IMHO) to parallelize this is with an OpenMP pragma before the first (outer) for loop. And it's completely platform independent.

Oliver

From: James Tursa on 11 Mar 2010 18:47

"Oliver Woodford" <o.j.woodford.98(a)cantab.net> wrote in message <hnbrlo$6u1$1(a)fred.mathworks.com>...
> "James Tursa" wrote:
> > plhs[0] = mxCreateDoubleMatrix(1, n, mxREAL);
> > pr = mxGetPr(plhs[0]);
> > for( j=0; j<n; j++ ) {
> > s = 0;
> > for( i=0; i<m; i++ ) {
> > if( (*R++) < k ) {
> > ++s;
> > }
> > }
> > *pr++ = s;
> > }
>
> James, I'm surprised a speed demon like you is using mxCreateDoubleMatrix like that. Since it sets the matrix to zero first it drags the whole matrix through the cache once, before you even write to it. Since you then set every entry later you don't need to do that. See a discussion on the subject, and solution, here:
> http://wwwuser.gwdg.de/~mleuten/MATLABToolbox/CmexWrapper.html
>
> Oliver

Yes, I am aware of that technique but hadn't thought of it for this application. Thanks for pointing it out. I don't know how much difference it will make, however, but I will try to run some tests later.

I have made previous attempts at speed improvement tests like this using mxMalloc (I was attempting to create a fast C mex preallocation routine that didn't zero out the memory) but have noticed that mxMalloc (or a previous mxFree) *always* zeroes out the memory. I made this conclusion after allocating very large blocks, setting all the elements to non-zero, freeing the memory with mxFree, and then immediately using mxMalloc again to grab the exact same block of memory and noting that all the memory is suddenly zeroed out. Don't know if it is mxMalloc or mxFree that is doing it. Bottom line is using mxCreateDoubleMatrix with 1 x n vs creating 0 x 0 and then mxMalloc for the memory may *still* drag the memory through the cache once to get the memory set to zero and you may not have any control over this. This may be a security design in the MATLAB memory manager, or something else. I will
quite frankly admit that I don't fully understand this behavior yet & who is doing the zeroing, and don't know all the implications with regards to cache memory, etc.

James Tursa

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: spm_select
Next: MATLAB code speed