How to achieve ultimate summing speed? [Matlab]

Prev: spm_select
Next: MATLAB code speed

From: Jean-Francois on 9 Mar 2010 18:55

"us " <us(a)neurol.unizh.ch> wrote in message <hmre7f$hri$1(a)fred.mathworks.com>...
> "Jean-Francois " <jkagy(a)princeton.edu> wrote in message <hmrdko$b6n$1(a)fred.mathworks.com>...
> > Hi,
> >
> > I'm stuck with the following problem:
> >
> > I have a large matrix R, say of size 2000 by 500, and a threshold value k. For each row, I seek the numbers of elements that are below k. Hence, the output would be a 2000 by 1 vector.
> >
> > So far, I've used "Q = SUM( R < k , 2 )", which proved to be much faster than everything I could think of, e.g. HISTC or (obviously) SORT+DIFF. A single calculation is in the order of milliseconds, so it wouldn't seem like a big deal. However, in the context of a Monte Carlo simulation, that summation turns out to be a huge bottleneck (> 60%); hence, even the slightest increase in efficiency would be very valuable.
> >
> > Thanks for your suggestion.
>
> before we get started: two important questions...
> - does this run in a loop
> - does your K change, eg, in the loop or according to some rule...
>
> us

Yes, it does. In fact, the whole code is as follows:

***************************************
function d = trshld(R,bins,k,p)

[ndraw n] = size(R); Nb = length(bins)-1;
Bins_l = repmat(bins(1:end-1),[ndraw 1]);
Bins_h = repmat(bins(2:end),[ndraw 1]);
d = 0;
for j = 1:length(k)
Q = sum((R < k(j)),2)/n;
Q = repmat(Q,[1 Nb]);
Ql = (Q > Bins_l);
Qh = ~(Q > Bins_h);
d = d + p(j)*sum(Ql.*Qh,1);
end
d = d/ndraw;

***************************************
Basically, the code receives a vector 'k'. For each element of 'k', it first calculates for each row of 'R' the fraction of elements that fall below k. Then it assigns the resulting fractions to pre-specified bins. 'p' is just a weight vector and is innocuous. There is a huge bottleneck at 'Q = sum((R < k(j)),2)/n', and there is also one, somehow smaller, at 'Q = repmat(Q,[1 Nb])'. Should the whole code go to a mex file, or parts of it? Thanks for your answer.

From: James Tursa on 9 Mar 2010 19:51

"Jean-Francois " <jkagy(a)princeton.edu> wrote in message <hn6n5a$omh$1(a)fred.mathworks.com>...
>
> Should the whole code go to a mex file, or parts of it? Thanks for your answer.

Have you tried the provided mex code yet?

James Tursa

From: Jean-Francois on 10 Mar 2010 17:53

"James Tursa" <aclassyguy_with_a_k_not_a_c(a)hotmail.com> wrote in message <hn6qdo$fv9$1(a)fred.mathworks.com>...
> "Jean-Francois " <jkagy(a)princeton.edu> wrote in message <hn6n5a$omh$1(a)fred.mathworks.com>...
> >
> > Should the whole code go to a mex file, or parts of it? Thanks for your answer.
>
> Have you tried the provided mex code yet?
>
> James Tursa

With R a 2000 x 200 matrix and 14,000 runs, I got the following times:

1. sum(R<k,2) --> 17.515 s
2. sum(R<k,1) --> 12.545 s
3. mex code from post 6 --> 15.810 s
4. mex code from post 9 --> 20.341 s

I am not familiar with C, so I am still figuring out how to traverse columns as was suggested previously. Also, I'm not sure I understood well, but does traversing rows vs columns make a difference, as it does in Matlab? Thanks all for the help.

From: Walter Roberson on 10 Mar 2010 18:13

Jean-Francois wrote:

> I am not familiar with C, so I am still figuring out how to traverse
> columns as was suggested previously. Also, I'm not sure I understood
> well, but does traversing rows vs columns make a difference, as it does
> in Matlab? Thanks all for the help.

Yes, in C (and in Matlab) it can make an enormous difference in the timings!!
But whether it is a major difference or a minor difference depends on the
quality of the compiler and the details of the processor instruction set and
upon the exact size of the matrix as compared to the exact size of the
processor's primary cache.

In general, processors are designed under the assumption that values that are
close together in the address space will be accessed fairly close together in
time in the program, so they are generally designed so that when a value is
read from (relatively slow) main memory to the processor, that a group of
adjacent values are also read in at the same time and are held in very fast
memory in or "very near" the CPU. When the CPU then goes to access that nearby
value it is then read from that very fast memory instead of going through the
slow memory interface to the main memory subsystem.

But there is a limit to how much fast memory is available, so when you go to
read in a new value, the fast-memory subsystem ("primary cache") needs to
discard (or write out to main memory if needed) something to make room. There
are different techniques to determine what gets discarded. One of the commonly
used techniques is to divide main memory into columns (you could describe it
as; "cache lines" is the technical term), and when a new item is to be read
in, what is discarded or written out is the old item that happened to be from
the same column of memory. Thus if you are handling an item at address R, and
the processor then needs an item from address R + k*S where k is an integer
and S is a fixed size dependent on the system configuration, then the item at
R is discarded and replaced by the item at R + k*S . Which is all well and
fine unless your array size happens to be a multiple of S. If A(i,j) happens
to be a multiple of S away from A(i+1,j) then you want to avoid accesses that
go along the first dimension, and if A(i,j) happens to be a multiple of S away
from A(i,j+1) then you want to avoid access that go along the second dimension.

In Matlab, A(i+1,j) is always immediately adjacent to A(i,j); in C, A(i,j+1)
is always immediately adjacent to A(i,j). So in C, you want to iterate along
the columns, and in Matlab you want to iterate along the rows -- because if
you mix up the orders, there is a chance that the hardware architecture will
make it necessary to throw away A(i,j) from fast memory in order to access the
new value.

From: James Tursa on 10 Mar 2010 21:52

"Jean-Francois " <jkagy(a)princeton.edu> wrote in message <hn97se$b6$1(a)fred.mathworks.com>...
> "James Tursa" <aclassyguy_with_a_k_not_a_c(a)hotmail.com> wrote in message <hn6qdo$fv9$1(a)fred.mathworks.com>...
> > "Jean-Francois " <jkagy(a)princeton.edu> wrote in message <hn6n5a$omh$1(a)fred.mathworks.com>...
> > >
> > > Should the whole code go to a mex file, or parts of it? Thanks for your answer.
> >
> > Have you tried the provided mex code yet?
> >
> > James Tursa
>
> With R a 2000 x 200 matrix and 14,000 runs, I got the following times:
>
> 1. sum(R<k,2) --> 17.515 s
> 2. sum(R<k,1) --> 12.545 s
> 3. mex code from post 6 --> 15.810 s
> 4. mex code from post 9 --> 20.341 s
>
> I am not familiar with C, so I am still figuring out how to traverse columns as was suggested previously. Also, I'm not sure I understood well, but does traversing rows vs columns make a difference, as it does in Matlab? Thanks all for the help.

Traversing rows vs columns may or may not make a big difference ... it depends on how the algorithm is coded and potentially on machine architecture. For the C mex routines, there will likely be very little difference between the timing of the post 6 sum2.c code which sums along the rows and the sum1.c code posted below which sums along the columns. That is because they are both coded in such a way that they do not create a large intermediate array (like MATLAB does for the R<k calculation) and they only traverse the R array once. The only important difference is the allocation of the s array and the multiple traversing of the s array in the sum2.c code, whereas in the sum1.c code a single scalar s is sufficient. That will in all likelihood only make a small timing increase in sum2.c compared to sum1.c. I would expect that they both would outperform the MATLAB code you have shown
because of the overhead savings ... the amount of the overhead savings dependent on computer, MATLAB version, etc. etc. If the MATLAB behind the scenes algorithm traverses the R array multiple times you could easily see a doubling of the run times in comparison to the C code, but if MATLAB uses good algorithms you may only see a small (e.g., 10% - 15%) improvement. I don't think you will be able to improve this calculation speed any more than these C routines give you. Of course, a better C compiler can sometimes make use of registers etc. more efficiently, particularly for looping code such as this. I have seen cases where Microsoft Visual C/C++ outperformed the supplied lcc compiler by a factor of 2-3 in speed. If you do any C work like this at all, IMO it is worth it to invest in a good C compiler.

James Tursa

// sum1.c, calculates the equivalent of sum(R<k,1)
// called as sum1(R,k)
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
mwSize i, j, m, n;
register mwSize s;
double *pr, *R;
double k;

m = mxGetM(prhs[0]);
n = mxGetN(prhs[0]);
R = mxGetPr(prhs[0]);
k = mxGetScalar(prhs[1]);
plhs[0] = mxCreateDoubleMatrix(1, n, mxREAL);
pr = mxGetPr(plhs[0]);
for( j=0; j<n; j++ ) {
s = 0;
for( i=0; i<m; i++ ) {
if( (*R++) < k ) {
++s;
}
}
*pr++ = s;
}
}

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: spm_select
Next: MATLAB code speed