How to achieve ultimate summing speed? [Matlab]

Prev: spm_select
Next: MATLAB code speed

From: Oliver Woodford on 12 Mar 2010 07:35

"Jan Simon" wrote
> Although your arguments are convincing and I've expected a remarkable acceleration also, I cannot reproduce advantages for your method for Matlab 6.5, 7.7 and 7.8 using different compilers, e.g. OpenWatcom 1.8 and MSVC 2008 Express.
> Is it necessary to start Matlab with a specific memory manager?

I wouldn't call them my argument or approach, rather Marcel Leutenegger's. I must confess I haven't tested the theory too much - I always just took it as read. Perhaps more fool me, given your and James' experience. Which does beg the question why MATLAB insists on setting the buffer to zeros even when you don't need it to - essentially (based on what James has said) there seems to be little practical difference between mxMalloc and mxCalloc.

Oliver

From: Jean-Francois on 12 Mar 2010 13:27

"Jan Simon" <matlab.THIS_YEAR(a)nMINUSsimon.de> wrote in message <hnbic6$7lb$1(a)fred.mathworks.com>...
> Dear Jean-Francois!
>
> > > // sum1.c, calculates the equivalent of sum(R<k,1)
> > > // called as sum1(R,k)
>
> > I tried using sum1.c, with no significantly improved performance over using Matlab's sum. Let me point out that in performing sum(R<k), ~66% of the time is spent on (R<k), and the rest on the summation itself, ...
>
> I'm confused. sum1.c includes the the R<k comparison already, so if 66% of the time of SUM(R<k,1) is spent for R<k, then I expect sum1.c to be faster. Can you show us your speed comparison with some details?
> Which compiler do you use?
>
> You can distribute the calculations for the different columns to different threads, if your computer has multiple cores. E.g.:
> http://www.mathworks.com/matlabcentral/fileexchange/21233
> shows how this can be programmed.
>
> Kind regards, Jan

*************************
I'm using MSVC C++ Express. Which compiler do you recommend?

By the way, aren't Matlab basic functions (SUM, TIMES, etc) already using a system's multithreading capabilities?

So, I split SUM(R<k,1) into Q=(R<k) and SUM(Q,1) to see if the bottleneck arises from the comparison or from the summation. For R a 2000x200 matrix, I ran 122000 repetitions, and total times were:

- Q=(R<k) --> 73.259s
- SUM(Q,1) --> 37.812s

Hopefully, I answered your question. Do you have any suggestion at this stage? Thanks.

From: James Tursa on 12 Mar 2010 14:27

"James Tursa" <aclassyguy_with_a_k_not_a_c(a)hotmail.com> wrote in message <hnbvdm$3i2$1(a)fred.mathworks.com>...
> "Oliver Woodford" <o.j.woodford.98(a)cantab.net> wrote in message <hnbrlo$6u1$1(a)fred.mathworks.com>...
> > "James Tursa" wrote:
> > > plhs[0] = mxCreateDoubleMatrix(1, n, mxREAL);
> > > pr = mxGetPr(plhs[0]);
> > > for( j=0; j<n; j++ ) {
> > > s = 0;
> > > for( i=0; i<m; i++ ) {
> > > if( (*R++) < k ) {
> > > ++s;
> > > }
> > > }
> > > *pr++ = s;
> > > }
> >
> > James, I'm surprised a speed demon like you is using mxCreateDoubleMatrix like that. Since it sets the matrix to zero first it drags the whole matrix through the cache once, before you even write to it. Since you then set every entry later you don't need to do that. See a discussion on the subject, and solution, here:
> > http://wwwuser.gwdg.de/~mleuten/MATLABToolbox/CmexWrapper.html
> >
> > Oliver
>
> Yes, I am aware of that technique but hadn't thought of it for this application. Thanks for pointing it out. I don't know how much difference it will make, however, but I will try to run some tests later.
>
> I have made previous attempts at speed improvement tests like this using mxMalloc (I was attempting to create a fast C mex preallocation routine that didn't zero out the memory) but have noticed that mxMalloc (or a previous mxFree) *always* zeroes out the memory. I made this conclusion after allocating very large blocks, setting all the elements to non-zero, freeing the memory with mxFree, and then immediately using mxMalloc again to grab the exact same block of memory and noting that all the memory is suddenly zeroed out. Don't know if it is mxMalloc or mxFree that is doing it. Bottom line is using mxCreateDoubleMatrix with 1 x n vs creating 0 x 0 and then mxMalloc for the memory may *still* drag the memory through the cache once to get the memory set to zero and you may not have any control over this. This may be a security design in the MATLAB memory manager, or something else. I
will
> quite frankly admit that I don't fully understand this behavior yet & who is doing the zeroing, and don't know all the implications with regards to cache memory, etc.
>
> James Tursa

I ran some spot checks using the raw mxMalloc method and I get no discernible timing difference compared to the original method. Used various R2006b - R2009b with lcc and MSVC8. So, at least based on these timings I don't see any advantage. It *was* interesting to see how the sum calculation has changed from R2006b ro R2009b, however. Using R=rand(3000) and k=0.5 the sum(R<k,1) calculation took about 0.20 seconds in R2006b. As the versions of MATLAB increase, the timing of this calculation drops significantly. Partly because of moving to multi-threading, and partly because of more efficient coding/compiling. At R2009b the same calculation takes 0.04 seconds ... quite an improvement over the R2006b runtime of 0.20 seconds. The equivalent calculation in a C mex routine using MSVC8 was consistently 0.04 seconds for all MATLAB versions I tested.

The other thing I did was re-test the mxMalloc zeroing stuff. Re-confirmed that something was zeroing out the memory between mxFree and subsequent mxMalloc calls. Then I did the same test with malloc and free and got the same results ... something was zeroing out the memory. Then I ran similar test code for malloc and free on an Alpha mainframe and a Sun workstation and those machines did *not* zero out the memory between calls. So maybe this is a Windows security thing and not related to MATLAB or C memory managers at all.

James Tursa

From: Jean-Francois on 13 Mar 2010 10:10

"James Tursa" <aclassyguy_with_a_k_not_a_c(a)hotmail.com> wrote in message <hne4it$6k0$1(a)fred.mathworks.com>...
> "James Tursa" <aclassyguy_with_a_k_not_a_c(a)hotmail.com> wrote in message <hnbvdm$3i2$1(a)fred.mathworks.com>...
> > "Oliver Woodford" <o.j.woodford.98(a)cantab.net> wrote in message <hnbrlo$6u1$1(a)fred.mathworks.com>...
> > > "James Tursa" wrote:
> > > > plhs[0] = mxCreateDoubleMatrix(1, n, mxREAL);
> > > > pr = mxGetPr(plhs[0]);
> > > > for( j=0; j<n; j++ ) {
> > > > s = 0;
> > > > for( i=0; i<m; i++ ) {
> > > > if( (*R++) < k ) {
> > > > ++s;
> > > > }
> > > > }
> > > > *pr++ = s;
> > > > }
> > >
> > > James, I'm surprised a speed demon like you is using mxCreateDoubleMatrix like that. Since it sets the matrix to zero first it drags the whole matrix through the cache once, before you even write to it. Since you then set every entry later you don't need to do that. See a discussion on the subject, and solution, here:
> > > http://wwwuser.gwdg.de/~mleuten/MATLABToolbox/CmexWrapper.html
> > >
> > > Oliver
> >
> > Yes, I am aware of that technique but hadn't thought of it for this application. Thanks for pointing it out. I don't know how much difference it will make, however, but I will try to run some tests later.
> >
> > I have made previous attempts at speed improvement tests like this using mxMalloc (I was attempting to create a fast C mex preallocation routine that didn't zero out the memory) but have noticed that mxMalloc (or a previous mxFree) *always* zeroes out the memory. I made this conclusion after allocating very large blocks, setting all the elements to non-zero, freeing the memory with mxFree, and then immediately using mxMalloc again to grab the exact same block of memory and noting that all the memory is suddenly zeroed out. Don't know if it is mxMalloc or mxFree that is doing it. Bottom line is using mxCreateDoubleMatrix with 1 x n vs creating 0 x 0 and then mxMalloc for the memory may *still* drag the memory through the cache once to get the memory set to zero and you may not have any control over this. This may be a security design in the MATLAB memory manager, or something else. I
> will
> > quite frankly admit that I don't fully understand this behavior yet & who is doing the zeroing, and don't know all the implications with regards to cache memory, etc.
> >
> > James Tursa
>
> I ran some spot checks using the raw mxMalloc method and I get no discernible timing difference compared to the original method. Used various R2006b - R2009b with lcc and MSVC8. So, at least based on these timings I don't see any advantage. It *was* interesting to see how the sum calculation has changed from R2006b ro R2009b, however. Using R=rand(3000) and k=0.5 the sum(R<k,1) calculation took about 0.20 seconds in R2006b. As the versions of MATLAB increase, the timing of this calculation drops significantly. Partly because of moving to multi-threading, and partly because of more efficient coding/compiling. At R2009b the same calculation takes 0.04 seconds ... quite an improvement over the R2006b runtime of 0.20 seconds. The equivalent calculation in a C mex routine using MSVC8 was consistently 0.04 seconds for all MATLAB versions I tested.
>
> The other thing I did was re-test the mxMalloc zeroing stuff. Re-confirmed that something was zeroing out the memory between mxFree and subsequent mxMalloc calls. Then I did the same test with malloc and free and got the same results ... something was zeroing out the memory. Then I ran similar test code for malloc and free on an Alpha mainframe and a Sun workstation and those machines did *not* zero out the memory between calls. So maybe this is a Windows security thing and not related to MATLAB or C memory managers at all.
>
> James Tursa

******************
So now I use a parfor loop instead of cellfun to call the function TRSHLD from post 12, with a 50% increase in speed. On top of that, using sum1.c instead of Matlab's sum in TRSHLD gives an additional 33% gain in speed, for an overall gain of 66%. On the other hand, when I used cellfun, there was no noticeable difference between using sum1.c or Matlab's sum. Is there a scientific explanation?

Also, what is meant in Rune's post (#4) by 'storing vectors sequentially' and 'optimizing the compiler'? I use MSVC++, I guess with defaults settings. What should I do?

From: Jan Simon on 14 Mar 2010 06:44

Dear Jean-Francois!

> Also, what is meant in Rune's post (#4) by 'storing vectors sequentially' and 'optimizing the compiler'? I use MSVC++, I guess with defaults settings. What should I do?

Storing vectors sequentially has the advantage, that the function can access neighbouring elements. Neighbouring elements can be accessed faster, because it is "easier" to get them from the RAM. E.g. the summation over rows of a matrix is slower than over columns:
x = rand(1000);
tic; for i=1:100; v = sum(x, 1); end; toc % fast
tic; for i=1:100; v = sum(x, 2); end; toc % slow
Transposing the input is usually not recommended, because the TRANSPOSE itself suffers from accessing elements, which are not neighbouring:
tic; for i=1:100; v = sum(transpose(x), 1); end; toc % slow also
So it is recommended to create the arrays such, that the orientation allows a sequential access (here the example is ridiculous, but it is just a demonstration):
x = transposes(x); tic; for i=1:100; v = sum(x, 1); end; toc % fast

You can try to set the /arch:SSE2 flag in the mexopts.bat file in the folder PREFDIR. This sometimes help.

Kind regards, Jan

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: spm_select
Next: MATLAB code speed