From: Jason Jury on
I built a mex version of "xcorr" that uses the FFTW library. I'm basically doing the same thing as Matlab "xcorr" (zero-pad 2 arrays to next pow-2, run fft on these arrays, multiply, run ifft on result). Within the C-program, I'm relying on "fftw_plan_dft_r2c_1d" and "...c2r..." to perform the computation. I'm not doing any multi-threading explicitly in the C-code, and I'm using Visual 2008 compiler.

I'm seeing a consistent speed improvement using my method over 2010A "xcorr" up to 2x for large arrays (5e5~2e6). This is done on Dell Latitude E6400 laptop with 2.5GHz Core2 Duo, 3.5GB of RAM. I suspect the reason is that the symmetry for real data is being used by my C program in the FFTW functions above, reducing comp's by ~1/2 (?). Using profiler on 2010A "xcorr" for large arrays, indeed nearly all the overhead is in the fft() and ifft() calls.

Some questions:
(1) is Matlab utilizing symmetry for real data when calculating FFT's, and if not why? (I recognize trade-offs between ease-of-use and horsepower)
(2) any other reasons why Matlab 2010A "xcorr" would run slower than a basic mex-compiled C version of the same? Perhaps I need to re-run after calling "fftw" in Matlab to get some "wisdom"--would this be a fairer comparison (even though I don't believe my C-code has done anything like this)?

Thanks in advance for any responses.
From: Rune Allnor on
On 9 apr, 19:02, "Jason Jury" <jason.c.j...(a)seagate.com> wrote:
> I built a mex version of "xcorr" that uses the FFTW library.  I'm basically doing the same thing as Matlab "xcorr" (zero-pad 2 arrays to next pow-2, run fft on these arrays, multiply, run ifft on result).  Within the C-program, I'm relying on "fftw_plan_dft_r2c_1d" and "...c2r..." to perform the computation.  I'm not doing any multi-threading explicitly in the C-code, and I'm using Visual 2008 compiler.  
>
> I'm seeing a consistent speed improvement using my method over 2010A "xcorr" up to 2x for large arrays (5e5~2e6).  This is done on Dell Latitude E6400 laptop with 2.5GHz Core2 Duo, 3.5GB of RAM.  I suspect the reason is that the symmetry for real data is being used by my C program in the FFTW functions above, reducing comp's by ~1/2 (?).  Using profiler on 2010A "xcorr" for large arrays, indeed nearly all the overhead is in the fft() and ifft() calls.
>
> Some questions:
> (1) is Matlab utilizing symmetry for real data when calculating FFT's, and if not why?  (I recognize trade-offs between ease-of-use and horsepower)

Matlab is using the FFTW library. The way matla data structures
are implemented one would assume that the real-valued versions
are used for real-valued data.

> (2) any other reasons why Matlab 2010A "xcorr" would run slower than a basic mex-compiled C version of the same?  

There is a lot of overhead inside matlab. It is notoriously slow
for anything but hardcore linear algebra. One can expect speed
improvements of a factor 10-50-100x as a matter of course, so
if matlab's FFT is only a factor 2 slower than your version, it's
in fact pretty fast.

But then, you have added a lot of unnecessary overhead by all that
power-of-2 nonsense at the start of your C routine. Loose that, and
the numbers might become more as one would expect.

Rune
From: Jason Jury on
Rune Allnor <allnor(a)tele.ntnu.no> wrote in message <b91e042d-53d8-46ce-956e-3e374337dbbf(a)g10g2000yqh.googlegroups.com>...
> Matlab is using the FFTW library. The way matla data structures
> are implemented one would assume that the real-valued versions
> are used for real-valued data.
>
> There is a lot of overhead inside matlab. It is notoriously slow
> for anything but hardcore linear algebra. One can expect speed
> improvements of a factor 10-50-100x as a matter of course, so
> if matlab's FFT is only a factor 2 slower than your version, it's
> in fact pretty fast.
>
> But then, you have added a lot of unnecessary overhead by all that
> power-of-2 nonsense at the start of your C routine. Loose that, and
> the numbers might become more as one would expect.
>
> Rune

Thanks for the note, Rune. I'm primarily concerned that Matlab2010A "xcorr" is slower than my "xcorr" though I was trying to replicate it as much as possible (e.g. using MEX to run mine within Matlab environment, zero-padding, using FFTW without any obvious tricks). This leads me to believe that Matlab "xcorr" is not optimized--I guess that's not uncommon. If I didn't bother to pad, ran the application outside Matlab as stand-alone compiled exe, etc., then I would expect more speed gains.

Part of my concern is that I'm interested in multi-thread and parallel processing to speed up either xcorr, and I'm wondering where the easiest gains could be had (investing in Matlab tools, creating my own mex tools, or some combination). Obviously the ultimate approach for speed is to do everything outside Matlab but I'm using it heavily for other things and I'm not a hard-core computer programmer.
From: Rune Allnor on
On 9 apr, 21:06, "Jason Jury" <jason.c.j...(a)seagate.com> wrote:

> Part of my concern is that I'm interested in multi-thread and parallel processing to speed up either xcorr, and I'm wondering where the easiest gains could be had

The terms "speed" and "easy" are contradictions in terms.
If you want eas of use, stick with matlab and pay the
performance penalty. If you want speed, be prepared to spend
a lot of time on very obscure details. Or a lot of $$$ on
somebody to take care of said obscure details for you.

Rune
From: kk KKsingh on
Just a small question how you are using FFTW with matlab can you show me example how to compiled FFTW with Matlab


"Jason Jury" <jason.c.jury(a)seagate.com> wrote in message <hpnmit$dfb$1(a)fred.mathworks.com>...
> I built a mex version of "xcorr" that uses the FFTW library. I'm basically doing the same thing as Matlab "xcorr" (zero-pad 2 arrays to next pow-2, run fft on these arrays, multiply, run ifft on result). Within the C-program, I'm relying on "fftw_plan_dft_r2c_1d" and "...c2r..." to perform the computation. I'm not doing any multi-threading explicitly in the C-code, and I'm using Visual 2008 compiler.
>
> I'm seeing a consistent speed improvement using my method over 2010A "xcorr" up to 2x for large arrays (5e5~2e6). This is done on Dell Latitude E6400 laptop with 2.5GHz Core2 Duo, 3.5GB of RAM. I suspect the reason is that the symmetry for real data is being used by my C program in the FFTW functions above, reducing comp's by ~1/2 (?). Using profiler on 2010A "xcorr" for large arrays, indeed nearly all the overhead is in the fft() and ifft() calls.
>
> Some questions:
> (1) is Matlab utilizing symmetry for real data when calculating FFT's, and if not why? (I recognize trade-offs between ease-of-use and horsepower)
> (2) any other reasons why Matlab 2010A "xcorr" would run slower than a basic mex-compiled C version of the same? Perhaps I need to re-run after calling "fftw" in Matlab to get some "wisdom"--would this be a fairer comparison (even though I don't believe my C-code has done anything like this)?
>
> Thanks in advance for any responses.