From: Martin Brown on
On Apr 19, 4:09 pm, John Larkin
<jjlar...(a)highNOTlandTHIStechnologyPART.com> wrote:
> On Sat, 19 Apr 2008 09:19:04 +0100, "Andrew Holme" <a...(a)nospam.co.uk>
> wrote:
>
> >"Antonio Pasini" <removethis_antonio.pas...(a)alice.it> wrote in message
> >news:480999E2.1010602(a)alice.it...
> >> Il 19/04/2008 4.42, John Larkin ha scritto:
>
> >>> Does anybody have an estimate of how long it might take to do a
> >>> million-point FFT on a modern Intel PC? Google is no help.
>
> >>> Our data would be from a 16-bit ADC, but I get the impression that, on
> >>> an Intel machine, a floating-point fft might be faster than
> >>> fixed-point.

No contest.
>
> >>> John
>
> >> Have a look athttp://www.fftw.org;it's one of the fastest free fft
> >> library on PC. They maintain an huge benchmark suite.
>
> >I use FFTW.  A one-million point FFT in double precision arithmetic takes
> >172 ms on my 1.7 GHz Pentium.
>
> Cool! I had no idea if it was milliseconds or minutes. We'd only have
> to do it every second or so, so we have lots of time.

Worth pointing out here that if you have the option to choose your
array length then FFTs for exact powers of two are quite a bit faster
on almost all architectures.

And 2^20 = 1024^2 = 16^5 which is probably the fastest power of two
Radix FFT kernel in common use.

Multiples of 5 and 10 although very fast can't compete with the highly
optimised radix 16, 8, 4, 2 kernels. Although for 10^6 datapoints you
will be limited by memory bandwidth and cache considerations to around
1000Mflops (on a 3GHz P4 box). On shorter lengths fitting entirely in
cache a 1024 FFT is roughly 2x faster than the best 1000 FFT.

You may also be able to steal a march by transforming a pair at the
same time with one complex transform and a diddle (there is a small
price to pay in crosstalk for this). Or by using a real to complex
conjugate symmetric version of the FFT (assuming your data is a real
valued sampled time series of some sort).

> >See code and output below.
>
> >Time = 172 ms

FFTW is very good for this sort of thing.

Regards,
Martin Brown
From: Steven G. Johnson on
On Apr 19, 12:04 pm, Martin Brown <|||newspam...(a)nezumi.demon.co.uk>
wrote:
> Worth pointing out here that if you have the option to choose your
> array length then FFTs for exact powers of two are quite a bit faster
> on almost all architectures.

This is a bit of an old wives' tale. Powers of two have some
advantage in arithmetic cost over other highly composite sizes (powers
of 2/3/5), but it is not overwhelming. The main advantage of powers
of two is not intrinsic, but is simply because most FFT libraries are
mainly optimized for power-of-two sizes. In fact, because of limited
cache associativity and direct-mapped caches based on low-order
address bits, power-of-two sizes (where the subtransforms access data
at power-of-two strides) can sometimes have a decided *disadvantage*
for large sizes.

With a library like FFTW that is heavily optimized for non-power-of-
two sizes as well as for powers of two, the difference is fairly
small. With FFTW, it is often not worth it to increase a power-2/3/5
transform length to the next power-of-two size unless the length
increase is less than 10%. It is especially not worth it for multi-
dimensional transforms.

For example, on my 2.6GHz Intel Core 2 Duo with gcc 4.1 and FFTW 3.2
in 64-mode for double precision with SSE2, size 800 is faster than
size 1024, sizes 3600 and 3840 are faster than size 4096, and sizes
25000 and 30000 are faster than 32768.
(You can try it yourself with FFTW's self-test program: tests/bench -
opatient 800 1024 3600 3840 4096 25000 30000 32768)

> 1000Mflops (on a 3GHz P4 box). On shorter lengths fitting entirely in
> cache a 1024 FFT is roughly 2x faster than the best 1000 FFT.

On my abovementioned 2.6GHz machine, the difference is closer to 25%
with FFTW 3.2. In double precision, size 1024 takes 7.84us, while
size 1000 takes 9.92us.

(And with regards to the original poster, size 1000000 is actually
sometimes faster than size 2^20.)

Regards,
Steven G. Johnson
From: Steven G. Johnson on
On Apr 19, 4:19 am, "Andrew Holme" <a...(a)nospam.co.uk> wrote:
> >> Does anybody have an estimate of how long it might take to do a
> >> million-point FFT on a modern Intel PC? Google is no help.
>
> I use FFTW. A one-million point FFT in double precision arithmetic takes
> 172 ms on my 1.7 GHz Pentium.

Actually, you should be able to do significantly better than that with
FFTW on a modern Intel machine (with SSE/SSE2 enabled). Especially if
you are doing lots of million-point FFTs and can afford to wait a few
minutes one time to run the planner with FFTW_PATIENT (and then save
the resulting plan with fftw_export_wisdom).

On my 2.6GHz Intel Core 2 Duo, in 64-bit mode, in double precision,
with FFTW 3.2alpha3 in FFTW_PATIENT mode, a million-point FFT takes
43ms, and in single precision (which should be sufficient for the
original poster's 16-bit ADC) a million-point FFT takes 22ms. Using
two CPUs, the million-point single-precision FFT takes 13ms.

Another poster suggested padding to 2^20 = 1048576 (the next power of
2), but on my machine that is actually slightly slower: it takes 49ms
in double precision and and 26ms in single precision.

Regards,
Steven G. Johnson
From: Martin Brown on
On Apr 21, 5:03 am, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote:
> On Apr 19, 4:19 am, "Andrew Holme" <a...(a)nospam.co.uk> wrote:
>
> > >> Does anybody have an estimate of how long it might take to do a
> > >> million-point FFT on a modern Intel PC? Google is no help.
>
> > I use FFTW.  A one-million point FFT in double precision arithmetic takes
> > 172 ms on my 1.7 GHz Pentium.
>
> Actually, you should be able to do significantly better than that with
> FFTW on a modern Intel machine (with SSE/SSE2 enabled).

I must have another look at FFTW then. I haven't seen the version that
exploits SSE2. And I did notice in the past that certain Microsoft
compilers made a poor fist of optimising the FFTW machine generated
kernels.
Which compiler did you use for these benchmarks?

> On my 2.6GHz Intel Core 2 Duo, in 64-bit mode, in double precision,
> with FFTW 3.2alpha3 in FFTW_PATIENT mode, a million-point FFT takes
> 43ms, and in single precision (which should be sufficient for the
> original poster's 16-bit ADC) a million-point FFT takes 22ms.  Using
> two CPUs, the million-point single-precision FFT takes 13ms.
>
> Another poster suggested padding to 2^20 = 1048576 (the next power of
> 2), but on my machine that is actually slightly slower: it takes 49ms
> in double precision and and 26ms in single precision.

Can't argue with these hard numbers or a famous FFT guru. Thinking
about it after seeing your other post I concluded that for cache
bandwidth limited single transforms you could be well be right since
the 5% increase in length would make for at least a 5% increase in
absolute timing. I am a bit surprised though that an optimised 2^N
transform cannot beat a radix 10 in terms of time per element
processed though - overheads in radix 5 are quite a bit higher. Does
the saved wisdom provide or export a human readable version of the
actual radix kernels used and their order of application?

I thought the problems with slower large 2^N transforms was confined
to a particularly unfortunate IBM vintage design of workstation but
apparently not. Thanks for posting the numbers.

I confess genuine surprise that 2^N no longer reigns supreme (if only
by a small margin) when you have a free choice of very long transform
length. You have to concede though that 2^N is a lot simpler to code.

Regards,
Martin Brown
From: Martin Brown on
On Apr 20, 8:24 pm, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote:
> On Apr 19, 12:04 pm, Martin Brown <|||newspam...(a)nezumi.demon.co.uk>
> wrote:
>
> > Worth pointing out here that if you have the option to choose your
> > array length then FFTs for exact powers of two are quite a bit faster
> > on almost all architectures.
>
> This is a bit of an old wives' tale.

I still believed it when I posted my reply. And I believe it is still
true provided that the array fits cleanly into cache.

>  Powers of two have some
> advantage in arithmetic cost over other highly composite sizes (powers
> of 2/3/5), but it is not overwhelming.  The main advantage of powers
> of two is not intrinsic, but is simply because most FFT libraries are
> mainly optimized for power-of-two sizes.  In fact, because of limited
> cache associativity and direct-mapped caches based on low-order
> address bits, power-of-two sizes (where the subtransforms access data
> at power-of-two strides) can sometimes have a decided *disadvantage*
> for large sizes.

Interesting. That makes sense of the numbers you posted for 10^6 vs
2^20.
I had not run into this problem with longer transforms (or noticed).
>
> With a library like FFTW that is heavily optimized for non-power-of-
> two sizes as well as for powers of two, the difference is fairly
> small.  With FFTW, it is often not worth it to increase a power-2/3/5
> transform length to the next power-of-two size unless the length
> increase is less than 10%.  It is especially not worth it for multi-
> dimensional transforms.

Agreed absolutely.
>
> For example, on my 2.6GHz Intel Core 2 Duo with gcc 4.1 and FFTW 3.2
> in 64-mode for double precision with SSE2, size 800 is faster than
> size 1024, sizes 3600 and 3840 are faster than size 4096, and sizes
> 25000 and 30000 are faster than 32768.

The old code I have can only do 2,3,5 prime radix (ex radio
astronomy). In 1-D benchmarks:
625, 640 and 768 are faster than 1024 and 3072, 3125 are faster than
4096
I don't have figures to hand for anything bigger. And obviously for 2-
D the space savings can make other lengths competitive.

> (And with regards to the original poster, size 1000000 is actually
> sometimes faster than size 2^20.)

Indeed. There must be something wrong with the old radix5/10 kernel I
use.

Regards,
Martin Brown