From: mike3 on
On Sep 20, 6:52 pm, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote:
> The performance for an FFT this large is usually determined more by
> memory access patterns; assembly isn't likely to help you much.
> (Note also that if your data are real then you can save at least
> another factor of 2; the number I quoted was for complex data.)

So why would the matrix thing be being so slow (slower than the
radix-2
recursion!)?

By the way, I also tested my small-FFT code, that is the stuff used
for the
single rows of the matrix (or after the radix-2 recursion once the
recursion
is small enough to fit in cache), which is an iterative loop and got
for a 4096-
element- length transform (complex elements) 1730 or so
"mflops" (using the
formula posted on that site you mentioned that was used to compute the
"mflops" ratings of the different FFT codes.), with no assembly, GCC
compiler,
and the processing unit is a 2.4GHz Intel Core2 Quad Q6600 CPU. Note
that
this CPU is not listed among the benchmarks on that site, but has
lower
GHz rating than one that is (the Core2 Duo at 3GHz). Note that this
code
uses precomputed trig tables and also does not do the scrambling of
the data
("bit reversal", "radix permutation") as that is not necessary for the
application
at hand here (convolution of long sequences: one can instead just use
DIF
followed by DIT and the scramble cancels out.).
From: Steven G. Johnson on
On Sep 21, 4:00 am, mike3 <mike4...(a)yahoo.com> wrote:
> So why would the matrix thing be being so slow (slower than the
> radix-2 recursion!)?

I can't optimize your code for you remotely. The devil is in the
details with these things; the reference I already gave you lists some
general principles regarding efficient transposition, efficient trig
tables for large matrix sizes, etcetera. Good luck!

Steven
From: Patricia Shanahan on
mike3 wrote:
> On Sep 20, 6:52 pm, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote:
>> The performance for an FFT this large is usually determined more by
>> memory access patterns; assembly isn't likely to help you much.
>> (Note also that if your data are real then you can save at least
>> another factor of 2; the number I quoted was for complex data.)
>
> So why would the matrix thing be being so slow (slower than the
> radix-2
> recursion!)?

My guess is that it has something to do with memory reference patterns.

It's many years since the last time I did any FFT optimization, but
managing the memory references was the key to making large FFTs
efficient then. The trend in hardware has been towards an even bigger
ratio between raw compute performance and memory access performance, so
I don't suppose that has changed.

Patricia