From: mike3 on 21 Sep 2009 04:00 On Sep 20, 6:52 pm, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote: > The performance for an FFT this large is usually determined more by > memory access patterns; assembly isn't likely to help you much. > (Note also that if your data are real then you can save at least > another factor of 2; the number I quoted was for complex data.) So why would the matrix thing be being so slow (slower than the radix-2 recursion!)? By the way, I also tested my small-FFT code, that is the stuff used for the single rows of the matrix (or after the radix-2 recursion once the recursion is small enough to fit in cache), which is an iterative loop and got for a 4096- element- length transform (complex elements) 1730 or so "mflops" (using the formula posted on that site you mentioned that was used to compute the "mflops" ratings of the different FFT codes.), with no assembly, GCC compiler, and the processing unit is a 2.4GHz Intel Core2 Quad Q6600 CPU. Note that this CPU is not listed among the benchmarks on that site, but has lower GHz rating than one that is (the Core2 Duo at 3GHz). Note that this code uses precomputed trig tables and also does not do the scrambling of the data ("bit reversal", "radix permutation") as that is not necessary for the application at hand here (convolution of long sequences: one can instead just use DIF followed by DIT and the scramble cancels out.).
From: Steven G. Johnson on 21 Sep 2009 17:23 On Sep 21, 4:00 am, mike3 <mike4...(a)yahoo.com> wrote: > So why would the matrix thing be being so slow (slower than the > radix-2 recursion!)? I can't optimize your code for you remotely. The devil is in the details with these things; the reference I already gave you lists some general principles regarding efficient transposition, efficient trig tables for large matrix sizes, etcetera. Good luck! Steven
From: Patricia Shanahan on 21 Sep 2009 17:43 mike3 wrote: > On Sep 20, 6:52 pm, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote: >> The performance for an FFT this large is usually determined more by >> memory access patterns; assembly isn't likely to help you much. >> (Note also that if your data are real then you can save at least >> another factor of 2; the number I quoted was for complex data.) > > So why would the matrix thing be being so slow (slower than the > radix-2 > recursion!)? My guess is that it has something to do with memory reference patterns. It's many years since the last time I did any FFT optimization, but managing the memory references was the key to making large FFTs efficient then. The trend in hardware has been towards an even bigger ratio between raw compute performance and memory access performance, so I don't suppose that has changed. Patricia
First
|
Prev
|
Pages: 1 2 3 Prev: How to Route Wikipedia® Content through MeAmI.org™ online browser Next: Java book idea. |