From: mike3 on 12 Sep 2009 05:50 Hi. I made a program to run Bailey's "4-step" (aka "matrix") Fast Fourier Transform algorithm, but it seemed to go _slower_ than a simple direct recursive+vectorized-iterative approach. Why is that?
From: Malcolm McLean on 13 Sep 2009 14:27 "mike3" <mike4ty4(a)yahoo.com> wrote in message news: > > I made a program to run Bailey's "4-step" (aka "matrix") Fast Fourier > Transform algorithm, but it seemed to go _slower_ than a simple direct > recursive+vectorized-iterative approach. Why is that? > Could be any reason. By using the Fast Fourier transform at all you have basically reduced the problem to N log N multiplications. Whilst I'm not familiar with Bailey's tweak, it will almost certainly get rid of a few of the multitpications in exchange for more shifting of the flow about in memory, but not change the order analysis of the algorithm. This could well compile to less efficient code, particularly for small N, because of cache usage, or because you are using the integer/float pipelines in a less parallel way, or because the loops are harder for the compiler to optimise.
From: Steven G. Johnson on 14 Sep 2009 00:26 On Sep 12, 5:50 am, mike3 <mike4...(a)yahoo.com> wrote: > I made a program to run Bailey's "4-step" (aka "matrix") Fast Fourier > Transform algorithm, but it seemed to go _slower_ than a simple direct > recursive+vectorized-iterative approach. Why is that? With our FFTW code (www.fftw.org), we found that a step radix-sqrt(N) (what Bailey calls the 4-step algorithm) is beneficial only when N gets really large (on the order of 2^20 or bigger), and even then you only do a single step (to reduce it to size ~ 2^10 transforms done by other methods). Essentially, while you can prove that radix-sqrt(N) uses hierarchical memory (caches) very well in an asymptotic sense, the constant factors are not so good compared to bounded-radix code. The advantage of a bounded-radix code is that the radix butterfly can be essentially done completely in hard-coded straight-line code including twiddle factors (which translates into efficient usage of the registers and CPU pipeline, at least if the radix is large enough), whereas radix-sqrt (N) requires more overhead for the transpositions, the twiddle factors, and other things that must now typically be done in separate array passes. See also our chapter "Implementing FFTs in practice" (http://cnx.org/ content/m16336/latest/) in Burrus's FFT book for some basic background on these questions. Regards, Steven G. Johnson PS. Quite apart from all of these questions, there are general issues of program optimization that come to bear when comparing "algorithms" in this way, because you are always really comparing implementations, not abstract algorithms, and the quality of the implementations matters a lot. I would urge you to benchmark your code against some other available routines, or just compare to the numbers posted on fftw.org/speed, to get some idea of whether you are comparing reasonably efficient FFT implementations. (If you compare extremely slow implementations, it's hard to draw any sensible conclusions about the algorithms.)
From: mike3 on 15 Sep 2009 03:03 On Sep 13, 10:26 pm, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote: > On Sep 12, 5:50 am, mike3 <mike4...(a)yahoo.com> wrote: > > > I made a program to run Bailey's "4-step" (aka "matrix") Fast Fourier > > Transform algorithm, but it seemed to go _slower_ than a simple direct > > recursive+vectorized-iterative approach. Why is that? > > With our FFTW code (www.fftw.org), we found that a step radix-sqrt(N) > (what Bailey calls the 4-step algorithm) is beneficial only when N > gets really large (on the order of 2^20 or bigger), and even then you > only do a single step (to reduce it to size ~ 2^10 transforms done by > other methods). > Well I'm running at a very large 2^26 = 67,108,864 els for the matrix algorithm and it still shows up as significantly slower than the non-matrix one (which is a radix-2 recursion down to a size small enough to switch to a fast iterative method with a precomputed root table.). > Essentially, while you can prove that radix-sqrt(N) uses hierarchical > memory (caches) very well in an asymptotic sense, the constant factors > are not so good compared to bounded-radix code. The advantage of a > bounded-radix code is that the radix butterfly can be essentially done > completely in hard-coded straight-line code including twiddle factors > (which translates into efficient usage of the registers and CPU > pipeline, at least if the radix is large enough), whereas radix-sqrt > (N) requires more overhead for the transpositions, the twiddle > factors, and other things that must now typically be done in separate > array passes. > So what could be going on here? > See also our chapter "Implementing FFTs in practice" (http://cnx.org/ > content/m16336/latest/) in Burrus's FFT book for some basic background > on these questions. > > Regards, > Steven G. Johnson > > PS. Quite apart from all of these questions, there are general issues > of program optimization that come to bear when comparing "algorithms" > in this way, because you are always really comparing implementations, > not abstract algorithms, and the quality of the implementations > matters a lot. I would urge you to benchmark your code against some > other available routines, or just compare to the numbers posted on > fftw.org/speed, to get some idea of whether you are comparing > reasonably efficient FFT implementations. (If you compare extremely > slow implementations, it's hard to draw any sensible conclusions about > the algorithms.)
From: Steven G. Johnson on 17 Sep 2009 11:47 On Sep 15, 3:03 am, mike3 <mike4...(a)yahoo.com> wrote: > Well I'm running at a very large 2^26 = 67,108,864 els for the matrix > algorithm > and it still shows up as significantly slower than the non-matrix one > (which > is a radix-2 recursion down to a size small enough to switch to a fast > iterative method with a precomputed root table.). Note that you only probably want to do 1 step of the radix-sqrt(N) algorithm, and then switch to another algorithm for the size 2^13 subtransforms. You don't want to do radix-sqrt(N) recursively. > So what could be going on here? Well, there is always the possibility that your implementation is simply very inefficient. (Even a simple square-matrix transposition, which is one part of the 4-step algorithm, is surprisingly tricky to code efficiently once you take cache lines into account. Dealing with memory efficiently is nontrivial these days. Efficient FFTs are also nontrivial, bearing little resemblance to textbook programs ala Numerical Recipes.) Again, I strongly recommend that you compare with other available FFT software to see if you are even in the ballpark of an efficient implementation, otherwise you are comparing one slow code to another (in which case you may need to rethink your whole approach before detailed optimization becomes worthwhile). Looking at the graphs on fftw.org/speed, on a 3GHz Intel Core Duo from a couple years ago (using only a single processor), both FFTW and the Intel Math Kernel Library take a little over 4 seconds to do a size 2^26 complex-data FFT in double precision. Steven
|
Next
|
Last
Pages: 1 2 3 Prev: How to Route Wikipedia® Content through MeAmI.org™ online browser Next: Java book idea. |