From: mike3 on
On Sep 17, 9:47 am, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote:
> On Sep 15, 3:03 am, mike3 <mike4...(a)yahoo.com> wrote:
>
> > Well I'm running at a very large 2^26 = 67,108,864 els for the matrix
> > algorithm
> > and it still shows up as significantly slower than the non-matrix one
> > (which
> > is a radix-2 recursion down to a size small enough to switch to a fast
> > iterative method with a precomputed root table.).
>
> Note that you only probably want to do 1 step of the radix-sqrt(N)
> algorithm, and then switch to another algorithm for the size 2^13
> subtransforms.  You don't want to do radix-sqrt(N) recursively.
>
> > So what could be going on here?
>
> Well, there is always the possibility that your implementation is
> simply very inefficient.  (Even a simple square-matrix transposition,
> which is one part of the 4-step algorithm, is surprisingly tricky to
> code efficiently once you take cache lines into account.  Dealing with
> memory efficiently is nontrivial these days. Efficient FFTs are also
> nontrivial, bearing little resemblance to textbook programs ala
> Numerical Recipes.)
>

So where do I find out all this "non-trivial" "non-textbook" stuff,
then?

> Again, I strongly recommend that you compare with other available FFT
> software to see if you are even in the ballpark of an efficient
> implementation, otherwise you are comparing one slow code to another
> (in which case you may need to rethink your whole approach before
> detailed optimization becomes worthwhile).  Looking at the graphs on
> fftw.org/speed, on a 3GHz Intel Core Duo from a couple years ago
> (using only a single processor), both FFTW and the Intel Math Kernel
> Library take a little over 4 seconds to do a size 2^26 complex-data
> FFT in double precision.
>

That's some fast FFTs. Is that 1-thread performance?

From: Steven G. Johnson on
On Sep 18, 7:14 pm, mike3 <mike4...(a)yahoo.com> wrote:
> So where do I find out all this "non-trivial" "non-textbook" stuff,
> then?

I gave you one reference already, and the citations in that chapter
will give you some more clues, and there is also a lot of source code
out there that one can read.

> That's some fast FFTs. Is that 1-thread performance?

Yes. (Note that I already mentioned it was "single processor"
performance.)

Steven
From: mike3 on
On Sep 18, 6:53 pm, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote:
> On Sep 18, 7:14 pm, mike3 <mike4...(a)yahoo.com> wrote:
>
> > So where do I find out all this "non-trivial" "non-textbook" stuff,
> > then?
>
> I gave you one reference already, and the citations in that chapter
> will give you some more clues, and there is also a lot of source code
> out there that one can read.
>
> > That's some fast FFTs. Is that 1-thread performance?
>
> Yes.  (Note that I already mentioned it was "single processor"
> performance.)
>

Man then the FFTs I did are pretty slow... 23 seconds for the
same data length (67108864 els) on 1 core of 2.4GHZ Core2 Quad.
Could I get that 1-core performance down to like 9-12 seconds without
needing assembler code in there? It doesn't need to be the fastest
thing on Earth, just a good deal faster than a "textbook"
implementation.
From: mike3 on
On Sep 18, 8:26 pm, mike3 <mike4...(a)yahoo.com> wrote:
> On Sep 18, 6:53 pm, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote:
>
> > On Sep 18, 7:14 pm, mike3 <mike4...(a)yahoo.com> wrote:
>
> > > So where do I find out all this "non-trivial" "non-textbook" stuff,
> > > then?
>
> > I gave you one reference already, and the citations in that chapter
> > will give you some more clues, and there is also a lot of source code
> > out there that one can read.
>
> > > That's some fast FFTs. Is that 1-thread performance?
>
> > Yes.  (Note that I already mentioned it was "single processor"
> > performance.)
>
> Man then the FFTs I did are pretty slow... 23 seconds for the
> same data length (67108864 els) on 1 core of 2.4GHZ Core2 Quad.
> Could I get that 1-core performance down to like 9-12 seconds without
> needing assembler code in there? It doesn't need to be the fastest
> thing on Earth, just a good deal faster than a "textbook"
> implementation.

Actually I got that wrong, that's for two transforms (forward then
reverse)
and a convolution step (this is being used to convolute very long
sequences.).
One FFT is around 12 seconds. Could that be gotten down to 8-9
seconds?
Do I need assembler to get it down to that? (Note here that we don't
need the
bit reversal, if one does the two forward FFTs as DiF and the reverse
FFT as
DiT the bit reversal steps cancel each other and so can be dropped
from the
program, and I don't do them.) And this is without the Bailey 4-step
thing, just
a radix 2 method, which starts out as recursive then switches to
iterative once
the data is smaller than cache, at which point it also switches to the
use of
precomputed trig factor tables.
From: Steven G. Johnson on
The performance for an FFT this large is usually determined more by
memory access patterns; assembly isn't likely to help you much.
(Note also that if your data are real then you can save at least
another factor of 2; the number I quoted was for complex data.)