Prev: Scilab and DSP
Next: Separating Harmonic Spectra
From: Steven G. Johnson on 28 Nov 2007 21:16 On Nov 28, 7:11 pm, robert bristow-johnson <r...(a)audioimagination.com> wrote: > okay, Steven, using O&S for illustration, the pointers changes looks > pretty knarly, even for N=8, but even if it's not, it's not in-place, > is it? and if it's not in-place, is there not a cache hit? assuming > you have the (double) memory and that cost is not a problem, you now > have the top and bottom (i'm gonna assume radix-2 for the moment) of > the butterfly you're reading (from the previous pass) and the top and > bottom of the butterfly you are writing (which would be the same > locations if you were doing this in-place). if it is not in-place, > then you are ping-ponging and i would expect the number of cache > misses to double. First of all, I have no idea what you are talking about with "gnarly pointer changes" for N=8. An N=8 FFT is 60 floating-point operations that can be implemented in straight-line code: read into local variables (= registers), do calculations, write results. Second, there are in-place algorithms that do not require a separate bitreversal pass, some of which I pointed out in another message on this thread. Basically, they interleave the transpositions (bitreversal = product of transpositions) with the computational stages. Third, an out-of-place algorithm does not require that you "ping-pong" between two arrays (as in the Stockham algorithm). Actually, only one stage of the computation needs to work out-of-place (as in the recursive example code I posted) to write things in the right place; all other stages of the computation can work in-place. And if you look at the FFTW benchmarks, you'll see that properly written out-of-place code can indeed be very competitive. Fourth, even if you have to do additional data motion or copying stages, sometimes this can be advantageous because it makes the data access during computation more consecutive. e.g. the Stockham ("ping- pong") algorithm was originally designed for vector machines where consecutive access was a huge win, and even now it has locality benefits (although we currently find that other algorithms work better for us). Or e.g. in the radix-sqrt(n) approach (dubbed the "four-step" and "six-step" algorithms by Bailey, although variants date back to Gentleman and Sande) you do explicit transpositions---essentially, a part of the bit-reversal permutation---interleaved with the computation, in order to improve locality, and this is a win for very large N or for very slow memory (e.g. distributed-memory systems). Fifth, one is crazy to do radix-2 these days. You need to do the work in larger chunks to take full advantage of the CPU. I've never seen a radix-2 FFT that was not several times slower than the fastest code. Ultimately, however, machines are complicated enough these days that theorizing about performance is not very persuasive; you have to actually try the different algorithms and sometimes the results are unexpected. In the case of FFTs, people have tried quite a few algorithms and there are many examples of people finding performance benefits from algorithms that avoid separate bit-reversal passes. Regards, Steven G. Johnson
From: robert bristow-johnson on 29 Nov 2007 23:37 On Nov 28, 5:30 pm, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote: > On Nov 27, 1:46 pm, robert bristow-johnson > > Steven, what's different (in terms of code) is the pointer > > arithmetic. doing all sorts of goofy pointer arithmetic might be more > > costly, in the aggregate throughout all of the FFT passes, than a > > single bit-reversing pass at the beginning or end. > As someone who has actually implemented many of the algorithms without > bit reversal, I can say that (a) the pointer arithmetic is not really > that complicated and (b) the pointer arithmetic per se is not the > dominant factor in the time anyway (compared to the memory access > itself), and (c) and it really does seem to be faster on modern > general-purpose CPUs than the methods involving a separate array pass. On Nov 28, 9:16 pm, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote: > On Nov 28, 7:11 pm, robert bristow-johnson <r...(a)audioimagination.com> > wrote: > > > okay, Steven, using O&S for illustration, the pointers changes looks > > pretty knarly, even for N=8, but even if it's not, it's not in-place, > > is it? and if it's not in-place, is there not a cache hit? assuming > > you have the (double) memory and that cost is not a problem, you now > > have the top and bottom (i'm gonna assume radix-2 for the moment) of > > the butterfly you're reading (from the previous pass) and the top and > > bottom of the butterfly you are writing (which would be the same > > locations if you were doing this in-place). if it is not in-place, > > then you are ping-ponging and i would expect the number of cache > > misses to double. > > First of all, I have no idea what you are talking about with "gnarly > pointer changes" for N=8. An N=8 FFT is 60 floating-point operations > that can be implemented in straight-line code: read into local > variables (= registers), do calculations, write results. well, i said "for illustration". in particular i was referring to Fig. 9-15 (p. 597) in my 1989 version of O&S. certainly if N=8 you can do it with straight-line code. that's beside the point. i was trying to infer, from the illustration, what the rhyme or reason there is in the particular pointer manipulation from that N=8 illustration. it still looks gnarly. for the case of a general N (say, a power of 2) and much bigger than 2^3, what is the tight and efficient algorithm for determining the order of indexing for the node that you are writing to, and once you've determined that, what are the nodes that you're reading from? except for the first pass (which is just like the first pass of an in-place DIT), i cannot see what that rhyme or reason is. i'm imagine there may be one, but it doesn't look simple or elegant. so how could this be coded compactly and efficiently for a general N = 2^p? i cannot see how it is done, while i do not deny it can be done. i just don't see it. surely, if you have to look up those node indices from a table (and a different table for every N?), that may be quick, but it's not compact. this is what i mean by "goofy pointer arithmetic". and, it's clear that it is not in-place after the first pass. there *must* be another array space to write to. so, there clearly is a scheme you know about (i never thought otherwise) that is something i hadn't seen. but using the one reference to a normal-in, normal-out structure that i have seen, i'll put it the way O&S did (p. 598): "In the flow graph of Fig. 9-15, the data are accessed nonsequentially, the computation is not in place, and a scheme for indexing the data is considerably more complicated than in either of the two previous cases [two in-place DIT structures]. Consequently, this structure has no apparent advantages." i was reverberating that sentiment from the little bit that i know of the radix-(2^n) FFT. that information could very well be stale. it's also in the Yellow O&S (1975 - p. 300, Fig. 6.13, and they say the same thing), so it's more than 3 decades old. sometimes i think that O&S is timeless, other times it's just dated. r b-j
From: Steven G. Johnson on 30 Nov 2007 18:35
On Nov 29, 11:37 pm, robert bristow-johnson <r...(a)audioimagination.com> wrote: > well, i said "for illustration". in particular i was referring to > Fig. 9-15 (p. 597) in my 1989 version of O&S. certainly if N=8 you > can do it with straight-line code. that's beside the point. i was > trying to infer, from the illustration, what the rhyme or reason there > is in the particular pointer manipulation from that N=8 illustration. > it still looks gnarly. for the case of a general N (say, a power of > 2) and much bigger than 2^3, what is the tight and efficient algorithm > for determining the order of indexing for the node that you are > writing to, and once you've determined that, what are the nodes that > you're reading from? except for the first pass (which is just like > the first pass of an in-place DIT), i cannot see what that rhyme or > reason is. i'm imagine there may be one, but it doesn't look simple > or elegant. No, there are certainly simple patterns to the memory access for in- place algorithms without bit-reversal. They are not that complicated to implement (as my 15-line example in another post shows). You absolutely do not have to use a look-up-table of indices! (One of the many reasons why I feel that butterfly diagrams are next to useless in truly understanding FFT algorithms.) (Of course, if you compare a highly optimized in-order FFT like FFTW or MKL, it will be way faster than a textbook radix-2 algorithm even if you skip the bitreversal stage. Highly optimized algorithms are always going to be way more complicated than textbook radix-2, as long as CPUs remain so complex.) > "In the flow graph of Fig. 9-15, the data are accessed > nonsequentially, the computation is not in place, and a > scheme for indexing the data is considerably more > complicated than in either of the two previous cases [two > in-place DIT structures]. Consequently, this structure > has no apparent advantages." One reason to be cautious here is that O&S is comparing apples and oranges: FFTs with bitreversed input or output, and FFTs with normal- order output. Of course, if you can get away with permuted output it will be simpler. That's not the comparison I'm making: I'm comparing only FFTs with normal-order output, that differ in whether you have a separate bitreversal pass or whether you combine the permutations with the computation. > i was reverberating that sentiment from the little bit that i know of > the radix-(2^n) FFT. that information could very well be stale. it's > also in the Yellow O&S (1975 - p. 300, Fig. 6.13, and they say the > same thing), so it's more than 3 decades old. > > sometimes i think that O&S is timeless, other times it's just dated. Comments on performance advantages from FFT algorithms that avoid a separate bitreversal pass date back to the 60's, so in some sense it's not a matter of being out-of-date. On the other hand, it seems to be especially in the last 20 years that floating-point performance of CPUs has totally outstripped memory performance, and that makes non- arithmetic issues so much more relevant now. The main problem is that textbooks are trying to teach students the basic ideas of FFT algorithms rather than performance hacks, and therefore they almost always limit their discussion to only two algorithms for power-of-two N: radix-2 decimation in time and radix-2 decimation in frequency. If you limit yourself to these two cases, you can't really have a useful discussion of performance optimization on modern CPUs where performance is limited not by arithmetic counts but rather by memory access and register/pipeline utilization. Steven |