From: Steven G. Johnson on 21 Apr 2008 12:47 On Apr 21, 5:30 am, Martin Brown <|||newspam...(a)nezumi.demon.co.uk> wrote: > Which compiler did you use for these benchmarks? gcc 4.1.2 > absolute timing. I am a bit surprised though that an optimised 2^N > transform cannot beat a radix 10 in terms of time per element > processed though - overheads in radix 5 are quite a bit higher. First, the arithmetic difference is not that great; on the order of 25%. Second, for transforms this big the time is dominated by memory access rather than by arithmetic. And, as I mentioned in my previous message, power-of-two strides have a disadvantage on modern memory architectures because of the way caches work (limited associativity based on low-order address bits). As a fun little exercise, try writing code that performs an NxN matrix transpose (just two nested loops). Then time it as a function of N, and plot time/N*N vs. N. Without caches etc., the plot would be roughly constant. However, if you do this on any current general- purpose machine you will find it is anything but constant---you will see huge spikes whenever N is a power of 2, or for that matter whenever N is a multiple of a power of 2. (Someday, perhaps, caches will use a proper hash of all the address bits and most of this wild variation will go away.) > Does > the saved wisdom provide or export a human readable version of the > actual radix kernels used and their order of application? The wisdom is not human readable. However, you can use fftw_print_plan to print a "nerd-readable" version of the plan. Note that the algorithm is not simply a choice of radices; it has many other choices besides that (see our Proc. IEEE paper on www.fftw.org). > I confess genuine surprise that 2^N no longer reigns supreme (if only > by a small margin) when you have a free choice of very long transform > length. You have to concede though that 2^N is a lot simpler to code. A lot of people are surprised; the myth of the overwhelming advantage of powers of two is deeply ingrained in DSP folklore. 2^N is only a "lot" simpler to code if you don't care about performance. If you care about performance, a textbook radix-2 implementation is worthless (5 to 40 times slower than the fastest FFTs, depending on N), you have to use larger radices anyway, and the complications involved in developing a highly optimized FFT aren't so centered around coding the radix butterfly. (For FFTW, all our butterflies and base cases are generated by a program anyway, so for us there is no difference.) Regards, Steven G. Johnson
From: Steven G. Johnson on 21 Apr 2008 12:52 On Apr 21, 5:57 am, Martin Brown <|||newspam...(a)nezumi.demon.co.uk> wrote: > Indeed. There must be something wrong with the old radix5/10 kernel I > use. There's probably nothing wrong with it; it's just optimized mainly for the power-of-two case. Most FFT programs are. Programmers have limited time to devote to optimization, and so it is sensible to devote most of that time to powers of two since they are the ingrained choice of most DSP practitioners. In FFTW, the generation of the radix kernels etc. is automated, so it is somewhat unusual in that we don't have make much of a tradeoff between optimizing powers of two and non powers of two. Steven
From: Steven G. Johnson on 21 Apr 2008 13:01 On Apr 21, 12:47 pm, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote: > As a fun little exercise, try writing code that performs an NxN matrix > transpose (just two nested loops). Then time it as a function of N, > and plot time/N*N vs. N. Without caches etc., the plot would be > roughly constant. However, if you do this on any current general- > purpose machine you will find it is anything but constant---you will > see huge spikes whenever N is a power of 2, or for that matter > whenever N is a multiple of a power of 2. (I should mention that there are faster ways to perform matrix transpositions than the simplest algorithm of two nested loops, in order to improve cache-line utilization; actually, FFTW has optimized in-place and out-of-place transpose routines for both square and non- square matrices [the nonsquare in-place case being *much* harder, of course]. But the magnitude of the timing variation of the two-loop transposition as a function of N is surprising to most programmers who have an oversimplified mental model of what is going on inside the CPU.) Steven
From: Bob Lied on 21 Apr 2008 13:24 > Steven G. Johnson wrote [2008-04-21 11:47 AM]: > As a fun little exercise, try writing code that performs an NxN matrix > transpose (just two nested loops). Then time it as a function of N, > and plot time/N*N vs. N. Without caches etc., the plot would be > roughly constant. However, if you do this on any current general- > purpose machine you will find it is anything but constant---you will > see huge spikes whenever N is a power of 2, or for that matter > whenever N is a multiple of a power of 2. You're right -- that was fun, and not at all what I would have predicted. I already knew my intuition about program performance was nearly useless, but this little exercise sure reinforces to measure before optimizing.
From: Martin Brown on 22 Apr 2008 12:14
Steven G. Johnson wrote: > On Apr 21, 5:30 am, Martin Brown <|||newspam...(a)nezumi.demon.co.uk> > wrote: >> Which compiler did you use for these benchmarks? > > gcc 4.1.2 Thanks. > >> absolute timing. I am a bit surprised though that an optimised 2^N >> transform cannot beat a radix 10 in terms of time per element >> processed though - overheads in radix 5 are quite a bit higher. > > First, the arithmetic difference is not that great; on the order of > 25%. Second, for transforms this big the time is dominated by memory > access rather than by arithmetic. And, as I mentioned in my previous > message, power-of-two strides have a disadvantage on modern memory > architectures because of the way caches work (limited associativity > based on low-order address bits). I hadn't realised the hit was so big. > > As a fun little exercise, try writing code that performs an NxN matrix > transpose (just two nested loops). Then time it as a function of N, > and plot time/N*N vs. N. Without caches etc., the plot would be > roughly constant. However, if you do this on any current general- > purpose machine you will find it is anything but constant---you will > see huge spikes whenever N is a power of 2, or for that matter > whenever N is a multiple of a power of 2. I remember demand paged virtual memory. It was almost guaranteed that every year some new graduate student would transpose (in those days big) a 1024 square image with a pair of nested loops instead of using the optimised library functions. It could grind a VAX to a standstill as almost every memory access generated a page fault. > > (Someday, perhaps, caches will use a proper hash of all the address > bits and most of this wild variation will go away.) Would an XOR of some midlevel bits be sufficient to break the degeneracy? Thanks for the paper reference. I am curious how the wisdom works in a bit more detail. >> I confess genuine surprise that 2^N no longer reigns supreme (if only >> by a small margin) when you have a free choice of very long transform >> length. You have to concede though that 2^N is a lot simpler to code. > > A lot of people are surprised; the myth of the overwhelming advantage > of powers of two is deeply ingrained in DSP folklore. It used to be true in the old days. Apart from some of the Winograd transforms as I recall. > > 2^N is only a "lot" simpler to code if you don't care about > performance. If you care about performance, a textbook radix-2 > implementation is worthless (5 to 40 times slower than the fastest > FFTs, depending on N), you have to use larger radices anyway, Yes. Although the best optimising compilers do seem to have narrowed the differences a bit. Compact inner loop code can help. Incidentally do you find the split radix 16 length transform worth the effort of coding? and the > For FFTW, all our > butterflies and base cases are generated by a program anyway, so for > us there is no difference.) And that is a stroke of genius. Regards, Martin Brown ** Posted from http://www.teranews.com ** |