From: Steven G. Johnson on
On Apr 21, 5:30 am, Martin Brown <|||newspam...(a)nezumi.demon.co.uk>
wrote:
> Which compiler did you use for these benchmarks?

gcc 4.1.2

> absolute timing. I am a bit surprised though that an optimised 2^N
> transform cannot beat a radix 10 in terms of time per element
> processed though - overheads in radix 5 are quite a bit higher.

First, the arithmetic difference is not that great; on the order of
25%. Second, for transforms this big the time is dominated by memory
access rather than by arithmetic. And, as I mentioned in my previous
message, power-of-two strides have a disadvantage on modern memory
architectures because of the way caches work (limited associativity
based on low-order address bits).

As a fun little exercise, try writing code that performs an NxN matrix
transpose (just two nested loops). Then time it as a function of N,
and plot time/N*N vs. N. Without caches etc., the plot would be
roughly constant. However, if you do this on any current general-
purpose machine you will find it is anything but constant---you will
see huge spikes whenever N is a power of 2, or for that matter
whenever N is a multiple of a power of 2.

(Someday, perhaps, caches will use a proper hash of all the address
bits and most of this wild variation will go away.)

> Does
> the saved wisdom provide or export a human readable version of the
> actual radix kernels used and their order of application?

The wisdom is not human readable. However, you can use
fftw_print_plan to print a "nerd-readable" version of the plan. Note
that the algorithm is not simply a choice of radices; it has many
other choices besides that (see our Proc. IEEE paper on www.fftw.org).

> I confess genuine surprise that 2^N no longer reigns supreme (if only
> by a small margin) when you have a free choice of very long transform
> length. You have to concede though that 2^N is a lot simpler to code.

A lot of people are surprised; the myth of the overwhelming advantage
of powers of two is deeply ingrained in DSP folklore.

2^N is only a "lot" simpler to code if you don't care about
performance. If you care about performance, a textbook radix-2
implementation is worthless (5 to 40 times slower than the fastest
FFTs, depending on N), you have to use larger radices anyway, and the
complications involved in developing a highly optimized FFT aren't so
centered around coding the radix butterfly. (For FFTW, all our
butterflies and base cases are generated by a program anyway, so for
us there is no difference.)

Regards,
Steven G. Johnson
From: Steven G. Johnson on
On Apr 21, 5:57 am, Martin Brown <|||newspam...(a)nezumi.demon.co.uk>
wrote:
> Indeed. There must be something wrong with the old radix5/10 kernel I
> use.

There's probably nothing wrong with it; it's just optimized mainly for
the power-of-two case. Most FFT programs are. Programmers have
limited time to devote to optimization, and so it is sensible to
devote most of that time to powers of two since they are the ingrained
choice of most DSP practitioners.

In FFTW, the generation of the radix kernels etc. is automated, so it
is somewhat unusual in that we don't have make much of a tradeoff
between optimizing powers of two and non powers of two.

Steven
From: Steven G. Johnson on
On Apr 21, 12:47 pm, "Steven G. Johnson" <stev...(a)alum.mit.edu> wrote:
> As a fun little exercise, try writing code that performs an NxN matrix
> transpose (just two nested loops). Then time it as a function of N,
> and plot time/N*N vs. N. Without caches etc., the plot would be
> roughly constant. However, if you do this on any current general-
> purpose machine you will find it is anything but constant---you will
> see huge spikes whenever N is a power of 2, or for that matter
> whenever N is a multiple of a power of 2.

(I should mention that there are faster ways to perform matrix
transpositions than the simplest algorithm of two nested loops, in
order to improve cache-line utilization; actually, FFTW has optimized
in-place and out-of-place transpose routines for both square and non-
square matrices [the nonsquare in-place case being *much* harder, of
course]. But the magnitude of the timing variation of the two-loop
transposition as a function of N is surprising to most programmers who
have an oversimplified mental model of what is going on inside the
CPU.)

Steven
From: Bob Lied on
> Steven G. Johnson wrote [2008-04-21 11:47 AM]:
> As a fun little exercise, try writing code that performs an NxN matrix
> transpose (just two nested loops). Then time it as a function of N,
> and plot time/N*N vs. N. Without caches etc., the plot would be
> roughly constant. However, if you do this on any current general-
> purpose machine you will find it is anything but constant---you will
> see huge spikes whenever N is a power of 2, or for that matter
> whenever N is a multiple of a power of 2.

You're right -- that was fun, and not at all what I would have
predicted. I already knew my intuition about program performance was
nearly useless, but this little exercise sure reinforces to measure
before optimizing.
From: Martin Brown on
Steven G. Johnson wrote:
> On Apr 21, 5:30 am, Martin Brown <|||newspam...(a)nezumi.demon.co.uk>
> wrote:
>> Which compiler did you use for these benchmarks?
>
> gcc 4.1.2

Thanks.
>
>> absolute timing. I am a bit surprised though that an optimised 2^N
>> transform cannot beat a radix 10 in terms of time per element
>> processed though - overheads in radix 5 are quite a bit higher.
>
> First, the arithmetic difference is not that great; on the order of
> 25%. Second, for transforms this big the time is dominated by memory
> access rather than by arithmetic. And, as I mentioned in my previous
> message, power-of-two strides have a disadvantage on modern memory
> architectures because of the way caches work (limited associativity
> based on low-order address bits).

I hadn't realised the hit was so big.
>
> As a fun little exercise, try writing code that performs an NxN matrix
> transpose (just two nested loops). Then time it as a function of N,
> and plot time/N*N vs. N. Without caches etc., the plot would be
> roughly constant. However, if you do this on any current general-
> purpose machine you will find it is anything but constant---you will
> see huge spikes whenever N is a power of 2, or for that matter
> whenever N is a multiple of a power of 2.

I remember demand paged virtual memory. It was almost guaranteed that
every year some new graduate student would transpose (in those days big)
a 1024 square image with a pair of nested loops instead of using the
optimised library functions. It could grind a VAX to a standstill as
almost every memory access generated a page fault.
>
> (Someday, perhaps, caches will use a proper hash of all the address
> bits and most of this wild variation will go away.)

Would an XOR of some midlevel bits be sufficient to break the degeneracy?

Thanks for the paper reference. I am curious how the wisdom works in a
bit more detail.

>> I confess genuine surprise that 2^N no longer reigns supreme (if only
>> by a small margin) when you have a free choice of very long transform
>> length. You have to concede though that 2^N is a lot simpler to code.
>
> A lot of people are surprised; the myth of the overwhelming advantage
> of powers of two is deeply ingrained in DSP folklore.

It used to be true in the old days. Apart from some of the Winograd
transforms as I recall.
>
> 2^N is only a "lot" simpler to code if you don't care about
> performance. If you care about performance, a textbook radix-2
> implementation is worthless (5 to 40 times slower than the fastest
> FFTs, depending on N), you have to use larger radices anyway,

Yes. Although the best optimising compilers do seem to have narrowed the
differences a bit. Compact inner loop code can help. Incidentally do you
find the split radix 16 length transform worth the effort of coding?

and the
> For FFTW, all our
> butterflies and base cases are generated by a program anyway, so for
> us there is no difference.)

And that is a stroke of genius.

Regards,
Martin Brown
** Posted from http://www.teranews.com **