64 x 64 FFT algorithm for 4096 pt. FFT [DSP]

Prev: Multipath channel and downsampling
Next: Mixing different length FFTs

From: onkars on 1 Jul 2010 17:54

>On 6/29/2010 6:22 PM, onkars wrote:
>> I am looking for an FFT algorithm that can give me a 4096 pt. FFT using
t=
>wo
>> 64 pt. FFTs. Is it the mixed radix FFT?
>... plus second post of 7:28 PM
>
>4096=3D64x64 is not mixed radix. It's 2 stages of radix 64 butterflies,
>so it's radix 64. Mixed radix would be 4096 =3D 128x32, or 32x128, or
>256x16, or 16x256, or ... (it's a long list).
>
>Based on some of your previous posts, I presume that you're looking to
>do this in hardware. In that case, a 64x64 isn't a bad choice (there
>are better ones), because you can break the 64x64 down into a 4 stage
>pipeline of 8x8x8x8. Radix 8 is quite efficient in that you only have
>a couple of 'difficult' internal twiddles of the +/-.707.... type, and
>they can be done efficiently with a special purpose multiplier. So
>you'd have 4 radix 8 butterfly stages (each with a special purpose +/-.
>707... multiplier), plus 3 full scale multipliers for the inter-stage
>twiddles.
>
>There are a lot of additional concepts that could be applied, but I'm
>not really sure exactly what you're trying to do, so I won't get into
>them.
>
>To get an idea of the basic concepts, you may want to look at:
>
>L. R. Rabiner, B. Gold, Theory and Application of Digital Signal
>Processing. Englewood Cliffs, NJ, Prentice-Hall, 1975.
>
>Chapters 6 and 10 in the above deal with FFT and some basic pipeline
>architectures. Although the book is out of print, you may be able to
>find it at a local university library, or on EBAY.
>
>As for the twiddles between the stages, they're fairly simple to
>figure out. I could tell you what they are for a 64x64 graph, or an
>8x8x8x8, but then I'd be doing all the work, and you wouldn't really
>learn anything (here's a hint: note the pattern of the input/output
>order and twiddle sequence of each butterfly for the 8x8 graph in Fig.
>10.15 on p. 587 in Rabiner and Gold's book. Now imagine what the
>graph would look like if you had 2 stages of 64x64 butterflies).
>
>If you get through all of the above and feel a little ambitious, you
>might want to take a look at:
>
>K. J. McGee, =93Comments on =91A 64-Point Fourier Transform Chip for
Video
>Motion Compensation Using Phase Correlation,=94 IEEE J. Solid-State
>Circuits, vol. 33, no. 6, June 1998, pp. 928-932.
>
>In particular, pay attention to Section III in the above: Equivalence
>of Higher and Lower Radix Graphs. It has important implications for
>shuffle exchange algorithms and pipeline architectures (and, as noted,
>I originally developed the concept in 1981).
>
>Kevin McGee
>

Hi Kevin,

Thanks for your detailed response.

Actually I am working on building a 4096 pt. FFT processor (8 bit input
with scaling) which is aimed at athroughtput of 20+ GigaSamples/s.

I am targeting this design for the Virtex 6 FPGA. With some initial
resource estimation and experiments with the Xilinx core FFT, I realized
that the issue with an FPGA was not so much multipliers and memory -- it is
the number of slices that are the bottleneck.

For Virtex 6 I can use the DSP48s for complex multiplication (4 per complex
mult) -- Virtex 6 has 2000 of these. And it also has abundant Block RAMs.
So I have to efficiently use the slices -- saving adders, registers could
help this.

I figured out that I could realistically achieve 350 MHz on the FPGA --
which implies that to achieve 20 Gsamples/s I need approx. 64 samples
output per clock cycle @350 MHz.

So my goals sort of become:
1. Minimize slice usage -- even at the cost of extra memory usage or
slightly more mults (implies reducing adders and registers).
2. Reduce long routes/wires which could bring the max clock freq. below 350
MHz.
3. one thing is for sure -- use MDC structures -- using SDF and SDC
structures saves memory, which is not our goal.

And the options that I consider:
1. Use 64x64 structure -- with reordering memory (RAMs) in between the 2 64
stages. Also, here I intend to use the full-blown parallel-pipelined (i.e.
full row and full column parallelism) implementation of 64 stage
(internally using maybe radix 4 i.e. 2-squared implemenation). The
advantages of this arch is that control will be minimal and simple. Also,
single structure computes the FFT @350 MHz hence the inputs will always
arrive on the same pins -- which feed this FFT. The downside is that the
routes will be long -- and may implact the ability to achieve 350 MHz.

2. Another approach is using X number FFTs (unable to figure out X and fft
length) in parallel (each implemented as radix 4 MDC). And these X parallel
FFTs feeding the 64 inputs of single 64 pt. FFT implemented full blown
(full parallelism) -- which eventually gives 4096 pt. FFT.

3. Use 8x8x8x8 MDC structure FFT. Will need to use 8 of these to get 64
outputs per cycle(with input demux feeding the 8 FFTs). The radix 8 stages
will be implemented using 2^3 (or 2-cube radix). Good point is shorter
routes, but more commutators -- making it complex. Another issue is the
input buffering will have to be managed to achieve 100% utilization. Also,
there will have to be some kind of demux. between the input pins and 8
different FFTs OR use 8 different input pin sets and external demux.

4. similar to (2) but use 4x4x4x4x4x4 MDC. Use 16 of these. Advantage is
even smaller routes. Disadvantages -- more complex commutators and control.
Input pins to 16 FFT routing.

Do you have any suggestions? Any comments or any other different
architecture or some pointers/references I could use?

Thank you.

Onkar

From: kevin on 1 Jul 2010 23:43

On Jul 1, 5:54 pm, "onkars" <onkar.sarode(a)n_o_s_p_a_m.gmail.com>
wrote:

(snip)

Some general comments first:

There are a great many pipeline FFT architectures (>100, at least).
You should thoroughly review the patent literature, journals and
conference proceedings.

Not all pipelines are created equal. Back in the 1980's, I did a
comparative analysis of many architectures. Some use less memory than
others. Some allow for much higher input/output data speeds (i.e.:
given a certain technology speed, what's the highest clock rate you
can use?). In some, the arithmetic units operate only 50% or 25% of
the time. For one of the architectures, I had to view the original
patent, rather than the published articles, because I determined that
it couldn't possibly work unless you had some extra registers at the
input/output of the arithmetic units (sure enough, they're right there
in the patent, but none of the published articles show them).

You might want to take a look at Fig. 3 in the previously mentioned
1998 journal article of mine. It shows a 3 stage 64 point 4x4x4.
It's 100% efficient in that all arithmetic units operate with no dead
time. And one nice thing about it is that the outputs are in line
sequential order (see the figure in the paper for the sequence).

You could either use that as your first 64 point, dump the outputs
into a buffer, and then use the another 4x4x4 for the second 64
point. Or you could design a 16x16x16. Your arithmetic unit in that
case will have to handle the +/- .707... twiddles, plus the couple of
others that you get for a 16 point FFT.

As for some of your specific comments:

>> Reduce long routes/wires which could bring the max clock freq. below 350 MHz.

A persistent problem. In the late 1980's, I took a course at LSI
Logic for designing with their gate array. It was an R+D project, so
we didn't have the money in the budget to pay for a mask charge
(100K), but I was able to use their design tools to simulate a 64
point pipeline that I had designed. It took a lot of clever ideas to
get it to operate at a 3 nanosecond clock (folding shift registers
back on themselves so that data went in and out of them at spots near
the arithmetic units; fully pipelining both the adders and multipliers
so I could feed them at the maximum clock speed; using the pipelining
delay through the adders/multiplier such that they effectively doubled
as additional storage) . Although the pipeline could handle it, the I/
O pins would only allow for 5 nanosecond speeds.

You're using an FPGA, so your problems could actually be worse in that
you probably don't have access to the lower level design elements that
I had.

>> Use 64x64 structure -- with reordering memory (RAMs) in between the 2 64
>> stages. Also, here I intend to use the full-blown parallel-pipelined (i.e.
>> full row and full column parallelism)

The only time that I would consider using an array architecture like
that is if there were no other way to meet the speed requirement.

>> but more commutators

In custom NMOS and CMOS, they're not too difficult. Take a look at
the aforementioned 1998 paper and check out reference 4 (US patent
4396994):

http://www.freepatentsonline.com/4396994.html

The commutator operation can be seen and implemented as a barrel
shifter (see Fig. 3 in the above paper). However, I'm not sure that
you'll have access to the low level design elements that you'd need
when using an FPGA. You may be forced into using mux or demux.

>> similar to (2) but use 4x4x4x4x4x4 MDC. Use 16 of these. Advantage is
>> even smaller routes. Disadvantages -- more complex commutators and
>> control. Input pins to 16 FFT routing.

Alright. I'd have to think about it. I'll look at your post again
tonight.

Kevin McGee

From: kevin on 3 Jul 2010 01:38

On Jul 1, 5:54 pm, "onkars" <onkar.sarode(a)n_o_s_p_a_m.gmail.com>
wrote:

>I figured out that I could realistically achieve 350 MHz on the FPGA --
>which implies that to achieve 20 Gsamples/s I need approx. 64 samples
>output per clock cycle @350 MHz.

OK. I get it now why you are probably going to have to use an array
type arrangement to do the 64 point.

But I'm not clear exactly how your data is coming into the VIRTEX
slices - are you using the Ghz serial port? And how is the input data
partitioned and/or sequenced? Sequential in, I suppose.

You mentioned the need for 100% utilization for the input buffering.
Quite possibly you may find the following to be helpful. First,
things like bit reversal or matrix transpose are fairly easy out of a
RAM. For instance, in several bit reverse in/sequential out
architectures, the data is first bit reversed by reading/writing out
of RAM. Consider the simple case of an 8 point FFT. Data points 0 to
7 are loaded into sequential RAM locations, with the addressing
controlled by a counter. Now the counter has a 2 to 1 mux on every
bit output. Things are arranged such that the counter output is
either in sequential mode (i.e.: count from 0 to 7), or in bit
reversed mode (count in bit reversed order). When the second set of 8
data points arrives, you do a read/write of the location. The data
being read out will be the bit reversed sequence of your first 8
points. The second set of data will actually be stored in the bit
reversed locations. When the third set of 8 points arrives, you read/
write again with the counter in sequential mode. The second set of
data will also come out in bit reversed order. All you need do is
start out correctly and switch the counter mode every 8 data samples.
It actually works - I've done things that way for a very long time.
Try it using the 8 point example, it's really not at all difficult to
understand.

Here's another thing about the above point: take a look at the shift
cell / commutator arrangement that you see in a great many pipeline
architectures. They're all bit reversers or matrix transposers. If
you have a fast enough RAM that can tolerate the input speeds, you
might consider doing things that way. All you have to do is set up
the counter properly. Advantage: no more commutators.

And just another thought: presuming your input data is real only, have
you considered using the '2N real FFT with an N complex FFT'
technique. That would cut your FFT size down to 2048. There's a
little bit of extra processing involved, but it's roughly equivalent
to adding a column of butterflies at the end.

I really haven't had enough time to look at your post thoroughly.
I'll spend some time over the weekend on it. I also spent a little
time scanning the VIRTEX specs.

I could suggest other things or different architectures (or make
better comments on your suggestions), but I'd need to have a clearer
idea of just exactly how that input data is coming in.

Kevin McGee

From: kevin on 3 Jul 2010 01:43

On Jul 3, 1:38 am, kevin <kevinjmc...(a)netscape.net> wrote:
> On Jul 1, 5:54 pm, "onkars" <onkar.sarode(a)n_o_s_p_a_m.gmail.com>
> wrote:
>
> >I figured out that I could realistically achieve 350 MHz on the FPGA --
> >which implies that to achieve 20 Gsamples/s I need approx. 64 samples
> >output per clock cycle @350 MHz.
>
> OK. I get it now why you are probably going to have to use an array
> type arrangement to do the 64 point.
>
> But I'm not clear exactly how your data is coming into the VIRTEX
> slices - are you using the Ghz serial port? And how is the input data
> partitioned and/or sequenced? Sequential in, I suppose.
>
> You mentioned the need for 100% utilization for the input buffering.
> Quite possibly you may find the following to be helpful. First,
> things like bit reversal or matrix transpose are fairly easy out of a
> RAM. For instance, in several bit reverse in/sequential out
> architectures, the data is first bit reversed by reading/writing out
> of RAM. Consider the simple case of an 8 point FFT. Data points 0 to
> 7 are loaded into sequential RAM locations, with the addressing
> controlled by a counter. Now the counter has a 2 to 1 mux on every
> bit output. Things are arranged such that the counter output is
> either in sequential mode (i.e.: count from 0 to 7), or in bit
> reversed mode (count in bit reversed order). When the second set of 8
> data points arrives, you do a read/write of the location. The data
> being read out will be the bit reversed sequence of your first 8
> points. The second set of data will actually be stored in the bit
> reversed locations. When the third set of 8 points arrives, you read/
> write again with the counter in sequential mode. The second set of
> data will also come out in bit reversed order. All you need do is
> start out correctly and switch the counter mode every 8 data samples.
> It actually works - I've done things that way for a very long time.
> Try it using the 8 point example, it's really not at all difficult to
> understand.
>
> Here's another thing about the above point: take a look at the shift
> cell / commutator arrangement that you see in a great many pipeline
> architectures. They're all bit reversers or matrix transposers. If
> you have a fast enough RAM that can tolerate the input speeds, you
> might consider doing things that way. All you have to do is set up
> the counter properly. Advantage: no more commutators.
>
> And just another thought: presuming your input data is real only, have
> you considered using the '2N real FFT with an N complex FFT'
> technique. That would cut your FFT size down to 2048. There's a
> little bit of extra processing involved, but it's roughly equivalent
> to adding a column of butterflies at the end.
>
> I really haven't had enough time to look at your post thoroughly.
> I'll spend some time over the weekend on it. I also spent a little
> time scanning the VIRTEX specs.
>
> I could suggest other things or different architectures (or make
> better comments on your suggestions), but I'd need to have a clearer
> idea of just exactly how that input data is coming in.
>
> Kevin McGee

Meant to say: the counter outputs go to a 2 to 1 mux such that you can
select either sequential output or bit reversed output.

From: onkars on 3 Jul 2010 23:09

>On Jul 3, 1:38=A0am, kevin <kevinjmc...(a)netscape.net> wrote:
>> On Jul 1, 5:54 pm, "onkars" <onkar.sarode(a)n_o_s_p_a_m.gmail.com>
>> wrote:
>>
>> >I figured out that I could realistically achieve 350 MHz on the FPGA
--
>> >which implies that to achieve 20 Gsamples/s I need approx. 64 samples
>> >output per clock cycle @350 MHz.
>>
>> OK. =A0I get it now why you are probably going to have to use an array
>> type arrangement to do the 64 point.
>>
>> But I'm not clear exactly how your data is coming into the VIRTEX
>> slices - are you using the Ghz serial port? =A0And how is the input
data
>> partitioned and/or sequenced? =A0Sequential in, I suppose.
>>
>> You mentioned the need for 100% utilization for the input buffering.
>> Quite possibly you may find the following to be helpful. =A0First,
>> things like bit reversal or matrix transpose are fairly easy out of a
>> RAM. =A0For instance, in several bit reverse in/sequential out
>> architectures, the data is first bit reversed by reading/writing out
>> of RAM. =A0Consider the simple case of an 8 point FFT. =A0Data points 0
t=
>o
>> 7 are loaded into sequential RAM locations, with the addressing
>> controlled by a counter. =A0Now the counter has a 2 to 1 mux on every
>> bit output. =A0Things are arranged such that the counter output is
>> either in sequential mode (i.e.: count from 0 to 7), or in bit
>> reversed mode (count in bit reversed order). =A0When the second set of
8
>> data points arrives, you do a read/write of the location. =A0The data
>> being read out will be the bit reversed sequence of your first 8
>> points. =A0The second set of data will actually be stored in the bit
>> reversed locations. =A0When the third set of 8 points arrives, you
read/
>> write again with the counter in sequential mode. =A0The second set of
>> data will also come out in bit reversed order. =A0All you need do is
>> start out correctly and switch the counter mode every 8 data samples.
>> It actually works - I've done things that way for a very long time.
>> Try it using the 8 point example, it's really not at all difficult to
>> understand.
>>
>> Here's another thing about the above point: =A0take a look at the shift
>> cell / commutator arrangement that you see in a great many pipeline
>> architectures. =A0They're all bit reversers or matrix transposers. If
>> you have a fast enough RAM that can tolerate the input speeds, you
>> might consider doing things that way. =A0All you have to do is set up
>> the counter properly. =A0Advantage: no more commutators.
>>
>> And just another thought: presuming your input data is real only, have
>> you considered using the '2N real FFT with an N complex FFT'
>> technique. =A0That would cut your FFT size down to 2048. =A0There's a
>> little bit of extra processing involved, but it's roughly equivalent
>> to adding a column of butterflies at the end.
>>
>> I really haven't had enough time to look at your post thoroughly.
>> I'll spend some time over the weekend on it. =A0I also spent a little
>> time scanning the VIRTEX specs.
>>
>> I could suggest other things or different architectures (or make
>> better comments on your suggestions), but I'd need to have a clearer
>> idea of just exactly how that input data is coming in.
>>
>> Kevin McGee
>
>Meant to say: the counter outputs go to a 2 to 1 mux such that you can
>select either sequential output or bit reversed output.
>

Thank you. I really appreciate your time.

As for the inputs to the FPGA -- thats an open aspect too. It can be
designed to best suite the FFT implementation. But we will be using the
HighSpeed serial IOs. So it pretty much depends on the how the FFT is
implemented -- we could even use external fast demuxes if necessary.

Also, the input is complex.

As mentioned earlier I should have 64 samples output per clock cycle. As a
reminder I am doing 4096 pt. fft.

So this can be done using various options e.g.

one FFT made of 64-path radix-64 using 2 stages (full row and column
parallel implementation of radix-64)

OR

4 FFTs made of 16-path radix-16 using 3 stages (full row and column
parallel implementation of radix-16)

OR

8 FFTs made of 8-path radix-8 (full row and column parallel implementation
of radix-8) using 4 stages

Using smaller radices has following disadvantages ---
The number of stages and required number of FFTs (to achieve 64
samples/clock) increases. Reasonably assuming that the number arithmetic
units for any approach is same --- this means that the control overhead
(scheduling logic + commutators using logic not mem. etc.) is increased. I
am not sure how much percentage this is.

Advantage of smaller radices is that the routes will be shorter (???) and
higher frequencies can be achievable. Will the routes when using a single
64-path radix-64 solution be longer than when using 4 16-path radix-16 ffts
??

Somehow I am inclined towards going for the 16 path radix 16 solution, and
aim to achieve 250 to 300 MHz. I have seen a paper (credible) claim they
achieved 40 MHz for a fully parallel (row and column) 256 pt. fft using
1024 radix 2 butterflies on Virtex 5.

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: Multipath channel and downsampling
Next: Mixing different length FFTs