From: onkars on 1 Jul 2010 17:54 >On 6/29/2010 6:22 PM, onkars wrote: >> I am looking for an FFT algorithm that can give me a 4096 pt. FFT using t= >wo >> 64 pt. FFTs. Is it the mixed radix FFT? >... plus second post of 7:28 PM > >4096=3D64x64 is not mixed radix. It's 2 stages of radix 64 butterflies, >so it's radix 64. Mixed radix would be 4096 =3D 128x32, or 32x128, or >256x16, or 16x256, or ... (it's a long list). > >Based on some of your previous posts, I presume that you're looking to >do this in hardware. In that case, a 64x64 isn't a bad choice (there >are better ones), because you can break the 64x64 down into a 4 stage >pipeline of 8x8x8x8. Radix 8 is quite efficient in that you only have >a couple of 'difficult' internal twiddles of the +/-.707.... type, and >they can be done efficiently with a special purpose multiplier. So >you'd have 4 radix 8 butterfly stages (each with a special purpose +/-. >707... multiplier), plus 3 full scale multipliers for the inter-stage >twiddles. > >There are a lot of additional concepts that could be applied, but I'm >not really sure exactly what you're trying to do, so I won't get into >them. > >To get an idea of the basic concepts, you may want to look at: > >L. R. Rabiner, B. Gold, Theory and Application of Digital Signal >Processing. Englewood Cliffs, NJ, Prentice-Hall, 1975. > >Chapters 6 and 10 in the above deal with FFT and some basic pipeline >architectures. Although the book is out of print, you may be able to >find it at a local university library, or on EBAY. > >As for the twiddles between the stages, they're fairly simple to >figure out. I could tell you what they are for a 64x64 graph, or an >8x8x8x8, but then I'd be doing all the work, and you wouldn't really >learn anything (here's a hint: note the pattern of the input/output >order and twiddle sequence of each butterfly for the 8x8 graph in Fig. >10.15 on p. 587 in Rabiner and Gold's book. Now imagine what the >graph would look like if you had 2 stages of 64x64 butterflies). > >If you get through all of the above and feel a little ambitious, you >might want to take a look at: > >K. J. McGee, =93Comments on =91A 64-Point Fourier Transform Chip for Video >Motion Compensation Using Phase Correlation,=94 IEEE J. Solid-State >Circuits, vol. 33, no. 6, June 1998, pp. 928-932. > >In particular, pay attention to Section III in the above: Equivalence >of Higher and Lower Radix Graphs. It has important implications for >shuffle exchange algorithms and pipeline architectures (and, as noted, >I originally developed the concept in 1981). > >Kevin McGee > Hi Kevin, Thanks for your detailed response. Actually I am working on building a 4096 pt. FFT processor (8 bit input with scaling) which is aimed at athroughtput of 20+ GigaSamples/s. I am targeting this design for the Virtex 6 FPGA. With some initial resource estimation and experiments with the Xilinx core FFT, I realized that the issue with an FPGA was not so much multipliers and memory -- it is the number of slices that are the bottleneck. For Virtex 6 I can use the DSP48s for complex multiplication (4 per complex mult) -- Virtex 6 has 2000 of these. And it also has abundant Block RAMs. So I have to efficiently use the slices -- saving adders, registers could help this. I figured out that I could realistically achieve 350 MHz on the FPGA -- which implies that to achieve 20 Gsamples/s I need approx. 64 samples output per clock cycle @350 MHz. So my goals sort of become: 1. Minimize slice usage -- even at the cost of extra memory usage or slightly more mults (implies reducing adders and registers). 2. Reduce long routes/wires which could bring the max clock freq. below 350 MHz. 3. one thing is for sure -- use MDC structures -- using SDF and SDC structures saves memory, which is not our goal. And the options that I consider: 1. Use 64x64 structure -- with reordering memory (RAMs) in between the 2 64 stages. Also, here I intend to use the full-blown parallel-pipelined (i.e. full row and full column parallelism) implementation of 64 stage (internally using maybe radix 4 i.e. 2-squared implemenation). The advantages of this arch is that control will be minimal and simple. Also, single structure computes the FFT @350 MHz hence the inputs will always arrive on the same pins -- which feed this FFT. The downside is that the routes will be long -- and may implact the ability to achieve 350 MHz. 2. Another approach is using X number FFTs (unable to figure out X and fft length) in parallel (each implemented as radix 4 MDC). And these X parallel FFTs feeding the 64 inputs of single 64 pt. FFT implemented full blown (full parallelism) -- which eventually gives 4096 pt. FFT. 3. Use 8x8x8x8 MDC structure FFT. Will need to use 8 of these to get 64 outputs per cycle(with input demux feeding the 8 FFTs). The radix 8 stages will be implemented using 2^3 (or 2-cube radix). Good point is shorter routes, but more commutators -- making it complex. Another issue is the input buffering will have to be managed to achieve 100% utilization. Also, there will have to be some kind of demux. between the input pins and 8 different FFTs OR use 8 different input pin sets and external demux. 4. similar to (2) but use 4x4x4x4x4x4 MDC. Use 16 of these. Advantage is even smaller routes. Disadvantages -- more complex commutators and control. Input pins to 16 FFT routing. Do you have any suggestions? Any comments or any other different architecture or some pointers/references I could use? Thank you. Onkar
From: kevin on 1 Jul 2010 23:43 On Jul 1, 5:54 pm, "onkars" <onkar.sarode(a)n_o_s_p_a_m.gmail.com> wrote: (snip) Some general comments first: There are a great many pipeline FFT architectures (>100, at least). You should thoroughly review the patent literature, journals and conference proceedings. Not all pipelines are created equal. Back in the 1980's, I did a comparative analysis of many architectures. Some use less memory than others. Some allow for much higher input/output data speeds (i.e.: given a certain technology speed, what's the highest clock rate you can use?). In some, the arithmetic units operate only 50% or 25% of the time. For one of the architectures, I had to view the original patent, rather than the published articles, because I determined that it couldn't possibly work unless you had some extra registers at the input/output of the arithmetic units (sure enough, they're right there in the patent, but none of the published articles show them). You might want to take a look at Fig. 3 in the previously mentioned 1998 journal article of mine. It shows a 3 stage 64 point 4x4x4. It's 100% efficient in that all arithmetic units operate with no dead time. And one nice thing about it is that the outputs are in line sequential order (see the figure in the paper for the sequence). You could either use that as your first 64 point, dump the outputs into a buffer, and then use the another 4x4x4 for the second 64 point. Or you could design a 16x16x16. Your arithmetic unit in that case will have to handle the +/- .707... twiddles, plus the couple of others that you get for a 16 point FFT. As for some of your specific comments: >> Reduce long routes/wires which could bring the max clock freq. below 350 MHz. A persistent problem. In the late 1980's, I took a course at LSI Logic for designing with their gate array. It was an R+D project, so we didn't have the money in the budget to pay for a mask charge (100K), but I was able to use their design tools to simulate a 64 point pipeline that I had designed. It took a lot of clever ideas to get it to operate at a 3 nanosecond clock (folding shift registers back on themselves so that data went in and out of them at spots near the arithmetic units; fully pipelining both the adders and multipliers so I could feed them at the maximum clock speed; using the pipelining delay through the adders/multiplier such that they effectively doubled as additional storage) . Although the pipeline could handle it, the I/ O pins would only allow for 5 nanosecond speeds. You're using an FPGA, so your problems could actually be worse in that you probably don't have access to the lower level design elements that I had. >> Use 64x64 structure -- with reordering memory (RAMs) in between the 2 64 >> stages. Also, here I intend to use the full-blown parallel-pipelined (i.e. >> full row and full column parallelism) The only time that I would consider using an array architecture like that is if there were no other way to meet the speed requirement. >> but more commutators In custom NMOS and CMOS, they're not too difficult. Take a look at the aforementioned 1998 paper and check out reference 4 (US patent 4396994): http://www.freepatentsonline.com/4396994.html The commutator operation can be seen and implemented as a barrel shifter (see Fig. 3 in the above paper). However, I'm not sure that you'll have access to the low level design elements that you'd need when using an FPGA. You may be forced into using mux or demux. >> similar to (2) but use 4x4x4x4x4x4 MDC. Use 16 of these. Advantage is >> even smaller routes. Disadvantages -- more complex commutators and >> control. Input pins to 16 FFT routing. Alright. I'd have to think about it. I'll look at your post again tonight. Kevin McGee
From: kevin on 3 Jul 2010 01:38 On Jul 1, 5:54 pm, "onkars" <onkar.sarode(a)n_o_s_p_a_m.gmail.com> wrote: >I figured out that I could realistically achieve 350 MHz on the FPGA -- >which implies that to achieve 20 Gsamples/s I need approx. 64 samples >output per clock cycle @350 MHz. OK. I get it now why you are probably going to have to use an array type arrangement to do the 64 point. But I'm not clear exactly how your data is coming into the VIRTEX slices - are you using the Ghz serial port? And how is the input data partitioned and/or sequenced? Sequential in, I suppose. You mentioned the need for 100% utilization for the input buffering. Quite possibly you may find the following to be helpful. First, things like bit reversal or matrix transpose are fairly easy out of a RAM. For instance, in several bit reverse in/sequential out architectures, the data is first bit reversed by reading/writing out of RAM. Consider the simple case of an 8 point FFT. Data points 0 to 7 are loaded into sequential RAM locations, with the addressing controlled by a counter. Now the counter has a 2 to 1 mux on every bit output. Things are arranged such that the counter output is either in sequential mode (i.e.: count from 0 to 7), or in bit reversed mode (count in bit reversed order). When the second set of 8 data points arrives, you do a read/write of the location. The data being read out will be the bit reversed sequence of your first 8 points. The second set of data will actually be stored in the bit reversed locations. When the third set of 8 points arrives, you read/ write again with the counter in sequential mode. The second set of data will also come out in bit reversed order. All you need do is start out correctly and switch the counter mode every 8 data samples. It actually works - I've done things that way for a very long time. Try it using the 8 point example, it's really not at all difficult to understand. Here's another thing about the above point: take a look at the shift cell / commutator arrangement that you see in a great many pipeline architectures. They're all bit reversers or matrix transposers. If you have a fast enough RAM that can tolerate the input speeds, you might consider doing things that way. All you have to do is set up the counter properly. Advantage: no more commutators. And just another thought: presuming your input data is real only, have you considered using the '2N real FFT with an N complex FFT' technique. That would cut your FFT size down to 2048. There's a little bit of extra processing involved, but it's roughly equivalent to adding a column of butterflies at the end. I really haven't had enough time to look at your post thoroughly. I'll spend some time over the weekend on it. I also spent a little time scanning the VIRTEX specs. I could suggest other things or different architectures (or make better comments on your suggestions), but I'd need to have a clearer idea of just exactly how that input data is coming in. Kevin McGee
From: kevin on 3 Jul 2010 01:43 On Jul 3, 1:38 am, kevin <kevinjmc...(a)netscape.net> wrote: > On Jul 1, 5:54 pm, "onkars" <onkar.sarode(a)n_o_s_p_a_m.gmail.com> > wrote: > > >I figured out that I could realistically achieve 350 MHz on the FPGA -- > >which implies that to achieve 20 Gsamples/s I need approx. 64 samples > >output per clock cycle @350 MHz. > > OK. I get it now why you are probably going to have to use an array > type arrangement to do the 64 point. > > But I'm not clear exactly how your data is coming into the VIRTEX > slices - are you using the Ghz serial port? And how is the input data > partitioned and/or sequenced? Sequential in, I suppose. > > You mentioned the need for 100% utilization for the input buffering. > Quite possibly you may find the following to be helpful. First, > things like bit reversal or matrix transpose are fairly easy out of a > RAM. For instance, in several bit reverse in/sequential out > architectures, the data is first bit reversed by reading/writing out > of RAM. Consider the simple case of an 8 point FFT. Data points 0 to > 7 are loaded into sequential RAM locations, with the addressing > controlled by a counter. Now the counter has a 2 to 1 mux on every > bit output. Things are arranged such that the counter output is > either in sequential mode (i.e.: count from 0 to 7), or in bit > reversed mode (count in bit reversed order). When the second set of 8 > data points arrives, you do a read/write of the location. The data > being read out will be the bit reversed sequence of your first 8 > points. The second set of data will actually be stored in the bit > reversed locations. When the third set of 8 points arrives, you read/ > write again with the counter in sequential mode. The second set of > data will also come out in bit reversed order. All you need do is > start out correctly and switch the counter mode every 8 data samples. > It actually works - I've done things that way for a very long time. > Try it using the 8 point example, it's really not at all difficult to > understand. > > Here's another thing about the above point: take a look at the shift > cell / commutator arrangement that you see in a great many pipeline > architectures. They're all bit reversers or matrix transposers. If > you have a fast enough RAM that can tolerate the input speeds, you > might consider doing things that way. All you have to do is set up > the counter properly. Advantage: no more commutators. > > And just another thought: presuming your input data is real only, have > you considered using the '2N real FFT with an N complex FFT' > technique. That would cut your FFT size down to 2048. There's a > little bit of extra processing involved, but it's roughly equivalent > to adding a column of butterflies at the end. > > I really haven't had enough time to look at your post thoroughly. > I'll spend some time over the weekend on it. I also spent a little > time scanning the VIRTEX specs. > > I could suggest other things or different architectures (or make > better comments on your suggestions), but I'd need to have a clearer > idea of just exactly how that input data is coming in. > > Kevin McGee Meant to say: the counter outputs go to a 2 to 1 mux such that you can select either sequential output or bit reversed output.
From: onkars on 3 Jul 2010 23:09 >On Jul 3, 1:38=A0am, kevin <kevinjmc...(a)netscape.net> wrote: >> On Jul 1, 5:54 pm, "onkars" <onkar.sarode(a)n_o_s_p_a_m.gmail.com> >> wrote: >> >> >I figured out that I could realistically achieve 350 MHz on the FPGA -- >> >which implies that to achieve 20 Gsamples/s I need approx. 64 samples >> >output per clock cycle @350 MHz. >> >> OK. =A0I get it now why you are probably going to have to use an array >> type arrangement to do the 64 point. >> >> But I'm not clear exactly how your data is coming into the VIRTEX >> slices - are you using the Ghz serial port? =A0And how is the input data >> partitioned and/or sequenced? =A0Sequential in, I suppose. >> >> You mentioned the need for 100% utilization for the input buffering. >> Quite possibly you may find the following to be helpful. =A0First, >> things like bit reversal or matrix transpose are fairly easy out of a >> RAM. =A0For instance, in several bit reverse in/sequential out >> architectures, the data is first bit reversed by reading/writing out >> of RAM. =A0Consider the simple case of an 8 point FFT. =A0Data points 0 t= >o >> 7 are loaded into sequential RAM locations, with the addressing >> controlled by a counter. =A0Now the counter has a 2 to 1 mux on every >> bit output. =A0Things are arranged such that the counter output is >> either in sequential mode (i.e.: count from 0 to 7), or in bit >> reversed mode (count in bit reversed order). =A0When the second set of 8 >> data points arrives, you do a read/write of the location. =A0The data >> being read out will be the bit reversed sequence of your first 8 >> points. =A0The second set of data will actually be stored in the bit >> reversed locations. =A0When the third set of 8 points arrives, you read/ >> write again with the counter in sequential mode. =A0The second set of >> data will also come out in bit reversed order. =A0All you need do is >> start out correctly and switch the counter mode every 8 data samples. >> It actually works - I've done things that way for a very long time. >> Try it using the 8 point example, it's really not at all difficult to >> understand. >> >> Here's another thing about the above point: =A0take a look at the shift >> cell / commutator arrangement that you see in a great many pipeline >> architectures. =A0They're all bit reversers or matrix transposers. If >> you have a fast enough RAM that can tolerate the input speeds, you >> might consider doing things that way. =A0All you have to do is set up >> the counter properly. =A0Advantage: no more commutators. >> >> And just another thought: presuming your input data is real only, have >> you considered using the '2N real FFT with an N complex FFT' >> technique. =A0That would cut your FFT size down to 2048. =A0There's a >> little bit of extra processing involved, but it's roughly equivalent >> to adding a column of butterflies at the end. >> >> I really haven't had enough time to look at your post thoroughly. >> I'll spend some time over the weekend on it. =A0I also spent a little >> time scanning the VIRTEX specs. >> >> I could suggest other things or different architectures (or make >> better comments on your suggestions), but I'd need to have a clearer >> idea of just exactly how that input data is coming in. >> >> Kevin McGee > >Meant to say: the counter outputs go to a 2 to 1 mux such that you can >select either sequential output or bit reversed output. > Thank you. I really appreciate your time. As for the inputs to the FPGA -- thats an open aspect too. It can be designed to best suite the FFT implementation. But we will be using the HighSpeed serial IOs. So it pretty much depends on the how the FFT is implemented -- we could even use external fast demuxes if necessary. Also, the input is complex. As mentioned earlier I should have 64 samples output per clock cycle. As a reminder I am doing 4096 pt. fft. So this can be done using various options e.g. one FFT made of 64-path radix-64 using 2 stages (full row and column parallel implementation of radix-64) OR 4 FFTs made of 16-path radix-16 using 3 stages (full row and column parallel implementation of radix-16) OR 8 FFTs made of 8-path radix-8 (full row and column parallel implementation of radix-8) using 4 stages Using smaller radices has following disadvantages --- The number of stages and required number of FFTs (to achieve 64 samples/clock) increases. Reasonably assuming that the number arithmetic units for any approach is same --- this means that the control overhead (scheduling logic + commutators using logic not mem. etc.) is increased. I am not sure how much percentage this is. Advantage of smaller radices is that the routes will be shorter (???) and higher frequencies can be achievable. Will the routes when using a single 64-path radix-64 solution be longer than when using 4 16-path radix-16 ffts ?? Somehow I am inclined towards going for the 16 path radix 16 solution, and aim to achieve 250 to 300 MHz. I have seen a paper (credible) claim they achieved 40 MHz for a fully parallel (row and column) 256 pt. fft using 1024 radix 2 butterflies on Virtex 5.
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 5 Prev: Multipath channel and downsampling Next: Mixing different length FFTs |