From: kevin on 4 Jul 2010 03:33 On Jul 3, 11:09 pm, "onkars" <onkar.sarode(a)n_o_s_p_a_m.gmail.com> wrote: >As for the inputs to the FPGA -- thats an open aspect too. It can be >designed to best suite the FFT implementation. But we will be using the >HighSpeed serial IOs. So it pretty much depends on the how the FFT is >implemented -- we could even use external fast demuxes if necessary. >Also, the input is complex. >As mentioned earlier I should have 64 samples output per clock cycle. As a >reminder I am doing 4096 pt. fft. Very useful information. I'm a little concerned about the rest of your post because I'm presuming that your input data is sequential. That is, 8 bits each of points 0, 1, 2, 3 ... 63, followed by points 64, 65, ... 127, then 128 ... 191, then 192 ... 255, all the way up to 4095. So you have 64 eight bit sequential inputs per clock tick at 375 Mhz (512 bits total). Here's the problem. With a 4096 point FFT, you've got 2 stages of 64 point butterflies with 1 column of inter-stage twiddles between them. No matter what the input/output order, geometry type or decimation, the top most butterfly in the first stage uses the following inputs: 0, 64, 128, 192 ... 4032 (every 64th input - the last one listed is 4096-64 = 4032). The next butterfly down from the top in the first stage uses the same sequence above PLUS 1 (1, 65, 129, 193 ... 4033). The next butterfly down uses a PLUS 2 (2, 66, 130, 194, ... 4034). Once again, samples separated by 64. If you need any clarification, I can send a 64 point 8x8 bitmap (or .pdf or whatever) via 'reply to author'. It's not at all difficult to explain what a 64x64 graph looks like after viewing the 8x8 (and I can send the explanation, too). Now you can break down the a 4096 many different ways, but you'll still have the same problem of how to feed the first stage of the pipeline with the appropriate grouping of inputs. I was concerned because you post kind of implied that you've already got your 4096 data coming in as if it were in bit-reversed order. Bit-reversing your 4096 inputs when they're coming in 64 at a time is not trivial and requires some careful thought. A double buffer would work, but it's extra hardware. Not to worry. I'm sure it can be done, but the devil's in the details. I've also got a couple of other suggestions, but I'll post them later, and think about your buffering problem in the meantime. You've got an interesting problem. Kevin McGee
From: onkars on 5 Jul 2010 21:14 > >On Jul 3, 11:09 pm, "onkars" <onkar.sarode(a)n_o_s_p_a_m.gmail.com> >wrote: > >>As for the inputs to the FPGA -- thats an open aspect too. It can be >>designed to best suite the FFT implementation. But we will be using the >>HighSpeed serial IOs. So it pretty much depends on the how the FFT is >>implemented -- we could even use external fast demuxes if necessary. > >>Also, the input is complex. > >>As mentioned earlier I should have 64 samples output per clock cycle. As a >>reminder I am doing 4096 pt. fft. > >Very useful information. > >I'm a little concerned about the rest of your post because I'm >presuming that your input data is sequential. That is, 8 bits each of >points 0, 1, 2, 3 ... 63, followed by points 64, 65, ... 127, then >128 ... 191, then 192 ... 255, all the way up to 4095. So you have 64 >eight bit sequential inputs per clock tick at 375 Mhz (512 bits >total). > >Here's the problem. With a 4096 point FFT, you've got 2 stages of 64 >point butterflies with 1 column of inter-stage twiddles between them. >No matter what the input/output order, geometry type or decimation, >the top most butterfly in the first stage uses the following inputs: >0, 64, 128, 192 ... 4032 (every 64th input - the last one listed is >4096-64 = 4032). The next butterfly down from the top in the first >stage uses the same sequence above PLUS 1 (1, 65, 129, 193 ... 4033). >The next butterfly down uses a PLUS 2 (2, 66, 130, 194, ... 4034). >Once again, samples separated by 64. > >If you need any clarification, I can send a 64 point 8x8 bitmap >(or .pdf or whatever) via 'reply to author'. It's not at all >difficult to explain what a 64x64 graph looks like after viewing the >8x8 (and I can send the explanation, too). > >Now you can break down the a 4096 many different ways, but you'll >still have the same problem of how to feed the first stage of the >pipeline with the appropriate grouping of inputs. I was concerned >because you post kind of implied that you've already got your 4096 >data coming in as if it were in bit-reversed order. Bit-reversing >your 4096 inputs when they're coming in 64 at a time is not trivial >and requires some careful thought. A double buffer would work, but >it's extra hardware. > >Not to worry. I'm sure it can be done, but the devil's in the >details. > >I've also got a couple of other suggestions, but I'll post them later, >and think about your buffering problem in the meantime. > >You've got an interesting problem. > >Kevin McGee > > Actually for the ordering of the input -- I have always thought of using double buffers. Extra hardware -- which here is memory -- is abundant on the FPGA (38000 Kb). It will increase latency -- which is not an issue. Thanks a lot. Onkar
From: kevin on 6 Jul 2010 00:22 I looked at some of the FFT cores being offered by various vendors. Your requirements are quite steep (and I've always thought it's more fun to do your own). But you might want to Google "virtex and FFT and cores" and check them out. You can get an idea of how others are handling the I/O. And most seem to be doing radix 2, with a few radix 4 or mixed radix (and one NxM - arbitrary numbers). I have more to add after my previous post, but it's late here, so I'll post again later today. Kevin McGee
From: kevin on 7 Jul 2010 02:13 OK. I'm convinced that there are at least several ways of manipulating your inputs such that you can sequence them properly and process them in the first stage 64 point pipelines. The only question is which way is most efficient. As for the structure of the first stage 64, you were wondering if the higher radix versions would result in longer wiring routes. Well, you're going to have very long wiring routes even for a radix 2 pipeline. Consider that your input data (as you described) consists of 64 eight bit words arriving simultaneously (512 bits total). Each stage of a 64 point radix 2 pipeline contains 32 butterflies. Presumably, they're done in parallel to meet the required throughput speed. And since you're doing things with complex inputs, you actually have 512 + 512 = 1024 bits being processed per clock tick (or, 64 complex 8 bit words per tick). That's a lot of wires. And because of the nature of the FFT, some of those wires are going to be crossing other wires to get to the proper butterfly (it depends on the exact algorithm, but every single algorithm is going to require it). As an aside, you might consider taking a look at the correspondence I mentioned some time before: K. J. McGee, Comments on A 64-Point Fourier Transform Chip for Video Motion Compensation Using Phase Correlation, IEEE J. Solid-State Circuits, vol. 33, no. 6, June 1998, pp. 928-932. Note in the abstract where I state: lower radix algorithms can be modified to be computationally equivalent to higher radix algorithms. Then I describe why in section III of the above. It's a true statement, and I figured it out in 1981 after manipulating a lot of FFT graphs. For instance, any 64 point radix 2 graph can be modified such that it contains the EXACT SAME number of adds, subtracts and multiplies as in a 64 point radix 8 graph. So lower radix efficiency can be made = to higher radix efficiency. So, for me at least, the choice of radix has more to do with what hardware resources are available, and which arrangement is most efficient. So you could proceed by picking a particular 64 point graph. Start with radix 2, if you want, and see just how much in the way of resources you'll need to implement it in full parallel form. There's yet other important thing to note. Since you seem to be fairly flexible as to your inputs, you might seriously consider doing things in a serial fashion. Suppose your data were coming in as 512 bits at a time. Then, after 8 ticks of your 375 Mhz clock, you'd have 512 eight bit words of data, and you'd process them serially. Whether your data is coming in parallel or serial, you should be able to get the same high throughput that you require. Serial processing may help you decrease the amount of long wire routes. It's something that you should consider. Kevin McGee
From: onkars on 9 Jul 2010 12:39 Thanks again for your time. >There's yet other important thing to note. Since you seem to be >fairly flexible as to your inputs, you might seriously consider doing >things in a serial fashion. Suppose your data were coming in as 512 >bits at a time. Then, after 8 ticks of your 375 Mhz clock, you'd have >512 eight bit words of data, and you'd process them serially. > >Whether your data is coming in parallel or serial, you should be able >to get the same high throughput that you require. Serial processing >may help you decrease the amount of long wire routes. It's something >that you should consider. > I am not exactly sure what you mean by serial. Does it mean having multiple smaller pipeline FFT engines -- each doing a 4096 point FFT but fed serially -- instead of 1 big 64 path pipelined fft. If this is what you meant -- I actually mentioned this in my previous posts -- actually these were part of the options I am contemplating. See below (extract from previous post): "USE **4** FFTs made of 16-path radix-16 using 3 stages (full row and column parallel implementation of radix-16) OR USE **8** FFTs made of 8-path radix-8 (full row and column parallel implementation of radix-8) using 4 stages" Whenever I speak of smaller radix(r), I mean each FFT pipeline stage of the 4096 radix-r fft will have this 1 radix implemented as full parallel MDC structure (which will internally be built of radix-2 type structures e.g. radix-4 will internally have 4 radix-2s making a radix-4 have 8 complex additions and finally 3 complex multiplications). So, this implies that whenever I say I will use small radix-r it means I will have to use more number of radix-r FFT units. Hence I say: "Using smaller radices has following disadvantages --- The number of stages and required number of FFTs (to achieve 64 samples/clock) increases. Reasonably assuming that the number arithmetic units for any approach is same --- this means that the control overhead (scheduling logic + commutators using logic not mem. etc.) is increased. I am not sure how much percentage this is. Advantage of smaller radices is that the routes will be shorter and higher frequencies can be achievable. Will the routes when using a single 64-path radix-64 solution be longer than when using 4 different radix-16 FFTs each designed as having 16-paths? -- I think they will. ??
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 5 Prev: Multipath channel and downsampling Next: Mixing different length FFTs |