Prev: mux behavior
Next: Software bloat (Larkin was right)
From: Rob Gaddi on 25 May 2010 17:32 I've got a Spartan 6 design that I'm working with under ISE 11.5. A code block that I would expect to take up about 200 LUTs is taking 800 instead. 600 LUTs wouldn't be the end of the world, except I'm planning to replicate this block 32 times, which puts me well over the top. So the question becomes where are all of the LUTs going? There's nothing in the XST status report for the section that would imply anywhere near this much utilization. I've tried looking over the RTL schematic; it's difficult to read and from what I could make out, there still wasn't anything to explain all those LUTs. Then I tried looking through the technology schematic instead. The viewer took forever to open the schematic, and when I finally got it open it took better than a minute any time I wanted to refresh the screen. Needless to say, this got me nowhere. So, I'm out for advice. Any suggestions on figuring out just where all of those LUTs are going? Thanks, Rob -- Rob Gaddi, Highland Technology Email address is currently out of order
From: glen herrmannsfeldt on 25 May 2010 17:54 Rob Gaddi <rgaddi(a)technologyhighland.com> wrote: > I've got a Spartan 6 design that I'm working with under ISE 11.5. A > code block that I would expect to take up about 200 LUTs is taking 800 > instead. 600 LUTs wouldn't be the end of the world, except I'm planning > to replicate this block 32 times, which puts me well over the top. How full is the FPGA that you are targeting? If not so full, I believe that the tools don't try so hard. Well, actually the LUT count shouldn't be so far off, but the CLB count can change, as it doesn't fill each CLB. Otherwise, without knowing about the design it is hard to say. Can you say a little about the logic? How many counters, adders, RAMs. Maybe it is using CLB for RAM, instead of BRAM? -- glen
From: John_H on 25 May 2010 18:42 On May 25, 5:32 pm, Rob Gaddi <rga...(a)technologyhighland.com> wrote: > I've got a Spartan 6 design that I'm working with under ISE 11.5. A > code block that I would expect to take up about 200 LUTs is taking 800 > instead. 600 LUTs wouldn't be the end of the world, except I'm planning > to replicate this block 32 times, which puts me well over the top. > > So the question becomes where are all of the LUTs going? There's > nothing in the XST status report for the section that would imply > anywhere near this much utilization. I've tried looking over the RTL > schematic; it's difficult to read and from what I could make out, there > still wasn't anything to explain all those LUTs. Then I tried looking > through the technology schematic instead. The viewer took forever to > open the schematic, and when I finally got it open it took better than a > minute any time I wanted to refresh the screen. Needless to say, this > got me nowhere. > > So, I'm out for advice. Any suggestions on figuring out just where all > of those LUTs are going? > > Thanks, > Rob > > -- > Rob Gaddi, Highland Technology > Email address is currently out of order A good technology view will make the world of difference. But it seems Xilinx isn't giving you that. I used the Synplify synthesizer's HDL Analyst to get a superb technology view that allowed me to understand the occasional oddity the synthesizer would produce from my code. I found that technology viewer to be a truly top-notch product and sincerely helpful in keeping a design on track. I've only glanced at the Xilinx technology viewer, seeing that it looked like a last-gen VW beetle compared to a modern day Lexus in the HDL Analyst. It may do the job but it won't be a comfortable job if it gets too involved.
From: Symon on 25 May 2010 18:55 On 5/25/2010 10:32 PM, Rob Gaddi wrote: > > So the question becomes where are all of the LUTs going? > > Thanks, > Rob > Does ISE11.5 have FPGA editor? Syms.
From: Rob Gaddi on 25 May 2010 19:39
On 5/25/2010 2:54 PM, glen herrmannsfeldt wrote: > Rob Gaddi<rgaddi(a)technologyhighland.com> wrote: >> I've got a Spartan 6 design that I'm working with under ISE 11.5. A >> code block that I would expect to take up about 200 LUTs is taking 800 >> instead. 600 LUTs wouldn't be the end of the world, except I'm planning >> to replicate this block 32 times, which puts me well over the top. > > How full is the FPGA that you are targeting? If not so full, I > believe that the tools don't try so hard. Well, actually the LUT > count shouldn't be so far off, but the CLB count can change, as > it doesn't fill each CLB. > > Otherwise, without knowing about the design it is hard to say. > > Can you say a little about the logic? How many counters, adders, RAMs. > > Maybe it is using CLB for RAM, instead of BRAM? > > -- glen Sure. The widget in question does 8 pole IIR filtering of 16 bit data using 48-bit internal data paths. The actual add/multiply/add math is taken care of by a subblock that uses a DSP48 slice and 222 LUTs that I'm not counting towards the 800. The block I'm looking at is the wrapper that sequences the math operations and holds the internal states. The logic infers two 48 bit LUT RAMs, one dual port, and one quad port. There's a 24-bit LUT RAM and a 24 bit adder that I use to implement an FIR prefilter (the 8 zeros at z=-1 that you get from the bilinear transform of an 8 pole filter). There's an FSM with four states, and a couple of 3 bit counters. There are two 18 bit comparators, but most of the LSBs of them should optimize out. I'll append the code here. I'm not bothering to include pkg_bus as well, but it just defines a simple WISHBONE bus and a few constants. -- library IEEE; use IEEE.STD_LOGIC_1164.all; use IEEE.NUMERIC_STD.all; use IEEE.STD_LOGIC_MISC.all; use work.pkg_bus.all; -- Xilinx specific macro library -- library UNISIM; -- use UNISIM.VComponents.all; entity filter is port ( -- Data path din : in signed(15 downto 0); nd : in boolean; dout : out signed(15 downto 0); drdy : out boolean; -- Coefficient path WB_IN : in t_wb_mosi; WB_OUT : out t_wb_miso; WB_SYS : in t_wb_sys ); end entity filter; architecture Behavioral of filter is alias clk : std_logic is WB_SYS.CLK_I; alias rst : std_logic is WB_SYS.RST_I; -- Component declaration of the "filter_math" unit defined in -- file: "./src/vhdl/filter_math.vhd" component filter_math port( data : in SIGNED(47 downto 0); pre : in SIGNED(47 downto 0); post : in SIGNED(47 downto 0); k : in SIGNED(47 downto 0); lsd_nd : in BOOLEAN; ichg : out BOOLEAN; irdy : out BOOLEAN; y : out SIGNED(47 downto 0); lsd_rdy : out BOOLEAN; msd_rdy : out BOOLEAN; clk : in STD_LOGIC); end component; for all: filter_math use entity work.filter_math(Xilinx_DSP48A1); -- We're going to use a whole mess o' RAMs to store various -- and sundry. subtype t_data is signed(47 downto 0); constant POLES : integer := 8; constant MAX_IDX : integer := POLES-1; -- Data memory is S3.45. subtype t_idx is integer range 0 to MAX_IDX; type t_ram is array(t_idx) of t_data; signal ram_dat : t_ram := (others => (others => '0')); subtype t_uns_idx is unsigned(2 downto 0); signal write_idx : t_uns_idx; signal read_idx : t_uns_idx; -- Coefficient memory is also S3.45, but since we're -- writing it from a 16 bit data bus, we need to be -- able to access it a word at a time. -- type t_coefram is array(t_idx) of t_wb_data; signal ram_k_hi : t_coefram := (others => (others => '0')); signal ram_k_md : t_coefram := (others => (others => '0')); signal ram_k_lo : t_coefram := (others => (others => '0')); -- As seen from the memory bus, the coefficients are -- 64 bits long. The uppermost word of this is shared -- between all coefficients, and is the filter control -- word. signal fcw : t_wb_data; -- Bits 2:0 are POLES_USED, which should be an odd number -- equal to the number of poles for this filter - 1. Any -- even number here, including zero, will code for no filter. alias poles_used : std_logic_vector is fcw(2 downto 0); -- Hook the data up to the math core signal data : t_data; signal pre : t_data; signal post : t_data; signal k : t_data; signal y : t_data; signal go : boolean; signal lsd_nd : boolean; signal ichg : boolean; signal irdy : boolean; signal lsd_rdy : boolean; signal msd_rdy : boolean; -- Downstream of the math core we'll apply a cascade of 2 pole -- boxcar filters in order to put some zeros. One bit growth per -- stage brings us to S1.23 when we're done. subtype t_fir_data is signed(din'length + POLES - 1 downto 0); type t_firram is array(t_idx) of t_fir_data; signal fir_cascade : t_firram := (others => (others => '0')); signal fir_idx : t_uns_idx; signal fir_din : t_fir_data; -- Internal states of things signal fir_drdy : boolean; signal use_fir_data : boolean; type t_state is (IDLE, FIR, IIR, RESET); signal state : t_state := RESET; -- LFSR noise generator. When we first extend the 16 bit data to 24 -- bits for the FIR filter, adding this noise in below the LSB helps -- make sure the IIR filters don't get into long, drawn out settlings. signal lfsr : std_logic_vector(22 downto 1) := (others => '0'); begin ------------------------------------------------------------------------- -- Make sure our constants are compiled correctly. ------------------------------------------------------------------------- assert (2**write_idx'length = POLES) report "Length of RAM index does not correspond to number of poles." severity failure; ------------------------------------------------------------------------- -- Connect up the asynchronous data paths. ------------------------------------------------------------------------- -- FIR data is in S1.23, the math core is expecting S3.45 data <= SHIFT_LEFT(RESIZE(fir_din, data'length), 45-23) when use_fir_data else y; lsd_nd <= fir_drdy when use_fir_data else lsd_rdy; -- Everything else comes out of the RAMs. ram_k has one r/w port and one -- read port, ram_dat has one write port and two read ports. -- pre <= ram_dat(TO_INTEGER(read_idx or "001")); post <= ram_dat(TO_INTEGER(read_idx)); k <= SIGNED(ram_k_hi(TO_INTEGER(read_idx))) & SIGNED(ram_k_md(TO_INTEGER(read_idx))) & SIGNED(ram_k_lo(TO_INTEGER(read_idx))); -- Instantiate our math core. MATH : filter_math port map( data => data, pre => pre, post => post, k => k, lsd_nd => lsd_nd, ichg => ichg, irdy => irdy, y => y, lsd_rdy => lsd_rdy, msd_rdy => msd_rdy, clk => clk ); ------------------------------------------------------------------------- -- WISHBONE coefficient readback. ------------------------------------------------------------------------- WB_READBACK: process(WB_IN, fcw, ram_k_hi, ram_k_md, ram_k_lo) variable read_addr : integer range 0 to MAX_IDX; variable word_addr : integer range 0 to 3; begin read_addr := TO_INTEGER(WB_IN.ADDR(1 + read_idx'length downto 2)); word_addr := TO_INTEGER(WB_IN.ADDR(1 downto 0)); WB_OUT <= WB_BADA_SLAVE; if read_addr <= MAX_IDX then case word_addr is when 0 => WB_OUT.DAT <= fcw; when 1 => WB_OUT.DAT <= ram_k_hi(read_addr); when 2 => WB_OUT.DAT <= ram_k_md(read_addr); when 3 => WB_OUT.DAT <= ram_k_lo(read_addr); end case; end if; end process WB_READBACK; ------------------------------------------------------------------------- -- Wrangle the big state machine. ------------------------------------------------------------------------- MACHINE: process variable write_addr : integer range 0 to 31; variable word_addr : integer range 0 to 3; variable current : t_data; variable unclamped : signed(17 downto 0); -- S3.15 number begin wait until rising_edge(clk); drdy <= false; fir_drdy <= false; if nd then assert (state = IDLE) report "New data request before IDLE state." severity error; end if; case state is when IDLE => -- Hold things in the start state. use_fir_data <= true; read_idx <= (others => '0'); write_idx <= (others => '0'); if nd then if (poles_used(0) = '0') then -- Allow for no filter at all dout <= din; drdy <= true; state <= IDLE; else -- Start our FIR filter with din at the MSBs. state <= FIR; fir_idx <= UNSIGNED(poles_used); fir_din <= SHIFT_LEFT( RESIZE(din & lfsr(lfsr'high), fir_din'length), fir_din'length - din'length - 1 ); end if; else state <= IDLE; end if; when FIR => -- Store the value, push the average forward. fir_cascade(TO_INTEGER(fir_idx)) <= fir_din; fir_din <= SHIFT_RIGHT(fir_din, 1) + SHIFT_RIGHT(fir_cascade(TO_INTEGER(fir_idx)), 1); if (fir_idx = 0) then -- Start the IIR filter. Repurpose the FIR index to count -- down the number of poles to do. state <= IIR; fir_drdy <= true; fir_idx <= UNSIGNED(poles_used); else fir_idx <= fir_idx - 1; end if; when IIR => -- The main responsibilities are updating -- the pointers and updating the stored data. if msd_rdy then -- Update the stored data and advance the -- write pointer. Also decrement the FIR index, which -- we're just using to count IIR stages at this point. ram_dat(TO_INTEGER(write_idx)) <= y; write_idx <= write_idx + 1; fir_idx <= fir_idx - 1; if (fir_idx = 0) then state <= IDLE; write_idx <= (others => '0'); -- We've treated the data as S3.45 all the -- way through. First, remap it to S3.15 unclamped := RESIZE(SHIFT_RIGHT(y, 45-15), 18); -- Now clamp any excess. if TO_INTEGER(unclamped) >= 2**15 then dout <= x"7FFF"; elsif TO_INTEGER(unclamped) <= -(2**15) then dout <= x"8001"; else dout <= RESIZE(unclamped, 16); end if; drdy <= true; end if; elsif ichg and not lsd_nd then -- We can advance the read index ahead of -- time. read_idx <= write_idx + 1; if (fir_idx = 0) then use_fir_data <= true; else use_fir_data <= false; end if; end if; when RESET => -- Initialize the states for both filters ram_dat(TO_INTEGER(write_idx)) <= (others => '0'); fir_cascade(TO_INTEGER(fir_idx))<= (others => '0'); if (fir_idx = 0) then write_idx <= (others => '0'); state <= IDLE; else write_idx <= write_idx + 1; fir_idx <= fir_idx - 1; end if; end case; -- Allow bus writes to the coefficient RAM if is_write(WB_IN) then write_addr := TO_INTEGER(WB_IN.ADDR(6 downto 2)); word_addr := TO_INTEGER(WB_IN.ADDR(1 downto 0)); if write_addr <= MAX_IDX then case word_addr is when 0 => fcw <= WB_IN.DAT; when 1 => ram_k_hi(write_addr) <= WB_IN.DAT; when 2 => ram_k_md(write_addr) <= WB_IN.DAT; when 3 => ram_k_lo(write_addr) <= WB_IN.DAT; end case; end if; end if; -- Advance the LFSR lfsr <= lfsr(21 downto 1) & (lfsr(22) xnor lfsr(21)); -- Handle the reset. if (rst = '1') then write_idx <= (others => '0'); read_idx <= (others => '0'); fir_idx <= (others => '1'); fcw <= (others => '0'); use_fir_data <= true; state <= RESET; end if; end process; end architecture Behavioral; -- Rob Gaddi, Highland Technology Email address is currently out of order |