Prev: Embedded clocks
Next: CPU design
From: Martin Schoeberl on 14 Aug 2006 12:56 >>>> You almost never want to have a fixed number of wait states but want to simply have the Avalon slave provide a wait request >>>> output and tell Avalon that by specifying that in the PTF file. >>> >>> Completely agree. When not writing and reading too many posts >>> I'm working on that version of the SRAM interface. It was just >>> a quick start as shown in the Quartus manual. >> >> BTW (to KJ): Do you have this type of Avalon slave >> for an SRAM? Would save some time and errors for me ;-) >> > No, over the past several years my use of async SRAMs has gone to 0 even though I used to use them quite heavily. They've been > replaced by internal > I'm assuming that you've checked and that Altera didn't toss one in as a MegaCore? Too bad. Their core for the (older) NIOS boards just uses the tri-state bridge .PTF approach - no VHDL > > Oh well, I'll stop posting and let you get back to work. > KJ Ok, I have now two versions of the SRAM interface: the plain PTF version and the VHDL version (with I/O registers at the FPGA pins to access a 15ns SRAM in two cycles at 100 MHz, and the nwr with the neg. clock to save one cycle on write). Here are some performance numbers of this JOP/SRAM interface on an embedded benchmark. It measures iterations/s and therefore higher numbers are better. All versions are clocked at 100 MHz, 4 KB instruction cache and 512 Byte stack cache. FPGA is Cyclone EP1C6-6, Memory is 32-bit SRAM 15ns. The only difference is the memory interface. SimpCon: 16,633 Avalon (PTF version): 14,015 Avalon (VHDL version): 13,920 So for me, the additional latency cycle(s) and not having the early ack information for the CPU pipeline degrades JOP's performance. Perhaps some Avalon specialist can do better. However, it is compensated by the many peripherals that are now just a mous click away ;-) Martin
From: Martin Schoeberl on 14 Aug 2006 13:21 > Here are some performance numbers of this JOP/SRAM > interface on an embedded benchmark. It measures iterations/s > and therefore higher numbers are better. All versions are clocked > at 100 MHz, 4 KB instruction cache and 512 Byte stack cache. > FPGA is Cyclone EP1C6-6, Memory is 32-bit SRAM 15ns. The only > difference is the memory interface. > > SimpCon: 16,633 > Avalon (PTF version): 14,015 > Avalon (VHDL version): 13,920 > some additional numbers from the Altera DE2 board with Cyclone II at 100 MHz with SDRAM and using on-chip memory (the EP2C35 is big enough to run the benchmark in on-chip memory). Avalon SDRAM: 7,288 Avalon on-chip memory: 15,769 The performance issue with the SDRAM is clear. Just needs some more caching to get a big (8 MB) memory with acceptable performance ;-) However, even the fast on-chip memory Avalon solution is slightly slower than the two cycle SRAM connected via SimpCon. Martin
From: KJ on 14 Aug 2006 14:01 "Martin Schoeberl" <mschoebe(a)mail.tuwien.ac.at> wrote in message news:44e0b13c$0$11352 > However, even the fast on-chip memory Avalon solution is > slightly slower than the two cycle SRAM connected via > SimpCon. > Well now THAT is incredibly surprising since the on-chip memory should be giving you 0 wait state, 0 latency performance (i.e. WaitRequest should always be low when accessing memory). That would seem to point to either some issue that comes up every now and then in your 'CPU to Avalon' bridge master interface logic or something equally odd inside the Avalon fabric itself connecting the CPU to the memory. I'd be interested to hear what you find. KJ
From: Martin Schoeberl on 14 Aug 2006 19:16 > Well now THAT is incredibly surprising since the on-chip memory should be giving you 0 wait state, 0 latency performance (i.e. > WaitRequest should That's not right anymore. You have at minimum one cycle latency as addresses are registered in current on-chip RAMs. Probably also the output is registered. However, I don't know - would have to look into the VHDL code. > always be low when accessing memory). That would seem to point to either waitrequest always low would only be possible with pipelining using datavalid. That helps on cach fill, but not on an ordinary read. Perhaps I should try to connect the on-chip RAM to my 'native' SimpCon interface and check the performance. That should be better than the 2 cycle SRAM. However, this is a more theoretical experiment as Java programs usually will not fit into on-chip RAMs ;-) C programs with NIOS are more code efficient. > some issue that comes up every now and then in your 'CPU to Avalon' bridge master interface logic or something equally odd inside > the Avalon fabric itself connecting the CPU to the memory. One issue is that my CPU takes advantage from this 'counting down ready' signal (the bsy_cnt in SimpCon). I can't do this with the Avalon spec. Therefore, there is a preformance penalty - Inherent due to the design. > I'd be interested to hear what you find. The CPU/Avalon bridge is probably sub-optimal. Will try to check this out (First I have to get the Altera ModelSim version running - would make it easier - still havn't compiled the missing SOPC libraries for ModelSim). Martin
From: Martin Schoeberl on 18 Aug 2006 05:39
>> Here are some performance numbers of this JOP/SRAM >> interface on an embedded benchmark. It measures iterations/s >> and therefore higher numbers are better. All versions are clocked >> at 100 MHz, 4 KB instruction cache and 512 Byte stack cache. >> FPGA is Cyclone EP1C6-6, Memory is 32-bit SRAM 15ns. The only >> difference is the memory interface. >> >> SimpCon: 16,633 >> Avalon (PTF version): 14,015 >> Avalon (VHDL version): 13,920 >> > some additional numbers from the Altera DE2 board with > Cyclone II at 100 MHz with SDRAM and using on-chip > memory (the EP2C35 is big enough to run the benchmark > in on-chip memory). > > Avalon SDRAM: 7,288 > Avalon on-chip memory: 15,769 > and some more: JOP at 100MHz on the Altera DE2 using the 16-bit SRAM: Avalon: 11,322 SimpCon: 14,760 So for the SRAM interface SimpCon is a clear winner ;-) The 16-bit SRAM SimpCon solution is even faster than the 32-bit SRAM Avalon solution. BTW: the embedded benchmark is a control application which is does not need a high memory bandwith. For a different benchmark (a UDP/IP application with IP processing - lot of buffer access) the difference is larger. With the 16-bit SRAM: Avalon: 4,302 Simpcon: 5,716 again - higher number is better Martin |