Prev: Embedded clocks
Next: CPU design
From: Tommy Thorn on 18 Aug 2006 14:07 Martin Schoeberl wrote: > JOP at 100MHz on the Altera DE2 using the 16-bit SRAM: > > Avalon: 11,322 > SimpCon: 14,760 > > So for the SRAM interface SimpCon is a clear winner ;-) > The 16-bit SRAM SimpCon solution is even faster than > the 32-bit SRAM Avalon solution. I'm not sure what your point is. It's hardly surprising that a JOP works better with the interface it was codesigned with, rather than some other one crafted on top. It says nothing is the relative merits of Avalon and SimpCon. I could code up a counter example quite easily. Altera has an App note on "Using Nios II Tightly Coupled Memory Tutorial" (http://altera.com/literature/tt/tt_nios2_tightly_coupled_memory_tutorial.pdf), but as far as I understand you, this is already how you use the memory. I noticed you didn't reply to how SimpCon doesn't scale. Does your silence mean that you see it now? :-) Tommy
From: Martin Schoeberl on 22 Aug 2006 05:48 From: "Tommy Thorn" <tommy.thorn(a)gmail.com> Newsgroups: > Martin Schoeberl wrote: >> JOP at 100MHz on the Altera DE2 using the 16-bit SRAM: >> >> Avalon: 11,322 >> SimpCon: 14,760 >> >> So for the SRAM interface SimpCon is a clear winner ;-) >> The 16-bit SRAM SimpCon solution is even faster than >> the 32-bit SRAM Avalon solution. > > I'm not sure what your point is. It's hardly surprising that a JOP > works better with the interface it was codesigned with, rather than > some other one crafted on top. It says nothing is the relative merits > of Avalon and SimpCon. I could code up a counter example quite easily. You're right from your point of view. I have only JOP to compare SimpCon and Avalon. JOP takes advantage of the early acknowledge of SimpCon. However, it's still simpler with SimpCon to implement a SRAM interface with the input and output registers at the IO cells in the FPGA without adding one cycle latency. A small defense of the JOP/SimpCon version: SimpCon was added very late to JOP. Up to this time JOP used it's own proprietary memory interface that was not shared with the IO subsystem. The IO devices also used a proprietary interface. Than I changed JOP to use Wishbone for memory and IO, but had to add a non Wishbone compliant early ack signal to get the performance I wanted. This resulted in the definition of SimpCon and another change in JOPs memory/IO system. It would be interesting to take another CPU (not NIOS or JOP) and implement an Avalon and a SimpCon SRAM interface and compare the performance. However, who has time to do this... > > Altera has an App note on "Using Nios II Tightly Coupled Memory > Tutorial" > (http://altera.com/literature/tt/tt_nios2_tightly_coupled_memory_tutorial.pdf), > but as far as I understand you, this is already how you use the memory. Very interesting, thanks for the link. No, this is not the way I used the on-chip memory with JOP - this looks NIOS specific. And it is stated there: 'The term tightly coupled memory interface refers to an *Avalon-like* interface...' That's interesting as it is an indication that there are issues for low latency connections with Avalon ;-) > I noticed you didn't reply to how SimpCon doesn't scale. Does your > silence mean that you see it now? :-) It means I have not thought enough about it ;-) Martin
From: Martin Schoeberl on 23 Aug 2006 13:25 > Martin Schoeberl wrote: >>>> Another point is, in my opinion, the wrong role who has to hold data >>>> for more than one cycle. This is true for several busses (e.g. also >>>> Wishbone). For these busses the master has to hold address and write >>>> data till the slave is ready. This is a result from the backplane >>>> bus thinking. In an SoC the slave can easily register those signals >>>> when needed longer and the master can continue. >>> When happens then when you issue another request to a slave which hasn't finished processing the first? Any queue will be finite >>> and eventually you'd have to deal with stalling anyway. Any issue is that there are generally many more slaves than masters so >>> it makes sense to move the complication to the master. >> >> I disagree ;-) >> How hard is it for a slave to hold the read data more than one cycle? >> Until the next read data is requested and available? That comes almost >> for free. It's a single register, trivial logic. Ok, is a little overhead >> for an on-chip peripheral. However, you need usually a MUX in the >> peripheral for select the IO registers (now using register with a different >> meaning). Making this MUX registered is almost for free. > > Focusing on the overhead for one slave supporting one outstanding command is missing the point. However, holding data out in the slave until overwritten by new data from a new request is still worth doing it. It will simplify a single master. And probably also the interconnection logic for multiple masters. > Non-trivial slaves can support multiple simultaneous outstanding requests (say N), so they would need at least a queue N deep. > Not a problem. Now, I have multiple slaves and multiple masters on the interconnect. Each master must be able to have at least M > outstanding requests. Any one slave can only accept one request pr cycle so the interconnect (the arbitration) needs buffer the > requests in lots of FIFOs and _they_ add significant latency, logic, and complication (pick two). If you want them to be completely independent you also need a reordering of results (or some kind of transaction id) in your interconnect. For me that's a completely different game. I think that's more a Network-on-Chip (NoC) topic. NoC is a big buzz-word these days ;-) > I'll need to study SimpCon more to understand what you mean by it's support for multiple outstanding requests. Just to clarify, > I'm talking about completely independent requests, not bursts. Different masters may issue multiple of these (up to some limit) > while previously issued requests are still not complete. I do insist the requests complete in the order they were issued, mostly > to simplify things (such as the arbitration). Really just a subset of Avalon. > You can issue completely requests with the plain SimpCon specification only to 'some' extent. Only when a former request is 'yet to arrive' you can issue a new request to SimpCon (or the switch logic). That's a restriction. We could add a accept signal to allow the master to issue more requests. However, issuing multiple requests to different slaves and than delivering them in order is a pain for the switch logic. You have to remember your request order and handle the results arriving in a different order. However, for this issue a slave that holds the data till used can simplify the switching a little bit... Perhaps I should state how I see SimpCon: A *simple* SoC interconnect that allows for lower latency and pipelining to some extent. The main application I have in mind is a single master (CPU) with multiple slaves (memory and IO). The interconnect/address decoding should be simple - and it is - see an example at: http://www.opencores.org/cvsweb.cgi/~checkout~/jop/vhdl/scio/scio_min.vhd Besides component declaration and IO signal routing the interconnect is just 18 lines of VHDL. The read MUX is driven by registered select, which helps in the critical path when you have planty of slaves. Martin
From: KJ on 23 Aug 2006 14:38 I think it all comes down to me maybe not totally getting what you're saying in the following paragraph so I'll go slowly if you do. > My pipeline approach is just this little funny busy counter > instead of a single ack and that a slave has to declare it's > pipeline level (0 to 3). Level 1 is almost ever possible. > It's more or less for free in a slave. Level 1 means that > the master can issue the next read/write command in the same > cycle when the data is available (rdy_cnt=0). Level 2 means > issue the next command one cycle earlier (rdy_cnt=1). Still > not a big issue for a slave (especially for a memory slave > where you need a little state machine anyway). I'm assuming that the master side address and command signals enter the 'Simpcon' bus and the 'Avalon' bus on the same clock cycle. Maybe this assumption is where my hang up is and maybe JOP on Simpcon is getting a 'head start' over JOP on Avalon. Given that assumption though, it's not clear to me why the address and command could not be designed to also end up at the actual memory device on the same clock cycle. Again, maybe this is where my hang up is. Given that address and command end up at the memory device on the same clock cycle whether SimpCon or Avalon, the resulting read data would then be valid and returned to the SimpCon/Avalon memory interface logic on the same clock cycle. Pretty sure this is correct since this is just saying that the external memory performance is the same which is should be since it does not depend on SimpCon or Avalon. Given all of that, it's not clear to me why the actual returned data would show up on the SimpCon bus ahead of Avalon or how it would be any slower getting back to the SimpCon or Avalon master. Again, this might be where my hangup is but if my assumptions have been correct up to this paragraph then I think the real issue is not here but in the next paragraph. If I got through this far then it comes down to....You say "Level 1 means that the master can issue the next read/write command in the same cycle when the data is available (rdy_cnt=0). Level 2 means issue the next command one cycle earlier (rdy_cnt=1)." and presumably the 'rdy_cnt=1' is the reason for the better SimpCon numbers. Where I'm pretty sure I'm hung up then is why can't the Avalon slave drop the wait request output on the clock cycle that corresponds to rdy_cnt=1 (i.e. one before data is available at the master)? rdy_cnt=1 sounds like it is allowing JOP on SimpCon to start up the next transaction (read/write or twiddle thumbs) one clock cycle before the read data is actually available. But how is that different than the Avalon slave dropping wait request one clock cycle before the data is available and then asserting read data valid once the data actually is available? All of this on the assumption that the Avalon master and slaves both support readdatavalid of course. > Enjoy this discussion :-) > Martin Immensely. And I think I'll finally get the light bulb turned on in my head after your reply. Kevin
From: Tommy Thorn on 23 Aug 2006 15:02
A quick answer for this one: > rdy_cnt=1 sounds like it is allowing JOP on SimpCon to start up the > next transaction (read/write or twiddle thumbs) one clock cycle before > the read data is actually available. But how is that different than > the Avalon slave dropping wait request one clock cycle before the data > is available and then asserting read data valid once the data actually > is available? The signal waitrequest has nothing to do with the output, but is property of the input. What you're suggesting is an "abuse" of Avalon and would only work for slaves that support only one outstanding transfer with a latency of exactly one. Clearly incompatible with existing Avalon components. I'll have a longer reply for Martin later :-) Tommy |