JOP as SOPC component [FPGA]

From: Tommy Thorn on 18 Aug 2006 14:07

Martin Schoeberl wrote:
> JOP at 100MHz on the Altera DE2 using the 16-bit SRAM:
>
> Avalon: 11,322
> SimpCon: 14,760
>
> So for the SRAM interface SimpCon is a clear winner ;-)
> The 16-bit SRAM SimpCon solution is even faster than
> the 32-bit SRAM Avalon solution.

I'm not sure what your point is. It's hardly surprising that a JOP
works better with the interface it was codesigned with, rather than
some other one crafted on top. It says nothing is the relative merits
of Avalon and SimpCon. I could code up a counter example quite easily.

Altera has an App note on "Using Nios II Tightly Coupled Memory
Tutorial"
(http://altera.com/literature/tt/tt_nios2_tightly_coupled_memory_tutorial.pdf),
but as far as I understand you, this is already how you use the memory.

I noticed you didn't reply to how SimpCon doesn't scale. Does your
silence mean that you see it now? :-)

Tommy

From: Martin Schoeberl on 22 Aug 2006 05:48

From: "Tommy Thorn" <tommy.thorn(a)gmail.com> Newsgroups:

> Martin Schoeberl wrote:
>> JOP at 100MHz on the Altera DE2 using the 16-bit SRAM:
>>
>> Avalon: 11,322
>> SimpCon: 14,760
>>
>> So for the SRAM interface SimpCon is a clear winner ;-)
>> The 16-bit SRAM SimpCon solution is even faster than
>> the 32-bit SRAM Avalon solution.
>
> I'm not sure what your point is. It's hardly surprising that a JOP
> works better with the interface it was codesigned with, rather than
> some other one crafted on top. It says nothing is the relative merits
> of Avalon and SimpCon. I could code up a counter example quite easily.

You're right from your point of view. I have only JOP to compare
SimpCon and Avalon. JOP takes advantage of the early acknowledge of
SimpCon. However, it's still simpler with SimpCon to implement a
SRAM interface with the input and output registers at the IO
cells in the FPGA without adding one cycle latency.

A small defense of the JOP/SimpCon version: SimpCon was added very
late to JOP. Up to this time JOP used it's own proprietary memory
interface that was not shared with the IO subsystem. The IO devices
also used a proprietary interface. Than I changed JOP to use
Wishbone for memory and IO, but had to add a non Wishbone compliant
early ack signal to get the performance I wanted. This resulted in
the definition of SimpCon and another change in JOPs memory/IO
system.

It would be interesting to take another CPU (not NIOS or JOP) and
implement an Avalon and a SimpCon SRAM interface and compare the
performance. However, who has time to do this...

>
> Altera has an App note on "Using Nios II Tightly Coupled Memory
> Tutorial"
> (http://altera.com/literature/tt/tt_nios2_tightly_coupled_memory_tutorial.pdf),
> but as far as I understand you, this is already how you use the memory.

Very interesting, thanks for the link. No, this is not the way I
used the on-chip memory with JOP - this looks NIOS specific. And it
is stated there:

'The term tightly coupled memory interface refers to an
*Avalon-like* interface...'

That's interesting as it is an indication that there are issues for
low latency connections with Avalon ;-)

> I noticed you didn't reply to how SimpCon doesn't scale. Does your
> silence mean that you see it now? :-)

It means I have not thought enough about it ;-)

Martin

From: Martin Schoeberl on 23 Aug 2006 13:25

> Martin Schoeberl wrote:
>>>> Another point is, in my opinion, the wrong role who has to hold data
>>>> for more than one cycle. This is true for several busses (e.g. also
>>>> Wishbone). For these busses the master has to hold address and write
>>>> data till the slave is ready. This is a result from the backplane
>>>> bus thinking. In an SoC the slave can easily register those signals
>>>> when needed longer and the master can continue.
>>> When happens then when you issue another request to a slave which hasn't finished processing the first? Any queue will be finite
>>> and eventually you'd have to deal with stalling anyway. Any issue is that there are generally many more slaves than masters so
>>> it makes sense to move the complication to the master.
>>
>> I disagree ;-)
>> How hard is it for a slave to hold the read data more than one cycle?
>> Until the next read data is requested and available? That comes almost
>> for free. It's a single register, trivial logic. Ok, is a little overhead
>> for an on-chip peripheral. However, you need usually a MUX in the
>> peripheral for select the IO registers (now using register with a different
>> meaning). Making this MUX registered is almost for free.
>
> Focusing on the overhead for one slave supporting one outstanding command is missing the point.

However, holding data out in the slave until overwritten by
new data from a new request is still worth doing it. It will
simplify a single master. And probably also the interconnection
logic for multiple masters.

> Non-trivial slaves can support multiple simultaneous outstanding requests (say N), so they would need at least a queue N deep.
> Not a problem. Now, I have multiple slaves and multiple masters on the interconnect. Each master must be able to have at least M
> outstanding requests. Any one slave can only accept one request pr cycle so the interconnect (the arbitration) needs buffer the
> requests in lots of FIFOs and _they_ add significant latency, logic, and complication (pick two).

If you want them to be completely independent you also need
a reordering of results (or some kind of transaction id) in
your interconnect. For me that's a completely different
game. I think that's more a Network-on-Chip (NoC) topic.
NoC is a big buzz-word these days ;-)

> I'll need to study SimpCon more to understand what you mean by it's support for multiple outstanding requests. Just to clarify,
> I'm talking about completely independent requests, not bursts. Different masters may issue multiple of these (up to some limit)
> while previously issued requests are still not complete. I do insist the requests complete in the order they were issued, mostly
> to simplify things (such as the arbitration). Really just a subset of Avalon.
>

You can issue completely requests with the plain SimpCon
specification only to 'some' extent. Only when a former
request is 'yet to arrive' you can issue a new request to
SimpCon (or the switch logic). That's a restriction. We could
add a accept signal to allow the master to issue more
requests.

However, issuing multiple requests to different slaves and
than delivering them in order is a pain for the switch
logic. You have to remember your request order and handle
the results arriving in a different order. However, for
this issue a slave that holds the data till used can
simplify the switching a little bit...

Perhaps I should state how I see SimpCon: A *simple*
SoC interconnect that allows for lower latency and
pipelining to some extent. The main application I have
in mind is a single master (CPU) with multiple slaves
(memory and IO). The interconnect/address decoding
should be simple - and it is - see an example at:
http://www.opencores.org/cvsweb.cgi/~checkout~/jop/vhdl/scio/scio_min.vhd

Besides component declaration and IO signal routing
the interconnect is just 18 lines of VHDL. The read
MUX is driven by registered select, which helps in
the critical path when you have planty of slaves.

Martin

From: KJ on 23 Aug 2006 14:38

I think it all comes down to me maybe not totally getting what you're
saying in the following paragraph so I'll go slowly if you do.

> My pipeline approach is just this little funny busy counter
> instead of a single ack and that a slave has to declare it's
> pipeline level (0 to 3). Level 1 is almost ever possible.
> It's more or less for free in a slave. Level 1 means that
> the master can issue the next read/write command in the same
> cycle when the data is available (rdy_cnt=0). Level 2 means
> issue the next command one cycle earlier (rdy_cnt=1). Still
> not a big issue for a slave (especially for a memory slave
> where you need a little state machine anyway).

I'm assuming that the master side address and command signals enter the
'Simpcon' bus and the 'Avalon' bus on the same clock cycle. Maybe this
assumption is where my hang up is and maybe JOP on Simpcon is getting a
'head start' over JOP on Avalon.

Given that assumption though, it's not clear to me why the address and
command could not be designed to also end up at the actual memory
device on the same clock cycle. Again, maybe this is where my hang up
is.

Given that address and command end up at the memory device on the same
clock cycle whether SimpCon or Avalon, the resulting read data would
then be valid and returned to the SimpCon/Avalon memory interface logic
on the same clock cycle. Pretty sure this is correct since this is
just saying that the external memory performance is the same which is
should be since it does not depend on SimpCon or Avalon.

Given all of that, it's not clear to me why the actual returned data
would show up on the SimpCon bus ahead of Avalon or how it would be any
slower getting back to the SimpCon or Avalon master. Again, this might
be where my hangup is but if my assumptions have been correct up to
this paragraph then I think the real issue is not here but in the next
paragraph.

If I got through this far then it comes down to....You say "Level 1
means that the master can issue the next read/write command in the same
cycle when the data is available (rdy_cnt=0). Level 2 means issue the
next command one cycle earlier (rdy_cnt=1)." and presumably the
'rdy_cnt=1' is the reason for the better SimpCon numbers. Where I'm
pretty sure I'm hung up then is why can't the Avalon slave drop the
wait request output on the clock cycle that corresponds to rdy_cnt=1
(i.e. one before data is available at the master)?

rdy_cnt=1 sounds like it is allowing JOP on SimpCon to start up the
next transaction (read/write or twiddle thumbs) one clock cycle before
the read data is actually available. But how is that different than
the Avalon slave dropping wait request one clock cycle before the data
is available and then asserting read data valid once the data actually
is available? All of this on the assumption that the Avalon master and
slaves both support readdatavalid of course.

> Enjoy this discussion :-)
> Martin

Immensely. And I think I'll finally get the light bulb turned on in my
head after your reply.

Kevin

From: Tommy Thorn on 23 Aug 2006 15:02

A quick answer for this one:

> rdy_cnt=1 sounds like it is allowing JOP on SimpCon to start up the
> next transaction (read/write or twiddle thumbs) one clock cycle before
> the read data is actually available. But how is that different than
> the Avalon slave dropping wait request one clock cycle before the data
> is available and then asserting read data valid once the data actually
> is available?

The signal waitrequest has nothing to do with the output, but is
property of the input. What you're suggesting is an "abuse" of Avalon
and would only work for slaves that support only one outstanding
transfer with a latency of exactly one. Clearly incompatible with
existing Avalon components.

I'll have a longer reply for Martin later :-)

Tommy

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: Embedded clocks
Next: CPU design