JOP as SOPC component [FPGA]

From: KJ on 24 Aug 2006 06:36

"Tommy Thorn" <tommy.thorn(a)gmail.com> wrote in message
news:1156359725.967272.219130(a)74g2000cwt.googlegroups.com...
>A quick answer for this one:
>
>> rdy_cnt=1 sounds like it is allowing JOP on SimpCon to start up the
>> next transaction (read/write or twiddle thumbs) one clock cycle before
>> the read data is actually available. But how is that different than
>> the Avalon slave dropping wait request one clock cycle before the data
>> is available and then asserting read data valid once the data actually
>> is available?
>
> The signal waitrequest has nothing to do with the output, but is
> property of the input. What you're suggesting is an "abuse" of Avalon
> and would only work for slaves that support only one outstanding
> transfer with a latency of exactly one. Clearly incompatible with
> existing Avalon components.
>
Not at all an abuse of Avalon. In fact it is the way waitrequest is
intended to be used. I'm not quite sure what you're referring to by input
and output when you say "nothing to do with the output, but is property of
the input" but what waitrequest is all about is to signal the end of the
'address phase' of the transaction where 'address phase' are the clock
cycle(s) where read and/or write are asserted along with address and
writedata (if write is asserted).

Waitrequest is an output from a slave component that, when asserted, signals
the Avalon fabric that the address and command inputs (and writedata if
performing a write) needs to be held for another clock cycle. Once the
slave component no longer needs the address and command inputs it can drop
waitrequest even if it has not actually completed the transaction.

The Avalon fabric 'almost' passes the waitrequest signal right back to the
master device, the only change being that the Avalon logic basically gates
the slave's waitrequest output with the slave's chipselect input (which the
Avalon fabric creates) to form the master's waitrequest input (assuming a
simple single master/slave connection for simplicity here). Per Avalon,
when an Avalon master sees it's waitrequest input asserted it simply must
not change the state of the address, read, write or writedata outputs on
that particular clock cycle. When the Avalon master is performing a read or
write and it sees waitrequest not asserted it is free to start up another
transaction on the next clock cycle. In particular, if the first
transaction was a read, this means that the 'next' transaction can be
started even though the data has not yet been returned from the first read.
For a slave device that has a readdatavalid output signal Avalon does not
define any min/max time for when readdatavalid must come back just that for
each read that has been accepted by the slave (i.e. one with read asserted,
waitrequest not asserted) there must be exactly one cycle with readdatavalid
asserted flagging the readdata output as having valid data.

During a read, Avalon allows the delay between the clock cycle with "read
and not(waitrequest)" and the eventual clock cycle with "readatavalid" to be
either fixed or variable. If fixed, then SOPC Builder allows the fixed
latency number to be entered into the class.ptf file for the slave and no
readdatavalid output from the slave is required. All that does though is
cause SOPC Builder to synthesize the code itself to generate readdatavalid
as if it came from the slave code itself. If the readdatavalid output IS
part of the component then SOPC Builder allows the latency delay to be
variable; whether it actually is or not is up to the slave's VHDL/Verilog
design code. Bottom line is that Avalon does have a mechanism built right
into the basic specification that allows a master device to start up another
read or write cycle one clock cycle prior to readdata actually having been
provided.

Given the description that Martin posted on how his SimpCon interface logic
works it 'appears' that he believes that this ability to start up another
cycle prior to completion (meaning the data from the read has actually been
returned) is what is giving SimpCon the edge over Avalon. At least that's
how it appears to me, which is why I asked him to walk me through the
transaction to find where I'm missing something. My basic confusion is not
understanding just exactly where in the read transaction does SimpCon 'pull
ahead' of Avalon and give 'JOP on SimpCon' the performance edge over 'JOP on
Avalon'.

Anyway, hopefully that explains why it's not abusing Avalon in any way.

KJ

From: Tommy Thorn on 24 Aug 2006 13:03

KJ wrote:
..... a (AFAICT) correct description of Avalon.

> During a read, Avalon allows the delay between the clock cycle with "read
> and not(waitrequest)" and the eventual clock cycle with "readatavalid" to be
> either fixed or variable. If fixed, then SOPC Builder allows the fixed
> latency number to be entered into the class.ptf file for the slave and no
> readdatavalid output from the slave is required. All that does though is
> cause SOPC Builder to synthesize the code itself to generate readdatavalid
> as if it came from the slave code itself. If the readdatavalid output IS
> part of the component then SOPC Builder allows the latency delay to be
> variable; whether it actually is or not is up to the slave's VHDL/Verilog
> design code. Bottom line is that Avalon does have a mechanism built right
> into the basic specification that allows a master device to start up another
> read or write cycle one clock cycle prior to readdata actually having been
> provided.

Ah, we only differ in perspective. Yes, Avalon _allows_ you to write
slaves like that and if your fabric consists only of such slaves, then
yes, they are the same. But variable latency does _not_ work like that,
thus you can't make such an assumption in general if you wish the fabric
to be able to accommodate arbitrary Avalon slaves.

> Given the description that Martin posted on how his SimpCon interface logic
> works it 'appears' that he believes that this ability to start up another
> cycle prior to completion (meaning the data from the read has actually been
> returned) is what is giving SimpCon the edge over Avalon. At least that's
> how it appears to me, which is why I asked him to walk me through the
> transaction to find where I'm missing something. My basic confusion is not
> understanding just exactly where in the read transaction does SimpCon 'pull
> ahead' of Avalon and give 'JOP on SimpCon' the performance edge over 'JOP on
> Avalon'.

That was not my understanding. SimpCon allows Martin to get an "early
warning" that a transaction is about to complete. As I mentioned, this
is not an uncommon idea and it works great for point-to-point
interfaces. My claim is that it doesn't scale if you wish to use SimpCon
like a general purpose fabric like Avalon.

Being able to "start up another cycle prior to completion" is what I
mean by multiple outstanding requests (known as "posted reads" in PCI
lingo). It is definitely a feature of Avalon.

> Anyway, hopefully that explains why it's not abusing Avalon in any way.

My wording was poor. Another way to say it is "to use Avalon in a
constrained way". Used this way you cannot hook up slaves with variable
latency, so it's not really Avalon, it's a subset of Avalon.

Cheers,
Tommy

From: KJ on 24 Aug 2006 15:32

Tommy Thorn wrote:
> KJ wrote:
> .... a (AFAICT) correct description of Avalon.
>
> Ah, we only differ in perspective. Yes, Avalon _allows_ you to write
> slaves like that
Umm, yeah it's defined up front in the spec and not off in some corner
like Wishbone's tag method either.

> and if your fabric consists only of such slaves, then
> yes, they are the same.
What is the same as what? Also, there is no restriction about having
latency aware masters and slaves.

> But variable latency does _not_ work like that,
How do you think it works? I've been using the term 'variable latency'
as it is used by Avalon which is that there can be an arbitrary delay
between the end of the address phase (i.e. when waitrequest is not
asserted to the master) and the end of the data phase (i.e. when
readdatavalid, is asserted to the master).

> thus you can't make such an assumption in general if you wish the fabric
> to be able to accommodate arbitrary Avalon slaves.
What assumption do you think I'm making? The Avalon fabric can connect
any mix of Avalon slaves whether they are fixed latency, variable
latency or no latency (i.e. controlled by waitrequest). Furthermore it
can be connected to an Avalon master that is 'latency aware' (i.e. has
a 'readdatavalid' input) or one that is not (i.e. does not have
'readdatavalid' as an input, so cycles are controlled only by
'waitrequest'). You get different performance based on which method is
used but that is a design choice on the master and slave side design,
not something that Avalon is doing anything to help or hinder.

>
> That was not my understanding. SimpCon allows Martin to get an "early
> warning" that a transaction is about to complete.
And what happens as a result of this 'early warning'? I *thought* it
allowed the JOP Avalon master to start up another transaction of some
sort. If so, then that can be accomplished with waitrequest and
readdatavalid. But maybe it's something on the data path side that
gets the jump that I'm just not getting just yet.

>
> > Anyway, hopefully that explains why it's not abusing Avalon in any way.
>
> My wording was poor. Another way to say it is "to use Avalon in a
> constrained way".
I'm not clear on what constraint you're seeing in the usage.

> Used this way you cannot hook up slaves with variable
> latency, so it's not really Avalon, it's a subset of Avalon.
If anything, choosing to not use the readdatavalid signal in the master
or slave design to allow for completion of the address phase prior to
the data phase is the subset not the other way around.

KJ

From: Martin Schoeberl on 24 Aug 2006 16:44

Hi Kevin,

now I know more from your name than KJ ;-)

>> My pipeline approach is just this little funny busy counter
>> instead of a single ack and that a slave has to declare it's
>> pipeline level (0 to 3). Level 1 is almost ever possible.
>> It's more or less for free in a slave. Level 1 means that
>> the master can issue the next read/write command in the same
>> cycle when the data is available (rdy_cnt=0). Level 2 means
>> issue the next command one cycle earlier (rdy_cnt=1). Still
>> not a big issue for a slave (especially for a memory slave
>> where you need a little state machine anyway).
>
> I'm assuming that the master side address and command signals enter the
> 'Simpcon' bus and the 'Avalon' bus on the same clock cycle. Maybe this
> assumption is where my hang up is and maybe JOP on Simpcon is getting a
> 'head start' over JOP on Avalon.

This assumption is true. Address and command (+write data) are
issued in the same cycle - no magic there. In SimpCon this is a
single cycle thing and there is no ack or busy signal involed in
this first cycle. That means no combinatorial generation of ack or
busy. And no combinatorial reaction of the master in the first
cycle.

What I loos with SimpCon is a single cycle latency access. However,
I think this is not too much to give up for easier pipelining of the
arbitration/data in MUX.

> Given that assumption though, it's not clear to me why the address and
> command could not be designed to also end up at the actual memory
> device on the same clock cycle. Again, maybe this is where my hang up
> is.

The register that holds the address is probably a ALU result
register (or in my case the top-of-stack). That one is usually
buried deep in the design. Additional you have to generate your
slave selection (chip select) from that address. This ends up with
some logic and long routing pathes to the pins. In a practical
example with the Cyclone 6-7 ns are not so uncommon. Almost one
cycle at 100 MHz. Furthermore, this delay is not easy to control in
your design - add another slave and the output delay changes.

To avoid this unpredictability one will add a register at the IO pad
for address and rd/wr/cs. If we agree on this additional register at
the slave/memory interface we can drop the requirement on the master
to hold the address and control longer than one cycle. Furthermore,
as we have this minimum one cycle latency from master command till
address/rd/wr/data on the pins we do not need an ack/busy indication
during this command cycle. We just say to the master: in the cycle
the follows your command you will get the information about ready or
wait.

> Given that address and command end up at the memory device on the same
> clock cycle whether SimpCon or Avalon, the resulting read data would
> then be valid and returned to the SimpCon/Avalon memory interface logic
> on the same clock cycle. Pretty sure this is correct since this is
> just saying that the external memory performance is the same which is
> should be since it does not depend on SimpCon or Avalon.

In SimpCon it will definitely arrive one cycle later. With Avalon
(and the generated memory interface) I 'assume' that there is also
one cycle latency - I read this from the tco values of the output
pins in the Quartus timing analyzer report. For the SRAM interface I
did in VHDL I explicitly added registers at the addredd/rd/wr/data
output. I don't know if the switch fabric adds another cycle.
Probably not, if you do not check the pipelined checkbox in the SOPC
Builds.

> Given all of that, it's not clear to me why the actual returned data
> would show up on the SimpCon bus ahead of Avalon or how it would be any
> slower getting back to the SimpCon or Avalon master. Again, this might
> be where my hangup is but if my assumptions have been correct up to
> this paragraph then I think the real issue is not here but in the next
> paragraph.

Completely agree. The read data should arrive in the same cycle from
Avalon or SimpCon to the master. Now that's the point where this
bsy_cnt comes into play. In my master (JOP) I can take advantage of
the early knowledge when data will arrive. I can restart my waiting
pipeline earlier with this information. This is probably the main
performance difference.

Going through my VHDL code for the Avalon interface I found on more
issue with the JOP/Avalon interface: In JOP I issue read/write
commands and continue to execute microcode if possible. Only when
the result is needed the main pipeline waits for the slave result.
However, the slave can deliver the result earlier than needed. In
that case the slave has to hold the data for JOP. The Avalon
specification guarantees the read data valid only for a single
cycle. So I added a register to hold the data and got one cycle
latency:

* one register at the input pins for the read data
* one register at the JOP/Avalon interface to hold the data
longer than one cycle

As I see it, this can be enhanced in the same way I did the little
Avalon specification violation on the master side. Use a MUX to
deliver the data from the input register in the first cycle and
switch to the 'hold' register for the other cycles. Should change
the interface for a fairer comparison. Thanks for pointing me to
this :-)

> If I got through this far then it comes down to....You say "Level 1
> means that the master can issue the next read/write command in the same
> cycle when the data is available (rdy_cnt=0). Level 2 means issue the
> next command one cycle earlier (rdy_cnt=1)." and presumably the
> 'rdy_cnt=1' is the reason for the better SimpCon numbers. Where I'm
> pretty sure I'm hung up then is why can't the Avalon slave drop the
> wait request output on the clock cycle that corresponds to rdy_cnt=1
> (i.e. one before data is available at the master)?

Because rdy_cnt has a different meaning than waitrequest. It is more
like an early datavalid. Dropping waitrequest does not help with my
pipeline restart thing.

> rdy_cnt=1 sounds like it is allowing JOP on SimpCon to start up the
> next transaction (read/write or twiddle thumbs) one clock cycle before
> the read data is actually available. But how is that different than

As above: the main thing is to get the master pipeline started early
to use the read data. Perhaps this is a special design feature of
JOP and not usable in a di

From: Martin Schoeberl on 24 Aug 2006 17:13

> the input" but what waitrequest is all about is to signal the end of the
> 'address phase' of the transaction where 'address phase' are the clock
> cycle(s) where read and/or write are asserted along with address and
> writedata (if write is asserted).

If we could agree on slaves that don't need address/write
data/commands for more than one cycle we could completely eliminate
the waitrequest ;-)

Let's say the address/command phase is per definition one cycle.

That definition frees the master to do whatever it wants in the next
cycle. For another request to the same slave it has to watch for the
rdy_cnt in SimpCon. However, you can design a switch fabric with
SimpCon where it is legal to issue a command to a different slave in
the next cycle without attention to the first slave. You can just
ignore the first slaves output until you want to use it.

> The Avalon fabric 'almost' passes the waitrequest signal right back to the
> master device, the only change being that the Avalon logic basically gates
> the slave's waitrequest output with the slave's chipselect input (which the
> Avalon fabric creates) to form the master's waitrequest input (assuming a
> simple single master/slave connection for simplicity here). Per Avalon,

I'm repeating myself ;-) That's the point I don't like in Avalon,
Wishbone, OPB,...: You have a combinatorial path from address
register - decoding - slave decision - master decision (to hold
address/command or not). With a few slaves this will not be an
issue. With more slaves or a more complicated interconnect (multiple
master) this can be your critical path.

BTW: AMBA APB is an exception: It also requests the ready decision
(PREADY) in the following cycle. But AMBA APB still forces the
master to hold address/command till PREADY.

AMBA AHB is a little different: there is still an address and data
phase, but hey can overlap. On a wait request the address and data
have to be held by the master (although in the basic transfer this
is not necessary. A little bit confusing...

> Given the description that Martin posted on how his SimpCon interface logic
> works it 'appears' that he believes that this ability to start up another
> cycle prior to completion (meaning the data from the read has actually been
> returned) is what is giving SimpCon the edge over Avalon. At least that's

No, that's not the difference. I agree that for fully pipelined
transactions (e.g. cache line read) both busses should give you
absolutely the same performance.

> how it appears to me, which is why I asked him to walk me through the
> transaction to find where I'm missing something. My basic confusion is not
> understanding just exactly where in the read transaction does SimpCon 'pull
> ahead' of Avalon and give 'JOP on SimpCon' the performance edge over 'JOP on
> Avalon'.

As described in the other posting:
a.) the early pipeline restart
b.) the additional cycle in the Avalon interface for the
register to hold the data for the master (should be enhanced)

Martin

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: Embedded clocks
Next: CPU design