JOP as SOPC component [FPGA]

From: Martin Schoeberl on 14 Aug 2006 12:56

>>>> You almost never want to have a fixed number of wait states but want to simply have the Avalon slave provide a wait request
>>>> output and tell Avalon that by specifying that in the PTF file.
>>>
>>> Completely agree. When not writing and reading too many posts
>>> I'm working on that version of the SRAM interface. It was just
>>> a quick start as shown in the Quartus manual.
>>
>> BTW (to KJ): Do you have this type of Avalon slave
>> for an SRAM? Would save some time and errors for me ;-)
>>
> No, over the past several years my use of async SRAMs has gone to 0 even though I used to use them quite heavily. They've been
> replaced by internal
> I'm assuming that you've checked and that Altera didn't toss one in as a MegaCore? Too bad.

Their core for the (older) NIOS boards just uses the tri-state
bridge .PTF approach - no VHDL

>
> Oh well, I'll stop posting and let you get back to work.
> KJ

Ok, I have now two versions of the SRAM interface: the plain
PTF version and the VHDL version (with I/O registers at
the FPGA pins to access a 15ns SRAM in two cycles at 100 MHz,
and the nwr with the neg. clock to save one cycle on write).

Here are some performance numbers of this JOP/SRAM
interface on an embedded benchmark. It measures iterations/s
and therefore higher numbers are better. All versions are clocked
at 100 MHz, 4 KB instruction cache and 512 Byte stack cache.
FPGA is Cyclone EP1C6-6, Memory is 32-bit SRAM 15ns. The only
difference is the memory interface.

SimpCon: 16,633
Avalon (PTF version): 14,015
Avalon (VHDL version): 13,920

So for me, the additional latency cycle(s) and not having
the early ack information for the CPU pipeline degrades
JOP's performance. Perhaps some Avalon specialist can do better.

However, it is compensated by the many peripherals that are
now just a mous click away ;-)

Martin

From: Martin Schoeberl on 14 Aug 2006 13:21

> Here are some performance numbers of this JOP/SRAM
> interface on an embedded benchmark. It measures iterations/s
> and therefore higher numbers are better. All versions are clocked
> at 100 MHz, 4 KB instruction cache and 512 Byte stack cache.
> FPGA is Cyclone EP1C6-6, Memory is 32-bit SRAM 15ns. The only
> difference is the memory interface.
>
> SimpCon: 16,633
> Avalon (PTF version): 14,015
> Avalon (VHDL version): 13,920
>
some additional numbers from the Altera DE2 board with
Cyclone II at 100 MHz with SDRAM and using on-chip
memory (the EP2C35 is big enough to run the benchmark
in on-chip memory).

Avalon SDRAM: 7,288
Avalon on-chip memory: 15,769

The performance issue with the SDRAM is clear. Just needs
some more caching to get a big (8 MB) memory with
acceptable performance ;-)
However, even the fast on-chip memory Avalon solution is
slightly slower than the two cycle SRAM connected via
SimpCon.

Martin

From: KJ on 14 Aug 2006 14:01

"Martin Schoeberl" <mschoebe(a)mail.tuwien.ac.at> wrote in message
news:44e0b13c$0$11352
> However, even the fast on-chip memory Avalon solution is
> slightly slower than the two cycle SRAM connected via
> SimpCon.
>

Well now THAT is incredibly surprising since the on-chip memory should be
giving you 0 wait state, 0 latency performance (i.e. WaitRequest should
always be low when accessing memory). That would seem to point to either
some issue that comes up every now and then in your 'CPU to Avalon' bridge
master interface logic or something equally odd inside the Avalon fabric
itself connecting the CPU to the memory.

I'd be interested to hear what you find.

KJ

From: Martin Schoeberl on 14 Aug 2006 19:16

> Well now THAT is incredibly surprising since the on-chip memory should be giving you 0 wait state, 0 latency performance (i.e.
> WaitRequest should

That's not right anymore. You have at minimum one cycle latency
as addresses are registered in current on-chip RAMs. Probably also
the output is registered. However, I don't know - would have to
look into the VHDL code.

> always be low when accessing memory). That would seem to point to either

waitrequest always low would only be possible with pipelining using
datavalid. That helps on cach fill, but not on an ordinary read.

Perhaps I should try to connect the on-chip RAM to my
'native' SimpCon interface and check the performance.
That should be better than the 2 cycle SRAM. However,
this is a more theoretical experiment as Java programs
usually will not fit into on-chip RAMs ;-)
C programs with NIOS are more code efficient.

> some issue that comes up every now and then in your 'CPU to Avalon' bridge master interface logic or something equally odd inside
> the Avalon fabric itself connecting the CPU to the memory.

One issue is that my CPU takes advantage from this 'counting down
ready' signal (the bsy_cnt in SimpCon). I can't do this with the
Avalon spec. Therefore, there is a preformance penalty - Inherent
due to the design.

> I'd be interested to hear what you find.

The CPU/Avalon bridge is probably sub-optimal. Will
try to check this out (First I have to get the Altera ModelSim
version running - would make it easier - still havn't compiled
the missing SOPC libraries for ModelSim).

Martin

From: Martin Schoeberl on 18 Aug 2006 05:39

>> Here are some performance numbers of this JOP/SRAM
>> interface on an embedded benchmark. It measures iterations/s
>> and therefore higher numbers are better. All versions are clocked
>> at 100 MHz, 4 KB instruction cache and 512 Byte stack cache.
>> FPGA is Cyclone EP1C6-6, Memory is 32-bit SRAM 15ns. The only
>> difference is the memory interface.
>>
>> SimpCon: 16,633
>> Avalon (PTF version): 14,015
>> Avalon (VHDL version): 13,920
>>
> some additional numbers from the Altera DE2 board with
> Cyclone II at 100 MHz with SDRAM and using on-chip
> memory (the EP2C35 is big enough to run the benchmark
> in on-chip memory).
>
> Avalon SDRAM: 7,288
> Avalon on-chip memory: 15,769
>
and some more:

JOP at 100MHz on the Altera DE2 using the 16-bit SRAM:

Avalon: 11,322
SimpCon: 14,760

So for the SRAM interface SimpCon is a clear winner ;-)
The 16-bit SRAM SimpCon solution is even faster than
the 32-bit SRAM Avalon solution.

BTW: the embedded benchmark is a control application
which is does not need a high memory bandwith. For a
different benchmark (a UDP/IP application with IP
processing - lot of buffer access) the difference is
larger. With the 16-bit SRAM:

Avalon: 4,302
Simpcon: 5,716

again - higher number is better

Martin

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: Embedded clocks
Next: CPU design