Anyone else need bigger parts in small (low pin count) packages [FPGA]

Prev: Programming Digilent Nexys 2 from Linux
Next: Estimating resource utilization of cores (from Xilinx CoreGen)

From: -jg on 3 Jun 2010 21:01

On Jun 4, 10:08 am, rickman <gnu...(a)gmail.com> wrote:
> Assuming you *need* the OS level portion. If I understand the XMOS
> device, they have pipelined their design, but instead of trying to use
> that to speed up a single processor, they treat it as a time sliced
> multi-processor. Zero overhead other than the muxing of the multiple
> registers. The trade off is that each of the N processors run as if
> they are not pipelined at 1/N of the clock rate. I guess there may be
> some complexity in the interrupt controller too. So for the cost of 1
> processor in terms of logic, they get N processors running
> concurrently.

Close, they can run up to 4 threads, with no speed impact, but really
that's because they limit to no more than C/4. Above 4, and that C/N
starts to show,

What XMOS forgot to do, was allow good HW capture on the pins.
Perhaps, being SW centric, they figured all problems can be solved
with code, but that always has a time ceiling.

> I may take a look at doing that in my processor. The
> code space could even be shared.

While you have the hood open, also look at making the Interrupt
response times always the same ?.
(The new M0 claims to do this)
Often jitter is a far worse problem than absolute delays.

-jg

From: rickman on 3 Jun 2010 23:32

On Jun 3, 9:01 pm, -jg <jim.granvi...(a)gmail.com> wrote:
> On Jun 4, 10:08 am, rickman <gnu...(a)gmail.com> wrote:
>
> > Assuming you *need* the OS level portion. If I understand the XMOS
> > device, they have pipelined their design, but instead of trying to use
> > that to speed up a single processor, they treat it as a time sliced
> > multi-processor. Zero overhead other than the muxing of the multiple
> > registers. The trade off is that each of the N processors run as if
> > they are not pipelined at 1/N of the clock rate. I guess there may be
> > some complexity in the interrupt controller too. So for the cost of 1
> > processor in terms of logic, they get N processors running
> > concurrently.
>
> Close, they can run up to 4 threads, with no speed impact, but really
> that's because they limit to no more than C/4. Above 4, and that C/N
> starts to show,

I don't follow. I thought a processor was 8 way interleaved. Why
does it slow with more than 4 threads? I see there are only four
"hardware locks", but I can't find any mention of what they are and
what they do. Is that what you are saying limits it somehow?

> What XMOS forgot to do, was allow good HW capture on the pins.
> Perhaps, being SW centric, they figured all problems can be solved
> with code, but that always has a time ceiling.
>
> > I may take a look at doing that in my processor. The
> > code space could even be shared.
>
> While you have the hood open, also look at making the Interrupt
> response times always the same ?.
> (The new M0 claims to do this)
> Often jitter is a far worse problem than absolute delays.

My processor is highly optimized for real time work. Every
instruction is 1 clock cycle and the interrupt latency is always the
same, 1 clock cycle. As soon as the interrupt winds its way through
the interrupt controller logic (less than a clock cycle if the signal
is already synchronous to the CPU clock) the next clock cycle jams an
interrupt in place of the current instruction, saves the current
address and the current PSW and jumps to the interrupt location, all
in one clock cycle. That is one advantage of having two stacks, you
can save two things at once. The only short coming is that the stacks
*are* your working register set, so they are still pointing just above
the last operands of the interrupted code. If you want to save
interrupt context between states, you either have to explicitly save
the data in memory somewhere on exit and restore it on reentry (very
slow) or you need to save and restore the data stack pointer, one word
to be saved in memory. I haven't tried writing the code for this
yet. I'll be interested to see how much work the CPU has to do to
save and restore the data stack pointer.

With a round robin scheduling of tasks to the time slices, letting the
interrupt use an uncommitted slice would mean a variable delay of 1 to
8 clocks, but then they are system clocks, not instruction clocks. Or
an interrupt could just take the next time slice no matter what. I
would think the whole tasking/interrupt thing could get very complex
if priorities are involved. It would be much simpler to just assign
an interrupt to a time slice and let it be shared between the
interrupt and a lower priority task.

Rick

From: -jg on 4 Jun 2010 05:52

On Jun 4, 3:32 pm, rickman <gnu...(a)gmail.com> wrote:
>
> > Close, they can run up to 4 threads, with no speed impact, but really
> > that's because they limit to no more than C/4. Above 4, and that C/N
> > starts to show,
>
> I don't follow. I thought a processor was 8 way interleaved. Why
> does it slow with more than 4 threads? I see there are only four
> "hardware locks", but I can't find any mention of what they are and
> what they do. Is that what you are saying limits it somehow?

My understanding is they chose to allow only 100MHz cpu rates at
400Mhz clock (maybe some I/O limits), so the Max CPU thread speed is
100Mhz/10ns, but you CAN launch up to 4 of these, with no impact (and
so use up all slack 400MHz clocks) - thereafter, adding another
thread, has to lower the total average CPU speed, until you get to the
8 x 50MHz CPU's point.

An alternative approach would have been to allow 8 time Slots, and
then map each slot to a thread. (8 x 3 bit entries)

If they had done that, then 2x100MHz + 4 x 50MHz thread rates would
have been possible, and less interaction between thread loading. Then,
those 50MHz threads could start/stop, with no rate-cross-effects, and
hopefully some power saving.

-jg

From: rickman on 4 Jun 2010 07:27

On Jun 4, 5:52 am, -jg <jim.granvi...(a)gmail.com> wrote:
> On Jun 4, 3:32 pm, rickman <gnu...(a)gmail.com> wrote:
>
>
>
> > > Close, they can run up to 4 threads, with no speed impact, but really
> > > that's because they limit to no more than C/4. Above 4, and that C/N
> > > starts to show,
>
> > I don't follow. I thought a processor was 8 way interleaved. Why
> > does it slow with more than 4 threads? I see there are only four
> > "hardware locks", but I can't find any mention of what they are and
> > what they do. Is that what you are saying limits it somehow?
>
> My understanding is they chose to allow only 100MHz cpu rates at
> 400Mhz clock (maybe some I/O limits), so the Max CPU thread speed is
> 100Mhz/10ns, but you CAN launch up to 4 of these, with no impact (and
> so use up all slack 400MHz clocks) - thereafter, adding another
> thread, has to lower the total average CPU speed, until you get to the
> 8 x 50MHz CPU's point.
>
> An alternative approach would have been to allow 8 time Slots, and
> then map each slot to a thread. (8 x 3 bit entries)
>
> If they had done that, then 2x100MHz + 4 x 50MHz thread rates would
> have been possible, and less interaction between thread loading. Then,
> those 50MHz threads could start/stop, with no rate-cross-effects, and
> hopefully some power saving.
>
> -jg

I can't say I understand this. The scheme I was thinking about would
have had 8 time slots, each at 1/8th the system clock rate. There
would be no way to utilize more than 1/8th the system clock rate for a
single thread because then it would be pipelined and would require
special logic to manage the issues that creates.

I guess the devil is in the details and I really don't know how they
are doing this.

Rick

From: Jaime Andres Aranguren C. on 18 Jun 2010 18:26

"rickman" <gnuarm(a)gmail.com> wrote in message
news:54356eb8-3f64-4df7-9997-e916daa71b7c(a)m21g2000vbr.googlegroups.com...
On Jun 2, 7:35 pm, -jg <jim.granvi...(a)gmail.com> wrote:
> On Jun 3, 10:05 am, rickman <gnu...(a)gmail.com> wrote:
>
> > > Rather than the uC+CPLD the marketing types are chasing, I would find
> > > a CPLD+RAM more useful, as there are LOTS of uC out there already, and
> > > if they can make 32KB SRAM for sub $1, they should be able to include
> > > it almost for free, in a medium CPLD.
>
> > > -jg
>
> > I won't argue with that for a moment. But deciding what to put in a
> > part and which flavors to offer in what packages is decided in the
> > land of marketing. As much as I whine and complain, I guess I have to
> > assume they know *something* about their jobs.
>
> The product managers are understandably blinkered by what has gone
> before, and what they sell now, so in the CPLD market it is very rare
> to see a bold step.
>
> The CoolRunner was the last bold step I recall, and that was not made
> by a traditional vendor product manager, but by some new blood.
>
> Altera, Atmel, Lattice and Xilinx have slowed right down on CPLD
> releases, to almost be in 'run out' mode.

I've been busy with work the last few months so I tend to forget what
I read about trends. I seem to recall that Xilinx has announced
something with an MCU in it and not the PPC they used in the past. Do
I remember right? Is X coming out with an FPGA with an ARM?

Personally, I prefer something other than an ARM inside an FPGA. I
want a CPU that executes each instruction in a single clock cycle and
has very seriously low interrupt latency. That is why I designed my
own CPU at one point. ARM CPUs with FPGAs seem to be oriented to
people who want to use lots of memory and run a real time OS. Not
that that an ARM or a real time OS is a bad thing, I just want
something closer to the metal.

If I could get a good MCU in an FPGA (which would certainly have some
adequate memory) in a "convenient" package, that would really make my
day. I don't have to have the analog stuff, but 5 volt tolerance
would certainly be useful. That alone would take two chips off my
board and maybe more.

Rick
--

ARM Cortex-M3 is in Actel SmartFusion FPGAs. Not in small package, though.

--
Jaime Andres Aranguren Cardona
SanJaaC Electronics
Soluciones en DSP
www.sanjaac.com

First | Prev |
Pages: 1 2 3 4 5 6
Prev: Programming Digilent Nexys 2 from Linux
Next: Estimating resource utilization of cores (from Xilinx CoreGen)