Prev: Programming Digilent Nexys 2 from Linux
Next: Estimating resource utilization of cores (from Xilinx CoreGen)
From: -jg on 3 Jun 2010 21:01 On Jun 4, 10:08 am, rickman <gnu...(a)gmail.com> wrote: > Assuming you *need* the OS level portion. If I understand the XMOS > device, they have pipelined their design, but instead of trying to use > that to speed up a single processor, they treat it as a time sliced > multi-processor. Zero overhead other than the muxing of the multiple > registers. The trade off is that each of the N processors run as if > they are not pipelined at 1/N of the clock rate. I guess there may be > some complexity in the interrupt controller too. So for the cost of 1 > processor in terms of logic, they get N processors running > concurrently. Close, they can run up to 4 threads, with no speed impact, but really that's because they limit to no more than C/4. Above 4, and that C/N starts to show, What XMOS forgot to do, was allow good HW capture on the pins. Perhaps, being SW centric, they figured all problems can be solved with code, but that always has a time ceiling. > I may take a look at doing that in my processor. The > code space could even be shared. While you have the hood open, also look at making the Interrupt response times always the same ?. (The new M0 claims to do this) Often jitter is a far worse problem than absolute delays. -jg
From: rickman on 3 Jun 2010 23:32 On Jun 3, 9:01 pm, -jg <jim.granvi...(a)gmail.com> wrote: > On Jun 4, 10:08 am, rickman <gnu...(a)gmail.com> wrote: > > > Assuming you *need* the OS level portion. If I understand the XMOS > > device, they have pipelined their design, but instead of trying to use > > that to speed up a single processor, they treat it as a time sliced > > multi-processor. Zero overhead other than the muxing of the multiple > > registers. The trade off is that each of the N processors run as if > > they are not pipelined at 1/N of the clock rate. I guess there may be > > some complexity in the interrupt controller too. So for the cost of 1 > > processor in terms of logic, they get N processors running > > concurrently. > > Close, they can run up to 4 threads, with no speed impact, but really > that's because they limit to no more than C/4. Above 4, and that C/N > starts to show, I don't follow. I thought a processor was 8 way interleaved. Why does it slow with more than 4 threads? I see there are only four "hardware locks", but I can't find any mention of what they are and what they do. Is that what you are saying limits it somehow? > What XMOS forgot to do, was allow good HW capture on the pins. > Perhaps, being SW centric, they figured all problems can be solved > with code, but that always has a time ceiling. > > > I may take a look at doing that in my processor. The > > code space could even be shared. > > While you have the hood open, also look at making the Interrupt > response times always the same ?. > (The new M0 claims to do this) > Often jitter is a far worse problem than absolute delays. My processor is highly optimized for real time work. Every instruction is 1 clock cycle and the interrupt latency is always the same, 1 clock cycle. As soon as the interrupt winds its way through the interrupt controller logic (less than a clock cycle if the signal is already synchronous to the CPU clock) the next clock cycle jams an interrupt in place of the current instruction, saves the current address and the current PSW and jumps to the interrupt location, all in one clock cycle. That is one advantage of having two stacks, you can save two things at once. The only short coming is that the stacks *are* your working register set, so they are still pointing just above the last operands of the interrupted code. If you want to save interrupt context between states, you either have to explicitly save the data in memory somewhere on exit and restore it on reentry (very slow) or you need to save and restore the data stack pointer, one word to be saved in memory. I haven't tried writing the code for this yet. I'll be interested to see how much work the CPU has to do to save and restore the data stack pointer. With a round robin scheduling of tasks to the time slices, letting the interrupt use an uncommitted slice would mean a variable delay of 1 to 8 clocks, but then they are system clocks, not instruction clocks. Or an interrupt could just take the next time slice no matter what. I would think the whole tasking/interrupt thing could get very complex if priorities are involved. It would be much simpler to just assign an interrupt to a time slice and let it be shared between the interrupt and a lower priority task. Rick
From: -jg on 4 Jun 2010 05:52 On Jun 4, 3:32 pm, rickman <gnu...(a)gmail.com> wrote: > > > Close, they can run up to 4 threads, with no speed impact, but really > > that's because they limit to no more than C/4. Above 4, and that C/N > > starts to show, > > I don't follow. I thought a processor was 8 way interleaved. Why > does it slow with more than 4 threads? I see there are only four > "hardware locks", but I can't find any mention of what they are and > what they do. Is that what you are saying limits it somehow? My understanding is they chose to allow only 100MHz cpu rates at 400Mhz clock (maybe some I/O limits), so the Max CPU thread speed is 100Mhz/10ns, but you CAN launch up to 4 of these, with no impact (and so use up all slack 400MHz clocks) - thereafter, adding another thread, has to lower the total average CPU speed, until you get to the 8 x 50MHz CPU's point. An alternative approach would have been to allow 8 time Slots, and then map each slot to a thread. (8 x 3 bit entries) If they had done that, then 2x100MHz + 4 x 50MHz thread rates would have been possible, and less interaction between thread loading. Then, those 50MHz threads could start/stop, with no rate-cross-effects, and hopefully some power saving. -jg
From: rickman on 4 Jun 2010 07:27 On Jun 4, 5:52 am, -jg <jim.granvi...(a)gmail.com> wrote: > On Jun 4, 3:32 pm, rickman <gnu...(a)gmail.com> wrote: > > > > > > Close, they can run up to 4 threads, with no speed impact, but really > > > that's because they limit to no more than C/4. Above 4, and that C/N > > > starts to show, > > > I don't follow. I thought a processor was 8 way interleaved. Why > > does it slow with more than 4 threads? I see there are only four > > "hardware locks", but I can't find any mention of what they are and > > what they do. Is that what you are saying limits it somehow? > > My understanding is they chose to allow only 100MHz cpu rates at > 400Mhz clock (maybe some I/O limits), so the Max CPU thread speed is > 100Mhz/10ns, but you CAN launch up to 4 of these, with no impact (and > so use up all slack 400MHz clocks) - thereafter, adding another > thread, has to lower the total average CPU speed, until you get to the > 8 x 50MHz CPU's point. > > An alternative approach would have been to allow 8 time Slots, and > then map each slot to a thread. (8 x 3 bit entries) > > If they had done that, then 2x100MHz + 4 x 50MHz thread rates would > have been possible, and less interaction between thread loading. Then, > those 50MHz threads could start/stop, with no rate-cross-effects, and > hopefully some power saving. > > -jg I can't say I understand this. The scheme I was thinking about would have had 8 time slots, each at 1/8th the system clock rate. There would be no way to utilize more than 1/8th the system clock rate for a single thread because then it would be pipelined and would require special logic to manage the issues that creates. I guess the devil is in the details and I really don't know how they are doing this. Rick
From: Jaime Andres Aranguren C. on 18 Jun 2010 18:26 "rickman" <gnuarm(a)gmail.com> wrote in message news:54356eb8-3f64-4df7-9997-e916daa71b7c(a)m21g2000vbr.googlegroups.com... On Jun 2, 7:35 pm, -jg <jim.granvi...(a)gmail.com> wrote: > On Jun 3, 10:05 am, rickman <gnu...(a)gmail.com> wrote: > > > > Rather than the uC+CPLD the marketing types are chasing, I would find > > > a CPLD+RAM more useful, as there are LOTS of uC out there already, and > > > if they can make 32KB SRAM for sub $1, they should be able to include > > > it almost for free, in a medium CPLD. > > > > -jg > > > I won't argue with that for a moment. But deciding what to put in a > > part and which flavors to offer in what packages is decided in the > > land of marketing. As much as I whine and complain, I guess I have to > > assume they know *something* about their jobs. > > The product managers are understandably blinkered by what has gone > before, and what they sell now, so in the CPLD market it is very rare > to see a bold step. > > The CoolRunner was the last bold step I recall, and that was not made > by a traditional vendor product manager, but by some new blood. > > Altera, Atmel, Lattice and Xilinx have slowed right down on CPLD > releases, to almost be in 'run out' mode. I've been busy with work the last few months so I tend to forget what I read about trends. I seem to recall that Xilinx has announced something with an MCU in it and not the PPC they used in the past. Do I remember right? Is X coming out with an FPGA with an ARM? Personally, I prefer something other than an ARM inside an FPGA. I want a CPU that executes each instruction in a single clock cycle and has very seriously low interrupt latency. That is why I designed my own CPU at one point. ARM CPUs with FPGAs seem to be oriented to people who want to use lots of memory and run a real time OS. Not that that an ARM or a real time OS is a bad thing, I just want something closer to the metal. If I could get a good MCU in an FPGA (which would certainly have some adequate memory) in a "convenient" package, that would really make my day. I don't have to have the analog stuff, but 5 volt tolerance would certainly be useful. That alone would take two chips off my board and maybe more. Rick -- ARM Cortex-M3 is in Actel SmartFusion FPGAs. Not in small package, though. -- Jaime Andres Aranguren Cardona SanJaaC Electronics Soluciones en DSP www.sanjaac.com
First
|
Prev
|
Pages: 1 2 3 4 5 6 Prev: Programming Digilent Nexys 2 from Linux Next: Estimating resource utilization of cores (from Xilinx CoreGen) |