Prev: systemc
Next: sqrt(a^2 + b^2) in synthesizable VHDL?
From: JJ on 7 May 2006 13:54 Piotr Wyderski wrote: snipping > > If the 32 bit RISC was optimized for some specialized task, then it > > might make sense to have it alongside a high-performance CPU. > > No, because in this case you are trying to outperform an out-of-order, > highly parallel processor core able to complete ~6 simple instructions > per cycle and clocked at 2+ GHz. Reasonable soft CPU cores run at > about 200 MHz and complete only one instruction per cycle. It > means that a cheap CPU you can buy anywhere has about 60 times > higher performance in sequential processing. Even if you could provide > the same performance (not to mention about outperforming it, which > is the key idea, anyway), it would mean that you are at least Harry Potter. > :-) > I have fantastic disbelief about that 6 ops /clock except in very specific circumstances perhaps in a video codec using MMX/SSE etc where those units really do the equiv of many tiny integer codes per cycle on 4 or more parallel 8 bit DSP values. Now thats looking pretty much like what FPGA DSP can do pretty trivially except for the clock ratio 2GHz v 150MHz. I look at my C code (compilers, GUI development, databases, simulators etc) and some of the the critical output assembler and then time some parts on huge 1M timed loops making sure no iteration benefits from cacheing the previous run. I always see a tiny fraction of that ~6 ops/cycle. SInce my code is most definitely not vector code or media codec but is a mix of graph or tree traversal over large uncacheable spans, I often see avg rates exactly 1 op per clock on Athlon TB at 1GHz and also XP2400 at 2GHz. My conclusion is that the claims for wicked performance are mostly super hype that most punters accept all too easily. The truth of the matter is that Athlon XPs rated at say 2400 are not 2.4 x faster than TB at 1GHz in avg cases, maybe on vector codecs. When I compare WIndows apps for different cpus, I usually only see the faster cpu performing closer to sqr of its claimed speedup. A while back, Toms Hardware did a comparison of 3GHz P4s v the P100 1st pentium and all the in betweens and the plot was basically linear and thats on stupid benchmarks that don't reflect real world code. One has to bear in mind the P4 not only used 30x the clock to get 30x the benchmark performance, it also used perhaps 100x the transistor count as well and that is all due to the Memory Wall and the necessiity to avoid at all costs accessing DRAM. Now if we did that on an FPGA benchmark we would be damned to hell, one should count the clock ratio and the gate or LUT ratio but PCs have gotten away with using infinite transistor budgets to make claims. This makes sense to me since the instruction rate is still bound by real memory accesses to DRAM some percent of the time for cache misses, I figure around 2% typical or even more. DRAM has improved miserably over 20yrs in true random access about 2x from 120ns to 60ns Ras to Dout time. If you use cache misses close to 0.1% then you get the hyped nos, but code doesn't work like that atleast mine doesn't. Try running a random number generator say R250 which can generate a new rand number every 3ns on a XP2400 (9 ops IIRC). Now use that no to address a table >> 4MB. All of a sudden my 12Gops Athlon is running at 3MHz ie every memory access takes 300ns or so since every part of the memory system is wreaked (deliberately in this case). Ironically if thats all you wanted to do, an FPOA cpu without complex MMU, TLBs could generate random numbers in 1 cycle and drive an SDRAM controller just as fast if not faster since SDRAMs can cyle fully random closer to 60ns. Now in packet switching and processing where large tables are looked up with random looking fields, they use RLDRAM to get SRAM like performance. So what does real code look like, any old mixture of the 2 extremes, ie sometimes its memory crippled, sometimes if everything is in L1 cache, it really does seem to do 2 ops/clock if array accesses are spread out, even with small forward branches. So all the complexity of these OoO machines is there to push up the avg rate and keep it just above 1 for typical integer codes, more for specially tuned codes. Each FP code used though is equivalent to a large no of ops on an integer only cpu, but then I rarely use FP except for reporting averages. So on an FPGA cpu, without OoO, no Branch prediction, and with tiny caches, I would expect to see only abouit .6 to .8 ops/cycle and without caches, a fraction of that. So that leaves the real speed difference much closer, maybe 10-20 to 1 for integer codes, but orders more for FP codes. For an integer only problem where some of the code can be turned into specialized instructions as in your applications list, the FPGA cpu is more transparent and possibly a more even match if replicated enough, but still it is dificult even to get parity and writing HDL is much harder than plain C. I have no experience with the Opterons yet, I have heard they might be 10x faster than my old 1GHx TB but I remain skeptical based on past experience. On the Harry Potter theme, I have suggested that an FPGA Transputer cpu that solves the Memory Wall by trading it for a Thread Wall using latency hiding MTA cpu AND esp latency hiding MTA RLDRAM can be a more serious competitor to conventional OoO BP, SS designs that continue to flog regular SDRAMs. In that sort of design a 10 PE + 1 MMU Transputer node setup with RLDRAM can match 1000 Mips since each PE is only 100Mips but you have to deal with 40 threads with almost no Memory Wall effect, ie a Thread Wall. Since the PEs are quite cheap, the limit on FPGAs is really how many MMUs can be placed on a FPGA for max Memory throughput and that seems to be a pin & special clocks limit rather than core limit. Perhaps using spare BlockRams as an L1 RLDRAM intermediate, one could get many more busy cpus inside the FPGA sharing the RLDRAM bandwidth on L1 misses. > > More interested in prototyping some RISC centric soft-IP designs. > > You can do this using existing development boards. > > Best regards > Piotr Wyderski regards John Jakson transputer guy (paper at wotug.org)
From: Andreas Ehliar on 8 May 2006 02:55 On 2006-05-07, JJ <johnjakson(a)gmail.com> wrote: > I would say that if we were to see PCIe on chip, even if on a higher $ > part, we would quickly see alot more co pro board activity even just > plain vanilla PC boards. You might be interested in knowing that Lattice is doing just that in some of their LatticeSC parts. On the other hand, you are somewhat limited in the kinds of application you are going to accelerate since LatticeSC do not have embedded multipliers IIRC. (Lattice are targetting communication solutions such as line cards that rarely needs high performance multiplication in LatticeSC.) /Andreas
From: Andreas Ehliar on 8 May 2006 02:58 On 2006-05-06, Piotr Wyderski <wyderski(a)mothers.against.spam-ii.uni.wroc.pl> wrote: > What could it accelerate? Modern PCs are quite fast beasts... > If you couldn't speed things up by a factor of, say, 300%, your > device would be useless. Modest improvements by several tens > of percents can be neglected -- Moore's law constantly works > for you. FPGAs are good for special-purpose tasks, but there > are not many such tasks in the realm of PCs. One interesting application for most of the people on this newsgroup would be synthesis, place & route and HDL simulation. My guess would be that these applications could be heavily accelerated by FPGA:s. My second guess that it is far from trivial to actually do this :) /Andreas
From: fpga_toys on 8 May 2006 03:38 Piotr Wyderski wrote: > AFAIR a nice 2 GHz Sempron chip costs about $70. No FPGA can > beat its price/performance ratio if its tasks would be CPU-like. An FPGA > device can easily win with any general-purpose CPU in the domain of > DSP, advanced encryption and decryption, cipher breaking, true real-time > control etc., but these are not typical applications of a PC computer, so > there is not much to accelerate. And don't forget about easier alternative > ways, like computing on a GPU present on your video card: CPU's are heavily optimized memory/cache/register/ALU engines, which for serial algorithms, will always out perform an FPGA unless the algorithm isn't strictly serial in nature. In doing TMCC/FpgaC based research for several years, it's suprising how many natively serial algorithms can be successfully rewritten with some significant parallel gains in an Fpga domain. The dirty part of "fixing" traditional optimized C code, is actually removing all the performance specific coding "enhancements" ment to fine tune that C code for a particular ISA. In fact, some rather counter intuitive coding styles (for those with an ISA centric experience set) are necessary to give an FPGA C compiler the room to properly optimize the code for performance. Consider variable reuse. It's very common to declare just a couple variables to save memory (cache/registers) and heavily reuse that variable. For an FPGA C compiler, this means constructing multiple multiplexors for each write instance of the variable, and frequently comitting a LUT/FF pair for it. If the instances are separated out, then frequently the operations for the individual instances will result in nothing more than a few wires and extra terms in the LUTs of other variables. So the ISA optimized C code can actually create more storage, logic, and clock cycles that may well be completely free and transparent with a less optimized and direct coding style that issolates variable by actual function. Generally a small 16 element array is nearly free, by using LUT based RAMs for those small arrays. Thus it becomes relatively easy to design FPGA code sequences around a large number of independent memories, by several means, including agressive loop unrolling. Because even LUT based memories are inheriently serial (due to addressing requirements) it's sometimes wise to rename array reverences to individual variables (IE V[0] become V0, V[1] becomes V1, etc) which may easily unroll several consecutive loops into a single set of LUT terms in a single reasonable clock cycle time. The choice on this depends heavily on if the array/variables are feedback terms in the outer C loops (FSM) and need to be registered anyway. As I've noted in other discussions about FpgaC coding styles, pipelining is another counter intuitive strategy that is easily exploited for FPGA code, by reversing statement order and providiing addition retiming variables, to break a deep combinatorial code block up into faster, smaller blocks with the reverse order of updating creating pipelined FF's in the resulting FSM for the C code loops. VHDL/Verilog/C all have similar language and variable expression terms, and can all be compilied with nearly the same functional results. The difference is that coding FSMs is a natural part of loop construction with C, and frequently requires considerable care in VHDL/Verilog. When loops execute in a traditional ISA all kinds of exceptions occur which cause pipeline flushes, wasted prefetch cycles, branch prediction stalls, exception processing and other events which prevent fast ISA machines from reaching even a few percent of best case performance. These seemingly sequential loops that would intuitively be highly suited for ISA execution can in fact, actually turn into a very flat one cycle FSM with some modest recoding that can easily run at a few hundred MHz in an FPGA, and dozens of cycles in an ISA at a much slower effective speed. A large FPGA with roughly a thousand IO pins can keep a dozen 32bit quad DDR memories in full tilt bandwidth, a significantly higher memory bandwidth than you can typically get out of a traditional ISA CPU. Some applications can benifit from this, but I requires that the primary memory live on the FPGA Accel card, not on the system bus. This means that the code which manipulates that memory, must all be moved into the FPGA card, to avoid the bandwidth bottleneck. Like wise some applications may well need the FPGA to have direct connection to a couple dozen disk drives, and network ports, to server wirespeed network to storage applications, and only use the host CPU for "exceptions" and "housekeeping".
From: Adam Megacz on 8 May 2006 04:39
Andreas Ehliar <ehliar(a)lysator.liu.se> writes: > One interesting application for most of the people on this > newsgroup would be synthesis, place & route and HDL simulation. > My guess would be that these applications could be heavily > accelerated by FPGA:s. My second guess that it is far from trivial > to actually do this :) Place: http://www.cs.caltech.edu/research/ic/pdf/hwassistsa_fpga2003.pdf Route: http://www.cs.caltech.edu/research/ic/pdf/fastroute_fpga2003.pdf - a -- PGP/GPG: 5C9F F366 C9CF 2145 E770 B1B8 EFB1 462D A146 C380 |