Prev: systemc
Next: sqrt(a^2 + b^2) in synthesizable VHDL?
From: Wayne on 8 May 2006 10:21 Hypertransport offer 41GB/s speed. Maybe it is the best way to move data between PC and FPGA. Wayne
From: Jeremy Ralph on 8 May 2006 12:45 Yes, 41GB/s would a nice rate for moving data around. Am I correct in assuming this would require a motherboard board with two or more 939 AMD sockets? Any idea how much effort would be involved in programming the host to move data between. I expect there are some open libraries for this sort of thing. Also, how much work to have the FPGA handshake the hypertransport protocol? Hopefully the FPGA board vendor would have this covered. Found this product, which looks interesting. Anyone know of other HT products of interest? http://www.drccomputer.com/pages/modules.html Seems the HT route could get expensive (more costly FPGA board + new motherboard & processor). Thanks all for the great discussion! --- PDTi [ http://www.productive-eda.com ] SpectaReg -- Spec-down code and doc generation for register maps
From: Piotr Wyderski on 8 May 2006 13:13 JJ wrote: > I have fantastic disbelief about that 6 ops /clock except in very > specific circumstances perhaps in a video codec using MMX/SSE etc where > those units really do the equiv of many tiny integer codes per cycle on > 4 or more parallel 8 bit DSP values. John, of course it is about peak performance, reachable with great effort. But the existence of every accelerator is explained only when even that peak performance is not enough. Otherwise you simply could write better code at no additional hardware cost. I know that in most cases the CPU sleeps because of lack of load or stalls because of a cache miss, but it is completely different song... > Now thats looking pretty much like what FPGA DSP can do pretty trivially > except for the clock ratio 2GHz v 150MHz. Yes, in my case a Cyclone @ 65MHz (130MHz internally + SDR interface, 260 MHz at the critical path with timesharing) is enough. But it is a specialized waveforming device, not a generic-purpose computer. As a processor, it could reach 180MHz and then stabilize -- not an impressive value today, not to mention that it contsins no cache, as BRAMs are too precious resources to be wasted that way. > A while back, Toms Hardware did a comparison of 3GHz P4s v the P100 1st > pentium and all the in betweens and the plot was basically linear Interesting. In fact I don't care about P4, as its architecture is one big mistake, but linear speedup would be a shame for a Pentium 3... > benchmark performance, it also used perhaps 100x the transistor count Northwood has 55 million, the old Pentium had 4.5 million. > as well and that is all due to the Memory Wall and the necessiity to > avoid at all costs accessing DRAM. Yes, that is true. 144 MiB of caches of a POWER5 does help. A 1.5GHz POWER5 is as fast as a 3.2GHz Pentium 4 (measured on a large memory-hungry application). But you can buy many P4s at the price of a single POWER5 MQM. > Try running a random number generator say R250 which can generate a new > rand number every 3ns on a XP2400 (9 ops IIRC). Now use that no to > address a table >> 4MB. All of a sudden my 12Gops Athlon is running at > 3MHz ie every memory access takes 300ns Man, what 4MiB... ;-) Our application's working set is 200--600MiB. That's the PITA! :-/ > So on an FPGA cpu, without OoO, no Branch prediction, and with tiny > caches, I would expect to see only abouit .6 to .8 ops/cycle and > without caches In a soft DSP processor it would be much less, as there is much vector processing, which omits (or at least should) the funny caches built of BRAMs. > I have no experience with the Opterons yet, I have heard they might be > 10x faster than my old 1GHx TB but I remain skeptical based on past > experience. I like the Cell approach -- no chache => no cache misses => tremendous preformance. But there are only 256KiB of local memory, so it is restricted to specialized tasks. Best regards Piotr Wyderski
From: Piotr Wyderski on 8 May 2006 13:40 Andreas Ehliar wrote: > One interesting application for most of the people on this > newsgroup would be synthesis, place & route and HDL simulation. > My guess would be that these applications could be heavily > accelerated by FPGA:s. A car is not the best tool to make another cars. It's not a bees & butterflies story. :-) Same with FPGAs. > My second guess that it is far from trivial to actually do this :) And who actually would need that? Best regards Piotr Wyderski
From: JJ on 8 May 2006 14:20
Piotr Wyderski wrote: > JJ wrote: > > > I have fantastic disbelief about that 6 ops /clock except in very > > specific circumstances perhaps in a video codec using MMX/SSE etc where > > those units really do the equiv of many tiny integer codes per cycle on > > 4 or more parallel 8 bit DSP values. > > John, of course it is about peak performance, reachable with great effort. Ofcourse, I don't think we differ much in opinion on the matter. But I prefer to stick to avg throughputs available with C codes. I think in summary any HW acceleration is justified when it is pretty much busy all the time, embedded or or least can shrink very significantly the time spent waiting to complete, but few opportunities are going to get done I fear since the software experts are far from having the knowhow to do this in HW.. For many apps that an FPGA might barely be considered, one might also look at the GPUs or the Physix chip or maybe wait for ClearSpeed to get on board (esp for flops) so FPGA will be the least visible option. > But the existence of every accelerator is explained only when even that > peak performance is not enough. Otherwise you simply could write better > code at no additional hardware cost. I know that in most cases the CPU > sleeps because of lack of load or stalls because of a cache miss, but it > is completely different song... > > > Now thats looking pretty much like what FPGA DSP can do pretty trivially > > except for the clock ratio 2GHz v 150MHz. > > Yes, in my case a Cyclone @ 65MHz (130MHz internally + SDR interface, > 260 MHz at the critical path with timesharing) is enough. But it is a > specialized > waveforming device, not a generic-purpose computer. As a processor, it could > reach 180MHz and then stabilize -- not an impressive value today, not to > mention > that it contsins no cache, as BRAMs are too precious resources to be wasted > that > way. The BRAMs are what define the opportunity, 500 odd BRAMs all whacking data at say 300MHz & dual port is orders more bandwidth than any commodity cpu will ever see, so if they can be used independantly, FPGAs win hand down. I suspect alot of poorly executed software to hardware conversion combines too many BRAMs into a single large and relatively very expensive SRAM which gives all the points back to cpus. That is also the problem with soft core cpus, to be usefull you wants lots of cache, but merging BRAMs into useful size caches throws all their individual bandwidth away. Thats why I propose using RLDRAM as it allows FPGA cpus to use 1 BRAM each and share RLDRAM bandwidth over many threads with full associativity of memory lines using hashed MMU structure IPT sort of. > > > A while back, Toms Hardware did a comparison of 3GHz P4s v the P100 1st > > pentium and all the in betweens and the plot was basically linear > > Interesting. In fact I don't care about P4, as its architecture is one > big mistake, but linear speedup would be a shame for a Pentium 3... > Toms IIRC didn't have AMD on the lineup, must have been 1-2yrs ago. The P4 end of the curve was still linear but the tests are IMO bogus as they push linear memmory tests rather than the random test I use. I hate when people talk of bandwidth for blasting GB of contiguous large data around and completely ignore pushing millions of tiny blocks around. > > benchmark performance, it also used perhaps 100x the transistor count > > Northwood has 55 million, the old Pentium had 4.5 million. > 100x overstating it a bit I admit, but the turn to multi cores puts cpu back on the same path as FPGAs, Moores law for quantity rather than raw clock speed which keeps the arguments for & against relatively constant. > > as well and that is all due to the Memory Wall and the necessiity to > > avoid at all costs accessing DRAM. > > Yes, that is true. 144 MiB of caches of a POWER5 does help. > A 1.5GHz POWER5 is as fast as a 3.2GHz Pentium 4 (measured > on a large memory-hungry application). But you can buy many P4s > at the price of a single POWER5 MQM. > > > Try running a random number generator say R250 which can generate a new > > rand number every 3ns on a XP2400 (9 ops IIRC). Now use that no to > > address a table >> 4MB. All of a sudden my 12Gops Athlon is running at > > 3MHz ie every memory access takes 300ns > > Man, what 4MiB... ;-) Our application's working set is 200--600MiB. That's > the PITA! :-/ > Actually I ran that test from 32k doubling until I got to my ram limit 640MB (no swapping) on a 1GB system and the speed reduction is sort of stair case log. At 32K obviously no real slow down, the step bumps obviously indicate the memory system gradually failing, L1, L2, TLB, after 16M, the drop to 300ns can't get any worse since the L2,TLBs have long failed having so very little associativity. But then again it all depends on temporal locality, how much work gets done per cache line refill and is all the effort of the cache transfer thrown away every time (trees). or only some of the time (code). In the RLDRAM approach I use, the Virtex 2Pro would effectively see 3ns raw memory issue rates for full random accesses but the true latency of 20ns is well hidden and the issue rate is reduced probably 2x to allow for rehashing and bank collisions. Still 6ns issue rate v 300ns for full random access is something to crow about. Ofcourse the technology would work even better on full custom cpu. The OS never really gets involved to fix up TLBs since there aren't any, the MMU does the rehash work. The 2 big penalties are that tagging adds 20% to memory cost, 1 tag every 32bytes, and with hashing, the store should be left <80% full, but memory is cheap, bandwidth isn't. > > So on an FPGA cpu, without OoO, no Branch prediction, and with tiny > > caches, I would expect to see only abouit .6 to .8 ops/cycle and > > without caches > > In a soft DSP processor it would be much less, as there is much vector > processing, which omits (or at least should) the funny caches built of > BRAMs. > DSP has highly predicable data structures and high locality, not much tree walking so SDRAM bandwidth can be better used directly, still code should be cached. > > I have no experience with the Opterons yet, I have heard they might be > > 10x faster than my old 1GHx TB but I remain skeptical based on past > > experience. > > I like the Cell approach -- no chache => no cache misses => tremendous > preformance. > But there are only 256KiB of local memory, so it is restricted to > specialized tasks. > I suspect Cell will get used to accelerate as many apps as FPGAs or more but it is so manually cached. I can't say I like it myself, so much theoretical peak, but how to get at it. I much prefer the Niagara approach to cpu design, if only the memory was done the same way. > Best regards > Piotr Wyderski regards John Jakson transputer guy |