FPGA-based hardware accelerator for PC [FPGA]

Prev: systemc
Next: sqrt(a^2 + b^2) in synthesizable VHDL?

From: JJ on 7 May 2006 13:54

Piotr Wyderski wrote:

snipping

> > If the 32 bit RISC was optimized for some specialized task, then it
> > might make sense to have it alongside a high-performance CPU.
>
> No, because in this case you are trying to outperform an out-of-order,
> highly parallel processor core able to complete ~6 simple instructions
> per cycle and clocked at 2+ GHz. Reasonable soft CPU cores run at
> about 200 MHz and complete only one instruction per cycle. It
> means that a cheap CPU you can buy anywhere has about 60 times
> higher performance in sequential processing. Even if you could provide
> the same performance (not to mention about outperforming it, which
> is the key idea, anyway), it would mean that you are at least Harry Potter.
> :-)
>

I have fantastic disbelief about that 6 ops /clock except in very
specific circumstances perhaps in a video codec using MMX/SSE etc where
those units really do the equiv of many tiny integer codes per cycle on
4 or more parallel 8 bit DSP values. Now thats looking pretty much like
what FPGA DSP can do pretty trivially except for the clock ratio 2GHz v
150MHz.

I look at my C code (compilers, GUI development, databases, simulators
etc) and some of the the critical output assembler and then time some
parts on huge 1M timed loops making sure no iteration benefits from
cacheing the previous run. I always see a tiny fraction of that ~6
ops/cycle. SInce my code is most definitely not vector code or media
codec but is a mix of graph or tree traversal over large uncacheable
spans, I often see avg rates exactly 1 op per clock on Athlon TB at
1GHz and also XP2400 at 2GHz. My conclusion is that the claims for
wicked performance are mostly super hype that most punters accept all
too easily. The truth of the matter is that Athlon XPs rated at say
2400 are not 2.4 x faster than TB at 1GHz in avg cases, maybe on vector
codecs. When I compare WIndows apps for different cpus, I usually only
see the faster cpu performing closer to sqr of its claimed speedup.

A while back, Toms Hardware did a comparison of 3GHz P4s v the P100 1st
pentium and all the in betweens and the plot was basically linear and
thats on stupid benchmarks that don't reflect real world code. One has
to bear in mind the P4 not only used 30x the clock to get 30x the
benchmark performance, it also used perhaps 100x the transistor count
as well and that is all due to the Memory Wall and the necessiity to
avoid at all costs accessing DRAM. Now if we did that on an FPGA
benchmark we would be damned to hell, one should count the clock ratio
and the gate or LUT ratio but PCs have gotten away with using infinite
transistor budgets to make claims.

This makes sense to me since the instruction rate is still bound by
real memory accesses to DRAM some percent of the time for cache misses,
I figure around 2% typical or even more. DRAM has improved miserably
over 20yrs in true random access about 2x from 120ns to 60ns Ras to
Dout time. If you use cache misses close to 0.1% then you get the hyped
nos, but code doesn't work like that atleast mine doesn't.

Try running a random number generator say R250 which can generate a new
rand number every 3ns on a XP2400 (9 ops IIRC). Now use that no to
address a table >> 4MB. All of a sudden my 12Gops Athlon is running at
3MHz ie every memory access takes 300ns or so since every part of the
memory system is wreaked (deliberately in this case). Ironically if
thats all you wanted to do, an FPOA cpu without complex MMU, TLBs could
generate random numbers in 1 cycle and drive an SDRAM controller just
as fast if not faster since SDRAMs can cyle fully random closer to
60ns. Now in packet switching and processing where large tables are
looked up with random looking fields, they use RLDRAM to get SRAM like
performance.

So what does real code look like, any old mixture of the 2 extremes, ie
sometimes its memory crippled, sometimes if everything is in L1 cache,
it really does seem to do 2 ops/clock if array accesses are spread out,
even with small forward branches. So all the complexity of these OoO
machines is there to push up the avg rate and keep it just above 1 for
typical integer codes, more for specially tuned codes. Each FP code
used though is equivalent to a large no of ops on an integer only cpu,
but then I rarely use FP except for reporting averages.

So on an FPGA cpu, without OoO, no Branch prediction, and with tiny
caches, I would expect to see only abouit .6 to .8 ops/cycle and
without caches, a fraction of that. So that leaves the real speed
difference much closer, maybe 10-20 to 1 for integer codes, but orders
more for FP codes. For an integer only problem where some of the code
can be turned into specialized instructions as in your applications
list, the FPGA cpu is more transparent and possibly a more even match
if replicated enough, but still it is dificult even to get parity and
writing HDL is much harder than plain C.

I have no experience with the Opterons yet, I have heard they might be
10x faster than my old 1GHx TB but I remain skeptical based on past
experience.

On the Harry Potter theme, I have suggested that an FPGA Transputer cpu
that solves the Memory Wall by trading it for a Thread Wall using
latency hiding MTA cpu AND esp latency hiding MTA RLDRAM can be a more
serious competitor to conventional OoO BP, SS designs that continue to
flog regular SDRAMs. In that sort of design a 10 PE + 1 MMU Transputer
node setup with RLDRAM can match 1000 Mips since each PE is only
100Mips but you have to deal with 40 threads with almost no Memory Wall
effect, ie a Thread Wall. Since the PEs are quite cheap, the limit on
FPGAs is really how many MMUs can be placed on a FPGA for max Memory
throughput and that seems to be a pin & special clocks limit rather
than core limit. Perhaps using spare BlockRams as an L1 RLDRAM
intermediate, one could get many more busy cpus inside the FPGA
sharing the RLDRAM bandwidth on L1 misses.

> > More interested in prototyping some RISC centric soft-IP designs.
>
> You can do this using existing development boards.
>
> Best regards
> Piotr Wyderski

regards

John Jakson
transputer guy

(paper at wotug.org)

From: Andreas Ehliar on 8 May 2006 02:55

On 2006-05-07, JJ <johnjakson(a)gmail.com> wrote:
> I would say that if we were to see PCIe on chip, even if on a higher $
> part, we would quickly see alot more co pro board activity even just
> plain vanilla PC boards.

You might be interested in knowing that Lattice is doing just that in
some of their LatticeSC parts. On the other hand, you are somewhat
limited in the kinds of application you are going to accelerate since
LatticeSC do not have embedded multipliers IIRC. (Lattice are
targetting communication solutions such as line cards that rarely needs
high performance multiplication in LatticeSC.)

/Andreas

From: Andreas Ehliar on 8 May 2006 02:58

On 2006-05-06, Piotr Wyderski <wyderski(a)mothers.against.spam-ii.uni.wroc.pl> wrote:
> What could it accelerate? Modern PCs are quite fast beasts...
> If you couldn't speed things up by a factor of, say, 300%, your
> device would be useless. Modest improvements by several tens
> of percents can be neglected -- Moore's law constantly works
> for you. FPGAs are good for special-purpose tasks, but there
> are not many such tasks in the realm of PCs.

One interesting application for most of the people on this
newsgroup would be synthesis, place & route and HDL simulation.
My guess would be that these applications could be heavily
accelerated by FPGA:s. My second guess that it is far from trivial
to actually do this :)

/Andreas

From: fpga_toys on 8 May 2006 03:38

Piotr Wyderski wrote:
> AFAIR a nice 2 GHz Sempron chip costs about $70. No FPGA can
> beat its price/performance ratio if its tasks would be CPU-like. An FPGA
> device can easily win with any general-purpose CPU in the domain of
> DSP, advanced encryption and decryption, cipher breaking, true real-time
> control etc., but these are not typical applications of a PC computer, so
> there is not much to accelerate. And don't forget about easier alternative
> ways, like computing on a GPU present on your video card:

CPU's are heavily optimized memory/cache/register/ALU engines, which
for serial algorithms, will always out perform an FPGA unless the
algorithm isn't strictly serial in nature.

In doing TMCC/FpgaC based research for several years, it's suprising
how many natively serial algorithms can be successfully rewritten with
some significant parallel gains in an Fpga domain. The dirty part of
"fixing" traditional optimized C code, is actually removing all the
performance specific coding "enhancements" ment to fine tune that C
code for a particular ISA. In fact, some rather counter intuitive
coding styles (for those with an ISA centric experience set) are
necessary to give an FPGA C compiler the room to properly optimize the
code for performance.

Consider variable reuse. It's very common to declare just a couple
variables to save memory (cache/registers) and heavily reuse that
variable. For an FPGA C compiler, this means constructing multiple
multiplexors for each write instance of the variable, and frequently
comitting a LUT/FF pair for it. If the instances are separated out,
then frequently the operations for the individual instances will result
in nothing more than a few wires and extra terms in the LUTs of other
variables. So the ISA optimized C code can actually create more
storage, logic, and clock cycles that may well be completely free and
transparent with a less optimized and direct coding style that
issolates variable by actual function.

Generally a small 16 element array is nearly free, by using LUT based
RAMs for those small arrays. Thus it becomes relatively easy to design
FPGA code sequences around a large number of independent memories, by
several means, including agressive loop unrolling. Because even LUT
based memories are inheriently serial (due to addressing requirements)
it's sometimes wise to rename array reverences to individual variables
(IE V[0] become V0, V[1] becomes V1, etc) which may easily unroll
several consecutive loops into a single set of LUT terms in a single
reasonable clock cycle time. The choice on this depends heavily on if
the array/variables are feedback terms in the outer C loops (FSM) and
need to be registered anyway.

As I've noted in other discussions about FpgaC coding styles,
pipelining is another counter intuitive strategy that is easily
exploited for FPGA code, by reversing statement order and providiing
addition retiming variables, to break a deep combinatorial code block
up into faster, smaller blocks with the reverse order of updating
creating pipelined FF's in the resulting FSM for the C code loops.

VHDL/Verilog/C all have similar language and variable expression terms,
and can all be compilied with nearly the same functional results. The
difference is that coding FSMs is a natural part of loop construction
with C, and frequently requires considerable care in VHDL/Verilog.
When loops execute in a traditional ISA all kinds of exceptions occur
which cause pipeline flushes, wasted prefetch cycles, branch prediction
stalls, exception processing and other events which prevent fast ISA
machines from reaching even a few percent of best case performance.
These seemingly sequential loops that would intuitively be highly
suited for ISA execution can in fact, actually turn into a very flat
one cycle FSM with some modest recoding that can easily run at a few
hundred MHz in an FPGA, and dozens of cycles in an ISA at a much slower
effective speed.

A large FPGA with roughly a thousand IO pins can keep a dozen 32bit
quad DDR memories in full tilt bandwidth, a significantly higher memory
bandwidth than you can typically get out of a traditional ISA CPU.
Some applications can benifit from this, but I requires that the
primary memory live on the FPGA Accel card, not on the system bus. This
means that the code which manipulates that memory, must all be moved
into the FPGA card, to avoid the bandwidth bottleneck. Like wise some
applications may well need the FPGA to have direct connection to a
couple dozen disk drives, and network ports, to server wirespeed
network to storage applications, and only use the host CPU for
"exceptions" and "housekeeping".

From: Adam Megacz on 8 May 2006 04:39

Andreas Ehliar <ehliar(a)lysator.liu.se> writes:
> One interesting application for most of the people on this
> newsgroup would be synthesis, place & route and HDL simulation.
> My guess would be that these applications could be heavily
> accelerated by FPGA:s. My second guess that it is far from trivial
> to actually do this :)

Place: http://www.cs.caltech.edu/research/ic/pdf/hwassistsa_fpga2003.pdf
Route: http://www.cs.caltech.edu/research/ic/pdf/fastroute_fpga2003.pdf

- a

--
PGP/GPG: 5C9F F366 C9CF 2145 E770 B1B8 EFB1 462D A146 C380

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8
Prev: systemc
Next: sqrt(a^2 + b^2) in synthesizable VHDL?