Why so little parallelism? [Computer Architecture]

Prev: Historical question, what went wrong with bubble memory?
Next: Need SimpleScalar GCC Compiler

From: Thomas Womack on 26 Oct 2006 17:37

In article <1161854959.938202.172510(a)k70g2000cwa.googlegroups.com>,
BDH <bhauth(a)gmail.com> wrote:
>> Back in the late 1960s and all the 1970s, a few of us were saying that
>> we should NOT be measuring KFlops but KMaccs (memory accesses); even
>> now, that mindset is not universal. And 90% of the double-quoted
>> nonsense uses completely unrealistic and often inconsistent models.
>> If you do an analysis using a simple, but more realistic model[*], you
>> get VERY different complexities for most such operations. But such
>> approaches are VERY hard to get published, as the establishment is not
>> keen to have people point out that it has been bullshitting for many
>> decades ....
>
>You need some memory while gates wait for their input or a clock cycle,
>but why would you send your data to some sort of separate RAM and back?

Because all interesting problems are bigger than all practical arrays
of combinatorial logic -- if your operation is at all interesting, you
probably can't fit 1000^2 of them onto a chip even with fancy
lithography.

So you have to get the data onto and off the chip, and chips don't
seem to find it easy to have more than about 2^7 I/O pins clocked at
more than about 1GBps.

Easy to build a chip with the pipelined FADD and FMUL units explicitly
laid out so that it can do an FFT on a new set of 16 complex*16
variables every nanosecond, or the pipelined compare-and-swap units
laid out in a sorting network to sorting 32 int*8s each nanosecond;
probably not impossible to build a chip with generalish-purpose units
and wide routers so that it could quickly be configured to do either;
absolutely bloody impossible to get the data to it at the 2Tbit/s
needed to keep up with the pipeline.

If your problem's small enough that the whole thing fits into memories
small enough that you can fit one beside each logic element, you can
avoid that issue, but who save those whose governments oblige them to
solve medium-sized arrays of Boolean equations by successive
enumeration of possibilities has that class of problem?

Tom

From: BDH on 26 Oct 2006 21:59

> >You need some memory while gates wait for their input or a clock cycle,
> >but why would you send your data to some sort of separate RAM and back?
>
> Because all interesting problems are bigger than all practical arrays
> of combinatorial logic -- if your operation is at all interesting, you
> probably can't fit 1000^2 of them onto a chip even with fancy
> lithography.

Very good! A logarithmic time sort for a million ints would fill up a
modern CPU, and would only run a million times faster. Thus, it's only
worthwhile for specialized applications now, but will eventually become
common.

> Easy to build a chip with the pipelined FADD and FMUL units explicitly
> laid out so that it can do an FFT on a new set of 16 complex*16
> variables every nanosecond, or the pipelined compare-and-swap units
> laid out in a sorting network to sorting 32 int*8s each nanosecond;
> probably not impossible to build a chip with generalish-purpose units
> and wide routers so that it could quickly be configured to do either;
> absolutely bloody impossible to get the data to it at the 2Tbit/s
> needed to keep up with the pipeline.

Fourier transforms? Probably end up being done optically.

From: Thomas Womack on 27 Oct 2006 12:18

In article <1161914364.030115.267180(a)i42g2000cwa.googlegroups.com>,
BDH <bhauth(a)gmail.com> wrote:
>> >You need some memory while gates wait for their input or a clock cycle,
>> >but why would you send your data to some sort of separate RAM and back?
>>
>> Because all interesting problems are bigger than all practical arrays
>> of combinatorial logic -- if your operation is at all interesting, you
>> probably can't fit 1000^2 of them onto a chip even with fancy
>> lithography.
>
>Very good! A logarithmic time sort for a million ints would fill up a
>modern CPU, and would only run a million times faster. Thus, it's only
>worthwhile for specialized applications now, but will eventually become
>common.

The whole point is that it wouldn't come close to running a million
times faster, because you can't possibly get the ints into the chip
quickly enough, and almost no application is specialised enough that
you can generate the ints on chip at the rate that the hardware
sorter-network can sort them; I have trouble thinking of even
ridiculous applications where you want the whole sorted list rather
than a few order-statistics.

Suppose you have a cycle time of 0.1ns, and feed the chip with a
thousand 10Gbps input lines. It'll still take 3200 nanoseconds to get
the ints into the chip and 3200 nanoseconds to get the sorted outputs
out again, dwarfing the 2ns to do the sort.

It would be stunningly quick for, given N, computing the sum of the
(1024r+512)-th order statistics of the AES encryption of the numbers N
... N+1048575; that I admit. If you find a use for this task, I will
be surprised.

FFTs done optically: don't make me laugh. Dynamic range is dreadful,
and you have to get the data into the light modulator and read it out
of the CCD at the end, which gets you straight into the bus bottleneck
again. It might make sense if you're happy for your input to be of
constant amplitude and with a couple of bits of phase information, and
if you've got enough logic integrated on the detector to do (say) a
thresholded centroid per 16x16 block of pixels. But if you're in a
context where you're talking about number-theoretic transforms, the
sort of bignum maths where 23 bits of mantissa is nothing like enough,
the idea of inserting a physics stage is crazy.

Tom

From: Eugene Miya on 27 Oct 2006 12:37

In article <1161776197.073940.210000(a)f16g2000cwb.googlegroups.com>,
BDH <bhauth(a)gmail.com> wrote:
>Arbitrary vector permutation by a permutation vector, vector reversal,
>insertion of a vector into a vector, and so forth in log(n) time and
>O(n) switches if you allow pipelining is pretty obvious.

Pipelining is your bottleneck. Don't do it.

"The only reason time exists is so that everything doesn't happen at once."

>Building on
>that, log(n) time sorting with O(n) elements with pipelining is pretty
>obvious.

Use better data structures. Don't sort.

>I just spent hours waiting on a suffix array because my
>computer is badly designed. Why?

The instructure it takes to make computers. You are working with the
wrong computers.

--

From: BDH on 27 Oct 2006 20:16

> >Very good! A logarithmic time sort for a million ints would fill up a
> >modern CPU, and would only run a million times faster.

On further thought, flash memory would be more representative of the
density a device of this form could achieve. Make that 4 million
floats.

> Suppose you have a cycle time of 0.1ns, and feed the chip with a
> thousand 10Gbps input lines. It'll still take 3200 nanoseconds to get
> the ints into the chip and 3200 nanoseconds to get the sorted outputs
> out again, dwarfing the 2ns to do the sort.

2ns per sort? You overestimate what I'm pushing. More like 1
microsecond per sort. with a few pipelined at a time. Second, making
bigger chips is going to be easier than smaller elements.

> FFTs done optically: don't make me laugh. Dynamic range is dreadful,

8 bits is enough. Turning an FFT into more FFT of smaller numbers is
straightforward.

> and you have to get the data into the light modulator and read it out
> of the CCD at the end, which gets you straight into the bus bottleneck
> again.

Still much faster. Want to argue busses are badly designed too? I'll
agree.

> But if you're in a
> context where you're talking about number-theoretic transforms, the
> sort of bignum maths where 23 bits of mantissa is nothing like enough,
> the idea of inserting a physics stage is crazy.

Not really.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Prev: Historical question, what went wrong with bubble memory?
Next: Need SimpleScalar GCC Compiler