From: Thomas Womack on 26 Oct 2006 17:37 In article <1161854959.938202.172510(a)k70g2000cwa.googlegroups.com>, BDH <bhauth(a)gmail.com> wrote: >> Back in the late 1960s and all the 1970s, a few of us were saying that >> we should NOT be measuring KFlops but KMaccs (memory accesses); even >> now, that mindset is not universal. And 90% of the double-quoted >> nonsense uses completely unrealistic and often inconsistent models. >> If you do an analysis using a simple, but more realistic model[*], you >> get VERY different complexities for most such operations. But such >> approaches are VERY hard to get published, as the establishment is not >> keen to have people point out that it has been bullshitting for many >> decades .... > >You need some memory while gates wait for their input or a clock cycle, >but why would you send your data to some sort of separate RAM and back? Because all interesting problems are bigger than all practical arrays of combinatorial logic -- if your operation is at all interesting, you probably can't fit 1000^2 of them onto a chip even with fancy lithography. So you have to get the data onto and off the chip, and chips don't seem to find it easy to have more than about 2^7 I/O pins clocked at more than about 1GBps. Easy to build a chip with the pipelined FADD and FMUL units explicitly laid out so that it can do an FFT on a new set of 16 complex*16 variables every nanosecond, or the pipelined compare-and-swap units laid out in a sorting network to sorting 32 int*8s each nanosecond; probably not impossible to build a chip with generalish-purpose units and wide routers so that it could quickly be configured to do either; absolutely bloody impossible to get the data to it at the 2Tbit/s needed to keep up with the pipeline. If your problem's small enough that the whole thing fits into memories small enough that you can fit one beside each logic element, you can avoid that issue, but who save those whose governments oblige them to solve medium-sized arrays of Boolean equations by successive enumeration of possibilities has that class of problem? Tom
From: BDH on 26 Oct 2006 21:59 > >You need some memory while gates wait for their input or a clock cycle, > >but why would you send your data to some sort of separate RAM and back? > > Because all interesting problems are bigger than all practical arrays > of combinatorial logic -- if your operation is at all interesting, you > probably can't fit 1000^2 of them onto a chip even with fancy > lithography. Very good! A logarithmic time sort for a million ints would fill up a modern CPU, and would only run a million times faster. Thus, it's only worthwhile for specialized applications now, but will eventually become common. > Easy to build a chip with the pipelined FADD and FMUL units explicitly > laid out so that it can do an FFT on a new set of 16 complex*16 > variables every nanosecond, or the pipelined compare-and-swap units > laid out in a sorting network to sorting 32 int*8s each nanosecond; > probably not impossible to build a chip with generalish-purpose units > and wide routers so that it could quickly be configured to do either; > absolutely bloody impossible to get the data to it at the 2Tbit/s > needed to keep up with the pipeline. Fourier transforms? Probably end up being done optically.
From: Thomas Womack on 27 Oct 2006 12:18 In article <1161914364.030115.267180(a)i42g2000cwa.googlegroups.com>, BDH <bhauth(a)gmail.com> wrote: >> >You need some memory while gates wait for their input or a clock cycle, >> >but why would you send your data to some sort of separate RAM and back? >> >> Because all interesting problems are bigger than all practical arrays >> of combinatorial logic -- if your operation is at all interesting, you >> probably can't fit 1000^2 of them onto a chip even with fancy >> lithography. > >Very good! A logarithmic time sort for a million ints would fill up a >modern CPU, and would only run a million times faster. Thus, it's only >worthwhile for specialized applications now, but will eventually become >common. The whole point is that it wouldn't come close to running a million times faster, because you can't possibly get the ints into the chip quickly enough, and almost no application is specialised enough that you can generate the ints on chip at the rate that the hardware sorter-network can sort them; I have trouble thinking of even ridiculous applications where you want the whole sorted list rather than a few order-statistics. Suppose you have a cycle time of 0.1ns, and feed the chip with a thousand 10Gbps input lines. It'll still take 3200 nanoseconds to get the ints into the chip and 3200 nanoseconds to get the sorted outputs out again, dwarfing the 2ns to do the sort. It would be stunningly quick for, given N, computing the sum of the (1024r+512)-th order statistics of the AES encryption of the numbers N ... N+1048575; that I admit. If you find a use for this task, I will be surprised. FFTs done optically: don't make me laugh. Dynamic range is dreadful, and you have to get the data into the light modulator and read it out of the CCD at the end, which gets you straight into the bus bottleneck again. It might make sense if you're happy for your input to be of constant amplitude and with a couple of bits of phase information, and if you've got enough logic integrated on the detector to do (say) a thresholded centroid per 16x16 block of pixels. But if you're in a context where you're talking about number-theoretic transforms, the sort of bignum maths where 23 bits of mantissa is nothing like enough, the idea of inserting a physics stage is crazy. Tom
From: Eugene Miya on 27 Oct 2006 12:37 In article <1161776197.073940.210000(a)f16g2000cwb.googlegroups.com>, BDH <bhauth(a)gmail.com> wrote: >Arbitrary vector permutation by a permutation vector, vector reversal, >insertion of a vector into a vector, and so forth in log(n) time and >O(n) switches if you allow pipelining is pretty obvious. Pipelining is your bottleneck. Don't do it. "The only reason time exists is so that everything doesn't happen at once." >Building on >that, log(n) time sorting with O(n) elements with pipelining is pretty >obvious. Use better data structures. Don't sort. >I just spent hours waiting on a suffix array because my >computer is badly designed. Why? The instructure it takes to make computers. You are working with the wrong computers. --
From: BDH on 27 Oct 2006 20:16
> >Very good! A logarithmic time sort for a million ints would fill up a > >modern CPU, and would only run a million times faster. On further thought, flash memory would be more representative of the density a device of this form could achieve. Make that 4 million floats. > Suppose you have a cycle time of 0.1ns, and feed the chip with a > thousand 10Gbps input lines. It'll still take 3200 nanoseconds to get > the ints into the chip and 3200 nanoseconds to get the sorted outputs > out again, dwarfing the 2ns to do the sort. 2ns per sort? You overestimate what I'm pushing. More like 1 microsecond per sort. with a few pipelined at a time. Second, making bigger chips is going to be easier than smaller elements. > FFTs done optically: don't make me laugh. Dynamic range is dreadful, 8 bits is enough. Turning an FFT into more FFT of smaller numbers is straightforward. > and you have to get the data into the light modulator and read it out > of the CCD at the end, which gets you straight into the bus bottleneck > again. Still much faster. Want to argue busses are badly designed too? I'll agree. > But if you're in a > context where you're talking about number-theoretic transforms, the > sort of bignum maths where 23 bits of mantissa is nothing like enough, > the idea of inserting a physics stage is crazy. Not really. |