From: BDH on 25 Oct 2006 07:36 Arbitrary vector permutation by a permutation vector, vector reversal, insertion of a vector into a vector, and so forth in log(n) time and O(n) switches if you allow pipelining is pretty obvious. Building on that, log(n) time sorting with O(n) elements with pipelining is pretty obvious. I just spent hours waiting on a suffix array because my computer is badly designed. Why?
From: John Dallman on 25 Oct 2006 14:11 In article <1161776197.073940.210000(a)f16g2000cwb.googlegroups.com>, bhauth(a)gmail.com (BDH) wrote: > Arbitrary vector permutation by a permutation vector, vector reversal, > insertion of a vector into a vector, and so forth in log(n) time and > O(n) switches if you allow pipelining is pretty obvious. Building on > that, log(n) time sorting with O(n) elements with pipelining is pretty > obvious. I just spent hours waiting on a suffix array because my > computer is badly designed. Why? Err, are you suggesting that processors should have instructions for doing these operations? Because I'm afraid that viewpoint is now sadly out of date. That kind of approach was tried, but it was found that the instructions were often not usable, for a wide assortment of reasons. One always feels that one's own problem is the simple and obvious case, but this is not normally true. And trying to design processors with sufficiently varied instruction sets to handle everyone's needs with purpose-designed instructions doesn't seem to be very practical. Looking up the history of "RISC" should explain it tolerably well. The Wikipedia article at http://en.wikipedia.org/wiki/RISC seems to cover the basics. As of current technology, such algorithms as you describe will fit easily into the on-chip instruction caches of processors. The time taken for them to run is entirely dominated by the time taken to get stuff out of memory and put it back there. Caches can help with that, but are not perfect. If you have a very large amount of processing to do, it is worth getting a chip programmed at the hardware level to do it faster than software. That's an FPGA: http://en.wikipedia.org/wiki/FPGA However, "very large" usually translates to "will take months to run on conventional hardware". --- John Dallman jgd(a)cix.co.uk "Any sufficiently advanced technology is indistinguishable from a well-rigged demo"
From: Thomas Womack on 25 Oct 2006 14:20 In article <1161776197.073940.210000(a)f16g2000cwb.googlegroups.com>, BDH <bhauth(a)gmail.com> wrote: >Arbitrary vector permutation by a permutation vector, vector reversal, >insertion of a vector into a vector, and so forth in log(n) time and >O(n) switches if you allow pipelining is pretty obvious. Building on >that, log(n) time sorting with O(n) elements with pipelining is pretty >obvious. I just spent hours waiting on a suffix array because my >computer is badly designed. Why? a) What do you mean by 'element' -- sorting usually requires comparison of things more complicated than integers, and if you're having to indirect to do your comparison you're limited by memory speeds. b) O(n) time to get the data into the elements plus O(n) time to sort them plus O(n) time to get the data out again; very difficult to get buses wide enough to get the data in in time O(1), as you'd need if you want to pipeline the sorting-array. c) O(n log n) elements only practical for either uninterestingly small elements or uninterestingly small n. On the other hand, there may be useful factor-several speedups to be had with instructions like PORDERUSW xmm0, xmm1 The 16-bit packets of xmm1 contain the numbers 0 through 7, in such a way that xmm0[xmm1[i]], i=0 .. 7, are in non-decreasing order, and if xmm0[j] = xmm0[k] with j<k, the packet of xmm0 with value j will precede the packet with value k. PSHUFB, PMAX and PMIN let you implement a sorting network reasonably efficiently ... I'm sure I've read a surprisingly non-obvious article about fast median filters on SIMD architectures, but the obvious Google terms tend to pick up a patent on a way to do it with bit-slices. Tom
From: BDH on 25 Oct 2006 17:01 > As of current technology, such algorithms as you describe will fit > easily into the on-chip instruction caches of processors. The time taken > for them to run is entirely dominated by the time taken to get stuff out > of memory and put it back there. Caches can help with that, but are not > perfect. That's most of the problem. Moving data around is so slow - you need to be able to shuffle all of a vector at the same time. In fact, when you can do that, why bother with a standard bus system for most things?
From: BDH on 25 Oct 2006 17:07
> a) What do you mean by 'element' Logic gate or single pole double throw switch, of course. > b) O(n) time to get the data into the elements plus O(n) time to sort > them plus O(n) time to get the data out again; very difficult to > get buses wide enough to get the data in in time O(1), as you'd need > if you want to pipeline the sorting-array. Like I said, element count includes the "bus" architecture, and no it's not constant time but logarithmic time... |