Why so little parallelism? [Computer Architecture]

Prev: Historical question, what went wrong with bubble memory?
Next: Need SimpleScalar GCC Compiler

From: BDH on 25 Oct 2006 07:36

Arbitrary vector permutation by a permutation vector, vector reversal,
insertion of a vector into a vector, and so forth in log(n) time and
O(n) switches if you allow pipelining is pretty obvious. Building on
that, log(n) time sorting with O(n) elements with pipelining is pretty
obvious. I just spent hours waiting on a suffix array because my
computer is badly designed. Why?

From: John Dallman on 25 Oct 2006 14:11

In article <1161776197.073940.210000(a)f16g2000cwb.googlegroups.com>,
bhauth(a)gmail.com (BDH) wrote:

> Arbitrary vector permutation by a permutation vector, vector reversal,
> insertion of a vector into a vector, and so forth in log(n) time and
> O(n) switches if you allow pipelining is pretty obvious. Building on
> that, log(n) time sorting with O(n) elements with pipelining is pretty
> obvious. I just spent hours waiting on a suffix array because my
> computer is badly designed. Why?

Err, are you suggesting that processors should have instructions for
doing these operations? Because I'm afraid that viewpoint is now sadly
out of date. That kind of approach was tried, but it was found that the
instructions were often not usable, for a wide assortment of reasons.
One always feels that one's own problem is the simple and obvious case,
but this is not normally true. And trying to design processors with
sufficiently varied instruction sets to handle everyone's needs with
purpose-designed instructions doesn't seem to be very practical.

Looking up the history of "RISC" should explain it tolerably well. The
Wikipedia article at http://en.wikipedia.org/wiki/RISC seems to cover
the basics.

As of current technology, such algorithms as you describe will fit
easily into the on-chip instruction caches of processors. The time taken
for them to run is entirely dominated by the time taken to get stuff out
of memory and put it back there. Caches can help with that, but are not
perfect.

If you have a very large amount of processing to do, it is worth getting
a chip programmed at the hardware level to do it faster than software.
That's an FPGA: http://en.wikipedia.org/wiki/FPGA However, "very large"
usually translates to "will take months to run on conventional hardware".

---
John Dallman jgd(a)cix.co.uk
"Any sufficiently advanced technology is indistinguishable from a
well-rigged demo"

From: Thomas Womack on 25 Oct 2006 14:20

In article <1161776197.073940.210000(a)f16g2000cwb.googlegroups.com>,
BDH <bhauth(a)gmail.com> wrote:
>Arbitrary vector permutation by a permutation vector, vector reversal,
>insertion of a vector into a vector, and so forth in log(n) time and
>O(n) switches if you allow pipelining is pretty obvious. Building on
>that, log(n) time sorting with O(n) elements with pipelining is pretty
>obvious. I just spent hours waiting on a suffix array because my
>computer is badly designed. Why?

a) What do you mean by 'element' -- sorting usually requires
comparison of things more complicated than integers, and if you're
having to indirect to do your comparison you're limited by memory
speeds.

b) O(n) time to get the data into the elements plus O(n) time to sort
them plus O(n) time to get the data out again; very difficult to
get buses wide enough to get the data in in time O(1), as you'd need
if you want to pipeline the sorting-array.

c) O(n log n) elements only practical for either uninterestingly small
elements or uninterestingly small n.

On the other hand, there may be useful factor-several speedups to be
had with instructions like

PORDERUSW xmm0, xmm1

The 16-bit packets of xmm1 contain the numbers 0 through 7, in such a
way that xmm0[xmm1[i]], i=0 .. 7, are in non-decreasing order, and if
xmm0[j] = xmm0[k] with j<k, the packet of xmm0 with value j will
precede the packet with value k.

PSHUFB, PMAX and PMIN let you implement a sorting network reasonably
efficiently ... I'm sure I've read a surprisingly non-obvious article
about fast median filters on SIMD architectures, but the obvious
Google terms tend to pick up a patent on a way to do it with
bit-slices.

Tom

From: BDH on 25 Oct 2006 17:01

> As of current technology, such algorithms as you describe will fit
> easily into the on-chip instruction caches of processors. The time taken
> for them to run is entirely dominated by the time taken to get stuff out
> of memory and put it back there. Caches can help with that, but are not
> perfect.

That's most of the problem. Moving data around is so slow - you need to
be able to shuffle all of a vector at the same time. In fact, when you
can do that, why bother with a standard bus system for most things?

From: BDH on 25 Oct 2006 17:07

> a) What do you mean by 'element'

Logic gate or single pole double throw switch, of course.

> b) O(n) time to get the data into the elements plus O(n) time to sort
> them plus O(n) time to get the data out again; very difficult to
> get buses wide enough to get the data in in time O(1), as you'd need
> if you want to pipeline the sorting-array.

Like I said, element count includes the "bus" architecture, and no it's
not constant time but logarithmic time...

| Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: Historical question, what went wrong with bubble memory?
Next: Need SimpleScalar GCC Compiler