High-bandwidth computing interest group [Computer Architecture]

Prev: Last Call for Papers Reminder (extended): World Congress on Engineering and Computer Science WCECS 2010
Next: ARM-based desktop computer ? (Hybrid computers ?: Low + High performance ;))

From: MitchAlsup on 24 Jul 2010 19:52

On Jul 24, 3:24 pm, n...(a)cam.ac.uk wrote:
> In article <88d23585-d47c-47af-91a1-7bae764af...(a)q22g2000yqm.googlegroups..com>,
> Robert Myers <rbmyers...(a)gmail.com> wrote:
>
>
>
> >Today's computers are *not* designed around computation, but around
> >coherent cache. Now that the memory controller is on the die, the
> >takeover is complete. Nothing moves efficiently without notice and
> >often unnecessary involvement of the real Von Neumann bottleneck,
> >which is the cache.
>
> Yes and no. Their interfaces are still designed around computation,
> and the coherent cache is designed to give the impression that
> programmers need not concern themselves with programming memory
> access - it's all transparent.

If cache were transparent, then instructions such are PREFETCH<...>
would not exist! Streaming stores would not exist!
Compilers would not go to extraordinary pains to use these, or to find
out the configuration parameters of the cahce hierarchy.
Memory Controllers would not be optimized around cache lines.
Memory Ordering would not be subject to Cache Coherence rules.

But <as usual> I digress....

I think what Robert is getting at is that lumping everything under a
coherent cache is running into a vonNeumann wall.

Mitch

From: Brett Davis on 25 Jul 2010 00:31

In article
<aba3a53e-09d5-4528-86a2-c2374ed4c4f1(a)q35g2000yqn.googlegroups.com>,
MitchAlsup <MitchAlsup(a)aol.com> wrote:
> If cache were transparent, then instructions such are PREFETCH<...>
> would not exist! Streaming stores would not exist!
> Compilers would not go to extraordinary pains to use these

I have never in my life seen a compiler issue a PREFETCH instruction.
I have several times mocked the usefulness of PREFETCH as implemented
for CPUs in the embedded market. (Locking up one of the two read
ports makes good performance impossible without resorting to assembly.)

I would think that the fetch ahead engine on high end x86 and POWER
would make PREFETCH just about as useless, except to prime the pump
at the start of a new data set being streamed in.

How is PREFETCH used by which compilers today?

From: Andy Glew "newsgroup at on 25 Jul 2010 01:06

> In article<sdtk4654pheq6292135jd42oagr5ov7cg4(a)4ax.com>,
> George Neuner<gneuner2(a)comcast.net> wrote:
>> The problem most often cited for vector units is that they need to
>> support non-consecutive and non-uniform striding to be useful. I
>> agree that there *does* need to be support for those features, but I
>> believe it should be in the memory subsystem rather than in the
>> processor.

This is why, in a recent post, I have proposed creating a memory
subsystem and interconnect that is designed around scatter/gather.

I think this can be done fairly straightforwardly for the interconnect.

It is harder to do for the DRAMs themselves. Modern DRAMs are oriented
towards providing bursts of 4 to 8 cycles worth of data. If you have 64
to 128 bit wide interfaces, that means that you are always going to be
reading or writing 32 to 128 consecutive, contiguous bytes at a go.
Whether you need it or not.

--

I'm still happy to have processor support to give me the scatter
gathers. Either in the form of vector instructions, or GPU-style
SIMD/SIMT/CIMT, or in the form of out-of-order execution. Lacking
these, with in-order scalar processing you have to do a lot of work to
get to where you can start scatter/gather - circa 8 times more
processors being needed.

But once you have got s/g requests, howsoever generated, then the real
action is in the interconnect. Or it could be.

--

Having said that alll modern DRAMs are oriented towards 4 to 8 cycle
bursts...

Maybe we can build scatter/gather friendly memory subsystems.

Instead of building 64-128-256 bit wide DRAM channels, maybe we should
be building 8 or 16 bit wide DRAM channels. That can give us 64 or 128
bits in any transfer, over 4 to 8 clocks. It will add to latency, but
maybe the improved s/g performance will be worth it.

Such a narrow DRAM channel system would probably consume at least 50%
more power than a wide DRAM system. Would that be outweighed by wasting
less bandwidth on unneceessary parts of cachelines?

It would also cost a lot more, not using commodity DIMMs.

From: Andy Glew "newsgroup at on 25 Jul 2010 01:34

On 7/23/2010 11:48 PM, Terje Mathisen wrote:
> Andy Glew wrote:

>> But also because, as I discusssed in my Berkeley Parlab presentation of
>> Aug 2009 on GPUs, I can see how to use vector ISAs to ameliorate
>> somewhat the deficiencies of coherent threading, specifically the
>> problem of divergence.

You want me to repeat what I already said in those slides?

Sure, if it's necessary.

But the pictures in the slides say it better than I can on comp.arch.

BTW, here is a link to the sliddes, on Googlle Docs using their
incomprehensiblle URLs.

---

I'll give a brief overview:

First off, a basic starter: coherent threading is better than
prredication or masking for complicated IF structures. Even in a warp of
64 threads, coherent threading only ever executes paths that one of the
threads is taking. Whereas predication executes all paths.

The big problem with GPU style coherennt threading, aka SIMD, aka SIMT,
maybe CIMT is divergence. But when you get into the coherent threading
mindset, I can think of dozzens of ways to ameliorate divergence.

So, here's a first "optimization": create a loop buffer at each lane, or
group of lanes. So it is not really SIMT, single instruction multiple
data, any more - it is really single instruction to the loop buffer,
independent thereafter. Much divergence can thereby be tolerated - as
in, instruction issue slots that would be wasted because of SIMD can be
used.

Trouble is, loop buffers per lane lose much of the cost reduction of CIMT.

I know: what's a degenerate loop buffer? A vector instruction.

By slide 14 I am showing how, if the indivdual instructions of CIMT are
time vectors, you can distribute instructions from divergent paths while
others are executing. I.e. you might lose an instruction issye cycle,
but if instruction execution takes 4 or more cycles, 50% divergence need
only lose 10% or less instruction execution bandwidth.
This is a use of vector instructions.

Slide 13 depicts an optimization independent of vector instructions.
Most GPUs distribute a warp or wavefront over multiple cycles,
typicallly 4. If you can rejigger threads within a wavefront, soo that
they can be moved between cycles so that converged threads execute together.
This is not a use of vector instructions.

But, vector instructions make things easier to rejigger - since threads
themselves are already spread over several cycles.

Related, not in the slidedeck: spatial rejiggering. Making a "lane" of
instructions map to one or multiple lanes of ALUs.
E.g. if all threads in a warp are converged, then assign each
thread to one and only one ALU lane.
But if you have divergence, spread instruction execution foor a
thread across several ALLU lanes. E.g. the classic 50% divergence would
get swalloweed up.
As usual, vectoor instructions would make it easy, although it
probably could be done without.

I've already mentioned how you can think of adding coherent insttruction
dispatch to multicluster multithreading.

Slides 34 and 35 talk about how time pipelined vectors with VL vector
length control eliminate the vector length wastage of parallel vector
architectures, such as Larrabeee was reported to be in the SIGGRAPH paper.

Slide 71 talks about how time pipelined vectors operationg planar, SOA
style, savepower by reducing toggling.

From: nmm1 on 25 Jul 2010 04:16

In article <aba3a53e-09d5-4528-86a2-c2374ed4c4f1(a)q35g2000yqn.googlegroups.com>,
MitchAlsup <MitchAlsup(a)aol.com> wrote:
>>
>> >Today's computers are *not* designed around computation, but around
>> >coherent cache. =A0Now that the memory controller is on the die, the
>> >takeover is complete. =A0Nothing moves efficiently without notice and
>> >often unnecessary involvement of the real Von Neumann bottleneck,
>> >which is the cache.
>>
>> Yes and no. =A0Their interfaces are still designed around computation,
>> and the coherent cache is designed to give the impression that
>> programmers need not concern themselves with programming memory
>> access - it's all transparent.
>
>If cache were transparent, then instructions such are PREFETCH<...>
>would not exist! Streaming stores would not exist!

Few things in real life are absolute. Those are late and rather
unsatisfactory extras that have never been a great success.

>Compilers would not go to extraordinary pains to use these, or to find
>out the configuration parameters of the cahce hierarchy.

Most don't. The few that do don't succeed very well.

>Memory Controllers would not be optimized around cache lines.
>Memory Ordering would not be subject to Cache Coherence rules.

No, those support my point. Their APIs to the programmer are
intended to be as transparent as possible - the visibility is
because that is not feasible.

>I think what Robert is getting at is that lumping everything under a
>coherent cache is running into a vonNeumann wall.

Precisely. That's been clear for a long time. My point is that
desperate solutions need desperate remedies, and it's about time
that we accepted that the wall is at the end of a long cul de sac.
There isn't any way round or through, so the only technical
solution is to back off a long way and try a different route.

No, I don't expect to live to see it happen.

Regards,
Nick Maclaren.

First | Prev | Next | Last
Pages: 5 6 7 8 9 10 11 12 13 14 15 16 17
Prev: Last Call for Papers Reminder (extended): World Congress on Engineering and Computer Science WCECS 2010
Next: ARM-based desktop computer ? (Hybrid computers ?: Low + High performance ;))