MCMT in Bulldozer and SIMT in GPUs [Computer Architecture]

Prev: Sequential consitency on X86_64 - implementation of a bust wait algo not working inspite of using mfences
Next: pipes/streams/channels for parallelism (was: Is C close ...)

From: "Andy "Krazy" Glew" on 16 Nov 2009 16:05

Andy "Krazy" Glew wrote:
> Of course, AMD has undoubtedly changed and evolved MCMT in many ways
> since I first proposed it to them. For example, I called the set of an
> integer scheduler, integer execution units, and an L1 data cache a
> "cluster", and the whole thing, consisting of shared front end, shared
> FP, and 2 or more clusters, a processor core. Apparently AMD is calling
> my clusters their cores, and my core their cluster. It has been
> suggested that this change of terminology is motivated by marketing, so
> that they can say they have twice as many cores.
>
> My original motivation for MCMT was to work around some of the
> limitations of Hyperthreading on Willamette.

Now that y'all can see MCMT from a source other than me, it may be
easier to explain some of the reasons what I am interested by SIMT /
Coherent Threading in GPUs like Nvidia (and ATI, and Intel).

The SIMT GPUs take a single instruction, takes it through a shared
front-end, and distributes it to different "threads" running in
different lanes. Essentially, replicated execution units. Memory may
or may not be replicated; the GPUs seem often to decouple memory from
the execution lanes, as is needed for non-stride-1 accesses.

MCMT, as in Bulldozer, shares the front end, replicates in each cluster
the scheduler, execution units, and L1 data cache. MCMT is typically
superscalar, but if we examined the limit case of a non-superscalar
MCMT, it would be taking one instruction, and distributing it only one
of the clusters.

Roughly speaking, MCMT clusters correspond to SIMD/SIMT vector lanes.

But while SIMT can send the same instruction to multiple lane/clusters,
MCMT does not.

So, the logical question is, why not? Why not send the same
instruction(s) to 2 (or more) clusters in an MCMT machine? If you can
recognize that the clusters are executing the same code?

To do this on an out-of-order processor you would probably need to
replicate the renamer. (Or split it into two stages, one shared, a
hopefully cheaper stage replicated.) But apart from this, it would
work. The scheduler within the MCMT clusters would decouple, and allow
the clusters to operate independently, and perhaps diverge.

This might allow MCMT clustering to scale beyond 2-3.

Downside is, Bulldozer shares the FP. But FP workloads are the
workloads that benefit most from SIMD/SIMT.

|
Pages: 1
Prev: Sequential consitency on X86_64 - implementation of a bust wait algo not working inspite of using mfences
Next: pipes/streams/channels for parallelism (was: Is C close ...)