Prev: Sequential consitency on X86_64 - implementation of a bust wait algo not working inspite of using mfences
Next: pipes/streams/channels for parallelism (was: Is C close ...)
From: "Andy "Krazy" Glew" on 16 Nov 2009 16:05 Andy "Krazy" Glew wrote: > Of course, AMD has undoubtedly changed and evolved MCMT in many ways > since I first proposed it to them. For example, I called the set of an > integer scheduler, integer execution units, and an L1 data cache a > "cluster", and the whole thing, consisting of shared front end, shared > FP, and 2 or more clusters, a processor core. Apparently AMD is calling > my clusters their cores, and my core their cluster. It has been > suggested that this change of terminology is motivated by marketing, so > that they can say they have twice as many cores. > > My original motivation for MCMT was to work around some of the > limitations of Hyperthreading on Willamette. Now that y'all can see MCMT from a source other than me, it may be easier to explain some of the reasons what I am interested by SIMT / Coherent Threading in GPUs like Nvidia (and ATI, and Intel). The SIMT GPUs take a single instruction, takes it through a shared front-end, and distributes it to different "threads" running in different lanes. Essentially, replicated execution units. Memory may or may not be replicated; the GPUs seem often to decouple memory from the execution lanes, as is needed for non-stride-1 accesses. MCMT, as in Bulldozer, shares the front end, replicates in each cluster the scheduler, execution units, and L1 data cache. MCMT is typically superscalar, but if we examined the limit case of a non-superscalar MCMT, it would be taking one instruction, and distributing it only one of the clusters. Roughly speaking, MCMT clusters correspond to SIMD/SIMT vector lanes. But while SIMT can send the same instruction to multiple lane/clusters, MCMT does not. So, the logical question is, why not? Why not send the same instruction(s) to 2 (or more) clusters in an MCMT machine? If you can recognize that the clusters are executing the same code? To do this on an out-of-order processor you would probably need to replicate the renamer. (Or split it into two stages, one shared, a hopefully cheaper stage replicated.) But apart from this, it would work. The scheduler within the MCMT clusters would decouple, and allow the clusters to operate independently, and perhaps diverge. This might allow MCMT clustering to scale beyond 2-3. Downside is, Bulldozer shares the FP. But FP workloads are the workloads that benefit most from SIMD/SIMT. |