Prev: Looking for Sponsorship
Next: Processors stall on OLTP workloads about half the time--almostno matter what you do
From: MitchAlsup on 22 Apr 2010 23:20 On Apr 22, 3:15 pm, Stephen Fuld <SF...(a)alumni.cmu.edu.invalid> wrote: > I don't see how you get a multi-core design with the same transistor > count as a multi-threaded one. I have seen numbers of 5% additional > logic for a second thread. Mostly you duplicate the registers and add a > little logic. But with two cores, clearly you get 100% overhead, > duplicating the registers, the execution units, the L1 caches and all > the other logic. On (say) Pentium4, once the pipeline was sufficiently "screwed up" that adding threading was easy (5%) the design team is in the position of having to do it. We looked at this for K9, it would have added several pipe steps, a bunch of instruction buffering, and some minor register state. It ended up closer to 9% than 5%. It would have also delivered similar throughputs (+15%-ish) but it would have come with a monothreaded cost of some 7% off the top. So after you lost 7%, you could add 15% back in and look like a genius ((ahem and with a deep sounding voice drawing out the enunciation):RIGHT). I wonder if some Intel engineer/ designer knows hat was lost in P4 such that threading became easy. My guess is that we will not know for a very long time (2 decades) On the other hand one could build a 1-wide in order core that gives something like 37% of the K9 performance for half of the die additions needed to add threading. That is: an attached 1W IO core added to a GreatBig OoO core would add more performance and add less die area than adding threading to the GB OoO core. {Caveat: To a pipeline inherently slimmed for highest possible frequency (say 5GHz when Opterons were at 3GHz). On the third hand, more medium sized cores loose little in comercial applications simply because they wait just as well as the GB OoO cors wait (and just as long). And by not being so big,..... they consume less power, area,... design time, debugging time,... Its a multidimensional optimization puzzle. Once you drop the need for maximal monothreaded performance, the GB OoO design point is no longer optimal by any metric you want to apply to the comercial space (and others). But since so much of todays benchmarks are SSSOOOOOO monothreaded, the market gets what the benchmaks convince the designers to build. Also note: if you look at the volume of chips that go into servers and other big iron, it represents an aftenoon in the FAB per year compared to the desktop and notebooks,... A profitable afternoon, but not big enough for an Intel nor AMD to alter design team directions. Mitch
From: MitchAlsup on 22 Apr 2010 23:28 On Apr 22, 3:15 pm, Stephen Fuld <SF...(a)alumni.cmu.edu.invalid> wrote: > Of course, comparing one design with nearly twice the number of > transistors could outperform the single core design. Counting transistors is a poor way to judge a CPU design, or compare CPU utility functions. A small/medium core with 6X-8X the cache might fit in exactly the same die area as a GB OoO core. An old example I used several years ago was comparing an Opteron core with a quad postage stamp core with a 4 way interleaved 256KB shared L2. Same die area, greater throughput, higher transistor count, more cache, greater ILP, greater MLP, smaller power disipation. Knock off 3 of those other PSPs, and one could have one small core and 512KB of cache in the same die footprint as the Opteron core (with no L2 or NB or memory/DRAM controller or pins). Mitch
From: Quadibloc on 23 Apr 2010 03:50 On Apr 22, 9:20 pm, MitchAlsup <MitchAl...(a)aol.com> wrote: > Once you drop the need for > maximal monothreaded performance, the GB OoO design point is no longer > optimal by any metric you want to apply to the comercial space (and > others). But since so much of todays benchmarks are SSSOOOOOO > monothreaded, the market gets what the benchmaks convince the > designers to build. Hmm. While I think that maximal monothreaded performance is what is generally needed - except in the relatively unusual application of OLTP, where throughput is king - out-of-order execution involves a great deal of complexity (although the note in this thread that 6600- style scoreboards require much less is food for thought). Would a superscalar chip that uses multithreading to make full use of the computational resources that are there anyways, with generous cache, be a good design point? It would seem to me that one only needs to put multiple cores per chip, aside from satisfying strange things like Windows licensing requirements, if it's important to have the processors tightly coupled. Independent jobs from different users that don't share memory, and which could be running in different boxes connected by network cables, hardly need to share cache. But the fact that very small caches, like that on the original five- volt Pentium, or the 360/85, were already enough to vastly improve performance means that cache size involves diminishing returns. That seems like a good reason to consider putting another core on the chip. John Savard
From: Anne & Lynn Wheeler on 21 Apr 2010 16:03 Robert Myers <rbmyersusa(a)gmail.com> writes: > If Intel management read this report, and I assume it did, it would > have headed in the direction that Andy has lamented: lots of simple > cores without energy-consuming cleverness that doesn't help much, > anyway--at least for certain kinds of workloads. The only thing that > really helps is cache. in the time-frame we were doing cluster scaleup for both commercial and numerical intensive ... commercial reference to jan92 http://www.garlic.com/~lynn/95.html#13 oracle made a big issue that they had done extensive tracing and simulation work ... and major thruput factor at the time was having at least 2mbyte processor caches ... and they worked with major server vendors to have option for sufficient cache. recent posts on cluster scaleup http://www.garlic.com/~lynn/2010f.html#47 Nonlinear systems and nonlocal supercomputing http://www.garlic.com/~lynn/2010f.html#50 Handling multicore CPUs; what the competition is thinking http://www.garlic.com/~lynn/2010g.html#8 Handling multicore CPUs; what the competition is thinking http://www.garlic.com/~lynn/2010g.html#52 Handling multicore CPUs; what the competition is thinking the other issue was compare&swap had become widely used for large DBMS multi-threaded operation (whether running multi-threaded or not) ... and although rios/rs6000 did provide for smp operation ... it also didn't provide an atomic compare&swap primitive. as a result, dbms thruput suffered on rs/6000 platform because kernel calls were required to to have serialized operation. eventually aix provided a simulation of compare&swap semantics via a supervisor call (special fastpath in supervisor call interrupt routine that operated while disabled for interrupts ... works in a single processor environment). misc. past posts mentioning compare&swap (&/or smp): http://www.garlic.com/~lynn/subtopic.html#smp compare&swap was originally invented by charlie working on fine-grain multiprocessor cp67 kernel locking at the science center. an effort was then made to try and get it included in 370 architecture ... which was rebuffed by the favorite son operating system in pok (claiming test&set, from 360, was more than sufficient). 370 architecture then provided opening with challenge to come up uses for compare&swap that weren't multiprocessor specific; thus was born the descriptions of compare&swap for use by multithreaded applications. -- 42yrs virtualization experience (since Jan68), online at home since Mar1970
From: MitchAlsup on 21 Apr 2010 18:36 On Apr 21, 11:02 am, Robert Myers <rbmyers...(a)gmail.com> wrote: > Even though the paper distinguishes between technical and commercial > workloads, and draws its negative conclusion only for commercial > workloads, it was interesting to me that, for instance, Blue Gene went > the same direction--many simple processors--for a technical workload so > as to achieve low power operation. Reading between the lines, Comercial and DB workloads are better served by slower processors accessing a thinner cache/memory hierarchy than by faster processors accessing a thicker cache/memory hierarchy. That is: a comercial machine is better served with larger first level cache backed up by large second cache running at slower frequencies, while a technical machine would be better served with smaller first level caches, medium second elvel cache and a large third level cache running at higher frequencies. What this actually shows is that "one design point" cannot cover all bases, and that one should configure a technical machine differently than a comercial machine, differently than a database machine. We saw this develop with the interplay between Alpha and HP, Alpha taking the speed deamon approach, while HP took the brainiac approach. Alpha had more layers of cache with thinner slices at each level. HP tried to avoid even the second level of cache (7600 Snakes) and then tried to avoid bringing the first level cache on die until the die area was sufficient. On certain applications Alpha wins, on others HP wins. We also witness this as the Alpha evolved, 8kB caches became 16 KB cache, then back to 8KB caches as teh cache hierarchy was continually rebalanced to the wrokloads (benchmarks) the designers cared about. Since this paper was written slightly before the x86 crushed out RISCs in their entirety, the modern reality is that technical, comercial, and database applications are being held hostage to PC-based thinking. It has become just too expensive to target (with more than lip service) application domains other than PCs (for non-mobile applications). Thus the high end PC chips do not have the memory systems nor interconnects that would beter serve other workloads and larger footprint serer systems. A shame, really Mitch
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 5 Prev: Looking for Sponsorship Next: Processors stall on OLTP workloads about half the time--almostno matter what you do |