Prev: 128316 Computer Knowledge, Free and alwqays Up to Date 59
Next: Fwd: Different stacks for return addresses and data?
From: "Andy "Krazy" Glew" on 11 Mar 2010 09:42 Although I don't want to get too involved in the discussion about US supercomputer policy, not my bailiwick, I will continue to not purely lurk because I am interested in the question: "When is it worthwhile adding processors that will be left idle much of the time?" I'm interested in this not just in the context of this discussion, but in terms of past discussions: e.g. we have discussed multithreaded architectures like Tera, and, for that matter, Intel HT (SMT), AMD MCMT, etc. One of the basic ideas behind multithreading is to switch to a different thread when the supposedly expensive execution units would otherwise go idle. Yet such threading has a hardware cost: register files to support such threading get slower and more power hungry, the buffers to hold threads that are switched out waiting for a cache miss, etc. At some point it is not worthwhile adding more threads, and their associated hardware. At some point it is probably worth letting the processor go idle. I'm trying to get my head around what that point is, in a semi-theoretical sense, to help calibrate my understanding if I get to work on such issues again. E.g. with multicluster multithreading, as AMD is rumored to be doing, and as I envisaged it: add execution units that you know will be more idle than in an SMT. But the tradeoff that I want to consider is not the "stop adding more threading, we are in diminishing returns" area. Instead I want to consider the tradeoff between Adding another completely separate processor, with the associated interconnect overhead vs Adding more threading. e.g. let's assume that you have a base processor with flops X, hardware cost H, connected to a network I. If you add a duplicate core, you get flops 2X (best case), hardware cost 2H, and interconnect cost - well, not 2I. Perhas 2I in the local interconnect, but definitely not in the global. So let's assume interconnect I = L + G, and say interconnect is 3L + G. (Note that I am making the assumption that you have a simplistic tree interconnect - to double the processors, you add an extra layer of local stuff => 3L) If you double the multithreading, you get flops 2X, hardware cost H+M, and interconnect... well, let's just leave interconnect unaffected, although realistically you will have to add more hardware. But let's assume that interconnect can be completely pipelined and clocked faster, wave pipelined or whatever, and that doesn't cost you. Subtracting duplicate core versus doubling the multithreading, C2 - M2 = (2H + 3L+G) - (H+M + L+G) = H + 2L - M. Which means that you only stop multithreading when the size of the multithreading overhead, the RFs etc., is more than the cost of the CPU and local interconnect. Which is a long way out. (Or when the ALUs are almost fully used, which is closer). However, the above is too simplistic: when you double cores, you don't get 2X the flops. And when you double threads, you get even less flops. Perhaps it is better to compare in terms of total cost per flop: (2H+3L+G) / X-multicore vs (H+M+3L+G) / X-multithread But this means that the incremental value, the utilization of the extra flops that you are adding via multithreading, may be significantly less that the incremental value in multicore. The old econmist "Occasionally it is worth selling a product below cost, so long as it covers your marginal cost" issue. It also means that there is a sawtooth wave. If you want to increase performance by 1.5X, it may be worth adding threads; but when you cross the threshold, you need to add more cores and back off on the threads. I.e. when you add cores you want to back off a few generations on the threading. Robert Myers wrote: > On Mar 10, 2:16 am, Terje Mathisen <"terje.mathisen at tmsw.no"> > wrote: >> Andy "Krazy" Glew wrote: >>> Robert Myers wrote: >>>> A machine that achieves 5% efficiency doing a bog-standard problem has >>>> little right to claim to be super in any respect. That's my mantra. >>> I've been lurking and listening and learning, but, hold on: this is silly. >>> Utilization has nothing to do with "super-ness". Cost effectiveness is >>> what matters. > I focus on the FFT because there I am certain of my footing. In a > broader sense that I can't defend with such specificity, if all your > computer designs force localized computation or favor it heavily, you > will skew the kind of science that you can and will do. I'm quite happy to agree with you. If FFTs are the most important application, and if you are not building the sort of machine that an FFT needs, in an ideal world you would change the machine design. I'm just pointing out that the most cost effective FFT machine may not be the "ideal" FFT machine from the point of view of processor utilization. I also agree with you that there may be unfortunate side effects: if the most cost effective FFT machine is even more cost effective for non-FFT calculations, then the science that may go in those non-FFT directions. >>> But if the flops are cheap compared to the bandwidth, then it may very >>> well make sense to add lots of flops. You might want to add more >>> bandwidth, but if you can add a 1% utilized flop and eke a little bit >>> more out of the expensive interconnect... >> If having a teraflop available in a $200 chip makes it possible to get >> 10% better use out of the $2000 (per board) interconnect, then that's a >> good bargain. >> > Maybe it is and maybe it isn't. > My argument is that the scalability of current machines is > a PT Barnum hoax with very little scientific payoff. > > ... People may intuitively think that if you just pile > up enough flops in one place, you can do any kind of math you want, > but that intuition is dangerous. I definitely agree with you. > When I start down this road, I will inevitably start to sputter > because I feel so outnumbered. Join the club. > If the limit on interconnects is fundamental, I'd sure like to understand > why. It's always fun to hear Burton Smith do his mental calculations of interconnect: "If I have such and such a volume of copper, then the best bisection bandwidth that I can achieve is ..."
From: "Andy "Krazy" Glew" on 11 Mar 2010 10:01 Robert Myers wrote: > On Mar 10, 9:33 pm, "Del Cecchi" <delcec...(a)gmail.com> wrote: >> As I said a few days ago, bandwidth costs money. >> Latency is with us always. >> > You keep saying that, and it makes me grind my teeth every time you > do. When Seymour said, "You can't fake it," he was talking about > *bandwidth*, not latency. You *can* fake latency, and IBM did some of > the most fundamental work in that area. You *cannot* fake bandwidth. We fake latency via caches and prefetchers. Which work pretty well, but for some workloads don't work at all. We fake bandwidth by replicating computation and compression. Instead of doing the smallest number of computations, you compress the data, and send the smallest amount of data between nodes, and then use local computation ton uncompress and/or replicate computations. Faking bandwidth doesn't seem to work quite as well. --- I'm thinking about this because it became obvious in my conversation with Ivan Sutherland that his head is totally in bandwidth space. As you might expect from graphics. Whereas I have spent most of my career in latency space. This is causing me to wonder: are there ay important computations that are still latency sensitive? Or is everything bandwidth sensitive from now on? I suspect latency still matters, but I want to understand how, in what workloads. Particularly, in what parallel workloads does latency still matter. By the way, though, the really latency sensitive supercomputer workloads tend to not be so incompatible architecture-wise with bandwidth, because they tend to not really benefit from caches, and complex multistage routing networks that emphasize local connectivity. They tend to want low latency access to global data. It's not latency vs. bandwidth. It's locality vs. globality, and in the latter maybe global latency vs global bandwidth. And here may be the crux of the problem: global latency and global bandwidth are not incompatible when the interconnect is shallow. But when you start adding multiple stages to the interconnect, you tend to hurt global latency; but you also provide a place that is just so very damned attractive to add a processor. Every switching node is a place you can add a processor. And when you add processors in the switching nodes, you create architectures that favor locality.
From: Robert Myers on 11 Mar 2010 12:45 On Mar 11, 10:01 am, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net> wrote: > Robert Myers wrote: > > On Mar 10, 9:33 pm, "Del Cecchi" <delcec...(a)gmail.com> wrote: > >> As I said a few days ago, bandwidth costs money. > >> Latency is with us always. > > > You keep saying that, and it makes me grind my teeth every time you > > do. When Seymour said, "You can't fake it," he was talking about > > *bandwidth*, not latency. You *can* fake latency, and IBM did some of > > the most fundamental work in that area. You *cannot* fake bandwidth. > > We fake latency via caches and prefetchers. Which work pretty well, but for some workloads don't work at all. > > We fake bandwidth by replicating computation and compression. Instead of doing the smallest number of computations, you > compress the data, and send the smallest amount of data between nodes, and then use local computation ton uncompress > and/or replicate computations. > > Faking bandwidth doesn't seem to work quite as well. > In many instances, it doesn't matter if the calculation is even seconds later, so long as you don't lengthen the critical path significantly, because those seconds are small compared to the total compute time for problems put on really big computers. If you change the bandwidth by the same ratio (a hundred nanoseconds to a few seconds), you might as well skip the electronics and use humans with paper and pencil. Sure "you can't fake bandwidth" can be quibbled with, but the extent to which faking is potentially useful is very small, as compared to latency, where, if the access pattern is predictable, the latency to get the calculation started hardly matters at all. There will always be problems with unpredictable access patterns. If you really need to be doing petaflops with calculations like that, either the problem had best be embarrassingly parallel or you may as well give up. > --- > > I'm thinking about this because it became obvious in my conversation with Ivan Sutherland that his head is totally in > bandwidth space. As you might expect from graphics. > > Whereas I have spent most of my career in latency space. > > This is causing me to wonder: are there ay important computations that are still latency sensitive? Or is everything > bandwidth sensitive from now on? > Some operations research calculations are inherently serial and therefore latency sensitive. My argument has been that if such calculations were all *that* important, you'd see a big market for computers with heroic cooling. If even someone all Wall Street is doing it to gain a few milliseconds, I've not heard of it. > I suspect latency still matters, but I want to understand how, in what workloads. Particularly, in what parallel > workloads does latency still matter. > > By the way, though, the really latency sensitive supercomputer workloads tend to not be so incompatible > architecture-wise with bandwidth, because they tend to not really benefit from caches, and complex multistage routing > networks that emphasize local connectivity. They tend to want low latency access to global data. > It's not latency vs. bandwidth. It's locality vs. globality, and in the latter maybe global latency vs global bandwidth. > > And here may be the crux of the problem: global latency and global bandwidth are not incompatible when the interconnect > is shallow. But when you start adding multiple stages to the interconnect, you tend to hurt global latency; but you > also provide a place that is just so very damned attractive to add a processor. Every switching node is a place you can > add a processor. And when you add processors in the switching nodes, you create architectures that favor locality. Which is basically where we are right now, and maybe the logic is so compelling that that's the end of the story. Robert.
From: Andrew Reilly on 11 Mar 2010 16:38 On Thu, 11 Mar 2010 07:38:44 -0600, Del Cecchi` wrote: > Apparently FFT doesn't let you fake bandwidth or latency. It depends on how they're written, of course, but FFTs don't necessarily care about latency at all: the access/communications pattern might be total and intricate, but it is entirely deterministic. Back in the 80's my boss made an FFT engine that the CSIRO (and later SETI) used for radio astronomy. It used DRAM for all storage, but the compute unit was 100% saturated, because the computation program and the memory access program were effectively pre-computed (unrolled) and scheduled around the DRAM latency and then stored in a ROM (and later that ROM was optimized/ compressed into a state machine.) I dare say that the FFT routines that run on the big, distributed supers operate in much the same way, or at least they could. Cheers, -- Andrew
From: Paul A. Clayton on 11 Mar 2010 17:20
On Mar 11, 9:42 am, "Andy \"Krazy\" Glew" <ag-n...(a)patten-glew.net> wrote: [snip] > Adding another completely separate processor, with the associated interconnect overhead > vs > Adding more threading. [snip] > (2H+3L+G) / X-multicore > > vs > > (H+M+3L+G) / X-multithread > > But this means that the incremental value, the utilization of the extra flops that you are adding via multithreading, > may be significantly less that the incremental value in multicore. The old econmist "Occasionally it is worth selling a > product below cost, so long as it covers your marginal cost" issue. > > It also means that there is a sawtooth wave. If you want to increase performance by 1.5X, it may be worth adding > threads; but when you cross the threshold, you need to add more cores and back off on the threads. I.e. when you add > cores you want to back off a few generations on the threading. Just a few quick (obvious) comments: * Multithreading can increase locality of communication (potentially even more so than multiple cores sharing L1 DCache). * Multithreading encourages 'fat' cores that can exploit temporally local parallelism (in some sense a repeat of the previous point--locality of communication) * Multicore provides greater thermal separation, potentially providing thermal headroom for higher frequency. * Multicore fits better with 'Bubblewrap/Processor popping' * Multicore tends to reduce scheduling complexity. * Multithreading allows finer-grained resource allocation at various levels of temporal granularity (a thread could, e.g., use less than the maximum number of registers, use less execution resources, et al.) without heterogeneity of cores and interconnect. Obviously the divisions are not clear cut. Caches, instruction decode hardware, OoO support hardware, execution resources, result routing, registers, etc. can be shared (or not); hierarchies can have different sharing/latency (I have not read any papers suggesting the potential dual use of storage space for 'dead but not committed' register values or SoEMT waiting thread context. [Nor have I read any suggestion that something like Larrabee's SIMD registers could be used to hold waiting thread contexts for server uses with thread richness applications with limited data-level parallelism.]). Paul A. Clayton still just a technophile |