Prev: High-bandwidth computing (hbc) wiki and mailing list
Next: Effects of Memory Latency and Bandwidth onSupercomputer,Application Performance
From: Brett Davis on 26 Jul 2010 02:49 In article <15ea973c-fef5-4cc0-82fb-d71e68a6425d(a)d37g2000yqm.googlegroups.com>, Robert Myers <rbmyersusa(a)gmail.com> wrote: > On Jul 25, 1:27�am, Brett Davis <gg...(a)yahoo.com> wrote: > > <57004bc1-201f-4cd1-8584-e59c01eb0...(a)e5g2000yqn.googlegroups.com>, > > �Robert Myers <rbmyers...(a)gmail.com> wrote: > > > I agree. �The plots are splendid. �They always are. > > > > Stanford online has a truly awesome related iTunes talk by Bill Dally:http://deimos3.apple.com/WebObjects/Core.woa/Browse/itunes.stanford.e... > > From:http://deimos3.apple.com/WebObjects/Core.woa/Browse/itunes.stanford.e... > > iTunes store search term "The future of throughput". > > Track 13 for the course "Programming Massively Parallel Processors with CUDA (CS193G)". > > > > I got stopped on Bill Dally's first slide, where he says > "efficiency=locality." > > He claims that his slide tells you everything you need to know about > the future of computing, AND THAT BELIEF AND PROSELYTIZING FOR THAT > BELIEF IS WHY I HAVE USED UP SO MUCH SPACE HERE. > > THE MOST INTERESTING PROBLEMS ARE NOT LOCAL, BECAUSE THE MOST > INTERESTING PROBLEMS ARE NOT LINEAR. > > A problem is embarrassingly parallel IFF it is local IFF it is linear. > > If a problem is linear, then there is a representation in which it is > both local and embarrassingly parallel. If a problem is not linear, > then there is, in general, no such representation. Does one need a > PhD from Cal Tech to understand this? A simple programmer will tell you fine, but your performance is going to be horrible. Several things have been tried: The company that bought Cray had a machine that was 1000 way interleaved to deal with 1000 cycle latencies from RAM. That design is dead, and buried? Sun had its 8 way threaded multiprocessor for a more traditional approach to the same problem. Dead. IBM sells 4 way multiprocessors, it is too early to say if this is a fad and failure, for IBM. AMD would allow you to split the 128 bit memory bus into two independent 64 bit busses, for better transaction throughput. To the best of my knowledge no one turns this mode on, as one bus is faster for everyone? For BullDozer with its 256 bit memory bus I could see one 128 bit bus for the OS and app image, and a pair of 64 bit buses for the database to use. But that is a stretch when AMD may have decided the whole idea is stupid. Intel has two way HypeThreading, which while no longer a net negative that is turned off at boot by default like the NetBust design was, is still a wash at best. Bold hopes for 2x speed increases for high latency code have returned no speed boost at all. You are hard pressed to find a 5% improvement in a non-rigged benchmark. Now that we have more CPUs than threads of work to do for most desktop CPUs, the very idea of fine grain threading is dead. (Outside of Intel marketing, where it is just Blue Crystals.) Fine grain threading costs transistors for all those things you have to double, and otherwise increase the size of. Costing power, which costs clock, which costs performance. AMD is quite vocal that they will never do HypeThreading, saying that it is a net negative, and I believe them. > I suppose that it is inevitable that EE's should believe that > interesting things can happen in a linear universe, but, from a > mathematical point of view, linear systems are as interesting as the > aging of paintings hanging in the Louvre. > > Robert. The one problem set I know of that really wants small random access is dictionary based AI, the last remaining approach to emulating the human mind, as all the other approaches have failed. (Think ELisa with a terra-byte database.) But ultimately this is a kludge to get the same results that the human mind does, but the human mind is massively parallel and soft plugboard wired up between neurons. On the scale between ASIC-DSP-CPU the human mind way off the left of ASIC. So what problem is it that you want this functionality for? Fringe designs and apps are interesting mental exercises, fun. > > A related PDF:http://cva.stanford.edu/people/dally/ISSCC2005.pdf > > > > Brett
From: Jason Riedy on 26 Jul 2010 12:13 And Brett Davis writes: > The company that bought Cray had a machine that was 1000 way > interleaved to deal with 1000 cycle latencies from RAM. > That design is dead, and buried? Nope. The Cray XMT exists; I use the one at PNNL almost daily. The XMT2 is on its way. And your numbers are off by a factor of 10. The Tera machine had ~140 cycle memory latency (iirc) and carried 128 threads per cpu. The XMT's latency is far worse, and you have fewer useful threads per processor (user code limited to 100). Some of that is slated to change with the XMT2, but I don't know (and wouldn't be able to say yet) hard numbers. I'm not making any claims about commercial viability, however. But it's far easier to obtain decent performance on massive-scale combinatorial graph analysis on the XMT than elsewhere... at the moment. Amdahl giggles a bit at practical potential... Jason
From: Robert Myers on 26 Jul 2010 13:16 On Jul 26, 1:33 am, Andy Glew <"newsgroup at comp-arch.net"> wrote: > On 7/25/2010 8:12 PM, Robert Myers wrote: > > > I got stopped on Bill Dally's first slide, where he says > > "efficiency=locality." > > > He claims that his slide tells you everything you need to know about > > the future of computing, AND THAT BELIEF AND PROSELYTIZING FOR THAT > > BELIEF IS WHY I HAVE USED UP SO MUCH SPACE HERE. > > > THE MOST INTERESTING PROBLEMS ARE NOT LOCAL, BECAUSE THE MOST > > INTERESTING PROBLEMS ARE NOT LINEAR. > > > A problem is embarrassingly parallel IFF it is local IFF it is linear. > > > If a problem is linear, then there is a representation in which it is > > both local and embarrassingly parallel. If a problem is not linear, > > then there is, in general, no such representation. Does one need a > > PhD from Cal Tech to understand this? > > (1) Is there a proof for this? Not that there are non-linear systems > that are not embarassingly parallel, but that there are no interesting > non-linear systems that are not amenable to parallel solutions. > > E.g. the N-body problem, gravitational dynamics? > First, you have to understand that what I think of as a huge problem the DoE (for example) does not even recognize as a problem at all, never mind as a huge problem. Bill Dally later whisks through the logic, where he talks about calculating fluxes that never make it off the graphics card. If you think way back to the divergence theorem, you can see what they're doing. You reduce the nonlinear terms in the Navier-Stokes (or other equation involving transport) to fluxes through the surface of a control volume. Voila! The non-linear term is local. Since you've got all these flops and not much bandwidth, you only ever transmit and receiive fluxes across the boundary. What's more, if you're the DoE, you can add all kinds of chemical reactions and constitutive equations to keep the flops occupied, EVEN THOUGH YOU AR USING TERRIBLE APPROXIMATION OF THE TRANSPORT EQUATIONS, even without the complication of all those chemical reactions (or whatever). Control volumes, of course, do not appear in the Navier-Stokes equations, which contains differential operators. Any accurate approximation to the differential operators is non-local, and, in fact, the only one with quantifiable limitations in the face of non- linearity is global (a spectral representation of the operator). This all sounds very arcane, I'm sure, and the DoE wants you to think it is arcane, but the physics are driven by the details of how the shortest scales interact with the longest scales, and when you misrepresent the differential operators with a crude appropximation, you are changing the exact physics you want to look at. If you can't build computers that can actually do global FFT's, then you will never be able to examine what the numerics are actually doing, and I do truly believe that that's the way the supercomputer bandwagon wants things. That way, they can publish gorgeous pictures of the Rayleigh-Taylor instability, even while using methods of unknowable accuracy. It's important enough that, in the middle of his sales pitch, Bill Dally feels compelled to whisk through an admission of what's really going on, even though I strongly suspect that he has no idea of the mathematical consequences of the methods he is using. This whole "supercomputer" shenanigan (and with it miraculous GPU's that will unravel the secrets of the universe with practically no bandwidth) is built on a crude fudge. The fudge is SO convenient, though, that the empire builders and software writers just want to get on with it, without worrying that they might have lost contact with reality before they even started. > (2) If what you and Dally say is true, Robert, then you may be tilting > at windmills. There may be no computationally efficient way of doing > what you want. > > I don't believe this, because I do not equate computationally efficient > to embarassingly parallel. > > Also: even though locality really matters, nevertheless we have used > that as an excuse to pessimize non-local activities. At the very least > we can influence the constant factor by removing these pessimizations. > As I have attempted to do with my proposal for a scatter/gather based > memory subsystem. I may well be tilting at windmills. I've understood that all along. Even if you can build computers with sufficient global bandwidth (can afford the hardware), the energy costs might kill you. I don't really know. The FFT isn't local, but it diagonalizes the derivative (DOES TRULY MAKE THE CALCULATION LOCAL), and it can be done with very great efficiency, IFF you have sufficient global bandwidth. Personally, I think that if I'm tilting at windmills, it's not because of any actual physical limitation or any real financial limitation that can't be overcome, but because dealing with the foundational reality I keep harping on will interfere with the march of petaflops and the endless flow of pretty pictures. Robert.
From: Robert Myers on 26 Jul 2010 20:24 On Jul 26, 7:29 pm, Chris Gray <c...(a)GraySage.com> wrote: > Robert Myers <rbmyers...(a)gmail.com> writes: > > First, you have to understand that what I think of as a huge problem > > the DoE (for example) does not even recognize as a problem at all, > > never mind as a huge problem. > > ... etc ... > > I know virtually nothing about the math and coding of these things, and so > should just stay out of this. But: > > 1) surely someone reading this newsgroup has a simulation/whatever that can > be fairly easily converted from the "bad" locality assumptions into one > that runs globally. Can't you just hack some of the constants so that it > only has one "cell" instead of many? Don't the environments support things > like large arrays spanning multiple system nodes? > > 2) surely someone reading this newsgroup has some kind of access to a system > that can try to run the resulting simulation. I figure that back in the > old days at Myrias we could have found a way to do it. We could likely > have done a weekend run on 1024 nodes, longer on fewer nodes. > > It could even be in the interests of hardware vendors to do this. If the > run proves Robert right, then there could be lots of new money coming to > research systems able to run globally. > > Likely one run wouldn't be enough, but it would at least be a start. > It's not as if any of this just bubbled up out of the ground like methane around the Deep Horizon well. People who need to get the physics right know how to do it: http://www.o3d.org/abracco/annual_rev_3dnumerical.pdf It would be fascinating to see what would happen if you tried to reproduce these kinds of fundamental calculations using the kind of differencing that Dally apparently takes for granted. It would be a significant undertaking, and I don't think anyone would be surprised that such calculations would yield results that would be wrong for reasons that one could easily predict ahead of time: the real physics of isotropic turbulence would be swamped by numerical diffusion. In my admittedly cynical view of the world, when you need to get it right, and you know it will be obvious that you have gotten it wrong, no one uses the methods that the DoE apparently designs its computers around. If all you need is a calculation that isn't obviously wrong without doing some subtle checking, crude differencing schemes are the rule and highly accurate differencing schemes are the exception. At one time, fluid mechanical calculations were so expensive that you were happy to get something that just looked right. We shouldn't have to live that way any more. I can't guarantee that the kind of comparison you'd think would be obvious to do hasn't been done, but the incentive to do it is low and the risks very high. As someone who really does know has said to me in a friendly way: I'm likely to make a lot of enemies. A careful comparison of local, low-order differencing results with global, high-order differencing results is an obvious thing to do, but if there is any such study with any computer even remotely state-of- the-art (and, remember, state of the art "supercomputers" are *terrible* for FFT's), I'm not familiar with it. It's an obvious thing to do. Robert.
From: Andy Glew "newsgroup at on 26 Jul 2010 22:34
On 7/26/2010 5:24 PM, Robert Myers wrote: > A careful comparison of local, low-order differencing results with > global, high-order differencing results is an obvious thing to do, but > if there is any such study with any computer even remotely state-of- > the-art (and, remember, state of the art "supercomputers" are > *terrible* for FFT's), I'm not familiar with it. It's an obvious > thing to do. Can you remind us (okay, me, the lazy one) what the scaling of bandwidth requirements with problem size is for FFTs? I.e. for an NxNxN 3D FFT, DP coordinates, what is the bandwidth? Phrase it solely in terms of sendig around 64 bit DP numbers. And phrase it in terms of a dancehall configuration - all memory on one side, all processors on the other. I.e. do NOT neglect bandwidth to processor local memory. One you get to a certain level of the FFT, we have lots of nice cache line sized accesses. Unless they are being distributed to different processors. Which is where my scatter/gather interconnect would help. We might want to create 3D tiles. I'm staring at this thinking compression. Adjacent points are probably close in value. Now, compression of cache lines doesn't help that much, unless you have sequential access patterns even bigger than a cache line. But in my scatter/gather world, any compression might help, since it can be packed with other odd sized packets. |