Prev: Call for Papers Reminder (extended): The World Congress on Engineering WCE 2010
Next: Call to stop spamming here
From: Larry on 14 Mar 2010 18:56 I agree that the graph theory definition of bisection is not too useful as a figure of merit for computer system interconnects. Roughly, it asks how many links you have to cut before one half of the machine can't talk to the other half, for an arbitrary assignment of nodes to halves. In the case of things like Kautz and de Bruin graphs, the known upper and lower bounds are far apart, making it less than useful, and second, the graph theory version of the problem doesn't take routing into account. The routes between nodes in the same partition may cross the bisection an even number of times, and the routes between nodes in opposite partitions may cross an odd number of times. The same problems can arise when using multiple paths even in simpler topologies like 3D meshes and tori like Cray XTn or BG/whatever. However, better globabl communications helps with large scale applications, and it is worth trying to quantify. My personal favorite is an average of all-to-all bandwidth with everyone talking to everyone. This is something like the communications pattern for HPCC Random Access. Random Access is latency insensitive, however. HPCC Random RIng is the next best choice, since both latency and bandwidth are measured. The "Random" part makes it fairly hard to game. FFT and PTRANS are relatively poor choices. PTRANS is easy to game, and although it is hard to build a system that works well on FFT, it is not clear that it would help with other applications or even work well on FFT unless you get the job layout exactly right. This also suggests another issue: job layout and what happens when you want to run more than one job at a time. How much degradation due to interference will there be? Finally, I want to point out that good global communications can be done, it is just expensive. No magic is necessary. All you need to do is build a sufficiently large crossbar switch. These can have modular components, and it is "just" an engineering problem. Of course it's costs go as N**2 in the number of nodes. Alternatively, you can build a fully configured fat tree, which only costs NlogN In either case, the cost of the communications is going to dominate the cost of the system at larger scale. You get logN latency with the fat tree, rather than constant, however. This unfortunate tradeoff is why we keep getting 3D tori. They are easy to build, and they work for a range of scales, until they don't. SiCortex tried the Kautz graph, which scales better than the 3D torus, but is hard to wire. So are there scalable solutions in which the communications costs only grow with the number of nodes? And which have low latency for all routes? One possibility is optical, where each node aims a laser at a mirror to hit the receiver on another node. There is collision avoidance work to be done, but there's no N**2 crossbar cost. -L
From: nmm1 on 15 Mar 2010 05:05 In article <826b10f7-be18-48f1-b1f2-42b0bbebd52f(a)t41g2000yqt.googlegroups.com>, Larry <lstewart2(a)gmail.com> wrote: >I agree that the graph theory definition of bisection is not too >useful as a figure of merit for computer system interconnects. > >Roughly, it asks how many links you have to cut before one half of the >machine can't talk to the other half, for an arbitrary assignment of >nodes to halves. Yes, but even that has maximal, minimal and average forms, for as many meanings of average as you care to choose! >In the case of things like Kautz and de Bruin graphs, the known upper >and lower bounds are far apart, making it less than useful, and >second, the graph theory version of the problem doesn't take routing >into account. The routes between nodes in the same partition may >cross the bisection an even number of times, and the routes between >nodes in opposite partitions may cross an odd number of times. Yes, indeed. There are graph theoretic measures that do take account of such things, but they have been dropped by most computer scientists because they are mathematically harder. Network flow theory goes back a long way, and is very important in many fields. >My personal favorite is an average of all-to-all bandwidth with >everyone talking to everyone. ... I started using that about 1997, and found it very reliable. What I used was (roughly): Point-to-point peak bandwidth and latency, to evaluate what a single pair of ports could do. All-to-all bandwidth and latency, to evaluate what a fully loaded network would deliver. It was interesting how often the figures I got were a lot smaller than the ones quoted by the vendor, even for quite small networks. I could get their figures easily enough by using a bisection measurement, but could get even lower figures by using another bisection measurement chosen with malice aforethought). As you say, the great advantage of all-to-all is that it is a well-defined average value, typical of the use in many applications. Regards, Nick Maclaren.
From: Terje Mathisen "terje.mathisen at on 15 Mar 2010 05:17 Larry wrote: > Finally, I want to point out that good global communications can be > done, it is just expensive. No magic is necessary. All you need to > do is build a sufficiently large crossbar switch. These can have > modular components, and it is "just" an engineering problem. Of > course it's costs go as N**2 in the number of nodes. > > Alternatively, you can build a fully configured fat tree, which only > costs NlogN In either case, the cost of the communications is going > to dominate the cost of the system at larger scale. You get logN > latency with the fat tree, rather than constant, however. Hmmm? I would assume that when you get big enough, even a very large crossbar could not scale much better than sqrt(N), since that is the increase in wire distance? > > This unfortunate tradeoff is why we keep getting 3D tori. They are > easy to build, and they work for a range of scales, until they don't. > SiCortex tried the Kautz graph, which scales better than the 3D torus, > but is hard to wire. > > So are there scalable solutions in which the communications costs only > grow with the number of nodes? And which have low latency for all > routes? One possibility is optical, where each node aims a laser at a > mirror to hit the receiver on another node. There is collision > avoidance work to be done, but there's no N**2 crossbar cost. The micro-mirror laser setup would seem to need a separate network to handle setup of the links. If you want to steer the mirrors between every exchange, then the mirror latency would probably dominate. TANSTAAFL Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: nmm1 on 15 Mar 2010 05:54 In article <jha177-fiv2.ln1(a)ntp.tmsw.no>, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote: >Larry wrote: >> Finally, I want to point out that good global communications can be >> done, it is just expensive. No magic is necessary. All you need to >> do is build a sufficiently large crossbar switch. These can have >> modular components, and it is "just" an engineering problem. Of >> course it's costs go as N**2 in the number of nodes. >> >> Alternatively, you can build a fully configured fat tree, which only >> costs NlogN In either case, the cost of the communications is going >> to dominate the cost of the system at larger scale. You get logN >> latency with the fat tree, rather than constant, however. > >Hmmm? > >I would assume that when you get big enough, even a very large crossbar >could not scale much better than sqrt(N), since that is the increase in >wire distance? Perhaps N^(1/3) or similar. If the power requirements were cut to something reasonable, a lot could be done with layering. For example, a purely passive copper+insulator chip could be attached to deliver all long-distance wiring. It might even be feasible with no power reduction, and I assume that the construction people are considering it. Regards, Nick Maclaren.
From: MitchAlsup on 15 Mar 2010 12:35
On Mar 14, 1:25 pm, an...(a)mips.complang.tuwien.ac.at (Anton Ertl) wrote: > MitchAlsup <MitchAl...(a)aol.com> writes: > >Consider a 1/2 meter sq motherboard with "several" CPU nodes with 16 > >bidirectionial (about) byte wide ports running at 6-10 GTs. Now > >consider a back plane that simply couples this 1/2 sq meter > >motherboard to another 1/2 sq meter DRAM carring board also with 16 > >bidirectional (almost) bite wide ports running at the same > >frequencies. Except, this time, the DRAM boards are perpendicular to > >the CPU boards. With this arrangement, we have 16 CPU containing > >motherboards fully connected to 16 DRAM containing motherboards and > >256 (almost) byte wide connections running at 6-10 GTs. 1 cubic meter, > >about the actual size of a refrigerator. > > I compute 1/2m x 1/2m x 1/2m = 1/8 m^3. > > Where have I misunderstood you? Now that you point this out, I end up agreeing with you. But cnsider adding space around each side of the boards to hold the board in place (the rack) provide connections (the backplane) and other communications ports (front side connectors) on both the CPU boards and the DRAM boards; finally add walls to isolate and control air flow and space for the fans and things, and I think you might be getting close ot a cubic meter. However, the way I explained it does lead to confusion. Sorry. > But that size is the size of a small freezer around here (typical > width 55-60cm, depth about the same, and the height of the small ones > is around the same, with the normal-sized ones at about 1m height). > > Hmm, couldn't you have DRAM boards on both sides of the mainboard (if > you find a way to mount the mainboard in the middle and make it strong > enough). Then you can have a computer like a normal-size fridge:-). {I believe} You need the DRAM boards to be perpendicular to the CPU boards in order to create the massive bandwidth between the CPUs and DRAMs. With 16 boards of each flavor, you have 256 independent FB-DIMM- like channels providing the BW. The interconnecting "backplane" only has wires connecting one port of one board with one port of another straight through {with maybe some power, gnd, clocks,...} I don't see how to get this kind of BW by putting the DRAMs on the main boards. Conceptualy one could put the DRAMs on the mainboards, but in this case the wiring of those 256 interconnecting ports would then require actual wires (on the backplane) inistead of simple through connections across a backplane as described above. So, in essence, the difference is that the ports from CPU<->DRAM cross the backplane with local (short) wiring connecting one connector to another in a perpendicular direction with essentially no wiring elsewhere on the backplane. Thus, the backplane could be implemented in (maybe) 3 routing layers. Mitch |