Prev: PEEEEEEP
Next: Texture units as a general function
From: Robert Myers on 3 Jan 2010 14:21 On Jan 3, 6:44 am, n...(a)cam.ac.uk wrote: > In article <7qacstFk...(a)mid.individual.net>, > > Del Cecchi <delcec...(a)gmail.com> wrote: Well, actually, Robert Myers wrote: > >Since I've argued that super-sized computers seem to me to be of > >questionable value, maybe that's all the vast bulk of science really > >needs. If I really need a computer with respectable bi-section > >bandwidth, I can skip waiting for a gigantic machine that runs at 5% > >efficiency (or worse) and learn to live with whatever I can build > >myself. > > Bisection bandwidth is the Linpack measurement of networking, but > that doesn't affect your point. > > I favoured (and favour) the latency and bandwidth of all-to-all, > both because it is more realistic and because it is invariant over > the network topology and implementation. Sure, and while we're at it, can I have a time-travel machine? Rather than trying to fight every conceivable battle, I've chosen just one where even the national labs and now even some of the endlessly- reproduced bureaucrat charts have already conceded my point: limited bi-section bandwidth is a problem for FFT's. If Vladimir Putin can says he wants supercomputers and people think that supercomputers are lots of linpack flops, then that's what scientists in Russia will be getting. It's what scientists in the USA are already getting. I've chosen a single number to focus on. Just like linpack flops, it simplifies a complicated story, but I'll be happy to get the story to the level of complication I've been so persistent about. > The only real point of the specialist HPC systems nowadays is when > the problems are limited by communication and not processing, but > it's more a situation of how you spend your money than differences > in architecture. There isn't any major technical problem with using > multiple InfiniBand ports on each node, and a suitable topology; the > software can get interesting, though. Yes, and if your goal is to claw your way to the top of the all- important Top 500 list, you're not going to waste money on communication that doesn't help with a linpack benchmark. > However, people who do that usually want SMALLER nodes, because that > increases the network capacity relative to computation power. It's > expensive to scale the former pro-rata to the latter, and is often > not feasible with off-the-shelf components. I.e. they use single- > socket nodes. I'm not following your logic here. If I could have one big fat wire running amongst huge nodes, rather than angel hair spaghetti connecting many more smaller nodes, why, exactly, would I prefer the latter? One four or or eight way nehalem or successor with or without gpgpu assist with a huge wire running to a massive switch. What better are you proposing? Robert.
From: Mayan Moudgill on 3 Jan 2010 14:51 nmm1(a)cam.ac.uk wrote: > > The CETEP/CASTEP/ONETEP series of ab initio quantum mechanics programs > alternate between the solution of a large set of simultaneous equations > and a very large 3-D FFT. The latter, like sorting, is limited by > communication. It is (amongst other things) used for surface chemistry > caclulations, which are of immense industrial importance. (I can't speak to 3D FFTs, so I'll restrict myself to Cooley-Tukey radix-2 1D FFT which I do have experience with, and hopefully the analysis will carry over) A 1-D FFT of size 2**lgN=N points over 2**lgM=M machines will do lgN FFT steps. Of these, all but the last lgM are independent. The last lgM steps can be done in different ways, but lets analyze the case where every node exchanges information with its butterfly pair, so that the entire array is copied for each of the last lgM steps. Assume that it takes P sec per point for compute and U sec per point for communication. Then, the time to compute a 1-D FFT with N points will be serially: Ts = lgN*N*P while the time to do it in parallel will be: Tp = lgN*N*P/M + lgM*U*N What's U? Assume that the points are double precision complex values [i.e. 16bytes], and thatthat the bandwidth per link is B bytes per second, all nodes are connected to a central switch using 2 links (one up, one down). Given the nature of 1-D FFT communication, then all links can be used simulataneously during the communication phase, for an aggregate time of: U=16/B*M Set B'=16/B and substituting Tp = lgN*N*P/M + lgM*N*B'/M Communication can be overlapped with computation - as soon as some number of results have been computed (128 complex doubles for normal ehternet, 512 for jumbo frames, YMMV based on OS syscall overhead and actual network used), start sending them out. Ideally, this will result in Tp = (lgN-lgM)*N*P/M + lgM*N*B'/M = 1/M*(lgN*N*P + lgM*N*(B'-P)) The reciprocal of the speed-up is: Tp/Ts = 1/M + lgM*(B'-P)/lgN*P = 1/M + lgM*B'/lgN*P - lgM/lgN Assume a 16-way machine lgM=4, P=1e-9 ns, B=1e9 B/s (assumes dual 10GbE, fairly tuned stacks). For lg2N = 30 (N ~ 1G-points), we would end up with Ts = 32.2sec and Tp = 6.0sec, of which 4.3 ns was communication: speedup=5.33. For a 64-way, assuming the same numbers, we end up with Tp = 2.0sec, of which 1.6s is communication: speedup=16. For 256 way, we would end up with Tp=0.6sec, of which 0.5 sec is communication: speedup=51 radix-2 1D-FFTs have the nice property that any machine needs to only talk to lgM other machines, so the switch structure will have somewhat better scaling properties than a general switch. Anyway, Nick, you're right that things like FFTs are network bandwidth limited; however, it is still possible to get fairly good speedups. Of course, I haven't analyzed the other approaches; e.g. do the last lg2M stages on one processor (so that there is only one communication phase) or other points in the parallelism space.
From: nmm1 on 3 Jan 2010 15:13 In article <u9mdnb0HtsTSaN3WnZ2dnUVZ_gKdnZ2d(a)bestweb.net>, Mayan Moudgill <mayan(a)bestweb.net> wrote: > >(I can't speak to 3D FFTs, so I'll restrict myself to Cooley-Tukey >radix-2 1D FFT which I do have experience with, and hopefully the >analysis will carry over) In general, it doesn't, but it more-or-less does for the one you are doing - which is NOT the way to do multi-dimensional FFTs! It is almost always much faster to transpose, so you are doing vector operations in the FFT at each stage. >Communication can be overlapped with computation - ... I am afraid not, when you are using 'commodity clusters'. Firstly, even with the current offloading of the TCP/IP stack, there is still a lot of CPU processing needed to manage the transfer, and obviously a CPU can't be doing an FFT while doing that. Secondly, you have omitted the cost of the scatter/gather, which again has to be done by the CPU. >Assume a 16-way machine lgM=4, P=1e-9 ns, B=1e9 B/s (assumes dual 10GbE, >fairly tuned stacks). >For lg2N = 30 (N ~ 1G-points), we would end up with Ts = 32.2sec and Tp >= 6.0sec, of which 4.3 ns was communication: speedup=5.33. >For a 64-way, assuming the same numbers, we end up with Tp = 2.0sec, of >which 1.6s is communication: speedup=16. >For 256 way, we would end up with Tp=0.6sec, of which 0.5 sec is >communication: speedup=51 Hmm. That's not the experience of the people I know who have tried it. I haven't checked your calculations, so I am not saying whether or not I agree with them. >Anyway, Nick, you're right that things like FFTs are network bandwidth >limited; however, it is still possible to get fairly good speedups. For multi-dimensional FFTs, certainly. I remain doubtful that you would get results anywhere near that good for single-dimensional ones. I certainly know that they are notorious SMP-system killers, and the speedups obtained by vendors' libraries are not very good. Regards, Nick Maclaren.
From: Mayan Moudgill on 3 Jan 2010 15:16 Robert Myers wrote: > I assume that most who buy installations like Blue Gene would have RAS > requirements that would be hard or impossible to meet with a Beowulf > cluster. In the end, it's probably RAS that rules. > What kind of recovery are you looking for? Against node failures? I'd suggest that distributed checkpointing with restart on failures could be an adequate model for small-to-mid-size clusters. BTW: does anyone have pointers to how IBM/Los Alamos/LLNL are planning to handle failures on Roadrunner/BlueGene? A MTBF analysis, and a discussion of failure-detection and recovery techniques would be nice.
From: Mayan Moudgill on 3 Jan 2010 15:19
nmm1(a)cam.ac.uk wrote: > In article <u9mdnb0HtsTSaN3WnZ2dnUVZ_gKdnZ2d(a)bestweb.net>, > Mayan Moudgill <mayan(a)bestweb.net> wrote: > >>(I can't speak to 3D FFTs, so I'll restrict myself to Cooley-Tukey >>radix-2 1D FFT which I do have experience with, and hopefully the >>analysis will carry over) > > > In general, it doesn't, but it more-or-less does for the one you are > doing - which is NOT the way to do multi-dimensional FFTs! It is > almost always much faster to transpose, so you are doing vector > operations in the FFT at each stage. > The initial bit-reversal is trivial to parallelize (at least for 1D FFTs). It will (at most) involve a single copy of the array over the entire network. |