Prev: PEEEEEEP
Next: Texture units as a general function
From: Robert Myers on 4 Jan 2010 13:33 On Jan 4, 11:08 am, Thomas Womack <twom...(a)chiark.greenend.org.uk> wrote: > > Can you give some idea of how large an FFT you want to do? > You want to do the biggest problem that you conceivably can, and, for the problems I know most about, one eighth of the total memory space would be a plausible goal. For the 64K processor Blue Gene installation, you'd like to be able to make use of about 8K processors, not 512. For the bigger machines, you want to use more. I answered to direct questioning here why I think doing such large transforms is an important goal for any machine that wants to advance fundamental science. To make a long story short, many interesting questions come down to: for a strongly-nonlinear system, how do the longest and shortest scales interact? Nature gives us enormous ranges of scales that we will never come close to computing. The best we can hope to do is to try to understand, from problems within reach, how that interaction goes. Fourier transform methods to represent differential operators are particularly attractive for such a question because they don't do artificial things to the smallest scales, as all non-global differencing schemes do. Important computations can be reduced to nothing but FFT's and some saxpy type operations. > >The advice of the Einstein of Cambridge (and the thousands of unnamed > >others whom he will call to his witness) notwithstanding, the reason > >is not far to find and could have been and was identified even from > >the sketchiest of design documents, and it could have been fixed. > >Even someone as obtuse as I am can follow that logic. > > OK, you are being gnomic rather than obtuse; please tell us the > reason, and how you fix it without making the machine vastly more > expensive or vastly less modular. > The bomb labs have known about this problem for a long time. Machine bisection bandwidth is a reasonable predictor of performance on an FFT, and the massively powerful machines that are so breathlessly reported to the press have an embarrassing bisection bandwidth. The solution is straightforward, but not cheap: you need more network bandwidth. For machines that use a fat tree architecture, the challenge is to build a fast enough switch with sufficient bandwidth. I assume that this issue has been discussed at length behind closed doors and that the national labs have decided they don't want to pay the price: they'd rather have more flops than bytes/second, even though bandwidth is the limiting factor in a huge range of problems. If you're scaling up the machine while letting bytes/flop drop to zero (which is what the "scalable" machines we now have do), the scalability is simply a fraud. > >Maybe the world has changed so much that an ability to handle "naive" > >but multidimensional data structures with great efficiency is no > >longer so important. Nick *would* be in a much better position to > >comment on that than I would. I know, just like you do, about a > >narrow but important class of problems. Well, if physics is narrow. > > Naive, efficient and fast is a classic pick-two; insisting that > physically-adjacent points live sixteen megabytes apart doesn't seem > entirely naive to me. > > http://www.cse.scitech.ac.uk/disco/mew20/presentations/MFG.pdfis a > huge bolus of undigestable information, but describes the performance > profile on small-to-medium clusters of what seem to be a number of > jobs that chemists (micro-scale physicists?) want to do; they have > fairly big FFTs in and don't seem to be doing too badly. > If you build your own cluster, you can do the trades for yourself. If the nation is going to invest in a handful of ball-buster machines, you have to take what the committee decides to give you. We're going ahead based on claims about global warming without understanding the basic science of an issue that pervades nearly every aspect of that problem and we've been building machines that won't help. Robert.
From: Stephen Fuld on 4 Jan 2010 17:28 Andy "Krazy" Glew wrote: > >> You could use the provided hardware scatter-gather if you were astute >> enough to use InfiniBand interconnect. :-) >> >> del >> >> you can lead a horse to water but you can't make him give up ethernet. > > Del: > > What's the stoy on Infiniband? Do you want to know the history of Infiniband or some details of what it was designed to do (and mostly does)? -- - Stephen Fuld (e-mail address disguised to prevent spam)
From: Anne & Lynn Wheeler on 4 Jan 2010 17:40 Stephen Fuld <SFuld(a)alumni.cmu.edu.invalid> writes: > Do you want to know the history of Infiniband or some details of what > it was designed to do (and mostly does)? minor reference to SCI (being implementable subset of FutureBus) http://en.wikipedia.org/wiki/Scalable_Coherent_Interface eventually morphing into current InfiniBand http://en.wikipedia.org/wiki/InfiniBand -- 40+yrs virtualization experience (since Jan68), online at home since Mar1970
From: Thomas Womack on 4 Jan 2010 18:54 In article <8a091340-7961-4a0a-baae-1265d2cc00f8(a)r24g2000yqd.googlegroups.com>, Robert Myers <rbmyersusa(a)gmail.com> wrote: >On Jan 4, 11:08=A0am, Thomas Womack <twom...(a)chiark.greenend.org.uk> >wrote: > >> >> Can you give some idea of how large an FFT you want to do? >> >You want to do the biggest problem that you conceivably can, and, for >the problems I know most about, one eighth of the total memory space >would be a plausible goal. For the 64K processor Blue Gene >installation, you'd like to be able to make use of about 8K >processors, not 512. For the bigger machines, you want to use more. The largest 1D FFT that I can find evidence of is implied by http://www.hpcs.is.tsukuba.ac.jp/~daisuke/pi.html - the size of the FFT isn't stated, but it'll be either 25*2^35 storing three decimal digits per double-precision entry or 15*2^35 storing five. The algorithm doubles the number of correct digits with each iteration, and each iteration involves about three full-length FFTs, so there are about a hundred FFTs each on about half a trillion elements - each is taking about twenty minutes. Quoting Daisuke Takahashi (neither the footballer nor the figure-skater): "Main program run: Job start : 9th April 2009 07:37:32 (JST) Job end : 10th April 2009 12:43:21 (JST) Elapsed time : 29:05:49 Main memory : 13.5 TB Algorithm : Gauss-Legendre algorithm Programs were written by myself. The computer used was T2K Open Supercomputer (Appro Xtreme-X3 Server) at the Center for Computational Sciences, University of Tsukuba. 640 nodes of the total system (648 nodes, theoretical peak processing speed for the single node is 147.2 billion floating point operations per second. 95.4 trillion floating point operations per second for all nodes), were definitely used as single job and parallel processing for both of programs run." The machine is (according to top500, and the 147.2 = 2.3 GHz * 4 flops/cycle * 16 cores/node) a Myrinet 10G cluster of four-socket quad-core 2.3GHz Opterons; I'd guess twenty to thirty million dollars, and a very fat switch in the middle, though the Myrinet documentation I can find suggests that their biggest routine switch is 512-way. 648 is 2^3*3^4, the coincidence of node counts makes me wonder if the topology is something like the Kautz graphs that Si-Cortex used. http://www.hpcs.is.tsukuba.ac.jp/~daisuke/pub.html indicates that Daisuke has also worked on 3D FFT implementations on this kind of hardware. Tom
From: Del Cecchi` on 4 Jan 2010 23:23
Robert Myers wrote: (snip) >> > > If you build your own cluster, you can do the trades for yourself. If > the nation is going to invest in a handful of ball-buster machines, > you have to take what the committee decides to give you. We're going > ahead based on claims about global warming without understanding the > basic science of an issue that pervades nearly every aspect of that > problem and we've been building machines that won't help. > > Robert. > > How is your bisection bandwidth calculation affected by the reasonable amount of per node memory on Blue Gene? As I understand it, the current BG/P node has 4 cores and 4GB of memory. del Just noticed bad reply to address. sorry, will fix in a minute for future use |