Prev: PEEEEEEP
Next: Texture units as a general function
From: Robert Myers on 4 Jan 2010 10:00 On Jan 4, 4:56 am, Thomas Womack <twom...(a)chiark.greenend.org.uk> wrote: > In article <b7dd84e0-0cbc-41c6-8a85-b51197c2d...(a)e27g2000yqd.googlegroups..com>, > Robert Myers <rbmyers...(a)gmail.com> wrote: > > >On Jan 3, 11:25=A0pm, wclod...(a)lost-alamos.pet (William Clodius) wrote: > > >> I think Nick is saying that to improve locality it is essential to > >> transpose the dimensions of the array as you cycle through each > >> dimension in a multi-dimensional array in a multi-dimensional FFT. > > >But it sure wasn't necessary on a Cray-1--a side benefit of not having > >cache and having easily worked-around limitations on an arbitrary > >stride in memory. But, of course, none of this matters any more, > >because no one has to bother with multi-dimensional FFT's. Or, at > >least, they'd better not. > > I can't read this as other than obtuse; yes, you don't need to do > these tricks if main memory is the same speed as your processor, and > indeed if you're running on an i7 a job small enough to fit on a > Cray-1 you probably don't have to do that many of the tricks - the > level-3 cache is eight megabytes long, the size of Cray-1 main memory, > and the other caches are reasonably good at batching up accesses to > L3. > > The place I work for does 3D FFTs from morning until night, but since > nobody knows how to grow large flawless protein crystals, the _size_ > of the FFTs hasn't changed since it was difficult to fit them in a > Cray-1; FFTs on data small enough to fit in the shared memory of a > current cheap SMP are really quite fast and parallelise quite > reasonably, and (much more importantly) Frigo&Johnson at MIT have done > the parallelisation and FFTW is not expensive even for commercial > users. > So, your problems are different from mine. That doesn't justify your use of an abusive term like obtuse. Since you were so obtuse as not to fill in the missing steps for yourself, let me fill them in for you. We had a capability with the Cray-1 that we have lost. That that capability was easy to supply because of the architecture of the machine was never lost on me. Without an architecture like the Cray-1, it is much, much harder, but it is a long way from impossible to do much, much better than we now do. For one thing, a modern desktop will accept problems much larger than would have a Cray-1. You *still* have to deal with the awkwardness of cache lines. Why don't some people in HPC like cache (at least as it is currently realized)? Well, there's one reason. A constant stride through memory, at one time perfectly straightforward, is now a visit to the memory controller/prefetch casino, and you *still* have to deal with cache lines. Blue Gene, by IBM's documents the last time I looked, could only use about 512 processors effectively for multidimensional FFT's. That means that, from the POV of a multidimensional FFT, 63,500 of your processors are useless. The advice of the Einstein of Cambridge (and the thousands of unnamed others whom he will call to his witness) notwithstanding, the reason is not far to find and could have been and was identified even from the sketchiest of design documents, and it could have been fixed. Even someone as obtuse as I am can follow that logic. Your advice is: if you have my problems and use the kinds of computers that I use, your remark is stupid. My advice is: we've given up an important capability, and, rather than trying to fix it, we've redefined what's important. Maybe the world has changed so much that an ability to handle "naive" but multidimensional data structures with great efficiency is no longer so important. Nick *would* be in a much better position to comment on that than I would. I know, just like you do, about a narrow but important class of problems. Well, if physics is narrow. Robert.
From: Thomas Womack on 4 Jan 2010 11:08 In article <6fd837a7-52fb-40ac-b382-fa4417829549(a)m26g2000yqb.googlegroups.com>, Robert Myers <rbmyersusa(a)gmail.com> wrote: >For one thing, a modern desktop will accept problems much larger than >would have a Cray-1. You *still* have to deal with the awkwardness of >cache lines. Why don't some people in HPC like cache (at least as it >is currently realized)? Well, there's one reason. A constant stride >through memory, at one time perfectly straightforward, is now a visit >to the memory controller/prefetch casino, and you *still* have to deal >with cache lines. How fast are the computation fronts in your jobs, and how big the kernels? Bit-interleaving the addresses seems to be the standard trick for dealing with two- and three-dimensional mainly-local jobs in one-dimensional memories with caches; at least the cache lines now contain data which is generally used together. I have no idea what the equivalent trick is for jobs that divide space into non-uniform polygons or polyhedra. The use-17-instead-of-16 tricks still work, since you can often get bank clashes inside the L2 cache; I did various benchmarks of [120..129]x[120..129]x[120..129] FFTs with FFTW, and was initially slightly surprised to find that 128x128x128 was among the slowest. As far as I know, FFTW doesn't let you specify the data layout so you can't tell it to do a 128x128x128 FFT on data stored in the top part of a 129x129x128 box; I believe an early version of the manual said that they had often found performance improvements by doing this but couldn't figure out how to exploit them in a product with a comprehensible interface. >Blue Gene, by IBM's documents the last time I looked, could only use >about 512 processors effectively for multidimensional FFT's. That >means that, from the POV of a multidimensional FFT, 63,500 of your >processors are useless. I would say that it meant you had to do 128 problems at a time, or does that 512-way FFT eat up the whole of some resource that's provided per-machine rather than per-rack? Can you give some idea of how large an FFT you want to do? >The advice of the Einstein of Cambridge (and the thousands of unnamed >others whom he will call to his witness) notwithstanding, the reason >is not far to find and could have been and was identified even from >the sketchiest of design documents, and it could have been fixed. >Even someone as obtuse as I am can follow that logic. OK, you are being gnomic rather than obtuse; please tell us the reason, and how you fix it without making the machine vastly more expensive or vastly less modular. >Maybe the world has changed so much that an ability to handle "naive" >but multidimensional data structures with great efficiency is no >longer so important. Nick *would* be in a much better position to >comment on that than I would. I know, just like you do, about a >narrow but important class of problems. Well, if physics is narrow. Naive, efficient and fast is a classic pick-two; insisting that physically-adjacent points live sixteen megabytes apart doesn't seem entirely naive to me. http://www.cse.scitech.ac.uk/disco/mew20/presentations/MFG.pdf is a huge bolus of undigestable information, but describes the performance profile on small-to-medium clusters of what seem to be a number of jobs that chemists (micro-scale physicists?) want to do; they have fairly big FFTs in and don't seem to be doing too badly. Tom
From: Terje Mathisen "terje.mathisen at on 4 Jan 2010 11:43 Thomas Womack wrote: > How fast are the computation fronts in your jobs, and how big the > kernels? Bit-interleaving the addresses seems to be the standard > trick for dealing with two- and three-dimensional mainly-local jobs in > one-dimensional memories with caches; at least the cache lines now > contain data which is generally used together. This same trick is so useful that I have seen at least one modern core which has dedicated instructions for 2D and 3D interleave. I'm guessing that it wouldn't be too hard to use those operations to support a few higher dimensions as well, with one or two more steps. Many, many years ago it was used to tile geometry for a flight simulator, it is probably still used like that. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Thomas Womack on 4 Jan 2010 12:31 In article <4fi917-n1q1.ln1(a)ntp.tmsw.no>, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote: >Thomas Womack wrote: >> How fast are the computation fronts in your jobs, and how big the >> kernels? Bit-interleaving the addresses seems to be the standard >> trick for dealing with two- and three-dimensional mainly-local jobs in >> one-dimensional memories with caches; at least the cache lines now >> contain data which is generally used together. > >This same trick is so useful that I have seen at least one modern core >which has dedicated instructions for 2D and 3D interleave. A previous draft mentioned Larrabee; obviously 2D interleave gets you 4D, and 2D+3D gets 6D. I haven't figured out how you get 5D. There are obvious hakmem-like techniques for adding and subtracting within a single dimension of interleaved numbers - adding propagates carries across strings of 1, subtracting across strings of 0: (mask, insert 1s in gaps in x leaving 0s in gaps in y, add, mask would do it). What I don't know is clever processes for doing the conversion each way using 'standard' instructions; IIRC Larrabee doesn't have an uninterleave operation, you interleave to get an address for a texture sampler and who would want to uninterleave afterwards. Can't use a naive magic multiplier: 8N=2 and 64N=4 are incompatible even mod 2^32. >Many, many years ago it was used to tile geometry for a flight >simulator, it is probably still used like that. The ATI detailed documentation indicates that textures are stored in an interleaved format, though I think it's not bitwise, it's something like interleaved up to X[6] then Y[5..3] X[5..3] Y[2..0] X[2..0], but I can't find the document at the moment, it's not in http://developer.amd.com/gpu_assets/R700-Family_Instruction_Set_Architecture.pdf Tom
From: Terje Mathisen "terje.mathisen at on 4 Jan 2010 13:25
Thomas Womack wrote: > each way using 'standard' instructions; IIRC Larrabee doesn't have an > uninterleave operation, you interleave to get an address for a texture > sampler and who would want to uninterleave afterwards. Can't use a > naive magic multiplier: 8N=2 and 64N=4 are incompatible even mod 2^32. I think the proper response is simply "Don't do that!", i.e. make sure you never need to uninterleave by keeping the normal coordinates around. > >> Many, many years ago it was used to tile geometry for a flight >> simulator, it is probably still used like that. > > The ATI detailed documentation indicates that textures are stored in > an interleaved format, though I think it's not bitwise, it's something > like interleaved up to X[6] then Y[5..3] X[5..3] Y[2..0] X[2..0], but You would not want bitwise interleaving of the actual data, use the bit-interleaved value as an index into the texture array instead. (I.e. loading 8-128 bits from each index) I think that flight simulator used the interleaved index to load 256x256 pixel textures. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching" |