Larrabee delayed: anyone know what's happening? [Computer Architecture]

Prev: PEEEEEEP
Next: Texture units as a general function

From: Robert Myers on 4 Jan 2010 10:00

On Jan 4, 4:56 am, Thomas Womack <twom...(a)chiark.greenend.org.uk>
wrote:
> In article <b7dd84e0-0cbc-41c6-8a85-b51197c2d...(a)e27g2000yqd.googlegroups..com>,
> Robert Myers <rbmyers...(a)gmail.com> wrote:
>
> >On Jan 3, 11:25=A0pm, wclod...(a)lost-alamos.pet (William Clodius) wrote:
>
> >> I think Nick is saying that to improve locality it is essential to
> >> transpose the dimensions of the array as you cycle through each
> >> dimension in a multi-dimensional array in a multi-dimensional FFT.
>
> >But it sure wasn't necessary on a Cray-1--a side benefit of not having
> >cache and having easily worked-around limitations on an arbitrary
> >stride in memory. But, of course, none of this matters any more,
> >because no one has to bother with multi-dimensional FFT's. Or, at
> >least, they'd better not.
>
> I can't read this as other than obtuse; yes, you don't need to do
> these tricks if main memory is the same speed as your processor, and
> indeed if you're running on an i7 a job small enough to fit on a
> Cray-1 you probably don't have to do that many of the tricks - the
> level-3 cache is eight megabytes long, the size of Cray-1 main memory,
> and the other caches are reasonably good at batching up accesses to
> L3.
>
> The place I work for does 3D FFTs from morning until night, but since
> nobody knows how to grow large flawless protein crystals, the _size_
> of the FFTs hasn't changed since it was difficult to fit them in a
> Cray-1; FFTs on data small enough to fit in the shared memory of a
> current cheap SMP are really quite fast and parallelise quite
> reasonably, and (much more importantly) Frigo&Johnson at MIT have done
> the parallelisation and FFTW is not expensive even for commercial
> users.
>
So, your problems are different from mine. That doesn't justify your
use of an abusive term like obtuse.

Since you were so obtuse as not to fill in the missing steps for
yourself, let me fill them in for you.

We had a capability with the Cray-1 that we have lost. That that
capability was easy to supply because of the architecture of the
machine was never lost on me.

Without an architecture like the Cray-1, it is much, much harder, but
it is a long way from impossible to do much, much better than we now
do.

For one thing, a modern desktop will accept problems much larger than
would have a Cray-1. You *still* have to deal with the awkwardness of
cache lines. Why don't some people in HPC like cache (at least as it
is currently realized)? Well, there's one reason. A constant stride
through memory, at one time perfectly straightforward, is now a visit
to the memory controller/prefetch casino, and you *still* have to deal
with cache lines.

Blue Gene, by IBM's documents the last time I looked, could only use
about 512 processors effectively for multidimensional FFT's. That
means that, from the POV of a multidimensional FFT, 63,500 of your
processors are useless. The advice of the Einstein of Cambridge (and
the thousands of unnamed others whom he will call to his witness)
notwithstanding, the reason is not far to find and could have been and
was identified even from the sketchiest of design documents, and it
could have been fixed. Even someone as obtuse as I am can follow that
logic.

Your advice is: if you have my problems and use the kinds of computers
that I use, your remark is stupid. My advice is: we've given up an
important capability, and, rather than trying to fix it, we've
redefined what's important.

Maybe the world has changed so much that an ability to handle "naive"
but multidimensional data structures with great efficiency is no
longer so important. Nick *would* be in a much better position to
comment on that than I would. I know, just like you do, about a
narrow but important class of problems. Well, if physics is narrow.

Robert.

From: Thomas Womack on 4 Jan 2010 11:08

In article <6fd837a7-52fb-40ac-b382-fa4417829549(a)m26g2000yqb.googlegroups.com>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:
>For one thing, a modern desktop will accept problems much larger than
>would have a Cray-1. You *still* have to deal with the awkwardness of
>cache lines. Why don't some people in HPC like cache (at least as it
>is currently realized)? Well, there's one reason. A constant stride
>through memory, at one time perfectly straightforward, is now a visit
>to the memory controller/prefetch casino, and you *still* have to deal
>with cache lines.

How fast are the computation fronts in your jobs, and how big the
kernels? Bit-interleaving the addresses seems to be the standard
trick for dealing with two- and three-dimensional mainly-local jobs in
one-dimensional memories with caches; at least the cache lines now
contain data which is generally used together.

I have no idea what the equivalent trick is for jobs that divide space
into non-uniform polygons or polyhedra.

The use-17-instead-of-16 tricks still work, since you can often get
bank clashes inside the L2 cache; I did various benchmarks of
[120..129]x[120..129]x[120..129] FFTs with FFTW, and was initially
slightly surprised to find that 128x128x128 was among the slowest.
As far as I know, FFTW doesn't let you specify the data layout so you
can't tell it to do a 128x128x128 FFT on data stored in the top part
of a 129x129x128 box; I believe an early version of the manual said
that they had often found performance improvements by doing this but
couldn't figure out how to exploit them in a product with a
comprehensible interface.

>Blue Gene, by IBM's documents the last time I looked, could only use
>about 512 processors effectively for multidimensional FFT's. That
>means that, from the POV of a multidimensional FFT, 63,500 of your
>processors are useless.

I would say that it meant you had to do 128 problems at a time, or
does that 512-way FFT eat up the whole of some resource that's
provided per-machine rather than per-rack?

Can you give some idea of how large an FFT you want to do?

>The advice of the Einstein of Cambridge (and the thousands of unnamed
>others whom he will call to his witness) notwithstanding, the reason
>is not far to find and could have been and was identified even from
>the sketchiest of design documents, and it could have been fixed.
>Even someone as obtuse as I am can follow that logic.

OK, you are being gnomic rather than obtuse; please tell us the
reason, and how you fix it without making the machine vastly more
expensive or vastly less modular.

>Maybe the world has changed so much that an ability to handle "naive"
>but multidimensional data structures with great efficiency is no
>longer so important. Nick *would* be in a much better position to
>comment on that than I would. I know, just like you do, about a
>narrow but important class of problems. Well, if physics is narrow.

Naive, efficient and fast is a classic pick-two; insisting that
physically-adjacent points live sixteen megabytes apart doesn't seem
entirely naive to me.

http://www.cse.scitech.ac.uk/disco/mew20/presentations/MFG.pdf is a
huge bolus of undigestable information, but describes the performance
profile on small-to-medium clusters of what seem to be a number of
jobs that chemists (micro-scale physicists?) want to do; they have
fairly big FFTs in and don't seem to be doing too badly.

Tom

From: Terje Mathisen "terje.mathisen at on 4 Jan 2010 11:43

Thomas Womack wrote:
> How fast are the computation fronts in your jobs, and how big the
> kernels? Bit-interleaving the addresses seems to be the standard
> trick for dealing with two- and three-dimensional mainly-local jobs in
> one-dimensional memories with caches; at least the cache lines now
> contain data which is generally used together.

This same trick is so useful that I have seen at least one modern core
which has dedicated instructions for 2D and 3D interleave.

I'm guessing that it wouldn't be too hard to use those operations to
support a few higher dimensions as well, with one or two more steps.

Many, many years ago it was used to tile geometry for a flight
simulator, it is probably still used like that.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Thomas Womack on 4 Jan 2010 12:31

In article <4fi917-n1q1.ln1(a)ntp.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Thomas Womack wrote:
>> How fast are the computation fronts in your jobs, and how big the
>> kernels? Bit-interleaving the addresses seems to be the standard
>> trick for dealing with two- and three-dimensional mainly-local jobs in
>> one-dimensional memories with caches; at least the cache lines now
>> contain data which is generally used together.
>
>This same trick is so useful that I have seen at least one modern core
>which has dedicated instructions for 2D and 3D interleave.

A previous draft mentioned Larrabee; obviously 2D interleave gets you
4D, and 2D+3D gets 6D. I haven't figured out how you get 5D.

There are obvious hakmem-like techniques for adding and subtracting
within a single dimension of interleaved numbers - adding propagates
carries across strings of 1, subtracting across strings of 0: (mask,
insert 1s in gaps in x leaving 0s in gaps in y, add, mask would do
it). What I don't know is clever processes for doing the conversion
each way using 'standard' instructions; IIRC Larrabee doesn't have an
uninterleave operation, you interleave to get an address for a texture
sampler and who would want to uninterleave afterwards. Can't use a
naive magic multiplier: 8N=2 and 64N=4 are incompatible even mod 2^32.

>Many, many years ago it was used to tile geometry for a flight
>simulator, it is probably still used like that.

The ATI detailed documentation indicates that textures are stored in
an interleaved format, though I think it's not bitwise, it's something
like interleaved up to X[6] then Y[5..3] X[5..3] Y[2..0] X[2..0], but
I can't find the document at the moment, it's not in
http://developer.amd.com/gpu_assets/R700-Family_Instruction_Set_Architecture.pdf

Tom

From: Terje Mathisen "terje.mathisen at on 4 Jan 2010 13:25

Thomas Womack wrote:
> each way using 'standard' instructions; IIRC Larrabee doesn't have an
> uninterleave operation, you interleave to get an address for a texture
> sampler and who would want to uninterleave afterwards. Can't use a
> naive magic multiplier: 8N=2 and 64N=4 are incompatible even mod 2^32.

I think the proper response is simply "Don't do that!", i.e. make sure
you never need to uninterleave by keeping the normal coordinates around.
>
>> Many, many years ago it was used to tile geometry for a flight
>> simulator, it is probably still used like that.
>
> The ATI detailed documentation indicates that textures are stored in
> an interleaved format, though I think it's not bitwise, it's something
> like interleaved up to X[6] then Y[5..3] X[5..3] Y[2..0] X[2..0], but

You would not want bitwise interleaving of the actual data, use the
bit-interleaved value as an index into the texture array instead. (I.e.
loading 8-128 bits from each index)

I think that flight simulator used the interleaved index to load 256x256
pixel textures.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

First | Prev | Next | Last
Pages: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
Prev: PEEEEEEP
Next: Texture units as a general function