Nonlinear systems and nonlocal supercomputing [Computer Architecture]

Prev: Call for Papers Reminder (extended): The World Congress on Engineering WCE 2010
Next: Call to stop spamming here

From: Larry on 14 Mar 2010 18:56

I agree that the graph theory definition of bisection is not too
useful as a figure of merit for computer system interconnects.

Roughly, it asks how many links you have to cut before one half of the
machine can't talk to the other half, for an arbitrary assignment of
nodes to halves.

In the case of things like Kautz and de Bruin graphs, the known upper
and lower bounds are far apart, making it less than useful, and
second, the graph theory version of the problem doesn't take routing
into account. The routes between nodes in the same partition may
cross the bisection an even number of times, and the routes between
nodes in opposite partitions may cross an odd number of times.

The same problems can arise when using multiple paths even in simpler
topologies like 3D meshes and tori like Cray XTn or BG/whatever.

However, better globabl communications helps with large scale
applications, and it is worth trying to quantify.

My personal favorite is an average of all-to-all bandwidth with
everyone talking to everyone. This is something like the
communications pattern for HPCC Random Access. Random Access is
latency insensitive, however. HPCC Random RIng is the next best
choice, since both latency and bandwidth are measured. The "Random"
part makes it fairly hard to game. FFT and PTRANS are relatively poor
choices. PTRANS is easy to game, and although it is hard to build a
system that works well on FFT, it is not clear that it would help with
other applications or even work well on FFT unless you get the job
layout exactly right.

This also suggests another issue: job layout and what happens when you
want to run more than one job at a time. How much degradation due to
interference will there be?

Finally, I want to point out that good global communications can be
done, it is just expensive. No magic is necessary. All you need to
do is build a sufficiently large crossbar switch. These can have
modular components, and it is "just" an engineering problem. Of
course it's costs go as N**2 in the number of nodes.

Alternatively, you can build a fully configured fat tree, which only
costs NlogN In either case, the cost of the communications is going
to dominate the cost of the system at larger scale. You get logN
latency with the fat tree, rather than constant, however.

This unfortunate tradeoff is why we keep getting 3D tori. They are
easy to build, and they work for a range of scales, until they don't.
SiCortex tried the Kautz graph, which scales better than the 3D torus,
but is hard to wire.

So are there scalable solutions in which the communications costs only
grow with the number of nodes? And which have low latency for all
routes? One possibility is optical, where each node aims a laser at a
mirror to hit the receiver on another node. There is collision
avoidance work to be done, but there's no N**2 crossbar cost.

-L

From: nmm1 on 15 Mar 2010 05:05

In article <826b10f7-be18-48f1-b1f2-42b0bbebd52f(a)t41g2000yqt.googlegroups.com>,
Larry <lstewart2(a)gmail.com> wrote:
>I agree that the graph theory definition of bisection is not too
>useful as a figure of merit for computer system interconnects.
>
>Roughly, it asks how many links you have to cut before one half of the
>machine can't talk to the other half, for an arbitrary assignment of
>nodes to halves.

Yes, but even that has maximal, minimal and average forms, for as
many meanings of average as you care to choose!

>In the case of things like Kautz and de Bruin graphs, the known upper
>and lower bounds are far apart, making it less than useful, and
>second, the graph theory version of the problem doesn't take routing
>into account. The routes between nodes in the same partition may
>cross the bisection an even number of times, and the routes between
>nodes in opposite partitions may cross an odd number of times.

Yes, indeed. There are graph theoretic measures that do take
account of such things, but they have been dropped by most computer
scientists because they are mathematically harder. Network flow
theory goes back a long way, and is very important in many fields.

>My personal favorite is an average of all-to-all bandwidth with
>everyone talking to everyone. ...

I started using that about 1997, and found it very reliable. What
I used was (roughly):

Point-to-point peak bandwidth and latency, to evaluate what a
single pair of ports could do.

All-to-all bandwidth and latency, to evaluate what a fully
loaded network would deliver.

It was interesting how often the figures I got were a lot smaller
than the ones quoted by the vendor, even for quite small networks.
I could get their figures easily enough by using a bisection
measurement, but could get even lower figures by using another
bisection measurement chosen with malice aforethought). As you
say, the great advantage of all-to-all is that it is a well-defined
average value, typical of the use in many applications.

Regards,
Nick Maclaren.

From: Terje Mathisen "terje.mathisen at on 15 Mar 2010 05:17

Larry wrote:
> Finally, I want to point out that good global communications can be
> done, it is just expensive. No magic is necessary. All you need to
> do is build a sufficiently large crossbar switch. These can have
> modular components, and it is "just" an engineering problem. Of
> course it's costs go as N**2 in the number of nodes.
>
> Alternatively, you can build a fully configured fat tree, which only
> costs NlogN In either case, the cost of the communications is going
> to dominate the cost of the system at larger scale. You get logN
> latency with the fat tree, rather than constant, however.

Hmmm?

I would assume that when you get big enough, even a very large crossbar
could not scale much better than sqrt(N), since that is the increase in
wire distance?
>
> This unfortunate tradeoff is why we keep getting 3D tori. They are
> easy to build, and they work for a range of scales, until they don't.
> SiCortex tried the Kautz graph, which scales better than the 3D torus,
> but is hard to wire.
>
> So are there scalable solutions in which the communications costs only
> grow with the number of nodes? And which have low latency for all
> routes? One possibility is optical, where each node aims a laser at a
> mirror to hit the receiver on another node. There is collision
> avoidance work to be done, but there's no N**2 crossbar cost.

The micro-mirror laser setup would seem to need a separate network to
handle setup of the links.

If you want to steer the mirrors between every exchange, then the mirror
latency would probably dominate.

TANSTAAFL

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: nmm1 on 15 Mar 2010 05:54

In article <jha177-fiv2.ln1(a)ntp.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Larry wrote:
>> Finally, I want to point out that good global communications can be
>> done, it is just expensive. No magic is necessary. All you need to
>> do is build a sufficiently large crossbar switch. These can have
>> modular components, and it is "just" an engineering problem. Of
>> course it's costs go as N**2 in the number of nodes.
>>
>> Alternatively, you can build a fully configured fat tree, which only
>> costs NlogN In either case, the cost of the communications is going
>> to dominate the cost of the system at larger scale. You get logN
>> latency with the fat tree, rather than constant, however.
>
>Hmmm?
>
>I would assume that when you get big enough, even a very large crossbar
>could not scale much better than sqrt(N), since that is the increase in
>wire distance?

Perhaps N^(1/3) or similar. If the power requirements were cut to
something reasonable, a lot could be done with layering. For example,
a purely passive copper+insulator chip could be attached to deliver
all long-distance wiring.

It might even be feasible with no power reduction, and I assume that
the construction people are considering it.

Regards,
Nick Maclaren.

From: MitchAlsup on 15 Mar 2010 12:35

On Mar 14, 1:25 pm, an...(a)mips.complang.tuwien.ac.at (Anton Ertl)
wrote:
> MitchAlsup <MitchAl...(a)aol.com> writes:
> >Consider a 1/2 meter sq motherboard with "several" CPU nodes with 16
> >bidirectionial (about) byte wide ports running at 6-10 GTs. Now
> >consider a back plane that simply couples this 1/2 sq meter
> >motherboard to another 1/2 sq meter DRAM carring board also with 16
> >bidirectional (almost) bite wide ports running at the same
> >frequencies. Except, this time, the DRAM boards are perpendicular to
> >the CPU boards. With this arrangement, we have 16 CPU containing
> >motherboards fully connected to 16 DRAM containing motherboards and
> >256 (almost) byte wide connections running at 6-10 GTs. 1 cubic meter,
> >about the actual size of a refrigerator.
>
> I compute 1/2m x 1/2m x 1/2m = 1/8 m^3.
>
> Where have I misunderstood you?

Now that you point this out, I end up agreeing with you.

But cnsider adding space around each side of the boards to hold the
board in place (the rack) provide connections (the backplane) and
other communications ports (front side connectors) on both the CPU
boards and the DRAM boards; finally add walls to isolate and control
air flow and space for the fans and things, and I think you might be
getting close ot a cubic meter. However, the way I explained it does
lead to confusion. Sorry.

> But that size is the size of a small freezer around here (typical
> width 55-60cm, depth about the same, and the height of the small ones
> is around the same, with the normal-sized ones at about 1m height).
>
> Hmm, couldn't you have DRAM boards on both sides of the mainboard (if
> you find a way to mount the mainboard in the middle and make it strong
> enough). Then you can have a computer like a normal-size fridge:-).

{I believe} You need the DRAM boards to be perpendicular to the CPU
boards in order to create the massive bandwidth between the CPUs and
DRAMs. With 16 boards of each flavor, you have 256 independent FB-DIMM-
like channels providing the BW. The interconnecting "backplane" only
has wires connecting one port of one board with one port of another
straight through {with maybe some power, gnd, clocks,...} I don't see
how to get this kind of BW by putting the DRAMs on the main boards.

Conceptualy one could put the DRAMs on the mainboards, but in this
case the wiring of those 256 interconnecting ports would then require
actual wires (on the backplane) inistead of simple through connections
across a backplane as described above.

So, in essence, the difference is that the ports from CPU<->DRAM cross
the backplane with local (short) wiring connecting one connector to
another in a perpendicular direction with essentially no wiring
elsewhere on the backplane. Thus, the backplane could be implemented
in (maybe) 3 routing layers.

Mitch

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: Call for Papers Reminder (extended): The World Congress on Engineering WCE 2010
Next: Call to stop spamming here