Interesting presentation [Computer Architecture]

Prev: Multi-core lag for Left 4 Dead 1 and 2 and Quake 4 on AMD X23800+ processor... why ?
Next: Which is the most beautiful and memorable hardware structure in a CPU?

From: Robert Myers on 31 Mar 2010 14:57

On Mar 31, 1:04 pm, Del Cecchi <delcec...(a)gmail.com> wrote:
>
> There certainly seems to be a spectrum of applications that can use a
> spectrum of bandwidth, and it would be unnecessary to spend a lot of
> money on interconnect and packaging if the problem can be solved with
> PCs connected together with string. Whether all protein folding and
> genomics software falls into this category is beyond me, but some seems
> to.
>
> There are other problems which require both lots of flops and lots of
> bandwidth. Flops are cheap. Bandwidth, especially bandwidth in a
> network with many nodes, is expensive.
>
> The various generations of Blue Gene seem to have increased bandwidth,
> although I'm not sure whether it has kept up with the flops. More
> bandwidth can be had especially if the nodes are quite fat, like blue
> waters. BTW thanks for the pointer to that, it was interesting.
>
> I tend to think the technical folks are doing the best they can, whilst
> you apparently think there is some kind of dominant hidden agenda in the
> form of sort a farm program for supercomputers where the government
> spends a lot of money on useless boxes good only for publicity and IBM's
> bottom line.
>
> Custom hardware is pretty expensive. So it seems that most supers are
> build out of pieces parts or technologies that were developed for
> something else. For example, I think the PowerPC cores in BG/L were
> 405's left over from the embedded business. The interconnect physical
> layer was also taken from some other project, but I forget which.
> Likewise the packaging power and cooling were derivitives of earlier
> stuff with some new ideas.

I'm not quite as simple-minded as my posts might make me out to be. I
am deliberately over-simplifying the situation to make my case clear.

Blue Gene and similar would be useful for a class of problems that
need lots of flops and very little global bandwidth and that are
relatively sensitive to inter-node latency. That would include some
kinds of fluid-mechanical modeling--the kind I personally do not trust
to be quantitatively correct.

I made the post mostly because you asked how else, other than
something like Blue Gene, you could do a petaflop. Like everyone
else, you mentioned no requirements other than flops. That is the
bureaucratic mantra at which I have been taking aim for years. You
need flops? The nation is already drowning in flops, so let's hear no
more about flops without qualification. A PS-3 is a pretty fat node.

Like it or not, the defense establishment does implement an industrial
policy, from which IBM has benefited greatly. Maybe, as a nation, we
need that, because, for example, no one else is going to be building 3-
D chips. I'm not saying the policy is wrong, I am merely pointing to
what seem to me to be obvious shortcomings and conflicts of interest.

Most people, technical or otherwise, are interested in their next
paycheck. If the government would pay IBM to dump money into the
Pacific Ocean, they would salute smartly, do as told, and hope that no
one found out, so long as they could make a profit. Technical people
follow orders or lose their jobs. It isn't a matter of competence.
It's a question of survival, and the game has been rigged by a
magister ludi whose goal is to create goals that can be met and
advertised to the executive and legislative branches as successfully
met through their competent management and leadership, leading to
career continuation and higher paychecks.

If you get imaginative and try to build an F-35 and get behind
schedule and over budget (which is almost inevitable), someone with a
few stars on his epaulets might lose his job--a lesson lost on no
bureaucrat who is destined to rise in the ranks. It's not evil. It's
just life.

I just threw a new question out. How much of this couldn't be done on
computers we have already paid for? The purpose isn't to demean IBM
or Blue Gene. The purpose is to get people to talk about something
more meaningful than flops.

I found this document

http://www.sandia.gov/~rcmurph/doc/latency.pdf,

which comes with the take-way message, "This paper compares the memory
performance sensitivity of both traditional and emerging HPC
applications, and shows that the new codes are significantly
more sensitive to memory latency and bandwidth than their
traditional counterparts."

The paper also concludes that performance is more sensitive to latency
than to bandwidth, but nothing is said (so far as I can tell) about
the role of the interconnecting fabric. This gets boiled down to: we
need to continue hammering away on memory-to-processor latency, which
further gets boiled down to: latency is more important than
bandwidth. If that's all you have time to remember, you're not going
to pay attention to global network bandwidth, which is apparently very
expensive, even though there may be little point (from a scientific
perspective) in building computers with a wimpy global interconnect.

Robert.

From: Morten Reistad on 1 Apr 2010 07:44

In article <4ae8b20e-d931-4112-bc07-f406b25f082e(a)e7g2000yqf.googlegroups.com>,
Robert Myers <rbmyersusa(a)gmail.com> wrote:
>On Mar 31, 1:04�pm, Del Cecchi <delcec...(a)gmail.com> wrote:
>>

>I found this document
>
>http://www.sandia.gov/~rcmurph/doc/latency.pdf,
>
>which comes with the take-way message, "This paper compares the memory
>performance sensitivity of both traditional and emerging HPC
>applications, and shows that the new codes are significantly
>more sensitive to memory latency and bandwidth than their
>traditional counterparts."

This fits our, very commercial, observations very well. The size
of the L2 cache is the defining load limiter in more than half
the systems benchmarks we have done. Hypertransport proves a huge
win by interconncting the L2 caches on different chips.

>The paper also concludes that performance is more sensitive to latency
>than to bandwidth, but nothing is said (so far as I can tell) about
>the role of the interconnecting fabric. This gets boiled down to: we
>need to continue hammering away on memory-to-processor latency, which
>further gets boiled down to: latency is more important than
>bandwidth. If that's all you have time to remember, you're not going
>to pay attention to global network bandwidth, which is apparently very
>expensive, even though there may be little point (from a scientific
>perspective) in building computers with a wimpy global interconnect.

Now, can we attack this from a simpler perspective; can we make
the L2-memory interaction more intelligent? Like actually make
a paging system for it? Paging revolutionised the disk-memory
systems, remember?

In therms of network switchs; the performance of 10G switches
is pretty impressive, but there is some latency reaching the
10G card over whatver IO bus is used. Direct CPU attached 10G
links anyone?

16-way Xeon-style systems are becoming pretty standard shelfware,
and 10G switches are getting there. Combining these should be
able to fill a niche above the "P3-via-dsl" networks, and still
only cost a minor fortune, no?

-- mrr

From: Robert Myers on 1 Apr 2010 16:36

Morten Reistad wrote:

> Now, can we attack this from a simpler perspective; can we make
> the L2-memory interaction more intelligent? Like actually make
> a paging system for it? Paging revolutionised the disk-memory
> systems, remember?

I think this suggestion is equivalent to putting "main memory" on the
chip and treating what once was main memory like a disk drive. One can
imagine inheriting all the wisdom and benefits of disk caching.

I'm guessing that stacked 3-D chips could play a big role here. That
technology will first appear only in uber-expensive computers, but, like
every other technology, it will eventually find its way into the hands
of mortals. Graphical virtual reality applications will drive it, if
nothing else will.

> In therms of network switchs; the performance of 10G switches
> is pretty impressive, but there is some latency reaching the
> 10G card over whatver IO bus is used. Direct CPU attached 10G
> links anyone?

This is the kind of thinking the taxpayers should be spending more of
their hard-earned dollars on--not just more of the same.

> 16-way Xeon-style systems are becoming pretty standard shelfware,
> and 10G switches are getting there. Combining these should be
> able to fill a niche above the "P3-via-dsl" networks, and still
> only cost a minor fortune, no?

Yes, and even more off-the-shelf muscle is on the way. Thus, my advice
about unimaginative big systems: better to wait a few years.

I don't know where you reach the point of diminishing returns for global
bandwidth (with acceptable latency) in making nodes fatter, but
off-the-shelf hardware can build really fat nodes.

Robert.

From: Tim McCaffrey on 1 Apr 2010 18:40

In article <It7tn.4459$iL1.992(a)newsfe24.iad>, rbmyersusa(a)gmail.com says...
>
>Morten Reistad wrote:
>
>
>> Now, can we attack this from a simpler perspective; can we make
>> the L2-memory interaction more intelligent? Like actually make
>> a paging system for it? Paging revolutionised the disk-memory
>> systems, remember?
>
>I think this suggestion is equivalent to putting "main memory" on the
>chip and treating what once was main memory like a disk drive. One can
>imagine inheriting all the wisdom and benefits of disk caching.
>
>I'm guessing that stacked 3-D chips could play a big role here. That
>technology will first appear only in uber-expensive computers, but, like
>every other technology, it will eventually find its way into the hands
>of mortals. Graphical virtual reality applications will drive it, if
>nothing else will.
>
>> In therms of network switchs; the performance of 10G switches
>> is pretty impressive, but there is some latency reaching the
>> 10G card over whatver IO bus is used. Direct CPU attached 10G
>> links anyone?
>
>This is the kind of thinking the taxpayers should be spending more of
>their hard-earned dollars on--not just more of the same.
>

The PCIe 2.0 links on the Clarkdale chips runs at 5G. You could interconnect
CPUs with a Non-Transparent bridge.

- Tim

From: MitchAlsup on 1 Apr 2010 22:49

On Apr 1, 5:40 pm, timcaff...(a)aol.com (Tim McCaffrey) wrote:
> The PCIe 2.0 links on the Clarkdale chips runs at 5G.

Any how many dozen meters can these wires run?

Mitch

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: Multi-core lag for Left 4 Dead 1 and 2 and Quake 4 on AMD X23800+ processor... why ?
Next: Which is the most beautiful and memorable hardware structure in a CPU?