Prev: Multi-core lag for Left 4 Dead 1 and 2 and Quake 4 on AMD X23800+ processor... why ?
Next: Which is the most beautiful and memorable hardware structure in a CPU?
From: Robert Myers on 31 Mar 2010 14:57 On Mar 31, 1:04 pm, Del Cecchi <delcec...(a)gmail.com> wrote: > > There certainly seems to be a spectrum of applications that can use a > spectrum of bandwidth, and it would be unnecessary to spend a lot of > money on interconnect and packaging if the problem can be solved with > PCs connected together with string. Whether all protein folding and > genomics software falls into this category is beyond me, but some seems > to. > > There are other problems which require both lots of flops and lots of > bandwidth. Flops are cheap. Bandwidth, especially bandwidth in a > network with many nodes, is expensive. > > The various generations of Blue Gene seem to have increased bandwidth, > although I'm not sure whether it has kept up with the flops. More > bandwidth can be had especially if the nodes are quite fat, like blue > waters. BTW thanks for the pointer to that, it was interesting. > > I tend to think the technical folks are doing the best they can, whilst > you apparently think there is some kind of dominant hidden agenda in the > form of sort a farm program for supercomputers where the government > spends a lot of money on useless boxes good only for publicity and IBM's > bottom line. > > Custom hardware is pretty expensive. So it seems that most supers are > build out of pieces parts or technologies that were developed for > something else. For example, I think the PowerPC cores in BG/L were > 405's left over from the embedded business. The interconnect physical > layer was also taken from some other project, but I forget which. > Likewise the packaging power and cooling were derivitives of earlier > stuff with some new ideas. I'm not quite as simple-minded as my posts might make me out to be. I am deliberately over-simplifying the situation to make my case clear. Blue Gene and similar would be useful for a class of problems that need lots of flops and very little global bandwidth and that are relatively sensitive to inter-node latency. That would include some kinds of fluid-mechanical modeling--the kind I personally do not trust to be quantitatively correct. I made the post mostly because you asked how else, other than something like Blue Gene, you could do a petaflop. Like everyone else, you mentioned no requirements other than flops. That is the bureaucratic mantra at which I have been taking aim for years. You need flops? The nation is already drowning in flops, so let's hear no more about flops without qualification. A PS-3 is a pretty fat node. Like it or not, the defense establishment does implement an industrial policy, from which IBM has benefited greatly. Maybe, as a nation, we need that, because, for example, no one else is going to be building 3- D chips. I'm not saying the policy is wrong, I am merely pointing to what seem to me to be obvious shortcomings and conflicts of interest. Most people, technical or otherwise, are interested in their next paycheck. If the government would pay IBM to dump money into the Pacific Ocean, they would salute smartly, do as told, and hope that no one found out, so long as they could make a profit. Technical people follow orders or lose their jobs. It isn't a matter of competence. It's a question of survival, and the game has been rigged by a magister ludi whose goal is to create goals that can be met and advertised to the executive and legislative branches as successfully met through their competent management and leadership, leading to career continuation and higher paychecks. If you get imaginative and try to build an F-35 and get behind schedule and over budget (which is almost inevitable), someone with a few stars on his epaulets might lose his job--a lesson lost on no bureaucrat who is destined to rise in the ranks. It's not evil. It's just life. I just threw a new question out. How much of this couldn't be done on computers we have already paid for? The purpose isn't to demean IBM or Blue Gene. The purpose is to get people to talk about something more meaningful than flops. I found this document http://www.sandia.gov/~rcmurph/doc/latency.pdf, which comes with the take-way message, "This paper compares the memory performance sensitivity of both traditional and emerging HPC applications, and shows that the new codes are significantly more sensitive to memory latency and bandwidth than their traditional counterparts." The paper also concludes that performance is more sensitive to latency than to bandwidth, but nothing is said (so far as I can tell) about the role of the interconnecting fabric. This gets boiled down to: we need to continue hammering away on memory-to-processor latency, which further gets boiled down to: latency is more important than bandwidth. If that's all you have time to remember, you're not going to pay attention to global network bandwidth, which is apparently very expensive, even though there may be little point (from a scientific perspective) in building computers with a wimpy global interconnect. Robert.
From: Morten Reistad on 1 Apr 2010 07:44 In article <4ae8b20e-d931-4112-bc07-f406b25f082e(a)e7g2000yqf.googlegroups.com>, Robert Myers <rbmyersusa(a)gmail.com> wrote: >On Mar 31, 1:04�pm, Del Cecchi <delcec...(a)gmail.com> wrote: >> >I found this document > >http://www.sandia.gov/~rcmurph/doc/latency.pdf, > >which comes with the take-way message, "This paper compares the memory >performance sensitivity of both traditional and emerging HPC >applications, and shows that the new codes are significantly >more sensitive to memory latency and bandwidth than their >traditional counterparts." This fits our, very commercial, observations very well. The size of the L2 cache is the defining load limiter in more than half the systems benchmarks we have done. Hypertransport proves a huge win by interconncting the L2 caches on different chips. >The paper also concludes that performance is more sensitive to latency >than to bandwidth, but nothing is said (so far as I can tell) about >the role of the interconnecting fabric. This gets boiled down to: we >need to continue hammering away on memory-to-processor latency, which >further gets boiled down to: latency is more important than >bandwidth. If that's all you have time to remember, you're not going >to pay attention to global network bandwidth, which is apparently very >expensive, even though there may be little point (from a scientific >perspective) in building computers with a wimpy global interconnect. Now, can we attack this from a simpler perspective; can we make the L2-memory interaction more intelligent? Like actually make a paging system for it? Paging revolutionised the disk-memory systems, remember? In therms of network switchs; the performance of 10G switches is pretty impressive, but there is some latency reaching the 10G card over whatver IO bus is used. Direct CPU attached 10G links anyone? 16-way Xeon-style systems are becoming pretty standard shelfware, and 10G switches are getting there. Combining these should be able to fill a niche above the "P3-via-dsl" networks, and still only cost a minor fortune, no? -- mrr
From: Robert Myers on 1 Apr 2010 16:36 Morten Reistad wrote: > Now, can we attack this from a simpler perspective; can we make > the L2-memory interaction more intelligent? Like actually make > a paging system for it? Paging revolutionised the disk-memory > systems, remember? I think this suggestion is equivalent to putting "main memory" on the chip and treating what once was main memory like a disk drive. One can imagine inheriting all the wisdom and benefits of disk caching. I'm guessing that stacked 3-D chips could play a big role here. That technology will first appear only in uber-expensive computers, but, like every other technology, it will eventually find its way into the hands of mortals. Graphical virtual reality applications will drive it, if nothing else will. > In therms of network switchs; the performance of 10G switches > is pretty impressive, but there is some latency reaching the > 10G card over whatver IO bus is used. Direct CPU attached 10G > links anyone? This is the kind of thinking the taxpayers should be spending more of their hard-earned dollars on--not just more of the same. > 16-way Xeon-style systems are becoming pretty standard shelfware, > and 10G switches are getting there. Combining these should be > able to fill a niche above the "P3-via-dsl" networks, and still > only cost a minor fortune, no? Yes, and even more off-the-shelf muscle is on the way. Thus, my advice about unimaginative big systems: better to wait a few years. I don't know where you reach the point of diminishing returns for global bandwidth (with acceptable latency) in making nodes fatter, but off-the-shelf hardware can build really fat nodes. Robert.
From: Tim McCaffrey on 1 Apr 2010 18:40 In article <It7tn.4459$iL1.992(a)newsfe24.iad>, rbmyersusa(a)gmail.com says... > >Morten Reistad wrote: > > >> Now, can we attack this from a simpler perspective; can we make >> the L2-memory interaction more intelligent? Like actually make >> a paging system for it? Paging revolutionised the disk-memory >> systems, remember? > >I think this suggestion is equivalent to putting "main memory" on the >chip and treating what once was main memory like a disk drive. One can >imagine inheriting all the wisdom and benefits of disk caching. > >I'm guessing that stacked 3-D chips could play a big role here. That >technology will first appear only in uber-expensive computers, but, like >every other technology, it will eventually find its way into the hands >of mortals. Graphical virtual reality applications will drive it, if >nothing else will. > >> In therms of network switchs; the performance of 10G switches >> is pretty impressive, but there is some latency reaching the >> 10G card over whatver IO bus is used. Direct CPU attached 10G >> links anyone? > >This is the kind of thinking the taxpayers should be spending more of >their hard-earned dollars on--not just more of the same. > The PCIe 2.0 links on the Clarkdale chips runs at 5G. You could interconnect CPUs with a Non-Transparent bridge. - Tim
From: MitchAlsup on 1 Apr 2010 22:49
On Apr 1, 5:40 pm, timcaff...(a)aol.com (Tim McCaffrey) wrote: > The PCIe 2.0 links on the Clarkdale chips runs at 5G. Any how many dozen meters can these wires run? Mitch |