From: Robert Myers on 11 Dec 2006 19:40 Edward Wolfgram wrote: > > What don't you like about Blue Gene? > Blue Gene has a bisection bandwidth in the range of millibytes per flop, depending on how it's configured (or you could have a nice bisection bandwidth if you settled for an uninterestingly small machine). As you continue to add nodes to the mesh, creating ever-higher linpack scores, the bisection bandwidth in millibytes per flop just keeps falling, with the limit for this "scalable" machine being zero. That's a problem for doing FFT's. It's a problem that was identified in the potential applications of Red Storm (25% projected efficiency for pseudospectral simulations, for example), and a problem that has appeared in IBM's own documents regarding FFT's on Blue Gene: the flops per processor falls apart at some uninterestingly low number of processors. I gather that Blue Gene just won an award for doing FFT's. I haven't had a chance yet to look at it to see what it means. Nothing could have come to me as a greater surprise. It's been a while since I went through the details, all of which were discouraging. Maybe something has changed I don't know about. Robert.
From: Chris Thomasson on 15 Dec 2006 00:28 "Joe Seigh" <jseigh_01(a)xemaps.com> wrote in message news:CbadnSpeDeLqweHYnZ2dnUVZ_segnZ2d(a)comcast.com... > The cell processor appears to be the first commondity multiprocessor that > breaks with the cached shared memory multi-processing model if I > got this right. So it's less applicable to shared memory multi-threading > models and more applicable to models like MPI. > > In some respects, although there's no crossbar switch, it's like the > old SP systems where an IBM mainframe served as a scheduling and > control processor. The PPC appears to be the new mainframe. :) > > I wonder how the old shared memory strategy will work out. Will > coherent cache shared memory scale up to 10's and 100's of processors > and stay competitive? YES!!! Here is ultimate PDR + Hardware Solution: http://groups.google.com/group/comp.arch/msg/2a0f4163f8e13f1e Watch... A PATENT for this technique will mysteriously appear one day. ;^)
From: Chris Thomasson on 15 Dec 2006 00:35 "Joe Seigh" <jseigh_01(a)xemaps.com> wrote in message news:CbadnSpeDeLqweHYnZ2dnUVZ_segnZ2d(a)comcast.com... > The cell processor appears to be the first commondity multiprocessor that > breaks with the cached shared memory multi-processing model if I > got this right. So it's less applicable to shared memory multi-threading > models and more applicable to models like MPI. > > In some respects, although there's no crossbar switch, it's like the > old SP systems where an IBM mainframe served as a scheduling and > control processor. The PPC appears to be the new mainframe. :) > > I wonder how the old shared memory strategy will work out. Will > coherent cache shared memory scale up to 10's and 100's of processors > and stay competitive? You can use the PowerPC for a lot of the shared memory work. The Cell simply forces you to stick to a strict distributed programming paradigm. Well, luckily, I have experience with distributed programming. However, I do like the fact that I can use the PPC on the Cell to do high-end shared memory multi-processing.
From: Chris Thomasson on 19 Dec 2006 01:59
"Joe Seigh" <jseigh_01(a)xemaps.com> wrote in message news:I9WdneLUKrCi9R_YnZ2dnUVZ_vqpnZ2d(a)comcast.com... > Chris Thomasson wrote: >> "Joe Seigh" <jseigh_01(a)xemaps.com> wrote in message >> news:CbadnSpeDeLqweHYnZ2dnUVZ_segnZ2d(a)comcast.com... >> >>>The cell processor appears to be the first commondity multiprocessor that >>>breaks with the cached shared memory multi-processing model if I >>>got this right. So it's less applicable to shared memory multi-threading >>>models and more applicable to models like MPI. [...] Well, message passing is okay with me simply because I can personally implement it with virtually zero overheads. It comforts me to know that a message passing algorithm can be implemented in software in a way that renders all questions which deal with any possible overheads that may be attributed to its usage, virtually meaningless. I posted the algorithm over on c.p.t. if your interested; look for conversations I had with David Hopwood. In my "very humble" opinion, the posted algorithm proves that an efficient message passing algorithm can be accomplished and 100% implemented in software using existing ISA's. The Cell seems to trust the programmer a whole lot... Forcing us to come up with ultra-lean-and-mean message passing paradigm seems to be the trend... The trend that will make us some real $$$ that is... ;^) Any thoughts on this approach? Joe, if were are forced to use distributed programming, then we are forced to implement fast message passing patterns in software... We have to beat the hardware... I have a bad feeling that the hardware guys can render us software guys moot? Na... The Cell proves that software means something after all? :O >>>I wonder how the old shared memory strategy will work out. Will >>>coherent cache shared memory scale up to 10's and 100's of processors >>>and stay competitive? The cache coherency has to be weak. The software should always have the ability to use the ISA to force a certain memory model for certain algorithms. If the cache coherency mechanism a future processor uses is sufficiently weak, then it can allow software applications a tailor custom memory models to its specific data-usage patterns' and overall throughout protocols. >> You can use the PowerPC for a lot of the shared memory work. The Cell >> simply forces you to stick to a strict distributed programming paradigm. [...] > AMD seems to be about to go that route > > http://techreport.com/onearticle.x/11438 > > so distributed models might become more the norm. Well, if you can beat em'... Join em'? Humm... At least we can use our overall synchronization algorithm implementation design goals of "zero-overhead or die" in a message passing implementation... Brief Ultra-Fast Message-Passing Outline ---------------------- ---- Multiplexing Of Multiple Per-Thread: -- communication data-structures: "word-sized" lock-free anchors for linked data-structure -- allocated with: "** http://groups.google.com/group/comp.arch/browse_frm/thread/24c40d42a04ee855 **" (patent pending) ---- Synchronized With: -- "per-thread Petersons Algorithm": http://groups.google.com/group/comp.programming.threads/browse_frm/thread/c49c0658e2607317 -- "per-thread unbounded virtually zero-overhead fifo": http://appcore.home.comcast.net/ ---- Organized with -- dual-per-thread/message version/time stamps All may not be lost? ;^) [...] > Depending on how unique the processor is, the > application might have to be written from scratch. Yep. :O Well, more work for us? Consultants anyone? Any one need a consultant? ;^) |