Prev: Last Call for Papers Reminder (extended): World Congress on Engineering and Computer Science WCECS 2010
Next: ARM-based desktop computer ? (Hybrid computers ?: Low + High performance ;))
From: David L. Craig on 20 Jul 2010 13:49 On Jul 20, 11:31 am, Andy Glew <"newsgroup at comp-arch.net"> wrote: > We welcome new blood, and new ideas. These are new ideas? I hope not. > I'm with you, David. Maximizing what I call the MLP, the > memory level parallelism, the number of DRAM accesses that > can be concurrently in flight, is one of the things that > we can do. > Me, I'm just the MLP guy: give me a certain number of > channels and bandwidth, I try to make the best use of > them. MLP is one of the ways of making more efficient > use of whatever limited bandwidth you have. I guess that's > my mindset - making the most of what you have. Not because > I don't want to increase the overall memory bandwidth. > But because I don't have any great ideas on how to do so, > apart from > a) More memory channels > b) Wider memory channnels > c) Memory channels/DRAMs that handle short bursts/high > address bandwidth efficiently > d) DRAMs with a high degree of internal banking > e) aggressive DRAM scheduling > Actually, c,d,e are really ways of making more efficient > use of bandwidth, i.e. preventing pins from going idle > because the burst length is giving you a lot of data you > don't want. > f) stacking DRAMs > g) stacking DRAMs with an interface chip such as Tom > Pawlowski of micron proposes, and a new abstract > DRAM interface, enabling all of the good stuff > above but keeping DRAM a comodity > h) stacking DRAMs with an interface chip and a > processor chip (with however many processors you > care to build). If we're talking about COTS design, FP bandwidth is probably not the area in which to increase production costs for better performance. As Mitch Alsup observed a little after the post I've been quoting became available: > We are at the point where, even when the L2 cache > supplies data, there are too many latency cycles for > the machine to be able to efficiently strip mine > data. {And in most cases the cache hierarchy is not > designed to efficiently strip mine data, either.} Have performance runs using various cache disablements indicated any gains could be realized therein? If so, I think that makes the case for adding circuits to increase RAM parallelism as the cores fight it out for timely data in and data out operations. If we're talking about custom, never-mind-the-cost designs, then that's the stuff that should make this a really fun group.
From: jacko on 20 Jul 2010 14:48 reality rnter, thr eniene pooj descn to lan dern turdil/ Soery I must STIOP giving thge motostest. I'd love for it to mux long dataa. I can't see hoe it frows to the tend to stuff. Chad'ict? I do know that not writing is good.
From: Robert Myers on 20 Jul 2010 14:49 On Jul 20, 1:49 pm, "David L. Craig" <dlc....(a)gmail.com> wrote: > If we're talking about custom, never-mind-the-cost > designs, then that's the stuff that should make this > a really fun group. If no one ever goes blue sky and asks: what is even physically possible without worrying what may or may not be already in the works at Intel, then we are forever limited, even in the imagination, to what a marketdroid at Intel believes can be sold at Intel's customary margins. There is always IBM, of course, and AMD seems willing to try anything that isn't guaranteed to put it out of business, but, for the most part, the dollars just aren't there, unless the government supplies them. As far as I'm concerned, the roots of the current climate for HPC can be found in some DoD memos from the early nineties. I'm pretty sure I have already offered links to some of those documents here. In all fairness to those memos and to the semiconductor industry in the US, the markets have delivered well beyond the limits I feared when those memos first came out. I doubt if mass-market x86 hypervisors ever crossed the imagination at IBM, even as the barbarians were at the gates. Also, to be fair to markets, the cost-no-object exercises the government undertook even after those early 90's memos delivered almost nothing of any real use. Lots of money has been squandered on some really dumb ideas. The national labs and others have tried the same idea (glorified Beowulf) with practically every plausible processor and interconnect on offer and pretty much the same result (90%+ efficiency for Linpack, 10% for anything even slightly more interesting). Moving the discussion to some place slightly less visible than comp.arch might not produce more productive flights of fancy, but I, for one, am interested in what is physically possible and not just what can be built with the consent of Sen. Mikulski--a lady I have always admired, to be sure, from her earliest days in politics, just not the person I'd cite as intellectual backup for technical decisions. Robert.
From: jacko on 20 Jul 2010 14:55 On 20 July, 18:49, "David L. Craig" <dlc....(a)gmail.com> wrote: > On Jul 20, 11:31 am, Andy Glew <"newsgroup at comp-arch.net"> > wrote: > > > We welcome new blood, and new ideas. > > These are new ideas? I hope not. > > > > > > > I'm with you, David. Maximizing what I call the MLP, the > > memory level parallelism, the number of DRAM accesses that > > can be concurrently in flight, is one of the things that > > we can do. > > Me, I'm just the MLP guy: give me a certain number of > > channels and bandwidth, I try to make the best use of > > them. MLP is one of the ways of making more efficient > > use of whatever limited bandwidth you have. I guess that's > > my mindset - making the most of what you have. Not because > > I don't want to increase the overall memory bandwidth. > > But because I don't have any great ideas on how to do so, > > apart from > > a) More memory channels > > b) Wider memory channnels > > c) Memory channels/DRAMs that handle short bursts/high > > address bandwidth efficiently > > d) DRAMs with a high degree of internal banking > > e) aggressive DRAM scheduling > > Actually, c,d,e are really ways of making more efficient > > use of bandwidth, i.e. preventing pins from going idle > > because the burst length is giving you a lot of data you > > don't want. > > f) stacking DRAMs > > g) stacking DRAMs with an interface chip such as Tom > > Pawlowski of micron proposes, and a new abstract > > DRAM interface, enabling all of the good stuff > > above but keeping DRAM a comodity > > h) stacking DRAMs with an interface chip and a > > processor chip (with however many processors you > > care to build). > > If we're talking about COTS design, FP bandwidth is > probably not the area in which to increase production > costs for better performance. As Mitch Alsup observed > a little after the post I've been quoting became > available: > > > We are at the point where, even when the L2 cache > > supplies data, there are too many latency cycles for > > the machine to be able to efficiently strip mine > > data. {And in most cases the cache hierarchy is not > > designed to efficiently strip mine data, either.} > > Have performance runs using various cache disablements > indicated any gains could be realized therein? If so, > I think that makes the case for adding circuits to > increase RAM parallelism as the cores fight it out for > timely data in and data out operations. > > If we're talking about custom, never-mind-the-cost > designs, then that's the stuff that should make this > a really fun group.- Hide quoted text - > > - Show quoted text - Why want in a explcit eans be< (short) all functors line up to allign.
From: MitchAlsup on 20 Jul 2010 17:07
An example of the subtle microarchitectureal optimization that is in Robert's favor was tried in one of my previous designs. The L1 cache was organized to cache the width of the bus returning from the L2 on die cache. The L2 cache was organized at the width of your typical multibeet cache line returning from main memory. Thus, one L2 cache line would occupy 4 L1 cache sub-lines when fully 'in' the L1. Some horseplay at the cache coherence protocol prevented incoherence. With the L1-to-L2 interface suitably organized, one could strip mine data from the L2 through the L1 through the computation units back to the L1. L1 Victims were transfered back to the L2 as L2 data arrived and forwarded into execution. Here, the execution window had to absorb only the L2 transfer delay plus the floatig point computation delay. And for this that execution window worked just fine. DAXPY and DGEMM on suitably sized vectors would strip mine data footprints as big as the L2 cache at vector rates. Mitch |