Prev: Last Call for Papers Reminder (extended): World Congress on Engineering and Computer Science WCECS 2010
Next: ARM-based desktop computer ? (Hybrid computers ?: Low + High performance ;))
From: jacko on 19 Jul 2010 18:21 Why do memory channels have to be wired by inverter chains and relativly long track interconnect on the circuit board? Microwave pipework from chiptop to chiptop is perhaps possible, but maintaining enough bandwidth over the microwave channel is many GHz, but it is close, so of a low radiant power! Flops or not? lets generalize and call them nops, said he with a touch of carcasm. Non-specific Operations, needing GB/s. Cheers Jacko
From: Robert Myers on 19 Jul 2010 20:18 On Jul 19, 3:44 pm, nik Simpson <ni...(a)knology.net> wrote: > On 7/19/2010 10:36 AM, MitchAlsup wrote: > > > > > d) high end PC processors can afford 2 memory channels > > Not quite as screwed as that, the top-end Xeon & Opteron parts have 4 > DDR3 memory channels, but still screwed. For the 2-socket space, it's 3 > DDR3 memory channels for typical server processors. Of course, the move > to on-chip memory controllers means that scope for additional memory > channels is pretty much "zero" but that's the price you pay for > commodity parts, they are designed to meet the majority of customers, > and it's hard to justify the costs of additional memory channels at the > processor and board layout levels just to satisfy the needs of bandwidth > crazy HPC apps ;-) > Maybe the capabilities of high-end x86 are and will continue to be so compelling that, unless IBM is building the machine, that's what we're looking at for the foreseeable future. I don't understand the economics of less mass-market designs, but maybe the perfect chip would be some iteration of an "open" core, maybe less heat-intensive, less expensive, and soldered-down with more attention to memory and I/O resources. Or maybe you could dual port or route memory, accepting whatever cost in latency there is, and at least allow some pure DMA device to perform I/O and gather/scatter chores so as to maximize what processor bandwidth there is. I'd like some blue sky thinking. Robert.
From: Andrew Reilly on 20 Jul 2010 00:43 On Mon, 19 Jul 2010 08:36:18 -0700, MitchAlsup wrote: > The memory system can supply only 1/3rd of what a single processor wants If that's the case (and down-thread Nik Simpson suggests that the best case might even be twice as "good", or 2/3 of a single processor's worst- case demand), then that's amazingly better than has been available, at least in the commodity processor space, for quite a long time. I remember when I started moving DSP code onto PCs, and finding anything with better than 10MB/s memory bandwidth was not easy. These days my problem set typically doesn't get out of the cache, so that's not something I personally worry about much any more. If your problem set is driven by stream-style vector ops, then you might as well switch to low- power critters like Atoms, and match the flops to the available bandwidth, and save some power. On the other hand, I have a lot of difficulty believing that even for large-scale vector-style code, a bit of loop fusion, blocking or code factoring can't bring value-reuse up to a level where even (0.3/nProcs) available bandwidth is plenty. That's single-threaded application-think. Where you *really* need that bandwidth, I suspect, is for the inter-processor communication between your hoards of cooperating (ha!) cores. Cheers, -- Andrew
From: jacko on 20 Jul 2010 01:44 On 20 July, 05:43, Andrew Reilly <areilly...(a)bigpond.net.au> wrote: > On Mon, 19 Jul 2010 08:36:18 -0700, MitchAlsup wrote: > > The memory system can supply only 1/3rd of what a single processor wants > > If that's the case (and down-thread Nik Simpson suggests that the best > case might even be twice as "good", or 2/3 of a single processor's worst- > case demand), then that's amazingly better than has been available, at > least in the commodity processor space, for quite a long time. I > remember when I started moving DSP code onto PCs, and finding anything > with better than 10MB/s memory bandwidth was not easy. These days my > problem set typically doesn't get out of the cache, so that's not > something I personally worry about much any more. If your problem set is > driven by stream-style vector ops, then you might as well switch to low- > power critters like Atoms, and match the flops to the available > bandwidth, and save some power. Or run a bigger network off the same power. > On the other hand, I have a lot of difficulty believing that even for > large-scale vector-style code, a bit of loop fusion, blocking or code > factoring can't bring value-reuse up to a level where even (0.3/nProcs) > available bandwidth is plenty. Prob(able)ly - sick perverse hanging on to a longer word in the post quantum age. > That's single-threaded application-think. Where you *really* need that > bandwidth, I suspect, is for the inter-processor communication between > your hoards of cooperating (ha!) cores. Maybe. I think much of the problem is not vectors, as these are usually have a single index, it's matrix and tensor problems which have 2 or n indexes. T[a,b,c,d] The fact that many product sums over different indexes, even with transpose elimination coding (automatic switching between row and column order based on linear sequencing of a write target, or for best read/write/read/write etc. performance) in the prefetch context, with limited gather/scatter. Maybe even some multi store (slightly wasteful of memory cells) with differing address bit swapings? the high bits as an address map translation selector with bank write and read combo 'union' operation (* or +)?. Ummm.
From: George Neuner on 20 Jul 2010 10:33
On Mon, 19 Jul 2010 08:36:18 -0700 (PDT), MitchAlsup <MitchAlsup(a)aol.com> wrote: >It seems to me that having less than 8 bytes of memory bandwidth per >flop leads to an endless series of cache excersizes.** > >It also seems to me that nobody is going to be able to put the >required 100 GB/s/processor pin interface on the part.* > >Nor does it seam, it would have the latency needed to strip mine main >memory continuously were the required BW made available. > >Thus, we are in essence screwed. > >* current bandwidths >a) 3 GHz processors with 2 FP pipes running 128-bit double DP flops >(ala SSE) This gives 12 GFlop/processor >b) 12 GFlop/processor demands 100 GByte/processor >c) DDR3 can achieve 17 GBytes/channel >d) high end PC processors can afford 2 memory channels >e) therefore we are screwed: >e.1)The memory system can supply only 1/3rd of what a single processor >wants >e.2)There are 4 and growing numbers of processors >e.3) therefore the memory systen can support less than 1/12 as much BW >as required. > >Mitch > >** The Ideal memBW/Flop is 3 memory operations per flop, and back in >the Cray-1 to XMP transition much of the vectorization gain occurred >from the added memBW and the better chaining. ISTM bandwidth was the whole point behind pipelined vector processors in the older supercomputers. Yes there was a lot of latency (and I know you [Mitch] and Robert Myers are dead set against latency too) but the staging data movement provided a lot of opportunity to overlap with real computation. YMMV, but I think pipeline vector units need to make a comeback. I am not particularly happy at the thought of using them again, but I don't see a good way around it. George |