Prev: Last Call for Papers Reminder (extended): World Congress on Engineering and Computer Science WCECS 2010
Next: ARM-based desktop computer ? (Hybrid computers ?: Low + High performance ;))
From: George Neuner on 24 Jul 2010 05:10 On Fri, 23 Jul 2010 11:30:32 -0700 (PDT), Robert Myers <rbmyersusa(a)gmail.com> wrote: >I don't think the hardware and software issues can be separated. Thank goodness someone said that. At least where HPC is concerned, I've been convinced for some time that we are fighting hardware rather than leveraging it. I've spent a number of years with DSPs and FPGAs and I've come to believe that we (or at least compilers) need to be deliberately programming memory interfaces as well as the ALU/FPU operations. The problem most often cited for vector units is that they need to support non-consecutive and non-uniform striding to be useful. I agree that there *does* need to be support for those features, but I believe it should be in the memory subsystem rather than in the processor. I'm supposing that there are vector registers accessible to scatter/gather DMA and further supposing that there are several DMA channels - ideally 1 channel per register. The programmer's indexing loop code is compiled into instructions that program DMA to read/gather a block of operands into the source registers, execute the vector operation(s), and finally DMA write/scatters the results back to memory. I do understand that problems have to have "crystalline" access patterns and enough long vector(izable) operations to absorb the latency of data staging. I know there are plenty of problems that don't fit that model. The main issue would be having a main memory that could tolerate concurrent DMA - but I know that lots of things are possible with appropriate design: I once worked with a system which had a proprietary FPGA based memory controller that sustained 1400MB/s - 700 in and out - using banked 100MHz SDRAM (the old kind, not DDR). I used to have a 40MHz ADI Sharc 21060 (120 MFlops sustained) on a bus mastering PCI board in a 450MHz Pentium II desktop. I had a number of programs that turned the DSP into a long vector processor (512 to 4096 element "registers") and used overlapped DMA to move data in and out while processing. Given a large enough data set that 40MHz DSP could handily outperform the host's 450MHz CPU. George
From: nmm1 on 24 Jul 2010 06:01 In article <sdtk4654pheq6292135jd42oagr5ov7cg4(a)4ax.com>, George Neuner <gneuner2(a)comcast.net> wrote: >On Fri, 23 Jul 2010 11:30:32 -0700 (PDT), Robert Myers ><rbmyersusa(a)gmail.com> wrote: > >>I don't think the hardware and software issues can be separated. > >Thank goodness someone said that. > >At least where HPC is concerned, I've been convinced for some time >that we are fighting hardware rather than leveraging it. I've spent a >number of years with DSPs and FPGAs and I've come to believe that we >(or at least compilers) need to be deliberately programming memory >interfaces as well as the ALU/FPU operations. > >The problem most often cited for vector units is that they need to >support non-consecutive and non-uniform striding to be useful. I >agree that there *does* need to be support for those features, but I >believe it should be in the memory subsystem rather than in the >processor. I believe that you have taken the first step on the path to True Enlightenment, but need to have the courage of your convictions and proceed further on :-) I.e. I agree, and what we need is architectures which are designed to provide data management first and foremost, and which attach the computation onto that. I.e. turn the traditional approach on its head. And I don't think that is limited to HPC, either. I can't see any of the decent computer architects having any great problem with this concept, but I doubt that the benchmarketers and execudroids would swallow it. It would also need a comparable revolution in programming languages and paradigms, though there have been a lot of exploratory ones that show the concepts are viable. Regards, Nick Maclaren.
From: nmm1 on 24 Jul 2010 06:05 In article <cveqh7-ad2.ln1(a)ntp.tmsw.no>, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote: > >If you check actual SIMD type code, you'll notice that various forms of >permutations are _very_ common, i.e. you need to rearrange the order of >data in one or more vector register: > >If vectors were processed in streaming mode, we would have the same >situation as for the Pentium4 which did half a register in each half >cycle in the fast core, but had to punt each time you did a right shift >(or any other operations which could not be processed in LE order). > >I have seen once a reference to Altivec code that used the in-register >permute operation more than any other opcode. > >Except that even scalar code needs prefix/mask type operations in order >to get rid of some branches, right? > >All (most of?) the others seem to boil down to a need for a fast vector >permute... Yes. My limited investigations indicated that the viability of vector systems usually boiled down to whether the hardware's ability to do that was enough to meet the software's requirements. If not, it spent most of its time in scalar code. >> But also because, as I discusssed in my Berkeley Parlab presentation of >> Aug 2009 on GPUs, I can see how to use vector ISAs to ameliorate >> somewhat the deficiencies of coherent threading, specifically the >> problem of divergence. > >Please tell! Indeed, yes, please do! Regards, Nick Maclaren.
From: Robert Myers on 24 Jul 2010 13:02 On Jul 24, 6:01 am, n...(a)cam.ac.uk wrote: > In article <sdtk4654pheq6292135jd42oagr5ov7...(a)4ax.com>, > George Neuner <gneun...(a)comcast.net> wrote: > > > >At least where HPC is concerned, I've been convinced for some time > >that we are fighting hardware rather than leveraging it. I've spent a > >number of years with DSPs and FPGAs and I've come to believe that we > >(or at least compilers) need to be deliberately programming memory > >interfaces as well as the ALU/FPU operations. > > >The problem most often cited for vector units is that they need to > >support non-consecutive and non-uniform striding to be useful. I > >agree that there *does* need to be support for those features, but I > >believe it should be in the memory subsystem rather than in the > >processor. > > I believe that you have taken the first step on the path to True > Enlightenment, but need to have the courage of your convictions > and proceed further on :-) > > I.e. I agree, and what we need is architectures which are designed > to provide data management first and foremost, and which attach > the computation onto that. I.e. turn the traditional approach on > its head. And I don't think that is limited to HPC, either. > I can't see any of the decent computer architects having any great > problem with this concept, but I doubt that the benchmarketers and > execudroids would swallow it. Ok. So here's a half-baked guess. The reason that doesn't happen isn't to be found in the corner office, but in your thread about RDMA and Andy's comments in that thread, in particular. Today's computers are *not* designed around computation, but around coherent cache. Now that the memory controller is on the die, the takeover is complete. Nothing moves efficiently without notice and often unnecessary involvement of the real Von Neumann bottleneck, which is the cache. Cache snooping is the one ring that rules them all. I doubt if an implausible journey through Middle Earth by fantastic creatures would help, but probably some similarly wild exercise of the imagination is called for. Currently, you cluster processors when you can't conveniently jam them all into a single coherence domain. The multiple coherence domains that result are an annoyance to someone like me who would desperately like to think in terms of one big, flat memory space, but they also allow new possibilities, like moving data around without bothering other processors and other coherence domains. Maybe you want multiple coherence domains even when you aren't forced into it by the size of a board or a rack or a mainframe. Maybe you want more programmable control over coherence domains. If you're not going to scrap cache and cache snooping, maybe you can wrestle some control away from the hardware and give it to the software. Robert.
From: nmm1 on 24 Jul 2010 16:24
In article <88d23585-d47c-47af-91a1-7bae764afaf8(a)q22g2000yqm.googlegroups.com>, Robert Myers <rbmyersusa(a)gmail.com> wrote: > >Today's computers are *not* designed around computation, but around >coherent cache. Now that the memory controller is on the die, the >takeover is complete. Nothing moves efficiently without notice and >often unnecessary involvement of the real Von Neumann bottleneck, >which is the cache. Yes and no. Their interfaces are still designed around computation, and the coherent cache is designed to give the impression that programmers need not concern themselves with programming memory access - it's all transparent. >Maybe you want more programmable control over coherence domains. If >you're not going to scrap cache and cache snooping, maybe you can >wrestle some control away from the hardware and give it to the >software. That is, indeed, part of what I do mean. Regards, Nick Maclaren. |