Prev: Last Call for Papers Reminder (extended): World Congress on Engineering and Computer Science WCECS 2010
Next: ARM-based desktop computer ? (Hybrid computers ?: Low + High performance ;))
From: nmm1 on 20 Jul 2010 10:41 In article <04cb46947eo6mur14842fqj45pvrqp61l1(a)4ax.com>, George Neuner <gneuner2(a)comcast.net> wrote: > >ISTM bandwidth was the whole point behind pipelined vector processors >in the older supercomputers. Yes there was a lot of latency (and I >know you [Mitch] and Robert Myers are dead set against latency too) >but the staging data movement provided a lot of opportunity to overlap >with real computation. Yes. >YMMV, but I think pipeline vector units need to make a comeback. I am >not particularly happy at the thought of using them again, but I don't >see a good way around it. NO chance! It's completely infeasible - they were dropped because the vendors couldn't make them for affordable amounts of money any longer. Regards, Nick Maclaren.
From: jacko on 20 Jul 2010 10:54 On 20 July, 15:41, n...(a)cam.ac.uk wrote: > In article <04cb46947eo6mur14842fqj45pvrqp6...(a)4ax.com>, > George Neuner <gneun...(a)comcast.net> wrote: > > > > >ISTM bandwidth was the whole point behind pipelined vector processors > >in the older supercomputers. Yes there was a lot of latency (and I > >know you [Mitch] and Robert Myers are dead set against latency too) > >but the staging data movement provided a lot of opportunity to overlap > >with real computation. > > Yes. > > >YMMV, but I think pipeline vector units need to make a comeback. I am > >not particularly happy at the thought of using them again, but I don't > >see a good way around it. > > NO chance! It's completely infeasible - they were dropped because > the vendors couldn't make them for affordable amounts of money any > longer. > > Regards, > Nick Maclaren. Maybe he needs a FPGA card with many single cycle Boothe multipliers on chip, A bit slow though due to routing delays, but much parallel. There really should be a way to que mulmac pairs with a reset to zero (or the nilpotent).
From: Andy Glew "newsgroup at on 20 Jul 2010 11:31 On 7/19/2010 11:59 AM, Robert Myers wrote: > David L. Craig wrote: > >> I am new to comp.arch and so am unclear of the pertinent history of this >> discussion This is a bit of a tired discussion. Not because the solution is known, but because the solutions that we think we know aren't commercially feasible. We need to break out of the box. We welcome new blood, and new ideas. >> Also, why single out floating point bandwidth? For instance, what about the >> maximum number of parallel RAM acceses architectures can support, which has >> major impacts on balancing cores' use with I/Os use? > > Computation is more or less a solved problem. Most of the challenges > left have to do with moving data around, with latency and not bandwidth > having gotten the lion's share of attention (for good reason). I believe > that moving data around will ultimately be the limiting factor with > regard to reducing power consumption. I'm with you, David. Maximizing what I call the MLP, the memory level parallelism, the number of DRAM accesses that can be concurrently in flight, is one of the things that we can do. But Robert's comment is symptomatic of the discussion. Robert says most work has been on latency, by which I think that he means caches, and maybe integrating the memory controller. I say MLP to Robert, but he glides on by. Robert is interested in brute force bandwidth. Mitch points out that modern CPUs have 1-4 DRAM channels, which defines the bandwidth that you get, assuming fairly standard JEDEC DRAM interfaces. GPUs may have more channels, 6 being a possibility, wider, etc., so higher bandwidth is a possibility. Me, I'm just the MLP guy: give me a certain number of channels and bandwidth, I try to make the best use of them. MLP is one of the ways of making more efficient use of whatever limited bandwidth you have. I guess that's my mindset - making the most of what you have. Not because I don't want to increase the overall memory bandwidth. But because I don't have any great ideas on how to do so, apart from a) More memory channels b) Wider memory channnels c) Memory channels/DRAMs that handle short bursts/high address bandwidth efficiently d) DRAMs with a high degree of internal banking e) aggressive DRAM scheduling Actually, c,d,e are really ways of making more efficient use of bandwidth, i.e. preventing pins from going idle because the burst length is giving you a lot of data you don't want. f) stacking DRAMs g) stacking DRAMs with an interface chip such as Tom Pawlowski of micron proposes, and a new abstract DRAM interface, enabling all of the good stuff above but keeping DRAM a comodity h) stacking DRAMs with an interface chip and a processor chip (with however many processors you care to build). Actually, I think that it is inaccurate to say that Robert Myers just wants brute force memory bandwidth. I think that he would be unhappy with a machine that achieved brute force memory bandwidth by having 4KiB burst transfers - because while that machine might be good for DAXPY, it would not be good for most of the codes Robert wants. I think that Robert does not want brute force sequential babwidth. I think that he needs randoom access pattern bandwidth. Q: is that so, Robert? > Even leaving aside justifying why expensive bandwidth is not optional, > there is little precedent here for in-depth explorations of blue-sky > proposals. A fair fraction of the blue-sky propositions brought here > can't be taken seriously, and my sense of this group is that it wants to > keep the thinking mostly inside the box, not for want of imagination, > but to avoid science fiction and rambling, uninformed discussion. I'm game for blue-sky SCIENCE FICTION. I.e. imaginings based on science. That have some possibility of being true. I'm not so hot on science FANTASY, imaginin gs based on wishful thinking.
From: MitchAlsup on 20 Jul 2010 12:35 On Jul 20, 10:31 am, Andy Glew <"newsgroup at comp-arch.net"> wrote: > Actually, I think that it is inaccurate to say that Robert Myers just > wants brute force memory bandwidth. Undouubtably correct. As to why Vector machine fell out of fashion. Vectors were architected to absorb memory latency. Early Crays had 64-entry vectors and 20-ish cycle main memory. Later, as the CPUs got faster and the memories larger and more interleaved, the latency, in cycles, to main memory increased. And once the Vector machines got to where main memory latency, in cycles, was greater than the vector length, their course had been run. Nor can OoO machines create vector performance rates unless the latency to <whatever layer in the memory hierarchy supplies the data> can be absorbed by the size of the execution window. Thus, the execution window needs to be 2.5-3 times the number of flops being crunched per loop iteration. We are at the point where, even when the L2 cache supplies data, there are too many latency cycles for the machine to be able to efficiently strip mine data. {And in most cases the cache hierarchy is not designed to efficiently strip mine data, either.} neither a) High latency adequate bandwidth nor b) Low Latency inadequate bandwidth enable vector execution rates--that is getting the most of the FP computation capabilities. Mitch
From: Robert Myers on 20 Jul 2010 12:54
Andy Glew wrote: > I think that Robert does not want brute force sequential babwidth. > I think that he needs randoom access pattern bandwidth. > > Q: is that so, Robert? > The problems that I personally am most interested in are fairly "crystalline": very regular access patterns across very regular data structures. So the data access patterns are neither random nor sequential. The fact that processors and memory controllers want to deal with cache lines and not with 8-byte words is a serious constraint. No matter how you arrange a multi-dimensional array, some kinds of access are going to waste huge amounts of bandwidth, even though the access pattern is far from random. In the ideal world, you don't want scientists and engineers worrying about where things are, and more and more problems involve access patterns that are hard to anticipate. If you can't make random access free (as fast as sequential access), then at least you can aim at making hardware naivet� less costly (a factor of, say, two penalty for having majored in physics rather than computer science, rather than a factor of, say, ten or more). Problems that require truly random (or hard to anticipate) access are (as I perceive things) far more frequent than they were in the early Cray days, and the costs of dealing with them increasingly painful. To attempt to be concise: I have no doubt that the needs of media stream processors will be met without my worrying about it. Any kind of more complicated access (I speculate) is now so far down the priority list that, from the POV of COTS processor manufacturers, it is in the noise. So I'm interested in talking about any kind of calculation that can't feed a GPU without some degree of hard work or even magic. If I seem a tad blas� about the parts of the problem you understand the most about (or are most interested in), it's because my concerns extend far beyond a standard rack mounted board and even beyond the rack to the football-field sized installations that get the most press in HPC. There are so many pieces to this problem, that even making a comprehensive list is a challenge. At one time, you could enter a room and see a Cray 2 (not including the supporting plumbing). Now you'd have to take the roof off a building and rent a helicopter to get a similar view of a state of the art "supercomputer." There's a lot to think about. I'm also interested in what you can build that doesn't occupy significant real estate and require a rewiring of the nation's electric grid, so I'm interested in what you can jam onto a single board or into a single rack. No shortage of things to talk about. A final word about latency and bandwidth. I really want to keep my mind as open as possible. The more latency you can tolerate, perhaps with some of the kinds of exotic approaches (e.g. huge instruction windows) that interest you, the more options you have for approaching the problem of bandwidth. I know that most everyone here understands that. I just want to make it clear that I understand it, too. Robert. |