Prev: Effects of Memory Latency and Bandwidth on Supercomputer,Application Performance
Next: Changing the color of objects/primitives only ? (flat shading...) (massive parallel lookup hardware idea...)
From: Robert Myers on 28 Jul 2010 14:01 On Jul 28, 12:10 pm, Jason Riedy <ja...(a)acm.org> wrote: > And Andy Glew writes: > > c) this is complex. More complex than simply supporting sequential bursts. > > > But I'm not afraid of complexity. I try to avoid complexity, when > > there are simpler ways of solving a problem. But it appears that this > > random access problem is a problem that (a) is solvable (with a bit of > > complexity), (b) has customers (Robert, and some other supercomputing > > customers I have met, some very important), and (c) isn't getting > > solved any other way. > > There are customers who evaluate systems using the GUPS benchmark[1], > some vendors are trying to address it, and some contract RFPs require > considering the issue (DARPA UHPC). A dual-mode system supporting > full-bandwidth streams (possibly along affine ("crystiline"?) patterns > of limited dimension) and, say, half-bandwidth word access would permit > balancing the better bandwidth and power efficiency of streams with > scatter/gather/GUPS accesses that currently are bottlenecks. Had I thought to use the word "affine," which I associate with mind- bending texts involving Cristoffel symbols, I would have made a much less memorable post and not gotten my point across nearly as clearly. ;-) Affine might be too general, though, since it includes general rotation. Robert.
From: Jason Riedy on 28 Jul 2010 15:46 And Robert Myers writes: > Affine might be too general, though, since it includes general > rotation. You're right. I was digging in my head for a term... Oh! The polyhedral model[1] for loops. The affine part comes from transformations between polyhedra. I'm imagining the memory actions following the final, transformed polyhedron. Each can be described by a base address and a vector of strides, etc. Some dimensions will cross node boundaries. I suspect (with no data) that supporting two dimensions covers most uses that can be reduced to streaming-ish loads, and three dimensions would cover almost all uses short of real random access. IIRC, some prefetch engines handle the 2d case for same-node memory. The IPDPS presentation implied (to me) that Blue Waters folks have thought about striding across nodes for PGAS support and have some support in the little communication processors. I don't know if anyone has thought hard about system-wide coordination of the stride patterns on modern (less synchronized) machines, though, and that could prove critical for nontrivial streaming access in "exascale" memories. The MTA/XMT has one "solution" for global coordination of random access patterns: randomize (hash) all addresses. The implications for reliability are, um, open for interpretation. Forget about direct I/O. And you still must avoid hot-spots. I'm dubious that the randomization buys anything but pain. Jason Footnotes: [1] http://en.wikipedia.org/wiki/Polytope_model
From: =?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?= on 28 Jul 2010 15:58 Andy Glew <"newsgroup at comp-arch.net"> wrote: > I have not been able to read the redbook yet (Google Chrome and Adobe > Reader were conspiring to hang, and could not view/download the > document; I had to fall back to Internet Explorer). Perhaps it is easier to download from the parent page: <http://www.redbooks.ibm.com/Redbooks.nsf/RedpieceAbstracts/sg247833.htm l> -- Mvh./Regards, Niels J�rgen Kruse, Vanl�se, Denmark
From: Andy Glew "newsgroup at on 28 Jul 2010 21:49 On 7/28/2010 12:46 PM, Jason Riedy wrote: > And Robert Myers writes: >> Affine might be too general, though, since it includes general >> rotation. > > I don't know if anyone has thought hard about system-wide > coordination of the stride patterns on modern (less synchronized) > machines, though, and that could prove critical for nontrivial > streaming access in "exascale" memories. > > The MTA/XMT has one "solution" for global coordination of random > access patterns: randomize (hash) all addresses. The implications > for reliability are, um, open for interpretation. Forget about > direct I/O. And you still must avoid hot-spots. I'm dubious that > the randomization buys anything but pain. "Affine" is a good word. I think including general rotations is good. But "crystalline" sparkles. :-) Thanks Jason for reminding me about GUPS. Plus, if you look at some of the recent GUPS champions, they have been doing in software what I propose doing in hardware, along the lines of scatter/gather bursting. I suspect that we will end up in a 3-way discussion: 1) sequential 2) crystalline or affine - regular access patterns, just not supported well by block sequential 3) random access patterns My user community is more in the random access patterns than the crystalline or affine. E.g. I am into pointers, linked data structures, irregular sparse arrays. Not regularity. More AI than FFT. At the moment 2) and 3) are allied. A good solution to 3) will also help 2) a lot, but not vice versa. I am a little bit afraid of proposals that seem to hardwire support for certain access patterns into the hardware, at the expense of others not anticipated by the anticipator of little foresight. -- I grew up admiring the stunt box scheduling to coordinate stride patterns on old machines. But every time I looked closely, it was not really all that fancy - not as fancy as people in this group suggested. They weren't solving diophantine equations for access pattern optimizations in real time; they were applying simple heuristics, usually variants of greedy, that delivered reasonable performance. (Now, the *compiler* might be solving diophantine equations - but usually not for scheduling.) -- I'm symapthetic to the randomization, but not certain about it.
From: Andy Glew "newsgroup at on 29 Jul 2010 20:28
On 7/28/2010 9:10 AM, Jason Riedy wrote: > There are customers who evaluate systems using the GUPS benchmark[1], Jason, can you explain why GUPS is so update heavy? Sure, a workload of random updates seesm important. But similarly a workload of random reads also seems important. I think that we need two randoom read components: (1) independent random reads (2) dependent random reads, e.g. pointer chasing: choose a random hash table choose a random element in that hash table and repeat. Probably with the above compined with updatesin some sort of weighted mix. Probably also witgh sequential accesses at the pointer dereferences. |