Prev: Effects of Memory Latency and Bandwidth on Supercomputer,Application Performance
Next: Changing the color of objects/primitives only ? (flat shading...) (massive parallel lookup hardware idea...)
From: David Kanter on 27 Jul 2010 17:29 On Jul 27, 9:37 am, nos...(a)ab-katrinedal.dk (Niels Jørgen Kruse) wrote: > Terje Mathisen <"terje.mathisen at tmsw.no"> wrote: > > > Andy Glew wrote: > > > On 7/27/2010 6:16 AM, Niels Jørgen Kruse wrote: > > >> 24 MB L3 per 4 cores > > >> up to 768 MB L4 > > >> 256 byte line sizes at all levels. > > > > 256 *BYTE*? > > > Yes, that one rather screamed at me as well. > > Another surprising thing I spotted browsing through the redbook, is the > claim of single cycle L1D access. That must be array access only, so > there are at least address generation and format cycles before and > after. Still, 3 cycle loads from a 128 KB L1D at 5.2 GHz must show up on > the power budget. It's definitely array access only. Honestly, they've got an awful lot to get done in 2-3 cycles. TLB look up (should be in parallel), tag and parity check, array access, data format, send data somewhere. The data format and transfer might be pipelined, but even so...that's a lot of activity. What is interesting is that they indicated the L1 and L2 are both write-thru/store-thru designs, while the L3 and L4 are write-back/ store-in. Write thru should be a transitive quality, and if that's correct, then the latency for a store to retire is going to be pretty high, requiring an L1 write, an L2 write and an L3 write. DK
From: Andy Glew "newsgroup at on 28 Jul 2010 01:59 On 7/27/2010 2:29 PM, David Kanter wrote: > On Jul 27, 9:37 am, nos...(a)ab-katrinedal.dk (Niels J�rgen Kruse) > wrote: > What is interesting is that they indicated the L1 and L2 are both > write-thru/store-thru designs, while the L3 and L4 are write-back/ > store-in. Write thru should be a transitive quality, and if that's > correct, then the latency for a store to retire is going to be pretty > high, requiring an L1 write, an L2 write and an L3 write. But the latency for a store to retire doesn't matter that much. So long as it can be pipelined. E.g. if you have stores queued up a) you can have "obtained ownership", i.e. ensured that all other copies of the line have been invalidated in all other peer store-thru caches, before the store starts to retire. (IBM does this; Intel did NOT do this, up until Nehalem and QPI. I.e. IBM does "invalidate before store-thru", whereas older Intel machines did "write-through-invalidates", because they had a different, less constrained, memory model and a more constrained system architecture.) b) assuming store1 and store2 are queued up: cycle 1: store1 L1 write cycle 2: store2 l1 write; store1 L2 write cycle 3: store2 L2 write; store1 L3 write cycle 4: store2 L3 write You have consistent state at all points. Also, you can store combine, at least into same line (and, aggressively, into different lines).
From: Andy Glew "newsgroup at on 28 Jul 2010 11:05 On 7/27/2010 8:08 AM, Terje Mathisen wrote: > Andy Glew wrote: >> On 7/27/2010 6:16 AM, Niels J�rgen Kruse wrote: >>> 24 MB L3 per 4 cores >>> up to 768 MB L4 >>> 256 byte line sizes at all levels. >> >> 256 *BYTE*? [cache line size on new IBM z-Series] > > Yes, that one rather screamed at me as well. >> >> 2048 bits? >> >> Line sizes 4X the typical 64B line size of x86? >> >> These aren't cache lines. They are disk blocks. > > Yes. So what? > > I (and Nick, and you afair) have talked for years about how current CPUs > are just like mainframes of old: > > new old > DISK -> TAPE : Sequential access only > RAM -> DISK : HW-controlled, block-based transfer > CACHE -> RAM : Actual random access, but blocks are still faster > >> >> Won't make Robert Myers happy. Yes, I know. Many of my responses to Robert Myers have been explanations of this, the state of the world. However, the reason that I am willing to cheer Robert on as he tilts at his windmill, and even to try to help out a bit, is that this trend is not a fundamental limit. I.e. there is no fundamental reason that we have to be hurting random accesses as memoy systems evolve. People seem to act as if there are only two design points: * low latency, small random accesses * long latency, burst accesses But it is possible to build a system that supports * small random accesses with long latencies By the way, it is more appropriate to say that the current trend is towards * long latency, random long sequential burst accesses. (Noting that you can have random non-sequential burst accesses, as I have recently posted about.) The things that seem to be driving the evolution towards long sequential bursts are a) tags in caches - the smaller the cache objects, the more area wasted on tags. But if you don't care about tags for your small random accesses... b) signalling overhead - long sequential bursts have a ratio of address bits to data bits of, say, 64:512 = 1:8 for Intel's 64 byte cache lines, and 64:2048 = 8:256 = 1:32 for IBM's 256 byte cache lines. Whereas scatter gather has a signalling ratio of more like 1:1. Signalling overhead manifests both in bandwidth and power. One can imagine an interconnect that handles both sequential bursts and scatter/gather random accesses - so that you don't pay a penalty for sequential access patterns, but you support small random access patterns with long latencies well. But... c) this is complex. More complex than simply supporting sequential bursts. But I'm not afraid of complexity. I try to avoid complexity, when there are simpler ways of solving a problem. But it appears that this random access problem is a problem that (a) is solvable (with a bit of complexity), (b) has customers (Robert, and some other supercomputing customers I have met, some very important), and (c) isn't getting solved any other way. For all that we talk about persuading programmers that DRAM is the new disk. > 768 MB of L4 means your problem size is limited to a little less than > that, otherwise random access is out. It may be worse than you think. I have not been able to read the redbook yet (Google Chrome and Adobe Reader were conspiring to hang, and could not view/download the document; I had to fall back to Internet Explorer). But I wonder what the cache line size is in the interior caches, the L1, L2, L3? With the IBM heritage, it may a small, sectored cache line. Say 64 bytes. But, I also recall seeing IBM machines that could transfer a full 2048 bits between cache and registers in a single cycle. Something which I conjecture is good for context switches on mainframe workloads. If the 256B cache line is used in the inside caches, then it might be that only the L1 is really capable of random access. Or, rather: there is no absolute "capable of random access". Instead, there are penalties for random access. I suggest that the main penalty should be measured as ratio long burst sequential time to transfer N bytes to ratio small random access to transfer N bytes Let us talk about 64bit randm accesses. Inside the L1 cache at Intel, with 64 byte cache lines, this ratio is close to 1:1. Accessing data that fits in the L2, this ratio is circa 8:1 - i.e. long burst sequential is 8X faster, higher bandwidth, than 64b random accesses. From main memory the 1:8 ratio still approximately holds wire-wise, but buffering effects tend to crop up which inflates it. With 256B cache lines, the wire contribution to this ratio is 32:1. - i.e. long burst sequential is 32X faster, higher bandwidth, than 64b random accesses. Probably with more slowdowns. --- What I am concerned about is that it may not be that "DRAM is the new disk". It may be that "L2 cache is the new disk". More likely "L4 cache is the new disk". --- By the way, this is the first post I am making to Robert Myer's high-bandwidth-computing(a)googlegroups.com mailing list Robert: is this an appropriate topic?
From: Jason Riedy on 28 Jul 2010 12:10 And Andy Glew writes: > c) this is complex. More complex than simply supporting sequential bursts. > > But I'm not afraid of complexity. I try to avoid complexity, when > there are simpler ways of solving a problem. But it appears that this > random access problem is a problem that (a) is solvable (with a bit of > complexity), (b) has customers (Robert, and some other supercomputing > customers I have met, some very important), and (c) isn't getting > solved any other way. There are customers who evaluate systems using the GUPS benchmark[1], some vendors are trying to address it, and some contract RFPs require considering the issue (DARPA UHPC). A dual-mode system supporting full-bandwidth streams (possibly along affine ("crystiline"?) patterns of limited dimension) and, say, half-bandwidth word access would permit balancing the better bandwidth and power efficiency of streams with scatter/gather/GUPS accesses that currently are bottlenecks. Those bottlenecks also waste power, so having both could be a win from the system perspective even if a single component might draw more power. The Blue Waters slides presented at IPDPS'10 make me believe IBM's going that route with a specialized interconnect controller per board, but I don't remember/know the details. Another vendor also understands this split and wants to support both access patterns. Again, I don't know the details, but I'm pretty sure they're going in this dual-mode direction. Considering people have dropped things like networked file systems and IP routing protocols into FPGAs and silicon, I can't believe supporting two modes would be much more of a technical challenge. And it looks like there may finally be money attached to tackling that challenge. Jason Footnotes: [1] http://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess/
From: Robert Myers on 28 Jul 2010 13:52
On Jul 28, 11:05 am, Andy Glew <"newsgroup at comp-arch.net"> wrote: > > By the way, this is the first post I am making to Robert Myer's > > high-bandwidth-computing(a)googlegroups.com > > mailing list > > Robert: is this an appropriate topic? I'm happy to let the discussion go whatever way it wants to. Using available bandwidth more effectively is the same as having more bandwidth. You could say that reducing the issue to "more bandwidth!" is as bad as reducing all of HPC to "more flops!" Robert. |