Prev: PEEEEEEP
Next: Texture units as a general function
From: "Andy "Krazy" Glew" on 24 Dec 2009 00:42 Andy "Krazy" Glew > Hetero doesn't impact this - unless you are tempted to do things like > track, say, only one outstanding transaction per small core, and not to > allocate memory controller buffers for small core requests. > > Just say no. Of course simple cores that block on cache misses or remote accesses are the ultimate MIMD, and may be the endpoint of computer architecture as transistors grow cheap and power expensive. In many ways, this is what Shekhar Borkhar, the mouthpiece of Intel's CRL (Circuit Research Labs) advocates. I've posted about my interest in coherers theaded GPUs that interleave and/ or switch to other threads. But threads waste power with their large register files. However, when I door the math, we aren't there yet. - _ - This I am writing on a plane; using my tablet PC. I have to use it rotated sideways, but at least I can use it, whereas I cannot type. Darn compressed seating! Also, I usually get an aisle,
From: "Andy "Krazy" Glew" on 24 Dec 2009 00:42 Terje Mathisen wrote: > Andy "Krazy" Glew wrote: > [interesting spectrum of distributed memory models snipped] >> I think SMB-WB-bitmask is more likely to be important than the weak >> models 0.7 and 0.9, >> in part because I am in love with new ideas >> but also because I think it scales better. >> >> It provides the performance of conventional PGAS, but supports cache >> locality when it is present. And poses none of the semantic challenges >> of software managed cache coherency, although it has all of the same >> performance issues. >> >> Of ourse, it needs roghly 64 bits per cache line. Which may be enough to >> kill it in its tracks. > > Isn't this _exactly_ the same as the current setup on some chips that > use 128-byte cache lines, split into two sectors of 64 bytes each. > > I.e. an effective cache line size that is smaller than the "real" line > size, taken to its logical end point. > > I would suggest that (as you note) register size words is the smallest > item you might need to care about and track, so 8 bits for a 64-bit > platform with 64-byte cache lines, but most likely you'll have to > support semi-atomic 32-bit operations, so 16 bits which is a 3% overhead. > > Terje Well, it's not * exactly * like sectored cache lines. You typically need the sector size to be a multiple of the dram burst transfer size, what Jim Goodman called the 'transfer block size' in his paper that I thought defined the only really good terminology. The 'cache line size', what Jim Goodman called the 'address block size', Is A multiple, usually the usual power of two aligned multiple of the transfer block size. Indeed, the address block may consist of several sub blocks that, forgetting Jim's notation, I will call a residency block, AKA sector. That in turn may consist of several transfer blocks. Whereas the write bitmasks, whether at byte or word granularity, are finer grain than the transfer block size. Byte granularity is motivated because it is the smallest granularity that you can usually write into some memories without having to do or read modify write. Almost nobody allows you to write at bit granularity. Sure, some systems do not allow you to write at byte granularity and they may even require you to write at word or cache line granularity. But byte granularity is very widespread. If you track this at word granularity but allow the user to write a byte granularity because that's what his instruction set has, then you run the risk of losing writes. Example: original memory location value equals ABCD. Two processors P1 and P2, both read the memory location. P1 writes X in the first location yielding XBCD. P2 writes Y in the last location yielding ABCY. Let's assume that both of these values are resident in their respective processors' caches but the caches are not cash coherent. If P1 evicts first then main memory and other processors will see XBCD, and if P2 then evicts then P1's write of X will disappear and memory will contain ABCY. Writes can be lost in this way whenever the bitmasks used to merge the evicted cache lines are of coarser granularity than the minimum write size in the instruction set. I * think * that this may be important. People complain about non cache coherent systems. But if you think about it, non cache coherent systems really have several different surprising behaviors - behaviors that a na�ve programmer might find surprising: A) different processors may see different values in the same memory location at the same time. Sure, this is confusing, but it is rather inherent in non cache coherent systems. It's the whole point of cache coherent systems. B) non cache coherent systems usually have weak memory ordering. C) writes get lost, as I describe above. Writeback systems often solve all of these problems at the same time. Write through cache protocols may solve them all, but often only solve A and C leaving B, weak memory ordering. (The presenters of the memory tutorial at ISCA earlier this year defined it succinctly: on strongly ordered IBM systems with write-through caches, you must ensure that all other copies of the cache line are invalidated before the write-through is performed. To which I add, on a weakly ordered write-through system, you perform the write-through and perform the invalidations as a side effect of snooping the write-through. I.e., Strongly ordered write-through systems essentially perform a read for ownership before write-through.) Since the whole point of this exercise is to try to reduce the overhead of cache coherency, but people have demonstrated they don't like the consequences semantically, I am trying a different combination: allow A, multiple values; allow B weak ordering; but disallow C losing writes. I possibly that this may be more acceptable and fewer bugs. I.e. I am suspecting that full cache coherency is overkill, but that completely eliminating cache coherency is underkill. - - - * This * post, by the way, is composed almost exclusively by speech recognition, using the pen for certain trivial edits. It's nice to find a way that I can actually compose stuff on a plane again.
From: "Andy "Krazy" Glew" on 24 Dec 2009 00:49 Terje Mathisen wrote: > Andy "Krazy" Glew wrote: >> I think SMB-WB-bitmask is more likely... > Isn't this _exactly_ the same as the current setup on some chips that > use 128-byte cache lines, split into two sectors of 64 bytes each. > > I.e. an effective cache line size that is smaller than the "real" line > size, taken to its logical end point. > > I would suggest that (as you note) register size words is the smallest > item you might need to care about and track, so 8 bits for a 64-bit > platform with 64-byte cache lines, but most likely you'll have to > support semi-atomic 32-bit operations, so 16 bits which is a 3% overhead. > > Terje I know that I meant to reply to this on the airplane going to my parents' wedding anniversary. I just dug the post out of my drafts folder. Briefly: Sectors are usually the burst size of memory. Anything coarser grain than byte granularity gives rise to the possibility of losing writes. How bad is that? We already have the possibility of losing writes when we write to individual bits within word or byte. Maybe we can increase the granularity?
From: "Andy "Krazy" Glew" on 24 Dec 2009 01:08 nmm1(a)cam.ac.uk wrote: >> I think SMB-WB-bitmask is more likely to be important than the weak >> models 0.7 and 0.9, >> in part because I am in love with new ideas >> but also because I think it scales better. > > It also matches language specifications much better than most of the > others, which is not a minor advantage. That could well be the > factor that gets it accepted, if it is. This is my thinking. Language specifications, sure, but I think there mainly important because they indicate what the programmer expects: not losing writes. Note: languages that allow bit fields to be specified, such as int a:1, experience the lossage of losing writes for such sub-byte accesses even on cache coherent shared memory subsystems. Unless, that is, they generate interlocked RMWs such as LOCK BTS for all such bit accesses. Hmm... Here's an idea: in the bad old days you would never want to generate locked instructions if you could avoid them. Bus locks are really slow. But the trend is being to make cache locks really really cheap. They are on the verge of being as cheap as unlocked operations if they hit in the cache, or if they miss but are uncontended. Perhaps, if something like SMB-WB-bitmask is implemented at word granularity rather than byte granularity, we should implement the operations that allow bytes not be lost in much the same way that LOCK BTS prevents bits from being lost, with unlocked BTS as a possible optimization. Such an instruction is: LOCK write bytes under mask. I.e. LOCK mem := (mem & mask) | (stdata & ~mask) Or even lock right bits under mask. .... Or perhaps we would just want to use hardware that did this to implement byte writes.
From: "Andy "Krazy" Glew" on 24 Dec 2009 01:16
Andy "Krazy" Glew wrote: > Hmm... Here's an idea: in the bad old days you would never want to > generate locked instructions if you could avoid them. Bus locks are > really slow. But the trend is being to make cache locks really really > cheap. They are on the verge of being as cheap as unlocked operations > if they hit in the cache, or if they miss but are uncontended. > > .... Or perhaps we would just want to use hardware that did this to > implement byte writes. Urg. But of course, the way we get cheap LOCKs is cache coherency. And we are trying to avoid cache coherency. Cache coherency only for byte accesses? :>-( |