Prev: PEEEEEEP
Next: Texture units as a general function
From: "Andy "Krazy" Glew" on 14 Dec 2009 23:12 nmm1(a)cam.ac.uk wrote: >> Since integration is inevitable as well as obvious, inevitably we >> will have more than one cache coherent domains on chip, which are PGAS >> or MPI non-cache coherent between the domains. > > Extremely likely - nay, almost certain. Whether those domains will > share an address space or not, it's hard to say. My suspicion is > that they will, but there will be a SHMEM-like interface to them > from their non-owning cores. I'm using PGAS as my abbreviation for "shared memory, shared address space, but not cache coherent, and not memory ordered". I realize, though, that some people consider Cray SHMEM different from PGAS. Can you suggest a more generic term? Hmmm... "shared memory, shared address space, but not cache coherent, and not memory ordered" SM-SAS-NCC-NMO ? No, needs a better name. -- Let's see, if I have it right, In strict PGAS (Private/Global Address Space) there are only two forms of memory access: 1. local private memory, inaccessible to other processors 2. global shared memory, accessible by all other processors, although implicitly accessible everywhere the same. Not local to anyone. Whereas SHMEM allows more types of memory accesses, including a. local memory, that may be shared with other processors b. remote accesses to memory that is local to other processors as well as remote access to memory that isn't local to anyone. And potentially other memory types. -- Some people, seem to assume that PGAS/SHMEM imply a special type of programmatic memory access. E.g. Kathy Yelick, in one of her SC09 talks, said "PGAS gives programmers access to DMA controllers." Maybe often so, but tain't necessarily so. There are several different ways of "binding" such remote memory accesses to an instruction set so that a programmer can use them, including: The first two do not involve changes to the CPU microarchitecture: a) DMA-style b) Prefetch-style The last involves making the CPU aware of remote memory c) CPU-remote-aware a) DMA-style - ideally user level, non-privileged, access to something like a DMA engine. The main question is, how do you give user level access to a DMA engine? Memory mapped command registers? Virtualization issues. Queues? (Notification issues. E.g. interrupt on completion? Not everyone has user level interrupts. (And even though x86 does, they are not frequently used.)) b) Prefetch-style - have the programmer issue a prefetch, somehow. Later, allow the programmer to perform an access. If the prefetch is complete, allow it. (Notification issues.) Could be a normal prefetch instruction, that somehow bypasses the CPU cache prefetch logic (e.g. because of address range.) Or, the prefetch could be something like an uncached, UC, store: UC-STORE to: magic-address data: packet containing PGAS address Aremote you want to load from plus maybe a few other things in the store data packet - length, stride, etc. Plus maybe the actual store data. Later, you might do a load. Possibly a real load: UC-LOAD from: PGAS address Aremote or possibly a fake load, with a transformed address: UC-LOAD hash(Aremote) The load result may contain flags that indicate succes/failure/not yet arrived. Life would be particularly nice if your instruction set had operations that allowed you to write out a store address and a data packet, and then read from the same location, atomically. Yes, atomic RMWs. Like in PCIe. Like in the processor CMPXCHG type instructions. But, the big cost in all of this is that you probably need to make the operations involved be UC, uncached. Anbd, because we have on x86 only one main UC main memory type, used for legacy I/O, it is not optimized for the usage models that PGAS/SHMEM expect. c) Finally, one could make the CPU aware of PGAS/SHMEM remote accesses. Possibly as new instructions. Or, possibly as a new memory type. Now, it is a truism that x86 can't add new memory types. No more page table bits. We'd rather add new instructions. I think this is bogus. However, I have always liked the idea of being able to specify the memory type on a per instruction basis. E.g. in x86, having a new prefix applicable to memory instructions that says "The type of this memory access is ...REMOTE-ordinary-memory..." Probably with combining rules for the page tables and MTRR memory types. If you come from another instruction set, perhaps like Sun's alternate address space. In either case, possibly wit the new memory type as a literal field in the instruction, or possibly from a small set of registers. If you allow normal memory instructions to access remote memory, and then just use a memory type, then you could use the same libraries for both local and remote: e.g. the same linked list routine could work in both. Assuming itmade no assumptions about memory ordering that would work in local but not in remote memory. Is this worth doing? I think that it is always a good idea to have the DMA style or prefetch style interfaces. Particularly if on a RISC ISA that has no block instructions like REP MOVS. Also if one wants to add extra instructions for remote access that are not already in local memory. But, the a) DMA-style and b) preftch-style interfaces are probabky slower, for small accesses, on many common implementations. We can more aggressively optimize the c) CPU-remote-aware. Conversely, if you don't need it, you can always implement the CPU-remote-aware in terms of the other two.
From: "Andy "Krazy" Glew" on 15 Dec 2009 00:11 nmm1(a)cam.ac.uk wrote: >> Since integration is inevitable as well as obvious, inevitably we >> will have more than one cache coherent domains on chip, which are PGAS >> or MPI non-cache coherent between the domains. > > Extremely likely - nay, almost certain. Whether those domains will > share an address space or not, it's hard to say. My suspicion is > that they will, but there will be a SHMEM-like interface to them > from their non-owning cores. Actually, it's not an either/or choice. There aren't just two points on the spectrum. We have already mentioned three, including the MPI space. I like thinking about a few more: 1) SMP: shared memory, cache coherent, a relatively strong memory ordering model like SC or TSO or PC. Typically writeback cache. 0) MPI: no shared memory, message passing 0.5) PGAS: shared memory, non-cache coherent. Typically UC, with DMA as described in other posts. 0.9) SMP-WC: shared memory, cache coherent, a relatively weak memory ordering model like RC or WC. Typically writeback cache. 0.8) ... with WT, writethrough, caches. Actually, it becomes a partial order: there's WT-PC, and WT-WC. 0.7) SMP-WB-SWCO: non-cache-coherent, WB (or WT), with software managed cache coherency via operations such as cache flushes. I am particularly intrigued by the possibility of 0.6) SMP-WB-bitmask: non-cache-coherent. However, "eventually coherent". Track which bytes have been written by a bitmask per cache line. When evicting a cache line, evict with the bitmask, and write-back only the written bytes. (Or words, if you prefer). What I like about this is that it avoids one of the hardest aspects of non-cache-coherent systems: (a) the fact that writes can disappear - not just be observed in a different order, but actually disappear, and the old data reappear (b) tied to cache line granularity. Tracking bitmasks in this way means that you will never lose writes. You may not know what order they get done in. There may be no global order. But you will never lose writes. While we are at it 1.1) SMP with update cache protocols. === Sorting these according to "strength" - although, as I say above, there are really some divergences, it is a partial order or lattice: 1.1) SMP with update cache protocols. **** 1) SMP: shared memory, cache coherent, a relatively strong memory ordering model like SC or TSO or PC. Typically writeback cache. 0.9) SMP-WB-weak: shared memory, cache coherent, a relatively weak memory ordering model like RC or WC. Typically writeback cache. 0.8) ... with WT, writethrough, caches. 0.7) SMP-WB-SWCO: non-cache-coherent, WB (or WT), with software managed cache coherency via operations such as cache flushes 0.65) .. with WT ****??????? 0.6) SMP-WB-bitmask: non-cache-coherent. However, "eventually coherent". Track which bytes have been written by a bitmask per cache line. When evicting a cache line, evict with the bitmask, and write-back only the written bytes. (Or words, if you prefer). 0.55) ... with WT **** 0.5) PGAS: shared memory, non-cache coherent. Typically UC, with DMA as described in other posts. **** 0) MPI: no shared memory, message passing I've marked the models that I think are likely to be most important. I think SMB-WB-bitmask is more likely to be important than the weak models 0.7 and 0.9, in part because I am in love with new ideas but also because I think it scales better. It provides the performance of conventional PGAS, but supports cache locality when it is present. And poses none of the semantic challenges of software managed cache coherency, although it has all of the same performance issues. Of ourse, it needs roghly 64 bits per cache line. Which may be enough to kill it in its tracks.
From: Terje Mathisen on 15 Dec 2009 01:48 Andy "Krazy" Glew wrote: [interesting spectrum of distributed memory models snipped] > I think SMB-WB-bitmask is more likely to be important than the weak > models 0.7 and 0.9, > in part because I am in love with new ideas > but also because I think it scales better. > > It provides the performance of conventional PGAS, but supports cache > locality when it is present. And poses none of the semantic challenges > of software managed cache coherency, although it has all of the same > performance issues. > > > Of ourse, it needs roghly 64 bits per cache line. Which may be enough to > kill it in its tracks. Isn't this _exactly_ the same as the current setup on some chips that use 128-byte cache lines, split into two sectors of 64 bytes each. I.e. an effective cache line size that is smaller than the "real" line size, taken to its logical end point. I would suggest that (as you note) register size words is the smallest item you might need to care about and track, so 8 bits for a 64-bit platform with 64-byte cache lines, but most likely you'll have to support semi-atomic 32-bit operations, so 16 bits which is a 3% overhead. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: nmm1 on 15 Dec 2009 04:18 In article <4B270CA7.9060508(a)patten-glew.net>, Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote: > >I'm using PGAS as my abbreviation for "shared memory, shared address >space, but not cache coherent, and not memory ordered". I realize, >though, that some people consider Cray SHMEM different from PGAS. Can >you suggest a more generic term? No, but that isn't what PGAS normally means. However, no matter. >Let's see, if I have it right, > >In strict PGAS (Private/Global Address Space) there are only two forms >of memory access: > 1. local private memory, inaccessible to other processors > 2. global shared memory, accessible by all other processors, >although implicitly accessible everywhere the same. Not local to anyone. I wasn't aware of that meaning. Its most common meaning at present is Partitioned Global Address Space, with each processor owning some memory but others being able to access it, possibly by the use of special syntax. Very like some forms of SHMEM. >Whereas SHMEM allows more types of memory accesses, including > a. local memory, that may be shared with other processors > b. remote accesses to memory that is local to other processors >as well as remote access to memory that isn't local to anyone. >And potentially other memory types. Yes, and each use of SHMEM is different. Regards, Nick Maclaren.
From: Mayan Moudgill on 15 Dec 2009 05:07
Andy "Krazy" Glew wrote: > > In strict PGAS (Private/Global Address Space) there are only two forms > of memory access: > 1. local private memory, inaccessible to other processors > 2. global shared memory, accessible by all other processors, > although implicitly accessible everywhere the same. Not local to anyone. > > Whereas SHMEM allows more types of memory accesses, including > a. local memory, that may be shared with other processors > b. remote accesses to memory that is local to other processors > as well as remote access to memory that isn't local to anyone. > And potentially other memory types. > I can't see that there is any benefit between having strictly private memory (PGAS 1. above), at least on a high-performance MP system. The CPUs are going to access memory via a cache. I doubt that there will be 2 separate kinds of caches, one for private and one for the rest of the memory. So, as far as the CPUs are concerned there is no distinction. Since the CPUs are still going to have to talk to a shared memory (PGAS 2. above), there will still be an path/controller between the bottom of the cache hierarchy and the shared memory. This "controller" will have to implement whatever snooping/cache-coherence/transfer protocol is needed by the global memory. The difference between shared local memory (SHMEM a) and strictly private local memory (PGAS 1) is whether the local memory sits below the memory controlller or bypasses it. Its not obvious (to me at least) whether there are any benefits to be had by bypassing it. Can anyone come up with something? |