Prev: PEEEEEEP
Next: Texture units as a general function
From: nmm1 on 15 Dec 2009 05:23 In article <JfednUfEbp5qwrrWnZ2dnUVZ_hOdnZ2d(a)bestweb.net>, Mayan Moudgill <mayan(a)bestweb.net> wrote: >Andy "Krazy" Glew wrote: > >> In strict PGAS (Private/Global Address Space) there are only two forms >> of memory access: >> 1. local private memory, inaccessible to other processors >> 2. global shared memory, accessible by all other processors, >> although implicitly accessible everywhere the same. Not local to anyone. >> >> Whereas SHMEM allows more types of memory accesses, including >> a. local memory, that may be shared with other processors >> b. remote accesses to memory that is local to other processors >> as well as remote access to memory that isn't local to anyone. >> And potentially other memory types. > >I can't see that there is any benefit between having strictly private >memory (PGAS 1. above), at least on a high-performance MP system. > >The CPUs are going to access memory via a cache. I doubt that there will >be 2 separate kinds of caches, one for private and one for the rest of >the memory. So, as far as the CPUs are concerned there is no distinction. > >Since the CPUs are still going to have to talk to a shared memory (PGAS >2. above), there will still be an path/controller between the bottom of >the cache hierarchy and the shared memory. This "controller" will have >to implement whatever snooping/cache-coherence/transfer protocol is >needed by the global memory. > >The difference between shared local memory (SHMEM a) and strictly >private local memory (PGAS 1) is whether the local memory sits below the >memory controlller or bypasses it. Its not obvious (to me at least) >whether there are any benefits to be had by bypassing it. Can anyone >come up with something? I don't think you realise how much cache coherence costs, once you get beyond small core-counts. There are two main methods: snooping is quadratic in the number of packets and directories are quadratic in the amount of logic (for constant time accesses). As usual, there are intermediates, e.g. directories that are (say) N*sqrt(N) in both logic and number of packets. The main advantage of truly private memory, rather than incoherent sharing across domains, is reliability. You can guarantee that it won't change because of a bug in the code being run on another processor. Regards, Nick Maclaren.
From: Mayan Moudgill on 15 Dec 2009 06:07 nmm1(a)cam.ac.uk wrote: > > I don't think you realise how much cache coherence costs, once you > get beyond small core-counts. That has nothing to do with truly private vs. shared-local memory: that's in the cache-coherence protocol. One can (in theory) have the cross product of {local,global} x {coherent,non-coherent}. And you really need to stop assuming what other people do and don't know about stuff... > > The main advantage of truly private memory, rather than incoherent > sharing across domains, is reliability. You can guarantee that it > won't change because of a bug in the code being run on another > processor. > If I wanted to absolutely guarantee that, I would put the access control in the memory controller. If I wanted to somewhat guarantee that, I would use the VM access right bits.
From: nmm1 on 15 Dec 2009 07:08 In article <B7SdnYVDl8Wf87rWnZ2dnUVZ_sSdnZ2d(a)bestweb.net>, Mayan Moudgill <mayan(a)bestweb.net> wrote: > >> I don't think you realise how much cache coherence costs, once you >> get beyond small core-counts. > >That has nothing to do with truly private vs. shared-local memory: >that's in the cache-coherence protocol. One can (in theory) have the >cross product of {local,global} x {coherent,non-coherent}. One can in theory do many things that have proved to be infeasible in practice. It is true that I misunderstood what you were trying to say, but I assert that your words (which I quote below) matches my understanding better than your intent does. I can't see that there is any benefit between having strictly private memory (PGAS 1. above), at least on a high-performance MP system. The CPUs are going to access memory via a cache. I doubt that there will be 2 separate kinds of caches, one for private and one for the rest of the memory. So, as far as the CPUs are concerned there is no distinction. Since the CPUs are still going to have to talk to a shared memory (PGAS 2. above), there will still be an path/controller between the bottom of the cache hierarchy and the shared memory. This "controller" will have to implement whatever snooping/cache-coherence/transfer protocol is needed by the global memory. >And you really need to stop assuming what other people do and don't know >about stuff... I suggest that you read what I post before responding like that. I can judge what you know only from your postings, and this is not the first time that you have posted assertions that fly in the face of all HPC experience, without posting any explanation of why you think that is mistaken, even after being queried. In particular, using a common cache with different coherence protocols for different parts of it has been done, but has never been very successful. I have no idea why you think that the previous experience of its unsatisfactoriness is misleading. >> The main advantage of truly private memory, rather than incoherent >> sharing across domains, is reliability. You can guarantee that it >> won't change because of a bug in the code being run on another >> processor. > >If I wanted to absolutely guarantee that, I would put the access control >in the memory controller. If I wanted to somewhat guarantee that, I >would use the VM access right bits. Doubtless you would. And that is another example of what I said earlier. That does not "absolutely guarantee" that - indeed, it doesn't even guarantee it, because it still leaves the possibility of a privileged process on another processor accessing the pseudo- local memory. And, yes, I have seen that cause trouble. You might claim that it is a bug, but you would be wrong if you did. Consider the case when processor A performs some DMA-capable I/O on its pseudo-local memory. You now have different consistency semantics according to where the I/O process runs. Regards, Nick Maclaren.
From: nmm1 on 15 Dec 2009 07:42 In article <4B271A90.8060302(a)patten-glew.net>, Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote: > >>> Since integration is inevitable as well as obvious, inevitably we >>> will have more than one cache coherent domains on chip, which are PGAS >>> or MPI non-cache coherent between the domains. >> >> Extremely likely - nay, almost certain. Whether those domains will >> share an address space or not, it's hard to say. My suspicion is >> that they will, but there will be a SHMEM-like interface to them >> from their non-owning cores. > >Actually, it's not an either/or choice. There aren't just two points on >the spectrum. We have already mentioned three, including the MPI space. > I like thinking about a few more: Gug. I need to print those out and study them! Yes, I agree that it's not an either/or choice, but I hadn't thought out that many possibilities. >I am particularly intrigued by the possibility of > >0.6) SMP-WB-bitmask: non-cache-coherent. However, "eventually >coherent". Track which bytes have been written by a bitmask per cache >line. When evicting a cache line, evict with the bitmask, and >write-back only the written bytes. (Or words, if you prefer). > >What I like about this is that it avoids one of the hardest aspects of >non-cache-coherent systems: (a) the fact that writes can disappear - not >just be observed in a different order, but actually disappear, and the >old data reappear (b) tied to cache line granularity. > >Tracking bitmasks in this way means that you will never lose writes. > >You may not know what order they get done in. There may be no global >order. > >But you will never lose writes. > >I think SMB-WB-bitmask is more likely to be important than the weak >models 0.7 and 0.9, > in part because I am in love with new ideas > but also because I think it scales better. It also matches language specifications much better than most of the others, which is not a minor advantage. That could well be the factor that gets it accepted, if it is. Regards, Nick Maclaren.
From: Del Cecchi on 15 Dec 2009 13:37
"Robert Myers" <rbmyersusa(a)gmail.com> wrote in message news:eb1c4904-abd3-4646-8e61-dd833f806776(a)b15g2000yqd.googlegroups.com... On Dec 14, 4:03 pm, j...(a)cix.compulink.co.uk wrote: > > I wasn't explaining enough. A single memory controller does not seem > to be enough for today's big OOO x86 cores. A Core 2 Duo has two > memory > controllers; a Core i7 has three. This is inevitably pushing up pin > count. If you add a bunch more small cores, you're going to need > even > more memory bandwidth, and thus presumably more memory controllers. > This > is do doubt achievable, but the price may be a problem. Bandwidth. Bandwidth. Bandwidth. It must be in scripture somewhere. It is, but no one reads the Gospel according to Seymour any more. Is an optical fat link out of the question? I know that optical on- chip will take a miracle and maybe a Nobel prize, but just one fat link. Is that too much to ask? Robert. ---------------------- Yes it is at the moment. On the other hand you can do 10Gb/sec/differential pair on copper if you don't want to go too far. So you don't really need optics. But all the fancy dancy interface stuff adds latency, if that's ok. del. |