Prev: PEEEEEEP
Next: Texture units as a general function
From: "Andy "Krazy" Glew" on 15 Dec 2009 20:57 Mayan Moudgill wrote: > I can't see that there is any benefit between having strictly private > memory (PGAS 1. above), at least on a high-performance MP system. > > The CPUs are going to access memory via a cache. I doubt that there will > be 2 separate kinds of caches, one for private and one for the rest of > the memory. So, as far as the CPUs are concerned there is no distinction. > > Since the CPUs are still going to have to talk to a shared memory (PGAS > 2. above), there will still be an path/controller between the bottom of > the cache hierarchy and the shared memory. This "controller" will have > to implement whatever snooping/cache-coherence/transfer protocol is > needed by the global memory. > > The difference between shared local memory (SHMEM a) and strictly > private local memory (PGAS 1) is whether the local memory sits below the > memory controller or bypasses it. Its not obvious (to me at least) > whether there are any benefits to be had by bypassing it. Can anyone > come up with something? Nick is right: the P in PGAS stands for partitioned, not private. For some reason, I keep making this confusion. (Pictures such as slide 4 in http://groups.google.com/group/scaling-to-petascale-workshop-2009/web/introduction-to-pgas-languages?pli=1 are, perhaps, one source of my confusion, since Snir definitely depicts private/global,not partitioned.) Mayan is right: the main motivation in having private memory is whether you want to bypass any cache. Believe it or not, many HPC people do not want to have any cache whatsoever. I agree with Mayan: we will definitely cache local accesses because uncached and we probably don't want to create special cases for remote memory. That being said, I will admit that I have been thinking about special protocols for global memory, such as described in the previous post. I suppose that one of the reasons I have been thinking of private as opposed to partitioned has been thinking about languages that have "private" and " global" keywords. This is a smaller addition to the language than adding a placement syntax. The question then is whether you can convert a pointer to private T into a pointer to public T. UPC seems to disallow of this. Even if in the implementation in hardware private and global memory locations are cached in the same way, it may be desirable to distinguish some of the language level: the compiler may be able to use more efficient synchronization mechanisms for variables that are guaranteed to be local private than it can use for global variables that might be local or might be remote and might be shared with other processors. Typically, on X86 the local variables may not require fencing because the X86's default strong memory ordering, whereas fences may be required for global variables because the global interconnect may not provide the snooping mechanisms that processors such as the P6 family use to enforce strong memory ordering. Note that these fences may not be the standard LFENCE, SFENCE, or MFENCE instructions, since those are typically not externally visible. Instead they might have to be expensive UC memory accesses, so that the are visible to the outside world. Of course it would be wonderful to create new versions of the fence instructions that could be visible to external memory fabric. But if you go down that path you might actually end up having to distinguish private and global memory. - - - (I am writing this in the Seattle to Portland van, bouncing on the rough roads. It is quite remarkable how much slower the computer is when there is this much vibration. I fear that my heads are crashing all the time. I really need to save up the money to get myself a solid state disk. Also, as I have noted before, speech recognition works better in this fight vibration environment than keyboarding, with handwriting recognition in between. This is the first time I've actually used speech recognition in the van with somebody else present, except for Monday when I was matching a person who was talking loudly on the cell phone. I hope that I'm not disturbing the other passenger. I hope that she will tell me honestly if I am, and not just be polite. I'm curious to find out if speech recognition is socially acceptable in such relatively high noise environments as the shuttle van or an airplane. I hope that it is less obnoxious that speaking on a cell phone. Of course, the impoliteness of talking on a cell phone does not stop many people doing it. I suspect that dictating text is better than listening to a cell phone, because I dictate in full sentences; but listening to me edit text is probably even more than listening to a cell phone. I am falling into an odd hybrid of using speech to dictate and editing with the pen.)
From: Mayan Moudgill on 16 Dec 2009 07:04 nmm1(a)cam.ac.uk wrote: > > In particular, using a common cache with different coherence > protocols for different parts of it has been done, but has never > been very successful. There is a distinction between choosing between two different coherence protocols and between a simpler coherent/not-coherent memory. At the hardware level, this would be a choice between running MOESI (or whatever MESI variant is being used) when running with coherence and imnmediately promoting a line from S/O to M for purposes of writes (for non-coherence); you'd use instruction control (cache flush, e.g.) or write-through to guarantee its visibility to the outside world. Following tradition, this would probably be controlled by bits in the page-table. So, its demonstrably simple to *implement* coherence/non-coherence. If the lack of success is because it is difficult to use in an MP context, that is a different issue. > >>>The main advantage of truly private memory, rather than incoherent >>>sharing across domains, is reliability. You can guarantee that it >>>won't change because of a bug in the code being run on another >>>processor. >> >>If I wanted to absolutely guarantee that, I would put the access control >>in the memory controller. If I wanted to somewhat guarantee that, I >>would use the VM access right bits. > > > Doubtless you would. And that is another example of what I said > earlier. That does not "absolutely guarantee" that - indeed, it > doesn't even guarantee it, because it still leaves the possibility > of a privileged process on another processor accessing the pseudo- > local memory. And, yes, I have seen that cause trouble. Absolutely guarantee would imply a control register in the memory controller with a bit that, if set, ensures that the only write (or write and read) requests the memory controller allows through are those from its "owning" processor. That is why the absolute guarantee is part of the controller. As you correctly pointed out, a VM based scheme fails in the presence of bugs. Which is why I called it a "somewhat guarantee" exclusivity model.
From: Bernd Paysan on 18 Dec 2009 16:51 Andy "Krazy" Glew wrote: > 1) SMP: shared memory, cache coherent, a relatively strong memory > ordering model like SC or TSO or PC. Typically writeback cache. > > 0) MPI: no shared memory, message passing You can also have shared "write-only" memory. That's close to the MPI side of the tradeoffs. Each CPU can read and write its own memory, but can only write remote memories. The pro side is that all you need is a similar infrastructure to MPI (send data packets around), and thus it scales well; also, there are no blocking latencies. The programming model can be closer do data flow than pure MPI, since when you only pass data, writing the data to the target destination is completely sufficient. An "this data is now valid" message might be necessary (or some log of the memory controller where each CPU can extract what regions were written to). -- Bernd Paysan "If you want it done right, you have to do it yourself" http://www.jwdt.com/~paysan/
From: "Andy "Krazy" Glew" on 16 Dec 2009 17:13 Andy "Krazy" Glew wrote: > Mayan Moudgill wrote: >> I can't see that there is any benefit between having strictly private >> memory (PGAS 1. above), at least on a high-performance MP system. >> >> The CPUs are going to access memory via a cache. I doubt that there >> will be 2 separate kinds of caches, one for private and one for the >> rest of the memory. So, as far as the CPUs are concerned there is no >> distinction. >> >> Since the CPUs are still going to have to talk to a shared memory >> (PGAS 2. above), there will still be an path/controller between the >> bottom of the cache hierarchy and the shared memory. This "controller" >> will have to implement whatever snooping/cache-coherence/transfer >> protocol is needed by the global memory. > Even if in the implementation in hardware private and global memory > locations are cached in the same way, it may be desirable to distinguish > some of the language level: the compiler may be able to use more > efficient synchronization mechanisms for variables that are guaranteed > to be local private than it can use for global variables that might be > local or might be remote and might be shared with other processors. I mentioned the possibility of fencing being different for local/ private memory and for global memory. I forgot to mention the possibility of software controlled cache coherence. If the compiler has to emit cache flush directives around accesses to global memory that is cached, and if these directives are as slow as on present X86, then the compiler definitely warts to know what is private and what is not. IMHO this is a good reason to use the DMA model. If flushing cache is slow, then you may want to distinguish private memory that can be cached, e.g. in your 2M/ core L3 cache, from remote cacheable memory_ caching the latter in a smaller, cheaper to flush, structure.
From: nmm1 on 22 Dec 2009 15:56
In article <4B2FEF58.4040907(a)patten-glew.net>, Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote: >Bernd Paysan wrote: >> >> You can also have shared "write-only" memory. That's close to the MPI >> side of the tradeoffs. Each CPU can read and write its own memory, but >> can only write remote memories. The pro side is that all you need is a >> similar infrastructure to MPI (send data packets around), and thus it >> scales well; also, there are no blocking latencies. >> >> The programming model can be closer do data flow than pure MPI, since >> when you only pass data, writing the data to the target destination is >> completely sufficient. An "this data is now valid" message might be >> necessary (or some log of the memory controller where each CPU can >> extract what regions were written to). > >At first I liked this, and then I realized what I liked was the idea of >being able to create linked data structures, readable by anyone, but >only manipulated by the local node - except for the minimal operations >necessary to link new nodes into the data structure. Yes, that's a model I have liked for some time. I should be very interested to know why Bernd regards the other way round as better; I can't see it, myself, but can't convince myself that it isn't. >I don't think that ordinary read/write semantics are acceptable. I >think that you need the ability to "atomically" (for some definition of >atomic - all atomicity is relative) read a large block of data. Used by >a node A to read a data node in node B's memory. I agree, but the problem has been solved for file-systems, where snapshots are implemented in such a way as to appear to give such atomic read semantics. Actually, what I like is the database/BSP semantics. Updates are purely local, until the owner says "commit", when all other nodes will see the new structure when they next say "accept". Before that, they see the old structure. Details of whether commit and accept should be directed or global are topics for research .... I think that it could be done fairly easily at the page level, using virtual memory primitives, but not below unless the cache line ones were extended. Regards, Nick Maclaren. |