Prev: PEEEEEEP
Next: Texture units as a general function
From: "Andy "Krazy" Glew" on 30 Dec 2009 20:27 nmm1(a)cam.ac.uk wrote: > In article <4B399C3B.2050207(a)patten-glew.net>, > Andy \"Krazy\" Glew <ag-news(a)patten-glew.net> wrote: >>> That doesn't help significantly. For the real benefit, you need to >>> be able to ensure that objects that are accessed in separate threads >>> can be separated by the implementation to any degree that it needs. >>> That's a problem. >> How can you do this with linked datastructures? >> >> E.g. the flavor of sparse array that has cells >> (row,col,data,ptr_to_next_in_row,ptr_to_next_in_col) >> with linked lists on both dimensions. >> Sometimes parallelized by row, sometimes by columns; >> sometimes both at the same time. > > Just as for arrays. You copy in and copy back, when and as necessary. > Yes, it costs, which is why most Fortran compilers use copy-in/copy-out > for arrays only when essential. > > My point is that the language must be such as to ALLOW the compiler > to do that. Fortran, Python etc. by and large do. C, C++ and Java, > by and large don't. Copy in/out helps, but generalize: Some funky linked data structure where you can't tell a priori what subsections there are. Or, heck, an N-dimensional array, where you sometimes want row access, sometimes columns, sometimes any of the diagonals, sometimes sub-matrices - where, in general, you can't tell what subset of the data structure is being manipulated. And where there is not some hierarchy that is natural for locking or other mutex. Copy in/out helps. Particularly for dense arrays - because the parallel threads are not deallocating structure nodes, but are just changing values. You may get inconsistent values when you copy back in - values that are not consistent with their surrounding values - but the dense array isn't broken. Whereas with a subset of linked datastructure that does not have a single dominator, you can't necessarily copy back in a changed datastructure, with nodes that have been added or deleted. Herlihy's non-blocking lock free algorithms using compare-and-swap depend on there being a single pointer that dominates the data structure you are changing. Failing that, you need either transaction - K-way compare and swap, or full transactions. Or you need locks. If you are operating on something like an irregular but static mesh, okay, you may be able to copy in/out. But if the mesh is evolving as objects move. Anyway, yes, copy in/out helps. But it is not the final answer. The real problem in so much of this is that we have memory. Data structures that are operated on in place. And, no matter what advocates of data flow, logic programming, applicative or functional languages say, I don't think that we are going to do away with them.
From: "Andy "Krazy" Glew" on 30 Dec 2009 20:35 Terje Mathisen wrote: > nmm1(a)cam.ac.uk wrote: >> In article<MMOdnSSpXI_wwaTWnZ2dnUVZ_tGdnZ2d(a)bestweb.net>, >> Mayan Moudgill<mayan(a)bestweb.net> wrote: >>> Sorry; don't see how this suffers from any kind of degradation. Could >>> you elaborate? >> >> Because, if simultaneous updates of the same cache line are not >> coherent (in whatever sense required by the language), that code is >> incorrect. This has nothing to do with performance, and a great >> deal to do with programs giving occasional, non-repeatable wrong >> answers. > > So then you need the compiler to know that all accesses within a cache > line (or whatever the coherency size is) of both edges of the range are > special, and must be handled with totally separate instructions, right? > > Sounds like something it should be trivial to add to any given C > compiler! :-) > > Actually it seems to me that a compiler which supports OpenMP or similar > must do quite a bit of this when automatically partitioning a loop... void par_func( char* a, char* b ) { if( in_different_cache_lines(a,b) ) { par { operate_in_processor_1(func,a); operate_in_processor_2(func,b); } } else { // because a and b are in the same cache line // and because I am in a non-cache-coherent-system // cannot run in parallel // so just run locally. func(a); func(b); } } Sure, some compilers do stuff like this. Although they hate doing it, because it leads to combinatoric explosion. But that's PERFORMANCE tuning. Whereas the code above is necessary for CORRECTNESS. And what about when a and b are not char*, but are pointers to some possibly intersecting linked data structures?
From: "Andy "Krazy" Glew" on 30 Dec 2009 20:45 Mayan Moudgill wrote: > nmm1(a)cam.ac.uk wrote: >> In article <MMOdnSSpXI_wwaTWnZ2dnUVZ_tGdnZ2d(a)bestweb.net>, >> Mayan Moudgill <mayan(a)bestweb.net> wrote: >> >>>>> Sorry, don't understand. Can you give a more concrete example? What >>>>> do you mean by "byte-separation accesses"? >>>> >> >> Because, if simultaneous updates of the same cache line are not >> coherent (in whatever sense required by the language), that code is >> incorrect. > > I see. You are worried that, given code of the form: > > for p in processors > x[p] = foo(p,...) > > in the case of *non-coherent* memory, the writeback of x[M] may > overwrite the already written value of x[M-1] (or any other value of x > sharing a cache line with x[M]). > > This is exacerbated if x[] is a byte array, since it increases the > chances of collisions, and because one hardware workaround (write masks) > would require more bits (you need to keep track of dirty bytes as > opposed to dirty words). > > If I have understood your problem properly, why would making the entire > array write-through not be an adequate solution? Making the data structure write-through, with byte enables on the write throughs, would be adequate. Except: I'm trying to come up with a policy that allows read caching, potentially of stale data. That does not suffer all of the performance problems of write-through. That does not mandate the immediate invalidations of cache coherency protocols. Note, by the way, that there are at least two approaches to write through. (1) The write-through that I grew up with did not obtain ownership of the line being written to: you wrote, a transaction with byte enables went out, and that transaction invalidated all other copies as it went around. This is correct for certain memory ordering models, and for certain interconnection networks. E.g. bus; e.g. tree-hierarchical with PC, but not SC. (2) IBM and, lately, QPI, insist on doing the invalidation of all other copies of write through lines before the write is done. Call them invalidate-before-write-through versus invalidate-as-writing-through. These write through schemes have cache coherency overhead: you've got to be able to find the lines that you want to invalidate. Plus, they invalidate other copies of the line, even if the user does not depend on the bytes being written. I am trying to come up with a scheme that does not have the cache coherence protocol - software has to do explicit invalidations and flushes - and allows other copies to be kept around, not invalidated on the first write to any part.
From: "Andy "Krazy" Glew" on 30 Dec 2009 21:16 nmm1(a)cam.ac.uk wrote: > THE memory model? That's my point - there isn't one, there are at > least two - the one provided by the hardware and the one assumed by > the programming language. > > ... Currently, almost all shared-memory parallel > models being introduced by programming languages are MORE consistent > than what the hardware currently provides, rather than less (which > would allow freedom to change caching technology). In my experience somewhat the reverse. The languages have a weaker memory ordering model than the hardware. (Admittedly, at Intel and AMD I worked on the processors that were put into large systems - the system vendor did not need to implement the memory model we were implementing.) E.g. the Java standard's sequential atomics. The language allows arbitrary reordering, except at well identified places. Whereas the hardware enforces ordering on all memory operations. Actually, for historical reasons the old Intel official memory ordering model was PC, and the semantics of fences and synchronizing atomic RMWs were not well defined. But the actual hardware always implemented something stricter, except for bugs. With the publication in the last few years of the reformed Intel memory ordering model (hmm, I may have forgotten the plaque for that when I left Intel), we just published and made official the stronger memory ordering model that the hardware had been doing since time immemorial. I.e. at the start, say 1991 Programming languages specified a really weak memory ordering model, say MO-0 (almost no model at all). Hardware built a fairly strong memory ordering model, say MO-0.8. (TSO, almost.) But we published a slightly weaker memory ordering model, say MO-0.7. (PC) Over the years, programming languages got stricter and stricter. Say MO-0.5, Sarita Adve's sequential atomics. Somebody realized that there were incompatibilities with the published documents: it wasn't that MO-0.7 was strictly stronger, it was a partial order. MO-0.7 PC ?>? MO-0.5 SA But the hardware model actually implemented was strictly stronger, MO-0.8 TSO-like > MO-0.5 SA So the published model was updated. === But I agree with Nick's point about implementation freedom. Nick just has a different world of experience than I do. The supercomputer people keep saying "We don't want cache coherency." and "We don't want memory ordering." "We want to manage both in software, if you give us the primitives." Which is fine... except that caches really do work pretty well in many cases. Some people think that we will have domains that are cache coherent, in combination with domains that share memory, but are not cache coherent. If you think about it, the really hard aspects of cache consistency protocols all have to do with memory ordering: making sure that there is only one writer to a cache line at any time. But the programmers say they are okay on handling that themselves. Caching per se is easy enough to implement. There are some interesting issues in how to give software cache control primitives, but let's assume that you can do that. So, let's see: * I want caching * Weak ordering * I do not want to have to invalidate other copies of a cache line, etc., etc. This is easy to implement. Heck, I can change cache line size whenever I want. If I want to build 512 byte cache lines, I can. If I want to build vector strided cache lines, I can. (Yes, vector strided caches have been proposed, that store not contiguous cache lines but, say, 4 bytes every N. They are quite implementable in a non-MP sense But in MP, the coherency overhead gets ugly.) But I can't change cache line size... not because of cache coherency, but because software that was allocating stuff at 64B cache lines will break if I have a different cache line size, larger, or strided, or ... So that's why I am thinking about tyhe bitmasks. Whether byte or word. I'm wlling to suspend disbelief and believe software when they say they can manage memory ordering and cache consistency in software. That reduces a lot of overhead; ad if we make the cache line granularity invisible to software correctness wise, it opens up a whol;ew slew of possibilities for hardware. --- Unfortunately, I suspect that software really can't manage cache consistency or weak ordering anywhere near as well as it thinks. I suspect they may have hidden assumptions, like "I can manage weak ordering as long as the inconsistencies of the eventual consistency model are only short lived; but if the inconistencies can last arbitrarily long times, not so good..." (e.g. you mean "I *have* to flush? I can't just wait in a spin loop?) But it is fun to explore new possibilities.
From: "Andy "Krazy" Glew" on 30 Dec 2009 22:03
Robert Myers wrote: > This exchange, while impressive for its depth of experience and > understanding, is a bit like a discussion of how to save the Ptolemaic > universe. > > Keeping track of things by memory locations that can be written to and > accessed by multiple processes and that can have multiple meanings > based on context is a losing enterprise. > > No matter what you propose, you'll wind up with specifications that > read like the health care bill now before Congress. And, as I think > Nick is pointing out with current solutions, the bugs won't always be > the result of programmer incompetence. > > You've both got so much invested in thinking about these things that > you'll beat it to death forever, rather than admitting that the entire > model of stored variables simply doesn't work. > > Other models have been proposed. Time to get on with it. I agree with you that memory and stored variables is the fundamental problem. I am sympathetic to languages and programming models that seek to ameliorate this fundamental problem. I think that we must encourage programming models such as functional, applicative, etc. But... there's just so much memory oriented stuff. Here's a challenge: rewrite all of Knuth's Art of Computer Programming, all of the volumes on datastructures, etc. in terms of your favorite non-memory programming model. Better yet, show an automatic transform. Show that no efficiency is lost via the transform. Even better, if you can completely automate this, then accept ordinary programs written for memory datastructures as input, and output your favorite non-memory paradigm. --- It's easy to do this for arrays. It's harder to do it for linked data structures. Oftentimes, mechanisms such as futures use linked datastructures behind the scenes. Maybe linked datastructures are the problem? Maybe LDSes are the datastructure equivalent of assembly code? Maybe only programming gods should use them to implement the highly parallel sytems, while we just use the highly parallel stuff that the gods have produced? --- I think that part of the problem is that it is just so much easier to write some forms of memory / stored variable code: e.g. compare the following code snippets for finding the sum of an array sum = 0; for(int i = 0; i<n; i++) { sum += a[i]; } versus sum = 0+ FORC(int i = 0; i < n; i ++ ) DOC a[i] ENDC (the latter being a trick I created in iHDL. Particularly fun with IFs.) Oh, sure, you can use some sort of map_reduce function. Now, which of the umpteen flavors of map_reduce do you want to use? In C++, do you write for(T::iterator i = ds.begin(); i!= d.end(); i++)? Or do you know the STL's map reduce templates by heart? And you create functors on a whim? (I do... but I still find the for iteraror code easier to read and write.) Sure, a good compiler will handle some of the above examples. But it probably won't handle fancier examples. It's just a simple matter of retraining all of the programmers in the world, and rewriting most of the textbooks. We need to do much of this. But it won't happen overnight. And it probably won't happen completely. We'll probably have stored variables for a very long time, even if non-memory programming models increase from <<0.1% to a reasonable fraction of the code in the world. --- By the way, one thing I think will help is developing good hybrid notations. E.g. a lambda with an annotation that says "in this scope, I am using sequential notation." sum = {seq: sum_so_far = 0; for(int i = 0; i<n; i++) { sum_so_far += a[i]; } return sum_so_far; } Allow single assignment to be called from non-single assignment, and vice versa. Arbitrarily layered. |