Prev: PEEEEEEP
Next: Texture units as a general function
From: Mike on 28 Dec 2009 13:31 <nmm1(a)cam.ac.uk> wrote in message news:hhapct$4vr$1(a)smaug.linux.pwf.cam.ac.uk... | In article <7qSdna5HxYemf6XWnZ2dnUVZ_tudnZ2d(a)bestweb.net>, | Mayan Moudgill <mayan(a)bestweb.net> wrote: | > > | > > Consider a program that has a large array of one-byte flags, and needs | > > to update that array in parallel. No single flag will be accessed by | > > two different threads without synchronisation, but the adjacency | > > properties are such that it cannot practically be separated into one | > > array for each thread. You can get the same issue by splitting up a | > > very large string (think DNA or whatever). | > | >This kind of computation will have performance issues on single threaded | >code as well - for large enough array sizes, you'll get horrible cache | >hit rates. Potentially *EVERY* update will cause a cache miss. In that | >case, going multi-core *COULD* give you super-linear _speed-up_ due to | >increased aggreate cache-size. | | No, no, not at all. SOME such code has the properties you describe; | other code has quite good caching properties, but relies on a few | byte-separation accesses. It's the latter that are the killer. Not | merely are they currently legal (almost recommended) and efficient, | but tracking down a small number of 'aliasing' problems if often a | nightmare. MTBF times of hours, with no two successive failures | showing up in the same way :-( | | If the failure were trapped by hardware, there wouldn't be a problem, | but all it does is to give wrong answers, non-repeatably. Here is an idea. What if there were some mechanism to be sure that all threads that update contiguous bytes that form a single cash line must run on a single cpu. However when a large array or structure spans multiple cash lines additional threads and cpu's are allocated. Mike
From: nmm1 on 28 Dec 2009 13:50 In article <fLOdnQLzHcrvZKXWnZ2dnUVZ_qOdnZ2d(a)earthlink.com>, Mike <mike(a)mike.net> wrote: > >What if there were some mechanism to be sure that all threads that >update contiguous bytes that form a single cash line must run on a >single cpu. However when a large array or structure spans multiple >cash lines additional threads and cpu's are allocated. It doesn't help. This is a language problem, and not a hardware one. A compiler could not make use of any such mechanism unless the language provides the information for it to do so. Regards, Nick Maclaren.
From: Mayan Moudgill on 28 Dec 2009 14:00 nmm1(a)cam.ac.uk wrote: > In article <7qSdna5HxYemf6XWnZ2dnUVZ_tudnZ2d(a)bestweb.net>, > Mayan Moudgill <mayan(a)bestweb.net> wrote: > >>>Consider a program that has a large array of one-byte flags, and needs >>>to update that array in parallel. No single flag will be accessed by >>>two different threads without synchronisation, but the adjacency >>>properties are such that it cannot practically be separated into one >>>array for each thread. You can get the same issue by splitting up a >>>very large string (think DNA or whatever). >> >>This kind of computation will have performance issues on single threaded >>code as well - for large enough array sizes, you'll get horrible cache >>hit rates. Potentially *EVERY* update will cause a cache miss. In that >>case, going multi-core *COULD* give you super-linear _speed-up_ due to >>increased aggreate cache-size. > > > No, no, not at all. SOME such code has the properties you describe; > other code has quite good caching properties, but relies on a few > byte-separation accesses. Sorry, don't understand. Can you give a more concrete example? What do you mean by "byte-separation accesses"? I was assuming that you were trying to parallelize: for( i = 0; i < N; i++ ) { B[f(i)] = g(i, ...) } where B is the byte array, f(i) is some arbitrary scatter vector or computation, and g() is some function. If g() is dependent on some values of B[], then I would have thought that, for irregular f(), parallelization would not pay off; g(M, ...) would be dependent on all i, 0 <= i < M. For g() not dependent on B[], or not dependent on the loop computed values of B[], there is, of course, no need for synchronization inside the loop; a barrier after (and possibly before) the loop suffices. The major performance inhibitor is the ping-ponging of cache lines. But, as I said before, this will probably be correlated with the behavior on the cache of the non-parallelized code; if there aren't too many misses in the serial code, then block parallelization of the form: forall p for( i = p*N/P; i <(p+1)*N/P; i++ ) B[f(i)] = g(i, ....) SHOULD result in equivalent behavior on the parallel code. The last case of interest is g() dependent on B[], but the dependencies & f() are sufficiently regular that we can figure out a parallel iteration mapping with a small enough number of synchronization points to make it practical to actually do the loop in parallel. If I read your statement correctly, your claim is that in that case, we will end up with line ping-ponging because of byte flag sharing. I am somewhat skeptical of this; given that there is sufficient regularity in the various index expressions to admit profitable parallelization, I suspeect that there will be enough regularity to permit restructuring (some variant of tiling, striding, skewing etc.) of the iteration order to decrease the ping-pong behavior. A concrete example would be nice to have here.
From: nmm1 on 28 Dec 2009 14:28 In article <ubGdnTyFcsjdnaTWnZ2dnUVZ_sGdnZ2d(a)bestweb.net>, Mayan Moudgill <mayan(a)bestweb.net> wrote: > >Sorry, don't understand. Can you give a more concrete example? What do >you mean by "byte-separation accesses"? The example you give is one such, but consider this: unsigned char flags[1000000000]; int offsets[num_threads]; /* This passes different sections of the flags to different threads, for updating. ] fred(&flags[offsets[thread]],offsets[thread+1]-offsets[thread]); In the case where the offsets are determined by the algorithm, and cannot conveniently be made multiples of the cache line size. Even when they can be, it is at best disgusting to have to build in such a system-specific value just to ensure that a program is correct. That is the absolutely antithesis of portability! >I am somewhat skeptical of this; given that there is sufficient >regularity in the various index expressions to admit profitable >parallelization, I suspeect that there will be enough regularity to >permit restructuring (some variant of tiling, striding, skewing etc.) of >the iteration order to decrease the ping-pong behavior. Another, and VERY important, example is adaptive grids. You start off a grid as nicely and cleanly separated, but points near the boundaries drift into other sections over time. As restructuring is far more expensive than a normal iteration, you do it only when the number of drifted points reaches a threshhold (say 1-5%). The actual number of cross-section accesses is small enough that it doesn't impact the cache hit ratio significantly. In case you aren't into that area, think finite element analysis of moving structures - say a rotating jet engine, or way that a tsunami breaks over a protective structure (when the grid is of the water, not the ground). Regards, Nick Maclaren.
From: Mayan Moudgill on 28 Dec 2009 20:33
nmm1(a)cam.ac.uk wrote: > In article <ubGdnTyFcsjdnaTWnZ2dnUVZ_sGdnZ2d(a)bestweb.net>, > Mayan Moudgill <mayan(a)bestweb.net> wrote: > >>Sorry, don't understand. Can you give a more concrete example? What do >>you mean by "byte-separation accesses"? > > > The example you give is one such, but consider this: > > unsigned char flags[1000000000]; > int offsets[num_threads]; > /* This passes different sections of the flags to different threads, > for updating. ] > fred(&flags[offsets[thread]],offsets[thread+1]-offsets[thread]); > > In the case where the offsets are determined by the algorithm, and > cannot conveniently be made multiples of the cache line size. Even > when they can be, it is at best disgusting to have to build in such > a system-specific value just to ensure that a program is correct. > That is the absolutely antithesis of portability! > Sorry; don't see how this suffers from any kind of degradation. Could you elaborate? So far this looks like the case where thread T handles the subarray of flags flags[offset[T]:offset[T+1]-1], which is pretty straightforward. Is the degradation because the subarrays for T are unbalanced? |