Prev: VMWare tools killed my Mac OS X ?
Next: Software vs hardware floating-point [was Re: What happened ...]
From: Terje Mathisen on 14 Sep 2009 12:52 nmm1(a)cam.ac.uk wrote: > I am talking about a good but not unrealistically perfect case. > The O(N) analysis isn't useful when one is talking about constant > factors, of course. > > Consider a large number M of locks with usage rate P, per lock. > Actual clashes are K.M.P^2, for some K. With your new scheme, > there are sqrt(M) higher level locks, with a usage rate of sqrt(M).P, No, no! There are still M top-level locks, each with sqrt(P) subsidiary locks protecting them. (I take P as the number of users of each of the M locks.) I.e. I am using O(M*sqrt(P)) additional memory in order to reduce the amount of lock contention. With 64, 128 or (in some possible situations) even 4K per lock, this is a significant overhead of course. :-( > per lock. Actual clashes are K.sqrt^3(M).P^2. Not really, see above. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: nmm1 on 14 Sep 2009 13:05 In article <S_qdnZENgIVf6TPXnZ2dnUVZ8kWdnZ2d(a)lyse.net>, Terje Mathisen <Terje.Mathisen(a)tmsw.no> wrote: > >> I am talking about a good but not unrealistically perfect case. >> The O(N) analysis isn't useful when one is talking about constant >> factors, of course. >> >> Consider a large number M of locks with usage rate P, per lock. >> Actual clashes are K.M.P^2, for some K. With your new scheme, >> there are sqrt(M) higher level locks, with a usage rate of sqrt(M).P, > >No, no! > >There are still M top-level locks, each with sqrt(P) subsidiary locks >protecting them. (I take P as the number of users of each of the M locks.) > >I.e. I am using O(M*sqrt(P)) additional memory in order to reduce the >amount of lock contention. Ah! Sorry. But I still don't understand exactly what you are suggesting. If the subsidiary locks have a wider scope, how do you deal with two different threads getting separate subsidiary locks for the same top-level one? Regards, Nick Maclaren.
From: Tim McCaffrey on 14 Sep 2009 15:30 In article <da524b6d-bc4d-4ad7-9786-3672f7e9e52c(a)j19g2000yqk.googlegroups.com>, MitchAlsup(a)aol.com says... > >On Sep 10, 10:04=A0pm, Mayan Moudgill <ma...(a)bestweb.net> wrote: >> Well, synchronization can be pretty easy to implement - depends on what >> you are trying to accomplish with it (barriers, exclusion, queues, >> etc.). > >If it is so easy to implement then why are (almost) all >synchronization models at lest BigO( n**2 ) in time? per unit of >observation. That is, it takes a minimum of n**2 memory accesses for 1 >processor to recognize that it is the processor that can attempt to >make forward progress amongst n contending processors/threads. > >> But think of the OS/runtime changes that are required to guarantee that >> we can use spin-locks efficiently. > >Don't want spin-locks--I want a means whereby everyone trying to >synchronize recognizes their place in line in BigO( ln(x) ) time and >then another means whereby they can use their place in line to access >the queue (or whatever) without (/with minimal) interfering with the >other threads doing likewise. > >> Sorry if I'm a little incoherent. Point is, in the appropriate >> environment, with the appropriate (cheap) H/W assistance, we can >> implement synchronization efficiently. > >No, it does not and can not and its a problem of the primatives and >the memory order and cache coherent models these primitaves have to >live within. > >To adequately solve the problem you have to make the BigO term smaller >than n**2. > I have been working, of late, with device drivers that have multiple threads (either worker threads or running on user threads). In the case where it was talking to a real device, spinlock/sync issues became bad enough problem that I got negative scaling going from 2 to 4 CPUs. Analysis indicated that the spin locks themselves were not the problem, it was any line of cache that could be shared between CPUs (for instance, I implemented a "lock-less" circular queue that had no spin locks. Still spent alot of time there due to cache thrashing). I think the trick is to 1) Don't share cache lines if you can help it, and 2) push information instead of pulling it. The 1st will probably require deep OS changes, because this requires running on the "correct" CPU (this is especially difficult for a device driver), and #2 will require hardware (and therefore OS) support. - Tim
From: Terje Mathisen on 14 Sep 2009 16:14 nmm1(a)cam.ac.uk wrote: > In article<S_qdnZENgIVf6TPXnZ2dnUVZ8kWdnZ2d(a)lyse.net>, > Terje Mathisen<Terje.Mathisen(a)tmsw.no> wrote: >> There are still M top-level locks, each with sqrt(P) subsidiary locks >> protecting them. (I take P as the number of users of each of the M locks.) >> >> I.e. I am using O(M*sqrt(P)) additional memory in order to reduce the >> amount of lock contention. > > Ah! Sorry. But I still don't understand exactly what you are > suggesting. If the subsidiary locks have a wider scope, how do > you deal with two different threads getting separate subsidiary > locks for the same top-level one? They are supposed to be different! The first-level lock is a simply a gate to allow access to the top-level (real) lock. I suggested using a thread ID hash to select which if the first-level gates to pass through, i.e. the code would be similar to: unsigned first_index = hash(thread.ID) % lock->secondary_count; .... aquire(lock->secondary_locks->[first_index]); aquire(lock->main_lock); .... Do my stuff release(lock->main_lock); release((lock->secondary_locks->[first_index]); Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Mayan Moudgill on 14 Sep 2009 16:25
Tim McCaffrey wrote: > > I have been working, of late, with device drivers that have multiple threads > (either worker threads or running on user threads). In the case where it was > talking to a real device, spinlock/sync issues became bad enough problem that > I got negative scaling going from 2 to 4 CPUs. Analysis indicated that the > spin locks themselves were not the problem, it was any line of cache that > could be shared between CPUs (for instance, I implemented a "lock-less" > circular queue that had no spin locks. Still spent alot of time there due to > cache thrashing). > > I think the trick is to 1) Don't share cache lines if you can help it, and 2) > push information instead of pulling it. The 1st will probably require deep OS > changes, because this requires running on the "correct" CPU (this is > especially difficult for a device driver), and #2 will require hardware (and > therefore OS) support. > > - Tim > Have you tried doing the following (assuming, of course, that your h/w has appropriate support): 1. Allocate the line without reading (dcbz in PowerPC) 2. Fill the line (normal stores) 3. Force the line to memory (dcbf in PowerPC) You'll also have to sprinkle in some sync's no doubt. It looks like on the x86 architectures, you may be able to get the same effect by using write-combining (WC memory) and WBINVD (or maybe CLFLUSH). |