Prev: VMWare tools killed my Mac OS X ?
Next: Software vs hardware floating-point [was Re: What happened ...]
From: Terje Mathisen on 14 Sep 2009 03:07 Mayan Moudgill wrote: > I guess you're implementing a mutex, not a barrier, right? If so, here's > what we do in the case of the SB3500 (somewhat simplified). > 1. There are a set of memory regions (I believe its 1K addresses in the > current implementation) that return a 32 bit value and auto increment, > initialized to return 0 on the first read, 1 on the second and so on. > Call these location the ticket-window. Each mutex has a ticket-window > location. So this is a hardware-based XADD implementation, avoiding all the cache aquire/update/release traffic. > 2. For each mutex, for each thread that can access the mutex there is a > memory location containing the current value of the serving ticket, > initially 0. [Note that this per-thread serving-ticket location is > intended to be in close-to-thread-memory] > 3. To implement a mutex: > - a thread goes and reads the ticket-window, getting a ticket. > - if the value of the serving ticket is the same as its ticket, it > enters the mutex, otherwise it spin-waits for its ticket to be the > serving-ticket > - when it exits the mutex, it updates all threads serving tickets to be > the next higher number. This is a remote write into other threads local > memory. Which means that each lock must have an associated list of registered users, right? I presume updates to the lock user list is _very_ infrequent? :-) This also means that you have a hw limit on the number of locks you can support, this is (probably) OK for an embedded solution, not so good for a general-purpose OS. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Mayan Moudgill on 14 Sep 2009 08:54 Terje Mathisen wrote: > Mayan Moudgill wrote: > >> I guess you're implementing a mutex, not a barrier, right? If so, here's >> what we do in the case of the SB3500 (somewhat simplified). >> 1. There are a set of memory regions (I believe its 1K addresses in the >> current implementation) that return a 32 bit value and auto increment, >> initialized to return 0 on the first read, 1 on the second and so on. >> Call these location the ticket-window. Each mutex has a ticket-window >> location. > > > So this is a hardware-based XADD implementation, avoiding all the cache > aquire/update/release traffic. Yeah; in a cached system, this would be a non-cacheable load. >> 2. For each mutex, for each thread that can access the mutex there is a >> memory location containing the current value of the serving ticket, >> initially 0. [Note that this per-thread serving-ticket location is >> intended to be in close-to-thread-memory] >> 3. To implement a mutex: >> - a thread goes and reads the ticket-window, getting a ticket. >> - if the value of the serving ticket is the same as its ticket, it >> enters the mutex, otherwise it spin-waits for its ticket to be the >> serving-ticket >> - when it exits the mutex, it updates all threads serving tickets to be >> the next higher number. This is a remote write into other threads local >> memory. > > > Which means that each lock must have an associated list of registered > users, right? In the default implementations all participants are known a priori. The default case is 1:1 with the number of _hardware_threads_; whether or not that h/w is actually using that lock is irrelevant. Note that this is a function of spin-waits. > I presume updates to the lock user list is _very_ infrequent? :-) Well, I'm guessing if you allow hot-upgrades, it might happen :) > This also means that you have a hw limit on the number of locks you can > support, this is (probably) OK for an embedded solution, not so good for > a general-purpose OS. Not really; it depends on the memory space/controller. Basically, in the general case, there are two possible solutions based on the memory controller: - a region of memory can be given two address ranges for the same memory - one of which auto-increments the memory, and the other treats it as normal memory. - certain pages can be marked as auto-increment For instance, if you have a L3 controller, that controller can implement the read-modify-write for auto-increment (note that if you use a cache for auto-increment locations, you don't actually have to do the writeback immediately).
From: nmm1 on 14 Sep 2009 09:30 In article <ZJ6dnTKJq8hXdzDXnZ2dnUVZ8sednZ2d(a)lyse.net>, Terje Mathisen <Terje.Mathisen(a)tmsw.no> wrote: > >This is the crux of the matter: > >It is _easy_ to provide sync when contention is (very close to) zero, it >is _hard_ to do the same when lots of cores/threads tries to do it >simultaneously. > >The usual answer to this kind of situation is "Don't do that!". Yes. Don't try to solve the problem you actually have; give it up and choose one that is easier to solve. Fine for ivory tower computer scientists; impractical for everyone else. >To me, when I see an O(n*n) situation, I try to look for O(n*log(n)) >or O(n): > >One obvious approach would be to have a hierarchy of locks: > >Just having two levels with sqrt(n) first-level barriers all protecting >access to the single "real" lock would turn the problem into something >that takes twice as long in the zero contention situation, but where the >worst case becomes O(sqrt(n)*sqrt(n)) * 2 => O(n), right? Well, yes, but .... You are also increasing the contention for the higher level lock in the case where there are a lot of low-contention locks. So it could be worse than twice as slow even in that case. Regards, Nick Maclaren.
From: Terje Mathisen on 14 Sep 2009 10:19 nmm1(a)cam.ac.uk wrote: > In article<ZJ6dnTKJq8hXdzDXnZ2dnUVZ8sednZ2d(a)lyse.net>, > Terje Mathisen<Terje.Mathisen(a)tmsw.no> wrote: >> One obvious approach would be to have a hierarchy of locks: >> >> Just having two levels with sqrt(n) first-level barriers all protecting >> access to the single "real" lock would turn the problem into something >> that takes twice as long in the zero contention situation, but where the >> worst case becomes O(sqrt(n)*sqrt(n)) * 2 => O(n), right? > > Well, yes, but .... You are also increasing the contention for the > higher level lock in the case where there are a lot of low-contention > locks. So it could be worse than twice as slow even in that case. I can't see that: In the best case, i.e. no contention at all, it is of course twice as slow, but (except for the test&branch on the result of first-level lock) not worse than that. At the other end of the spectrum, where every single thread is spinning on the same resource, we we've gotten that O(n*n) -> O(2n) == O(n) improvement. The worst case intermediate position would be a single thread in each of the first-level locks arriving simultaneously, getting it and then hitting on the second-level (real/original) lock at the same time: In this situation we have sqrt(n) threads accessing the same lock, taking O(n) time. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: nmm1 on 14 Sep 2009 10:39
In article <JPmdnSLOCMhlzTPXnZ2dnUVZ8t6dnZ2d(a)lyse.net>, Terje Mathisen <Terje.Mathisen(a)tmsw.no> wrote: > >>> One obvious approach would be to have a hierarchy of locks: >>> >>> Just having two levels with sqrt(n) first-level barriers all protecting >>> access to the single "real" lock would turn the problem into something >>> that takes twice as long in the zero contention situation, but where the >>> worst case becomes O(sqrt(n)*sqrt(n)) * 2 => O(n), right? >> >> Well, yes, but .... You are also increasing the contention for the >> higher level lock in the case where there are a lot of low-contention >> locks. So it could be worse than twice as slow even in that case. > >I can't see that: > >In the best case, i.e. no contention at all, it is of course twice as >slow, but (except for the test&branch on the result of first-level lock) >not worse than that. I am talking about a good but not unrealistically perfect case. The O(N) analysis isn't useful when one is talking about constant factors, of course. Consider a large number M of locks with usage rate P, per lock. Actual clashes are K.M.P^2, for some K. With your new scheme, there are sqrt(M) higher level locks, with a usage rate of sqrt(M).P, per lock. Actual clashes are K.sqrt^3(M).P^2. Regards, Nick Maclaren. |