Prev: Fujitsu SPARC VIII-fx HPC-ACE adds instruction prefixes: variablelength instructions for RISC!!!
Next: Fujitsu SPARC VIII-fx HPC-ACE adds instruction prefixes: variable?length instructions for RISC!!!
From: Terje Mathisen "terje.mathisen at on 27 Dec 2009 05:43 Mayan Moudgill wrote: > Let me ask the following question: assume that we have a N-process > barrier, how do you avoid using N system calls (use any hardware > synchronization you want)? OK, so you want all N threads/cores/cpus to wait here until the last one arrives? > > The fundamental problem becomes that, in the general case, the first N-1 > processes that arrive at the barrier will need to (have the kernel) put > themselves on a waiting queue, and the last process to arrive will need > to (have the kernel) put the N-1 waiting processes back on the ready queue. > > I doubt that software transactional memory solves this problem. First of all, I assume dedicated cores for each thread, i.e. gang scheduling, otherwise the OS is involved in thread switches anyway, right? It seems to me that you could use a single LOCK XADD(counter,-1) from every thread, followed by a spin loop that contains a MONITOR instruction on 'counter' and a counter re-read to determine if it has made it all the way down to zero by now. This would seem to give forward progress for each bus transaction, the others involved in a contention will do a hardware retry, leading to at most N updates total as long as the bus protocol is such that when multiple cores try to acquire the same cache line for update, one of them will win and go on to decrement the counter and flush the line. Alternatively, if N is _very_ large, then a multi-stage counter where sqrt(N) cores use each of the sqrt(N) first-level counters, and the "winners" of each of those update the top-level counter, and report back to the other first-level cores. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Mayan Moudgill on 27 Dec 2009 13:23 Terje Mathisen wrote: > Mayan Moudgill wrote: > >> Let me ask the following question: assume that we have a N-process >> barrier, how do you avoid using N system calls (use any hardware >> synchronization you want)? > > > OK, so you want all N threads/cores/cpus to wait here until the last one > arrives? > > > First of all, I assume dedicated cores for each thread, i.e. gang > scheduling, otherwise the OS is involved in thread switches anyway, right? > > It seems to me that you could use a single LOCK XADD(counter,-1) from > every thread, followed by a spin loop that contains a MONITOR > instruction on 'counter' and a counter re-read to determine if it has > made it all the way down to zero by now. > This solution is (as stated) not general; in the restricted case being considered, it is .... ummm ... baroque? Cleaner solutions are definitely possible. One particular red-flag is the use of the MONITOR instruction. Its only available at privilege level 0. How did you get there? With a system call?
From: EricP on 27 Dec 2009 14:20 Mayan Moudgill wrote: > Terje Mathisen wrote: > > Mayan Moudgill wrote: > > > >> Let me ask the following question: assume that we have a N-process > >> barrier, how do you avoid using N system calls (use any hardware > >> synchronization you want)? > > > > > > OK, so you want all N threads/cores/cpus to wait here until the last one > > arrives? > > > > > > > First of all, I assume dedicated cores for each thread, i.e. gang > > scheduling, otherwise the OS is involved in thread switches anyway, > right? > > > > It seems to me that you could use a single LOCK XADD(counter,-1) from > > every thread, followed by a spin loop that contains a MONITOR > > instruction on 'counter' and a counter re-read to determine if it has > > made it all the way down to zero by now. > > > > > This solution is (as stated) not general; in the restricted case being > considered, it is .... ummm ... baroque? Cleaner solutions are > definitely possible. > > One particular red-flag is the use of the MONITOR instruction. Its only > available at privilege level 0. How did you get there? With a system call? This barrier question is quite different from the atomic multiple update issue. Monitor is available in user mode, and Mwait can be if the OS enables it. The general solution for a barrier spinwait requires 2 counters, a WaitCount and RunCount, ideally on separate cache lines. Each thread increments WaitCount and if less than N spinwaits while RunCount == 0. When WaitCount hits N then that thread resets it to zero and sets RunCount = N-1 to release the peers and proceeds. Each peer decrements RunCount and proceeds. Its may not be a great solution because when the pack is released they all immediately try to decrement RunCount. Depending on the system that might cause a lot of contention but can't be avoided as the last thread leaving the barrier must shut the door behind it. volatile int gWaitCount; volatile int gRunCount; if (AtomicFetchAdd (&gWaitCount, 1) == THREADCOUNT-1) { // Release the peers AtomicWrite (&gWaitCount, 0); AtomicWrite (&gRunCount, THREADCOUNT-1); } else { // Wait for quorum while (AtomicRead (&gRunCount) == 0) { CpuPause(); } AtomicFetchAdd (&gRunCount, -1); } Eric
From: Terje Mathisen "terje.mathisen at on 27 Dec 2009 15:03 Mayan Moudgill wrote: > Terje Mathisen wrote: > > It seems to me that you could use a single LOCK XADD(counter,-1) from > > every thread, followed by a spin loop that contains a MONITOR > > instruction on 'counter' and a counter re-read to determine if it has > > made it all the way down to zero by now. > > > > This solution is (as stated) not general; in the restricted case being > considered, it is .... ummm ... baroque? Cleaner solutions are > definitely possible. If so, then I really don't see what the problem is? > > One particular red-flag is the use of the MONITOR instruction. Its only > available at privilege level 0. How did you get there? With a system call? I didn't know that, until now I had assumed that MONITOR+PAUSE was a user-level way to sleep a core, without involving the OS at all. Skipping the MONITOR part leaves me with a stock spin loop where every core will re-read the counter until it has reached zero, which also seems OK, since there's nothing else useful they can do. What's important is _only_ the latency from when the last core executes its XADD() operation until all the waiting cores have been able to reload the counter and seen that they are ready to go on. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: "Andy "Krazy" Glew" on 27 Dec 2009 15:13
Mayan Moudgill wrote: > Let me ask the following question: assume that we have a N-process > barrier, how do you avoid using N system calls (use any hardware > synchronization you want)? > > The fundamental problem becomes that, in the general case, the first N-1 > processes that arrive at the barrier will need to (have the kernel) put > themselves on a waiting queue, and the last process to arrive will need > to (have the kernel) put the N-1 waiting processes back on the ready queue. Since I have my head in hardware multithreading land - GPUs, various flavours of CPUs with a small number of actively running threads and a larger number of threads that are not actively running in a thread pool - then it is straightforward. The first N threads to arrive at the barrier switch to not activey running. They are in the thread pool, but other threads take their place on the actively running hardware thread contexts. When the last thread arrives at the barrier, they all are marked runnable again. If the OS wants to context switch other threads/processes in/out, it can. --- Yes, I am talking about a hardware or microcode thread scheduler. I got into this space because of my work on speculative multithreading. In an SpMT system you typically need many more potential threads that you need actively running threads. If you don't want the SpMT to be software visible, then you need to do such scheduling in hardware or microcode, underneath the OS or VMM. Some folks want to get the OS involved in such scheduling. Sure, maybe if you work at Microsoft. Not if you work at Intel or AMD. If you work at a hardware company, you don't want your project schedule to depend on a different company. However, my attitude has always been to reduce the need for software support, but to allow software access. So, if you have a hardware or microcode thread scheduler for SpMT, why not, eventually, expose it to software? If it has any advantages. The sorts of advantages it might expose things like power management, load balancing, QOS. |