Prev: Processors stall on OLTP workloads about half the time--almostno matter what you do
Next: Intel cache inclusion
From: Andy 'Krazy' Glew on 4 May 2010 02:25 > On Mon, 26 Apr 2010 19:38:01 -0700, "Andy "Krazy" Glew" <ag-news(a)patten-glew.net> wrote: >> I'll do a separate followup post, with the (vain) attempt to change the topic. I don't think I have ever described my >> SpMT ideas to this newsgroup. Nothing proprietary, just the by now really old stuff that I worked on at Wisconsin. I will try to briefly describe my SpMT (Speculative Multithreading) ideas, as I formulated them in 1996-2000 at the University of Wisconsin. I will omit anything done 2000-2009, while at Intel and AMD, with the possible exception of June and July 2004, when I was free of corporate obligations (and when I worked on Multistar). First, though, it may be good to summarize Haitham Akkary's DMT work - which he did and published as part of his PhD at Portland State University, although he was also an Intel employee at the time. DMT stands for Dynamic Multithreading. Akkary's DMT of this era ran on a substrate of a multithreaded out-of-order CPU, with 2, 4, or 8 threads. Pretty much what you would expect if out-of-order microarchitecture had evolved along the directions of Alpha EV8. Akkary's DMT used fork-loop-iterations, fork-after-loop, and fork-on-call. Speculative state was stored in a ROB like structure, store forwarded from a MOB, and a speculative cache in some variations. If the speculative state got full, you either stopped, or abandoned speculation. Akkary's fork policy was to preserve the N least speculative threads: e.g. if you had 5 nested calls, F1 calling F2 ... calling F5, and could only support 4 threads, you would first fork-on-call at F2 inside F1 (i.e. you fork a new thread at the place inside F1 that F2 would return to) giving you two threads, the original non-speculative thread and the speculative thread, then at F3 giving you 3 threads, then at F4 giving you 4 threads, and then if you forked at F5 you would kill the most speculative thread, F1. Haitham got reasonable speedups - see his papers. He reported that the speedups were about half due to prefetch, and half due to speculative execution. I adopted the term "SpMT", Speculative Multithreading, because it seems quite descriptive. "DMT" Dynamic Multithreading was aligned with the official name for P6 out-of-order execution, "Dynamic Execution", but I never really liked that somewhat meaningless term. I set out to do the following a) remove the potentially expensive hardware datastructures, like the ROB and MOB to hold speculative state. Particularly the CAMs in the MOB. (I have spent much of my career removing CAMs.) b) support fork policies other than "N least speculative", and/or avoid throwing work away Neither Haitham nor I ever felt that DMT was dependent on an SMT multithreaded substrate. That was just the substrate that looked most likely when Haitham was working on DMT in, what, 1994? By 1996 Willamette style microarchitectures were in vogue, with smaller L0 caches which tended to get thrashed by whatr was eventually called HyperThreading, so I necessarily had to tweak the substrate. That tweaking led to my proposing MCMT, Multicluster Multithreading: multithreading with the shared front end, but with Scheduler, Execution Units, and L0 or L1 cache replicated in what I called a cluster - typically with 2 or 4 such clusters per shared front end. (AMD has inverted my original terminology, swapping cluster and core.) Again, both Haitham and I felt that SpMT might also execute on non-threaded multicore systems. But, at the time, multithreading looked likely enough that I gave myself permission to use a few tricks related to multithreading on an out-of-order machine. Finally, let me say: SpMT was *not* my research at UWisc. I had originally planned to make it so, but when I learned that Haitham Akkary had scooped me wrt DMT, I decided to try to work on microarchitecture ideas past DMT/SpMT. Not much came of that, except MLP. (And unpublished work in multilevel branch predictors, and instruction "refinement", on-the fly optimizations at the renamer and in execution; also, symbolic memory disambiguation (store forwarding).) Excessively optimistic, I somewhat expected DMT to be well under way getting built at Intel by the time I finished grad school. SpMT was my attempt to figure out what a second generation cleaned up DMT might look like - AND THEN I was trying to build my research on top of that. But as we have seen, DMT did not happen; Willamette and the race for frequency did. I was wrong about my finishing grad school, too. Okay, so let's describe my SpMT ideas: BRIEF: Log-based SpMT, storing speculative state in log. Verify--re-execution to merge speculative state. Optionally: (1) forget speculative stare (prefetch-only), and/or (2) non-log speculative state mechanisms, such as speculative versioning cache and store buffer Any parallel substrate - e.g. multicore with 2-threads per core. Inter-processor communication through cache. Fork-on-call primarily studied. N/M: N executing threads (typically circa 4) out of a larger pool of potential threads (16, to many more). SUBSTRATE: By "substrate" I mean the basic microarchitecture, without SpMT. This substrate needs to be credible as a product in its own right. As explained above, I am reasonably agnostic with respect to substrate. It could be multiple non-threaded CPUs; it could be an SMT CPU; it could be a multicore of multithreaded CPUs. I had no interest in running on a 2-way threaded CPU, although I was interested in running on 2 or 4 2-way threaded cores. It could be MCMT, e.g. 2 clusters each running 2 threads, or a multicore thereof. I tended to want to have at least 4 threads runnable, since I expected that 2 threads did not have enough headroom. I suppose MCMT was my "reference design", although multiple 2-way threaded cores was the easiest thing to simulate, that allowed the renamer fork tricks. CACHE STRUCTURE: MCMT has a private L1 (or L0) cache per cluster, tightly coupled with scheduler and execution units. Noter that this is NOT necessarily a private cache per thread: I usually assume that you can support two threads in the same cluster, although you might want to have a single thread per cluster as your normal operating mode, to avoid cache thrashing. I tend to assume shared L2 caches - or, an L2 shared between the 2 or 4 clusters in an MCMT core, and another cache level shared between cores in a multicore system. I'll confess that my simulations were mostly private L1 and globally shared L2. INTER-PROCESSOR COMMUNICATION: I only ever assumed communication through the shared cache. I refuse to grant SpMT the luxury of having a dedicated set of wires betwen processors to make forking run faster. -- Not unless that same set of wires is available for non-SpMT parallel programming. Communication through the shared cache might be through a hidden memory region, or might be through buffers in the shared cache that are not mapped to memory - cache-like. This was a well-established technique, one used in P6. FORK TYPES: I have only ever put fork-on-call (actually, fork the code that comes after the call, i.e. after the forked function returns) into any of my simulators. I neglected fork-loop-iteration, fork-after-loop, fork-to-control-convergence, etc., not because I think they are bad ideas, but mainly because others - Akkary, Gonzales - have evaluated them. Well, also a little bit because with loop forking it is easy to fall into temptation, forking loop threads that a good compiler would have parallelized. Also because I am lazy. First, to properly evaluate SpMT with loop forking, you really need to compare it to a parallelizing compiler - and I did not want to have to write one. Second, because to evaluate any of the other varieties of forking you have to build an infrastructure that, for example, can find right place to fork a loop iteration, and/or the right control independence point. Fork-on-call, on the other hand, is trivial to detect where to fork. Anyway, I am not opposed to other forking technques. I've just left them up to others. And argue that their results might "port" to my SpMT microarchitecture. Gonzales argues that loop forking is essential. I'm willing to concede that. Except... I have seen pretty good limit study speedups just with fork-on-call. So I'm quite willing to start with that, and move on to other fork types as time permits. I did write up an amusing paper called "A whole family of mother forkers". Althoughg I am afraid that has disappeared, along with much else, as part of the lobotomy exacerbated by FUD when I changed employer. Also, I am most interested in challenging code. I surprised another reader of this group, who has also worked on SpMT, when I told him that I mainly concentrated on speeding up GCC. Which, in my experience, is one of the most challenging benchmarks. I think that the MPEG and JPEG and other multimedia codes are too easy to speed up in other ways. LOG-BASED VERIFY/RE-EXECUTION Speculative state is stored in a "log". The simplest possible log might be a record of every possible instruction, with its inputs and outputs. E.g. consider user code INC r1++ INC r1++ INC r1++ INC r1++ r2 := LOAD(r1) STORE(r2,r4) etc. This might get represented in the log in a stylized manner: ASSERT( r1 = 0 ) r1 := 1 ASSERT( r1 = 1 ) r1 := 2 ASSERT( r1 = 2 ) r1 := 3 ASSERT( r1 = 3 ) r1 := 4 ASSERT( r1 = 4 ) ASSERT( LOAD M(4) == 66 ) r2 := 66 ASSERT( r4 = 55 ) STORE( 66, 55 ) Although the code above is clunky, note that it is much more parallel that the original code. All of the register to register operations have been replaced by operations that move a constant to a register destination. Only the loads and stores remain, and these have addresses that are themselves constant. Note that it is trivial to optimize this log, so that the above yields ASSERT( r1 = 0 ) r1 := 4 ASSERT( LOAD M(4) == 66 ) r2 := 66 ASSERT( r4 = 55 ) STORE( 66, 55 ) or ASSERT( r1 = 0 ) ASSERT( r4 = 55 ) ASSERT( LOAD M(4) == 66 ) r1 := 4 r2 := 66 STORE( 66, 55 ) This sort of optimization can be done at many levels, at many granularities. In any block of N instructions, there can be at most R assertions verifying the value of live-in registers, as well as checks for live-in emory locations. Certain memory references can be elided: read and write combining, etc. Plus, if you are willing to have the programmer or compiler inform the hardware as to what locations are thread private, you can elide certain store->load operations, such as register spill/fill. (I arrived at the log structure largely as a solution to how to maintain familiar memory ordering models such as x86 processor ordering.) The main piece of hardware required for log based execution is the hardware to record the log, optimize it, and, almost undoubtedly, compress it. The log could be stored in dedicated hardware buffers. However, in my dreams the log is stored in memory - in memory that is accessible only to the microarchitecture. Indeed, many other researchers have expressed interest in accessing such a log for software debug or multiprocessor purposes. The log is stored (in memory) in a segmented sequential manner. A chunk of log - say 4KiB - is stored sequentially. However, since threads do not fork in a manner conducive to circular buffers, non-contiguous 4KiB chunks are linked together to form the log for a single thread. There need be no limit on a speculative thread's log size, so long as more segments can be located, perhaps recycling log segments from other threads. When a thread is to be merged - e.g. when a less speculative thread has arrived at the RETurn instruction after which a thread is forked - the processor swiches into verify-reexecution mode. Essentially, it fetches entries from the log, uncompresses them, and executes them. Essentially it is a an alternate instruction set. Because the log is so much more parallel than normal execution, merging occurs much faster than normal execution - often 4X faster. During merging / verify-rexecution the log can diverge from true execution in either data or control flow. Naively, as soon as an assertion fails, one might switch back to normal execution; thereafter the only benefit of SpMT would be prefetch. However, I have observed that there are often small divergences that it is advantageous to tolerate. For example, if control flow is converged but a data value has diverged, one can execute just that instruction, and continue. So long as the data value divergences are less than, say, 1/4 of the log streamn, performance can be won. Similarly there are often small control flow divergences. One pattern I have often seen if ret = foo(...) if( ret == -1 ) fixup; ... i.e. often one tests the result of a function, and performs some local action. It is a challenge to tolerate this, but there is copius work on branch hammocks. So far as I know, I invented this log based verify-reexecution concept --- although it certainly was current in the halls of UWisc, e.g. with Craig Zilles and Abhishek my office mate. I invented this concept not, originally, for SpMT, but as an alternative to runahead execution. Runahead caused me great intellectual pain, because in theory, asymptotically, it is as good as out-of-order. However asymtotically log-based verify-reexecution beats runahead, because (a) it is more parallel, and (b) in terms of memory it is perfectly prefetchable and resistant to cache thrashing. OTHER SPECULATIVE STATE: If you want to have a speculative cache, or a big set of CAMs, you can. But I like having the log as a fall back, one that always works, one that scales well. Note that the log carries specylative results to commit. However, the log doesn't forward speculative stores to speculative loads without other mechanisms. FORGETFULNESS: Finally, one always has the option of just plain forgetting the speculative state. E.g. one can have a speculative L1, and just throw away the speculative data on eviction. This will lead to incorrect speculation, but the log will always catch this for correctness. I.e. forgetfulness becomes a performance problem, not a correctness problem. THREAD POOL: E.g. N=4 threads may be running. But there is a larger pool of potential threads. In a memory datastructure, with register sate and IP. HOW TO FORK: Forking requires cloning register state, and transferring it to new processor. Also, arranging memory state - although as mentioned above one might simply predict no memory carried dependencies, and rely on log based verify rexecution to detect errors. Because I want to penalize the non-speculative thread as little as possible, I like using a renamer mechanism: clone the registers by cloning the renamer map tables. Thereafter, run a little runt thread that cycle steals from the main thread, storing register state in memory (or a memory like part of the shared cache), while signalling recipient to pick up if waiting. --- Enough for now. I'm tired and I wanna go to bed.
From: Terje Mathisen "terje.mathisen at on 4 May 2010 02:48 Andy 'Krazy' Glew wrote: [huge snip] > HOW TO FORK: > > Forking requires cloning register state, and transferring it to new > processor. Also, arranging memory state - although as mentioned above > one might simply predict no memory carried dependencies, and rely on log > based verify rexecution to detect errors. > > Because I want to penalize the non-speculative thread as little as > possible, I like using a renamer mechanism: clone the registers by > cloning the renamer map tables. Thereafter, run a little runt thread > that cycle steals from the main thread, storing register state in memory > (or a memory like part of the shared cache), while signalling recipient > to pick up if waiting. > > > Enough for now. I'm tired and I wanna go to bed. Thanks for taking the time to write one of the more information-filled c.arch posts in quite a while! Terje PS. We had a pretty bad return flight from SFO, the travel agent had forgotten to book seats so we ended up against the bulkhead, with zero room to recline. :-) -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Torben �gidius Mogensen on 4 May 2010 04:23 Andy 'Krazy' Glew <ag-news(a)patten-glew.net> writes: > Okay, so let's describe my SpMT ideas: > > BRIEF: > Log-based SpMT, storing speculative state in > log. Verify--re-execution to merge speculative state. Optionally: (1) > forget speculative stare (prefetch-only), and/or (2) non-log > speculative state mechanisms, such as speculative versioning cache and > store buffer > Any parallel substrate - e.g. multicore with 2-threads per core. > Inter-processor communication through cache. > Fork-on-call primarily studied. > N/M: N executing threads (typically circa 4) out of a larger pool of potential threads (16, to many more). > I have only ever put fork-on-call (actually, fork the code that > comes after the call, i.e. after the forked function returns) into any > of my simulators. > Speculative state is stored in a "log". The simplest possible log > might be a record of every possible instruction, with its inputs and > outputs. > > E.g. consider user code > > INC r1++ > INC r1++ > INC r1++ > INC r1++ > r2 := LOAD(r1) > STORE(r2,r4) > etc. > > This might get represented in the log in a stylized manner: > > ASSERT( r1 = 0 ) > r1 := 1 > ASSERT( r1 = 1 ) > r1 := 2 > ASSERT( r1 = 2 ) > r1 := 3 > ASSERT( r1 = 3 ) > r1 := 4 > ASSERT( r1 = 4 ) > ASSERT( LOAD M(4) == 66 ) > r2 := 66 > ASSERT( r4 = 55 ) > STORE( 66, 55 ) When you make a function call, you will normally preserve all live registers through the calling convention, so you can assume these are unchanged except for the register(s) that contain the function result. Hence, I don't see any need to log register values. Since the function result is hard to predict, it would make sense to just block the thread when it needs the function result and, hence, only log memory accesses and register writes. The latter can, as you mentioned, be compressed so only the last written value to a register is kept in the log. Obviously, this assumes that you can rely on the calling convention being observed, but unless you are trying to run legacy code that is not an unreasonable assumption. If you can somehow guarantee that a memory location will not be changed by the function call, you don't need to log reads from this location. If the compiler can guarantee this, it could mark a load instruction as safe (meaning no log is required). The compiler can use static analysis to determine safety or rely on language invariants. For example, in a functional language, most memory reads would automatically be safe. For a functional language fork-on-call is very effective and is, indeed, often used when parallelising functional languages. Normally, no log is kept (except the most recent values of registers) and you simply block a thread when it reads or writes to a location that is not guaranteed to be safe. Newly allocated memory is, for example, safe to write to. Torben
From: Andy 'Krazy' Glew on 4 May 2010 10:01 On 5/4/2010 1:23 AM, Torben �gidius Mogensen wrote: > Andy 'Krazy' Glew<ag-news(a)patten-glew.net> writes: >> BRIEF: >> Log-based SpMT, storing speculative state in >> log. Verify--re-execution to merge speculative state. Optionally: (1) >> forget speculative stare (prefetch-only), and/or (2) non-log >> speculative state mechanisms, such as speculative versioning cache and >> store buffer >> Any parallel substrate - e.g. multicore with 2-threads per core. >> Inter-processor communication through cache. >> Fork-on-call primarily studied. >> N/M: N executing threads (typically circa 4) out of a larger pool of potential threads (16, to many more). > When you make a function call, you will normally preserve all live > registers through the calling convention, so you can assume these are > unchanged except for the register(s) that contain the function result. > Hence, I don't see any need to log register values. Haitham did this in DMT. He stored a bitmask indicating which registers had changed across a function. IIRC he startedassuming all unchanged, and corrected them. > Since the function > result is hard to predict, Some function results are easy to predict: e.g. syscall library functions that return -1 on error. malloc returning 0 (null). By the way, that reminds me: malloc was often the limiter on performance. Needlessly so. Many malloc implementations are serialized, incrementing pointers, etc. Yan Solihin et al have fixed this for semi-explicitly parallel code, http://www.informationweek.com/news/security/app-security/showArticle.jhtml?articleID=224201364. You can largely fix it for implicitly parallel code like SpMT by > it would make sense to just block the thread > when it needs the function result I have done this. It cuts the limit study performance down quite a bit. Instead I try to skip around the often quite small amounts of code that depend on the function value. > Obviously, this assumes that you can rely on the calling convention > being observed, but unless you are trying to run legacy code that is not > an unreasonable assumption. Agreed. However, with my Intel hat on I wanted to apply this to legacy code. I think that it would be quite promising to explore compiler cooperative SpMT.
From: Torben �gidius Mogensen on 5 May 2010 08:41
Andy 'Krazy' Glew <ag-news(a)patten-glew.net> writes: > On 5/4/2010 1:23 AM, Torben �gidius Mogensen wrote: >> Since the function result is hard to predict, > > Some function results are easy to predict: e.g. syscall library > functions that return -1 on error. malloc returning 0 (null). True. Error values can be predicted as rare. > By the way, that reminds me: malloc was often the limiter on > performance. Needlessly so. Many malloc implementations are > serialized, incrementing pointers, etc. Many malloc implementations are needlessly costly, period. If you allocate and deallocate a lot of small objects, GC is nearly always faster. And if you don't, the cost of either GC or malloc/free is unimportant. >> it would make sense to just block the thread >> when it needs the function result > > I have done this. It cuts the limit study performance down quite a bit. > > Instead I try to skip around the often quite small amounts of code that depend on the function value. > It hsould be possible to statically schedule the code so code that depends on the return value is pushed as far back as possible. This way, you can keep the hardware simpler. This doesn't solve the error-value problem, though. >> Obviously, this assumes that you can rely on the calling convention >> being observed, but unless you are trying to run legacy code that is not >> an unreasonable assumption. > > Agreed. However, with my Intel hat on I wanted to apply this to legacy code. Quite. Legacy code must be a huge problem for Intel. > I think that it would be quite promising to explore compiler cooperative SpMT. I think so too. In many cases, the compiler will know things that are almost impossible to detect at run-time, such as memory locations that never change value. Even more potential exist in co-designing languages and hardware so the language is designed to preserve invariants/properties that the hardware relies on to parallelise code. But this immediately gets you into the legacy problem: You can not run old code efficiently, so it limits your customer base. Torben |