Ye Olde Log-Based SpMT Uarch [Computer Architecture]

Prev: Processors stall on OLTP workloads about half the time--almostno matter what you do
Next: Intel cache inclusion

From: Andy 'Krazy' Glew on 4 May 2010 02:25

> On Mon, 26 Apr 2010 19:38:01 -0700, "Andy "Krazy" Glew" <ag-news(a)patten-glew.net> wrote:
>> I'll do a separate followup post, with the (vain) attempt to change the topic. I don't think I have ever described my
>> SpMT ideas to this newsgroup. Nothing proprietary, just the by now really old stuff that I worked on at Wisconsin.

I will try to briefly describe my SpMT (Speculative Multithreading) ideas, as I formulated them in 1996-2000 at the
University of Wisconsin. I will omit anything done 2000-2009, while at Intel and AMD, with the possible exception of
June and July 2004, when I was free of corporate obligations (and when I worked on Multistar).

First, though, it may be good to summarize Haitham Akkary's DMT work - which he did and published as part of his PhD at
Portland State University, although he was also an Intel employee at the time.

DMT stands for Dynamic Multithreading. Akkary's DMT of this era ran on a substrate of a multithreaded out-of-order CPU,
with 2, 4, or 8 threads. Pretty much what you would expect if out-of-order microarchitecture had evolved along the
directions of Alpha EV8. Akkary's DMT used fork-loop-iterations, fork-after-loop, and fork-on-call. Speculative state
was stored in a ROB like structure, store forwarded from a MOB, and a speculative cache in some variations. If the
speculative state got full, you either stopped, or abandoned speculation. Akkary's fork policy was to preserve the N
least speculative threads: e.g. if you had 5 nested calls, F1 calling F2 ... calling F5, and could only support 4
threads, you would first fork-on-call at F2 inside F1 (i.e. you fork a new thread at the place inside F1 that F2 would
return to) giving you two threads, the original non-speculative thread and the speculative thread, then at F3 giving you
3 threads, then at F4 giving you 4 threads, and then if you forked at F5 you would kill the most speculative thread, F1.
Haitham got reasonable speedups - see his papers. He reported that the speedups were about half due to prefetch, and
half due to speculative execution.

I adopted the term "SpMT", Speculative Multithreading, because it seems quite descriptive. "DMT" Dynamic Multithreading
was aligned with the official name for P6 out-of-order execution, "Dynamic Execution", but I never really liked that
somewhat meaningless term.

I set out to do the following

a) remove the potentially expensive hardware datastructures, like the ROB and MOB to hold speculative state.
Particularly the CAMs in the MOB. (I have spent much of my career removing CAMs.)

b) support fork policies other than "N least speculative", and/or avoid throwing work away

Neither Haitham nor I ever felt that DMT was dependent on an SMT multithreaded substrate. That was just the substrate
that looked most likely when Haitham was working on DMT in, what, 1994? By 1996 Willamette style microarchitectures
were in vogue, with smaller L0 caches which tended to get thrashed by whatr was eventually called HyperThreading, so I
necessarily had to tweak the substrate. That tweaking led to my proposing MCMT, Multicluster Multithreading:
multithreading with the shared front end, but with Scheduler, Execution Units, and L0 or L1 cache replicated in what I
called a cluster - typically with 2 or 4 such clusters per shared front end. (AMD has inverted my original terminology,
swapping cluster and core.)

Again, both Haitham and I felt that SpMT might also execute on non-threaded multicore systems. But, at the time,
multithreading looked likely enough that I gave myself permission to use a few tricks related to multithreading on an
out-of-order machine.

Finally, let me say: SpMT was *not* my research at UWisc. I had originally planned to make it so, but when I learned
that Haitham Akkary had scooped me wrt DMT, I decided to try to work on microarchitecture ideas past DMT/SpMT. Not much
came of that, except MLP. (And unpublished work in multilevel branch predictors, and instruction "refinement", on-the
fly optimizations at the renamer and in execution; also, symbolic memory disambiguation (store forwarding).)
Excessively optimistic, I somewhat expected DMT to be well under way getting built at Intel by the time I finished
grad school. SpMT was my attempt to figure out what a second generation cleaned up DMT might look like - AND THEN I was
trying to build my research on top of that. But as we have seen, DMT did not happen; Willamette and the race for
frequency did. I was wrong about my finishing grad school, too.

Okay, so let's describe my SpMT ideas:

BRIEF:
Log-based SpMT, storing speculative state in log. Verify--re-execution to merge speculative state. Optionally: (1)
forget speculative stare (prefetch-only), and/or (2) non-log speculative state mechanisms, such as speculative
versioning cache and store buffer
Any parallel substrate - e.g. multicore with 2-threads per core.
Inter-processor communication through cache.
Fork-on-call primarily studied.
N/M: N executing threads (typically circa 4) out of a larger pool of potential threads (16, to many more).

SUBSTRATE:

By "substrate" I mean the basic microarchitecture, without SpMT. This substrate needs to be credible as a product
in its own right.

As explained above, I am reasonably agnostic with respect to substrate. It could be multiple non-threaded CPUs; it
could be an SMT CPU; it could be a multicore of multithreaded CPUs. I had no interest in running on a 2-way threaded
CPU, although I was interested in running on 2 or 4 2-way threaded cores. It could be MCMT, e.g. 2 clusters each running
2 threads, or a multicore thereof. I tended to want to have at least 4 threads runnable, since I expected that 2 threads
did not have enough headroom.

I suppose MCMT was my "reference design", although multiple 2-way threaded cores was the easiest thing to simulate,
that allowed the renamer fork tricks.

CACHE STRUCTURE:

MCMT has a private L1 (or L0) cache per cluster, tightly coupled with scheduler and execution units. Noter that
this is NOT necessarily a private cache per thread: I usually assume that you can support two threads in the same
cluster, although you might want to have a single thread per cluster as your normal operating mode, to avoid cache
thrashing.

I tend to assume shared L2 caches - or, an L2 shared between the 2 or 4 clusters in an MCMT core, and another cache
level shared between cores in a multicore system. I'll confess that my simulations were mostly private L1 and globally
shared L2.

INTER-PROCESSOR COMMUNICATION:

I only ever assumed communication through the shared cache. I refuse to grant SpMT the luxury of having a dedicated
set of wires betwen processors to make forking run faster. -- Not unless that same set of wires is available for
non-SpMT parallel programming.

Communication through the shared cache might be through a hidden memory region, or might be through buffers in the
shared cache that are not mapped to memory - cache-like. This was a well-established technique, one used in P6.

FORK TYPES:

I have only ever put fork-on-call (actually, fork the code that comes after the call, i.e. after the forked function
returns) into any of my simulators.

I neglected fork-loop-iteration, fork-after-loop, fork-to-control-convergence, etc., not because I think they are
bad ideas, but mainly because others - Akkary, Gonzales - have evaluated them. Well, also a little bit because with loop
forking it is easy to fall into temptation, forking loop threads that a good compiler would have parallelized.

Also because I am lazy. First, to properly evaluate SpMT with loop forking, you really need to compare it to a
parallelizing compiler - and I did not want to have to write one. Second, because to evaluate any of the other
varieties of forking you have to build an infrastructure that, for example, can find right place to fork a loop
iteration, and/or the right control independence point. Fork-on-call, on the other hand, is trivial to detect where to
fork. Anyway, I am not opposed to other forking technques. I've just left them up to others. And argue that their
results might "port" to my SpMT microarchitecture. Gonzales argues that loop forking is essential.

I'm willing to concede that. Except... I have seen pretty good limit study speedups just with fork-on-call. So I'm
quite willing to start with that, and move on to other fork types as time permits. I did write up an amusing paper
called "A whole family of mother forkers". Althoughg I am afraid that has disappeared, along with much else, as part of
the lobotomy exacerbated by FUD when I changed employer.

Also, I am most interested in challenging code. I surprised another reader of this group, who has also worked on
SpMT, when I told him that I mainly concentrated on speeding up GCC. Which, in my experience, is one of the most
challenging benchmarks. I think that the MPEG and JPEG and other multimedia codes are too easy to speed up in other ways.

LOG-BASED VERIFY/RE-EXECUTION

Speculative state is stored in a "log". The simplest possible log might be a record of every possible instruction,
with its inputs and outputs.

E.g. consider user code

INC r1++
INC r1++
INC r1++
INC r1++
r2 := LOAD(r1)
STORE(r2,r4)
etc.

This might get represented in the log in a stylized manner:

ASSERT( r1 = 0 )
r1 := 1
ASSERT( r1 = 1 )
r1 := 2
ASSERT( r1 = 2 )
r1 := 3
ASSERT( r1 = 3 )
r1 := 4
ASSERT( r1 = 4 )
ASSERT( LOAD M(4) == 66 )
r2 := 66
ASSERT( r4 = 55 )
STORE( 66, 55 )

Although the code above is clunky, note that it is much more parallel that the original code. All of the register to
register operations have been replaced by operations that move a constant to a register destination. Only the loads and
stores remain, and these have addresses that are themselves constant.

Note that it is trivial to optimize this log, so that the above yields

ASSERT( r1 = 0 )
r1 := 4
ASSERT( LOAD M(4) == 66 )
r2 := 66
ASSERT( r4 = 55 )
STORE( 66, 55 )

or

ASSERT( r1 = 0 )
ASSERT( r4 = 55 )
ASSERT( LOAD M(4) == 66 )
r1 := 4
r2 := 66
STORE( 66, 55 )

This sort of optimization can be done at many levels, at many granularities. In any block of N instructions, there can
be at most R assertions verifying the value of live-in registers, as well as checks for live-in emory locations.

Certain memory references can be elided: read and write combining, etc. Plus, if you are willing to have the programmer
or compiler inform the hardware as to what locations are thread private, you can elide certain store->load operations,
such as register spill/fill. (I arrived at the log structure largely as a solution to how to maintain familiar memory
ordering models such as x86 processor ordering.)

The main piece of hardware required for log based execution is the hardware to record the log, optimize it, and, almost
undoubtedly, compress it.

The log could be stored in dedicated hardware buffers. However, in my dreams the log is stored in memory - in memory
that is accessible only to the microarchitecture. Indeed, many other researchers have expressed interest in accessing
such a log for software debug or multiprocessor purposes.

The log is stored (in memory) in a segmented sequential manner. A chunk of log - say 4KiB - is stored sequentially.
However, since threads do not fork in a manner conducive to circular buffers, non-contiguous 4KiB chunks are linked
together to form the log for a single thread. There need be no limit on a speculative thread's log size, so long as
more segments can be located, perhaps recycling log segments from other threads.

When a thread is to be merged - e.g. when a less speculative thread has arrived at the RETurn instruction after which a
thread is forked - the processor swiches into verify-reexecution mode. Essentially, it fetches entries from the log,
uncompresses them, and executes them. Essentially it is a an alternate instruction set. Because the log is so much
more parallel than normal execution, merging occurs much faster than normal execution - often 4X faster.

During merging / verify-rexecution the log can diverge from true execution in either data or control flow. Naively, as
soon as an assertion fails, one might switch back to normal execution; thereafter the only benefit of SpMT would be
prefetch.

However, I have observed that there are often small divergences that it is advantageous to tolerate. For example, if
control flow is converged but a data value has diverged, one can execute just that instruction, and continue. So long
as the data value divergences are less than, say, 1/4 of the log streamn, performance can be won.

Similarly there are often small control flow divergences. One pattern I have often seen if

ret = foo(...)
if( ret == -1 ) fixup;
...

i.e. often one tests the result of a function, and performs some local action. It is a challenge to tolerate this, but
there is copius work on branch hammocks.

So far as I know, I invented this log based verify-reexecution concept --- although it certainly was current in the
halls of UWisc, e.g. with Craig Zilles and Abhishek my office mate. I invented this concept not, originally, for SpMT,
but as an alternative to runahead execution. Runahead caused me great intellectual pain, because in theory,
asymptotically, it is as good as out-of-order. However asymtotically log-based verify-reexecution beats runahead,
because (a) it is more parallel, and (b) in terms of memory it is perfectly prefetchable and resistant to cache thrashing.

OTHER SPECULATIVE STATE:

If you want to have a speculative cache, or a big set of CAMs, you can. But I like having the log as a fall back,
one that always works, one that scales well.

Note that the log carries specylative results to commit. However, the log doesn't forward speculative stores to
speculative loads without other mechanisms.

FORGETFULNESS:

Finally, one always has the option of just plain forgetting the speculative state. E.g. one can have a speculative
L1, and just throw away the speculative data on eviction. This will lead to incorrect speculation, but the log will
always catch this for correctness. I.e. forgetfulness becomes a performance problem, not a correctness problem.

THREAD POOL:

E.g. N=4 threads may be running.

But there is a larger pool of potential threads. In a memory datastructure, with register sate and IP.

HOW TO FORK:

Forking requires cloning register state, and transferring it to new processor. Also, arranging memory state -
although as mentioned above one might simply predict no memory carried dependencies, and rely on log based verify
rexecution to detect errors.

Because I want to penalize the non-speculative thread as little as possible, I like using a renamer mechanism:
clone the registers by cloning the renamer map tables. Thereafter, run a little runt thread that cycle steals from the
main thread, storing register state in memory (or a memory like part of the shared cache), while signalling recipient to
pick up if waiting.

---

Enough for now. I'm tired and I wanna go to bed.

From: Terje Mathisen "terje.mathisen at on 4 May 2010 02:48

Andy 'Krazy' Glew wrote:
[huge snip]
> HOW TO FORK:
>
> Forking requires cloning register state, and transferring it to new
> processor. Also, arranging memory state - although as mentioned above
> one might simply predict no memory carried dependencies, and rely on log
> based verify rexecution to detect errors.
>
> Because I want to penalize the non-speculative thread as little as
> possible, I like using a renamer mechanism: clone the registers by
> cloning the renamer map tables. Thereafter, run a little runt thread
> that cycle steals from the main thread, storing register state in memory
> (or a memory like part of the shared cache), while signalling recipient
> to pick up if waiting.
>
>
> Enough for now. I'm tired and I wanna go to bed.

Thanks for taking the time to write one of the more information-filled
c.arch posts in quite a while!

Terje
PS. We had a pretty bad return flight from SFO, the travel agent had
forgotten to book seats so we ended up against the bulkhead, with zero
room to recline. :-)

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Torben �gidius Mogensen on 4 May 2010 04:23

Andy 'Krazy' Glew <ag-news(a)patten-glew.net> writes:

> Okay, so let's describe my SpMT ideas:
>
> BRIEF:
> Log-based SpMT, storing speculative state in
> log. Verify--re-execution to merge speculative state. Optionally: (1)
> forget speculative stare (prefetch-only), and/or (2) non-log
> speculative state mechanisms, such as speculative versioning cache and
> store buffer
> Any parallel substrate - e.g. multicore with 2-threads per core.
> Inter-processor communication through cache.
> Fork-on-call primarily studied.
> N/M: N executing threads (typically circa 4) out of a larger pool of potential threads (16, to many more).

> I have only ever put fork-on-call (actually, fork the code that
> comes after the call, i.e. after the forked function returns) into any
> of my simulators.

> Speculative state is stored in a "log". The simplest possible log
> might be a record of every possible instruction, with its inputs and
> outputs.
>
> E.g. consider user code
>
> INC r1++
> INC r1++
> INC r1++
> INC r1++
> r2 := LOAD(r1)
> STORE(r2,r4)
> etc.
>
> This might get represented in the log in a stylized manner:
>
> ASSERT( r1 = 0 )
> r1 := 1
> ASSERT( r1 = 1 )
> r1 := 2
> ASSERT( r1 = 2 )
> r1 := 3
> ASSERT( r1 = 3 )
> r1 := 4
> ASSERT( r1 = 4 )
> ASSERT( LOAD M(4) == 66 )
> r2 := 66
> ASSERT( r4 = 55 )
> STORE( 66, 55 )

When you make a function call, you will normally preserve all live
registers through the calling convention, so you can assume these are
unchanged except for the register(s) that contain the function result.
Hence, I don't see any need to log register values. Since the function
result is hard to predict, it would make sense to just block the thread
when it needs the function result and, hence, only log memory accesses
and register writes. The latter can, as you mentioned, be compressed so
only the last written value to a register is kept in the log.

Obviously, this assumes that you can rely on the calling convention
being observed, but unless you are trying to run legacy code that is not
an unreasonable assumption.

If you can somehow guarantee that a memory location will not be changed
by the function call, you don't need to log reads from this location.
If the compiler can guarantee this, it could mark a load instruction as
safe (meaning no log is required). The compiler can use static analysis
to determine safety or rely on language invariants. For example, in a
functional language, most memory reads would automatically be safe. For
a functional language fork-on-call is very effective and is, indeed,
often used when parallelising functional languages. Normally, no log is
kept (except the most recent values of registers) and you simply block a
thread when it reads or writes to a location that is not guaranteed to
be safe. Newly allocated memory is, for example, safe to write to.

Torben

From: Andy 'Krazy' Glew on 4 May 2010 10:01

On 5/4/2010 1:23 AM, Torben �gidius Mogensen wrote:
> Andy 'Krazy' Glew<ag-news(a)patten-glew.net> writes:

>> BRIEF:
>> Log-based SpMT, storing speculative state in
>> log. Verify--re-execution to merge speculative state. Optionally: (1)
>> forget speculative stare (prefetch-only), and/or (2) non-log
>> speculative state mechanisms, such as speculative versioning cache and
>> store buffer
>> Any parallel substrate - e.g. multicore with 2-threads per core.
>> Inter-processor communication through cache.
>> Fork-on-call primarily studied.
>> N/M: N executing threads (typically circa 4) out of a larger pool of potential threads (16, to many more).

> When you make a function call, you will normally preserve all live
> registers through the calling convention, so you can assume these are
> unchanged except for the register(s) that contain the function result.
> Hence, I don't see any need to log register values.

Haitham did this in DMT. He stored a bitmask indicating which registers had changed across a function. IIRC he
startedassuming all unchanged, and corrected them.

> Since the function
> result is hard to predict,

Some function results are easy to predict: e.g. syscall library functions that return -1 on error. malloc returning 0
(null).

By the way, that reminds me: malloc was often the limiter on performance. Needlessly so. Many malloc implementations
are serialized, incrementing pointers, etc. Yan Solihin et al have fixed this for semi-explicitly parallel code,
http://www.informationweek.com/news/security/app-security/showArticle.jhtml?articleID=224201364. You can largely fix it
for implicitly parallel code like SpMT by

> it would make sense to just block the thread
> when it needs the function result

I have done this. It cuts the limit study performance down quite a bit.

Instead I try to skip around the often quite small amounts of code that depend on the function value.

> Obviously, this assumes that you can rely on the calling convention
> being observed, but unless you are trying to run legacy code that is not
> an unreasonable assumption.

Agreed. However, with my Intel hat on I wanted to apply this to legacy code.

I think that it would be quite promising to explore compiler cooperative SpMT.

From: Torben �gidius Mogensen on 5 May 2010 08:41

Andy 'Krazy' Glew <ag-news(a)patten-glew.net> writes:

> On 5/4/2010 1:23 AM, Torben �gidius Mogensen wrote:

>> Since the function result is hard to predict,
>
> Some function results are easy to predict: e.g. syscall library
> functions that return -1 on error. malloc returning 0 (null).

True. Error values can be predicted as rare.

> By the way, that reminds me: malloc was often the limiter on
> performance. Needlessly so. Many malloc implementations are
> serialized, incrementing pointers, etc.

Many malloc implementations are needlessly costly, period. If you
allocate and deallocate a lot of small objects, GC is nearly always
faster. And if you don't, the cost of either GC or malloc/free is
unimportant.

>> it would make sense to just block the thread
>> when it needs the function result
>
> I have done this. It cuts the limit study performance down quite a bit.
>
> Instead I try to skip around the often quite small amounts of code that depend on the function value.
>

It hsould be possible to statically schedule the code so code that
depends on the return value is pushed as far back as possible. This
way, you can keep the hardware simpler. This doesn't solve the
error-value problem, though.

>> Obviously, this assumes that you can rely on the calling convention
>> being observed, but unless you are trying to run legacy code that is not
>> an unreasonable assumption.
>
> Agreed. However, with my Intel hat on I wanted to apply this to legacy code.

Quite. Legacy code must be a huge problem for Intel.

> I think that it would be quite promising to explore compiler cooperative SpMT.

I think so too. In many cases, the compiler will know things that are
almost impossible to detect at run-time, such as memory locations that
never change value.

Even more potential exist in co-designing languages and hardware so the
language is designed to preserve invariants/properties that the hardware
relies on to parallelise code. But this immediately gets you into the
legacy problem: You can not run old code efficiently, so it limits your
customer base.

Torben

| Next | Last
Pages: 1 2
Prev: Processors stall on OLTP workloads about half the time--almostno matter what you do
Next: Intel cache inclusion