Prev: Picking N-th ready element (e.g. in an OOO scheduler)
Next: Lolling at programmers, how many ways are there to create a bitmask ? ;) :)
From: nedbrek on 1 Jul 2010 22:40 Hello all, "Andy 'Krazy' Glew" <ag-news(a)patten-glew.net> wrote in message news:4C2B6F29.3060908(a)patten-glew.net... > Second, RUU is microarchitecture. A uarch feature that I never considered > implementable myself. By the way, I can't remember timestamps in it - I > suspect that you may be thinking of the > SimpleScalar simulator implementation, which may have had timestamps. > > Whereas what I was talking about is simulator. Not microarchitecture. Definitely talking simulator. I've never seen a RUU proposed for hardware. > Look: there are queues, or at least latches, between your pipestages no > matter > what. All I am suggesting is that you configure the SIMULATOR queues so > that you can do experiments such as saying > "what if I combine instruction decode and in-order schedule into a single > pipestage." Whereas, if you have a reverse > pipeline with a cycle per loop iteration, you cannot even run that > experiment. Glew observation: oftentimes people say > "That's not practical", when actually something is practical in hardware, > just not in their simulator infrastructure. In my experience, stages from IPGEN to SCHED cost 1% per. Stages from READY to READY (ignoring 1 for EXE) are 5-10% Stages in the backend are 1% or less. These sorts of trends are some of the first runs you do. If your model can only sweep for 10+ pipestages (due to the stage configuration), is 3 or 5 really going to be significantly different? > By the way, let me roughly describe what such a queue looks like: > * hardware granularity and alignmemt > - i.e. does hardware think of it as a fixed alignment queue, > blocks of 4 always moving > or does hardware think of it as a misaligned queue, where HW > can read 4 ntries at any alignment > - By the way - this should be parameterizable, since the > decision to use an aligned or a misaligned > queue is one of the basic studies you woll always do. > * hardware size (minimum) > * cycles to get across again, good to parameterize IN THE SIMULATOR > so that you can easily simulate different chip layouts, > with different delays for the wires Definitely. The pipestage code we had in IPFsim had width & depth knobs, plus a "serpentine" knob (serpentine pipes flowed freely, non-serpentine could only advance by the full width). > === > However, I also believe that there is a place for cycle accurate > simulators > that are - well, maybe not less detailed than SimpleScalar, but more > agile. Simulators that you can run edxperiments > such as saying "What if the latency through this pipestage was 0". Where > you can quickly dial in multiple cycles of delay. > > In the detailed simulators, you might not allow yourself to use the > timestamped > queues that I talk about above. Whereas in the more agile simulator you > might use such timestamped queues to give > yourself the agility. > > I also believe that there is a place for simulators that are not cycle > accurate. Like DFA. Agility is mostly a measure of what assumptions went into the original code ("I'm writing a P6 model for Itanium"). As long as you don't stray too far from that (adding a new branch predictor), you can add a lot of features and details. After tacking on a lot of stuff, it gets harder to change. Some stuff never really fits right (P4). No high level simulator is ever going to (easily) be cycle accurate with hardware. Leave that to RTL. What you want is something that trends correctly ("these uarch changes are worth 20% over the baseline"). Also, if you can't model everything, at least understand where the model breaks down. I had a simulator show huge speedup from a prefetching idea. Turns out, the model's page table walker was effectively pipelined. Once you factored that out, the idea was useless. > === > By the way: you should, of course, try to make it impossible, difficult, > or at > least obvious to access simulator artifact data from true hardware > simulation. C++ encapsulation. Much harder > to do in C; or, rather, much easier to violate encapsulation by accident. Definitely. Like I said, I would like my next model to be in D. Pin uses C++ linkage, so my DFA stuff will need to be C++, but I probably won't write a lot there. Ned
From: Andy 'Krazy' Glew on 1 Jul 2010 22:39 On 7/1/2010 7:40 PM, nedbrek wrote: > In my experience, stages from IPGEN to SCHED cost 1% per. Stages from READY > to READY (ignoring 1 for EXE) are 5-10% Stages in the backend are 1% or > less. > > These sorts of trends are some of the first runs you do. If your model can > only sweep for 10+ pipestages (due to the stage configuration), is 3 or 5 > really going to be significantly different? I'm old enough to remember when it was 5% for frontend pipestages, and 20% for execution loop pipestages. I'm sure that the reduced importance of adding pipestages is due to (a) better predictors, (b) relatively slower memory, and (c) the fact that adding 1 pipestage on top of 5 is a big deal, but on top of 10 is not so big. We may well stuck in the double digits for pipestages. But I wonder if the pendulum may not want to swing the other way 1) because of device variation. You get better yield (and performance, in terms of average latency per transistor or gate) if you have 20 gates per pipestage rather than Cray-like 8, and even better with 40. 2) seeking to minimize overheads such as setup and skew allowances, which helps both power and perf 3) if you start using asynchronous design styles (in my current analysis, asynchronous design styles for bandwidth may have fewer gates per "cycle" (or whatever the equivalent term is for asynch); whereas if you are designing for minimum latency of certain critical computations, asynch wants fat pipestages 4) and because I can see an asymptote where it is better to have less pipelined logic go idle, than it is to have more pipelined logic get blocked with stuff in the pipeline that must be maintained. > Definitely. The pipestage code we had in IPFsim had width& depth knobs, > plus a "serpentine" knob (serpentine pipes flowed freely, non-serpentine > could only advance by the full width). Cool. I earlier described such "Alignment issues for queues, buffers, and pipelines" - i.e. I used the term "alignment" or "blocking factor" for what you describe as "non-serpentine". I can even see where the term comes from. Adding this to https://semipublic.comp-arch.net/wiki/Alignment_issues_for_queues,_buffers,_and_pipelines
From: Andy 'Krazy' Glew on 1 Jul 2010 22:45 On 7/1/2010 5:42 AM, nedbrek wrote: > Sure, we wanted an execute-at-execute model. That is what we were driving > for. > > But, you have to cut us some slack! There were two of us (plus an intern > for the summer). We came in with a blank slate for modelling out-of-order. > We were basically trying to reproduce P6, for Itanium. Slack given. Moreover, I don't know the history; or, rather, I know only the early history, before you were at Intel. I don't think we overlapped much. > We used the same mechanism as P6, loads wait for oldest STA. Fair enough for you, but I feel obliged to mention for the record that the P6 simulators circa 1991 were not that limited. We chose to implement only "loads don't pass stores whose address is unknown". But we evaluated other policies. We knew the speedups with an oracle store-to-load dependency predictor, perfect; and also with random prediction accuracies. I know that we had proposed various STLF predictors, such as history based. I suspect those were in branches off the main version control trunk, if they were implemented in the simulator. (By the way, although randomized predictors with a dialable accuracy are easy to do, and provide some insight, they are misleading. Real predictors are not unform random; and if you knew the real predictor stats...)
From: MitchAlsup on 2 Jul 2010 09:14 On Jul 1, 9:39 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote: > On 7/1/2010 7:40 PM, nedbrek wrote: > > > In my experience, stages from IPGEN to SCHED cost 1% per. Stages from READY > > to READY (ignoring 1 for EXE) are 5-10% Stages in the backend are 1% or > > less. > > > These sorts of trends are some of the first runs you do. If your model can > > only sweep for 10+ pipestages (due to the stage configuration), is 3 or 5 > > really going to be significantly different? > > I'm old enough to remember when it was 5% for frontend pipestages, and 20% for execution loop pipestages. I'm sure that > the reduced importance of adding pipestages is due to (a) better predictors, (b) relatively slower memory, and (c) the > fact that adding 1 pipestage on top of 5 is a big deal, but on top of 10 is not so big. For great big OoO machines 1% in the front end/pipestage is pretty standard. We saw 9%-12%-ish for not being able to do back to back integer instructions and a 33%-50% increase in frequency by <basically> doubling the number of pipe stages. > We may well stuck in the double digits for pipestages. But I wonder if the pendulum may not want to swing the other way The pendulum definately wants to swing that direction. But pure market momentum, and a bit of FUD are slowing the release of said pendulum. > 1) because of device variation. You get better yield (and performance, in terms of average latency per transistor or > gate) if you have 20 gates per pipestage rather than Cray-like 8, and even better with 40. Opteron is at 16 logic gates per pipe stage, or 20-21 gates if you include flop, jitter, and skew. CDC 6600 was 15 gates including the clear-set flop (Thornton) CDC 7600 was 12 gates including the flop Cray 1 was 10 gates including the latch (not a flop) Cray 2 was 5 gates including the latch The Cray 1 was slowed so as to avoid the noise in the FM radio band (80 MHz), the Y-MP hopped to the other end of the FM-band (105 MHz) Based on project I have done in the past, going from 16 gates per pipe stage to 20 gates per pipestage results in a 20% improvement in architectural figure of merit. That is the frequency loss/gain is a complete wash. Since power has reared its ugly head, doing more per cycle and having fewer cycle will be a win. Not only does the pipeline have fewer stages at 20 logic gates per cycle, one can bang on the SRAMs and register ports twice per cycle and make other activities of instruction processing more efficient. 16 gates per cycle is about where designers want the architects to quit using the SRAMs twice per cycle, butd by 20 gates per cycle, nobody really cares if you use the SRAMs twice per cycle. Thus, one gains cache bandwidth by slowing down just a bit, this makes the pipeline shorter especially in the stages nobody sees (post retire). One can desing/build a 6-7 pipestage x86 that cycles as fast as an Opteron (given access to a FAB with the same transistors and metal.) This will end up being a 1-wide monoScalar machine--think 486 with the modern instruction set extensions and floating point latencies and cache hierarchy. My simulations show this miniscule machine can get roughly 50% the performance of an opteron for 10% of the die area and less than 5% of the power. > 2) seeking to minimize overheads such as setup and skew allowances, which helps both power and perf The biggest lever left in power is speculation. That is: only do activities for those instruction that will retire or have very high probabilities of retiring. Mitch
From: MitchAlsup on 2 Jul 2010 09:19
On Jul 1, 9:45 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote: >(By the way, although randomized predictors with a dialable accuracy are easy > to do, and provide some insight, they are misleading. Real > predictors are not unform random; and if you knew the real predictor stats...) Which is why I have never been a fan of semi-accurate simulations. High level architectural models have their place, but what I want is and low level architectural model that contains (basically) everything but the scan path! I want the architects to build the control machine as a data path and run it through a trillion simulation cycles (without failing). This control machine being cycle accurate to the Verilog model. Mitch |