Prev: 735998 Computer Knowledge, Free and alwqays Up to Date 80
Next: Using NTP performance as an early warning (Was Re: SYSENTER/SYSEXIT_vs._SYSCALL/SYSRET)
From: nedbrek on 22 Jan 2010 20:18 Hello all, I am reading through the document at: http://docs.google.com/fileview?id=0B5qTWL2s3LcQZDIyZDVmN2EtYjY4MC00YjU2LWE4ZGMtYzk2MmU4M2U2NDQ5&hl=en Some pretty interesting stuff. I am most surprised by Section 1.9.2 Scheduler (page 29): "The S1 has 4 16 entry pickers, " and "The S2 scheduler contains 16 partitions, each of 64 entries. Each entry consists of 4 microoperations." That's 64 uops in L1, and 4K uops in the L2! ==== When we were evaluating HSW, we assumed a ROB of 256 entries, mostly as a nod to multithreading. Similar machines at the time (P4 and Core 2) used ~128 entry ROBs. Our L2 scheduler was pretty closely coupled to the ROB, and our L1 was 18 entries (limited by the operand capture file). I see you discovered the key to HSW (pages 36 and 37), combining the P4 and P3 register file styles :) More later! Ned
From: "Andy "Krazy" Glew" on 23 Jan 2010 14:27 nedbrek wrote: > I see you discovered the key to HSW (pages 36 and 37), combining the P4 and > P3 register file styles :) What's HSW?
From: "Andy "Krazy" Glew" on 23 Jan 2010 15:59 nedbrek wrote: > Hello all, > I am reading through the document at: > http://docs.google.com/fileview?id=0B5qTWL2s3LcQZDIyZDVmN2EtYjY4MC00YjU2LWE4ZGMtYzk2MmU4M2U2NDQ5&hl=en > > Some pretty interesting stuff. > > I am most surprised by Section 1.9.2 Scheduler (page 29): > "The S1 has 4 16 entry pickers, " and "The S2 scheduler contains 16 > partitions, each of 64 entries. Each entry consists of 4 microoperations." > > That's 64 uops in L1, and 4K uops in the L2! > > ==== > When we were evaluating HSW, we assumed a ROB of 256 entries, mostly as a > nod to multithreading. Similar machines at the time (P4 and Core 2) used > ~128 entry ROBs. Our L2 scheduler was pretty closely coupled to the ROB, > and our L1 was 18 entries (limited by the operand capture file). > > I see you discovered the key to HSW (pages 36 and 37), combining the P4 and > P3 register file styles :) > > More later! > Ned > > Ah, Ned, I see that HSW = Hierarchical Scheduling Windows, as in Hierarchical Scheduling Windows by E Brekelbaum - 2002 http://portal.acm.org/ft_gateway.cfm?id=774865&type=pdf&coll=GUIDE&dl=GUIDE&CFID=72794697&CFTOKEN=44981784 I must admit that I never paid your HSW papers much attention. I was at AMD at the time, so didn't have the opportunity to talk to you directly. I don't like the combination of slow scheduler and fast scheduler feeding the same execution units, seems to me to lead to a mess of wires, violates modularity. And replicating the execution units just leads to a bypass mess. But your point about latency tolerance is well taken. I also had bad experience with the EDF (Explicit Data Forwardng) system you use in your slow scheduler. I have encountered this now in two different companies, and neither time did it pan out. Heck, I might as well say, such a structure was the original P6 microarchitecture proposal, before I came in and showed that some CAMs are bad and some CAMs are not so bad. However, my aversion to this is not so strong as my concerns of the previous paragraph: I have spent my whole career eliminating CAMs, and I think that the EDF direction is the right one, if tweaked appropriately. I say more on this below. I've just seen other people's EDF-like designs crash and burn twice now over 20 years (as well as innumerable less detailed EDF-like proposals, like EDF itself), so I tend to shy away. > I see you discovered the key to HSW (pages 36 and 37), combining the P4 and > P3 register file styles :) I got all excited when I read what you said above, and printed out and read the above HSW paper by you and Chris and Bryan and Jeff. But I am afraid I don't see where you describe this. Is it in another HSW paper, or am I just not reading carefully enough? Perhaps that is the "extensive analysis (not documented in this paper)" you refer to on page 31 in the reference above. However, I am probably looking at the wrong HSW paper, since the one I am looking at ends at page36, and dos not have a page 37. Ah, wait... I see on page 31 you say "register read is performed before instructions are inserted into the fast window making it a data capture device". Cool. Similar lines of thought. Now, I think the key thing that I realized in Multi-star wrt register files has nothing to do with clusters or two level schedulers. It is that you can combine the P6 and P4 ROB/PRF/register file/operand capture approaches, which I call "read RF before schedule" and "read RF after schedule", even with a single level of scheduler. You can have a big PRF (typically renamed into from the logical regser file), a scheduler (your favorite scheduler - P6 style RS, a bit matrix, whatever), and a smaller operand structure. This smaller structure, OC, can be arranged as a P6-style CAM, or an RF. Or a hybrid.ectly controlling th I always thought of the P6-style RS CAMs as controlling wordlines effecting capture of the data - and the scheduler directly controlling those wordlines. Put another way, scheduler entries either contained or were in 1:1 direct correspondence with operand capture entries (actually, 1:2 correspondence, since a P6 uop always had 2 src operands). A level of indirection here is not necessarily a bad thing. Although it probably costs an extra cycle or so coming out of the scheduler and going to the execution units. (For a long time I was thinking unclearly, assuming that the advantage of capture structures was their lower latency loop. That is the advantage, but the source of it is mainly reduced #prts*#bitsper_entry*#entries, not necessarily the elimination of the level of indirection - or, rather, avoiding the introduction of the level of indirection we are about describe.) I.e. instead of the scheduler saying "I am scheduling uop #i" and the operand capture structure looking up entry #i, i.e. where #i is used to index both scheduler and OC, we might have scheduler say "I am executing uop #i", and then we might look up uop #i in the uop table, and find out that it needs none or one or more than one of the operands #p1, #p2, #p3.... and then we would look up in the operand capture structure these operands. This allows us to do things like a) sharing operands between uops in the scheduler and b) using 0 or 1 or 2 or 3 operands, usng just what is needed, instead of always using a fixed number of 2 or 3 although it does make the scheduler a little bit harder, since it must schedule RF ports. This level of indirection basically makes the operand capture structre more like an RF. Hence I sometimes say that you have an RF2 before the scheduler, and an RF1 after the scheduler, even though it is a sngle level instruction window machine. A related key point is the realization that the ports you use to access the operand capture array after coming out of the scheduler a) do not need to be 1:1 with the scheduler instructin entries b) do not need to have the same indexing principle. So, you can have three ports on the operand capture OC/RF1: 1) the port you write data into initially - RAM indexed 2) the port you use to obtain operands after scheduling. a) Could be RAM index. b) Could be 1:1 implicit, direct. 3) the port you use to writeback results from the execution units a) could be CAM b) could be RAM P6 was 1) 2b) 3c). I only realized (back when I was working n Multistar) that 1) 2a) 3a) was a good possibility - which I made the Multistar strawman. While 1) 2a) 3b) is a good possibility, given RF2/RF1 mapping. Finally, 1) 2b) 3b) is, I suppose, a possibility - but I really don't knw what it would be good for. CAMs are convenient. If you don't use a CAM on the writeback, you need to have a mapping table to remap RF2/PRF numbers into RF1/OC operand enry numbers. It's easier than the logical to PRF remapping, since it desn't need as much management, but everything you add is a pain. The CAM just wraps that up into one place. This seems to be a general principle: you can often use CAMs on pipestages, or implement a mapping table. The mapping table typically adds an extra pipestage, and often has management cmlexity such as checkpoint recovery (although not always). So, it is just a tradeoff - an extra pipestage + some complexity versus CAMs. --- Hey, this is fun. Haven't thought about this in years in such detail. --- Getting back to the scheduler: while I think that a non-CAM approach, sch as you say EDF is like (I'm not sure if I have read the EDF papers; I have encountered what sound like similar approaches) *should* be a good idea, my experience when I have seen these worked out in detail was not so good. Multistar has a placeholder for such a scheduler. But another of the new ideas in Multistar was that is is not so bad to build a CAM for the big second level scjheduler, S2. Bcause you don't need to CAM all inputs. E.g. in the second level scheduler conatins entries of 4 or 8 or 16 instructions, you don't need to CAM all possible inputs. You don't even need to CAM all live-ins. You only need to CAM soe reasonably close to the last one of the inputs. ou can use a scoreboard like timuing estimator at the RAT to choose the input to CAM on. Thus, in the strawman you quote above > "The S2 scheduler contains 16 partitions, each of 64 entries. Each entry consists of 4 microoperations." if each S2 entry has 4 uops, and only is CAMming 1 input, that's "only" 1K CAMs for 4K uops. Not the 8K CAMs that would have been required if you had 2 CAMs per uop, or thw 12K if you had 3 CAMs per uop to handle FMAC. More and more my thinking has been leaning towards S2 scheduler entries of 8 or 16 uops - or possibly threaded S2 scheduler entries, say 4 uop chunks, chaned together into, say, 12 uop groups. If you can do that, the S2 schedulr might hae "only" 512 or 256 CAMs - comparable to existing schedulers, but holding 16 times more uops. By the way, this input reduction technique - performing imprecise scheduling on the S2 uop group entries - also simplifies the EDF-like hardware that eliminates CAMs. Not sure if it is enough simplification to mae EDF worthwhile - the problem with EDF is handling the overflow cases, i.e. it is design complexity, not size. Now, the traditional problems with having scheduler entries hold uop groups are a) what happens if they have too many inpus for the hardware to handle b) what happens if there are inter-group dependencies. c fragmentation. It's damned hard to get 16 dependence related instructions in a group. Again, that's a key realization in Multi-star: it doesn't matter!!!!!! If you take the approach that all of the instruction groups coming out of the S2 are going to the S1, where they are scheduled as single uops (or, at least, smaller instruction groups, more tightly coupled), then the S1 handles all the work, scheduling instructions when they are ready. a) Too many inputs in S2 (which, as we have said above, may have only 1 or 2 inputs per 8 instruction group) - well, let the S1 handle al the inputs. b) inter-group dependencies - again, the S1 handles. All we need to do is ensure that we don't gey a deadlock - that we don't get into a situation where an instruction in the S1 is waiting for an instruction from the S2, which can't get into the S1 because the S1 is draining. There are many techniques to handle this. c) fragmentation. Doesn't matter I.e. it doesn't mattert if we put unrelated instructions into the same group (as long as we don't get the above mentioned deadlocks. This is another reason why I don't like the HSW approac of sending supposedly latency insenstive instrucions straight from the S2 to the EUs: you can't shluff off this work the the S1. However, I think that I did note in the Multistar writeup that you could execute straight from the S2, even though not necessarily all inputs are ready. E.g. by using Wmt-style replay, detecting non-readinss, ad sending the non-ready instructions back to either the S1 or S2. (Actualy, I guess that is not Wmt-style replay - they did not replay through the scheduler. Silly people.) --- Hey, this was fun. Those were the days.
From: nedbrek on 23 Jan 2010 18:14 "Andy "Krazy" Glew" <ag-news(a)patten-glew.net> wrote in message news:4B5B6334.4080201(a)patten-glew.net... > nedbrek wrote: > > I don't like the combination of slow scheduler and fast scheduler feeding > the same execution units, seems to me to lead to a mess of wires, violates > modularity. And replicating the execution units just leads to a bypass > mess. But your point about latency tolerance is well taken. Two clusters worked for the paper, no one complained about it. When we were looking into actually going to product, axing the slow cluster was one of the first ideas we looked at... we could delay the results from slow to fast, but it still meant wires somewhere. > I also had bad experience with the EDF (Explicit Data Forwardng) system > you use in your slow scheduler. EDF was another "good for a paper" thing. We actually had some reviewers say how much they liked integrating previous work. Talking to circuit guys, they seemed sure that slow CAMs aren't too bad (i.e. the problem with CAMs mostly apply only to fast CAMs). > > I see you discovered the key to HSW (pages 36 and 37), combining the P4 > > and > > P3 register file styles :) > > I got all excited when I read what you said above > > Ah, wait... I see on page 31 you say "register read is performed before > instructions are inserted into the fast window making it a data capture > device". I was thinking of the ppt from the Micro presentation, it is made explicit there (slide 5 is P3 vs. P4, slide 7 is "Stop! You're both right!") > A level of indirection here is not necessarily a bad thing. Although it > probably costs an extra cycle or so coming out of the scheduler and going > to the execution units. (For a long time I was thinking unclearly, > assuming that the advantage of capture structures was their lower latency > loop. That is the advantage, but the source of it is mainly reduced > #prts*#bitsper_entry*#entries, not necessarily the elimination of the > level of indirection - or, rather, avoiding the introduction of the level > of indirection we are about describe.) Yes, P3 scheduling is all about fewer read ports (1 wide, vs 2 split per uop). Half the ports makes a smaller structure, which gives you the lower latency loop :) > More and more my thinking has been leaning towards S2 scheduler entries of > 8 or 16 uops - or possibly threaded S2 scheduler entries, say 4 uop > chunks, chaned together into, say, 12 uop groups. If you can do that, the > S2 schedulr might hae "only" 512 or 256 CAMs - comparable to existing > schedulers, but holding 16 times more uops. That's where my mind starts to boggle. I would need see branch predictor and serialization data showing a window this big would deliver significant performance gains. We were looking at Itanium runs of Spec2k built by Electron (i.e. super optimized). We were assuming very heavy implementation (few serializing conditions). We were unable to scale this far. > By the way, this input reduction technique - performing imprecise > scheduling on the S2 uop group entries - also simplifies the EDF-like > hardware that eliminates CAMs. Not sure if it is enough simplification to > mae EDF worthwhile - the problem with EDF is handling the overflow cases, > i.e. it is design complexity, not size. One of the nifty things we showed about EDF in the face of a 2 level scheduler - you can drop the updates on overflow. The mover will eventually shift the stuff down to L1. We filed a patent on that, it's out there somewhere. > Now, the traditional problems with having scheduler entries hold uop > groups are > a) what happens if they have too many inpus for the hardware to handle > b) what happens if there are inter-group dependencies. > c fragmentation. It's damned hard to get 16 dependence related > instructions in a group. > > Again, that's a key realization in Multi-star: it doesn't matter!!!!!! > > If you take the approach that all of the instruction groups coming out of > the S2 are going to the S1, where they are scheduled as single uops (or, > at least, smaller instruction groups, more tightly coupled), then the S1 > handles all the work, scheduling instructions when they are ready. Yea, that's what the "mover" is all about. It makes it possible for unschedulable stuff to make it to L1. > However, I think that I did note in the Multistar writeup that you could > execute straight from the S2, even though not necessarily all inputs are > ready. E.g. by using Wmt-style replay, detecting non-readinss, ad sending > the non-ready instructions back to either the S1 or S2. (Actualy, I guess > that is not Wmt-style replay - they did not replay through the scheduler. > Silly people.) That's probably our biggest area of disagreement. We used to sit next to the Tejas guys. We heard so many horror stories about replay, we were dead set against it. HSW was very much oriented towards making it possible to have little or no replay (the L1 scheduler was "snatch-back", while the L2 would eat the full L1 latency on dependent ops). I actually had a slide in the Micro presentation showing relative replay rates. It didn't go well for the P4 :) > Hey, this was fun. > > Those were the days. Yea, it's been over three years since I left Intel. My new boss tries to humor the microarchitect in me, but it's just not relevant in my new job. Enjoy, Ned
From: "Andy "Krazy" Glew" on 24 Jan 2010 14:03
nedbrek wrote: >> Me (Andy Glew) >> More and more my thinking has been leaning towards S2 scheduler entries of >> 8 or 16 uops - or possibly threaded S2 scheduler entries, say 4 uop >> chunks, chaned together into, say, 12 uop groups. If you can do that, the >> S2 schedulr might hae "only" 512 or 256 CAMs - comparable to existing >> schedulers, but holding 16 times more uops. > > That's where my mind starts to boggle. I would need see branch predictor > and serialization data showing a window this big would deliver significant > performance gains. We were looking at Itanium runs of Spec2k built by > Electron (i.e. super optimized). We were assuming very heavy implementation > (few serializing conditions). We were unable to scale this far. That's almost exactly what Jim Smith told me at Wisconsin - except then we were talking about 256 entry instruction windows. Although I will freely admit that even then I was thinking in terms of instruction windows in the thousands and larger. It has just taken years to invent the pieces, the mechanisms, that make it buildable. In terms of academic research, look at ... well, I wanted to refer to the kilo instruction processor papers, but I don't have access to the ACM Digital Library. But, anyway, if memory serves they show some benefit for large instruction windows, which they justify but mecahnisms to reduce window cost. I just go further. As for stuff that you can do on your own, here are some very simple limit studies: a) First, do the perfect branch prediction study with unlimited instruction fetch bandwidth, and see how performance changes as window size changes. b) Then, go and have realistic branch prediction. But still keep ifetch bandwidth infinite. And se how many instructions can be reached without a branch misprediction. Take advantage of control independence - e.g. you dont need to cross past branch mispredictions to fetch instructions after a call returns - you are guaranteed to ber able to execute those. Track how many of these are dependent on earlier instructions, and how many are not. c) Now go and add in finite ifetch. But on potentially many threads - the non-speculative threads, and speculative threads (if that is what you want to call them) e.g. at control independence points. d) somewhere along the way, take into account branch mispredictions that are dependent on cache missses versus those that are not. You get the picture. Do the work: you will see that there are lots of potential instructions out there. Even without SpMT/SkMT; but with SpMT/SkMT there are even more. Having demonstrated that there is potential in keeping a large number of instructions in flight, you then have to figure out how to build the hardware cheaply enough That's what stuff like Multistar is about. Note that when you have an S2 scheduler such as I am talking about, you are not necessarily keeping all instructions around, nor allocating PRF entries for their results. A scheduler entry might be just an IP and a tag used to trigger its dispatch. >> However, I think that I did note in the Multistar writeup that you could >> execute straight from the S2, even though not necessarily all inputs are >> ready. E.g. by using Wmt-style replay, detecting non-readinss, ad sending >> the non-ready instructions back to either the S1 or S2. (Actualy, I guess >> that is not Wmt-style replay - they did not replay through the scheduler. >> Silly people.) > > That's probably our biggest area of disagreement. > > We used to sit next to the Tejas guys. We heard so many horror stories > about replay, we were dead set against it. HSW was very much oriented > towards making it possible to have little or no replay (the L1 scheduler was > "snatch-back", while the L2 would eat the full L1 latency on dependent ops). > > I actually had a slide in the Micro presentation showing relative replay > rates. It didn't go well for the P4 :) |