From: Andy 'Krazy' Glew on 31 May 2010 11:26 On 5/30/2010 1:36 PM, Robert Myers wrote: > On May 28, 2:08 am, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote: > >> >> Memory carried dependencies do not have such a convenient upper limit. Potentially every store instruction in an >> earlier batch/cluster might need to be forwarded to a load instruction in a later batch/cluster. ... But, we have to >> do this anyway. Most practical proposals for building large instruction windows have large store-to-load conflict >> detection windows. Only in the best of times can you elide many such memory dependencies. >> > > I am reminded of the famous cartoon with "and then a miracle happens" > in the middle of a complex proof. > > One advantage that I can think of for your proposal is that it imposes > a natural order on memory references. The usual parallelism problem > of who got there first has a definite answer imposed by the ordering > of a serial instruction stream. > > You can't retire any write instructions until all ambiguous pending > read instructions in preceding instruction blocks are complete. You > can't count on any read instructions until all ambiguous pending write > instructions in preceding instruction blocks are complete. You can > ignore the latter constraint and fix up, if necessary, but the thought > of doing it makes my head hurt. It's the same as hoisting a load > above a store. How often does that work? > > I'll save you the trouble. "Aaargh!" But maybe others are as puzzled > about the details as I am. > > Robert. But that's exactly how all modern OOO machines with store to load forwarding buffers work. See, e.g. store disambiguation, http://download.intel.com/technology/architecture/sma.pdf It got into product fairly late, 2006 in the above publication. But it is much older, mature (in tge R&D sense) technology. E.g. it was one of the issues in the University of Wisconsin / Intel settlement, http://www.technologyinnovative.com/industry/research/intel-settles-wisconsin-universitys-patent-claim-over-chips.html. The technique was discussed by us in the P6 development, and earlier by me at the UWisc. When I am talking about batching, or any large instruction window, the only think that is new is scaling up existing mechanisms for store-to-load forwarding. Back in 2002 I thought these might be a challenge. But in 2003-2005 there were a flurry of papers on how to do this effecticely. If this causes you to say "Aargh", Robert, then, umm, you really aren't qualified to be discussing advanced computer architecture. This is the easy stuff.
From: Andy 'Krazy' Glew on 31 May 2010 12:04 On 5/30/2010 1:36 PM, Robert Myers wrote: > On May 28, 2:08 am, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote: > >> >> Memory carried dependencies do not have such a convenient upper limit. Potentially every store instruction in an >> earlier batch/cluster might need to be forwarded to a load instruction in a later batch/cluster. ... But, we have to >> do this anyway. Most practical proposals for building large instruction windows have large store-to-load conflict >> detection windows. Only in the best of times can you elide many such memory dependencies. >> > > I am reminded of the famous cartoon with "and then a miracle happens" > in the middle of a complex proof. > > One advantage that I can think of for your proposal is that it imposes > a natural order on memory references. The usual parallelism problem > of who got there first has a definite answer imposed by the ordering > of a serial instruction stream. > > You can't retire any write instructions until all ambiguous pending > read instructions in preceding instruction blocks are complete. You > can't count on any read instructions until all ambiguous pending write > instructions in preceding instruction blocks are complete. You can > ignore the latter constraint and fix up, if necessary, but the thought > of doing it makes my head hurt. It's the same as hoisting a load > above a store. How often does that work? > > I'll save you the trouble. "Aaargh!" But maybe others are as puzzled > about the details as I am. > > Robert. > > But that's exactly how all modern OOO machines with store to load forwarding buffers work. See, e.g. store disambiguation, http://download.intel.com/technology/architecture/sma.pdf It got into product fairly late, 2006 in the above publication. But it is much older, mature (in tge R&D sense) technology. E.g. it was one of the issues in the University of Wisconsin / Intel settlement, http://www.technologyinnovative.com/industry/research/intel-settles-wisconsin-universitys-patent-claim-over-chips.html. The technique was discussed by us in the P6 development, and earlier by me at the UWisc. When I am talking about batching, or any large instruction window, the only think that is new is scaling up existing mechanisms for store-to-load forwarding. Back in 2002 I thought these might be a challenge. But in 2003-2005 there were a flurry of papers on how to do this effecticely. If this causes you to say "Aargh", Robert, then, umm, you really aren't qualified to be discussing advanced computer architecture. This is the easy stuff. I apologize for what seems to be an ad-hominem remark, but I really don't know how to say it otherwise. Store-to-load forwarding is old hat; predictors to allow loads to get hoisted above a store are old hat. It's fine for you to say you don't think existing mechanisms thereof will scale. Then we can talk about the 2005 era papers on the topic, and other stuff. But if your head hurts just from the thought of it... I don't really know how to address that.
From: Robert Myers on 31 May 2010 14:06 On May 31, 11:26 am, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote: > On 5/30/2010 1:36 PM, Robert Myers wrote: > > > > > > > On May 28, 2:08 am, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote: > > >> Memory carried dependencies do not have such a convenient upper limit. Potentially every store instruction in an > >> earlier batch/cluster might need to be forwarded to a load instruction in a later batch/cluster. ... But, we have to > >> do this anyway. Most practical proposals for building large instruction windows have large store-to-load conflict > >> detection windows. Only in the best of times can you elide many such memory dependencies. > > > I am reminded of the famous cartoon with "and then a miracle happens" > > in the middle of a complex proof. > > > One advantage that I can think of for your proposal is that it imposes > > a natural order on memory references. The usual parallelism problem > > of who got there first has a definite answer imposed by the ordering > > of a serial instruction stream. > > > You can't retire any write instructions until all ambiguous pending > > read instructions in preceding instruction blocks are complete. You > > can't count on any read instructions until all ambiguous pending write > > instructions in preceding instruction blocks are complete. You can > > ignore the latter constraint and fix up, if necessary, but the thought > > of doing it makes my head hurt. It's the same as hoisting a load > > above a store. How often does that work? > > > I'll save you the trouble. "Aaargh!" But maybe others are as puzzled > > about the details as I am. > > > Robert. > > But that's exactly how all modern OOO machines with store to load forwarding buffers work. > > See, e.g. store disambiguation,http://download.intel.com/technology/architecture/sma.pdf > > It got into product fairly late, 2006 in the above publication. > > But it is much older, mature (in tge R&D sense) technology. E.g. it was one of the issues in the University of > Wisconsin / Intel settlement,http://www.technologyinnovative.com/industry/research/intel-settles-w.... > The technique was discussed by us in the P6 development, and earlier by me at the UWisc. > > When I am talking about batching, or any large instruction window, the only think that is new is scaling up existing > mechanisms for store-to-load forwarding. Back in 2002 I thought these might be a challenge. But in 2003-2005 there > were a flurry of papers on how to do this effecticely. > > If this causes you to say "Aargh", Robert, then, umm, you really aren't qualified to be discussing advanced computer > architecture. This is the easy stuff. "Aargh" is your expletive. I anticipated such a response. I'm not wondering how a single processor works with a single thread or a large instruction window. I think I understand that well enough to follow the discussion. What's got me stopped is: that doesn't seem to be what you're talking about. Rather than having a single instruction window juggling instructions, you have separate instruction windows potentially (or even ideally) with instructions from far apart in the stream. If there is parallelism, that's great, but that's sort of like saying, "Suppose all the problems that (say) CSP was invented to address didn't exist." I'm visualizing instruction windows clogged with instructions waiting to complete because some far away memory reference might change everything. If I don't understand something, it's why you think it will work well enough to be useful, not how it would work in a single instance. Let me put it another way. If feels like you've magically split the N^2 out of order problem into M*(N/M)^2 problems or N^2/M, where M is the number of separate instruction windows. How did you do that? Robert.
From: Robert Myers on 31 May 2010 14:47 On May 31, 12:04 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote: > On 5/30/2010 1:36 PM, Robert Myers wrote: > > > > > > > On May 28, 2:08 am, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote: > > >> Memory carried dependencies do not have such a convenient upper limit. Potentially every store instruction in an > >> earlier batch/cluster might need to be forwarded to a load instruction in a later batch/cluster. ... But, we have to > >> do this anyway. Most practical proposals for building large instruction windows have large store-to-load conflict > >> detection windows. Only in the best of times can you elide many such memory dependencies. > > > I am reminded of the famous cartoon with "and then a miracle happens" > > in the middle of a complex proof. > > > One advantage that I can think of for your proposal is that it imposes > > a natural order on memory references. The usual parallelism problem > > of who got there first has a definite answer imposed by the ordering > > of a serial instruction stream. > > > You can't retire any write instructions until all ambiguous pending > > read instructions in preceding instruction blocks are complete. You > > can't count on any read instructions until all ambiguous pending write > > instructions in preceding instruction blocks are complete. You can > > ignore the latter constraint and fix up, if necessary, but the thought > > of doing it makes my head hurt. It's the same as hoisting a load > > above a store. How often does that work? > > > I'll save you the trouble. "Aaargh!" But maybe others are as puzzled > > about the details as I am. > > > Robert. > > But that's exactly how all modern OOO machines with store to load forwarding buffers work. > > See, e.g. store disambiguation,http://download.intel.com/technology/architecture/sma.pdf > > It got into product fairly late, 2006 in the above publication. > > But it is much older, mature (in tge R&D sense) technology. E.g. it was one of the issues in the University of > Wisconsin / Intel settlement,http://www.technologyinnovative.com/industry/research/intel-settles-w.... > The technique was discussed by us in the P6 development, and earlier by me at the UWisc. > > When I am talking about batching, or any large instruction window, the only think that is new is scaling up existing > mechanisms for store-to-load forwarding. Back in 2002 I thought these might be a challenge. But in 2003-2005 there > were a flurry of papers on how to do this effecticely. > > If this causes you to say "Aargh", Robert, then, umm, you really aren't qualified to be discussing advanced computer > architecture. This is the easy stuff. > > I apologize for what seems to be an ad-hominem remark, but I really don't know how to say it otherwise. Store-to-load > forwarding is old hat; predictors to allow loads to get hoisted above a store are old hat. > I actually do know that. In the case where a speculation fails and the calculation must be redone, you have to flush the pipeline and redo the calculation. That doesn't make my head hurt. What makes my head hurt is when you are juggling multiple instruction windows and multiple pipelines. One possible emotionally neutral response to my rhetorical question, "How often does that work?" is "More often than you think, apparently." Robert.
From: Andy 'Krazy' Glew on 31 May 2010 16:12
On 5/31/2010 11:06 AM, Robert Myers wrote: > On May 31, 11:26 am, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net wrote: >> On 5/30/2010 1:36 PM, Robert Myers wro >>> On May 28, 2:08 am, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote: >> >>>> Memory carried dependencies do not have such a convenient upper limit. Potentially every store instruction in an >>>> earlier batch/cluster might need to be forwarded to a load instruction in a later batch/cluster. ... But, we have to >>>> do this anyway. Most practical proposals for building large instruction windows have large store-to-load conflict >>>> detection windows. Only in the best of times can you elide many such memory dependencies. >> >>> I am reminded of the famous cartoon with "and then a miracle happens" >>> in the middle of a complex proof. >> >>> You can't retire any write instructions until all ambiguous pending >>> read instructions in preceding instruction blocks are complete. You >>> can't count on any read instructions until all ambiguous pending write >>> instructions in preceding instruction blocks are complete. You can >>> ignore the latter constraint and fix up, if necessary, but the thought >>> of doing it makes my head hurt. It's the same as hoisting a load >>> above a store. How often does that work? >> >>> I'll save you the trouble. "Aaargh!" But maybe others are as puzzled >>> about the details as I am. >> >> But that's exactly how all modern OOO machines with store to load forwarding buffers work. > > I anticipated such a response. I'm not wondering how a single > processor works with a single thread or a large instruction window. I > think I understand that well enough to follow the discussion. What's > got me stopped is: that doesn't seem to be what you're talking about. > > Rather than having a single instruction window juggling instructions, > you have separate instruction windows potentially (or even ideally) > with instructions from far apart in the stream. If there is > parallelism, that's great, This is a better form of discussion. First, I apparently was not explicit enough in my earlier post about batch scheduling for instruction windows: It is *not* expandable split windows. It is a single, contiguous, instruction window, that happens to be split across multiple clusters of execution units, or multiple processor cores. It takes advantage of the OOO instruction window support that each cluster has - the scheduler or RS equivalent, etc. But it stitches them together to form a single large instruction window. In my earlier post I discussed how this stitching is not so hard for instruction sets with a small number of registers. Worst case, you save/restore all of the registers at batch boundaries. That's not too hard for the historical 6 registers of x86. It's harder as you add more and more registers. Eventually, you want to switch to a pull strategy, where you only transfer the registers across clusters when the new batch needs a value live-out of the old. You are correct to say that memory, store-to-load forwarding, is a greater challenge. You could use the existing STLF buffers that most OOO systems have within the clusters. But that leaves the problem of stitching togethr store buffers between batches, between clusters. Simplest strategy would be to make the 2 (or more) cluster STLFs look like a single one. E.g. send a signal from the older cluster to the younger cluster (if the batch occupies the whole cluster) indicating that is a store of address unknown. That is sufficient to implement the P6-era strategy of blocking a load until all earlier stores have become address-ready. But it is not sufficient to implement store-to-load forwarding. A load that becomes address ready is first compared to all stores that are address ready in its local cluster STLF buffer. If matched, great, no traffic to the outside world. If not matched, then you need to decide if you want to send the load to the earlier cluster's STLF buffer. ... At this point, I probably need to shut up, and write up an invention disclosure. It doesn't seem all that hard to me, I daresay it seems obvious to me, but I've been told many times that what seems obvious to me isn't necessarily obvious to all. Witness this discussion. Anyway, stitching together these store buffers - a pain, but I can see several, many, ways of doing so. If this stitching isn't good enough, there are several proposals for multilevel store buffers extant, both mine and those published in academia that I referred to in the previous post. As I have said earlier, I like an L1 STLF associated with the cluster, and an L2 STLF shared between clusters. Finally, my log based verify re-execution suffices to serve as a catch-all mechanism to ensure correctness. There are several other similar proposals that use verification to ensure correctness. We therefore have several different mechanisms to "stitch-together" STLF buffers. So, it looks buildable. Is it of any value. --- Now, yes, I have in other posts discussed SpMT, expandable split windows. And, yes, the original post of this thread started off with batched, contiguous, instruction windows, but segued into split windows. If that's what we want to discuss, yes, let's do so. However, in my log based SpMT, I drain instructions out of the window, and keep them in the log. Which really amounts to an L2 instruction window, just one that is very cheap. > but that's sort of like saying, "Suppose > all the problems that (say) CSP was invented to address didn't > exist." I'm visualizing instruction windows clogged with instructions > waiting to complete because some far away memory reference might > change everything. If I don't understand something, it's why you > think it will work well enough to be useful, not how it would work in > a single instance. So, first, it is well-known that there is some potential in simply increasing instruction window size. The problem is how costly is the hardware, esp. power wise, to do so. E.g. the kilo-instruction papers report 3.5x for FP and 1.5X for integer, going from instruction windows of 128 to 4K instructions. Cristal, A., Santana, O. J., Valero, M., and Mart�nez, J. F. 2004. Toward kilo-instruction processors. ACM Trans. Archit. Code Optim. 1, 4 (Dec. 2004), 389-417. DOI= http://doi.acm.org/10.1145/1044823.1044825 Now, that is not a very good payoff. Definitely sublinear. But the whole point is not that we are trying to build processors that are 4096/128 = 32X bigger. First, the OOO hardware is only a fraction of processor area, especially if you have big vector units like some recent proposals. Second, more importantly, to build big effective instruction windows you do not need to build all that hardware. As the kilo-instruction papers, and my own work, shows. Now, the batched instruction window proposal is on the wimpy side of this. If it keeps the instruction window contiguous, it is paying the full cost for every instruction in the window. Other proposals optimize that; batched doesn't. But, the batched proposal is easy to build. I think really what I am saying is that there are a number of techniques that help improve single thread performance. Batching is one of the simplest ones. SpMT is more aggressive. I'm collecting those techniques, and hope to pick and choose from such a toolkit if every I am building an aggressive single thread microarchitecture again. But, also, I am not pushing this as an alternative to improving parallel performance, such as via message-passing techniques like CSP (I half expect Nick to say "CSP isn't message passing"), or shared memory parallelism. I'm open to all such techniques to improve performance, and I don't denigrate any of them. What I do denigrate is people saying "The only thing that anyone nowadays should be building is the simplest possible in-order processors, many, many of them, to save power and ... " bet the farm on parallel programming. IMHO that does not make sense for general purpose workloads. It might make sense for some supercomputers and other HPC, and for some graphics. And I'll turn your point around: if you are saying the above, then you are saying "And a miracle happens..." with respect to parallel programming. However, in truth, I don't think you are saying thae. > Let me put it another way. If feels like you've magically split the > N^2 out of order problem into M*(N/M)^2 problems or N^2/M, where M is > the number of separate instruction windows. How did you do that? First, out-of-order is only N^2 in simplistic algorithms. I.e. OOO is N^2 the same way sorting is O(N^2) - i.e. it isn't. See, for example, the Ultrascalar papers show logarithmic gate delays and sqrt wire length, essentially by showing that OOO is a parallel prefix problem. Henry, D. S., Kuszmaul, B. C., and Viswanath, V. 1999. The Ultrascalar Processor-An Asymptotically Scalable Superscalar Microarchitecture. In Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI (March 21 - 24, 1999). ARVLSI. IEEE Computer Society, Washington, DC, 256. Arguably their results apply only to renaming of registers and memory, but I think they can extend to all OOO execution. Now, I'm not talking ultrascalar - I don't think I could justify building a new microarchitecture - so it is fair to ask how I am ducking the O(N^2) of simplistic OOO. But it is important to note the theoretical limit: we aren't violating theory, just using O^N^2) algorithms up until it gets painful, and then switching to more scalable approaches. Wrt the batched instruction window, above I started discussing how to stitch instruction windows together. Sorry, I started discussing, but stopped. But the basic idea is that not all traffic needs to cross the cluster boundary. You can filter out a lot that doesn't need to cross make preductions for some of the rest. And you are left with a lot less. |