Prev: x86 i/o management
Next: Nvidia Secret Sauce for ECC?
From: Andy 'Krazy' Glew on 9 May 2010 17:29 On 1/25/2010 3:43 AM, nedbrek wrote: > Hello all, > > "Andy "Krazy" Glew"<ag-news(a)patten-glew.net> wrote in message > news:4B5C999C.9060301(a)patten-glew.net... >> nedbrek wrote: >>> That's where my mind starts to boggle. I would need see branch predictor >>> and serialization data showing a window this big would deliver >>> significant performance gains. We were looking at Itanium runs of Spec2k >>> built by Electron (i.e. super optimized). We were assuming very heavy >>> implementation (few serializing conditions). We were unable to scale this >>> far. By the way, Ned's comment about needing to see branch prediction data indicates a fundamental misunderstanding of speculative multithreadng. This was almost exactly Jim Smith's misunderstanding of nigh on ten years ago. (Plus, Jim was not aware, at that time, of the significant advances in branch prediction made during the Willamette era. Jim's misunderstanding somewhat inspired my work on multilevel branch predictors, as a way of getting the accuracy of a larger predictor with the latency, the short bubble on predicted taken branches, of a smaller branch predictor. Since I was under NDA for the P4 branch predictor, I had to invent something almost as good to make my work relevant.) If you have a single thread feeding your instruction window, then any branch misprediction might conceivably invalidate all subsequent instructions. However, in SpMT you have multiple instruction fetch threads, from the same logical thread of execution, feeding the instruction window. Typically these threads are control independent of each other. E.g. stuff from after the return of a function is control independent of any branch inside the function itself, except for branches that cause the function not to return, e.g. to throw an exception. Ditto loops. I.e. exploitation of control independence removes branch mispredictions as an impediment to large instruction windows. Of course, one then has to worry about data value dependence. Similarly, Ned mentions serialization. Now, admittedly much work on stuff like SpMT is still vulnerable to the SpMT bottleneck. Particularly versions of SpMT or DMT or whatever that leave speculative state in the cache waiting to be committed, snooping to determine if it is still correct. However, my log-based SpMT is NOT vulnerable to the serialization bottleneck. You never have to stop speculation because of serialization. You never need to do this because serialization constraints are completely satisfied during verify re-execution. The verify re-execution engine may serialize itself, but it never needs to invalidate the speculative log that it is verifying - because the process of verification avoids that need. If you were to build a dedicated verify re-execution engine, it would be a fairly simple in-order machine, with low serialization costs. However, my preference is to minimize the amount of dedicated logic, and reuse the normal processor, which may be OOO. Nevertheless, serializng that will be less costly in verify re-execution mode than otherwise, because verify re-execution mode is so parallel. Pretty much the only time you really need to serialize a log based SpMT machine is when you change the contrl register bit that says "Never run in SpMT mode." Now, of course, what we have really done is convert serialization int a prediction problem. If the places where we have speculated pat what would have ben a serialization point on an old machine lead to many mis-speculations...
|
Pages: 1 Prev: x86 i/o management Next: Nvidia Secret Sauce for ECC? |