Prev: Online Exams for Certification, Free Practice Exams, Study Material, Dumps
Next: Motherboard unusuable because of 1 millimeter of missing plastic ?!?!?!?!
From: Anton Ertl on 4 Nov 2009 13:10 "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes: >Anton Ertl wrote: >> "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes: >>> b) Most branch predictors in my experience use virtual addresses, >>> although using physical addresses can shave a cycle in front end. >> >> I would expect physical addresses to cost extra cycles, because there >> is additional translation. > >Other way around. > >If you have a physically addressed I$, Some years ago the usual way was virtually-indexed physically-tagged L1 caches. Has this changed? > but a virtual branch predictor, Ah, you mean the addresses coming out of the branch predictor, right? I was thinking about the addresses going in; that's because conditional branch predictors only predict taken/not-taken, and because the question being discussed was the aliasing in the branch predictor from merging the histories of different threads. For the addresses going in using physical addresses would increase the latency (or at least the hardware required), and the benefit is probably small. >you have to translate the, e.g., virtual branch target addresses into >physical, giving you latency on a predicted taken branch. On the other >hand, it is I-fetch, where latency can often be tolerated. For the BTB, storing physical addresses may be a good idea (if it gives any advantage over virtually-indexed physically-tagged access). - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html
From: "Andy "Krazy" Glew" on 5 Nov 2009 00:47 Anton Ertl wrote: > "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes: >> If you have a physically addressed I$, > > Some years ago the usual way was virtually-indexed physically-tagged > L1 caches. Has this changed? Although there have been a number of systems with virtually indexed physically tagged D$ and I$, including IIRC the Willamette L0 D$, array lookup, most Intel x86s of the P6 family have physically indexed, physically tagged, caches. IMHO virtual indexing has gotten a bit of a bad rap. But, it certainly had a bad reputation in quite a few design groups. >> but a virtual branch predictor, > > Ah, you mean the addresses coming out of the branch predictor, right? Could be the address coming out Could be the address going in. It is convenient if they are of the same type, so that you can feed the predictor output right back to the input. > > I was thinking about the addresses going in; that's because > conditional branch predictors only predict taken/not-taken, and > because the question being discussed was the aliasing in the branch > predictor from merging the histories of different threads. > > For the addresses going in using physical addresses would increase the > latency (or at least the hardware required), and the benefit is > probably small. Why would physical addresses going in increase the latency? They would not increase latency of the array lookup or tag match. They add complexity. And they require the target to be translated when it is put into the array, typically on a misprediction when you are doing an ifetch anyway. > For the BTB, storing physical addresses may be a good idea (if it > gives any advantage over virtually-indexed physically-tagged access). Like I said, unclear if it is a complexity win. Definitely costs devices.
From: Quadibloc on 8 Nov 2009 11:52 On Nov 2, 3:08 am, "Ken Hagan" <K.Ha...(a)thermoteknix.com> wrote: > On Fri, 30 Oct 2009 19:21:28 -0000, Quadibloc <jsav...(a)ecn.ab.ca> wrote: > > If one has a multithreaded core, then that core should have separate > > branch predictor states for each thread as well. > > Isn't that the same as "For a multithreaded core, the space available for > storing branch predictor state should be divided exactly 1/N to each > context."? That's fair to each thread, but not necessarily the best use of > a presumably limited resource. It's only the same if one has a branch predictor that is capable of working that way. I certainly do agree that if one can optimally allocate branch predictor state without incurring inordinate costs for that capability, one should do so. However, I was trying to get at something much more simple, and I think less controversial: If one has a multithreaded core, branch predictor information should be labelled by thread, so that information gathered about the branches in one thread isn't used to control how branches in another thread are handled. The branch predictor should not simply ignore the fact that multiple different threads are being executed in the core. In other words, I was assuming that the branch predictor would be crude and simple in design; a handful of gates, not a computer in its own right, which is why I failed to be sufficiently explicit. John Savard
From: Anton Ertl on 8 Nov 2009 13:20 "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes: >Anton Ertl wrote: >> "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes: > >>> If you have a physically addressed I$, >> >> Some years ago the usual way was virtually-indexed physically-tagged >> L1 caches. Has this changed? > >Although there have been a number of systems with virtually indexed >physically tagged D$ and I$, including IIRC the Willamette L0 D$, array lookup, >most Intel x86s of the P6 family have physically indexed, physically >tagged, caches. > >IMHO virtual indexing has gotten a bit of a bad rap. But, it certainly >had a bad reputation in quite a few design groups. Why is that? I never heard about it before. >> For the addresses going in using physical addresses would increase the >> latency (or at least the hardware required), and the benefit is >> probably small. > >Why would physical addresses going in increase the latency? My thoughts were along the following lines (but see below): Either the CPU produces the physical address by translating from the virtual address, then there is latency. Or it maintains the physical PC as well, then there is additional hardware required (plus latency in rare cases, e.g. page-crossing). >They would not increase latency of the array lookup or tag match. Ok, using the common part for indexing, and delaying the tag match until after the translation, as in virtually-indexed physically-tagged caches. Yes, that may be possible without extra latency. - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html
From: Terje Mathisen on 8 Nov 2009 16:00
Quadibloc wrote: > On Nov 2, 3:08 am, "Ken Hagan"<K.Ha...(a)thermoteknix.com> wrote: >> On Fri, 30 Oct 2009 19:21:28 -0000, Quadibloc<jsav...(a)ecn.ab.ca> wrote: > >>> If one has a multithreaded core, then that core should have separate >>> branch predictor states for each thread as well. >> >> Isn't that the same as "For a multithreaded core, the space available for >> storing branch predictor state should be divided exactly 1/N to each >> context."? That's fair to each thread, but not necessarily the best use of >> a presumably limited resource. > > It's only the same if one has a branch predictor that is capable of > working that way. I certainly do agree that if one can optimally > allocate branch predictor state without incurring inordinate costs for > that capability, one should do so. > > However, I was trying to get at something much more simple, and I > think less controversial: > > If one has a multithreaded core, branch predictor information should > be labelled by thread, so that information gathered about the branches > in one thread isn't used to control how branches in another thread are > handled. The branch predictor should not simply ignore the fact that > multiple different threads are being executed in the core. In a multicore cpu, this is very probably exactly the wrong thing to do: The usual programming paradigm for such a system is to have many threads running the same algorithm, which means that training information from one thread is likely to be useful for another, or at least not detrimental. Cores that run different functions, will have a separate set of branches to consider, and again each set running the same code can share branch info. The main reason for keeping them separate is simply that the branch predictions needs to be very close to the instruction fetch and execution units, something which is hard to achieve if a single large global branch table is many cycles away. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching" |