Prev: Online Exams for Certification, Free Practice Exams, Study Material, Dumps
Next: Motherboard unusuable because of 1 millimeter of missing plastic ?!?!?!?!
From: Ken Hagan on 2 Nov 2009 05:08 On Fri, 30 Oct 2009 19:21:28 -0000, Quadibloc <jsavard(a)ecn.ab.ca> wrote: > If one has a multithreaded core, then that core should have separate > branch predictor states for each thread as well. Isn't that the same as "For a multithreaded core, the space available for storing branch predictor state should be divided exactly 1/N to each context."? That's fair to each thread, but not necessarily the best use of a presumably limited resource.
From: Sid Touati on 2 Nov 2009 05:15 Mayan Moudgill a �crit : > Sid Touati wrote: >> Andy "Krazy" Glew a �crit : >> >>> f) I've long wanted to have the option of loading/unloading predictor >>> state like other context. Trouble is, it is often faster to >>> recompute than reload. >> > > You can save the branch predictor tables and restore them on a context > switch. Or you can zero out the tables on a context switch. Or you can > just leave them alone, and let them correct themselves as the > switched-in program runs. Yeah, we can imagine lot of games inside a chip. My question was about what has been done, what has been experimented. All what we see in papers is good performance numbers of branch predictors and prefetchors that nobody is able to reproduce simply because rare people use a machine in a batch mode. The usage is most of the case with multitasking, multi-threading, etc. > > Turns out that there is not much point to doing either of the first two > approaches; the branch predictor will correct itself pretty quickly - > quickly enough that the extra cycles spent unloading and reloading the > predictor tables on a context switch overwhelm the actual performance gain. The term "learning" that is usually used to describe dynamic mechanisms is a subliminal description of what is going on inside speculative mechanismes: threads, predictors and prefetchors do not "learn" anything at execution time, they just play against random. Learning has something related to "understanding", a simple automata with a table cannot learn anything :) Anyway, if someone has an exact reference to a serious experimental study on branch predictors and data prefetchors in the context of multi-tasks, multi-threads, could you please point it. Best regards
From: Anton Ertl on 29 Oct 2009 13:05 "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes: >b) Most branch predictors in my experience use virtual addresses, >although using physical addresses can shave a cycle in front end. I would expect physical addresses to cost extra cycles, because there is additional translation. Is there much aliasing from using virtual addresses without address space numbers or similar? I wouldn't expect it. >c) P6 anecdote, circa 1991: the IFU (I-cache) designer wanted to flush >the BTB on all context switches. Because we cross checked, we did not >need to do so for correctness, and not flushing turned out to be a >slight performance win. That seems obvious. With flushing, you have no chance of a hit, without you have (even though it may be small). Am I overlooking something? - anton -- M. Anton Ertl Some things have to be seen to be believed anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html
From: "Andy "Krazy" Glew" on 3 Nov 2009 23:09 Anton Ertl wrote: > "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes: >> b) Most branch predictors in my experience use virtual addresses, >> although using physical addresses can shave a cycle in front end. > > I would expect physical addresses to cost extra cycles, because there > is additional translation. Other way around. If you have a physically addressed I$, but a virtual branch predictor, you have to translate the, e.g., virtual branch target addresses into physical, giving you latency on a predicted taken branch. On the other hand, it is I-fetch, where latency can often be tolerated. Whereas you could use physical addresses for I-fetch: e.g. have a current I-fetch PC (Intel parlance, PFIP, physical fetch instruction pointer (I made that up)), increment it to the next I$ line. Have the BTB have physical addresses. Trouble is, you have to do extra work, like translating when sequential instruction fetch crosses a page boundary, or remembering such crossings. You pretty much have to maintain the virtual or linear, VFIP or VLIP, instruction pointers as well, although maybe not as fast as the main PFIP.
From: "Andy "Krazy" Glew" on 3 Nov 2009 23:10
Anton Ertl wrote: > "Andy \"Krazy\" Glew" <ag-news(a)patten-glew.net> writes: >> b) Most branch predictors in my experience use virtual addresses, >> although using physical addresses can shave a cycle in front end. > > I would expect physical addresses to cost extra cycles, because there > is additional translation. Other way around. If you have a physically addressed I$, but a virtual branch predictor, you have to translate the, e.g., virtual branch target addresses into physical, giving you latency on a predicted taken branch. On the other hand, it is I-fetch, where latency can often be tolerated. Whereas you could use physical addresses for I-fetch: e.g. have a current I-fetch PC (Intel parlance, PFIP, physical fetch instruction pointer (I made that up)), increment it to the next I$ line. Have the BTB have physical addresses. Trouble is, you have to do extra work, like translating when sequential instruction fetch crosses a page boundary, or remembering such crossings. You pretty much have to maintain the virtual or linear, VFIP or VLIP, instruction pointers as well, although maybe not as fast as the main PFIP. |