Prev: Online Exams for Certification, Free Practice Exams, Study Material, Dumps
Next: Motherboard unusuable because of 1 millimeter of missing plastic ?!?!?!?!
From: Sid Touati on 28 Oct 2009 14:08 Hi all, branch predictors and data prefetchors are usually evaluated by considering a single task: the fixed benchmark is the unique program executed for evaluating the efficiency of the hardware data prefetchor or the branch predictor. With the multicore era, desktop and servers will execute more and more multi-threaded applications, or multiple distinct applications, from distinct users. When executing multiple threads from multiple applications, branch predictors and data prefetchor are disturbed, and their "learning" becomes erroneous (especially when they use physical address as tags). Does anyone know about a serious experimental study of the performance of hardware data prefetchor and branch predictor in such context ? Thanks S, waiting for the next generation of branch predictors and data prefetchors for multicore processors
From: "Andy "Krazy" Glew" on 28 Oct 2009 22:31 Sid Touati wrote: > Hi all, > branch predictors and data prefetchors are usually evaluated by > considering a single task: the fixed benchmark is the unique program > executed for evaluating the efficiency of the hardware data prefetchor > or the branch predictor. > > With the multicore era, desktop and servers will execute more and more > multi-threaded applications, or multiple distinct applications, from > distinct users. When executing multiple threads from multiple > applications, branch predictors and data prefetchor are disturbed, and > their "learning" becomes erroneous (especially when they use physical > address as tags). > > Does anyone know about a serious experimental study of the performance > of hardware data prefetchor and branch predictor in such context ? > > Thanks > > S, waiting for the next generation of branch predictors and data > prefetchors for multicore processors a) There have been studies in academia, published I believe, on the effects of context switching on branch predictors. As you might expect, the more context switching, the worse. b) Most branch predictors in my experience use virtual addresses, although using physical addresses can shave a cycle in front end. c) P6 anecdote, circa 1991: the IFU (I-cache) designer wanted to flush the BTB on all context switches. Because we cross checked, we did not need to do so for correctness, and not flushing turned out to be a slight performance win. d) Multicore in some ways *reduces* the frequency of context switches (compared to the same workload running timesliced), so predictors may improve. It's all a question of what you measure with respect to. e) Since many multicore and GP-GPU workloads run the same code on multiple processors, one might hope for possible IMPROVEMENTS in branch predictors. Especially if learning from one thread can help another. E.g. shared BIdB (Branch Identification Buffer) and BTB - basically, shared big expensive tagged structures. Private histories. Problem: nobody wants to have shared structures. It's nicer if the cores are independent. But if your units start becoming clusters of 2,4 processor, then such sharing is reasonable. Similarly, SIMT/CT (Choherent Threading) warps or clusters may easily emply a shared branch predictor. There should also be optimizations related to the mainly shared history. f) I've long wanted to have the option of loading/unloading predictor state like other context. Trouble is, it is often faster to recompute than reload.
From: Sid Touati on 30 Oct 2009 07:30 Andy "Krazy" Glew a �crit : > > > > a) There have been studies in academia, published I believe, on the > effects of context switching on branch predictors. As you might expect, > the more context switching, the worse. > do you have exact references on such academic studies ? Of course I was talking about real experiments, not simulations. Simulating the performances of multicore systems is tricky. > b) Most branch predictors in my experience use virtual addresses, > although using physical addresses can shave a cycle in front end. Fine, how do they distinguish between the PC of two separate applications running in parallel on the same multicore processor ? > c) P6 anecdote, circa 1991: the IFU (I-cache) designer wanted to flush > the BTB on all context switches. Because we cross checked, we did not > need to do so for correctness, and not flushing turned out to be a > slight performance win. It depends on the worksload, and on the application. > e) Since many multicore and GP-GPU workloads run the same code on > multiple processors, one might hope for possible IMPROVEMENTS in branch > predictors. Especially if learning from one thread can help another. you are right when we talk about executing multiple open-mp threads of the same application. In practice, multiple applications can be run in parallel, and this is the way we use computers usually (batch mode is reserved for special situations only) > f) I've long wanted to have the option of loading/unloading predictor > state like other context. Trouble is, it is often faster to recompute > than reload. I am missing your point here. Regards
From: Mayan Moudgill on 30 Oct 2009 13:01 Sid Touati wrote: > Andy "Krazy" Glew a �crit : > >> f) I've long wanted to have the option of loading/unloading predictor >> state like other context. Trouble is, it is often faster to recompute >> than reload. > You can save the branch predictor tables and restore them on a context switch. Or you can zero out the tables on a context switch. Or you can just leave them alone, and let them correct themselves as the switched-in program runs. Turns out that there is not much point to doing either of the first two approaches; the branch predictor will correct itself pretty quickly - quickly enough that the extra cycles spent unloading and reloading the predictor tables on a context switch overwhelm the actual performance gain.
From: Quadibloc on 30 Oct 2009 15:21
On Oct 28, 12:08 pm, Sid Touati <SidnospamTou...(a)inria.fr> wrote: > With the multicore era, desktop and servers will execute more and more > multi-threaded applications, or multiple distinct applications, from > distinct users. When executing multiple threads from multiple > applications, branch predictors and data prefetchor are disturbed, and > their "learning" becomes erroneous (especially when they use physical > address as tags). Multicore processors help, rather than hindering, as someone else already noted, since threads running on other processors are irrelevant; the branch predictor is a part of each core, so if there are other processors, this means fewer threads from the total have to be handled by each core. If one has a multithreaded core, then that core should have separate branch predictor states for each thread as well. John Savard |