Prev: Lolling at programmers, how many ways are there to create a bitmask ? ;) :)
Next: A naive conjecture about computer architecture and artificialintelligence
From: George Neuner on 10 Jul 2010 21:01 On Fri, 09 Jul 2010 22:03:31 +0200, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote: >MitchAlsup wrote: >> But beyond this, the predictors are really swimming up stream agains >> the current of random placement of semi-random sized objects allocated >> on a heap. Especially when subject to garbage collection. > >I think this is a key issue: > >Garbage collectors seems to me to be _the_ proper spot to implement data >location optimizations: > >Since most garbage-collectable languages allow the collector to have >full knowledge of all pointer data items, it should also be capable of >detecting left/right pointers in a tree structure right? > >If so, one possible approach would be to try to minimize the total >distance covered by said pointers, measured as the offset between the >address of the pointer itself and the object it points to. Coalescing tree data structures during GC has been studied quite a bit. There seems to be consensus that, where possible, the best thing to do is to maintain the address order of object allocations because that tends to be a good predictor of object lifetimes. But maintaining object address ordering is relatively complex due to the need to sort addresses, so where it is infeasible, regrouping according to depth first traversal has been shown to yield slightly better locality than breadth first (at least for the test cases that have been studied). Most copying collectors already do regroup data during collection. It is easier to maintain object allocation order in mark-compact collectors than in Cheney-style copiers, but Cheney's - which are implicitly breadth-first - can relatively easily incorporate secondary depth-first structure traversals. >If we could also have hit counters of some form, it would be possible to >cluster heavily-used object together, but that would probably require >either HW or virtual machine support. Marking pages as not present just >in order to detect accesses is probably to heavy duty, right? > >It seems to me that this could tend to cluster useful data? There have been a few studies on regrouping data dynamically according to mutator access patterns, but AFAIHS the extra complexity vs just maintaining some simple traversal order just isn't worthwhile. This is the same problem as VMM page replacement. The ideal page policy is L)east L)ikely T)o B)e N)eeded - which is, in general, impossible to implement. L)east R)ecenty U)sed is an achievable compromise, but LRU's approximation of LLTBN is highly dependent on the system's dynamic program mix. Similarly, you'd want to group clusters of data according to LLTBN ... but you can't and LRU may not be a good approximation. Grouping program objects by age is possible, though, and since the VMM already tracks resident page accesses, these can be taken into account. One problem with using VMM is that non-resident pages cannot be differentiated because their age data is not retained. They can uniformly be considered "ancient", but this would be wrong for pages pushed out in thrashing due to an overloaded system. Another issue is that many GC implementations already do not (and do not want to) use VMM mechanisms due to high OS trap latencies, the coarse granularity of processing (relatively large) VM pages and the difficulty of dealing with page-spanning objects incrementally. The result is that a lot of collectors are pure software implementations which don't rely much at all on VMM hardware. Not to mention that GC itself, regardless of algorithm(s), is a fairly bad fit to LRU paging because the collection process potentially may touch and drag in large amounts of non-resident data - displacing data actively in use. There have been proposals for various extensions to VMM APIs to give processes more direct control over their own paging, but, of course, these may be at odds with the global policies the VMM system may be trying to implement. >Terje George
From: jacko on 11 Jul 2010 17:46 Yes a referer count GC with on zero free and count-1 refered loop back to zero test. This will not chase many pointers as in a mark release GC, but it makes the list operations slightly more complicated. The question is what GC list node turnover has to happen to make it more efficient?
From: Andy Glew on 12 Jul 2010 00:25 On 7/10/2010 12:12 PM, Robert Myers wrote: > nmm1(a)cam.ac.uk wrote: >> The same is NOT true of branch and preload prediction. If you can >> reduce the misprediction by only 5% (relatively), it isn't exciting >> enough to change to an experimental design. >> > > To quote the learned Andrew Glew, from this same thread in this same > august forum: > > <quote> > > I warned you: at UWisc, I wanted to work on big instruction windows. But > Jim Smith said "big instruction windows cannot > be justified, because branch predictors are not good enough". Now, I > knew something that Jim didn't: I knew that > Willamette's branch predictors were significantly better than any than > had been published in academia (I don't know the > exact numbers off my head, but: if you go from 94& to 97%, you have > doubled the instructions between mispredicted > branches on average. Often more so. Also, I knew that main memory > latencies of 300 cycles were coming. > > </quote> > > I did that not to prove you wrong, Nick, but to show Andy that I am > paying attention. ;-) I think that Nick may have meant a 5% relative improvement n branch prediction, going, e.g. from 94%, improving 5% of ((100%-94%)=6%) = 0.30%, to a net branch prediction rate of 94.30%. But in case anyone cares, even a 0.30% increase in performance is the sort of thing Intel and AMD would kill for now. When I started as a computer architecture, my boss said "don't waste time on anything that doesn't get a 2X improvement". Then 20%. Now, we are scraping the bottom of the barrel looking for 1%ers. That's why multicore is so attractive. As y'all know, I think there are big improvements possible in single thread CPU performance - 10%ers at least, and NXers in some areas.
From: Andy Glew on 12 Jul 2010 00:26 On 7/10/2010 12:12 PM, Robert Myers wrote: > nmm1(a)cam.ac.uk wrote: >> The same is NOT true of branch and preload prediction. If you can >> reduce the misprediction by only 5% (relatively), it isn't exciting >> enough to change to an experimental design. >> > > To quote the learned Andrew Glew, from this same thread in this same > august forum: > > <quote> > > I warned you: at UWisc, I wanted to work on big instruction windows. But > Jim Smith said "big instruction windows cannot > be justified, because branch predictors are not good enough". Now, I > knew something that Jim didn't: I knew that > Willamette's branch predictors were significantly better than any than > had been published in academia (I don't know the > exact numbers off my head, but: if you go from 94& to 97%, you have > doubled the instructions between mispredicted > branches on average. Often more so. Also, I knew that main memory > latencies of 300 cycles were coming. > > </quote> > > I did that not to prove you wrong, Nick, but to show Andy that I am > paying attention. ;-) I think that Nick may have meant a 5% relative improvement n branch prediction, going, e.g. from 94%, improving 5% of ((100%-94%)=6%) = 0.30%, to a net branch prediction rate of 94.30%. But in case anyone cares, even a 0.30% increase in performance is the sort of thing Intel and AMD would kill for now. When I started as a computer architecture, my boss said "don't waste time on anything that doesn't get a 2X improvement". Then 20%. Now, we are scraping the bottom of the barrel looking for 1%ers. That's why multicore is so attractive. As y'all know, I think there are big improvements possible in single thread CPU performance - 10%ers at least, and NXers in some areas.
From: nmm1 on 12 Jul 2010 03:35
In article <4C3A995D.7040503(a)andy.glew.ca>, Andy Glew <giganews(a)andy.glew.ca> wrote: > >I think that Nick may have meant a 5% relative improvement n branch >prediction, going, e.g. from 94%, improving 5% of ((100%-94%)=6%) = >0.30%, to a net branch prediction rate of 94.30%. I did. >But in case anyone cares, even a 0.30% increase in performance is the >sort of thing Intel and AMD would kill for now. > >When I started as a computer architecture, my boss said "don't waste >time on anything that doesn't get a 2X improvement". Then 20%. Now, we >are scraping the bottom of the barrel looking for 1%ers. That's bonkers. But you know that. What really baffles me is that I know there is a market for LESS performance - provided that it comes in a lot less power hungry. Not a small market, either. And it could be done. >That's why multicore is so attractive. > >As y'all know, I think there are big improvements possible in single >thread CPU performance - 10%ers at least, and NXers in some areas. And it's those areas that are interesting. Regards, Nick Maclaren. |