From: Andy Glew on 13 Jul 2010 22:13 On 7/13/2010 2:46 PM, Morten Reistad wrote: > In article<4f8hg7-mcg2.ln1(a)ntp.tmsw.no>, > Terje Mathisen<"terje.mathisen at tmsw.no"> wrote: >> Brett Davis wrote: >>> In article<3l9eg7-j98.ln1(a)laptop.reistad.name>, >>> Morten Reistad<first(a)last.name> wrote: >>>> Seems to give large speedups to some fundamental algorithms. >>> >>> "You are Doing It Wrong" >>> http://queue.acm.org/detail.cfm?id=1814327 >> > The trick is to be choosy about which locality is important. Suddenly > a vertical vs horisontal organisation of b-trees becomes very important > for perfomance. > >>> Ten times faster than B-Tree with B-Heap. >> >> The problem here is that phk had two sorts of cached data: Web articles >> with images and the index, and they were both allowed to fight for the >> same limited resource (real DRAM pages). > > This isn't so much about RAM, it is about cache. We see this effect > creep in all over the place now. I could not help but wonder if the B-Heap speedups might also be in some part due to TLB locality.
From: Terje Mathisen "terje.mathisen at on 14 Jul 2010 03:41 Morten Reistad wrote: > In article<4f8hg7-mcg2.ln1(a)ntp.tmsw.no>, > Terje Mathisen<"terje.mathisen at tmsw.no"> wrote: >> It still makes sense to pack the index so that all the higher levels >> will fit in the lower cache levels. > > This isn't so much about RAM, it is about cache. We see this effect > creep in all over the place now. Isn't that what I wrote in the previous paragraph? Anyway, CPU Cache is very important (witness my .sig), but in this particular case phk was specifically targeting RAM used as a disk cache: Same kind of problem, ~3 orders of magnitude slower. > And the actual cache effects can be pretty surprising at times. > > Watch an 8-processor(@2.4Ghz) HP Xeon with 4 gb ram outperform a > 36-processor (@2.6GHz) (*) Sun with 40G ram by a factor of 6. > > It was all in the L2/L3 cache design. Fun stuff! Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Piotr Wyderski on 14 Jul 2010 06:37 Morten Reistad wrote: > Watch an 8-processor(@2.4Ghz) HP Xeon with 4 gb ram outperform a > 36-processor (@2.6GHz) (*) Sun with 40G ram by a factor of 6. I can confirm your observations, as we have encountered exactly the same issue. > It was all in the L2/L3 cache design. Not only, Sparcs have much worse memory disambiguation mechanisms. Even such a simple loop shows stunning differences: for(volatile unsigned int i = 0; i != 1000000000; ++i); Investigation at the assembler level has shown that Sparc is good (i.e. Xeon-like) at loads and stores, but when there is a long store-load dependence, the performance loss is a total disaster, to put it mildly. Xeons do not exhibit this kind of problems. Best regards Piotr Wyderski
From: Morten Reistad on 14 Jul 2010 07:04 In article <4C3D1CB0.8000704(a)andy.glew.ca>, Andy Glew <giganews(a)andy.glew.ca> wrote: >On 7/13/2010 2:46 PM, Morten Reistad wrote: >> In article<4f8hg7-mcg2.ln1(a)ntp.tmsw.no>, >> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote: >>> Brett Davis wrote: >>>> In article<3l9eg7-j98.ln1(a)laptop.reistad.name>, >>>> Morten Reistad<first(a)last.name> wrote: >>>>> Seems to give large speedups to some fundamental algorithms. >>>> >>>> "You are Doing It Wrong" >>>> http://queue.acm.org/detail.cfm?id=1814327 >>> >> The trick is to be choosy about which locality is important. Suddenly >> a vertical vs horisontal organisation of b-trees becomes very important >> for perfomance. >> >>>> Ten times faster than B-Tree with B-Heap. >>> >>> The problem here is that phk had two sorts of cached data: Web articles >>> with images and the index, and they were both allowed to fight for the >>> same limited resource (real DRAM pages). >> >> This isn't so much about RAM, it is about cache. We see this effect >> creep in all over the place now. > >I could not help but wonder if the B-Heap speedups might also be in some >part due to TLB locality. That too. But the cache seems to be the really imporant bit, including cache snoops into the cache of other processors. The cache counters are accessible through a loadable module starting in Linux 1.6.31, so you can have a look at what the MMU is actually doing. It is educational to see the cache activity. -- mrr
From: MitchAlsup on 14 Jul 2010 12:59
On Jul 14, 5:37 am, "Piotr Wyderski" <piotr.wyder...(a)mothers.against.spam.gmail.com> wrote: > Not only, Sparcs have much worse memory disambiguation mechanisms. > Even such a simple loop shows stunning differences: > > for(volatile unsigned int i = 0; i != 1000000000; ++i); > > Investigation at the assembler level has shown that Sparc is good > (i.e. Xeon-like) at loads and stores, but when there is a long > store-load dependence, the performance loss is a total disaster, to > put it mildly. Xeons do not exhibit this kind of problems. x86 grew up in an environment where stores were reloaded in a rather short amount of time {arguments get pushed, then accessed a couple cycles later in the called subroutine.) Thus, there are mechanisms to forward store data to loads when the addresses match even when the store instruction has not retired. This normally goes under the monicer of Store-to-Load forwarding (STLF). Mitch |