Prev: nvidia controller failed command, possibly related to SMART selftest (2.6.32)
Next: powernow-k8: Core Performance Boost and effective frequency support
From: Zhang, Yanmin on 2 Apr 2010 04:10 On Thu, 2010-04-01 at 10:53 -0500, Christoph Lameter wrote: > On Thu, 1 Apr 2010, Zhang, Yanmin wrote: > > > I suspect the moving of place of cpu_slab in kmem_cache causes the new cache > > miss. But when I move it to the tail of the structure, kernel always panic when > > booting. Perhaps there is another potential bug? > > Why would that cause an additional cache miss? > > > The node array is following at the end of the structure. If you want to > move it down then it needs to be placed before the node field Thanks. The moving cpu_slab to tail doesn't improve it. I used perf to collect statistics. Only data cache miss has a little difference. My testing command on my 2 socket machine: #hackbench 100 process 20000 With 2.6.33, it takes for about 96 seconds while 2.6.34-rc2 (or the latest tip tree) takes for about 101 seconds. perf shows some functions around SLUB have more cpu utilization, while some other SLUB functions have less cpu utilization. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Lameter on 5 Apr 2010 10:00 On Fri, 2 Apr 2010, Zhang, Yanmin wrote: > My testing command on my 2 socket machine: > #hackbench 100 process 20000 > > With 2.6.33, it takes for about 96 seconds while 2.6.34-rc2 (or the latest tip tree) > takes for about 101 seconds. > > perf shows some functions around SLUB have more cpu utilization, while some other > SLUB functions have less cpu utilization. Hmnmmm... The dynamic percpu areas use page tables and that data is used in the fast path. Maybe the high thread count causes tlb trashing? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Pekka Enberg on 5 Apr 2010 13:40 (I'm CC'ing Tejun) On Mon, Apr 5, 2010 at 4:54 PM, Christoph Lameter <cl(a)linux-foundation.org> wrote: > On Fri, 2 Apr 2010, Zhang, Yanmin wrote: > >> My testing command on my 2 socket machine: >> #hackbench 100 process 20000 >> >> With 2.6.33, it takes for about 96 seconds while 2.6.34-rc2 (or the latest tip tree) >> takes for about 101 seconds. >> >> perf shows some functions around SLUB have more cpu utilization, while some other >> SLUB functions have less cpu utilization. > > Hmnmmm... The dynamic percpu areas use page tables and that data is used > in the fast path. Maybe the high thread count causes tlb trashing? Hmm indeed. I don't see anything particularly funny in the SLUB percpu conversion so maybe this is a more issue with the new percpu allocator? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Lameter on 6 Apr 2010 11:50 On Tue, 6 Apr 2010, Zhang, Yanmin wrote: > Thanks. I tried 2 and 4 times and didn't see much improvement. > I checked /proc/vamallocinfo and it doesn't have item of pcpu_get_vm_areas > when I use 4 times of PERCPU_DYNAMIC_RESERVE. > I used perf to collect dtlb misses and LLC misses. dtlb miss data is not > stable. Sometimes, we have a bigger dtlb miss, but get a better result. > > LLC misses data are more stable. Only LLC-load-misses is the clear sign now. > LLC-store-misses has no big difference. LLC-load-miss is exactly what condition? The cacheline environment in the hotpath should only include the following cache lines (without debugging and counters): 1. The first cacheline from the kmem_cache structure (This is different from the sitation before the 2.6.34 changes. Earlier some critical values (object length etc) where available from the kmem_cache_cpu structure. The cacheline containing the percpu structure array was needed to determome the kmem_cache_cpu address!) 2. The first cacheline from kmem_cache_cpu 3. The first cacheline of the data object (free pointer) And in case of a kfree/ kmem_cache_free: 4. Cacheline that contains the page struct of the page the object resides in. Can you post the .config you are using and the bootup messages? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Lameter on 6 Apr 2010 17:00
We cannot reproduce the issue here. Our tests here (dual quad dell) show a performance increase in hackbench instead. Linux 2.6.33.2 #2 SMP Mon Apr 5 11:30:56 CDT 2010 x86_64 GNU/Linux ../hackbench 100 process 200000 Running with 100*40 (== 4000) tasks. Time: 3102.142 ../hackbench 100 process 20000 Running with 100*40 (== 4000) tasks. Time: 308.731 ../hackbench 100 process 20000 Running with 100*40 (== 4000) tasks. Time: 311.591 ../hackbench 100 process 20000 Running with 100*40 (== 4000) tasks. Time: 310.200 ../hackbench 10 process 20000 Running with 10*40 (== 400) tasks. Time: 38.048 ../hackbench 10 process 20000 Running with 10*40 (== 400) tasks. Time: 44.711 ../hackbench 10 process 20000 Running with 10*40 (== 400) tasks. Time: 39.407 ../hackbench 1 process 20000 Running with 1*40 (== 40) tasks. Time: 9.411 ../hackbench 1 process 20000 Running with 1*40 (== 40) tasks. Time: 8.765 ../hackbench 1 process 20000 Running with 1*40 (== 40) tasks. Time: 8.822 Linux 2.6.34-rc3 #1 SMP Tue Apr 6 13:30:34 CDT 2010 x86_64 GNU/Linux ../hackbench 100 process 200000 Running with 100*40 (== 4000) tasks. Time: 3003.578 ../hackbench 100 process 20000 Running with 100*40 (== 4000) tasks. Time: 300.289 ../hackbench 100 process 20000 Running with 100*40 (== 4000) tasks. Time: 301.462 ../hackbench 100 process 20000 Running with 100*40 (== 4000) tasks. Time: 301.173 ../hackbench 10 process 20000 Running with 10*40 (== 400) tasks. Time: 41.191 ../hackbench 10 process 20000 Running with 10*40 (== 400) tasks. Time: 41.964 ../hackbench 10 process 20000 Running with 10*40 (== 400) tasks. Time: 41.470 ../hackbench 1 process 20000 Running with 1*40 (== 40) tasks. Time: 8.829 ../hackbench 1 process 20000 Running with 1*40 (== 40) tasks. Time: 9.166 ../hackbench 1 process 20000 Running with 1*40 (== 40) tasks. Time: 8.681 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |