Prev: nvidia controller failed command, possibly related to SMART selftest (2.6.32)
Next: powernow-k8: Core Performance Boost and effective frequency support
From: Eric Dumazet on 8 Apr 2010 03:10 Le jeudi 08 avril 2010 à 07:39 +0200, Eric Dumazet a écrit : > I suspect NUMA is completely out of order on current kernel, or my > Nehalem machine NUMA support is a joke > > # numactl --hardware > available: 2 nodes (0-1) > node 0 size: 3071 MB > node 0 free: 2637 MB > node 1 size: 3062 MB > node 1 free: 2909 MB > > > # cat try.sh > hackbench 50 process 5000 > numactl --cpubind=0 --membind=0 hackbench 25 process 5000 >RES0 & > numactl --cpubind=1 --membind=1 hackbench 25 process 5000 >RES1 & > wait > echo node0 results > cat RES0 > echo node1 results > cat RES1 > > numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 & > numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 & > wait > echo node0 on mem1 results > cat RES0_1 > echo node1 on mem0 results > cat RES1_0 > > # ./try.sh > Running with 50*40 (== 2000) tasks. > Time: 16.865 > node0 results > Running with 25*40 (== 1000) tasks. > Time: 16.767 > node1 results > Running with 25*40 (== 1000) tasks. > Time: 16.564 > node0 on mem1 results > Running with 25*40 (== 1000) tasks. > Time: 16.814 > node1 on mem0 results > Running with 25*40 (== 1000) tasks. > Time: 16.896 If run individually, the tests results are more what we would expect (slow), but if machine runs the two set of process concurrently, each group runs much faster... # numactl --cpubind=0 --membind=1 hackbench 25 process 5000 Running with 25*40 (== 1000) tasks. Time: 21.810 # numactl --cpubind=1 --membind=0 hackbench 25 process 5000 Running with 25*40 (== 1000) tasks. Time: 20.679 # numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 & [1] 9177 # numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 & [2] 9196 # wait [1]- Done numactl --cpubind=0 --membind=1 hackbench 25 process 5000 >RES0_1 [2]+ Done numactl --cpubind=1 --membind=0 hackbench 25 process 5000 >RES1_0 # echo node0 on mem1 results node0 on mem1 results # cat RES0_1 Running with 25*40 (== 1000) tasks. Time: 13.818 # echo node1 on mem0 results node1 on mem0 results # cat RES1_0 Running with 25*40 (== 1000) tasks. Time: 11.633 Oh well... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: David Miller on 8 Apr 2010 03:10 From: Eric Dumazet <eric.dumazet(a)gmail.com> Date: Thu, 08 Apr 2010 09:00:19 +0200 > If run individually, the tests results are more what we would expect > (slow), but if machine runs the two set of process concurrently, each > group runs much faster... BTW, I just discovered (thanks to the function graph tracer, woo hoo!) that loopback TCP packets get fully checksum validated on receive. I'm trying to figure out why skb->ip_summed ends up being CHECKSUM_NONE in tcp_v4_rcv() even though it gets set to CHECKSUM_PARTIAL in tcp_sendmsg(). I wonder how much this accounts for some of the hackbench oddities... and other regressions in loopback tests we've seen. :-) Just FYI... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Zhang, Yanmin on 8 Apr 2010 03:20 On Wed, 2010-04-07 at 11:43 -0500, Christoph Lameter wrote: > On Wed, 7 Apr 2010, Zhang, Yanmin wrote: > > > I collected retired instruction, dtlb miss and LLC miss. > > Below is data of LLC miss. > > > > Kernel 2.6.33: > > 20.94% hackbench [kernel.kallsyms] [k] copy_user_generic_string > > 14.56% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg > > 12.88% hackbench [kernel.kallsyms] [k] kfree > > 7.37% hackbench [kernel.kallsyms] [k] kmem_cache_free > > 7.18% hackbench [kernel.kallsyms] [k] kmem_cache_alloc_node > > 6.78% hackbench [kernel.kallsyms] [k] kfree_skb > > 6.27% hackbench [kernel.kallsyms] [k] __kmalloc_node_track_caller > > 2.73% hackbench [kernel.kallsyms] [k] __slab_free > > 2.21% hackbench [kernel.kallsyms] [k] get_partial_node > > 2.01% hackbench [kernel.kallsyms] [k] _raw_spin_lock > > 1.59% hackbench [kernel.kallsyms] [k] schedule > > 1.27% hackbench hackbench [.] receiver > > 0.99% hackbench libpthread-2.9.so [.] __read > > 0.87% hackbench [kernel.kallsyms] [k] unix_stream_sendmsg > > > > Kernel 2.6.34-rc3: > > 18.55% hackbench [kernel.kallsyms] [k] copy_user_generic_str > > ing > > 13.19% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg > > 11.62% hackbench [kernel.kallsyms] [k] kfree > > 8.54% hackbench [kernel.kallsyms] [k] kmem_cache_free > > 7.88% hackbench [kernel.kallsyms] [k] __kmalloc_node_track_ > > caller > > Seems that the overhead of __kmalloc_node_track_caller was increased. The > function inlines slab_alloc(). > > > 6.54% hackbench [kernel.kallsyms] [k] kmem_cache_alloc_node > > 5.94% hackbench [kernel.kallsyms] [k] kfree_skb > > 3.48% hackbench [kernel.kallsyms] [k] __slab_free > > 2.15% hackbench [kernel.kallsyms] [k] _raw_spin_lock > > 1.83% hackbench [kernel.kallsyms] [k] schedule > > 1.82% hackbench [kernel.kallsyms] [k] get_partial_node > > 1.59% hackbench hackbench [.] receiver > > 1.37% hackbench libpthread-2.9.so [.] __read > > I wonder if this is not related to the kmem_cache_cpu structure straggling > cache line boundaries under some conditions. On 2.6.33 the kmem_cache_cpu > structure was larger and therefore tight packing resulted in different > alignment. > > Could you see how the following patch affects the results. It attempts to > increase the size of kmem_cache_cpu to a power of 2 bytes. There is also > the potential that other per cpu fetches to neighboring objects affect the > situation. We could cacheline align the whole thing. I tested the patch against 2.6.33+9dfc6e68bfe6e and it seems it doesn't help. I dumped percpu allocation info when booting kernel and didn't find clear sign. > > --- > include/linux/slub_def.h | 5 +++++ > 1 file changed, 5 insertions(+) > > Index: linux-2.6/include/linux/slub_def.h > =================================================================== > --- linux-2.6.orig/include/linux/slub_def.h 2010-04-07 11:33:50.000000000 -0500 > +++ linux-2.6/include/linux/slub_def.h 2010-04-07 11:35:18.000000000 -0500 > @@ -38,6 +38,11 @@ struct kmem_cache_cpu { > void **freelist; /* Pointer to first free per cpu object */ > struct page *page; /* The slab from which we are allocating */ > int node; /* The node of the page (or -1 for debug) */ > +#ifndef CONFIG_64BIT > + int dummy1; > +#endif > + unsigned long dummy2; > + > #ifdef CONFIG_SLUB_STATS > unsigned stat[NR_SLUB_STAT_ITEMS]; > #endif -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Eric Dumazet on 8 Apr 2010 03:30 Le jeudi 08 avril 2010 à 00:05 -0700, David Miller a écrit : > From: Eric Dumazet <eric.dumazet(a)gmail.com> > Date: Thu, 08 Apr 2010 09:00:19 +0200 > > > If run individually, the tests results are more what we would expect > > (slow), but if machine runs the two set of process concurrently, each > > group runs much faster... > > BTW, I just discovered (thanks to the function graph tracer, woo hoo!) > that loopback TCP packets get fully checksum validated on receive. > > I'm trying to figure out why skb->ip_summed ends up being > CHECKSUM_NONE in tcp_v4_rcv() even though it gets set to > CHECKSUM_PARTIAL in tcp_sendmsg(). > > I wonder how much this accounts for some of the hackbench > oddities... and other regressions in loopback tests we've seen. > :-) > > Just FYI... Thanks ! But hackbench is a af_unix benchmark, so loopback stuff is not used that much :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: David Miller on 8 Apr 2010 03:30
From: David Miller <davem(a)davemloft.net> Date: Thu, 08 Apr 2010 00:05:57 -0700 (PDT) > From: Eric Dumazet <eric.dumazet(a)gmail.com> > Date: Thu, 08 Apr 2010 09:00:19 +0200 > >> If run individually, the tests results are more what we would expect >> (slow), but if machine runs the two set of process concurrently, each >> group runs much faster... > > BTW, I just discovered (thanks to the function graph tracer, woo hoo!) > that loopback TCP packets get fully checksum validated on receive. > > I'm trying to figure out why skb->ip_summed ends up being > CHECKSUM_NONE in tcp_v4_rcv() even though it gets set to > CHECKSUM_PARTIAL in tcp_sendmsg(). Ok, it looks like it's only ACK packets that have this problem, but still :-) It's weird that we have a special ip_dev_loopback_xmit() for for ip_mc_output() NF_HOOK()s, which forces skb->ip_summed to CHECKSUM_UNNECESSARY, but the actual normal loopback xmit doesn't do that... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |