Prev: [PATCH 1/1] AGP: amd64, fix pci reference leaks
Next: [PATCH 2/3] viafb: remove unused structure member
From: Michael Buesch on 7 Sep 2009 15:00 On Monday 07 September 2009 20:26:29 Ingo Molnar wrote: > Could you profile it please? Also, what's the context-switch rate? As far as I can tell, the broadcom mips architecture does not have profiling support. It does only have some proprietary profiling registers that nobody wrote kernel support for, yet. -- Greetings, Michael. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Daniel Walker on 7 Sep 2009 14:50 On Mon, 2009-09-07 at 20:26 +0200, Ingo Molnar wrote: > That's interesting. I tried to reproduce it on x86, but the profile > does not show any scheduler overhead at all on the server: If the scheduler isn't running the task which causes the lower throughput , would that even show up in profiling output? Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Jens Axboe on 7 Sep 2009 14:50 On Mon, Sep 07 2009, Avi Kivity wrote: > On 09/07/2009 12:49 PM, Jens Axboe wrote: >> >> I ran a simple test as well, since I was curious to see how it performed >> wrt interactiveness. One of my pet peeves with the current scheduler is >> that I have to nice compile jobs, or my X experience is just awful while >> the compile is running. >> > > I think the problem is that CFS is optimizing for the wrong thing. It's > trying to be fair to tasks, but these are meaningless building blocks of > jobs, which is what the user sees and measures. Your make -j128 > dominates your interactive task by two orders of magnitude. If the > scheduler attempts to bridge this gap using heuristics, it will fail > badly when it misdetects since it will starve the really important > 100-thread job for a task that was misdetected as interactive. Agree, I was actually looking into doing joint latency for X number of tasks for the test app. I'll try and do that and see if we can detect something from that. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on 7 Sep 2009 14:30 * Michael Buesch <mb(a)bu3sch.de> wrote: > Here's a very simple test setup on an embedded singlecore bcm47xx > machine (WL500GPv2) It uses iperf for performance testing. The > iperf server is run on the embedded device. The device is so slow > that the iperf test is completely CPU bound. The network > connection is a 100MBit on the device connected via patch cable to > a 1000MBit machine. > > The kernel is openwrt-2.6.30.5. > > Here are the results: > > > > Mainline CFS scheduler: > > mb(a)homer:~$ iperf -c 192.168.1.1 > ------------------------------------------------------------ > Client connecting to 192.168.1.1, TCP port 5001 > TCP window size: 16.0 KByte (default) > ------------------------------------------------------------ > [ 3] local 192.168.1.99 port 35793 connected with 192.168.1.1 port 5001 > [ ID] Interval Transfer Bandwidth > [ 3] 0.0-10.0 sec 27.4 MBytes 23.0 Mbits/sec > mb(a)homer:~$ iperf -c 192.168.1.1 > ------------------------------------------------------------ > Client connecting to 192.168.1.1, TCP port 5001 > TCP window size: 16.0 KByte (default) > ------------------------------------------------------------ > [ 3] local 192.168.1.99 port 35794 connected with 192.168.1.1 port 5001 > [ ID] Interval Transfer Bandwidth > [ 3] 0.0-10.0 sec 27.3 MBytes 22.9 Mbits/sec > mb(a)homer:~$ iperf -c 192.168.1.1 > ------------------------------------------------------------ > Client connecting to 192.168.1.1, TCP port 5001 > TCP window size: 16.0 KByte (default) > ------------------------------------------------------------ > [ 3] local 192.168.1.99 port 56147 connected with 192.168.1.1 port 5001 > [ ID] Interval Transfer Bandwidth > [ 3] 0.0-10.0 sec 27.3 MBytes 22.9 Mbits/sec > > > BFS scheduler: > > mb(a)homer:~$ iperf -c 192.168.1.1 > ------------------------------------------------------------ > Client connecting to 192.168.1.1, TCP port 5001 > TCP window size: 16.0 KByte (default) > ------------------------------------------------------------ > [ 3] local 192.168.1.99 port 52489 connected with 192.168.1.1 port 5001 > [ ID] Interval Transfer Bandwidth > [ 3] 0.0-10.0 sec 38.2 MBytes 32.0 Mbits/sec > mb(a)homer:~$ iperf -c 192.168.1.1 > ------------------------------------------------------------ > Client connecting to 192.168.1.1, TCP port 5001 > TCP window size: 16.0 KByte (default) > ------------------------------------------------------------ > [ 3] local 192.168.1.99 port 52490 connected with 192.168.1.1 port 5001 > [ ID] Interval Transfer Bandwidth > [ 3] 0.0-10.0 sec 38.1 MBytes 31.9 Mbits/sec > mb(a)homer:~$ iperf -c 192.168.1.1 > ------------------------------------------------------------ > Client connecting to 192.168.1.1, TCP port 5001 > TCP window size: 16.0 KByte (default) > ------------------------------------------------------------ > [ 3] local 192.168.1.99 port 52491 connected with 192.168.1.1 port 5001 > [ ID] Interval Transfer Bandwidth > [ 3] 0.0-10.0 sec 38.1 MBytes 31.9 Mbits/sec That's interesting. I tried to reproduce it on x86, but the profile does not show any scheduler overhead at all on the server: $ perf report # # Samples: 8369 # # Overhead Symbol # ........ ...... # 9.20% [k] copy_user_generic_string 3.80% [k] e1000_clean 3.58% [k] ipt_do_table 2.72% [k] mwait_idle 2.68% [k] nf_iterate 2.28% [k] e1000_intr 2.15% [k] tcp_packet 2.10% [k] __hash_conntrack 1.59% [k] read_tsc 1.52% [k] _local_bh_enable_ip 1.34% [k] eth_type_trans 1.29% [k] __alloc_skb 1.19% [k] tcp_recvmsg 1.19% [k] ip_rcv 1.17% [k] e1000_clean_rx_irq 1.12% [k] apic_timer_interrupt 0.99% [k] vsnprintf 0.96% [k] nf_conntrack_in 0.96% [k] kmem_cache_free 0.93% [k] __kmalloc_track_caller Could you profile it please? Also, what's the context-switch rate? Below is the call-graph profile as well - all the overhead is in networking and SLAB. Ingo $ perf report --call-graph fractal,5 # # Samples: 8947 # # Overhead Command Shared Object Symbol # ........ .............. ............................. ...... # 9.06% iperf [kernel] [k] copy_user_generic_string | |--98.89%-- skb_copy_datagram_iovec | | | |--77.18%-- tcp_recvmsg | | sock_common_recvmsg | | __sock_recvmsg | | sock_recvmsg | | sys_recvfrom | | system_call_fastpath | | __recv | | | --22.82%-- tcp_rcv_established | tcp_v4_do_rcv | tcp_prequeue_process | tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | --1.11%-- system_call_fastpath __GI___libc_nanosleep 3.62% [init] [kernel] [k] e1000_clean 2.96% [init] [kernel] [k] ipt_do_table 2.79% [init] [kernel] [k] mwait_idle 2.22% [init] [kernel] [k] e1000_intr 1.93% [init] [kernel] [k] nf_iterate 1.65% [init] [kernel] [k] __hash_conntrack 1.52% [init] [kernel] [k] tcp_packet 1.29% [init] [kernel] [k] ip_rcv 1.18% [init] [kernel] [k] __alloc_skb 1.15% iperf [kernel] [k] tcp_recvmsg 1.04% [init] [kernel] [k] _local_bh_enable_ip 1.02% [init] [kernel] [k] apic_timer_interrupt 1.02% [init] [kernel] [k] eth_type_trans 1.01% [init] [kernel] [k] tcp_v4_rcv 0.96% iperf [kernel] [k] kfree | |--95.35%-- skb_release_data | __kfree_skb | | | |--79.27%-- tcp_recvmsg | | sock_common_recvmsg | | __sock_recvmsg | | sock_recvmsg | | sys_recvfrom | | system_call_fastpath | | __recv | | | --20.73%-- tcp_rcv_established | tcp_v4_do_rcv | tcp_prequeue_process | tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | --4.65%-- __kfree_skb | |--75.00%-- tcp_rcv_established | tcp_v4_do_rcv | tcp_prequeue_process | tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | --25.00%-- tcp_recvmsg sock_common_recvmsg __sock_recvmsg sock_recvmsg sys_recvfrom system_call_fastpath __recv 0.96% [init] [kernel] [k] read_tsc 0.92% iperf [kernel] [k] tcp_v4_do_rcv | |--95.12%-- tcp_prequeue_process | tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | --4.88%-- tcp_recvmsg sock_common_recvmsg __sock_recvmsg sock_recvmsg sys_recvfrom system_call_fastpath __recv 0.92% [init] [kernel] [k] e1000_clean_rx_irq 0.86% iperf [kernel] [k] tcp_rcv_established | |--96.10%-- tcp_v4_do_rcv | tcp_prequeue_process | tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | --3.90%-- tcp_prequeue_process tcp_recvmsg sock_common_recvmsg __sock_recvmsg sock_recvmsg sys_recvfrom system_call_fastpath __recv 0.84% iperf [kernel] [k] kmem_cache_free | |--93.33%-- __kfree_skb | | | |--71.43%-- tcp_recvmsg | | sock_common_recvmsg | | __sock_recvmsg | | sock_recvmsg | | sys_recvfrom | | system_call_fastpath | | __recv | | | --28.57%-- tcp_rcv_established | tcp_v4_do_rcv | tcp_prequeue_process | tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | |--4.00%-- tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | --2.67%-- tcp_rcv_established tcp_v4_do_rcv tcp_prequeue_process tcp_recvmsg sock_common_recvmsg __sock_recvmsg sock_recvmsg sys_recvfrom system_call_fastpath __recv 0.80% [init] [kernel] [k] netif_receive_skb 0.79% iperf [kernel] [k] tcp_event_data_recv | |--83.10%-- tcp_rcv_established | tcp_v4_do_rcv | tcp_prequeue_process | tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | |--12.68%-- tcp_v4_do_rcv | tcp_prequeue_process | tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | --4.23%-- tcp_data_queue tcp_rcv_established tcp_v4_do_rcv tcp_prequeue_process tcp_recvmsg sock_common_recvmsg __sock_recvmsg sock_recvmsg sys_recvfrom system_call_fastpath __recv 0.67% perf [kernel] [k] format_decode | |--91.67%-- vsnprintf | seq_printf | | | |--67.27%-- show_map_vma | | show_map | | seq_read | | vfs_read | | sys_read | | system_call_fastpath | | __GI_read | | | |--23.64%-- render_sigset_t | | proc_pid_status | | proc_single_show | | seq_read | | vfs_read | | sys_read | | system_call_fastpath | | __GI_read | | | |--7.27%-- proc_pid_status | | proc_single_show | | seq_read | | vfs_read | | sys_read | | system_call_fastpath | | __GI_read | | | --1.82%-- cpuset_task_status_allowed | proc_pid_status | proc_single_show | seq_read | vfs_read | sys_read | system_call_fastpath | __GI_read | --8.33%-- seq_printf | |--60.00%-- proc_pid_status | proc_single_show | seq_read | vfs_read | sys_read | system_call_fastpath | __GI_read | --40.00%-- show_map_vma show_map seq_read vfs_read sys_read system_call_fastpath __GI_read 0.65% [init] [kernel] [k] __kmalloc_track_caller 0.63% [init] [kernel] [k] nf_conntrack_in 0.63% [init] [kernel] [k] ip_route_input 0.58% perf [kernel] [k] vsnprintf | |--98.08%-- seq_printf | | | |--60.78%-- show_map_vma | | show_map | | seq_read | | vfs_read | | sys_read | | system_call_fastpath | | __GI_read | | | |--19.61%-- render_sigset_t | | proc_pid_status | | proc_single_show | | seq_read | | vfs_read | | sys_read | | system_call_fastpath | | __GI_read | | | |--9.80%-- proc_pid_status | | proc_single_show | | seq_read | | vfs_read | | sys_read | | system_call_fastpath | | __GI_read | | | |--3.92%-- task_mem | | proc_pid_status | | proc_single_show | | seq_read | | vfs_read | | sys_read | | system_call_fastpath | | __GI_read | | | |--3.92%-- cpuset_task_status_allowed | | proc_pid_status | | proc_single_show | | seq_read | | vfs_read | | sys_read | | system_call_fastpath | | __GI_read | | | --1.96%-- render_cap_t | proc_pid_status | proc_single_show | seq_read | vfs_read | sys_read | system_call_fastpath | __GI_read | --1.92%-- snprintf proc_task_readdir vfs_readdir sys_getdents system_call_fastpath __getdents64 0x69706565000a3430 0.57% [init] [kernel] [k] ktime_get 0.57% [init] [kernel] [k] nf_nat_fn 0.56% iperf [kernel] [k] tcp_packet | |--68.00%-- __tcp_ack_snd_check | tcp_rcv_established | tcp_v4_do_rcv | tcp_prequeue_process | tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | --32.00%-- tcp_cleanup_rbuf tcp_recvmsg sock_common_recvmsg __sock_recvmsg sock_recvmsg sys_recvfrom system_call_fastpath __recv 0.56% iperf /usr/bin/iperf [.] 0x000000000059f8 | |--8.00%-- 0x4059f8 | |--8.00%-- 0x405a16 | |--8.00%-- 0x4059fd | |--4.00%-- 0x409d22 | |--4.00%-- 0x405871 | |--4.00%-- 0x406ee1 | |--4.00%-- 0x405726 | |--4.00%-- 0x4058db | |--4.00%-- 0x406ee8 | |--2.00%-- 0x405b60 | |--2.00%-- 0x4058fd | |--2.00%-- 0x4058d5 | |--2.00%-- 0x405490 | |--2.00%-- 0x4058bb | |--2.00%-- 0x405b93 | |--2.00%-- 0x405b8e | |--2.00%-- 0x405903 | |--2.00%-- 0x405ba8 | |--2.00%-- 0x406eae | |--2.00%-- 0x405545 | |--2.00%-- 0x405870 | |--2.00%-- 0x405b67 | |--2.00%-- 0x4058ce | |--2.00%-- 0x40570e | |--2.00%-- 0x406ee4 | |--2.00%-- 0x405a02 | |--2.00%-- 0x406eec | |--2.00%-- 0x405b82 | |--2.00%-- 0x40556a | |--2.00%-- 0x405755 | |--2.00%-- 0x405a0a | |--2.00%-- 0x405498 | |--2.00%-- 0x409d20 | |--2.00%-- 0x405b21 | --2.00%-- 0x405a2c 0.56% [init] [kernel] [k] kmem_cache_alloc 0.56% [init] [kernel] [k] __inet_lookup_established 0.55% perf [kernel] [k] number | |--95.92%-- vsnprintf | | | |--97.87%-- seq_printf | | | | | |--56.52%-- show_map_vma | | | show_map | | | seq_read | | | vfs_read | | | sys_read | | | system_call_fastpath | | | __GI_read | | | | | |--28.26%-- render_sigset_t | | | proc_pid_status | | | proc_single_show | | | seq_read | | | vfs_read | | | sys_read | | | system_call_fastpath | | | __GI_read | | | | | |--6.52%-- proc_pid_status | | | proc_single_show | | | seq_read | | | vfs_read | | | sys_read | | | system_call_fastpath | | | __GI_read | | | | | |--4.35%-- render_cap_t | | | proc_pid_status | | | proc_single_show | | | seq_read | | | vfs_read | | | sys_read | | | system_call_fastpath | | | __GI_read | | | | | --4.35%-- task_mem | | proc_pid_status | | proc_single_show | | seq_read | | vfs_read | | sys_read | | system_call_fastpath | | __GI_read | | | --2.13%-- scnprintf | bitmap_scnlistprintf | seq_bitmap_list | cpuset_task_status_allowed | proc_pid_status | proc_single_show | seq_read | vfs_read | sys_read | system_call_fastpath | __GI_read | --4.08%-- seq_printf | |--50.00%-- show_map_vma | show_map | seq_read | vfs_read | sys_read | system_call_fastpath | __GI_read | --50.00%-- render_sigset_t proc_pid_status proc_single_show seq_read vfs_read sys_read system_call_fastpath __GI_read 0.55% [init] [kernel] [k] native_sched_clock 0.50% iperf [kernel] [k] e1000_xmit_frame | |--71.11%-- __tcp_ack_snd_check | tcp_rcv_established | tcp_v4_do_rcv | tcp_prequeue_process | tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | --28.89%-- tcp_cleanup_rbuf tcp_recvmsg sock_common_recvmsg __sock_recvmsg sock_recvmsg sys_recvfrom system_call_fastpath __recv 0.50% iperf [kernel] [k] ipt_do_table | |--37.78%-- ipt_local_hook | nf_iterate | nf_hook_slow | __ip_local_out | ip_local_out | ip_queue_xmit | tcp_transmit_skb | tcp_send_ack | | | |--58.82%-- __tcp_ack_snd_check | | tcp_rcv_established | | tcp_v4_do_rcv | | tcp_prequeue_process | | tcp_recvmsg | | sock_common_recvmsg | | __sock_recvmsg | | sock_recvmsg | | sys_recvfrom | | system_call_fastpath | | __recv | | | --41.18%-- tcp_cleanup_rbuf | tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | |--31.11%-- ipt_post_routing_hook | nf_iterate | nf_hook_slow | ip_output | ip_local_out | ip_queue_xmit | tcp_transmit_skb | tcp_send_ack | | | |--64.29%-- __tcp_ack_snd_check | | tcp_rcv_established | | tcp_v4_do_rcv | | tcp_prequeue_process | | tcp_recvmsg | | sock_common_recvmsg | | __sock_recvmsg | | sock_recvmsg | | sys_recvfrom | | system_call_fastpath | | __recv | | | --35.71%-- tcp_cleanup_rbuf | tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | |--20.00%-- ipt_local_out_hook | nf_iterate | nf_hook_slow | __ip_local_out | ip_local_out | ip_queue_xmit | tcp_transmit_skb | tcp_send_ack | | | |--88.89%-- __tcp_ack_snd_check | | tcp_rcv_established | | tcp_v4_do_rcv | | tcp_prequeue_process | | tcp_recvmsg | | sock_common_recvmsg | | __sock_recvmsg | | sock_recvmsg | | sys_recvfrom | | system_call_fastpath | | __recv | | | --11.11%-- tcp_cleanup_rbuf | tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | |--6.67%-- nf_iterate | nf_hook_slow | | | |--66.67%-- ip_output | | ip_local_out | | ip_queue_xmit | | tcp_transmit_skb | | tcp_send_ack | | tcp_cleanup_rbuf | | tcp_recvmsg | | sock_common_recvmsg | | __sock_recvmsg | | sock_recvmsg | | sys_recvfrom | | system_call_fastpath | | __recv | | | --33.33%-- __ip_local_out | ip_local_out | ip_queue_xmit | tcp_transmit_skb | tcp_send_ack | __tcp_ack_snd_check | tcp_rcv_established | tcp_v4_do_rcv | tcp_prequeue_process | tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | |--2.22%-- ipt_local_in_hook | nf_iterate | nf_hook_slow | ip_local_deliver | ip_rcv_finish | ip_rcv | netif_receive_skb | napi_skb_finish | napi_gro_receive | e1000_receive_skb | e1000_clean_rx_irq | e1000_clean | net_rx_action | __do_softirq | call_softirq | do_softirq | irq_exit | do_IRQ | ret_from_intr | vgettimeofday | --2.22%-- ipt_pre_routing_hook nf_iterate nf_hook_slow ip_rcv netif_receive_skb napi_skb_finish napi_gro_receive e1000_receive_skb e1000_clean_rx_irq e1000_clean net_rx_action __do_softirq call_softirq do_softirq irq_exit do_IRQ ret_from_intr __GI___libc_nanosleep 0.50% iperf [kernel] [k] schedule | |--57.78%-- do_nanosleep | hrtimer_nanosleep | sys_nanosleep | system_call_fastpath | __GI___libc_nanosleep | |--33.33%-- schedule_timeout | sk_wait_data | tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | |--6.67%-- hrtimer_nanosleep | sys_nanosleep | system_call_fastpath | __GI___libc_nanosleep | --2.22%-- sk_wait_data tcp_recvmsg sock_common_recvmsg __sock_recvmsg sock_recvmsg sys_recvfrom system_call_fastpath __recv 0.49% iperf [kernel] [k] tcp_transmit_skb | |--97.73%-- tcp_send_ack | | | |--83.72%-- __tcp_ack_snd_check | | tcp_rcv_established | | tcp_v4_do_rcv | | | | | |--97.22%-- tcp_prequeue_process | | | tcp_recvmsg | | | sock_common_recvmsg | | | __sock_recvmsg | | | sock_recvmsg | | | sys_recvfrom | | | system_call_fastpath | | | __recv | | | | | --2.78%-- release_sock | | tcp_recvmsg | | sock_common_recvmsg | | __sock_recvmsg | | sock_recvmsg | | sys_recvfrom | | system_call_fastpath | | __recv | | | --16.28%-- tcp_cleanup_rbuf | tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | --2.27%-- __tcp_ack_snd_check tcp_rcv_established tcp_v4_do_rcv tcp_prequeue_process tcp_recvmsg sock_common_recvmsg __sock_recvmsg sock_recvmsg sys_recvfrom system_call_fastpath __recv 0.49% [init] [kernel] [k] nf_hook_slow 0.48% iperf [kernel] [k] virt_to_head_page | |--53.49%-- kfree | skb_release_data | __kfree_skb | | | |--65.22%-- tcp_recvmsg | | sock_common_recvmsg | | __sock_recvmsg | | sock_recvmsg | | sys_recvfrom | | system_call_fastpath | | __recv | | | --34.78%-- tcp_rcv_established | tcp_v4_do_rcv | tcp_prequeue_process | tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | |--18.60%-- skb_release_data | __kfree_skb | | | |--62.50%-- tcp_rcv_established | | tcp_v4_do_rcv | | tcp_prequeue_process | | tcp_recvmsg | | sock_common_recvmsg | | __sock_recvmsg | | sock_recvmsg | | sys_recvfrom | | system_call_fastpath | | __recv | | | --37.50%-- tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | |--18.60%-- kmem_cache_free | __kfree_skb | | | |--62.50%-- tcp_rcv_established | | tcp_v4_do_rcv | | tcp_prequeue_process | | tcp_recvmsg | | sock_common_recvmsg | | __sock_recvmsg | | sock_recvmsg | | sys_recvfrom | | system_call_fastpath | | __recv | | | --37.50%-- tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | --9.30%-- __kfree_skb | |--75.00%-- tcp_rcv_established | tcp_v4_do_rcv | tcp_prequeue_process | tcp_recvmsg | sock_common_recvmsg | __sock_recvmsg | sock_recvmsg | sys_recvfrom | system_call_fastpath | __recv | --25.00%-- tcp_recvmsg sock_common_recvmsg __sock_recvmsg sock_recvmsg sys_recvfrom system_call_fastpath __recv ... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Ingo Molnar on 6 Sep 2009 17:10
hi Con, I've read your BFS announcement/FAQ with great interest: http://ck.kolivas.org/patches/bfs/bfs-faq.txt First and foremost, let me say that i'm happy that you are hacking the Linux scheduler again. It's perhaps proof that hacking the scheduler is one of the most addictive things on the planet ;-) I understand that BFS is still early code and that you are not targeting BFS for mainline inclusion - but BFS is an interesting and bold new approach, cutting a _lot_ of code out of kernel/sched*.c, so it raised my curiosity and interest :-) In the announcement and on your webpage you have compared BFS to the mainline scheduler in various workloads - showing various improvements over it. I have tried and tested BFS and ran a set of benchmarks - this mail contains the results and my (quick) findings. So ... to get to the numbers - i've tested both BFS and the tip of the latest upstream scheduler tree on a testbox of mine. I intentionally didnt test BFS on any really large box - because you described its upper limit like this in the announcement: ----------------------- | | How scalable is it? | | I don't own the sort of hardware that is likely to suffer from | using it, so I can't find the upper limit. Based on first | principles about the overhead of locking, and the way lookups | occur, I'd guess that a machine with more than 16 CPUS would | start to have less performance. BIG NUMA machines will probably | suck a lot with this because it pays no deference to locality of | the NUMA nodes when deciding what cpu to use. It just keeps them | all busy. The so-called "light NUMA" that constitutes commodity | hardware these days seems to really like BFS. | ----------------------- I generally agree with you that "light NUMA" is what a Linux scheduler needs to concentrate on (at most) in terms of scalability. Big NUMA, 4096 CPUs is not very common and we tune the Linux scheduler for desktop and small-server workloads mostly. So the testbox i picked fits into the upper portion of what i consider a sane range of systems to tune for - and should still fit into BFS's design bracket as well according to your description: it's a dual quad core system with hyperthreading. It has twice as many cores as the quad you tested on but it's not excessive and certainly does not have 4096 CPUs ;-) Here are the benchmark results: kernel build performance: http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild.jpg pipe performance: http://redhat.com/~mingo/misc/bfs-vs-tip-pipe.jpg messaging performance (hackbench): http://redhat.com/~mingo/misc/bfs-vs-tip-messaging.jpg OLTP performance (postgresql + sysbench) http://redhat.com/~mingo/misc/bfs-vs-tip-oltp.jpg Alas, as it can be seen in the graphs, i can not see any BFS performance improvements, on this box. Here's a more detailed description of the results: | Kernel build performance --------------------------- http://redhat.com/~mingo/misc/bfs-vs-tip-kbuild.jpg In the kbuild test BFS is showing significant weaknesses up to 16 CPUs. On 8 CPUs utilized (half load) it's 27.6% slower. All results (-j1, -j2... -j15 are slower. The peak at 100% utilization at -j16 is slightly stronger under BFS, by 1.5%. The 'absolute best' result is sched-devel at -j64 with 46.65 seconds - the best BFS result is 47.38 seconds (at -j64) - 1.5% better. | Pipe performance ------------------- http://redhat.com/~mingo/misc/bfs-vs-tip-pipe.jpg Pipe performance is a very simple test, two tasks message to each other via pipes. I measured 1 million such messages: http://redhat.com/~mingo/cfs-scheduler/tools/pipe-test-1m.c The pipe test ran a number of them in parallel: for ((i=0;i<$NR;i++)); do ~/sched-tests/pipe-test-1m & done; wait and measured elapsed time. This tests two things: basic scheduler performance and also scheduler fairness. (if one of these parallel jobs is delayed unfairly then the test will finish later.) [ see further below for a simpler pipe latency benchmark as well. ] As can be seen in the graph BFS performed very poorly in this test: at 8 pairs of tasks it had a runtime of 45.42 seconds - while sched-devel finished them in 3.8 seconds. I saw really bad interactivity in the BFS test here - the system was starved for as long as the test ran. I stopped the tests at 8 loops - the system was unusable and i was getting IO timeouts due to the scheduling lag: sd 0:0:0:0: [sda] Unhandled error code sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT end_request: I/O error, dev sda, sector 81949243 Aborting journal on device sda2. ext3_abort called. EXT3-fs error (device sda2): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only I measured interactivity during this test: $ time ssh aldebaran /bin/true real 2m17.968s user 0m0.009s sys 0m0.003s A single command took more than 2 minutes. | Messaging performance ------------------------ http://redhat.com/~mingo/misc/bfs-vs-tip-messaging.jpg Hackbench ran better - but mainline sched-devel is significantly faster for smaller and larger loads as well. With 20 groups mainline ran 61.5% faster. | OLTP performance -------------------- http://redhat.com/~mingo/misc/bfs-vs-tip-oltp.jpg As can be seen in the graph for sysbench OLTP performance sched-devel outperforms BFS on each of the main stages: single client load ( 1 client - 6.3% faster ) half load ( 8 clients - 57.6% faster ) peak performance ( 16 clients - 117.6% faster ) overload ( 512 clients - 288.3% faster ) | Other tests -------------- I also tested a couple of other things, such as lat_tcp: BFS: TCP latency using localhost: 16.5608 microseconds sched-devel: TCP latency using localhost: 13.5528 microseconds [22.1% faster] lat_pipe: BFS: Pipe latency: 4.9703 microseconds sched-devel: Pipe latency: 2.6137 microseconds [90.1% faster] General interactivity of BFS seemed good to me - except for the pipe test when there was significant lag over a minute. I think it's some starvation bug, not an inherent design property of BFS, so i'm looking forward to re-test it with the fix. Test environment: i used latest BFS (205 and then i re-ran under 208 and the numbers are all from 208), and the latest mainline scheduler development tree from: http://people.redhat.com/mingo/tip.git/README Commit 840a065 in particular. It's on a .31-rc8 base while BFS is on a .30 base - will be able to test BFS on a .31 base as well once you release it. (but it doesnt matter much to the results - there werent any heavy core kernel changes impacting these workloads.) The system had enough RAM to have the workloads cached, and i repeated all tests to make sure it's all representative. Nevertheless i'd like to encourage others to repeat these (or other) tests - the more testing the better. I also tried to configure the kernel in a BFS friendly way, i used HZ=1000 as recommended, turned off all debug options, etc. The kernel config i used can be found here: http://redhat.com/~mingo/misc/config ( Let me know if you need any more info about any of the tests i conducted. ) Also, i'd like to outline that i agree with the general goals described by you in the BFS announcement - small desktop systems matter more than large systems. We find it critically important that the mainline Linux scheduler performs well on those systems too - and if you (or anyone else) can reproduce suboptimal behavior please let the scheduler folks know so that we can fix/improve it. I hope to be able to work with you on this, please dont hesitate sending patches if you wish - and we'll also be following BFS for good ideas and code to adopt to mainline. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |