Prev: mm,migration: Allow the migration of PageSwapCache pages
Next: [PATCH 4/7] tracing: Fix compile error in module tracepoints when MODULE_UNLOAD not set
From: Cong Wang on 31 Mar 2010 22:50 Oleg Nesterov wrote: > On 03/31, Amerigo Wang wrote: >> This fixes a lockdep warning when invoking destroy_workqueue(), >> because the lockdep annotations are invoked under cpu_add_remove_lock. > > Confused. Why does lockdep complains? Because lockdep annotations are added to prevent other locks are taken before calling these functions. > >> So, move the lockdep annotations before taking cpu_add_remove_lock >> in destroy_workqueue(), this will not affect the original purpose >> of adding them for destroy_workqueue() etc. >> >> However, it will affect another caller of cleanup_workqueue_thread(), >> that is, workqueue_cpu_callback(). This should be fine, because there >> are no other cases than cpu hotplug could call it. > > OK, but nobody should take cpu_maps_update_begin() under wq->lockdep_map, > in particular work->func() must not. > > I must have missed something, but it seems to me this patch tries to > supress the valid warning. > > Could you please clarify? > Sure, below is the whole warning. Please teach me how this is valid. Thanks. Mar 31 16:15:00 dhcp-66-70-5 kernel: bonding: bond0: released all slaves Mar 31 16:15:00 dhcp-66-70-5 kernel: Mar 31 16:15:00 dhcp-66-70-5 kernel: ======================================================= Mar 31 16:15:00 dhcp-66-70-5 kernel: [ INFO: possible circular locking dependency detected ] Mar 31 16:15:00 dhcp-66-70-5 kernel: 2.6.34-rc3 #90 Mar 31 16:15:00 dhcp-66-70-5 kernel: ------------------------------------------------------- Mar 31 16:15:00 dhcp-66-70-5 kernel: modprobe/5264 is trying to acquire lock: Mar 31 16:15:00 dhcp-66-70-5 kernel: ((bond_dev->name)){+.+...}, at: [<ffffffff8108524a>] cleanup_workqueue_thread+0x2b/0x10b Mar 31 16:15:00 dhcp-66-70-5 kernel: Mar 31 16:15:00 dhcp-66-70-5 kernel: but task is already holding lock: Mar 31 16:15:00 dhcp-66-70-5 kernel: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff810631d1>] cpu_maps_update_begin+0x1e/0x27 Mar 31 16:15:00 dhcp-66-70-5 kernel: Mar 31 16:15:00 dhcp-66-70-5 kernel: which lock already depends on the new lock. Mar 31 16:15:01 dhcp-66-70-5 kernel: Mar 31 16:15:01 dhcp-66-70-5 kernel: Mar 31 16:15:01 dhcp-66-70-5 kernel: the existing dependency chain (in reverse order) is: Mar 31 16:15:01 dhcp-66-70-5 kernel: Mar 31 16:15:01 dhcp-66-70-5 kernel: -> #3 (cpu_add_remove_lock){+.+.+.}: Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff810a6bc1>] validate_chain+0x1019/0x1540 Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff810a7e75>] __lock_acquire+0xd8d/0xe55 Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff810aa3a4>] lock_acquire+0x160/0x1af Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff815523f8>] mutex_lock_nested+0x64/0x4e9 Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff810631d1>] cpu_maps_update_begin+0x1e/0x27 Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff810853cd>] destroy_workqueue+0x41/0x107 Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffffa0839d32>] bond_uninit+0x524/0x58a [bonding] Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff8146967b>] rollback_registered_many+0x205/0x2e3 Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff81469783>] unregister_netdevice_many+0x2a/0x75 Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff8147ada3>] __rtnl_kill_links+0x8b/0x9d Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff8147adea>] __rtnl_link_unregister+0x35/0x72 Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff8147b293>] rtnl_link_unregister+0x2c/0x43 Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffffa0845ca6>] bonding_exit+0x5a/0x76 [bonding] Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff810b7749>] sys_delete_module+0x306/0x3b1 Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff81003a5b>] system_call_fastpath+0x16/0x1b Mar 31 16:15:02 dhcp-66-70-5 kernel: Mar 31 16:15:02 dhcp-66-70-5 kernel: -> #2 (rtnl_mutex){+.+.+.}: Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff810a6bc1>] validate_chain+0x1019/0x1540 Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff810a7e75>] __lock_acquire+0xd8d/0xe55 Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff810aa3a4>] lock_acquire+0x160/0x1af Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff815523f8>] mutex_lock_nested+0x64/0x4e9 Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff8147af16>] rtnl_lock+0x1e/0x27 Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffffa0836779>] bond_mii_monitor+0x39f/0x74b [bonding] Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff8108654f>] worker_thread+0x2da/0x46c Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff8108b1ea>] kthread+0xdd/0xec Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff81004894>] kernel_thread_helper+0x4/0x10 Mar 31 16:15:02 dhcp-66-70-5 kernel: Mar 31 16:15:02 dhcp-66-70-5 kernel: -> #1 ((&(&bond->mii_work)->work)){+.+...}: Mar 31 16:15:02 dhcp-66-70-5 kernel: [<ffffffff810a6bc1>] validate_chain+0x1019/0x1540 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff810a7e75>] __lock_acquire+0xd8d/0xe55 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff810aa3a4>] lock_acquire+0x160/0x1af Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff81086542>] worker_thread+0x2cd/0x46c Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff8108b1ea>] kthread+0xdd/0xec Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff81004894>] kernel_thread_helper+0x4/0x10 Mar 31 16:15:03 dhcp-66-70-5 kernel: Mar 31 16:15:03 dhcp-66-70-5 kernel: -> #0 ((bond_dev->name)){+.+...}: Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff810a6696>] validate_chain+0xaee/0x1540 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff810a7e75>] __lock_acquire+0xd8d/0xe55 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff810aa3a4>] lock_acquire+0x160/0x1af Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff81085278>] cleanup_workqueue_thread+0x59/0x10b Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff81085428>] destroy_workqueue+0x9c/0x107 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffffa0839d32>] bond_uninit+0x524/0x58a [bonding] Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff8146967b>] rollback_registered_many+0x205/0x2e3 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff81469783>] unregister_netdevice_many+0x2a/0x75 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff8147ada3>] __rtnl_kill_links+0x8b/0x9d Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff8147adea>] __rtnl_link_unregister+0x35/0x72 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff8147b293>] rtnl_link_unregister+0x2c/0x43 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffffa0845ca6>] bonding_exit+0x5a/0x76 [bonding] Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff810b7749>] sys_delete_module+0x306/0x3b1 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff81003a5b>] system_call_fastpath+0x16/0x1b Mar 31 16:15:03 dhcp-66-70-5 kernel: Mar 31 16:15:03 dhcp-66-70-5 kernel: other info that might help us debug this: Mar 31 16:15:03 dhcp-66-70-5 kernel: Mar 31 16:15:03 dhcp-66-70-5 kernel: 2 locks held by modprobe/5264: Mar 31 16:15:03 dhcp-66-70-5 kernel: #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff8147af16>] rtnl_lock+0x1e/0x27 Mar 31 16:15:03 dhcp-66-70-5 kernel: #1: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff810631d1>] cpu_maps_update_begin+0x1e/0x27 Mar 31 16:15:03 dhcp-66-70-5 kernel: Mar 31 16:15:03 dhcp-66-70-5 kernel: stack backtrace: Mar 31 16:15:03 dhcp-66-70-5 kernel: Pid: 5264, comm: modprobe Not tainted 2.6.34-rc3 #90 Mar 31 16:15:03 dhcp-66-70-5 kernel: Call Trace: Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff810a50c1>] print_circular_bug+0x169/0x17e Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff810a6696>] validate_chain+0xaee/0x1540 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff810a7e75>] __lock_acquire+0xd8d/0xe55 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff810aa3a4>] lock_acquire+0x160/0x1af Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff8108524a>] ? cleanup_workqueue_thread+0x2b/0x10b Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff81085278>] cleanup_workqueue_thread+0x59/0x10b Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff8108524a>] ? cleanup_workqueue_thread+0x2b/0x10b Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff81085428>] destroy_workqueue+0x9c/0x107 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffffa0833a19>] ? bond_remove_proc_entry+0x0/0x109 [bonding] Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffffa0839d32>] bond_uninit+0x524/0x58a [bonding] Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff8146967b>] rollback_registered_many+0x205/0x2e3 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff81469783>] unregister_netdevice_many+0x2a/0x75 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff8147ada3>] __rtnl_kill_links+0x8b/0x9d Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff8147adea>] __rtnl_link_unregister+0x35/0x72 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff8147b293>] rtnl_link_unregister+0x2c/0x43 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffffa0845ca6>] bonding_exit+0x5a/0x76 [bonding] Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff810b7749>] sys_delete_module+0x306/0x3b1 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff81003a8c>] ? sysret_check+0x27/0x62 Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff810a3268>] ? trace_hardirqs_on_caller+0x264/0x29f Mar 31 16:15:03 dhcp-66-70-5 kernel: [<ffffffff81003a5b>] system_call_fastpath+0x16/0x1b -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Tejun Heo on 1 Apr 2010 00:00 Hello, guys. On 04/01/2010 11:45 AM, Cong Wang wrote: >> OK, but nobody should take cpu_maps_update_begin() under wq->lockdep_map, >> in particular work->func() must not. >> >> I must have missed something, but it seems to me this patch tries to >> supress the valid warning. >> >> Could you please clarify? > > Sure, below is the whole warning. Please teach me how this is valid. I still have some trouble interpreting lockdep warnings. Please correct me if I get something wrong. > modprobe/5264 is trying to acquire lock: > ((bond_dev->name)){+.+...}, at: [<ffffffff8108524a>] cleanup_workqueue_thread+0x2b/0x10b > > but task is already holding lock: > (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff810631d1>] cpu_maps_update_begin+0x1e/0x27 This (cpu hotplug -> wq) is the expected sequence. Plug cpu hotplugging and then flush cpu workqueues. > which lock already depends on the new lock. But lockdep says the other way around has already happened. > the existing dependency chain (in reverse order) is: > > -> #3 (cpu_add_remove_lock){+.+.+.}: > [<ffffffff810a6bc1>] validate_chain+0x1019/0x1540 > [<ffffffff810a7e75>] __lock_acquire+0xd8d/0xe55 > [<ffffffff810aa3a4>] lock_acquire+0x160/0x1af > [<ffffffff815523f8>] mutex_lock_nested+0x64/0x4e9 > [<ffffffff810631d1>] cpu_maps_update_begin+0x1e/0x27 > [<ffffffff810853cd>] destroy_workqueue+0x41/0x107 > [<ffffffffa0839d32>] bond_uninit+0x524/0x58a [bonding] > [<ffffffff8146967b>] rollback_registered_many+0x205/0x2e3 > [<ffffffff81469783>] unregister_netdevice_many+0x2a/0x75 > [<ffffffff8147ada3>] __rtnl_kill_links+0x8b/0x9d > [<ffffffff8147adea>] __rtnl_link_unregister+0x35/0x72 > [<ffffffff8147b293>] rtnl_link_unregister+0x2c/0x43 > [<ffffffffa0845ca6>] bonding_exit+0x5a/0x76 [bonding] > [<ffffffff810b7749>] sys_delete_module+0x306/0x3b1 > [<ffffffff81003a5b>] system_call_fastpath+0x16/0x1b This is bond_uninit() calling destroy_workqueue() but I don't get how this thread would be already holding wq lock. > -> #2 (rtnl_mutex){+.+.+.}: > [<ffffffff810a6bc1>] validate_chain+0x1019/0x1540 > [<ffffffff810a7e75>] __lock_acquire+0xd8d/0xe55 > [<ffffffff810aa3a4>] lock_acquire+0x160/0x1af > [<ffffffff815523f8>] mutex_lock_nested+0x64/0x4e9 > [<ffffffff8147af16>] rtnl_lock+0x1e/0x27 > [<ffffffffa0836779>] bond_mii_monitor+0x39f/0x74b [bonding] > [<ffffffff8108654f>] worker_thread+0x2da/0x46c > [<ffffffff8108b1ea>] kthread+0xdd/0xec > [<ffffffff81004894>] kernel_thread_helper+0x4/0x10 > > -> #1 ((&(&bond->mii_work)->work)){+.+...}: > [<ffffffff810a6bc1>] validate_chain+0x1019/0x1540 > [<ffffffff810a7e75>] __lock_acquire+0xd8d/0xe55 > [<ffffffff810aa3a4>] lock_acquire+0x160/0x1af > [<ffffffff81086542>] worker_thread+0x2cd/0x46c > [<ffffffff8108b1ea>] kthread+0xdd/0xec > [<ffffffff81004894>] kernel_thread_helper+0x4/0x10 These two are form a workqueue worker thread and I don't quite understand why they are here. > -> #0 ((bond_dev->name)){+.+...}: > [<ffffffff810a6696>] validate_chain+0xaee/0x1540 > [<ffffffff810a7e75>] __lock_acquire+0xd8d/0xe55 > [<ffffffff810aa3a4>] lock_acquire+0x160/0x1af > [<ffffffff81085278>] cleanup_workqueue_thread+0x59/0x10b > [<ffffffff81085428>] destroy_workqueue+0x9c/0x107 > [<ffffffffa0839d32>] bond_uninit+0x524/0x58a [bonding] > [<ffffffff8146967b>] rollback_registered_many+0x205/0x2e3 > [<ffffffff81469783>] unregister_netdevice_many+0x2a/0x75 > [<ffffffff8147ada3>] __rtnl_kill_links+0x8b/0x9d > [<ffffffff8147adea>] __rtnl_link_unregister+0x35/0x72 > [<ffffffff8147b293>] rtnl_link_unregister+0x2c/0x43 > [<ffffffffa0845ca6>] bonding_exit+0x5a/0x76 [bonding] > [<ffffffff810b7749>] sys_delete_module+0x306/0x3b1 > [<ffffffff81003a5b>] system_call_fastpath+0x16/0x1b This seems to be from the original thread of frame#3. It's grabbing wq lock here but the problem is that the lock will be released immediately, so bond_dev->name (the wq) can't be held by the time it reaches frame#3. How is this dependency chain completed? Is it somehow transitive through rtnl_mutex? > other info that might help us debug this: > > 2 locks held by modprobe/5264: > #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff8147af16>] rtnl_lock+0x1e/0x27 > #1: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff810631d1>] cpu_maps_update_begin+0x1e/0x27 Isn't there a circular dependency here? bonding_exit() calls destroy_workqueue() under rtnl_mutex but destroy_workqueue() should flush works which could be trying to grab rtnl_lock. Or am I completely misunderstanding locking here? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Cong Wang on 1 Apr 2010 00:10 Tejun Heo wrote: > Hello, guys. > > On 04/01/2010 11:45 AM, Cong Wang wrote: >>> OK, but nobody should take cpu_maps_update_begin() under wq->lockdep_map, >>> in particular work->func() must not. >>> >>> I must have missed something, but it seems to me this patch tries to >>> supress the valid warning. >>> >>> Could you please clarify? >> Sure, below is the whole warning. Please teach me how this is valid. > > I still have some trouble interpreting lockdep warnings. Please > correct me if I get something wrong. > >> modprobe/5264 is trying to acquire lock: >> ((bond_dev->name)){+.+...}, at: [<ffffffff8108524a>] cleanup_workqueue_thread+0x2b/0x10b >> >> but task is already holding lock: >> (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff810631d1>] cpu_maps_update_begin+0x1e/0x27 > > This (cpu hotplug -> wq) is the expected sequence. Plug cpu > hotplugging and then flush cpu workqueues. > >> which lock already depends on the new lock. > > But lockdep says the other way around has already happened. > >> the existing dependency chain (in reverse order) is: >> >> -> #3 (cpu_add_remove_lock){+.+.+.}: >> [<ffffffff810a6bc1>] validate_chain+0x1019/0x1540 >> [<ffffffff810a7e75>] __lock_acquire+0xd8d/0xe55 >> [<ffffffff810aa3a4>] lock_acquire+0x160/0x1af >> [<ffffffff815523f8>] mutex_lock_nested+0x64/0x4e9 >> [<ffffffff810631d1>] cpu_maps_update_begin+0x1e/0x27 >> [<ffffffff810853cd>] destroy_workqueue+0x41/0x107 >> [<ffffffffa0839d32>] bond_uninit+0x524/0x58a [bonding] >> [<ffffffff8146967b>] rollback_registered_many+0x205/0x2e3 >> [<ffffffff81469783>] unregister_netdevice_many+0x2a/0x75 >> [<ffffffff8147ada3>] __rtnl_kill_links+0x8b/0x9d >> [<ffffffff8147adea>] __rtnl_link_unregister+0x35/0x72 >> [<ffffffff8147b293>] rtnl_link_unregister+0x2c/0x43 >> [<ffffffffa0845ca6>] bonding_exit+0x5a/0x76 [bonding] >> [<ffffffff810b7749>] sys_delete_module+0x306/0x3b1 >> [<ffffffff81003a5b>] system_call_fastpath+0x16/0x1b > > This is bond_uninit() calling destroy_workqueue() but I don't get how > this thread would be already holding wq lock. destroy_workqueue() does hold wq lock and then releases it. > >> -> #2 (rtnl_mutex){+.+.+.}: >> [<ffffffff810a6bc1>] validate_chain+0x1019/0x1540 >> [<ffffffff810a7e75>] __lock_acquire+0xd8d/0xe55 >> [<ffffffff810aa3a4>] lock_acquire+0x160/0x1af >> [<ffffffff815523f8>] mutex_lock_nested+0x64/0x4e9 >> [<ffffffff8147af16>] rtnl_lock+0x1e/0x27 >> [<ffffffffa0836779>] bond_mii_monitor+0x39f/0x74b [bonding] >> [<ffffffff8108654f>] worker_thread+0x2da/0x46c >> [<ffffffff8108b1ea>] kthread+0xdd/0xec >> [<ffffffff81004894>] kernel_thread_helper+0x4/0x10 >> >> -> #1 ((&(&bond->mii_work)->work)){+.+...}: >> [<ffffffff810a6bc1>] validate_chain+0x1019/0x1540 >> [<ffffffff810a7e75>] __lock_acquire+0xd8d/0xe55 >> [<ffffffff810aa3a4>] lock_acquire+0x160/0x1af >> [<ffffffff81086542>] worker_thread+0x2cd/0x46c >> [<ffffffff8108b1ea>] kthread+0xdd/0xec >> [<ffffffff81004894>] kernel_thread_helper+0x4/0x10 > > These two are form a workqueue worker thread and I don't quite > understand why they are here. > >> -> #0 ((bond_dev->name)){+.+...}: >> [<ffffffff810a6696>] validate_chain+0xaee/0x1540 >> [<ffffffff810a7e75>] __lock_acquire+0xd8d/0xe55 >> [<ffffffff810aa3a4>] lock_acquire+0x160/0x1af >> [<ffffffff81085278>] cleanup_workqueue_thread+0x59/0x10b >> [<ffffffff81085428>] destroy_workqueue+0x9c/0x107 >> [<ffffffffa0839d32>] bond_uninit+0x524/0x58a [bonding] >> [<ffffffff8146967b>] rollback_registered_many+0x205/0x2e3 >> [<ffffffff81469783>] unregister_netdevice_many+0x2a/0x75 >> [<ffffffff8147ada3>] __rtnl_kill_links+0x8b/0x9d >> [<ffffffff8147adea>] __rtnl_link_unregister+0x35/0x72 >> [<ffffffff8147b293>] rtnl_link_unregister+0x2c/0x43 >> [<ffffffffa0845ca6>] bonding_exit+0x5a/0x76 [bonding] >> [<ffffffff810b7749>] sys_delete_module+0x306/0x3b1 >> [<ffffffff81003a5b>] system_call_fastpath+0x16/0x1b > > This seems to be from the original thread of frame#3. It's grabbing > wq lock here but the problem is that the lock will be released > immediately, so bond_dev->name (the wq) can't be held by the time it > reaches frame#3. How is this dependency chain completed? Is it > somehow transitive through rtnl_mutex? wq lock is held *after* cpu_add_remove_lock, lockdep also said this, the process is trying to hold wq lock while having cpu_add_remove_lock. > >> other info that might help us debug this: >> >> 2 locks held by modprobe/5264: >> #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff8147af16>] rtnl_lock+0x1e/0x27 >> #1: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff810631d1>] cpu_maps_update_begin+0x1e/0x27 > > Isn't there a circular dependency here? bonding_exit() calls > destroy_workqueue() under rtnl_mutex but destroy_workqueue() should > flush works which could be trying to grab rtnl_lock. Or am I > completely misunderstanding locking here? Sure, that is why I sent another patch for bonding. :) After this patch, another lockdep warning appears, it is exactly what you expect. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Tejun Heo on 1 Apr 2010 00:20 Hello, On 04/01/2010 01:09 PM, Cong Wang wrote: >> This seems to be from the original thread of frame#3. It's grabbing >> wq lock here but the problem is that the lock will be released >> immediately, so bond_dev->name (the wq) can't be held by the time it >> reaches frame#3. How is this dependency chain completed? Is it >> somehow transitive through rtnl_mutex? > > wq lock is held *after* cpu_add_remove_lock, lockdep also said this, > the process is trying to hold wq lock while having cpu_add_remove_lock. Yeah yeah, I'm just failing to see how the other direction is completed. ie. where does the kernel try to grab cpu_add_remove_lock *after* grabbing wq lock? >> Isn't there a circular dependency here? bonding_exit() calls >> destroy_workqueue() under rtnl_mutex but destroy_workqueue() should >> flush works which could be trying to grab rtnl_lock. Or am I >> completely misunderstanding locking here? > > Sure, that is why I sent another patch for bonding. :) Ah... great. :-) > After this patch, another lockdep warning appears, it is exactly what > you expect. Hmmm... can you please try to see whether this circular locking warning involving wq->lockdep_map is reproducible w/ the bonding locking fixed? I still can't see where wq -> cpu_add_remove_lock dependency is created. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Cong Wang on 1 Apr 2010 00:30
Tejun Heo wrote: > Hello, > > On 04/01/2010 01:09 PM, Cong Wang wrote: >>> This seems to be from the original thread of frame#3. It's grabbing >>> wq lock here but the problem is that the lock will be released >>> immediately, so bond_dev->name (the wq) can't be held by the time it >>> reaches frame#3. How is this dependency chain completed? Is it >>> somehow transitive through rtnl_mutex? >> wq lock is held *after* cpu_add_remove_lock, lockdep also said this, >> the process is trying to hold wq lock while having cpu_add_remove_lock. > > Yeah yeah, I'm just failing to see how the other direction is > completed. ie. where does the kernel try to grab cpu_add_remove_lock > *after* grabbing wq lock? > >>> Isn't there a circular dependency here? bonding_exit() calls >>> destroy_workqueue() under rtnl_mutex but destroy_workqueue() should >>> flush works which could be trying to grab rtnl_lock. Or am I >>> completely misunderstanding locking here? >> Sure, that is why I sent another patch for bonding. :) > > Ah... great. :-) > >> After this patch, another lockdep warning appears, it is exactly what >> you expect. > > Hmmm... can you please try to see whether this circular locking > warning involving wq->lockdep_map is reproducible w/ the bonding > locking fixed? I still can't see where wq -> cpu_add_remove_lock > dependency is created. > I thought this is obvious. Here it is: void destroy_workqueue(struct workqueue_struct *wq) { const struct cpumask *cpu_map = wq_cpu_map(wq); int cpu; cpu_maps_update_begin(); <----------------- Hold cpu_add_remove_lock here spin_lock(&workqueue_lock); list_del(&wq->list); spin_unlock(&workqueue_lock); for_each_cpu(cpu, cpu_map) cleanup_workqueue_thread(per_cpu_ptr(wq->cpu_wq, cpu)); <------ See below cpu_maps_update_done(); <----------------- Release cpu_add_remove_lock here .... static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq) { /* * Our caller is either destroy_workqueue() or CPU_POST_DEAD, * cpu_add_remove_lock protects cwq->thread. */ if (cwq->thread == NULL) return; lock_map_acquire(&cwq->wq->lockdep_map); <-------------- Lockdep complains here. lock_map_release(&cwq->wq->lockdep_map); .... Am I missing something?? Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |