Prev: sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining
Next: [PATCH 1/2] squashfs: xattr_handler don't inline
From: Srivatsa Vaddagiri on 3 Jun 2010 00:30 On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote: > > There are two separate problems: the more general problem is that > the hypervisor can put a vcpu to sleep while holding a lock, causing > other vcpus to spin until the end of their time slice. This can > only be addressed with hypervisor help. Fyi - I have a early patch ready to address this issue. Basically I am using host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to hint host whenever guest is in spin-lock'ed section, which is read by host scheduler to defer preemption. Guest side: static inline void spin_lock(spinlock_t *lock) { raw_spin_lock(&lock->rlock); + __get_cpu_var(gh_vcpu_ptr)->defer_preempt++; } static inline void spin_unlock(spinlock_t *lock) { + __get_cpu_var(gh_vcpu_ptr)->defer_preempt--; raw_spin_unlock(&lock->rlock); } [similar changes to other spinlock variants] Host side: @@ -860,6 +866,17 @@ check_preempt_tick(struct cfs_rq *cfs_rq ideal_runtime = sched_slice(cfs_rq, curr); delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; if (delta_exec > ideal_runtime) { + if ((sched_feat(DEFER_PREEMPT)) && (rq_of(cfs_rq)->curr->ghptr)) { + int defer_preempt = rq_of(cfs_rq)->curr->ghptr->defer_preempt; + if (((defer_preempt & 0xFFFF0000) == 0xfeed0000) && ((defer_preempt & 0x0000FFFF) != 0)) { + if ((rq_of(cfs_rq)->curr->grace_defer++ < sysctl_sched_preempt_defer_count)) { + rq_of(cfs_rq)->defer_preempt++; + return; + } else + rq_of(cfs_rq)->force_preempt++; + } + } resched_task(rq_of(cfs_rq)->curr); /* * The current task ran long enough, ensure it doesn't get [similar changes introduced at other preemption points in sched_fair.c] Note that guest can only request preemption to be deferred (and not disabled via this mechanism). I have seen good improvement (~15%) in kern compile benchmark with sysctl_sched_preempt_defer_count set to a low value of just 2 (i.e we can defer preemption by maximum two ticks). I intend to cleanup and post the patches pretty soon for comments. One pathological case where this may actually hurt is routines in guest like flush_tlb_others_ipi() which take a spinlock and then enter a while() loop waiting for other cpus to ack something. In this case, deferring preemption just because guest is in critical section actually hurts! Hopefully the upper bound for deferring preemtion and the fact that such routines may not be frequently hit should help alleviate such situations. - vatsa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Eric Dumazet on 3 Jun 2010 01:00 Le jeudi 03 juin 2010 à 09:50 +0530, Srivatsa Vaddagiri a écrit : > On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote: > > > > There are two separate problems: the more general problem is that > > the hypervisor can put a vcpu to sleep while holding a lock, causing > > other vcpus to spin until the end of their time slice. This can > > only be addressed with hypervisor help. > > Fyi - I have a early patch ready to address this issue. Basically I am using > host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to hint > host whenever guest is in spin-lock'ed section, which is read by host scheduler > to defer preemption. > > Guest side: > > static inline void spin_lock(spinlock_t *lock) > { > raw_spin_lock(&lock->rlock); > + __get_cpu_var(gh_vcpu_ptr)->defer_preempt++; 1) __this_cpu_inc() should be faster 2) Isnt a bit late to do this increment _after_ raw_spin_lock(&lock->rlock); > } > > static inline void spin_unlock(spinlock_t *lock) > { > + __get_cpu_var(gh_vcpu_ptr)->defer_preempt--; > raw_spin_unlock(&lock->rlock); > } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Srivatsa Vaddagiri on 3 Jun 2010 01:40 On Thu, Jun 03, 2010 at 06:51:51AM +0200, Eric Dumazet wrote: > > Guest side: > > > > static inline void spin_lock(spinlock_t *lock) > > { > > raw_spin_lock(&lock->rlock); > > + __get_cpu_var(gh_vcpu_ptr)->defer_preempt++; > > 1) __this_cpu_inc() should be faster Ok ..thx for that tip. > 2) Isnt a bit late to do this increment _after_ > raw_spin_lock(&lock->rlock); I think so, my worry about doing it earlier is we may set the defer_preempt hint for the wrong vcpu (if lets say the guest application thread is preempted by guest kernel and later migrated to another vcpu after it sets the hint and before it acquires the lock). - vatsa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on 3 Jun 2010 05:00 On Thu, Jun 03, 2010 at 09:50:51AM +0530, Srivatsa Vaddagiri wrote: > On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote: > > > > There are two separate problems: the more general problem is that > > the hypervisor can put a vcpu to sleep while holding a lock, causing > > other vcpus to spin until the end of their time slice. This can > > only be addressed with hypervisor help. > > Fyi - I have a early patch ready to address this issue. Basically I am using > host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to hint > host whenever guest is in spin-lock'ed section, which is read by host scheduler > to defer preemption. Looks like a ni.ce simple way to handle this for the kernel. However I suspect user space will hit the same issue sooner or later. I assume your way is not easily extensable to futexes? > One pathological case where this may actually hurt is routines in guest like > flush_tlb_others_ipi() which take a spinlock and then enter a while() loop > waiting for other cpus to ack something. In this case, deferring preemption just > because guest is in critical section actually hurts! Hopefully the upper bound > for deferring preemtion and the fact that such routines may not be frequently > hit should help alleviate such situations. So do you defer during the whole spinlock region or just during the spin? I assume the the first? -Andi -- ak(a)linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Srivatsa Vaddagiri on 3 Jun 2010 05:30
On Thu, Jun 03, 2010 at 10:52:51AM +0200, Andi Kleen wrote: > > Fyi - I have a early patch ready to address this issue. Basically I am using > > host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to hint > > host whenever guest is in spin-lock'ed section, which is read by host scheduler > > to defer preemption. > > Looks like a ni.ce simple way to handle this for the kernel. The idea is not new. It has been discussed for example at [1]. > However I suspect user space will hit the same issue sooner > or later. I assume your way is not easily extensable to futexes? I had thought that most userspace lock implementation avoid spinning for long times? i.e they would spin for a short while and sleep beyond a threshold? If that is the case, we shouldn't be burning lot of cycles unnecessarily spinning in userspace .. > So do you defer during the whole spinlock region or just during the spin? > > I assume the the first? My current implementation just blindly defers by a tick and checks if it is safe to preempt in the next tick - otherwise gives more grace ticks until the threshold is crossed (after which we forcibly preempt it). In future, I was thinking that host scheduler can hint back to guest that it was given some "grace" time which can be used in guest to yield when it comes out of the locked section. - vatsa 1. http://l4ka.org/publications/2004/Towards-Scalable-Multiprocessor-Virtual-Machines-VM04.pdf -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |