Prev: [PATCH] btrfs: check alloc return value before use handle and struct
Next: [PATCH] POWER (BAT) of DA9052 Linux device drivers (4/9)
From: Peter Zijlstra on 19 May 2010 05:40 On Wed, 2010-05-19 at 14:35 +0530, Amit K. Arora wrote: > Problem : In a stress test where some heavy tests were running along with > regular CPU offlining and onlining, a hang was observed. The system seems to > be hung at a point where migration_call() tries to kill the migration_thread > of the dying CPU, which just got moved to the current CPU. This migration > thread does not get a chance to run (and die) since rt_throttled is set to 1 > on current, and it doesn't get cleared as the hrtimer which is supposed to > reset the rt bandwidth (sched_rt_period_timer) is tied to the CPU being > offlined. > > Solution : This patch pushes the killing of migration thread to "CPU_POST_DEAD" > event. By then all the timers (including sched_rt_period_timer) should have got > migrated (along with other callbacks). > > Alternate Solution considered : Another option considered was to > increase the priority of the hrtimer cpu offline notifier, such that it > gets to run before scheduler's migration cpu offline notifier. In this > way we are sure that the timers will get migrated before migration_call > tries to kill migration_thread. But, this can have some non-obvious > implications, suggested Srivatsa. > > Testing : Without the patch the stress tests didn't last for even 12 > hours. And yes, the problem was reproducible. With the patch applied the > tests ran successfully for more than 48 hours. > > Thanks! > -- > Regards, > Amit Arora > > Signed-off-by: Amit Arora <aarora(a)in.ibm.com> > Signed-off-by: Gautham R Shenoy <ego(a)in.ibm.com> > -- > diff -Nuarp linux-2.6.34.org/kernel/sched.c linux-2.6.34/kernel/sched.c > --- linux-2.6.34.org/kernel/sched.c 2010-05-18 22:56:21.000000000 -0700 > +++ linux-2.6.34/kernel/sched.c 2010-05-18 22:58:31.000000000 -0700 > @@ -5942,14 +5942,26 @@ migration_call(struct notifier_block *nf > cpu_rq(cpu)->migration_thread = NULL; > break; > > + case CPU_POST_DEAD: > + /* > + Bring the migration thread down in CPU_POST_DEAD event, > + since the timers should have got migrated by now and thus > + we should not see a deadlock between trying to kill the > + migration thread and the sched_rt_period_timer. > + */ Faulty comment style that, please use: /* * text * goes * here */ > + cpuset_lock(); > + rq = cpu_rq(cpu); > + kthread_stop(rq->migration_thread); > + put_task_struct(rq->migration_thread); > + rq->migration_thread = NULL; > + cpuset_unlock(); > + break; > + The other problem is more urgent though, CPU_POST_DEAD runs outside of the hotplug lock and thus the above becomes a race where we could possible kill off the migration thread of a newly brought up cpu: cpu0 - down 2 cpu1 - up 2 (allocs a new migration thread, and leaks the old one) cpu0 - post_down 2 - frees the migration thread -- oops! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on 20 May 2010 03:30 On Wed, 2010-05-19 at 17:43 +0530, Amit K. Arora wrote: > Alternate Solution considered : Another option considered was to > increase the priority of the hrtimer cpu offline notifier, such that it > gets to run before scheduler's migration cpu offline notifier. In this > way we are sure that the timers will get migrated before migration_call > tries to kill migration_thread. But, this can have some non-obvious > implications, suggested Srivatsa. > On Wed, May 19, 2010 at 11:31:55AM +0200, Peter Zijlstra wrote: > > The other problem is more urgent though, CPU_POST_DEAD runs outside of > > the hotplug lock and thus the above becomes a race where we could > > possible kill off the migration thread of a newly brought up cpu: > > > > cpu0 - down 2 > > cpu1 - up 2 (allocs a new migration thread, and leaks the old one) > > cpu0 - post_down 2 - frees the migration thread -- oops! > > Ok. So, how about adding a check in CPU_UP_PREPARE event handling too ? > The cpuset_lock will synchronize, and thus avoid race between killing of > migration_thread in up_prepare and post_dead events. > > Here is the updated patch. If you don't like this one too, do you mind > suggesting an alternate approach to tackle the problem ? Thanks ! Right, so this isn't pretty at all.. Ingo, the comment near the migration_notifier says that migration_call should happen before all else, but can you see anything that would break if we let the timer migration happen first? Thomas? > Signed-off-by: Amit Arora <aarora(a)in.ibm.com> > Signed-off-by: Gautham R Shenoy <ego(a)in.ibm.com> > -- For everybody who reads this, _please_ use three (3) dashes '-' to separate the changelog from the patch, and left-align the changelog (including all tags). I seem to get more and more people sending patches with 2 dashes and daft changelogs with whitespace stuffing which break my scripts. > diff -Nuarp linux-2.6.34.org/kernel/sched.c linux-2.6.34/kernel/sched.c > --- linux-2.6.34.org/kernel/sched.c 2010-05-18 22:56:21.000000000 -0700 > +++ linux-2.6.34/kernel/sched.c 2010-05-19 04:47:49.000000000 -0700 > @@ -5900,6 +5900,19 @@ migration_call(struct notifier_block *nf > > case CPU_UP_PREPARE: > case CPU_UP_PREPARE_FROZEN: > + cpuset_lock(); > + rq = cpu_rq(cpu); > + /* > + * Since we now kill migration_thread in CPU_POST_DEAD event, > + * there may be a race here. So, lets cleanup the old > + * migration_thread on the rq, if any. > + */ > + if (unlikely(rq->migration_thread)) { > + kthread_stop(rq->migration_thread); > + put_task_struct(rq->migration_thread); > + rq->migration_thread = NULL; > + } > + cpuset_unlock(); > p = kthread_create(migration_thread, hcpu, "migration/%d", cpu); > if (IS_ERR(p)) > return NOTIFY_BAD; > @@ -5942,14 +5955,34 @@ migration_call(struct notifier_block *nf > cpu_rq(cpu)->migration_thread = NULL; > break; > > + case CPU_POST_DEAD: > + /* > + * Bring the migration thread down in CPU_POST_DEAD event, > + * since the timers should have got migrated by now and thus > + * we should not see a deadlock between trying to kill the > + * migration thread and the sched_rt_period_timer. > + */ > + cpuset_lock(); > + rq = cpu_rq(cpu); > + if (likely(rq->migration_thread)) { > + /* > + * Its possible that this CPU was onlined (from a > + * different CPU) before we reached here and > + * migration_thread was cleaned-up in the > + * CPU_UP_PREPARE event handling. > + */ > + kthread_stop(rq->migration_thread); > + put_task_struct(rq->migration_thread); > + rq->migration_thread = NULL; > + } > + cpuset_unlock(); > + break; > + > case CPU_DEAD: > case CPU_DEAD_FROZEN: > cpuset_lock(); /* around calls to cpuset_cpus_allowed_lock() */ > migrate_live_tasks(cpu); > rq = cpu_rq(cpu); > - kthread_stop(rq->migration_thread); > - put_task_struct(rq->migration_thread); > - rq->migration_thread = NULL; > /* Idle task back to normal (off runqueue, low prio) */ > raw_spin_lock_irq(&rq->lock); > update_rq_clock(rq); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Mike Galbraith on 23 May 2010 05:10 On Thu, 2010-05-20 at 09:28 +0200, Peter Zijlstra wrote: > On Wed, 2010-05-19 at 17:43 +0530, Amit K. Arora wrote: > > Alternate Solution considered : Another option considered was to > > increase the priority of the hrtimer cpu offline notifier, such that it > > gets to run before scheduler's migration cpu offline notifier. In this > > way we are sure that the timers will get migrated before migration_call > > tries to kill migration_thread. But, this can have some non-obvious > > implications, suggested Srivatsa. > > > > On Wed, May 19, 2010 at 11:31:55AM +0200, Peter Zijlstra wrote: > > > The other problem is more urgent though, CPU_POST_DEAD runs outside of > > > the hotplug lock and thus the above becomes a race where we could > > > possible kill off the migration thread of a newly brought up cpu: > > > > > > cpu0 - down 2 > > > cpu1 - up 2 (allocs a new migration thread, and leaks the old one) > > > cpu0 - post_down 2 - frees the migration thread -- oops! > > > > Ok. So, how about adding a check in CPU_UP_PREPARE event handling too ? > > The cpuset_lock will synchronize, and thus avoid race between killing of > > migration_thread in up_prepare and post_dead events. > > > > Here is the updated patch. If you don't like this one too, do you mind > > suggesting an alternate approach to tackle the problem ? Thanks ! > > Right, so this isn't pretty at all.. Since the problem seems to stem from interfering with a critical thread, how about create a SCHED_SYSTEM_CRITICAL flag ala SCHED_RESET_ON_FORK? Not particularly beautiful, and completely untested (well, it compiles). diff --git a/include/linux/sched.h b/include/linux/sched.h index 6cc43e0..23da2cf 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -40,6 +40,8 @@ #define SCHED_IDLE 5 /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */ #define SCHED_RESET_ON_FORK 0x40000000 +/* Can be ORed in to flag a thread as being system critical. Not inherited. */ +#define SCHED_SYSTEM_CRITICAL 0x20000000 #ifdef __KERNEL__ @@ -1240,6 +1242,9 @@ struct task_struct { /* Revert to default priority/policy when forking */ unsigned sched_reset_on_fork:1; + /* System critical thread. Cleared on fork. */ + unsigned sched_system_critical:1; + pid_t pid; pid_t tgid; diff --git a/kernel/sched.c b/kernel/sched.c index 2d17e3b..caf8b95 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -2511,6 +2511,11 @@ void sched_fork(struct task_struct *p, int clone_flags) */ p->prio = current->normal_prio; + /* + * System critical policy flag is not inherited. + */ + p->sched_system_critical = 0; + if (!rt_prio(p->prio)) p->sched_class = &fair_sched_class; @@ -4486,7 +4491,7 @@ static int __sched_setscheduler(struct task_struct *p, int policy, unsigned long flags; const struct sched_class *prev_class; struct rq *rq; - int reset_on_fork; + int reset_on_fork, system_critical; /* may grab non-irq protected spin_locks */ BUG_ON(in_interrupt()); @@ -4494,10 +4499,12 @@ recheck: /* double check policy once rq lock held */ if (policy < 0) { reset_on_fork = p->sched_reset_on_fork; + system_critical = p->sched_system_critical; policy = oldpolicy = p->policy; } else { reset_on_fork = !!(policy & SCHED_RESET_ON_FORK); - policy &= ~SCHED_RESET_ON_FORK; + system_critical = !!(policy & SCHED_SYSTEM_CRITICAL); + policy &= ~(SCHED_RESET_ON_FORK|SCHED_SYSTEM_CRITICAL); if (policy != SCHED_FIFO && policy != SCHED_RR && policy != SCHED_NORMAL && policy != SCHED_BATCH && @@ -4552,6 +4559,10 @@ recheck: /* Normal users shall not reset the sched_reset_on_fork flag */ if (p->sched_reset_on_fork && !reset_on_fork) return -EPERM; + + /* Normal users shall not reset the sched_system_critical flag */ + if (p->sched_system_critical && !system_critical) + return -EPERM; } if (user) { @@ -4560,7 +4571,7 @@ recheck: * Do not allow realtime tasks into groups that have no runtime * assigned. */ - if (rt_bandwidth_enabled() && rt_policy(policy) && + if (rt_bandwidth_enabled() && rt_policy(policy) && !system_critical && task_group(p)->rt_bandwidth.rt_runtime == 0) return -EPERM; #endif @@ -4595,6 +4606,7 @@ recheck: p->sched_class->put_prev_task(rq, p); p->sched_reset_on_fork = reset_on_fork; + p->sched_system_critical = system_critical; oldprio = p->prio; prev_class = p->sched_class; @@ -4712,9 +4724,13 @@ SYSCALL_DEFINE1(sched_getscheduler, pid_t, pid) p = find_process_by_pid(pid); if (p) { retval = security_task_getscheduler(p); - if (!retval) - retval = p->policy - | (p->sched_reset_on_fork ? SCHED_RESET_ON_FORK : 0); + if (!retval) { + retval = p->policy; + if (p->sched_reset_on_fork) + retval |= SCHED_RESET_ON_FORK; + if (p->sched_system_critical) + retval |= SCHED_SYSTEM_CRITICAL; + } } rcu_read_unlock(); return retval; diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c index 8afb953..4fb5ced 100644 --- a/kernel/sched_rt.c +++ b/kernel/sched_rt.c @@ -605,6 +605,7 @@ static void update_curr_rt(struct rq *rq) struct sched_rt_entity *rt_se = &curr->rt; struct rt_rq *rt_rq = rt_rq_of_se(rt_se); u64 delta_exec; + int system_critical = curr->sched_system_critical; if (!task_has_rt_policy(curr)) return; @@ -621,9 +622,13 @@ static void update_curr_rt(struct rq *rq) curr->se.exec_start = rq->clock; cpuacct_charge(curr, delta_exec); - sched_rt_avg_update(rq, delta_exec); + /* + * System critical tasks do not contribute to bandwidth consumption, + * nor are they evicted when runtime is exceeded. + */ + sched_rt_avg_update(rq, system_critical ? 0 : delta_exec); - if (!rt_bandwidth_enabled()) + if (!rt_bandwidth_enabled() || system_critical) return; for_each_sched_rt_entity(rt_se) { diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c index b4e7431..6cbea9a 100644 --- a/kernel/stop_machine.c +++ b/kernel/stop_machine.c @@ -304,7 +304,7 @@ static int __cpuinit cpu_stop_cpu_callback(struct notifier_block *nfb, cpu); if (IS_ERR(p)) return NOTIFY_BAD; - sched_setscheduler_nocheck(p, SCHED_FIFO, ¶m); + sched_setscheduler_nocheck(p, SCHED_FIFO|SCHED_SYSTEM_CRITICAL, ¶m); get_task_struct(p); stopper->thread = p; break; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on 23 May 2010 05:30 On Sun, 2010-05-23 at 11:07 +0200, Mike Galbraith wrote: > On Thu, 2010-05-20 at 09:28 +0200, Peter Zijlstra wrote: > > On Wed, 2010-05-19 at 17:43 +0530, Amit K. Arora wrote: > > > Alternate Solution considered : Another option considered was to > > > increase the priority of the hrtimer cpu offline notifier, such that it > > > gets to run before scheduler's migration cpu offline notifier. In this > > > way we are sure that the timers will get migrated before migration_call > > > tries to kill migration_thread. But, this can have some non-obvious > > > implications, suggested Srivatsa. > > > > > > > On Wed, May 19, 2010 at 11:31:55AM +0200, Peter Zijlstra wrote: > > > > The other problem is more urgent though, CPU_POST_DEAD runs outside of > > > > the hotplug lock and thus the above becomes a race where we could > > > > possible kill off the migration thread of a newly brought up cpu: > > > > > > > > cpu0 - down 2 > > > > cpu1 - up 2 (allocs a new migration thread, and leaks the old one) > > > > cpu0 - post_down 2 - frees the migration thread -- oops! > > > > > > Ok. So, how about adding a check in CPU_UP_PREPARE event handling too ? > > > The cpuset_lock will synchronize, and thus avoid race between killing of > > > migration_thread in up_prepare and post_dead events. > > > > > > Here is the updated patch. If you don't like this one too, do you mind > > > suggesting an alternate approach to tackle the problem ? Thanks ! > > > > Right, so this isn't pretty at all.. > > Since the problem seems to stem from interfering with a critical thread, > how about create a SCHED_SYSTEM_CRITICAL flag ala SCHED_RESET_ON_FORK? > > Not particularly beautiful, and completely untested (well, it compiles). Nah, I'd rather we pull the migration thread out of SCHED_FIFO and either schedule it explicit or add a sched_class for it. We need to do that anyway once we go play with SCHED_DEADLINE. But it would be very nice if we could simply order the timer and task migration bits so that the whole problem doesn't exist in the first place. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on 24 May 2010 09:30
On Mon, 2010-05-24 at 15:29 +0530, Amit K. Arora wrote: > since _cpu_up() and _cpu_down() can never run in > parallel, because of cpu_add_remove_lock. Ah indeed. I guess your initial patch works then. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |