Prev: [LOCKDEP] 33-rc8 Running aplay with pulse as the default
Next: 33-rc8 Running aplay with pulse as the default
From: Peter Zijlstra on 14 Feb 2010 05:20 On Sun, 2010-02-14 at 02:06 +0530, Vaidyanathan Srinivasan wrote: > > > > @@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle, > > > > continue; > > > > > > > > rq = cpu_rq(i); > > > > - wl = weighted_cpuload(i) * SCHED_LOAD_SCALE; > > > > - wl /= power; > > > > + wl = weighted_cpuload(i); > > > > > > > > + /* > > > > + * When comparing with imbalance, use weighted_cpuload() > > > > + * which is not scaled with the cpu power. > > > > + */ > > > > if (capacity && rq->nr_running == 1 && wl > imbalance) > > > > continue; > > > > > > > > + /* > > > > + * For the load comparisons with the other cpu's, consider > > > > + * the weighted_cpuload() scaled with the cpu power, so that > > > > + * the load can be moved away from the cpu that is potentially > > > > + * running at a lower capacity. > > > > + */ > > > > + wl = (wl * SCHED_LOAD_SCALE) / power; > > > > + > > > > if (wl > max_load) { > > > > max_load = wl; > > > > busiest = rq; > > > > > > > > > > In addition to the above fix, for sched_smt_powersavings to work, the > group capacity of the core (mc level) should be made 2 in > update_sg_lb_stats() by changing the DIV_ROUND_CLOSEST to > DIV_RPUND_UP() > > sgs->group_capacity = > DIV_ROUND_UP(group->cpu_power, SCHED_LOAD_SCALE); > > Ideally we can change this to DIV_ROUND_UP and let SD_PREFER_SIBLING > flag to force capacity to 1. Need to see if there are any side > effects of setting SD_PREFER_SIBLING at SIBLING level sched domain > based on sched_smt_powersavings flag. OK, so while I think that Suresh' patch can make sense (haven't had time to think it through), the above really sounds wrong. Things should not rely on the cpu_power value like that. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Vaidyanathan Srinivasan on 15 Feb 2010 07:40 * Peter Zijlstra <peterz(a)infradead.org> [2010-02-14 11:11:58]: > On Sun, 2010-02-14 at 02:06 +0530, Vaidyanathan Srinivasan wrote: > > > > > @@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle, > > > > > continue; > > > > > > > > > > rq = cpu_rq(i); > > > > > - wl = weighted_cpuload(i) * SCHED_LOAD_SCALE; > > > > > - wl /= power; > > > > > + wl = weighted_cpuload(i); > > > > > > > > > > + /* > > > > > + * When comparing with imbalance, use weighted_cpuload() > > > > > + * which is not scaled with the cpu power. > > > > > + */ > > > > > if (capacity && rq->nr_running == 1 && wl > imbalance) > > > > > continue; > > > > > > > > > > + /* > > > > > + * For the load comparisons with the other cpu's, consider > > > > > + * the weighted_cpuload() scaled with the cpu power, so that > > > > > + * the load can be moved away from the cpu that is potentially > > > > > + * running at a lower capacity. > > > > > + */ > > > > > + wl = (wl * SCHED_LOAD_SCALE) / power; > > > > > + > > > > > if (wl > max_load) { > > > > > max_load = wl; > > > > > busiest = rq; > > > > > > > > > > > > > > In addition to the above fix, for sched_smt_powersavings to work, the > > group capacity of the core (mc level) should be made 2 in > > update_sg_lb_stats() by changing the DIV_ROUND_CLOSEST to > > DIV_RPUND_UP() > > > > sgs->group_capacity = > > DIV_ROUND_UP(group->cpu_power, SCHED_LOAD_SCALE); > > > > Ideally we can change this to DIV_ROUND_UP and let SD_PREFER_SIBLING > > flag to force capacity to 1. Need to see if there are any side > > effects of setting SD_PREFER_SIBLING at SIBLING level sched domain > > based on sched_smt_powersavings flag. > > OK, so while I think that Suresh' patch can make sense (haven't had time > to think it through), the above really sounds wrong. Things should not > rely on the cpu_power value like that. Hi Peter, The reason rounding is a problem is because threads have fractional cpu_power and we lose some power in DIV_ROUND_CLOSEST(). At MC level a group has 2*589=1178 and group_capacity will be 1 always if DIV_ROUND_CLOSEST() is used irrespective of the SD_PREFER_SIBLING flag. We are reducing group capacity here to 1 even though we have 2 sibling threads in the group. In the sched_smt_powassavings>0 case, the group_capacity should be 2 to allow task consolidation to this group while leaving other groups completely idle. DIV_ROUND_UP(group->cpu_power, SCHED_LOAD_SCALE) will ensure any spare capacity is rounded up and counted. While, if SD_REFER_SIBLING is set, update_sd_lb_stats(): if (prefer_sibling) sgs.group_capacity = min(sgs.group_capacity, 1UL); will ensure the group_capacity is 1 and allows spreading of tasks. --Vaidy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on 15 Feb 2010 08:10 On Mon, 2010-02-15 at 18:05 +0530, Vaidyanathan Srinivasan wrote: > * Peter Zijlstra <peterz(a)infradead.org> [2010-02-14 11:11:58]: > > > On Sun, 2010-02-14 at 02:06 +0530, Vaidyanathan Srinivasan wrote: > > > > > > @@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle, > > > > > > continue; > > > > > > > > > > > > rq = cpu_rq(i); > > > > > > - wl = weighted_cpuload(i) * SCHED_LOAD_SCALE; > > > > > > - wl /= power; > > > > > > + wl = weighted_cpuload(i); > > > > > > > > > > > > + /* > > > > > > + * When comparing with imbalance, use weighted_cpuload() > > > > > > + * which is not scaled with the cpu power. > > > > > > + */ > > > > > > if (capacity && rq->nr_running == 1 && wl > imbalance) > > > > > > continue; > > > > > > > > > > > > + /* > > > > > > + * For the load comparisons with the other cpu's, consider > > > > > > + * the weighted_cpuload() scaled with the cpu power, so that > > > > > > + * the load can be moved away from the cpu that is potentially > > > > > > + * running at a lower capacity. > > > > > > + */ > > > > > > + wl = (wl * SCHED_LOAD_SCALE) / power; > > > > > > + > > > > > > if (wl > max_load) { > > > > > > max_load = wl; > > > > > > busiest = rq; > > > > > > > > > > > > > > > > > > In addition to the above fix, for sched_smt_powersavings to work, the > > > group capacity of the core (mc level) should be made 2 in > > > update_sg_lb_stats() by changing the DIV_ROUND_CLOSEST to > > > DIV_RPUND_UP() > > > > > > sgs->group_capacity = > > > DIV_ROUND_UP(group->cpu_power, SCHED_LOAD_SCALE); > > > > > > Ideally we can change this to DIV_ROUND_UP and let SD_PREFER_SIBLING > > > flag to force capacity to 1. Need to see if there are any side > > > effects of setting SD_PREFER_SIBLING at SIBLING level sched domain > > > based on sched_smt_powersavings flag. > > > > OK, so while I think that Suresh' patch can make sense (haven't had time > > to think it through), the above really sounds wrong. Things should not > > rely on the cpu_power value like that. > > Hi Peter, > > The reason rounding is a problem is because threads have fractional > cpu_power and we lose some power in DIV_ROUND_CLOSEST(). At MC level > a group has 2*589=1178 and group_capacity will be 1 always if > DIV_ROUND_CLOSEST() is used irrespective of the SD_PREFER_SIBLING > flag. > > We are reducing group capacity here to 1 even though we have 2 sibling > threads in the group. In the sched_smt_powassavings>0 case, the > group_capacity should be 2 to allow task consolidation to this group > while leaving other groups completely idle. > > DIV_ROUND_UP(group->cpu_power, SCHED_LOAD_SCALE) will ensure any spare > capacity is rounded up and counted. > > While, if SD_REFER_SIBLING is set, > > update_sd_lb_stats(): > if (prefer_sibling) > sgs.group_capacity = min(sgs.group_capacity, 1UL); > > will ensure the group_capacity is 1 and allows spreading of tasks. We should be weakening this link between cpu_power and capacity, not strengthening it. What I think you want is to use cpumask_weight(sched_gropu_cpus(group)) or something as capacity. The setup for cpu_power is that it can reflect the actual capacity for work, esp with todays asymmetric cpus where a socket can run on a different frequency we need to make sure this is so. So no, that DIV_ROUND_UP is utterly broken, as there might be many ways for cpu_power of multiple threads/cpus to be less than the number of cpus. Furthermore, for powersavings it makes sense to make the capacity a function of an overload argument/tunable, so that you can specify the threshold of packing. So really, cpu_power is a normalization factor to equally distribute load across cpus that have asymmetric work capacity, if you need any placement constraints outside of that, do _NOT_ touch cpu_power. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Peter Zijlstra on 15 Feb 2010 17:40 On Fri, 2010-02-12 at 17:14 -0800, Suresh Siddha wrote: > From: Suresh Siddha <suresh.b.siddha(a)intel.com> > Subject: sched: fix SMT scheduler regression in find_busiest_queue() > > Fix a SMT scheduler performance regression that is leading to a scenario > where SMT threads in one core are completely idle while both the SMT threads > in another core (on the same socket) are busy. > > This is caused by this commit (with the problematic code highlighted) > > commit bdb94aa5dbd8b55e75f5a50b61312fe589e2c2d1 > Author: Peter Zijlstra <a.p.zijlstra(a)chello.nl> > Date: Tue Sep 1 10:34:38 2009 +0200 > > sched: Try to deal with low capacity > > @@ -4203,15 +4223,18 @@ find_busiest_queue() > ... > for_each_cpu(i, sched_group_cpus(group)) { > + unsigned long power = power_of(i); > > ... > > - wl = weighted_cpuload(i); > + wl = weighted_cpuload(i) * SCHED_LOAD_SCALE; > + wl /= power; > > - if (rq->nr_running == 1 && wl > imbalance) > + if (capacity && rq->nr_running == 1 && wl > imbalance) > continue; > > On a SMT system, power of the HT logical cpu will be 589 and > the scheduler load imbalance (for scenarios like the one mentioned above) > can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling > the weighted load with the power will result in "wl > imbalance" and > ultimately resulting in find_busiest_queue() return NULL, causing > load_balance() to think that the load is well balanced. But infact > one of the tasks can be moved to the idle core for optimal performance. > > We don't need to use the weighted load (wl) scaled by the cpu power to > compare with imabalance. In that condition, we already know there is only a > single task "rq->nr_running == 1" and the comparison between imbalance, > wl is to make sure that we select the correct priority thread which matches > imbalance. So we really need to compare the imabalnce with the original > weighted load of the cpu and not the scaled load. > > But in other conditions where we want the most hammered(busiest) cpu, we can > use scaled load to ensure that we consider the cpu power in addition to the > actual load on that cpu, so that we can move the load away from the > guy that is getting most hammered with respect to the actual capacity, > as compared with the rest of the cpu's in that busiest group. > > Fix it. > > Reported-by: Ma Ling <ling.ma(a)intel.com> > Initial-Analysis-by: Zhang, Yanmin <yanmin_zhang(a)linux.intel.com> > Signed-off-by: Suresh Siddha <suresh.b.siddha(a)intel.com> > Cc: stable(a)kernel.org [2.6.32.x] A reproduction case would have been nice, I've been playing with busy loops and plotting the cpus on paper, but I didn't manage to reproduce. Still, I went through the logic and it seems to make sense, so: Acked-by: Peter Zijlstra <a.p.zijlstra(a)chello.nl> Ingo, sed -e 's/sched\.c/sched_fair.c/g', makes it apply to tip/master and should provide means of solving the rebase/merge conflict. > --- > > diff --git a/kernel/sched.c b/kernel/sched.c > index 3a8fb30..bef5369 100644 > --- a/kernel/sched.c > +++ b/kernel/sched.c > @@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle, > continue; > > rq = cpu_rq(i); > - wl = weighted_cpuload(i) * SCHED_LOAD_SCALE; > - wl /= power; > + wl = weighted_cpuload(i); > > + /* > + * When comparing with imbalance, use weighted_cpuload() > + * which is not scaled with the cpu power. > + */ > if (capacity && rq->nr_running == 1 && wl > imbalance) > continue; > > + /* > + * For the load comparisons with the other cpu's, consider > + * the weighted_cpuload() scaled with the cpu power, so that > + * the load can be moved away from the cpu that is potentially > + * running at a lower capacity. > + */ > + wl = (wl * SCHED_LOAD_SCALE) / power; > + > if (wl > max_load) { > max_load = wl; > busiest = rq; > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Vaidyanathan Srinivasan on 16 Feb 2010 11:00 * Peter Zijlstra <peterz(a)infradead.org> [2010-02-15 14:00:43]: > On Mon, 2010-02-15 at 18:05 +0530, Vaidyanathan Srinivasan wrote: > > * Peter Zijlstra <peterz(a)infradead.org> [2010-02-14 11:11:58]: > > > > > On Sun, 2010-02-14 at 02:06 +0530, Vaidyanathan Srinivasan wrote: > > > > > > > @@ -4119,12 +4119,23 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle, > > > > > > > continue; > > > > > > > > > > > > > > rq = cpu_rq(i); > > > > > > > - wl = weighted_cpuload(i) * SCHED_LOAD_SCALE; > > > > > > > - wl /= power; > > > > > > > + wl = weighted_cpuload(i); > > > > > > > > > > > > > > + /* > > > > > > > + * When comparing with imbalance, use weighted_cpuload() > > > > > > > + * which is not scaled with the cpu power. > > > > > > > + */ > > > > > > > if (capacity && rq->nr_running == 1 && wl > imbalance) > > > > > > > continue; > > > > > > > > > > > > > > + /* > > > > > > > + * For the load comparisons with the other cpu's, consider > > > > > > > + * the weighted_cpuload() scaled with the cpu power, so that > > > > > > > + * the load can be moved away from the cpu that is potentially > > > > > > > + * running at a lower capacity. > > > > > > > + */ > > > > > > > + wl = (wl * SCHED_LOAD_SCALE) / power; > > > > > > > + > > > > > > > if (wl > max_load) { > > > > > > > max_load = wl; > > > > > > > busiest = rq; > > > > > > > > > > > > > > > > > > > > > > In addition to the above fix, for sched_smt_powersavings to work, the > > > > group capacity of the core (mc level) should be made 2 in > > > > update_sg_lb_stats() by changing the DIV_ROUND_CLOSEST to > > > > DIV_RPUND_UP() > > > > > > > > sgs->group_capacity = > > > > DIV_ROUND_UP(group->cpu_power, SCHED_LOAD_SCALE); > > > > > > > > Ideally we can change this to DIV_ROUND_UP and let SD_PREFER_SIBLING > > > > flag to force capacity to 1. Need to see if there are any side > > > > effects of setting SD_PREFER_SIBLING at SIBLING level sched domain > > > > based on sched_smt_powersavings flag. > > > > > > OK, so while I think that Suresh' patch can make sense (haven't had time > > > to think it through), the above really sounds wrong. Things should not > > > rely on the cpu_power value like that. > > > > Hi Peter, > > > > The reason rounding is a problem is because threads have fractional > > cpu_power and we lose some power in DIV_ROUND_CLOSEST(). At MC level > > a group has 2*589=1178 and group_capacity will be 1 always if > > DIV_ROUND_CLOSEST() is used irrespective of the SD_PREFER_SIBLING > > flag. > > > > We are reducing group capacity here to 1 even though we have 2 sibling > > threads in the group. In the sched_smt_powassavings>0 case, the > > group_capacity should be 2 to allow task consolidation to this group > > while leaving other groups completely idle. > > > > DIV_ROUND_UP(group->cpu_power, SCHED_LOAD_SCALE) will ensure any spare > > capacity is rounded up and counted. > > > > While, if SD_REFER_SIBLING is set, > > > > update_sd_lb_stats(): > > if (prefer_sibling) > > sgs.group_capacity = min(sgs.group_capacity, 1UL); > > > > will ensure the group_capacity is 1 and allows spreading of tasks. > > We should be weakening this link between cpu_power and capacity, not > strengthening it. What I think you want is to use > cpumask_weight(sched_gropu_cpus(group)) or something as capacity. Yes, this is a good suggestion and elegant solution to the problem. I have a patch attached using this approach. > The setup for cpu_power is that it can reflect the actual capacity for > work, esp with todays asymmetric cpus where a socket can run on a > different frequency we need to make sure this is so. > > So no, that DIV_ROUND_UP is utterly broken, as there might be many ways > for cpu_power of multiple threads/cpus to be less than the number of > cpus. > > Furthermore, for powersavings it makes sense to make the capacity a > function of an overload argument/tunable, so that you can specify the > threshold of packing. > > So really, cpu_power is a normalization factor to equally distribute > load across cpus that have asymmetric work capacity, if you need any > placement constraints outside of that, do _NOT_ touch cpu_power. Agreed. Placement control should be handled by SD_PREFER_SIBLING and SD_POWER_SAVINGS flags. --Vaidy --- sched_smt_powersavings for threaded systems need this fix for consolidation to sibling threads to work. Since threads have fractional capacity, group_capacity will turn out to be one always and not accommodate another task in the sibling thread. This fix makes group_capacity a function of cpumask_weight that will enable the power saving load balancer to pack tasks among sibling threads and keep more cores idle. Signed-off-by: Vaidyanathan Srinivasan <svaidy(a)linux.vnet.ibm.com> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 522cf0e..ec3a5c5 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -2538,9 +2538,17 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu, * In case the child domain prefers tasks go to siblings * first, lower the group capacity to one so that we'll try * and move all the excess tasks away. + * If power savings balance is set at this domain, then + * make capacity equal to number of hardware threads to + * accomodate more tasks until capacity is reached. The + * default is fractional capacity for sibling hardware + * threads for fair use of available hardware resources. */ if (prefer_sibling) sgs.group_capacity = min(sgs.group_capacity, 1UL); + else if (sd->flags & SD_POWERSAVINGS_BALANCE) + sgs.group_capacity = + cpumask_weight(sched_group_cpus(group)); if (local_group) { sds->this_load = sgs.avg_load; @@ -2855,7 +2863,8 @@ static int need_active_balance(struct sched_domain *sd, int sd_idle, int idle) !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE)) return 0; - if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP) + if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP && + sched_smt_power_savings < POWERSAVINGS_BALANCE_WAKEUP) return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 Prev: [LOCKDEP] 33-rc8 Running aplay with pulse as the default Next: 33-rc8 Running aplay with pulse as the default |