Prev: [071/165] USB: g_serial: fix tty cleanup on unload
Next: [081/165] ext4: Fix potential quota deadlock
From: Paul Turner on 30 Jul 2010 15:10 On Fri, Jul 30, 2010 at 6:32 AM, Mike Galbraith <efault(a)gmx.de> wrote: > On Thu, 2010-07-29 at 22:19 -0700, Nikhil Rao wrote: >> Hi all, >> >> We have observed that a large weight differential between tasks on a runqueue >> leads to sub-optimal machine utilization and poor load balancing. For example, >> if you have lots of SCHED_IDLE tasks (sufficient number to keep the machine 100% >> busy) and a few SCHED_NORMAL soaker tasks, we see that the machine has >> significant idle time. >> >> The data below highlights this problem. The test machine is a 4 socket quad-core >> box (16 cpus). These experiemnts were done with v2.6.25-rc6. We spawn 16 >> SCHED_IDLE soaker threads (one per-cpu) to completely fill up the machine. CPU >> utilization numbers gathered from mpstat for 10s are: >> >> 03:30:24 PM �CPU � %user � %nice � �%sys %iowait � �%irq � %soft �%steal � %idle � �intr/s >> 03:30:25 PM �all � 99.94 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 �16234.65 >> 03:30:26 PM �all � 99.88 � �0.06 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 �16374.00 >> 03:30:27 PM �all � 99.94 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 �16392.00 >> 03:30:28 PM �all � 99.94 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 �16612.12 >> 03:30:29 PM �all � 99.88 � �0.00 � �0.12 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 �16375.00 >> 03:30:30 PM �all � 99.94 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 �16440.00 >> 03:30:31 PM �all � 99.81 � �0.00 � �0.19 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 �16237.62 >> 03:30:32 PM �all � 99.94 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 �16360.00 >> 03:30:33 PM �all � 99.94 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � �0.00 �16405.00 >> 03:30:34 PM �all � 99.38 � �0.06 � �0.50 � �0.00 � �0.00 � �0.00 � �0.00 � �0.06 �18881.82 >> Average: � � all � 99.86 � �0.02 � �0.12 � �0.00 � �0.00 � �0.00 � �0.00 � �0.01 �16628.20 >> >> We then spawn one SCHED_NORMAL while-1 task (the absolute number does not matter >> so long as we introduce some large weight differential). >> >> 03:40:57 PM �CPU � %user � %nice � �%sys %iowait � �%irq � %soft �%steal � %idle � �intr/s >> 03:40:58 PM �all � 83.06 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � 16.88 �14555.00 >> 03:40:59 PM �all � 78.25 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � 21.69 �14527.00 >> 03:41:00 PM �all � 82.71 � �0.06 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � 17.17 �14879.00 >> 03:41:01 PM �all � 87.34 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � 12.59 �15466.00 >> 03:41:02 PM �all � 80.80 � �0.06 � �0.19 � �0.00 � �0.00 � �0.00 � �0.00 � 18.95 �14584.00 >> 03:41:03 PM �all � 82.90 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � 17.04 �14570.00 >> 03:41:04 PM �all � 79.45 � �0.00 � �0.06 � �0.00 � �0.00 � �0.00 � �0.00 � 20.49 �14536.00 >> 03:41:05 PM �all � 86.48 � �0.00 � �0.07 � �0.00 � �0.00 � �0.00 � �0.00 � 13.46 �14577.00 >> 03:41:06 PM �all � 76.73 � �0.06 � �0.06 � �0.00 � �0.00 � �0.06 � �0.00 � 23.10 �14594.00 >> 03:41:07 PM �all � 86.48 � �0.00 � �0.07 � �0.00 � �0.00 � �0.00 � �0.00 � 13.45 �14703.03 >> Average: � � all � 82.31 � �0.02 � �0.08 � �0.00 � �0.00 � �0.01 � �0.00 � 17.59 �14699.10 > > What happens with s/SCHED_IDLE/nice 19? > > � � � �-Mike > > So this is also a concern of mine that I wanted to raise. The problem observed above is a function of balancing entities which can't meaningfully contribute to improving the observed imbalance. While this does occur with {nice 19, SCHED_IDLE} threads, it is a more general problem and actually much more prevalent in the group scheduling case where the group entity weight is more arbitrary. This is compounded by the fact that we iterate over the task_groups in load balance and don't have a good way of differentiating in-group high/low weight entities. Something I have been investigating is how can we fix this problem more generally: e.g. how can we effectively load-balance a low-weight entity in the presence of high-weight competition. This could just as easily be a {1000, 2} share split as it could be a {64000, 2000} share split -- the latter won't benefit from IDLE_SCHEDULING optimizations but is a realistic test case if someone is trying to provide minimal latencies for the high-weight task. What I've been considering in avenue is instead of load-balancing the low weight entities when their h_load relative to the imbalance is 'not meaningful' -- by some fitness function, tag them for a separate load-balancing pass. This secondary pass would only need to be periodic in nature, and consider the 'tagged' task_groups from the generic load_balancing operations. At this point in time it can be evaluated whether the load is still 'not meaningful', and distribution towards the most idle cpu (by utilization) can be evaluated. The one converse here is that while something like the above is more general for the low weight-group entity case, it does not extend particularly cleanly to the non group-scheduling case. I haven't yet got a particularly good answer for this since without a group entity representing the low-weight threads tracking them for a second pass becomes more cumbersome. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Nikhil Rao on 30 Jul 2010 15:10 On Fri, Jul 30, 2010 at 6:32 AM, Mike Galbraith <efault(a)gmx.de> wrote: > On Thu, 2010-07-29 at 22:19 -0700, Nikhil Rao wrote: >> Hi all, >> >> We have observed that a large weight differential between tasks on a runqueue >> leads to sub-optimal machine utilization and poor load balancing. For example, >> if you have lots of SCHED_IDLE tasks (sufficient number to keep the machine 100% >> busy) and a few SCHED_NORMAL soaker tasks, we see that the machine has >> significant idle time. >> >> The data below highlights this problem. The test machine is a 4 socket quad-core >> box (16 cpus). These experiemnts were done with v2.6.25-rc6. We spawn 16 >> SCHED_IDLE soaker threads (one per-cpu) to completely fill up the machine. CPU >> utilization numbers gathered from mpstat for 10s are: >> >> 03:30:24 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s >> 03:30:25 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16234.65 >> 03:30:26 PM all 99.88 0.06 0.06 0.00 0.00 0.00 0.00 0.00 16374.00 >> 03:30:27 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16392.00 >> 03:30:28 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16612.12 >> 03:30:29 PM all 99.88 0.00 0.12 0.00 0.00 0.00 0.00 0.00 16375.00 >> 03:30:30 PM all 99.94 0.06 0.00 0.00 0.00 0.00 0.00 0.00 16440.00 >> 03:30:31 PM all 99.81 0.00 0.19 0.00 0.00 0.00 0.00 0.00 16237.62 >> 03:30:32 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16360.00 >> 03:30:33 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16405.00 >> 03:30:34 PM all 99.38 0.06 0.50 0.00 0.00 0.00 0.00 0.06 18881.82 >> Average: all 99.86 0.02 0.12 0.00 0.00 0.00 0.00 0.01 16628.20 >> >> We then spawn one SCHED_NORMAL while-1 task (the absolute number does not matter >> so long as we introduce some large weight differential). >> >> 03:40:57 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s >> 03:40:58 PM all 83.06 0.00 0.06 0.00 0.00 0.00 0.00 16.88 14555.00 >> 03:40:59 PM all 78.25 0.00 0.06 0.00 0.00 0.00 0.00 21.69 14527.00 >> 03:41:00 PM all 82.71 0.06 0.06 0.00 0.00 0.00 0.00 17.17 14879.00 >> 03:41:01 PM all 87.34 0.00 0.06 0.00 0.00 0.00 0.00 12.59 15466.00 >> 03:41:02 PM all 80.80 0.06 0.19 0.00 0.00 0.00 0.00 18.95 14584.00 >> 03:41:03 PM all 82.90 0.00 0.06 0.00 0.00 0.00 0.00 17.04 14570.00 >> 03:41:04 PM all 79.45 0.00 0.06 0.00 0.00 0.00 0.00 20.49 14536.00 >> 03:41:05 PM all 86.48 0.00 0.07 0.00 0.00 0.00 0.00 13.46 14577.00 >> 03:41:06 PM all 76.73 0.06 0.06 0.00 0.00 0.06 0.00 23.10 14594.00 >> 03:41:07 PM all 86.48 0.00 0.07 0.00 0.00 0.00 0.00 13.45 14703.03 >> Average: all 82.31 0.02 0.08 0.00 0.00 0.01 0.00 17.59 14699.10 > > What happens with s/SCHED_IDLE/nice 19? > > -Mike We see the same result with nice 19 as well. w/ 16 nice-19 soakers: 10:15:16 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 10:15:17 AM all 0.06 99.94 0.00 0.00 0.00 0.00 0.00 0.00 16296.04 10:15:18 AM all 0.00 99.94 0.06 0.00 0.00 0.00 0.00 0.00 16379.00 10:15:19 AM all 0.00 99.94 0.06 0.00 0.00 0.00 0.00 0.00 16414.00 10:15:20 AM all 0.00 99.94 0.06 0.00 0.00 0.00 0.00 0.00 16413.00 10:15:21 AM all 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 16402.00 10:15:22 AM all 0.00 99.88 0.06 0.00 0.00 0.06 0.00 0.00 16419.00 10:15:23 AM all 0.00 99.94 0.06 0.00 0.00 0.00 0.00 0.00 16406.00 10:15:24 AM all 0.19 99.69 0.12 0.00 0.00 0.00 0.00 0.00 16613.13 10:15:25 AM all 0.38 99.31 0.31 0.00 0.00 0.00 0.00 0.00 16313.86 10:15:26 AM all 0.50 99.31 0.19 0.00 0.00 0.00 0.00 0.00 16623.23 Average: all 0.11 99.79 0.09 0.00 0.00 0.01 0.00 0.00 16427.30 w/ adding a SCHED_NORMAL soaker to the mix: 10:17:44 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 10:17:45 AM all 6.20 74.38 0.06 0.00 0.00 0.00 0.00 19.35 14419.80 10:17:46 AM all 6.25 74.89 0.06 0.00 0.00 0.00 0.00 18.80 14619.00 10:17:47 AM all 6.30 74.84 0.06 0.00 0.00 0.00 0.00 18.79 14590.00 10:17:48 AM all 6.25 80.57 0.06 0.00 0.00 0.00 0.00 13.12 15511.00 10:17:49 AM all 6.51 80.33 0.07 0.00 0.00 0.00 0.00 13.09 14904.00 10:17:50 AM all 6.06 72.62 0.06 0.00 0.00 0.00 0.00 21.26 14564.00 10:17:51 AM all 6.21 74.47 0.06 0.00 0.00 0.00 0.00 19.25 14584.00 10:17:52 AM all 6.47 77.67 0.12 0.00 0.00 0.00 0.00 15.73 15295.96 10:17:53 AM all 6.27 79.39 0.06 0.00 0.00 0.00 0.00 14.29 15251.00 10:17:54 AM all 6.32 75.85 0.00 0.00 0.00 0.00 0.00 17.83 14537.00 Average: all 6.28 76.47 0.06 0.00 0.00 0.00 0.00 17.18 14826.70 The problem is the large weight differential between nice 19/SCHED_IDLE and SCHED_NORMAL. I ran a quick experiment with the soaker tasks at different nice levels. Data is in the table below. First column is nice level, second is idle% on the machine (mpstat 10s average) and third is ratio of nice weight/1024. 0 0.00 1 1 0.00 0.800781 2 0.00 0.639648 3 0.00 0.513672 4 0.00 0.413086 5 0.17 0.327148 6 1.06 0.265625 7 7.62 0.209961 8 4.47 0.167969 9 11.78 0.133789 10 13.52 0.107422 11 14.92 0.0849609 12 14.33 0.0683594 13 17.47 0.0546875 14 15.89 0.0439453 15 18.69 0.0351562 16 16.63 0.0283203 17 17.04 0.0224609 18 17.86 0.0175781 19 18.13 0.0146484 It looks like we start seeing seeing sub-optimal performance when the weight ratio is >0.3. -Thanks, Nikhil -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Nikhil Rao on 3 Aug 2010 17:30
Peter, Thanks for the feedback! On Mon, Aug 2, 2010 at 4:39 AM, Peter Zijlstra <peterz(a)infradead.org> wrote: > The thing is that from a fairness point of view 1 nice-0 (weight=1024) > on one CPU and 512 SCHED_IDLE (weight=2) tasks on another CPU and all > other CPUs idle is correct. > > It just doesn't seem to be the thing that most people expect. > > Special casing things like you've done is utterly the wrong thing to do. > I see your point here, and yes I agree having 1 nice-0 on one cpu, 512 SCHED_IDLE tasks on another cpu and all other cpus idle is correct if we only considered fairness. However, we would also like to maximize machine utilization. The fitness function we would ideally like to optimize for is a combination of both fairness and utilization. I'm going to take a step back here and explain the bigger problem we are trying to solve. We have two types of workloads -- high priority tasks that need to run for short periods of time and can consume as much cpu as required when they run; and low priority tasks that ideally consume slack cpu. Let's say we had a machine with enough tasks to soak up all the cpus. There are two cases to implement the priority scheme. The first case is to run all tasks run as SCHED_NORMAL. The machine is fully utilized but this has other side-effects. Low priority tasks consume much more cpu than we would like, they get to preempt high priority tasks, and this results in poor performance of high priority tasks. The second case is to run high priority tasks as SCHED_NORMAL and low priority tasks as SCHED_IDLE. We choose SCHED_IDLE to minimize interference with high priority tasks -- we don't want low priority tasks to preempt high priority or to be favored over high priority tasks in any way. When there are high priority tasks, we want low priority tasks to get as little cpu as possible; thus giving them the minimum possible weight works out well. In this case, we are able to isolate high prio tasks from low prio tasks reasonably well. However, sub-optimal machine utilization defeats the purpose of packing a machine with lots of low priority tasks as we are not able to consume the slack cpu. This RFC is really meant to explain the problem that we are facing. We presented one possible solution to fix this but we are also open to other suggestions and ideas. > This problem comes in two forms and its name is infeasible weight > distribution. The load-balancer tries to ensure W_k ~= W_l, k,l elem_of > CPUs, where W_k = \Sum_i w_i^k, where w_i^k is the i-th task on CPU k. > > The two cases are statically infeasible, and dynamically infeasible. > > We say the task-set is statically infeasible if for a task set of n > tasks there is no way to statically distribute them on N <= n CPUs such > that each task gets equal service (assuming the scheduling on each CPU > is fair). > > We say the task-set is dynamically infeasible if for the given scenario > there is no way to rotate the tasks to obtain equality. > > Lets assume 2 CPUs. > > Ex.1: 2 tasks of different weight. > > Ex.2: 3 tasks of equal weight. > > The first example is both statically and dynamically infeasible as there > is no way to occupy both CPUs such that each task gets the proportional > correct service. > > The second example is statically infeasible, but dynamically feasible, > for if we rotate one task, such that we alternate between 2:1 and 1:2 in > equal measures, each task will receive its correct 2/3rd CPU service. > > The current load-balancer isn't particularly skilled at either issue. > > The proper solution is to 'fix' find_busiest_group() so that it will: > - pick the heaviest cpu with more than 1 task on it > - slowly over-balance things > > The first thing will solve your issue. > Thanks for your suggestions; I explored the first one a bit and I added a check into find_busiest_queue() (instead of find_busiest_group()) to skip a cpu if it has only 1 task on it (patch attached below - did you have something else in mind?). This fixes the example I posted in the RFC, but it doesn't work as well when the SCHED_NORMAL tasks have a sleep/wakeup pattern. I have some data below where the load balancer fails to fully utilize a machine. In these examples, I ran with the upstream kernel and with a kernel compiled with the check in fbq(). Setup: We run 16 SCHED_IDLE soakers and about half as many SCHED_NORMAL tasks which have 100ms / 100ms sleep/busy cycles. This is actually a very common use case that we run into. 2.6.35-rc6 12:41:45 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 12:41:46 PM all 97.50 0.00 0.00 0.00 0.00 0.00 0.00 2.50 16405.00 12:41:47 PM all 89.72 0.00 0.06 0.00 0.00 0.00 0.00 10.22 15736.00 12:41:48 PM all 90.54 0.06 0.06 0.00 0.00 0.00 0.00 9.34 15791.09 12:41:49 PM all 93.00 0.00 0.06 0.00 0.00 0.00 0.00 6.93 15816.83 12:41:50 PM all 96.44 0.00 0.06 0.00 0.00 0.06 0.00 3.44 16362.00 12:41:51 PM all 97.62 0.00 0.06 0.00 0.00 0.00 0.00 2.31 16326.00 12:41:52 PM all 99.56 0.00 0.06 0.00 0.00 0.00 0.00 0.38 16512.12 12:41:53 PM all 99.06 0.00 0.06 0.00 0.00 0.00 0.00 0.88 16289.00 12:41:54 PM all 99.50 0.00 0.06 0.00 0.00 0.00 0.00 0.44 16149.50 12:41:55 PM all 98.06 0.00 0.06 0.00 0.00 0.00 0.00 1.88 16405.05 Average: all 96.10 0.01 0.06 0.00 0.00 0.01 0.00 3.83 16177.92 2.6.35-rc6 + fbg-fix 12:42:48 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 12:42:49 PM all 98.07 0.00 0.06 0.00 0.00 0.00 0.00 1.87 16346.00 12:42:50 PM all 98.75 0.00 0.12 0.00 0.00 0.00 0.00 1.12 16236.63 12:42:51 PM all 99.56 0.06 0.19 0.00 0.00 0.00 0.00 0.19 16616.16 12:42:52 PM all 97.94 0.00 0.06 0.00 0.00 0.06 0.00 1.94 16348.00 12:42:53 PM all 96.94 0.00 0.25 0.00 0.00 0.00 0.00 2.81 16234.65 12:42:54 PM all 98.56 0.06 0.06 0.00 0.00 0.00 0.00 1.31 16339.00 12:42:55 PM all 97.56 0.00 0.19 0.00 0.00 0.00 0.00 2.25 16570.71 12:42:56 PM all 99.50 0.00 0.06 0.00 0.00 0.00 0.00 0.44 16400.00 12:42:57 PM all 95.25 0.00 0.37 0.00 0.00 0.00 0.00 4.37 16367.00 12:42:58 PM all 97.75 0.00 0.06 0.00 0.00 0.00 0.00 2.19 16409.00 Average: all 97.99 0.01 0.14 0.00 0.00 0.01 0.00 1.85 16386.00 -Thanks, Nikhil --- fqb-fix patch: diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index a878b53..e05c61f 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -2742,13 +2742,8 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle, continue; rq = cpu_rq(i); - wl = weighted_cpuload(i); - /* - * When comparing with imbalance, use weighted_cpuload() - * which is not scaled with the cpu power. - */ - if (capacity && rq->nr_running == 1 && wl > imbalance) + if (capacity && rq->nr_running == 1) continue; /* @@ -2757,6 +2752,7 @@ find_busiest_queue(struct sched_group *group, enum cpu_idle_type idle, * the load can be moved away from the cpu that is potentially * running at a lower capacity. */ + wl = weighted_cpuload(i); wl = (wl * SCHED_LOAD_SCALE) / power; if (wl > max_load) { -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |