Prev: [PATCH 3/6] sched: add moving average of time spent servicing SCHED_NORMAL tasks
Next: [PATCH -mm] add DMA_xxBIT_MASK to feature-removal-schedule.txt
From: Nikhil Rao on 30 Jul 2010 01:30 Hi all, We have observed that a large weight differential between tasks on a runqueue leads to sub-optimal machine utilization and poor load balancing. For example, if you have lots of SCHED_IDLE tasks (sufficient number to keep the machine 100% busy) and a few SCHED_NORMAL soaker tasks, we see that the machine has significant idle time. The data below highlights this problem. The test machine is a 4 socket quad-core box (16 cpus). These experiemnts were done with v2.6.25-rc6. We spawn 16 SCHED_IDLE soaker threads (one per-cpu) to completely fill up the machine. CPU utilization numbers gathered from mpstat for 10s are: 03:30:24 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 03:30:25 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16234.65 03:30:26 PM all 99.88 0.06 0.06 0.00 0.00 0.00 0.00 0.00 16374.00 03:30:27 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16392.00 03:30:28 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16612.12 03:30:29 PM all 99.88 0.00 0.12 0.00 0.00 0.00 0.00 0.00 16375.00 03:30:30 PM all 99.94 0.06 0.00 0.00 0.00 0.00 0.00 0.00 16440.00 03:30:31 PM all 99.81 0.00 0.19 0.00 0.00 0.00 0.00 0.00 16237.62 03:30:32 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16360.00 03:30:33 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16405.00 03:30:34 PM all 99.38 0.06 0.50 0.00 0.00 0.00 0.00 0.06 18881.82 Average: all 99.86 0.02 0.12 0.00 0.00 0.00 0.00 0.01 16628.20 We then spawn one SCHED_NORMAL while-1 task (the absolute number does not matter so long as we introduce some large weight differential). 03:40:57 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 03:40:58 PM all 83.06 0.00 0.06 0.00 0.00 0.00 0.00 16.88 14555.00 03:40:59 PM all 78.25 0.00 0.06 0.00 0.00 0.00 0.00 21.69 14527.00 03:41:00 PM all 82.71 0.06 0.06 0.00 0.00 0.00 0.00 17.17 14879.00 03:41:01 PM all 87.34 0.00 0.06 0.00 0.00 0.00 0.00 12.59 15466.00 03:41:02 PM all 80.80 0.06 0.19 0.00 0.00 0.00 0.00 18.95 14584.00 03:41:03 PM all 82.90 0.00 0.06 0.00 0.00 0.00 0.00 17.04 14570.00 03:41:04 PM all 79.45 0.00 0.06 0.00 0.00 0.00 0.00 20.49 14536.00 03:41:05 PM all 86.48 0.00 0.07 0.00 0.00 0.00 0.00 13.46 14577.00 03:41:06 PM all 76.73 0.06 0.06 0.00 0.00 0.06 0.00 23.10 14594.00 03:41:07 PM all 86.48 0.00 0.07 0.00 0.00 0.00 0.00 13.45 14703.03 Average: all 82.31 0.02 0.08 0.00 0.00 0.01 0.00 17.59 14699.10 The machine utilization is sub-optimal and this doesn't look right. On investigating this further, we noticed a couple of things. When we consider load balancing operations at any sched domain where the weight of the group is greater than one, the group containing the SCHED_NORMAL task is always selected as the busiest group. While this is expected (SCHED_NORMAL at nice 0 is a factor of 512 times the weight of SCHED_IDLE), this weight differential leads to sub-optimal machine utilization. CPUs outside this group can only pull SCHED_IDLE tasks (because pulling a SCHED_NORMAL would be far in excess of the imbalance). The load balancer pulls queued SCHED_IDLE tasks away from this group, and soon it starts failing balances because it unable to move running tasks. Eventually active migration kicks in and boots the remaining SCHED_IDLE tasks, leaving idle cpus in the busiest sched group. These idle cpus remain idle for long periods of time even though other cpus have large runqueue depths (albeit filled with SCHED_IDLE tasks). The idle balance on these cores always select the sched group with the SCHED_NORMAL task to pull tasks from, and always fails because this group has only one runnable task. We see the same problem with group scheduling as well, i.e. if we have two groups with a large weight differential, the machine utilization is sub-optimal. While this RFC focuses on SCHED_IDLE and SCHED_NORMAL, we can extend these arguments to SCHED_IDLE-like group entities as well (i.e. groups with cpu.shares set to 2). In the remainder of this RFC, we present a solution to address the problem below and have attached a prototype patchset for feedback and comments. Please let me know if you have alternate ideas; I would be happy to explore other solutions. This patchset changes the way the kernel balances SCHED_IDLE tasks. Since SCHED_IDLE tasks contribute very little weight to the runqueues, we do not balance them with SCHED_NORMAL/SCHED_BATCH tasks. Instead, we do second pass to balance SCHED_IDLE tasks using a different metric. We iterate through each cpu in the sched domain span and calculate the idle load as follows: load = (idle_nr_runnig * WEIGHT_IDLEPRIO / idle power) The metric used is a ratio of the load contributed by SCHED_IDLE tasks to the available power for running SCHED_IDLE tasks. We determine available power similar to the RT power scaling calculations, i.e. we scale a CPU's available idle power based on the average SCHED_NORMAL/SCHED_BATCH activity over a given period. Here are some results for a 2.6.35-rc6 kernel + patchset. This is with 16 SCHED_IDLE soaker threads and one SCHED_NORMAL task. 06:55:16 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 06:55:17 PM all 99.88 0.00 0.12 0.00 0.00 0.00 0.00 0.00 16228.71 06:55:18 PM all 99.81 0.00 0.19 0.00 0.00 0.00 0.00 0.00 16516.16 06:55:19 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16197.03 06:55:20 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16376.00 06:55:21 PM all 99.88 0.06 0.06 0.00 0.00 0.00 0.00 0.00 16406.00 06:55:22 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16349.00 06:55:23 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16343.00 06:55:24 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16354.00 06:55:25 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16521.21 06:55:26 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 16189.11 Average: all 99.92 0.01 0.08 0.00 0.00 0.00 0.00 0.00 16347.25 Please find attached patches against v2.6.35-rc6. Note that this patchset is a prototype and there are known limitations: - not well integrated with the existing LB paths - does not support group scheduling - needs to be extended to support preempt kernels - needs to be power-savings aware - needs schedstat support - written/tested for x86/64, SMP If folks think this is a good direction and this idea has some merit, then I will be more than happy to continue working with the community to improve this patchset. If there is another way to fix this problem, then please do tell; I would be more than happy to help out. -Thanks Nikhil --- Nikhil Rao (6): sched: account SCHED_IDLE tasks on rq->idle_nr_running sched: add SD_IDLE_LOAD_BALANCE to sched domain flags sched: add moving average of time spent servicing SCHED_NORMAL tasks sched: add sched_idle_balance argument to lb functions sched: add SCHED_IDLE load balancer sched: enable SD_IDLE_LOAD_BALANCE on MC, CPU and NUMA (x86) domains arch/x86/include/asm/topology.h | 1 + include/linux/sched.h | 1 + include/linux/topology.h | 4 + kernel/sched.c | 27 +++++++ kernel/sched_fair.c | 161 +++++++++++++++++++++++++++++++++++++-- 5 files changed, 186 insertions(+), 8 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |