From ff634002acea31b24fc306745e59156e373d7044 Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Thu, 8 Dec 2016 17:56:53 +0100 Subject: [PATCH 01/25] UPSTREAM: sched: fix find_idlest_group for fork During fork, the utilization of a task is init once the rq has been selected because the current utilization level of the rq is used to set the utilization of the fork task. As the task's utilization is still null at this step of the fork sequence, it doesn't make sense to look for some spare capacity that can fit the task's utilization. Furthermore, I can see perf regressions for the test "hackbench -P -g 1" because the least loaded policy is always bypassed and tasks are not spread during fork. With this patch and the fix below, we are back to same performances as for v4.8. The fix below is only a temporary one used for the test until a smarter solution is found because we can't simply remove the test which is useful for others benchmarks @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t avg_cost = this_sd->avg_scan_cost; - /* - * Due to large variance we need a large fuzz factor; hackbench in - * particularly is sensitive here. - */ - if ((avg_idle / 512) < avg_cost) - return -1; - time = local_clock(); for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) { Change-Id: If7dd1c2a81fd74419733b8ecff519ae8cb8eb32d Signed-off-by: Vincent Guittot Acked-by: Morten Rasmussen Tested-by: Matt Fleming Reviewed-by: Matt Fleming --- kernel/sched/fair.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index febc9abab23d..eb24450fc0a5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6206,13 +6206,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, * utilized systems if we require spare_capacity > task_util(p), * so we allow for some task stuffing by using * spare_capacity > task_util(p)/2. + * spare capacity can't be used for fork because the utilization has + * not been set yet as it need to get a rq to init the utilization */ + if (sd_flag & SD_BALANCE_FORK) + goto skip_spare; + if (this_spare > task_util(p) / 2 && imbalance*this_spare > 100*most_spare) return NULL; else if (most_spare > task_util(p) / 2) return most_spare_sg; +skip_spare: if (!idlest || 100*this_load < imbalance*min_load) return NULL; return idlest; From 08c53a651c24ec6350d5ef55c21bee81cd4ca055 Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Thu, 8 Dec 2016 17:56:54 +0100 Subject: [PATCH 02/25] UPSTREAM: sched: use load_avg for selecting idlest group find_idlest_group() only compares the runnable_load_avg when looking for the least loaded group. But on fork intensive use case like hackbench where tasks blocked quickly after the fork, this can lead to selecting the same CPU instead of other CPUs, which have similar runnable load but a lower load_avg. When the runnable_load_avg of 2 CPUs are close, we now take into account the amount of blocked load as a 2nd selection factor. There is now 3 zones for the runnable_load of the rq: -[0 .. (runnable_load - imbalance)] : Select the new rq which has significantly less runnable_load -](runnable_load - imbalance) .. (runnable_load + imbalance)[ : The runnable loads are close so we use load_avg to chose between the 2 rq -[(runnable_load + imbalance) .. ULONG_MAX] : Keep the current rq which has significantly less runnable_load The scale factor that is currently used for comparing runnable_load, doesn't work well with small value. As an example, the use of a scaling factor fails as soon as this_runnable_load == 0 because we always select local rq even if min_runnable_load is only 1, which doesn't really make sense because they are just the same. So instead of scaling factor, we use an absolute margin for runnable_load to detect CPUs with similar runnable_load and we keep using scaling factor for blocked load. For use case like hackbench, this enable the scheduler to select different CPUs during the fork sequence and to spread tasks across the system. Tests have been done on a Hikey board (ARM based octo cores) for several kernel. The result below gives min, max, avg and stdev values of 18 runs with each configuration. The v4.8+patches configuration also includes the changes below which is part of the proposal made by Peter to ensure that the clock will be up to date when the fork task will be attached to the rq. @@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p) __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); #endif rq = __task_rq_lock(p, &rf); + update_rq_clock(rq); post_init_entity_util_avg(&p->se); activate_task(rq, p, 0); hackbench -P -g 1 ea86cb4b7621 7dc603c9028e v4.8 v4.8+patches min 0.049 0.050 0.051 0,048 avg 0.057 0.057(0%) 0.057(0%) 0,055(+5%) max 0.066 0.068 0.070 0,063 stdev +/-9% +/-9% +/-8% +/-9% Change-Id: I8fe40bbd106454282b1963db825705f3cf377b9e Signed-off-by: Vincent Guittot Tested-by: Matt Fleming Reviewed-by: Matt Fleming --- kernel/sched/fair.c | 48 +++++++++++++++++++++++++++++++++++---------- 1 file changed, 38 insertions(+), 10 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index eb24450fc0a5..e0090a8bdb24 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6138,16 +6138,20 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, { struct sched_group *idlest = NULL, *group = sd->groups; struct sched_group *most_spare_sg = NULL; - unsigned long min_load = ULONG_MAX, this_load = 0; + unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0; + unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0; unsigned long most_spare = 0, this_spare = 0; int load_idx = sd->forkexec_idx; - int imbalance = 100 + (sd->imbalance_pct-100)/2; + int imbalance_scale = 100 + (sd->imbalance_pct-100)/2; + unsigned long imbalance = scale_load_down(NICE_0_LOAD) * + (sd->imbalance_pct-100) / 100; if (sd_flag & SD_BALANCE_WAKE) load_idx = sd->wake_idx; do { - unsigned long load, avg_load, spare_cap, max_spare_cap; + unsigned long load, avg_load, runnable_load; + unsigned long spare_cap, max_spare_cap; int local_group; int i; @@ -6164,6 +6168,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, * the group containing the CPU with most spare capacity. */ avg_load = 0; + runnable_load = 0; max_spare_cap = 0; for_each_cpu(i, sched_group_cpus(group)) { @@ -6173,7 +6178,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, else load = target_load(i, load_idx); - avg_load += load; + runnable_load += load; + + avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs); spare_cap = capacity_spare_wake(i, p); @@ -6182,14 +6189,32 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, } /* Adjust by relative CPU capacity of the group */ - avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity; + avg_load = (avg_load * SCHED_CAPACITY_SCALE) / + group->sgc->capacity; + runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) / + group->sgc->capacity; if (local_group) { - this_load = avg_load; + this_runnable_load = runnable_load; + this_avg_load = avg_load; this_spare = max_spare_cap; } else { - if (avg_load < min_load) { - min_load = avg_load; + if (min_runnable_load > (runnable_load + imbalance)) { + /* + * The runnable load is significantly smaller + * so we can pick this new cpu + */ + min_runnable_load = runnable_load; + min_avg_load = avg_load; + idlest = group; + } else if ((runnable_load < (min_runnable_load + imbalance)) && + (100*min_avg_load > imbalance_scale*avg_load)) { + /* + * The runnable loads are close so we take + * into account blocked load through avg_load + * which is blocked + runnable load + */ + min_avg_load = avg_load; idlest = group; } @@ -6213,13 +6238,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, goto skip_spare; if (this_spare > task_util(p) / 2 && - imbalance*this_spare > 100*most_spare) + imbalance_scale*this_spare > 100*most_spare) return NULL; else if (most_spare > task_util(p) / 2) return most_spare_sg; skip_spare: - if (!idlest || 100*this_load < imbalance*min_load) + if (!idlest || + (min_runnable_load > (this_runnable_load + imbalance)) || + ((this_runnable_load < (min_runnable_load + imbalance)) && + (100*this_avg_load < imbalance_scale*min_avg_load))) return NULL; return idlest; } From e019d8f5b53cddc4f2067231240f8acaebfe44d4 Mon Sep 17 00:00:00 2001 From: Brendan Jackman Date: Thu, 5 Oct 2017 11:58:54 +0100 Subject: [PATCH 03/25] UPSTREAM: sched/fair: Force balancing on NOHZ balance if local group has capacity The "goto force_balance" here is intended to mitigate the fact that avg_load calculations can result in bad placement decisions when priority is asymmetrical. The original commit that adds it: fab476228ba3 ("sched: Force balancing on newidle balance if local group has capacity") explains: Under certain situations, such as a niced down task (i.e. nice = -15) in the presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and kicks away other tasks because of its large weight. This leads to sub-optimal utilization of the machine. Even though the sched group has capacity, it does not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL. A similar but inverted issue also affects ARM big.LITTLE (asymmetrical CPU capacity) systems - consider 8 always-running, same-priority tasks on a system with 4 "big" and 4 "little" CPUs. Suppose that 5 of them end up on the "big" CPUs (which will be represented by one sched_group in the DIE sched_domain) and 3 on the "little" (the other sched_group in DIE), leaving one CPU unused. Because the "big" group has a higher group_capacity its avg_load may not present an imbalance that would cause migrating a task to the idle "little". The force_balance case here solves the problem but currently only for CPU_NEWLY_IDLE balances, which in theory might never happen on the unused CPU. Including CPU_IDLE in the force_balance case means there's an upper bound on the time before we can attempt to solve the underutilization: after DIE's sd->balance_interval has passed the next nohz balance kick will help us out. Change-Id: I4cbc8ced11e6a56f98ec67633e67e09cbb515268 Signed-off-by: Brendan Jackman Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Morten Rasmussen Cc: Paul Turner Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20170807163900.25180-1-brendan.jackman@arm.com Signed-off-by: Ingo Molnar (cherry picked from commit 583ffd99d7657755736d831bbc182612d1d2697d) Signed-off-by: Brendan Jackman --- kernel/sched/fair.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e0090a8bdb24..61b86fdbc91b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9078,8 +9078,11 @@ static struct sched_group *find_busiest_group(struct lb_env *env) if (busiest->group_type == group_imbalanced) goto force_balance; - /* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ - if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) && + /* + * When dst_cpu is idle, prevent SMP nice and/or asymmetric group + * capacities from resulting in underutilization due to avg_load. + */ + if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) && busiest->group_no_capacity) goto force_balance; From 6cc75c9bd23036959e22aba456d0eac8e69d3692 Mon Sep 17 00:00:00 2001 From: Brendan Jackman Date: Thu, 5 Oct 2017 12:45:12 +0100 Subject: [PATCH 04/25] UPSTREAM: sched/fair: Move select_task_rq_fair() slow-path into its own function In preparation for changes that would otherwise require adding a new level of indentation to the while(sd) loop, create a new function find_idlest_cpu() which contains this loop, and rename the existing find_idlest_cpu() to find_idlest_group_cpu(). Code inside the while(sd) loop is unchanged. @new_cpu is added as a variable in the new function, with the same initial value as the @new_cpu in select_task_rq_fair(). Change-Id: Ic08cc71eb4d0408b8987b2b15fcb7326956d7e81 Suggested-by: Peter Zijlstra Signed-off-by: Brendan Jackman Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Josef Bacik Reviewed-by: Vincent Guittot Cc: Dietmar Eggemann Cc: Josef Bacik Cc: Linus Torvalds Cc: Mike Galbraith Cc: Morten Rasmussen Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20171005114516.18617-2-brendan.jackman@arm.com Signed-off-by: Ingo Molnar (cherry picked from commit 18bd1b4bd53aba81d76d55e91a68310a227dc187) Signed-off-by: Brendan Jackman --- kernel/sched/fair.c | 116 ++++++++++++++++++++++++-------------------- 1 file changed, 63 insertions(+), 53 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 61b86fdbc91b..81d73d554cb2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6253,10 +6253,10 @@ skip_spare: } /* - * find_idlest_cpu - find the idlest cpu among the cpus in group. + * find_idlest_group_cpu - find the idlest cpu among the cpus in group. */ static int -find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) +find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) { unsigned long load, min_load = ULONG_MAX; unsigned int min_exit_latency = UINT_MAX; @@ -6305,6 +6305,66 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu; } +static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, + int cpu, int prev_cpu, int sd_flag) +{ + int wu = sd_flag & SD_BALANCE_WAKE; + int cas_cpu = -1; + int new_cpu = prev_cpu; + + if (wu) { + schedstat_inc(p->se.statistics.nr_wakeups_cas_attempts); + schedstat_inc(this_rq()->eas_stats.cas_attempts); + } + + + while (sd) { + struct sched_group *group; + struct sched_domain *tmp; + int weight; + + if (wu) + schedstat_inc(sd->eas_stats.cas_attempts); + + if (!(sd->flags & sd_flag)) { + sd = sd->child; + continue; + } + + group = find_idlest_group(sd, p, cpu, sd_flag); + if (!group) { + sd = sd->child; + continue; + } + + new_cpu = find_idlest_group_cpu(group, p, cpu); + if (new_cpu == -1 || new_cpu == cpu) { + /* Now try balancing at a lower domain level of cpu */ + sd = sd->child; + continue; + } + + /* Now try balancing at a lower domain level of new_cpu */ + cpu = cas_cpu = new_cpu; + weight = sd->span_weight; + sd = NULL; + for_each_domain(cpu, tmp) { + if (weight <= tmp->span_weight) + break; + if (tmp->flags & sd_flag) + sd = tmp; + } + /* while loop will break here if sd == NULL */ + } + + if (wu && (cas_cpu >= 0)) { + schedstat_inc(p->se.statistics.nr_wakeups_cas_count); + schedstat_inc(this_rq()->eas_stats.cas_count); + } + + return new_cpu; +} + #ifdef CONFIG_SCHED_SMT static inline void set_idle_cores(int cpu, int val) @@ -7053,57 +7113,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); } else { - int wu = sd_flag & SD_BALANCE_WAKE; - int cas_cpu = -1; - - if (wu) { - schedstat_inc(p->se.statistics.nr_wakeups_cas_attempts); - schedstat_inc(this_rq()->eas_stats.cas_attempts); - } - - - while (sd) { - struct sched_group *group; - int weight; - - if (wu) - schedstat_inc(sd->eas_stats.cas_attempts); - - if (!(sd->flags & sd_flag)) { - sd = sd->child; - continue; - } - - group = find_idlest_group(sd, p, cpu, sd_flag); - if (!group) { - sd = sd->child; - continue; - } - - new_cpu = find_idlest_cpu(group, p, cpu); - if (new_cpu == -1 || new_cpu == cpu) { - /* Now try balancing at a lower domain level of cpu */ - sd = sd->child; - continue; - } - - /* Now try balancing at a lower domain level of new_cpu */ - cpu = cas_cpu = new_cpu; - weight = sd->span_weight; - sd = NULL; - for_each_domain(cpu, tmp) { - if (weight <= tmp->span_weight) - break; - if (tmp->flags & sd_flag) - sd = tmp; - } - /* while loop will break here if sd == NULL */ - } - - if (wu && (cas_cpu >= 0)) { - schedstat_inc(p->se.statistics.nr_wakeups_cas_count); - schedstat_inc(this_rq()->eas_stats.cas_count); - } + new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag); } rcu_read_unlock(); From da6485cee9a29f6f022200c3e2745e8e40f6568a Mon Sep 17 00:00:00 2001 From: Brendan Jackman Date: Thu, 5 Oct 2017 12:45:13 +0100 Subject: [PATCH 05/25] UPSTREAM: sched/fair: Remove unnecessary comparison with -1 Since commit: 83a0a96a5f26 ("sched/fair: Leverage the idle state info when choosing the "idlest" cpu") find_idlest_group_cpu() (formerly find_idlest_cpu) no longer returns -1, so we can simplify the checking of the return value in find_idlest_cpu(). Change-Id: I43828722863cf286971c92fea1f3019f00b8e627 Signed-off-by: Brendan Jackman Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Josef Bacik Reviewed-by: Vincent Guittot Cc: Dietmar Eggemann Cc: Josef Bacik Cc: Linus Torvalds Cc: Mike Galbraith Cc: Morten Rasmussen Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20171005114516.18617-3-brendan.jackman@arm.com Signed-off-by: Ingo Molnar (cherry picked from commit e90381eaecf6d59c60fe396838e0e99789531419) Signed-off-by: Brendan Jackman --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 81d73d554cb2..c287647c4500 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6338,7 +6338,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p } new_cpu = find_idlest_group_cpu(group, p, cpu); - if (new_cpu == -1 || new_cpu == cpu) { + if (new_cpu == cpu) { /* Now try balancing at a lower domain level of cpu */ sd = sd->child; continue; From 13c4d4fdcae2d228df2b99dd3f7f921e581e2bc6 Mon Sep 17 00:00:00 2001 From: Brendan Jackman Date: Thu, 5 Oct 2017 12:45:14 +0100 Subject: [PATCH 06/25] UPSTREAM: sched/fair: Fix find_idlest_group() when local group is not allowed When the local group is not allowed we do not modify this_*_load from their initial value of 0. That means that the load checks at the end of find_idlest_group cause us to incorrectly return NULL. Fixing the initial values to ULONG_MAX means we will instead return the idlest remote group in that case. Change-Id: I4cb6269a653205b1d1ba6dce1ca42359faf85362 Signed-off-by: Brendan Jackman Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Vincent Guittot Reviewed-by: Josef Bacik Cc: Dietmar Eggemann Cc: Josef Bacik Cc: Linus Torvalds Cc: Mike Galbraith Cc: Morten Rasmussen Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20171005114516.18617-4-brendan.jackman@arm.com Signed-off-by: Ingo Molnar (cherry picked from commit 0d10ab952e99f3e9f374898e93f45452b81e5711) Signed-off-by: Brendan Jackman --- kernel/sched/fair.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c287647c4500..ecef0796ddd7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6138,8 +6138,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, { struct sched_group *idlest = NULL, *group = sd->groups; struct sched_group *most_spare_sg = NULL; - unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0; - unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0; + unsigned long min_runnable_load = ULONG_MAX; + unsigned long this_runnable_load = ULONG_MAX; + unsigned long min_avg_load = ULONG_MAX, this_avg_load = ULONG_MAX; unsigned long most_spare = 0, this_spare = 0; int load_idx = sd->forkexec_idx; int imbalance_scale = 100 + (sd->imbalance_pct-100)/2; From a8c3c11c0e045c8b62b23b3f68a593699273a312 Mon Sep 17 00:00:00 2001 From: Brendan Jackman Date: Thu, 5 Oct 2017 12:45:15 +0100 Subject: [PATCH 07/25] UPSTREAM: sched/fair: Fix usage of find_idlest_group() when no groups are allowed When 'p' is not allowed on any of the CPUs in the sched_domain, we currently return NULL from find_idlest_group(), and pointlessly continue the search on lower sched_domain levels (where 'p' is also not allowed) before returning prev_cpu regardless (as we have not updated new_cpu). Add an explicit check for this case, and add a comment to find_idlest_group(). Now when find_idlest_group() returns NULL, it always means that the local group is allowed and idlest. Change-Id: I900538644e62dab68f1cfddb59d41846d991ce09 Signed-off-by: Brendan Jackman Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Vincent Guittot Reviewed-by: Josef Bacik Cc: Dietmar Eggemann Cc: Josef Bacik Cc: Linus Torvalds Cc: Mike Galbraith Cc: Morten Rasmussen Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20171005114516.18617-5-brendan.jackman@arm.com Signed-off-by: Ingo Molnar (cherry picked from commit 6fee85ccbc76e8aeba43dc120c5fa3c5409a4e2c) Signed-off-by: Brendan Jackman --- kernel/sched/fair.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ecef0796ddd7..e1ef26684d08 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6131,6 +6131,8 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) /* * find_idlest_group finds and returns the least busy CPU group within the * domain. + * + * Assumes p is allowed on at least one CPU in sd. */ static struct sched_group * find_idlest_group(struct sched_domain *sd, struct task_struct *p, @@ -6318,6 +6320,8 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p schedstat_inc(this_rq()->eas_stats.cas_attempts); } + if (!cpumask_intersects(sched_domain_span(sd), &p->cpus_allowed)) + return prev_cpu; while (sd) { struct sched_group *group; From 55f81c1c6832be81f127a2672b7c7bf41f3bd104 Mon Sep 17 00:00:00 2001 From: Brendan Jackman Date: Thu, 5 Oct 2017 12:45:16 +0100 Subject: [PATCH 08/25] UPSTREAM: sched/fair: Fix usage of find_idlest_group() when the local group is idlest find_idlest_group() returns NULL when the local group is idlest. The caller then continues the find_idlest_group() search at a lower level of the current CPU's sched_domain hierarchy. find_idlest_group_cpu() is not consulted and, crucially, @new_cpu is not updated. This means the search is pointless and we return @prev_cpu from select_task_rq_fair(). This is fixed by initialising @new_cpu to @cpu instead of @prev_cpu. Change-Id: I8644a35179b00d4833befe237aa443ec840046ec Signed-off-by: Brendan Jackman Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Josef Bacik Reviewed-by: Vincent Guittot Cc: Dietmar Eggemann Cc: Josef Bacik Cc: Linus Torvalds Cc: Mike Galbraith Cc: Morten Rasmussen Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20171005114516.18617-6-brendan.jackman@arm.com Signed-off-by: Ingo Molnar (cherry picked from commit 93f50f90247e3e926bbe9830df089c64a5cec236) Signed-off-by: Brendan Jackman --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e1ef26684d08..52d0da646c47 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6313,7 +6313,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p { int wu = sd_flag & SD_BALANCE_WAKE; int cas_cpu = -1; - int new_cpu = prev_cpu; + int new_cpu = cpu; if (wu) { schedstat_inc(p->se.statistics.nr_wakeups_cas_attempts); From 185507e31ec7f3c18648864da1ef13eb7e21d4de Mon Sep 17 00:00:00 2001 From: Chris Redpath Date: Wed, 25 Oct 2017 17:25:20 +0100 Subject: [PATCH 09/25] ANDROID: sched/fair: Select correct capacity state for energy_diff The util returned from group_max_util is not capped at the max util present in the group, so it can be larger than the capacity stored in the array. Ensure that when this happens, we always use the last entry in the array to fetch energy from. Tested with synthetics on Juno board. Change-Id: I89fb52fb7e68fa3e682e308acc232596672d03f7 Signed-off-by: Chris Redpath --- kernel/sched/fair.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 52d0da646c47..0731623be1ee 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5543,17 +5543,20 @@ long group_norm_util(struct energy_env *eenv, struct sched_group *sg) static int find_new_capacity(struct energy_env *eenv, const struct sched_group_energy * const sge) { - int idx; + int idx, max_idx = sge->nr_cap_states - 1; unsigned long util = group_max_util(eenv); + /* default is max_cap if we don't find a match */ + eenv->cap_idx = max_idx; + for (idx = 0; idx < sge->nr_cap_states; idx++) { - if (sge->cap_states[idx].cap >= util) + if (sge->cap_states[idx].cap >= util) { + eenv->cap_idx = idx; break; + } } - eenv->cap_idx = idx; - - return idx; + return eenv->cap_idx; } static int group_idle_state(struct energy_env *eenv, struct sched_group *sg) From dad887ffaf76c244cf4c24897b660e6bd6bb9333 Mon Sep 17 00:00:00 2001 From: Dietmar Eggemann Date: Thu, 29 Jun 2017 12:22:54 +0100 Subject: [PATCH 10/25] sched/fair: remove erroneous RCU_LOCKDEP_WARN from start_cpu() Fixes: https://bugs.linaro.org/show_bug.cgi?id=3075 Change-Id: I62d714fc4b9366a9b2535649aa92d1edc840cf94 Reported-by: Naresh Kamboju Signed-off-by: Dietmar Eggemann Signed-off-by: Brendan Jackman Signed-off-by: Chris Redpath (cherry picked from commit 2617b1391f69dfbd6ffc340c235059107745aac2) Signed-off-by: Quentin Perret --- kernel/sched/fair.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d10a6d56cccb..b3aa6279f161 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6707,9 +6707,6 @@ static int start_cpu(bool boosted) { struct root_domain *rd = cpu_rq(smp_processor_id())->rd; - RCU_LOCKDEP_WARN(rcu_read_lock_sched_held(), - "sched RCU must be held"); - return boosted ? rd->max_cap_orig_cpu : rd->min_cap_orig_cpu; } From 22f112785057ea45aa25026e6bcc790efdc5048f Mon Sep 17 00:00:00 2001 From: Chris Redpath Date: Tue, 20 Jun 2017 12:12:49 +0100 Subject: [PATCH 11/25] cpufreq/sched: Use cpu max freq rather than policy max When we convert capacity into frequency, we used policy->max to get the max freq of the cpu. Since this can be changed by userspace policy or thermal events, we are potentially asking for a lower frequency than the utilization demands. Change over to using cpuinfo.max which is the max freq supported by that cpu rather than the currently-chosen max. Frequency granted still honours the max policy. Tested by setting a userspace policy and observing the relevant vars in a trace. In this instance, we ask for around 1ghz instead of 620MHz. freq_new=1013512 unfixed_freq_new=624487 capacity=546 cpuinfo_max=1900800 policy_max=1171200 Change-Id: I8c5694db42243c6fb78bb9be9046b06ac81295e7 Signed-off-by: Chris Redpath (cherry picked from commit 114c6ceb2e5b0e30442e2ba5f78cfaedf76bf1f9) [trivial cherry-pick issue] Signed-off-by: Quentin Perret --- kernel/sched/cpufreq_sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c index 1b19f2643f48..b0ef17317846 100644 --- a/kernel/sched/cpufreq_sched.c +++ b/kernel/sched/cpufreq_sched.c @@ -202,7 +202,7 @@ static void update_fdomain_capacity_request(int cpu) } /* Convert the new maximum capacity request into a cpu frequency */ - freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT; + freq_new = capacity * policy->cpuinfo.max_freq >> SCHED_CAPACITY_SHIFT; index_new = cpufreq_frequency_table_target(policy, freq_new, CPUFREQ_RELATION_L); freq_new = policy->freq_table[index_new].frequency; From d870d26fb3ff361243bdeb185ae68f43e60fdd62 Mon Sep 17 00:00:00 2001 From: Chris Redpath Date: Tue, 25 Apr 2017 10:37:58 +0100 Subject: [PATCH 12/25] cpufreq/sched: Consider max cpu capacity when choosing frequencies When using schedfreq on cpus with max capacity significantly smaller than 1024, the tick update uses non-normalised capacities - this leads to selecting an incorrect OPP as we were scaling the frequency as if the max capacity achievable was 1024 rather than the max for that particular cpu or group. This could result in a cpu being stuck at the lowest OPP and unable to generate enough utilisation to climb out if the max capacity is significantly smaller than 1024. Instead, normalize the capacity to be in the range 0-1024 in the tick so that when we later select a frequency, we get the correct one. Also comments updated to be clearer about what is needed. Change-Id: Id84391c7ac015311002ada21813a353ee13bee60 Signed-off-by: Chris Redpath (cherry picked from commit bc3b0db024f3d0b323e7d93994655e9e3d9f8d68) Signed-off-by: Quentin Perret --- kernel/sched/core.c | 4 ++++ kernel/sched/fair.c | 4 ++-- kernel/sched/sched.h | 4 ++++ 3 files changed, 10 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c7a317543b62..fd10180d3072 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3175,7 +3175,9 @@ static void sched_freq_tick_pelt(int cpu) * utilization and to harm its performance the least, request * a jump to a higher OPP as soon as the margin of free capacity * is impacted (specified by capacity_margin). + * Remember CPU utilization in sched_capacity_reqs should be normalised. */ + cpu_utilization = cpu_utilization * SCHED_CAPACITY_SCALE / capacity_orig_of(cpu); set_cfs_cpu_capacity(cpu, true, cpu_utilization); } @@ -3202,7 +3204,9 @@ static void sched_freq_tick_walt(int cpu) * It is likely that the load is growing so we * keep the added margin in our request as an * extra boost. + * Remember CPU utilization in sched_capacity_reqs should be normalised. */ + cpu_utilization = cpu_utilization * SCHED_CAPACITY_SCALE / capacity_orig_of(cpu); set_cfs_cpu_capacity(cpu, true, cpu_utilization); } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b3aa6279f161..b2c5d0ced614 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4803,7 +4803,7 @@ static void update_capacity_of(int cpu) if (!sched_freq()) return; - /* Convert scale-invariant capacity to cpu. */ + /* Normalize scale-invariant capacity to cpu. */ req_cap = boosted_cpu_util(cpu); req_cap = req_cap * SCHED_CAPACITY_SCALE / capacity_orig_of(cpu); set_cfs_cpu_capacity(cpu, true, req_cap); @@ -4996,7 +4996,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) if (rq->cfs.nr_running) update_capacity_of(cpu_of(rq)); else if (sched_freq()) - set_cfs_cpu_capacity(cpu_of(rq), false, 0); + set_cfs_cpu_capacity(cpu_of(rq), false, 0); /* no normalization required for 0 */ } } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 552e6c551dc3..c1884b6f9993 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1667,6 +1667,10 @@ static inline bool sched_freq(void) return static_key_false(&__sched_freq); } +/* + * sched_capacity_reqs expects capacity requests to be normalised. + * All capacities should sum to the range of 0-1024. + */ DECLARE_PER_CPU(struct sched_capacity_reqs, cpu_sched_capacity_reqs); void update_cpu_capacity_request(int cpu, bool request); From 460433155331ee728c5d1c2663ab9571569fdaef Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Tue, 8 Nov 2016 10:53:47 +0100 Subject: [PATCH 13/25] UPSTREAM: sched/fair: Fix task group initialization The moves of tasks are now propagated down to root and the utilization of cfs_rq reflects reality so it doesn't need to be estimated at init. Change-Id: I496f26d22cbbe38d78d683f7eeb389b997eb9035 Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Acked-by: Dietmar Eggemann Cc: Linus Torvalds Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: bsegall@google.com Cc: kernellwp@gmail.com Cc: pjt@google.com Cc: yuyang.du@intel.com Link: http://lkml.kernel.org/r/1478598827-32372-7-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar (cherry picked from commit d03266910a533d874c01ef2ca8dc73009f2925fa) Signed-off-by: Quentin Perret --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b2c5d0ced614..bd4bfdc7cd95 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10636,7 +10636,7 @@ void online_fair_sched_group(struct task_group *tg) se = tg->se[i]; raw_spin_lock_irq(&rq->lock); - post_init_entity_util_avg(se); + attach_entity_cfs_rq(se); sync_throttle(tg, i); raw_spin_unlock_irq(&rq->lock); } From 4b9300bf83c2490ca9d2bfbd467bbdbc8809fc94 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 3 Oct 2016 16:20:59 +0200 Subject: [PATCH 14/25] UPSTREAM: sched/core: Add missing update_rq_clock() in post_init_entity_util_avg() Address this rq-clock update bug: WARNING: CPU: 0 PID: 0 at ../kernel/sched/sched.h:797 post_init_entity_util_avg() rq->clock_update_flags < RQCF_ACT_SKIP Call Trace: __warn() post_init_entity_util_avg() wake_up_new_task() _do_fork() kernel_thread() rest_init() start_kernel() Change-Id: I0d8b0c83b19447dc53ebd397bb14f756cb97d8a3 Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar (cherry picked from commit 4126bad6717336abe5d666440ae15555563ca53f) Signed-off-by: Quentin Perret --- kernel/sched/core.c | 1 + kernel/sched/fair.c | 1 + 2 files changed, 2 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fd10180d3072..dc4b5bd92916 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2618,6 +2618,7 @@ void wake_up_new_task(struct task_struct *p) __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); #endif rq = __task_rq_lock(p, &rf); + update_rq_clock(rq); post_init_entity_util_avg(&p->se); walt_mark_task_starting(p); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bd4bfdc7cd95..278378eaa3c9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10636,6 +10636,7 @@ void online_fair_sched_group(struct task_group *tg) se = tg->se[i]; raw_spin_lock_irq(&rq->lock); + update_rq_clock(rq); attach_entity_cfs_rq(se); sync_throttle(tg, i); raw_spin_unlock_irq(&rq->lock); From 7c4e0f0832a348bc134eda07638f85b6d4f6b143 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 3 Oct 2016 16:28:37 +0200 Subject: [PATCH 15/25] UPSTREAM: sched/core: Add missing update_rq_clock() in detach_task_cfs_rq() Instead of adding the update_rq_clock() all the way at the bottom of the callstack, add one at the top, this to aid later effort to minimize update_rq_lock() calls. WARNING: CPU: 0 PID: 1 at ../kernel/sched/sched.h:797 detach_task_cfs_rq() rq->clock_update_flags < RQCF_ACT_SKIP Call Trace: dump_stack() __warn() warn_slowpath_fmt() detach_task_cfs_rq() switched_from_fair() __sched_setscheduler() _sched_setscheduler() sched_set_stop_task() cpu_stop_create() __smpboot_create_thread.part.2() smpboot_register_percpu_thread_cpumask() cpu_stop_init() do_one_initcall() ? print_cpu_info() kernel_init_freeable() ? rest_init() kernel_init() ret_from_fork() Change-Id: Iee08c2ed3303ae8f0c527658f13646b02a412cad Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar (cherry picked from commit 80f5c1b84baa8180c3c27b7e227429712cd967b6) Signed-off-by: Quentin Perret --- kernel/sched/core.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dc4b5bd92916..9459ddc94952 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3824,6 +3824,7 @@ void rt_mutex_setprio(struct task_struct *p, int prio) BUG_ON(prio > MAX_PRIO); rq = __task_rq_lock(p, &rf); + update_rq_clock(rq); /* * Idle task boosting is a nono in general. There is one @@ -4352,6 +4353,7 @@ recheck: * runqueue lock must be held. */ rq = task_rq_lock(p, &rf); + update_rq_clock(rq); /* * Changing the policy of the stop threads its a very bad idea @@ -8702,6 +8704,7 @@ static void cpu_cgroup_fork(struct task_struct *task) rq = task_rq_lock(task, &rf); + update_rq_clock(rq); sched_change_group(task, TASK_SET_GROUP); task_rq_unlock(rq, task, &rf); From 0aed57e79e0843ba3e8036ddb47746793f82bbbc Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 3 Oct 2016 16:35:32 +0200 Subject: [PATCH 16/25] UPSTREAM: sched/core: Add missing update_rq_clock() call for task_hot() Add the update_rq_clock() call at the top of the callstack instead of at the bottom where we find it missing, this to aid later effort to minimize the number of update_rq_lock() calls. WARNING: CPU: 30 PID: 194 at ../kernel/sched/sched.h:797 assert_clock_updated() rq->clock_update_flags < RQCF_ACT_SKIP Call Trace: dump_stack() __warn() warn_slowpath_fmt() assert_clock_updated.isra.63.part.64() can_migrate_task() load_balance() pick_next_task_fair() __schedule() schedule() worker_thread() kthread() Change-Id: I17716141789f9fac709495951cd2e079cf49d6d8 Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar (cherry picked from commit 3bed5e2166a5e433bf62162f3cd3c5174d335934) Signed-off-by: Quentin Perret --- kernel/sched/fair.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 278378eaa3c9..1f834acc9715 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9416,6 +9416,7 @@ redo: more_balance: raw_spin_lock_irqsave(&busiest->lock, flags); + update_rq_clock(busiest); /* * cur_ld_moved - load moved in current iteration @@ -9808,6 +9809,7 @@ static int active_load_balance_cpu_stop(void *data) }; schedstat_inc(sd->alb_count); + update_rq_clock(busiest_rq); p = detach_one_task(&env); if (p) { From 350e127daec15015f989e77bcb5dae855d889b38 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 3 Oct 2016 16:44:25 +0200 Subject: [PATCH 17/25] UPSTREAM: sched/core: Add missing update_rq_clock() call in set_user_nice() Address this rq-clock update bug: WARNING: CPU: 30 PID: 195 at ../kernel/sched/sched.h:797 set_next_entity() rq->clock_update_flags < RQCF_ACT_SKIP Call Trace: dump_stack() __warn() warn_slowpath_fmt() set_next_entity() ? _raw_spin_lock() set_curr_task_fair() set_user_nice.part.85() set_user_nice() create_worker() worker_thread() kthread() ret_from_fork() Change-Id: I8fb2653b2d9cb3bbc1637c1bcbf1c0645752ef12 Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar (cherry picked from commit 2fb8d36787affe26f3536c3d8ec094995a48037d) Signed-off-by: Quentin Perret --- kernel/sched/core.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9459ddc94952..b1939abf23b4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3921,6 +3921,8 @@ void set_user_nice(struct task_struct *p, long nice) * the task might be in the middle of scheduling on another CPU. */ rq = task_rq_lock(p, &rf); + update_rq_clock(rq); + /* * The RT priorities are set via sched_setscheduler(), but we still * allow the 'normal' nice value to be set - but as expected From 32ea7750821e24b1452c28fd55ac4c70a32e7886 Mon Sep 17 00:00:00 2001 From: Brendan Jackman Date: Thu, 5 Oct 2017 11:55:51 +0100 Subject: [PATCH 18/25] UPSTREAM: sched/fair: Sync task util before slow-path wakeup We use task_util() in find_idlest_group() via capacity_spare_wake(). This task_util() updated in wake_cap(). However wake_cap() is not the only reason for ending up in find_idlest_group() - we could have been sent there by wake_wide(). So explicitly sync the task util with prev_cpu when we are about to head to find_idlest_group(). We could simply do this at the beginning of select_task_rq_fair() (i.e. irrespective of whether we're heading to select_idle_sibling() or find_idlest_group() & co), but I didn't want to slow down the select_idle_sibling() path more than necessary. Don't do this during fork balancing, we won't need the task_util and we'd just clobber the last_update_time, which is supposed to be 0. Change-Id: I56113d8d67cf338f3fb1a692422289cf659399a3 Signed-off-by: Brendan Jackman Signed-off-by: Peter Zijlstra (Intel) Cc: Andres Oportus Cc: Dietmar Eggemann Cc: Joel Fernandes Cc: Josef Bacik Cc: Linus Torvalds Cc: Mike Galbraith Cc: Morten Rasmussen Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Vincent Guittot Link: http://lkml.kernel.org/r/20170808095519.10077-1-brendan.jackman@arm.com Signed-off-by: Ingo Molnar (cherry picked from commit ea16f0ea6c3d in tip:sched/core) Signed-off-by: Quentin Perret --- kernel/sched/fair.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1f834acc9715..d9edbda8210b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7143,6 +7143,15 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f new_cpu = cpu; } + if (sd && !(sd_flag & SD_BALANCE_FORK)) { + /* + * We're going to need the task's util for capacity_spare_wake + * in find_idlest_group. Sync it up to prev_cpu's + * last_update_time. + */ + sync_entity_load_avg(&p->se); + } + if (!sd) { if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */ new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); From 6abf18bda9e01b4c07b427f6d369f256e6871dcb Mon Sep 17 00:00:00 2001 From: Patrick Bellasi Date: Thu, 7 Sep 2017 12:24:45 +0100 Subject: [PATCH 19/25] sched/fair: trace energy_diff for non boosted tasks In systems where SchedTune is enabled, we do not report energy diff for non boosted tasks. Let's fix this by always genereting an energy_diff event where however: nrg.delta = 0, since we skip energy normalization payoff = nrg.diff, since the payoff is defined just by the energy difference Change-Id: I9a11ec19b6f56da04147f5ae5b47daf1dd180445 Signed-off-by: Patrick Bellasi Signed-off-by: Chris Redpath (cherry picked from commit 13e2e3c7f7d09f619444631a0962fd8020660b8d) Signed-off-by: Quentin Perret --- kernel/sched/fair.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d9edbda8210b..aa3d7bdecb5b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5890,8 +5890,14 @@ energy_diff(struct energy_env *eenv) __energy_diff(eenv); /* Return energy diff when boost margin is 0 */ - if (boost == 0) + if (boost == 0) { + trace_sched_energy_diff(eenv->task, + eenv->src_cpu, eenv->dst_cpu, eenv->util_delta, + eenv->nrg.before, eenv->nrg.after, eenv->nrg.diff, + eenv->cap.before, eenv->cap.after, eenv->cap.delta, + 0, -eenv->nrg.diff); return eenv->nrg.diff; + } /* Compute normalized energy diff */ nrg_delta = normalize_energy(eenv->nrg.diff); From 78ff98b3aa5dd2a501b07bbdf40251670d23e439 Mon Sep 17 00:00:00 2001 From: Patrick Bellasi Date: Thu, 7 Sep 2017 12:27:56 +0100 Subject: [PATCH 20/25] sched/fair: ignore backup CPU when not valid The find_best_target can sometimes not return a valid backup CPU, either because it cannot find one or just becasue it returns prev_cpu as a backup. In these cases we should skip the energy_diff evaluation for the backup CPU. Change-Id: I3787dbdfe74122348dd7a7485b88c4679051bd32 Signed-off-by: Patrick Bellasi Signed-off-by: Chris Redpath (cherry picked from commit d9bcf5b88594a0225b51878236e49305f272eadc) [trivial cherry-pick issue] Signed-off-by: Quentin Perret --- kernel/sched/fair.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index aa3d7bdecb5b..d9a2a0208773 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7071,7 +7071,9 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync /* No energy saving for target_cpu, try backup */ target_cpu = tmp_backup; eenv.dst_cpu = target_cpu; - if (tmp_backup < 0 || energy_diff(&eenv) >= 0) { + if (tmp_backup < 0 || + tmp_backup == prev_cpu || + energy_diff(&eenv) >= 0) { schedstat_inc(p->se.statistics.nr_wakeups_secb_no_nrg_sav); schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav); target_cpu = prev_cpu; From 1e58674375467499d40721d5acddd9063fd0fe2a Mon Sep 17 00:00:00 2001 From: Patrick Bellasi Date: Mon, 17 Jul 2017 15:54:39 +0100 Subject: [PATCH 21/25] sched/fair: enforce EAS mode For non latency sensitive tasks the goal is to optimize for energy efficiency. Thus, we should try our best to avoid moving a task on a CPU which is then going to be marked as overutilized. Let's use the capacity_margin metric to verify if a candidate target CPU should be considered without risking to bail out of EAS mode. Change-Id: Ib3697106f4073aedf4a6c6ce42bd5d000fa8c007 Signed-off-by: Patrick Bellasi Signed-off-by: Chris Redpath (cherry picked from commit f95753da4ba7e1c1eee0500ffe41b3e5fa68b347) Signed-off-by: Quentin Perret --- kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d9a2a0208773..8caba42b7039 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6862,6 +6862,19 @@ static inline int find_best_target(struct task_struct *p, int *backup_cpu, continue; } + /* + * Enforce EAS mode + * + * For non latency sensitive tasks, skip CPUs that + * will be overutilized by moving the task there. + * + * The goal here is to remain in EAS mode as long as + * possible at least for !prefer_idle tasks. + */ + if ((new_util * capacity_margin) > + (capacity_orig * SCHED_CAPACITY_SCALE)) + continue; + /* * Case B) Non latency sensitive tasks on IDLE CPUs. * From 4530ed9a46da249bda779ca9846da73ca6e1ed70 Mon Sep 17 00:00:00 2001 From: Chris Redpath Date: Tue, 12 Sep 2017 14:44:24 +0100 Subject: [PATCH 22/25] sched/fair: consider task utilization in group_norm_util() The group_norm_util() function is used to compute the normalized utilization of a SG given a certain energy_env configuration. The main client of this function is the energy_diff function when it comes to compute the SG energy for one of the before/after scheduling candidates. Currently, the energy_diff function sets util_delta = 0 when it wants to compute the energy corresponding to the scheduling candidate where the task runs in the previous CPU. This implies that, for the task waking up in the previous CPU we consider only its blocked load tracked by the CPU RQ. However, in case of a medium-big task which is waking up on a long time idle CPU, this blocked load can be already completely decayed. More in general, the current approach is biased towards under-estimating the energy consumption for the "before" scheduling candidate. This patch fixes this by: - always use the cpu_util_wake() to properly get the utilization of a CPU without any (partially decayed) contribution of the waking up task - adding the task utilization to the cpu_util_wake just for the target cpu The "target CPU" is defined by the energy_env to be either the src_cpu or the dst_cpu, depending on which scheduling candidate we are considering. This patch update also the definition of __cpu_norm_util(), which is currently called just by the group_norm_util() function. This allows to simplify the code by using this function just to normalize a specified utilization with respect to a given capacity. This update allows to completely remove any dependency of group_norm_util() from calc_util_delta(). Change-Id: I3b6ec50ce8decb1521faae660e326ab3319d3c82 Signed-off-by: Patrick Bellasi Signed-off-by: Chris Redpath (cherry picked from commit ef34ea830347ca175ba8a1baf05357ed98d5728c) Signed-off-by: Quentin Perret --- kernel/sched/fair.c | 58 ++++++++++++++++++++++++++++----------------- 1 file changed, 36 insertions(+), 22 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8caba42b7039..69819ca616d8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5481,6 +5481,7 @@ struct energy_env { int util_delta; int src_cpu; int dst_cpu; + int trg_cpu; int energy; int payoff; struct task_struct *task; @@ -5497,11 +5498,14 @@ struct energy_env { } cap; }; +static int cpu_util_wake(int cpu, struct task_struct *p); + /* * __cpu_norm_util() returns the cpu util relative to a specific capacity, - * i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for - * energy calculations. Using the scale-invariant util returned by - * cpu_util() and approximating scale-invariant util by: + * i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE], which is useful for + * energy calculations. + * + * Since util is a scale-invariant utilization defined as: * * util ~ (curr_freq/max_freq)*1024 * capacity_orig/1024 * running_time/time * @@ -5511,10 +5515,8 @@ struct energy_env { * * norm_util = running_time/time ~ util/capacity */ -static unsigned long __cpu_norm_util(int cpu, unsigned long capacity, int delta) +static unsigned long __cpu_norm_util(unsigned long util, unsigned long capacity) { - int util = __cpu_util(cpu, delta); - if (util >= capacity) return SCHED_CAPACITY_SCALE; @@ -5546,28 +5548,37 @@ unsigned long group_max_util(struct energy_env *eenv) /* * group_norm_util() returns the approximated group util relative to it's - * current capacity (busy ratio) in the range [0..SCHED_LOAD_SCALE] for use in - * energy calculations. Since task executions may or may not overlap in time in - * the group the true normalized util is between max(cpu_norm_util(i)) and - * sum(cpu_norm_util(i)) when iterating over all cpus in the group, i. The - * latter is used as the estimate as it leads to a more pessimistic energy + * current capacity (busy ratio), in the range [0..SCHED_LOAD_SCALE], for use + * in energy calculations. + * + * Since task executions may or may not overlap in time in the group the true + * normalized util is between MAX(cpu_norm_util(i)) and SUM(cpu_norm_util(i)) + * when iterating over all CPUs in the group. + * The latter estimate is used as it leads to a more pessimistic energy * estimate (more busy). */ static unsigned long group_norm_util(struct energy_env *eenv, struct sched_group *sg) { - int i, delta; - unsigned long util_sum = 0; unsigned long capacity = sg->sge->cap_states[eenv->cap_idx].cap; + unsigned long util, util_sum = 0; + int cpu; - for_each_cpu(i, sched_group_cpus(sg)) { - delta = calc_util_delta(eenv, i); - util_sum += __cpu_norm_util(i, capacity, delta); + for_each_cpu(cpu, sched_group_cpus(sg)) { + util = cpu_util_wake(cpu, eenv->task); + + /* + * If we are looking at the target CPU specified by the eenv, + * then we should add the (estimated) utilization of the task + * assuming we will wake it up on that CPU. + */ + if (unlikely(cpu == eenv->trg_cpu)) + util += eenv->util_delta; + + util_sum += __cpu_norm_util(util, capacity); } - if (util_sum > SCHED_CAPACITY_SCALE) - return SCHED_CAPACITY_SCALE; - return util_sum; + return min_t(unsigned long, util_sum, SCHED_CAPACITY_SCALE); } static int find_new_capacity(struct energy_env *eenv, @@ -5763,6 +5774,8 @@ static inline bool cpu_in_sg(struct sched_group *sg, int cpu) return cpu != -1 && cpumask_test_cpu(cpu, sched_group_cpus(sg)); } +static inline unsigned long task_util(struct task_struct *p); + /* * energy_diff(): Estimate the energy impact of changing the utilization * distribution. eenv specifies the change: utilisation amount, source, and @@ -5778,11 +5791,13 @@ static inline int __energy_diff(struct energy_env *eenv) int diff, margin; struct energy_env eenv_before = { - .util_delta = 0, + .util_delta = task_util(eenv->task), .src_cpu = eenv->src_cpu, .dst_cpu = eenv->dst_cpu, + .trg_cpu = eenv->src_cpu, .nrg = { 0, 0, 0, 0}, .cap = { 0, 0, 0 }, + .task = eenv->task, }; if (eenv->src_cpu == eenv->dst_cpu) @@ -6160,8 +6175,6 @@ boosted_task_util(struct task_struct *task) return util + margin; } -static int cpu_util_wake(int cpu, struct task_struct *p); - static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) { return capacity_orig_of(cpu) - cpu_util_wake(cpu, p); @@ -7071,6 +7084,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync .src_cpu = prev_cpu, .dst_cpu = target_cpu, .task = p, + .trg_cpu = target_cpu, }; /* Not enough spare capacity on previous cpu */ From 06d637c9f9917874f82ede3970903f114c12c672 Mon Sep 17 00:00:00 2001 From: Patrick Bellasi Date: Thu, 1 Jun 2017 16:40:22 +0100 Subject: [PATCH 23/25] sched/fair: consider task utilization in group_max_util() The group_max_util() function is used to compute the maximum utilization across the CPUs of a certain energy_env configuration. Its main client is the energy_diff function when it needs to compute the SG capacity for one of the before/after scheduling candidates. Currently, the energy_diff function sets util_delta = 0 when it wants to compute the energy corresponding to the scheduling candidate where the task runs in the previous CPU. This implies that, for the task waking up in the previous CPU we consider only its blocked load tracked by the CPU RQ. However, in case of a medium-big task which is waking up on a long time idle CPU, this blocked load can be already completely decayed. More in general, the current approach is biased towards under-estimating the capacity requirements for the "before" scheduling candidate. This patch fixes this by: - always use the cpu_util_wake() to properly get the utilization of a CPU without any (partially decayed) contribution of the waking up task - adding the task utilization to the cpu_util_wake just for the target cpu The "target CPU" is defined by the energy_env to be either the src_cpu or the dst_cpu, depending on which scheduling candidate we are considering. Finally, since this update removes the last usage of calc_util_delta() this function is now safely removed. Change-Id: I20ee1bcf40cee6bf6e265fb2d32ef79061ad6ced Signed-off-by: Patrick Bellasi Signed-off-by: Chris Redpath (cherry picked from commit 52d70152fade678e304683bd1a5842af95e83558) Signed-off-by: Quentin Perret --- kernel/sched/fair.c | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 69819ca616d8..4cb7b1ec437e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5523,24 +5523,24 @@ static unsigned long __cpu_norm_util(unsigned long util, unsigned long capacity) return (util << SCHED_CAPACITY_SHIFT)/capacity; } -static int calc_util_delta(struct energy_env *eenv, int cpu) +static unsigned long group_max_util(struct energy_env *eenv) { - if (cpu == eenv->src_cpu) - return -eenv->util_delta; - if (cpu == eenv->dst_cpu) - return eenv->util_delta; - return 0; -} - -static -unsigned long group_max_util(struct energy_env *eenv) -{ - int i, delta; unsigned long max_util = 0; + unsigned long util; + int cpu; - for_each_cpu(i, sched_group_cpus(eenv->sg_cap)) { - delta = calc_util_delta(eenv, i); - max_util = max(max_util, __cpu_util(i, delta)); + for_each_cpu(cpu, sched_group_cpus(eenv->sg_cap)) { + util = cpu_util_wake(cpu, eenv->task); + + /* + * If we are looking at the target CPU specified by the eenv, + * then we should add the (estimated) utilization of the task + * assuming we will wake it up on that CPU. + */ + if (unlikely(cpu == eenv->trg_cpu)) + util += eenv->util_delta; + + max_util = max(max_util, util); } return max_util; From ece6d3b76e64eb7a98ae23b62961f9f4fe344b37 Mon Sep 17 00:00:00 2001 From: Ke Wang Date: Mon, 30 Oct 2017 17:38:16 +0800 Subject: [PATCH 24/25] sched: EAS: update trg_cpu to backup_cpu if no energy saving for target_cpu If no energy saving for target_cpu in the calculation of energy_diff(), backup_cpu will be set as the new dst_cpu for the next calculation. At this point, we also need update the new trg_cpu as backup_cpu to make sure the subsequent calculation of energy_diff() is correct. Change-Id: If3b35b6dc54865f1cb4b1603134102d4422227d5 Signed-off-by: Ke Wang (cherry picked from commit b1923e22f4eca3e537e015d6ea3dce1187edea37) Signed-off-by: Quentin Perret --- kernel/sched/fair.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4cb7b1ec437e..9ca2153450b7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7098,6 +7098,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync /* No energy saving for target_cpu, try backup */ target_cpu = tmp_backup; eenv.dst_cpu = target_cpu; + eenv.trg_cpu = target_cpu; if (tmp_backup < 0 || tmp_backup == prev_cpu || energy_diff(&eenv) >= 0) { From c409b2024056bfba68ab8653dcc24d7a4e448cc1 Mon Sep 17 00:00:00 2001 From: Ke Wang Date: Wed, 1 Nov 2017 14:11:06 +0800 Subject: [PATCH 25/25] sched: EAS: Fix the condition to distinguish energy before/after Before commit 5f8b3a757d65 ("sched/fair: consider task utilization in group_norm_util()"), eenv->util_delta is used to distinguish energy before and energy after in sched_group_energy(). After that commit, eenv->util_delta can not do that any more. In this commit, use trg_cpu to distinguish energy before/after in sched_group_energy(). Before apply this commit, cap_before/cap_delta is not correct: -0 [001] 147504.608920: sched_energy_diff: pid=7 comm=rcu_preempt src_cpu=1 dst_cpu=3 usage_delta=7 nrg_before=250 nrg_after=250 nrg_diff=0 cap_before=0 cap_after=528 cap_delta=1056 nrg_delta=0 nrg_payoff=0 After apply this commit, cap_before/cap_delta retrun to normal: -0 [001] 220.494011: sched_energy_diff: pid=7 comm=rcu_preempt src_cpu=1 dst_cpu=2 usage_delta=3 nrg_before=248 nrg_after=248 nrg_diff=0 cap_before=528 cap_after=528 cap_delta=0 nrg_delta=0 nrg_payoff=0 Change-Id: I7b5f7ccce56e93af7ea4e87d8e0ea6e2405f9c27 Signed-off-by: Ke Wang (cherry picked from commit 0da783a605cd20d5a37c2a840e8a1fa641c09768) Signed-off-by: Quentin Perret --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9ca2153450b7..78eef4f12be9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5720,13 +5720,13 @@ static int sched_group_energy(struct energy_env *eenv) if (sg->group_weight == 1) { /* Remove capacity of src CPU (before task move) */ - if (eenv->util_delta == 0 && + if (eenv->trg_cpu == eenv->src_cpu && cpumask_test_cpu(eenv->src_cpu, sched_group_cpus(sg))) { eenv->cap.before = sg->sge->cap_states[cap_idx].cap; eenv->cap.delta -= eenv->cap.before; } /* Add capacity of dst CPU (after task move) */ - if (eenv->util_delta != 0 && + if (eenv->trg_cpu == eenv->dst_cpu && cpumask_test_cpu(eenv->dst_cpu, sched_group_cpus(sg))) { eenv->cap.after = sg->sge->cap_states[cap_idx].cap; eenv->cap.delta += eenv->cap.after;