Commit Graph

783782 Commits

Author SHA1 Message Date
Dietmar Eggemann
9ac9491eb2 ANDROID: arm64: enable max frequency capping
Defines arch_scale_max_freq_capacity() to use the topology driver
scale function.

Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: If7565747ec862e42ac55196240522ef8d22ca67d
2018-10-26 12:25:29 +01:00
Dietmar Eggemann
c4ed8aa44d ANDROID: implement max frequency capping
Implements the Max Frequency Capping Engine (MFCE) getter function
topology_get_max_freq_scale() to provide the scheduler with a
maximum frequency scaling correction factor for more accurate cpu
capacity handling by being able to deal with max frequency capping.

This scaling factor describes the influence of running a cpu with a
current maximum frequency (policy) lower than the maximum possible
frequency (cpuinfo).

The factor is:

  policy_max_freq(cpu) << SCHED_CAPACITY_SHIFT / cpuinfo_max_freq(cpu)

It also implements the MFCE setter function arch_set_max_freq_scale()
which is called from cpufreq_set_policy().

Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[Trivial cherry-pick issue in cpufreq.c]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I59e52861ee260755ab0518fe1f7183a2e4e3d0fc
2018-10-26 12:25:24 +01:00
Dietmar Eggemann
6c401bf9b0 ANDROID: sched/fair: add arch scaling function for max frequency capping
To be able to scale the cpu capacity by this factor introduce a call to
the new arch scaling function arch_scale_max_freq_capacity() in
update_cpu_capacity() and provide a default implementation which returns
SCHED_CAPACITY_SCALE.

Another subsystem (e.g. cpufreq) or architectural or platform specific
code can overwrite this default implementation, exactly as for frequency
and cpu invariance. It has to be enabled by the arch by defining
arch_scale_max_freq_capacity to the actual implementation.

Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
( Fixed conflict with scaling against the PELT-based scale_rt_capacity )
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I770a8b1f4f7340e9e314f71c64a765bf880f4b4d
2018-10-26 12:25:19 +01:00
Quentin Perret
c06013206c ANDROID: sched/fair: Bypass energy computation for prefer_idle tasks
If the only pre-selected candidate CPU in find_energy_efficient_cpu()
happens to be prev_cpu, there is not point in computing the system
energy since we have nothing to compare it against, so we currently bail
out early. The same logic can be extended when prefer_idle tasks are
routed in the energy-aware wake-up path: if the only candidate is idle
for a prefer_idle task, just select it no matter what the energy impact
is. That should help speeding-up wake-ups of prefer_idle tasks, at least
when find_best_target() is used for them.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: Idd0e387e4a766061cc05d2584df3a31e4dabfd09
2018-10-26 12:25:00 +01:00
Chris Redpath
1d00c33d8b ANDROID: sched: fair: Bypass energy-aware wakeup for prefer-idle tasks
Use the upstream slow path to find an idle cpu for prefer-idle tasks.
This slow-path is actually faster than the EAS path we are currently
going through (compute_energy()) which is really slow.

No performance degradation is seen with this and it reduces the delta
quite a bit between upstream and out of tree code.

It's not clear yet if using the mainline slow path task placement when
a task has the schedtune attribute prefer_idle=1 is the right thing to
do for products. Put the option to disable this behind a sched feature
so we can try out both options.

Signed-off-by: Joel Fernandes <joelaf@google.com>
(refactored for 4.14 version)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit c0ff131c88f68e4985793663144b6f9cf77be9d3)
[ - Refactored for 4.17 version
  - Adjusted the commit header to the new function names ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: Icf762a101c92c0e3f9e61df0370247fa15455581
2018-10-26 12:20:50 +01:00
Brendan Jackman
88d05f9257 FROMLIST: sched/fair: Use wake_q length as a hint for wake_wide
This patch adds a parameter to select_task_rq, sibling_count_hint
allowing the caller, where it has this information, to inform the
sched_class the number of tasks that are being woken up as part of
the same event.

The wake_q mechanism is one case where this information is available.

select_task_rq_fair can then use the information to detect that it
needs to widen the search space for task placement in order to avoid
overloading the last-level cache domain's CPUs.

                               * * *

The reason I am investigating this change is the following use case
on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
all repeatedly do X amount of work then
pthread_barrier_wait (i.e. sleep until the last task finishes its X
and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU
finish faster, and then those CPUs pull over the tasks that are still
running:

     v CPU v           ->time->

                    -------------
   0  (big)         11111  /333
                    -------------
   1  (big)         22222   /444|
                    -------------
   2  (LITTLE)      333333/
                    -------------
   3  (LITTLE)      444444/
                    -------------

Now when task 4 hits the barrier (at |) and wakes the others up,
there are 4 tasks with prev_cpu=<big> and 0 tasks with
prev_cpu=<little>. want_affine therefore means that we'll only look
in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
on the bigs until the next load balance, something like this:

     v CPU v           ->time->

                    ------------------------
   0  (big)         11111  /333  31313\33333
                    ------------------------
   1  (big)         22222   /444|424\4444444
                    ------------------------
   2  (LITTLE)      333333/          \222222
                    ------------------------
   3  (LITTLE)      444444/            \1111
                    ------------------------
                                 ^^^
                           underutilization

So, I'm trying to get want_affine = 0 for these tasks.

I don't _think_ any incarnation of the wakee_flips mechanism can help
us here because which task is waker and which tasks are wakees
generally changes with each iteration.

However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
nice property that we know exactly how many tasks are being woken, so
we can cheat.

It might be a disadvantage that we "widen" _every_ task that's woken in
an event, while select_idle_sibling would work fine for the first
sd_llc_size - 1 tasks.

IIUC, if wake_affine() behaves correctly this trick wouldn't be
necessary on SMP systems, so it might be best guarded by the presence
of SD_ASYM_CPUCAPACITY?

                               * * *

Final note..

In order to observe "perfect" behaviour for this use case, I also had
to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
above we are working through the work queue and have placed tasks 3
and 2, and are about to place task 1:

     v CPU v           ->time->

                    --------------
   0  (big)         11111  /333  3
                    --------------
   1  (big)         22222   /444|4
                    --------------
   2  (LITTLE)      333333/      2
                    --------------
   3  (LITTLE)      444444/          <- Task 1 should go here
                    --------------

If TTWU_QUEUE is enabled, we will not yet have enqueued task
2 (having instead sent a reschedule IPI) or attached its load to CPU
2. So we are likely to also place task 1 on cpu 2. Disabling
TTWU_QUEUE means that we enqueue task 2 before placing task 1,
solving this issue. TTWU_QUEUE is there to minimise rq lock
contention, and I guess that this contention is less of an issue on
big.LITTLE systems since they have relatively few CPUs, which
suggests the trade-off makes sense here.

Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
( - Applied from https://patchwork.kernel.org/patch/9895261/
  - Fixed trivial conflict in kernel/sched/core.c
  - Fixed select_task_rq_idle, now in kernel/sched/idle.c
  - Fixed trivial conflict in select_task_rq_fair )
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I3cfc4bf48c3d7feef969db4d22449f4fbb4f795d
2018-10-26 12:15:52 +01:00
Chris Redpath
af362094df ANDROID: sched: Unconditionally honor sync flag for energy-aware wakeups
Since we don't do energy-aware wakeups when we are overutilized, always
honoring sync wakeups in this state does not prevent wake-wide mechanics
overruling the flag as normal.

This patch is based upon previous work to build EAS for android products.

sync-hint code taken from commit 4a5e890ec6
"sched/fair: add tunable to force selection at cpu granularity" written
by Juri Lelli <juri.lelli@arm.com>

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry-picked from commit f1ec666a62dec1083ed52fe1ddef093b84373aaf)
[ Moved the feature to find_energy_efficient_cpu() ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I4b3d79141fc8e53dc51cd63ac11096c2e3cb10f5
2018-10-26 12:14:32 +01:00
Chris Redpath
c27c56105d ANDROID: Add find_best_target to minimise energy calculation overhead
find_best_target started life as an optimised energy cpu selection
algorithm designed to be efficient and high performance and integrate
well with schedtune and userspace cpuset configuration. It has had many
many small tweaks over time, and the version added here is forward
ported from the version used in android-4.4 and android-4.9. The linkage
to the rest of EAS is slightly different, however.

This version is split into the three main use-cases and addresses
them in priority order:
 A) latency sensitive tasks
 B) non latency sensitive tasks on IDLE CPUs
 C) non latency sensitive tasks on ACTIVE CPUs

Case A) Latency sensitive tasks

Unconditionally favoring tasks that prefer idle CPU to improve latency.

When we do enter here, we are looking for:
 - an idle CPU, whatever its idle_state is, since the first CPUs we
   explore are more likely to be reserved for latency sensitive tasks.
 - a non idle CPU where the task fits in its current capacity and has
   the maximum spare capacity.
 - a non idle CPU with lower contention from other tasks and running at
   the lowest possible OPP.

The last two goals try to favor a non idle CPU where the task can run
as if it is "almost alone". A maximum spare capacity CPU is favored
since the task already fits into that CPU's capacity without waiting for
an OPP change.

For any case other than case A, we avoid CPUs which would become
overutilized if we placed the task there.

Case B) Non latency sensitive tasks on IDLE CPUs.

Find an optimal backup IDLE CPU for non latency sensitive tasks.

Here we are looking for:
 - minimizing the capacity_orig,
   i.e. preferring LITTLE CPUs

If IDLE cpus are available, we prefer to choose one in order to spread
tasks and improve performance.

Case C) Non latency sensitive tasks on ACTIVE CPUs.

Pack tasks in the most energy efficient capacities.

This task packing strategy prefers more energy efficient CPUs (i.e.
pack on smaller maximum capacity CPUs) while also trying to spread
tasks to run them all at the lower OPP.

This assumes for example that it's more energy efficient to run two tasks
on two CPUs at a lower OPP than packing both on a single CPU but running
that CPU at an higher OPP.

This code has had many other contributors over the development listed
here as Cc.

Cc: Ke Wang <ke.wang@spreadtrum.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Srinath Sridharan <srinathsr@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit f240e44406558b17ff7765f252b0bcdcbc15126f)
[ - Removed tracepoints
  - Took capacity_curr_of() from "7f6fb82 ANDROID: sched: EAS: take
    cstate into account when selecting idle core"
  - Re-use sd_ea from find_energy_efficient_cpu() / removed start_cpu()
  - Mark candidates with a cpumask given by feec()
  - Squashed Ionela's tri-gear fbt fixes from android-4.14
  - Squashed Patrick's changes related to util_est from android-4.14 ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I9500d308c879dd53b08adeda8f988238e39cc392
2018-10-26 12:12:05 +01:00
Quentin Perret
9378697a9f ANDROID: sched/fair: Factor out CPU selection from find_energy_efficient_cpu
find_energy_efficient_cpu() is composed of two steps; we first look for
the CPU with the max spare capacity in each frequency domain, and then
the impact on energy of each candidate is estimated. In order to make it
easier to implement other CPU selection policies, let's factor the
candidate selection algorithm out of find_energy_efficient_cpu(), and
mark the candidates in a mask. This should result in no functional
difference.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I85a28880f01fcd11d7af28f9fbc1fe0cf4f197cf
2018-10-26 12:06:24 +01:00
Quentin Perret
2e88529cf8 ANDROID: sched: Introduce sysctl_sched_cstate_aware
Introduce a new sysctl for this option, 'sched_cstate_aware'.
When this is enabled, the scheduler can make use of the idle state
indexes in order to break the tie between potential CPU candidates.

This patch is based on 7f6fb825d6bc ("ANDROID: sched: EAS: take cstate
into account when selecting idle core") from android-4.14. All the
credits goes to the authors.

Change-Id: Ia076cf32faff91e90905291fa6f7924dc3dd6458
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:58:00 +01:00
Morten Rasmussen
3e63fb2c91 ANDROID: sched, cpuidle: Track cpuidle state index in the scheduler
The idle-state of each cpu is currently pointed to by rq->idle_state but
there isn't any information in the struct cpuidle_state that can used to
look up the idle-state energy model data stored in struct
sched_group_energy. For this purpose is necessary to store the idle
state index as well. Ideally, the idle-state data should be unified.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: Ib3d1178512735b0e314881f73fb8ccff5a69319f
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit a732c97420e109956c20f34c70b91e6d06f5df31)
[ Fixed trivial cherry-pick conflict in kernel/sched/sched.h ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:57:55 +01:00
Patrick Bellasi
68dbff9ce9 ANDROID: sched: fair/tune: Add schedtune with cgroups interface
Schedtune is the framework we use in Android to allow userspace
task classification and provides a CGroup controller which has two
attributes per group.

 * schedtune.boost
 * schedtune.prefer_idle

Schedtune itself provides task and CPU utilization boosting. EAS in
the fair scheduler uses boosted utilization and prefer_idle status to
control the algorithm used for wakeup task placement.

Boosting:

The task utilization signal, which is derived from PELT signals and
properly scaled to be architecture and frequency invariant, is used by
EAS as an estimation of the task requirements in terms of CPU bandwidth.

Schedtune allows userspace to assign a percentage boost to each group
and this boost is used to calculate an additional utilization margin.
The margin added to the original utilization is:
 1. computed based on the "boosting strategy" in use
 2. proportional to boost value defined by the "taskgroup" value

The boosted signal is used by EAS for task placement, and boosted CPU
utilization (if boosted tasks are running) is given when schedutil
requests utilization.

Prefer_idle:

When this attribute is 1 for a group, this is used as a signal from
userspace that tasks in this group need to be serviced with the
minimum latency possible.

Previous versions of schedtune had much more functionality around
allowing a more tuneable tradeoff between performand and energy,
however this has not been used a lot up until now. If necessary,
we can easily resurrect it based upon old code.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry-picked from commit 159c14f0397790405b9e8435184366b31f2ed15b)
[ - Removed tracepoints (to be added in a separate patch)
  - Integrated boosted_cpu_util() with cpu_util_cfs()
  - Backported Patrick's util_est related fixes from android-4.14 ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: Ie2fd63d82f604f34bcbc7e1ca9b5af1bdcc037e0
2018-10-26 11:57:46 +01:00
Quentin Perret
99f3cc6e05 ANDROID: drivers: Introduce a legacy Energy Model loading driver
The Energy Aware Scheduler (EAS) used to rely on statically defined
Energy Models (EMs) in the device tree. Now that EAS uses the EM
framework, the old-style EMs are not usable by default.

To address this issue, introduce a driver able to read DT-based EMs and
to load them in the EM framework, hence making them available to EAS.
Since EAS now uses only the active costs of CPUs, the idle cost and
cluster cost of the old EM are ignored. The driver can be compiled in
using the CONFIG_LEGACY_ENERGY_MODEL_DT Kconfig option (off by default).

The implementation of the driver is highly inspired by the EM loading
code from android-4.14 and before (written by Robin Randhawa
<robin.randhawa@arm.com>), and the arch_topology driver (Juri Lelli
<juri.lelli@redhat.com>).

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I4f525dfb45113ba63f01aaf8e1e809ae6b34dd52
2018-10-26 11:54:46 +01:00
Quentin Perret
a0912a31bd ANDROID: cpufreq: scmi: Register an Energy Model
The Energy Model framework provides an API to register the active power
of CPUs. This commit calls this API from the scmi-cpufreq driver which uses
the power costs provided by the firmware.

Change-Id: I2e6036acbf004d41f921e1396983b07e022a5399
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:54:45 +01:00
Quentin Perret
b085f79aee UPSTREAM: firmware: arm_scmi: add a getter for power of performance states
The SCMI protocol can be used to get power estimates from firmware
corresponding to each performance state of a device. Although these power
costs are already managed by the SCMI firmware driver, they are not
exposed to any external subsystem yet.

Fix this by adding a new get_power() interface to the exisiting perf_ops
defined for the SCMI protocol.

Change-Id: Iae7b1b60c1955f6590764f3de459a32320eba448
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
(cherry picked from commit 1a63fe9a2b in
git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux.git)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:54:00 +01:00
Dietmar Eggemann
354f0b0c8d ANDROID: cpufreq: arm_big_little: Register an Energy Model
The Energy Model framework provides an API to register the active power
of CPUs. This commit calls this API from the scpi-cpufreq driver which
uses the power estimation helper from PM_OPP.

Todo: Check if driver can handle -EPROBE_DEFER and if the call to
dev_pm_opp_get_opp_count() id realy necessary.

Change-Id: Ia808262ef6c9f2cc7819a83e8eb2f602454edfa3
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:53:59 +01:00
Quentin Perret
61dcd89053 ANDROID: cpufreq: scpi: Register an Energy Model
The Energy Model framework provides an API to register the active power
of CPUs. This commit calls this API from the scpi-cpufreq driver which
uses the power estimation helper from PM_OPP.

Change-Id: I113fa2edf8201c1272c9fb5a0c6c39622ae53f94
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:53:59 +01:00
Quentin Perret
d326aa36e6 ANDROID: PM / OPP: cpufreq-dt: Move power estimation function
The cpufreq-dt driver has a function estimating the power of CPUs using
the dynamic-power-coefficient DT binding and the parameters of PM_OPP.
Since this function can be useful to other drivers, relocate it to
PM_OPP which is already a dependency anyway.

Change-Id: I34f8f9cd9433c622c82f23f32ae9968a096a4390
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:53:59 +01:00
Quentin Perret
23977789b5 FROMLIST: cpufreq: dt: Register an Energy Model
The Energy Model framework provides an API to register the active power
of CPUs. Call this API from the cpufreq-dt driver with an estimation
of the power as P = C * V^2 * f with C, V, and f respectively the
capacitance of the CPU and the voltage and frequency of the OPP.

The CPU capacitance is read from the "dynamic-power-coefficient" DT
binding (originally introduced for thermal/IPA), and the voltage and
frequency values from PM_OPP.

Change-Id: Id7b79ae3aafcc53574b850cb91a25240ebffbdd4
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
(Removed "OPTIONAL" label from title. Applied from
https://lore.kernel.org/lkml/20180912091309.7551-15-quentin.perret@arm.com/)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:53:58 +01:00
Dietmar Eggemann
6ed388bcd6 ANDROID: arm: dts: vexpress-v2p-ca15_a7: Add dynamic-power-coefficient properties
The values are computed from measuring energy over a 10 secs sysbench
workload running at each frequency and affine to 1 or 2 A15's as well as
1 or 2 or 3 A7's. The Power values for individual cpus are calculated
from the Energy values divided by workload runtime by taking the
difference of the energy values between n+1 and n cpus.

P [mW] = C * freq [Mhz] * voltage [mV] * voltage [mV] / 1000000000

C = P [mW] / (freq [Mhz] * voltage [mV] * voltage [mV]) * 1000000000

The actual C (dynamic-power-coefficient) value is the mean value out of
all the C values of the OPP's.

A15:
     freq       power  voltage  dyn_pwr_coef
     [MhZ]      [mW]      [mV]
0   500.0   534.57550      900   1319.939506
1   600.0   547.15468      900   1125.832675
2   700.0   572.22060      900   1009.207407
3   800.0   607.76592      900    937.910370
4   900.0   648.50552      900    889.582332
5  1000.0   693.86776      900    856.626864
6  1100.0   916.51314      975    876.469442
7  1200.0  1198.57566     1050    905.952880

                            mean: 990

A7:

     freq      power  voltage  dyn_pwr_coef
     [MhZ]      [mW]     [mV]
0   350.0   40.17430      900    141.708289
1   400.0   42.68700      900    131.750000
2   500.0   54.25716      900    133.968296
3   600.0   64.09914      900    131.891235
4   700.0   74.09736      900    130.683175
5   800.0   82.69694      900    127.618735
6   900.0  113.71386      975    132.911225
7  1000.0  144.94124     1050    131.465977

                           mean: 133

The ratio between A15 and A7 is 990/113 = 7.44

This value (7.44) is very close to mean ratio between the power value of
A15 an A7 of the per sched-domain Energy Model (7.96):

  mV  Mhz  MhZ old EM      ratio
       A7  A15 core power
 900  350  500 6997/1024   6.83
 900  400  600 5177/761    6.80
 900  500  700 3846/549    7.01
 900  600  800 3524/447    7.88
 900  700  900 3125/407    7.68
 900  800 1000 2756/334    8.25
 975  900 1100 2312/275    8.40
1050 1000 1200 2021/187   10.80

                    mean:  7.96

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I93ef375d05ff769481a07f2c74f061e307cb14d4
2018-10-26 11:53:41 +01:00
Dietmar Eggemann
9b249ed13d ANDROID: arm64: dts: juno-r2: Add dynamic-power-coefficient properties
Taken from commit cadf54148974 "arm64: dts: Add IPA parameters to soc
thermal zone" wich also sets up SoC thermal zones and bind them to
cpufreq cooling devices. We don't want this functionality right now.

The commit is for example part of:

git.linaro.org/landing-teams/working/arm/kernel-release.git
lt_arm/ack-4.9-armlt-18.01

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: Id1a44fb7d222d59f7d44b5f55797d407513eb7e7
2018-10-26 11:53:41 +01:00
Dietmar Eggemann
e105a95917 ANDROID: arm64: dts: juno: Add dynamic-power-coefficient properties
Taken from commit cadf54148974 "arm64: dts: Add IPA parameters to soc
thermal zone" wich also sets up SoC thermal zones and bind them to
cpufreq cooling devices. We don't want this functionality right now.

The commit is for example part of:

git.linaro.org/landing-teams/working/arm/kernel-release.git
lt_arm/ack-4.9-armlt-18.01

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: I7c23a58fa49b281ed5df2f60db0514a9b3b50c7b
2018-10-26 11:53:41 +01:00
Dietmar Eggemann
a3f0c668be ANDROID: arm, arm64: Enable kernel config options required for EAS
arm and arm64:

  Add    Cgroups support
  Add    Energy Model
  Add    CpuFreq governors and make schedutil default

for arm:

  Add    Cpuset support
  Add    Scheduler autogroups
  Add    DIE sched domain level

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: Ib52a0bd27702c3f2c3d692e49c9c8e2fbbea2cf7
2018-10-26 11:53:41 +01:00
Dietmar Eggemann
af31057250 ANDROID: sched: Enable idle balance to pull single task towards cpu with higher capacity
We do not want to miss out on the ability to pull a single remaining
task from a potential source cpu towards an idle destination cpu. Add an
extra criteria to need_active_balance() to kick off active load balance
if the source cpu is over-utilized and has lower capacity than the
destination cpu.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Change-Id: Iea3b42b2a0f8d8a4252e42ba67cc33381a4a1075
2018-10-26 11:53:40 +01:00
Morten Rasmussen
97f2533053 ANDROID: sched: Prevent unnecessary active balance of single task in sched group
Scenarios with the busiest group having just one task and the local
being idle on topologies with sched groups with different numbers of
cpus manage to dodge all load-balance bailout conditions resulting the
nr_balance_failed counter to be incremented. This eventually causes a
pointless active migration of the task. This patch prevents this by not
incrementing the counter when the busiest group only has one task.
ASYM_PACKING migrations and migrations due to reduced capacity should
still take place as these are explicitly captured by
need_active_balance().

A better solution would be to not attempt the load-balance in the first
place, but that requires significant changes to the order of bailout
conditions and statistics gathering.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Change-Id: I28f69c72febe0211decbe77b7bc3e48839d3d7b3
2018-10-26 11:53:40 +01:00
Quentin Perret
54655c9f85 FROMLIST: sched/fair: Select an energy-efficient CPU on task wake-up
If an Energy Model (EM) is available and if the system isn't
overutilized, re-route waking tasks into an energy-aware placement
algorithm. The selection of an energy-efficient CPU for a task
is achieved by estimating the impact on system-level active energy
resulting from the placement of the task on the CPU with the highest
spare capacity in each performance domain. This strategy spreads tasks
in a performance domain and avoids overly aggressive task packing. The
best CPU energy-wise is then selected if it saves a large enough amount
of energy with respect to prev_cpu.

Although it has already shown significant benefits on some existing
targets, this approach cannot scale to platforms with numerous CPUs.
This is an attempt to do something useful as writing a fast heuristic
that performs reasonably well on a broad spectrum of architectures isn't
an easy task. As such, the scope of usability of the energy-aware
wake-up path is restricted to systems with the SD_ASYM_CPUCAPACITY flag
set, and where the EM isn't too complex.

Change-Id: I8c6384af904668f405319ed4e05054a7fa449192
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Message-Id: <20181016101513.26919-15-quentin.perret@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:47:12 +01:00
Quentin Perret
0d548be9f8 FROMLIST: sched/fair: Introduce an energy estimation helper function
In preparation for the definition of an energy-aware wakeup path,
introduce a helper function to estimate the consequence on system energy
when a specific task wakes-up on a specific CPU. compute_energy()
estimates the capacity state to be reached by all performance domains
and estimates the consumption of each online CPU according to its Energy
Model and its percentage of busy time.

Change-Id: Ia291deb16ec9a75f3c0252abbb5d864e3300562d
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Message-Id: <20181016101513.26919-14-quentin.perret@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:47:12 +01:00
Morten Rasmussen
4f37ec6ac1 FROMLIST: sched: Add over-utilization/tipping point indicator
Energy-aware scheduling is only meant to be active while the system is
_not_ over-utilized. That is, there are spare cycles available to shift
tasks around based on their actual utilization to get a more
energy-efficient task distribution without depriving any tasks. When
above the tipping point task placement is done the traditional way based
on load_avg, spreading the tasks across as many cpus as possible based
on priority scaled load to preserve smp_nice. Below the tipping point we
want to use util_avg instead. We need to define a criteria for when we
make the switch.

The util_avg for each cpu converges towards 100% regardless of how many
additional tasks we may put on it. If we define over-utilized as:

sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity)

some individual cpus may be over-utilized running multiple tasks even
when the above condition is false. That should be okay as long as we try
to spread the tasks out to avoid per-cpu over-utilization as much as
possible and if all tasks have the _same_ priority. If the latter isn't
true, we have to consider priority to preserve smp_nice.

For example, we could have n_cpus nice=-10 util_avg=55% tasks and
n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are
likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks
getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less
over-utilized than 55%+60% for those cpus that have to be shared. The
system utilization is only 85% of the system capacity, but we are
breaking smp_nice.

To be sure not to break smp_nice, we have defined over-utilization
conservatively as when any cpu in the system is fully utilized at its
highest frequency instead:

cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity

IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg
to factor in priority to preserve smp_nice.

With this definition, we can skip periodic load-balance as no cpu has an
always-running task when the system is not over-utilized. All tasks will
be periodic and we can balance them at wake-up. This conservative
condition does however mean that some scenarios that could benefit from
energy-aware decisions even if one cpu is fully utilized would not get
those benefits.

For systems where some cpus might have reduced capacity on some cpus
(RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
soon a just a single cpu is fully utilized as it might one of those with
reduced capacity and in that case we want to migrate it.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[ Added a comment explaining why new tasks are not accounted during
  overutilization detection ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Message-Id: <20181016101513.26919-13-quentin.perret@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

Change-Id: I19f816054adfd2dfa9a69fa92c1589f62794a218
2018-10-26 11:47:12 +01:00
Quentin Perret
e201c56b96 FROMLIST: sched/fair: Clean-up update_sg_lb_stats parameters
In preparation for the introduction of a new root domain flag which can
be set during load balance (the 'overutilized' flag), clean-up the set
of parameters passed to update_sg_lb_stats(). More specifically, the
'local_group' and 'local_idx' parameters can be removed since they can
easily be reconstructed from within the function.

While at it, transform the 'overload' parameter into a flag stored in
the 'sg_status' parameter hence facilitating the definition of new flags
when needed.

Change-Id: Ic2ccb51fdc08d7da0f8cc0442ef97cbcb4a52c86
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Message-Id: <20181016101513.26919-12-quentin.perret@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:47:12 +01:00
Quentin Perret
b78eec5fb7 FROMLIST: sched: Introduce a sysctl for Energy Aware Scheduling
In its current state, Energy Aware Scheduling (EAS) starts automatically
on asymmetric platforms having an Energy Model (EM). However, there are
users who want to have an EM (for thermal management for example), but
don't want EAS with it.

In order to let users disable EAS explicitly, introduce a new sysctl
called 'sched_energy_aware'. It is enabled by default so that EAS can
start automatically on platforms where it makes sense. Flipping it to 0
rebuilds the scheduling domains and disables EAS.

Change-Id: I55764e70bf5e90795d2269ec9135ae6e82794a2b
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Message-Id: <20181016101513.26919-11-quentin.perret@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:47:12 +01:00
Quentin Perret
78814285a9 FROMLIST: sched: Introduce sched_energy_present static key
In order to make sure Energy Aware Scheduling (EAS) will not impact
systems where no Energy Model is available, introduce a static key
guarding the access to EAS code. Since EAS is enabled on a
per-root-domain basis, the static key is enabled when at least one root
domain meets all conditions for EAS.

Change-Id: Ifa3e490e023d3f57b2f1b1272d5ea58d6ae726ab
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Message-Id: <20181016101513.26919-10-quentin.perret@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:47:11 +01:00
Quentin Perret
1e6b1214f1 FROMLIST: sched/topology: Make Energy Aware Scheduling depend on schedutil
Energy Aware Scheduling (EAS) is designed with the assumption that
frequencies of CPUs follow their utilization value. When using a CPUFreq
governor other than schedutil, the chances of this assumption being true
are small, if any. When schedutil is being used, EAS' predictions are at
least consistent with the frequency requests. Although those requests
have no guarantees to be honored by the hardware, they should at least
guide DVFS in the right direction and provide some hope in regards to the
EAS model being accurate.

To make sure EAS is only used in a sane configuration, create a strong
dependency on schedutil being used. Since having sugov compiled-in does
not provide that guarantee, make CPUFreq call a scheduler function on
governor changes hence letting it rebuild the scheduling domains, check
the governors of the online CPUs, and enable/disable EAS accordingly.

Change-Id: I872949134f97d2772fc681b7393eaed7f0e224f2
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Message-Id: <20181016101513.26919-9-quentin.perret@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:47:11 +01:00
Quentin Perret
7aeb1244ec FROMLIST: sched/topology: Disable EAS on inappropriate platforms
Energy Aware Scheduling (EAS) in its current form is most relevant on
platforms with asymmetric CPU topologies (e.g. Arm big.LITTLE) since
this is where there is a lot of potential for saving energy through
scheduling. This is particularly true since the Energy Model only
includes the active power costs of CPUs, hence not providing enough data
to compare packing-vs-spreading strategies.

As such, disable EAS on root domains where the SD_ASYM_CPUCAPACITY flag
is not set. While at it, disable EAS on systems where the complexity of
the Energy Model is too high since that could lead to unacceptable
scheduling overhead.

All in all, EAS can be used on a root domain if and only if:
  1. an Energy Model is available;
  2. the root domain has an asymmetric CPU capacity topology;
  3. the complexity of the root domain's EM is low enough to keep
     scheduling overheads low.

Change-Id: Ia557fbb226be44ed40d7d22661773326276bf9c8
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Message-Id: <20181016101513.26919-8-quentin.perret@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:47:11 +01:00
Quentin Perret
a4dffa6a32 FROMLIST: sched/topology: Lowest CPU asymmetry sched_domain level pointer
Add another member to the family of per-cpu sched_domain shortcut
pointers. This one, sd_asym_cpucapacity, points to the lowest level
at which the SD_ASYM_CPUCAPACITY flag is set. While at it, rename the
sd_asym shortcut to sd_asym_packing to avoid confusions.

Generally speaking, the largest opportunity to save energy via
scheduling comes from a smarter exploitation of heterogeneous platforms
(i.e. big.LITTLE). Consequently, the sd_asym_cpucapacity shortcut will
be used at first as the lowest domain where Energy-Aware Scheduling
(EAS) should be applied. For example, it is possible to apply EAS within
a socket on a multi-socket system, as long as each socket has an
asymmetric topology. Energy-aware cross-sockets wake-up balancing can
only happen if this_cpu and prev_cpu are in different sockets.

Change-Id: Ie777a1733991d40ce063b318e915199ba3c5416a
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Message-Id: <20181016101513.26919-7-quentin.perret@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:47:11 +01:00
Quentin Perret
2b73358bce FROMLIST: sched/topology: Reference the Energy Model of CPUs when available
The existing scheduling domain hierarchy is defined to map to the cache
topology of the system. However, Energy Aware Scheduling (EAS) requires
more knowledge about the platform, and specifically needs to know about
the span of Performance Domains (PD), which do not always align with
caches.

To address this issue, use the Energy Model (EM) of the system to extend
the scheduler topology code with a representation of the PDs, alongside
the scheduling domains. More specifically, a linked list of PDs is
attached to each root domain. When multiple root domains are in use,
each list contains only the PDs covering the CPUs of its root domain. If
a PD spans over CPUs of multiple different root domains, it will be
duplicated in all lists.

The lists are fully maintained by the scheduler from
partition_sched_domains() in order to cope with hotplug and cpuset
changes. As for scheduling domains, the list are protected by RCU to
ensure safe concurrent updates.

Change-Id: I27195ab35072210bdef91e78944d1407ff61f644
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Message-Id: <20181016101513.26919-6-quentin.perret@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:47:11 +01:00
Quentin Perret
b8c80a0f42 FROMLIST: PM / EM: Expose the Energy Model in sysfs
Expose the Energy Model (read-only) of all performance domains in sysfs
for convenience. To do so, add a kobject to the CPU subsystem under the
umbrella of which a kobject for each performance domain is attached.

The resulting hierarchy is as follows for a platform with two
performance domains for example:

   /sys/devices/system/cpu/energy_model
   ├── pd0
   │   ├── cost
   │   ├── cpus
   │   ├── frequency
   │   └── power
   └── pd4
       ├── cost
       ├── cpus
       ├── frequency
       └── power

In this implementation, the kobject abstraction is only used as a
convenient way of exposing data to sysfs. However, it could also be
used in the future to allocate and release performance domains in a more
dynamic way using reference counting.

Change-Id: Ia98bcae21c3578e385be9c6b030c9adff8210909
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Message-Id: <20181016101513.26919-5-quentin.perret@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:47:11 +01:00
Quentin Perret
9cbf2b68ca FROMLIST: PM: Introduce an Energy Model management framework
Several subsystems in the kernel (task scheduler and/or thermal at the
time of writing) can benefit from knowing about the energy consumed by
CPUs. Yet, this information can come from different sources (DT or
firmware for example), in different formats, hence making it hard to
exploit without a standard API.

As an attempt to address this, introduce a centralized Energy Model
(EM) management framework which aggregates the power values provided
by drivers into a table for each performance domain in the system. The
power cost tables are made available to interested clients (e.g. task
scheduler or thermal) via platform-agnostic APIs. The overall design
is represented by the diagram below (focused on Arm-related drivers as
an example, but applicable to any architecture):

     +---------------+  +-----------------+  +-------------+
     | Thermal (IPA) |  | Scheduler (EAS) |  |    Other    |
     +---------------+  +-----------------+  +-------------+
             |                   | em_pd_energy()   |
             |                   | em_cpu_get()     |
             +-----------+       |         +--------+
                         |       |         |
                         v       v         v
                      +---------------------+
                      |                     |
                      |    Energy Model     |
                      |                     |
                      |     Framework       |
                      |                     |
                      +---------------------+
                         ^       ^       ^
                         |       |       | em_register_perf_domain()
              +----------+       |       +---------+
              |                  |                 |
      +---------------+  +---------------+  +--------------+
      |  cpufreq-dt   |  |   arm_scmi    |  |    Other     |
      +---------------+  +---------------+  +--------------+
              ^                  ^                 ^
              |                  |                 |
      +--------------+   +---------------+  +--------------+
      | Device Tree  |   |   Firmware    |  |      ?       |
      +--------------+   +---------------+  +--------------+

Drivers (typically, but not limited to, CPUFreq drivers) can register
data in the EM framework using the em_register_perf_domain() API. The
calling driver must provide a callback function with a standardized
signature that will be used by the EM framework to build the power
cost tables of the performance domain. This design should offer a lot of
flexibility to calling drivers which are free of reading information
from any location and to use any technique to compute power costs.
Moreover, the capacity states registered by drivers in the EM framework
are not required to match real performance states of the target. This
is particularly important on targets where the performance states are
not known by the OS.

The power cost coefficients managed by the EM framework are specified in
milli-watts. Although the two potential users of those coefficients (IPA
and EAS) only need relative correctness, IPA specifically needs to
compare the power of CPUs with the power of other components (GPUs, for
example), which are still expressed in absolute terms in their
respective subsystems. Hence, specifying the power of CPUs in
milli-watts should help transitioning IPA to using the EM framework
without introducing new problems by keeping units comparable across
sub-systems.
On the longer term, the EM of other devices than CPUs could also be
managed by the EM framework, which would enable to remove the absolute
unit. However, this is not absolutely required as a first step, so this
extension of the EM framework is left for later.

On the client side, the EM framework offers APIs to access the power
cost tables of a CPU (em_cpu_get()), and to estimate the energy
consumed by the CPUs of a performance domain (em_pd_energy()). Clients
such as the task scheduler can then use these APIs to access the shared
data structures holding the Energy Model of CPUs.

Change-Id: I384cb3d28f37fe82c2943d7208a4cf5dcca2b6bd
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Message-Id: <20181016101513.26919-4-quentin.perret@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:47:11 +01:00
Quentin Perret
f3e45428b7 FROMLIST: sched/cpufreq: Prepare schedutil for Energy Aware Scheduling
Schedutil requests frequency by aggregating utilization signals from
the scheduler (CFS, RT, DL, IRQ) and applying a 25% margin on top of
them. Since Energy Aware Scheduling (EAS) needs to be able to predict
the frequency requests, it needs to forecast the decisions made by the
governor.

In order to prepare the introduction of EAS, introduce
schedutil_freq_util() to centralize the aforementioned signal
aggregation and make it available to both schedutil and EAS. Since
frequency selection and energy estimation still need to deal with RT and
DL signals slightly differently, schedutil_freq_util() is called with a
different 'type' parameter in those two contexts, and returns an
aggregated utilization signal accordingly. While at it, introduce the
map_util_freq() function which is designed to make schedutil's 25%
margin usable easily for both sugov and EAS.

As EAS will be able to predict schedutil's frequency requests more
accurately than any other governor by design, it'd be sensible to make
sure EAS cannot be used without schedutil. This will be done later, once
EAS has actually been introduced.

Change-Id: Idbeeb00926045507b73f9cba37630b38ae0816c0
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Message-Id: <20181016101513.26919-3-quentin.perret@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:47:10 +01:00
Quentin Perret
30f6e1b5c9 FROMLIST: sched: Relocate arch_scale_cpu_capacity
By default, arch_scale_cpu_capacity() is only visible from within the
kernel/sched folder. Relocate it to include/linux/sched/topology.h to
make it visible to other clients needing to know about the capacity of
CPUs, such as the Energy Model framework.

Change-Id: I144c7299e122201dbcadc431d55d0a6d24d90005
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Message-Id: <20181016101513.26919-2-quentin.perret@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-10-26 11:47:10 +01:00
Morten Rasmussen
d9634ab459 UPSTREAM: sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains
The 'prefer sibling' sched_domain flag is intended to encourage
spreading tasks to sibling sched_domain to take advantage of more caches
and core for SMT systems. It has recently been changed to be on all
non-NUMA topology level. However, spreading across domains with CPU
capacity asymmetry isn't desirable, e.g. spreading from high capacity to
low capacity CPUs even if high capacity CPUs aren't overutilized might
give access to more cache but the CPU will be slower and possibly lead
to worse overall throughput.

To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
level immediately below SD_ASYM_CPUCAPACITY.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-13-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 9c63e84db2)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

Change-Id: Iad5af7e95aedf555f4a8ffffaed7bc430fb8a169
2018-10-26 11:44:49 +01:00
Chris Redpath
cf203a200b UPSTREAM: sched/fair: Don't move tasks to lower capacity CPUs unless necessary
When lower capacity CPUs are load balancing and considering to pull
something from a higher capacity group, we should not pull tasks from a
CPU with only one task running as this is guaranteed to impede progress
for that task. If there is more than one task running, load balance in
the higher capacity group would have already made any possible moves to
resolve imbalance and we should make better use of system compute
capacity by moving a task if we still have more than one running.

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-11-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 4ad3831a9d)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

Change-Id: I49e78e10920364ad94d3bc4f1407208cd1e60aba
2018-10-26 11:44:49 +01:00
Valentin Schneider
9551c952a2 UPSTREAM: sched/fair: Set rq->rd->overload when misfit
Idle balance is a great opportunity to pull a misfit task. However,
there are scenarios where misfit tasks are present but idle balance is
prevented by the overload flag.

A good example of this is a workload of n identical tasks. Let's suppose
we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly
CPU-intensive tasks - for the sake of simplicity let's say they are just
CPU hogs, even when running on big CPUs.

They are identical tasks, so on an SMP system they should all end at
(roughly) the same time. However, in our case the LITTLE CPUs are less
performing than the big CPUs, so tasks running on the LITTLEs will have
a longer completion time.

This means that the big CPUs will complete their work earlier, at which
point they should pull the tasks from the LITTLEs. What we want to
happen is summarized as follows:

a,b,c,d are our CPU-hogging tasks _ signifies idling

  LITTLE_0 | a a a a _ _
  LITTLE_1 | b b b b _ _
  ---------|-------------
    big_0  | c c c c a a
    big_1  | d d d d b b
		    ^
		    ^
      Tasks end on the big CPUs, idle balance happens
      and the misfit tasks are pulled straight away

This however won't happen, because currently the overload flag is only
set when there is any CPU that has more than one runnable task - which
may very well not be the case here if our CPU-hogging workload is all
there is to run.

As such, this commit sets the overload flag in update_sg_lb_stats when
a group is flagged as having a misfit task.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-10-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 757ffdd705)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

Change-Id: I608765f3dd4238202bce5e6996c2711b28cbf4ea
2018-10-26 11:44:48 +01:00
Valentin Schneider
6497830fe9 UPSTREAM: sched/fair: Wrap rq->rd->overload accesses with READ/WRITE_ONCE()
This variable can be read and set locklessly within update_sd_lb_stats().
As such, READ/WRITE_ONCE() are added to make sure nothing terribly wrong
can happen because of the compiler.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-9-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit e90c8fe15a)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

Change-Id: Id46db398099d15a051566806f7ca0092d3973788
2018-10-26 11:44:48 +01:00
Valentin Schneider
f68d8febf6 UPSTREAM: sched/core: Change root_domain->overload type to int
sizeof(_Bool) is implementation defined, so let's just go with 'int' as
is done for other structures e.g. sched_domain_shared->has_idle_cores.

The local 'overload' variable used in update_sd_lb_stats can remain
bool, as it won't impact any struct layout and can be assigned to the
root_domain field.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-8-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 575638d104)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

Change-Id: I05f2778c80b0c47a9f336b9d2baf1c9daae1f8f6
2018-10-26 11:44:47 +01:00
Valentin Schneider
e45affc455 UPSTREAM: sched/fair: Change 'prefer_sibling' type to bool
This variable is entirely local to update_sd_lb_stats, so we can
safely change its type and slightly clean up its initialisation.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-7-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit dbbad71944)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

Change-Id: Ida430fc3f4c5acbc996da93f886b821d488281e2
2018-10-26 11:44:47 +01:00
Valentin Schneider
d66ab02a3c UPSTREAM: sched/fair: Kick nohz balance if rq->misfit_task_load
There already are a few conditions in nohz_kick_needed() to ensure
a nohz kick is triggered, but they are not enough for some misfit
task scenarios. Excluding asym packing, those are:

 - rq->nr_running >=2: Not relevant here because we are running a
   misfit task, it needs to be migrated regardless and potentially through
   active balance.

 - sds->nr_busy_cpus > 1: If there is only the misfit task being run
   on a group of low capacity CPUs, this will be evaluated to False.

 - rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here,
   misfit task needs to be migrated regardless of rt/IRQ pressure

As such, this commit adds an rq->misfit_task_load condition to trigger a
nohz kick.

The idea to kick a nohz balance for misfit tasks originally came from
Leo Yan <leo.yan@linaro.org>, and a similar patch was submitted for
the Android Common Kernel - see:

  https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-6-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 5fbdfae522)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

Change-Id: I1ad94a5688987950355b4fe0190cd5965aec035e
2018-10-26 11:44:47 +01:00
Morten Rasmussen
fa4d08f31d UPSTREAM: sched/fair: Consider misfit tasks when load-balancing
On asymmetric CPU capacity systems load intensive tasks can end up on
CPUs that don't suit their compute demand.  In this scenarios 'misfit'
tasks should be migrated to CPUs with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.

Misfit balancing only makes sense between a source group of lower
per-CPU capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.

The modifications are:

1. Only pick a group containing misfit tasks as the busiest group if the
   destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
   load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
   to be pulled ignoring average load.
4. Pick the CPU with the highest misfit load as the source CPU.
5. If the misfit task is alone on the source CPU, go for active
   balancing.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-5-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit cad68e552e)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

Change-Id: If41e108f3200ade85b4a130e21eaee62bb3860fd
2018-10-26 11:44:46 +01:00
Morten Rasmussen
92819a68c9 UPSTREAM: sched/fair: Add sched_group per-CPU max capacity
The current sg->min_capacity tracks the lowest per-CPU compute capacity
available in the sched_group when rt/irq pressure is taken into account.
Minimum capacity isn't the ideal metric for tracking if a sched_group
needs offloading to another sched_group for some scenarios, e.g. a
sched_group with multiple CPUs if only one is under heavy pressure.
Tracking maximum capacity isn't perfect either but a better choice for
some situations as it indicates that the sched_group definitely compute
capacity constrained either due to rt/irq pressure on all CPUs or
asymmetric CPU capacities (e.g. big.LITTLE).

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-4-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit e3d6d0cb66)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

Change-Id: I56324d1543d8d0d052a66cd94c7f8e5083bd9c28
2018-10-26 11:44:46 +01:00
Morten Rasmussen
bca5744aee UPSTREAM: sched/fair: Add 'group_misfit_task' load-balance type
To maximize throughput in systems with asymmetric CPU capacities (e.g.
ARM big.LITTLE) load-balancing has to consider task and CPU utilization
as well as per-CPU compute capacity when load-balancing in addition to
the current average load based load-balancing policy. Tasks with high
utilization that are scheduled on a lower capacity CPU need to be
identified and migrated to a higher capacity CPU if possible to maximize
throughput.

To implement this additional policy an additional group_type
(load-balance scenario) is added: 'group_misfit_task'. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-CPU capacity. 'group_misfit_task' is only considered
if the system is not overloaded or imbalanced ('group_imbalanced' or
'group_overloaded').

Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each CPU is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 3b1baa6496)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

Change-Id: If859d4e275a25e20ca004ed980c6e563e68d74ca
2018-10-26 11:44:45 +01:00
Morten Rasmussen
b1c506a014 UPSTREAM: sched/topology: Add static_key for asymmetric CPU capacity optimizations
The existing asymmetric CPU capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1530699470-29808-2-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit df054e8445)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>

Change-Id: I6f8de25a9bb593a0032ad0ad7977c269b180ec51
2018-10-26 11:44:45 +01:00