linux

mirror of https://github.com/hardkernel/linux.git synced 2026-04-02 19:23:01 +09:00

Author	SHA1	Message	Date
Patrick Bellasi	8eb64d5f73	ANDROID: sched/events: Introduce util_est trace events Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Change-Id: I65e294c454369cbc15a29370d8a13ce358a95c39	2018-10-26 12:44:04 +01:00
Dietmar Eggemann	915679307f	ANDROID: sched/events: Introduce task_group load tracking trace event The trace event key load is mapped to: (1) load : cfs_rq->tg->load_avg The cfs_rq owned by the task_group is used as the only parameter for the trace event because it has a reference to the taskgroup and the cpu. Using the taskgroup as a parameter instead would require the cpu as a second parameter. A task_group is global and not per-cpu data. The cpu key only tells on which cpu the value was gathered. The following list shows examples of the key=value pairs for: (1) a task group: cpu=1 path=/tg1/tg11/tg111 load=517 (2) an autogroup: cpu=1 path=/autogroup-10 load=1050 We don't maintain a load signal for a root task group. The trace event is only defined if cfs group scheduling support (CONFIG_FAIR_GROUP_SCHED) is enabled. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Change-Id: I7de38e6b30a99d7c9887c94c707ded26b383b5f8	2018-10-26 12:44:04 +01:00
Dietmar Eggemann	4290369491	ANDROID: sched/events: Introduce sched_entity load tracking trace event The following trace event keys are mapped to: (1) load : se->avg.load_avg (2) rbl_load : se->avg.runnable_load_avg (3) util : se->avg.util_avg To let this trace event work for configurations w/ and w/o group scheduling support for cfs (CONFIG_FAIR_GROUP_SCHED) the following special handling is necessary for non-existent key=value pairs: path = "(null)" : In case of !CONFIG_FAIR_GROUP_SCHED or the sched_entity represents a task. comm = "(null)" : In case sched_entity represents a task_group. pid = -1 : In case sched_entity represents a task_group. The following list shows examples of the key=value pairs in different configurations for: (1) a task: cpu=0 path=(null) comm=sshd pid=2206 load=102 rbl_load=102 util=102 (2) a taskgroup: cpu=1 path=/tg1/tg11/tg111 comm=(null) pid=-1 load=882 rbl_load=882 util=510 (3) an autogroup: cpu=0 path=/autogroup-13 comm=(null) pid=-1 load=49 rbl_load=49 util=48 (4) w/o CONFIG_FAIR_GROUP_SCHED: cpu=0 path=(null) comm=sshd pid=2211 load=301 rbl_load=301 util=265 The trace event is only defined for CONFIG_SMP. The helper functions __trace_sched_cpu(), __trace_sched_path() and __trace_sched_id() are extended to deal with sched_entities as well. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> [ Fixed issues related to the new pelt.c file ] Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: Id2e4d1ddb79c13412c80e4fa4147b9df3b1e212a	2018-10-26 12:44:04 +01:00
Dietmar Eggemann	b540360a9c	ANDROID: sched/events: Introduce cfs_rq load tracking trace event The following trace event keys are mapped to: (1) load : cfs_rq->avg.load_avg (2) rbl_load : cfs_rq->avg.runnable_load_avg (2) util : cfs_rq->avg.util_avg To let this trace event work for configurations w/ and w/o group scheduling support for cfs (CONFIG_FAIR_GROUP_SCHED) the following special handling is necessary for a non-existent key=value pair: path = "(null)" : In case of !CONFIG_FAIR_GROUP_SCHED. The following list shows examples of the key=value pairs in different configurations for: (1) a root task_group: cpu=4 path=/ load=6 rbl_load=6 util=331 (2) a task_group: cpu=1 path=/tg1/tg11/tg111 load=538 rbl_load=538 util=522 (3) an autogroup: cpu=3 path=/autogroup-18 load=997 rbl_load=997 util=517 (4) w/o CONFIG_FAIR_GROUP_SCHED: cpu=0 path=(null) load=314 rbl_load=314 util=289 The trace event is only defined for CONFIG_SMP. The helper function __trace_sched_path() can be used to get the length parameter of the dynamic array (path == NULL) and to copy the path into it (path != NULL). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> [ Fixed issues related to the new pelt.c file ] Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: I1107044c52b74ecb3df69f3a45c1e530f0e59b1b	2018-10-26 12:44:03 +01:00
Dietmar Eggemann	31d01ca5d4	ANDROID: sched/autogroup: Define autogroup_path() for !CONFIG_SCHED_DEBUG Define autogroup_path() even in the !CONFIG_SCHED_DEBUG case. If CONFIG_SCHED_AUTOGROUP is enabled the path of an autogroup has to be available to be printed in the load tracking trace events provided by this patch-stack regardless whether CONFIG_SCHED_DEBUG is set or not. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Change-Id: Iecf5466fc837f428ee545ddabe70024c152ff38d	2018-10-26 12:44:03 +01:00
Chris Redpath	df6cd51a54	ANDROID: sched/fair: Also do misfit in overloaded groups If we can classify the group as overloaded, that overrides any classification as misfit but we may still have misfit tasks present. Check the rq we're looking at to see if this is the case. Signed-off-by: Chris Redpath <chris.redpath@arm.com> [Removed stray reference to rq_has_misfit] Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Change-Id: Ida8eb66aa625e34de3fe2ee1b0dd8a78926273d8	2018-10-26 12:26:04 +01:00
Chris Redpath	de526a3400	ANDROID: sched/fair: Don't balance misfits if it would overload local group When load balancing in a system with misfit tasks present, if we always pull a misfit task to the local group this can lead to pulling a running task from a smaller capacity CPUs to a bigger CPU which is busy. In this situation, the pulled task is likely not to get a chance to run before an idle balance on another small CPU pulls it back. This penalises the pulled task as it is stopped for a short amount of time and then likely relocated to a different CPU (since the original CPU just did a NEWLY_IDLE balance and reset the periodic interval). If we only do this unconditionally for NEWLY_IDLE balance, we can be sure that any tasks and load which are present on the local group are related to short-running tasks which we are happy to displace for a longer running task in a system with misfit tasks present. However, other balance types should only pull a task if we think that the local group is underutilized - checking the number of tasks gives us a conservative estimate here since if they were short tasks we would have been doing NEWLY_IDLE balances instead. Signed-off-by: Chris Redpath <chris.redpath@arm.com> Change-Id: I710add1ab1139482620b6addc8370ad194791beb	2018-10-26 12:25:58 +01:00
Chris Redpath	a5ece57c96	ANDROID: sched/fair: Attempt to improve throughput for asym cap systems In some systems the capacity and group weights line up to defeat all the small imbalance correction conditions in fix_small_imbalance, which can cause bad task placement. Add a new condition if the existing code can't see anything to fix: If we have asymmetric capacity, and there are more tasks than CPUs in the busiest group and there are less tasks than CPUs in the local group then we try to pull something. There could be transient small tasks which prevent this from working, but on the whole it is beneficial for those systems with inconvenient capacity/cluster size relationships. Signed-off-by: Chris Redpath <chris.redpath@arm.com> Change-Id: Icf81cde215c082a61f816534b7990ccb70aee409	2018-10-26 12:25:51 +01:00
Steve Muckle	61ac701960	ANDROID: cpufreq/schedutil: add up/down frequency transition rate limits The rate-limit tunable in the schedutil governor applies to transitions to both lower and higher frequencies. On several platforms it is not the ideal tunable though, as it is difficult to get best power/performance figures using the same limit in both directions. It is common on mobile platforms with demanding user interfaces to want to increase frequency rapidly for example but decrease slowly. One of the example can be a case where we have short busy periods followed by similar or longer idle periods. If we keep the rate-limit high enough, we will not go to higher frequencies soon enough. On the other hand, if we keep it too low, we will have too many frequency transitions, as we will always reduce the frequency after the busy period. It would be very useful if we can set low rate-limit while increasing the frequency (so that we can respond to the short busy periods quickly) and high rate-limit while decreasing frequency (so that we don't reduce the frequency immediately after the short busy period and that may avoid frequency transitions before the next busy period). Implement separate up/down transition rate limits. Note that the governor avoids frequency recalculations for a period equal to minimum of up and down rate-limit. A global mutex is also defined to protect updates to min_rate_limit_us via two separate sysfs files. Note that this wouldn't change behavior of the schedutil governor for the platforms which wish to keep same values for both up and down rate limits. This is tested with the rt-app [1] on ARM Exynos, dual A15 processor platform. Testcase: Run a SCHED_OTHER thread on CPU0 which will emulate work-load for X ms of busy period out of the total period of Y ms, i.e. Y - X ms of idle period. The values of X/Y taken were: 20/40, 20/50, 20/70, i.e idle periods of 20, 30 and 50 ms respectively. These were tested against values of up/down rate limits as: 10/10 ms and 10/40 ms. For every test we noticed a performance increase of 5-10% with the schedutil governor, which was very much expected. [Viresh]: Simplified user interface and introduced min_rate_limit_us + mutex, rewrote commit log and included test results. [1] https://github.com/scheduler-tools/rt-app/ Signed-off-by: Steve Muckle <smuckle.linux@gmail.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> (applied from https://marc.info/?l=linux-kernel&m=147936011103832&w=2) [trivial adaptations] Signed-off-by: Juri Lelli <juri.lelli@arm.com> [updated rate limiting & fixed conflicts] Signed-off-by: Chris Redpath <chris.redpath@arm.com> (cherry picked from commit 50c26fdb74563ec0cb4a83373d42667f4e83a23e) [Trivial cherry-pick conflicts] Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: I18720a83855b196b8e21dcdc8deae79131635b84	2018-10-26 12:25:46 +01:00
Quentin Perret	ad271f2d0d	ANDROID: cpufreq/schedutil: Select frequency using util_avg for RT Schedutil always requests max frequency whenever a RT task is running. Now that we have a better estimate of the utilization of RT runqueues, it is possible to make a less conservative decision and scale frequency according to the needs of the RT tasks. To do so, protect the RT-go-to-max code with a new sched_feature. The sched_feature is disabled by default, hence favoring energy savings as required in mobile environments. Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: Ic9f01c8703d4f843addaa0d684012a422fe9f3b8	2018-10-26 12:25:42 +01:00
Dietmar Eggemann	9c83d8435a	ANDROID: sched: Update max cpu capacity in case of max frequency constraints Wakeup balancing uses cpu capacity awareness and needs to know the system-wide maximum cpu capacity. Patch "sched: Store system-wide maximum cpu capacity in root domain" finds the system-wide maximum cpu capacity during scheduler domain hierarchy setup. This is sufficient as long as maximum frequency invariance is not enabled. If it is enabled, the system-wide maximum cpu capacity can change between scheduler domain hierarchy setups due to frequency capping. The cpu capacity is changed in update_cpu_capacity() which is called in load balance on the lowest scheduler domain hierarchy level. To be able to know if a change in cpu capacity for a certain cpu also has an effect on the system-wide maximum cpu capacity it is normally necessary to iterate over all cpus. This would be way too costly. That's why this patch follows a different approach. The unsigned long max_cpu_capacity value in struct root_domain is replaced with a struct max_cpu_capacity, containing value (the max_cpu_capacity) and cpu (the cpu index of the cpu providing the maximum cpu_capacity). Changes to the system-wide maximum cpu capacity and the cpu index are made if: 1 System-wide maximum cpu capacity < cpu capacity 2 System-wide maximum cpu capacity > cpu capacity and cpu index == cpu There are no changes to the system-wide maximum cpu capacity in all other cases. Atomic read and write access to the pair (max_cpu_capacity.val, max_cpu_capacity.cpu) is enforced by max_cpu_capacity.lock. The access to max_cpu_capacity.val in task_fits_max() is still performed without taking the max_cpu_capacity.lock. The code to set max cpu capacity in build_sched_domains() has been removed because the whole functionality is now provided by update_cpu_capacity() instead. This approach can introduce errors temporarily, e.g. in case the cpu currently providing the max cpu capacity has its cpu capacity lowered due to frequency capping and calls update_cpu_capacity() before any cpu which might provide the max cpu now. Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>* Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> ( Fixed cherry-pick issues ) Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: Idaa7a16723001e222e476de34df332558e48dd13	2018-10-26 12:25:37 +01:00
Dietmar Eggemann	6c401bf9b0	ANDROID: sched/fair: add arch scaling function for max frequency capping To be able to scale the cpu capacity by this factor introduce a call to the new arch scaling function arch_scale_max_freq_capacity() in update_cpu_capacity() and provide a default implementation which returns SCHED_CAPACITY_SCALE. Another subsystem (e.g. cpufreq) or architectural or platform specific code can overwrite this default implementation, exactly as for frequency and cpu invariance. It has to be enabled by the arch by defining arch_scale_max_freq_capacity to the actual implementation. Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> ( Fixed conflict with scaling against the PELT-based scale_rt_capacity ) Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: I770a8b1f4f7340e9e314f71c64a765bf880f4b4d	2018-10-26 12:25:19 +01:00
Quentin Perret	c06013206c	ANDROID: sched/fair: Bypass energy computation for prefer_idle tasks If the only pre-selected candidate CPU in find_energy_efficient_cpu() happens to be prev_cpu, there is not point in computing the system energy since we have nothing to compare it against, so we currently bail out early. The same logic can be extended when prefer_idle tasks are routed in the energy-aware wake-up path: if the only candidate is idle for a prefer_idle task, just select it no matter what the energy impact is. That should help speeding-up wake-ups of prefer_idle tasks, at least when find_best_target() is used for them. Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: Idd0e387e4a766061cc05d2584df3a31e4dabfd09	2018-10-26 12:25:00 +01:00
Chris Redpath	1d00c33d8b	ANDROID: sched: fair: Bypass energy-aware wakeup for prefer-idle tasks Use the upstream slow path to find an idle cpu for prefer-idle tasks. This slow-path is actually faster than the EAS path we are currently going through (compute_energy()) which is really slow. No performance degradation is seen with this and it reduces the delta quite a bit between upstream and out of tree code. It's not clear yet if using the mainline slow path task placement when a task has the schedtune attribute prefer_idle=1 is the right thing to do for products. Put the option to disable this behind a sched feature so we can try out both options. Signed-off-by: Joel Fernandes <joelaf@google.com> (refactored for 4.14 version) Signed-off-by: Chris Redpath <chris.redpath@arm.com> (cherry picked from commit c0ff131c88f68e4985793663144b6f9cf77be9d3) [ - Refactored for 4.17 version - Adjusted the commit header to the new function names ] Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: Icf762a101c92c0e3f9e61df0370247fa15455581	2018-10-26 12:20:50 +01:00
Brendan Jackman	88d05f9257	FROMLIST: sched/fair: Use wake_q length as a hint for wake_wide This patch adds a parameter to select_task_rq, sibling_count_hint allowing the caller, where it has this information, to inform the sched_class the number of tasks that are being woken up as part of the same event. The wake_q mechanism is one case where this information is available. select_task_rq_fair can then use the information to detect that it needs to widen the search space for task placement in order to avoid overloading the last-level cache domain's CPUs. * * * The reason I am investigating this change is the following use case on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which all repeatedly do X amount of work then pthread_barrier_wait (i.e. sleep until the last task finishes its X and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU finish faster, and then those CPUs pull over the tasks that are still running: v CPU v ->time-> ------------- 0 (big) 11111 /333 ------------- 1 (big) 22222 /444\| ------------- 2 (LITTLE) 333333/ ------------- 3 (LITTLE) 444444/ ------------- Now when task 4 hits the barrier (at \|) and wakes the others up, there are 4 tasks with prev_cpu=<big> and 0 tasks with prev_cpu=<little>. want_affine therefore means that we'll only look in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled on the bigs until the next load balance, something like this: v CPU v ->time-> ------------------------ 0 (big) 11111 /333 31313\33333 ------------------------ 1 (big) 22222 /444\|424\4444444 ------------------------ 2 (LITTLE) 333333/ \222222 ------------------------ 3 (LITTLE) 444444/ \1111 ------------------------ ^^^ underutilization So, I'm trying to get want_affine = 0 for these tasks. I don't _think_ any incarnation of the wakee_flips mechanism can help us here because which task is waker and which tasks are wakees generally changes with each iteration. However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the nice property that we know exactly how many tasks are being woken, so we can cheat. It might be a disadvantage that we "widen" _every_ task that's woken in an event, while select_idle_sibling would work fine for the first sd_llc_size - 1 tasks. IIUC, if wake_affine() behaves correctly this trick wouldn't be necessary on SMP systems, so it might be best guarded by the presence of SD_ASYM_CPUCAPACITY? * * * Final note.. In order to observe "perfect" behaviour for this use case, I also had to disable the TTWU_QUEUE sched feature. Suppose during the wakeup above we are working through the work queue and have placed tasks 3 and 2, and are about to place task 1: v CPU v ->time-> -------------- 0 (big) 11111 /333 3 -------------- 1 (big) 22222 /444\|4 -------------- 2 (LITTLE) 333333/ 2 -------------- 3 (LITTLE) 444444/ <- Task 1 should go here -------------- If TTWU_QUEUE is enabled, we will not yet have enqueued task 2 (having instead sent a reschedule IPI) or attached its load to CPU 2. So we are likely to also place task 1 on cpu 2. Disabling TTWU_QUEUE means that we enqueue task 2 before placing task 1, solving this issue. TTWU_QUEUE is there to minimise rq lock contention, and I guess that this contention is less of an issue on big.LITTLE systems since they have relatively few CPUs, which suggests the trade-off makes sense here. Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Matt Fleming <matt@codeblueprint.co.uk> ( - Applied from https://patchwork.kernel.org/patch/9895261/ - Fixed trivial conflict in kernel/sched/core.c - Fixed select_task_rq_idle, now in kernel/sched/idle.c - Fixed trivial conflict in select_task_rq_fair ) Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: I3cfc4bf48c3d7feef969db4d22449f4fbb4f795d	2018-10-26 12:15:52 +01:00
Chris Redpath	af362094df	ANDROID: sched: Unconditionally honor sync flag for energy-aware wakeups Since we don't do energy-aware wakeups when we are overutilized, always honoring sync wakeups in this state does not prevent wake-wide mechanics overruling the flag as normal. This patch is based upon previous work to build EAS for android products. sync-hint code taken from commit `4a5e890ec6` "sched/fair: add tunable to force selection at cpu granularity" written by Juri Lelli <juri.lelli@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> (cherry-picked from commit f1ec666a62dec1083ed52fe1ddef093b84373aaf) [ Moved the feature to find_energy_efficient_cpu() ] Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: I4b3d79141fc8e53dc51cd63ac11096c2e3cb10f5	2018-10-26 12:14:32 +01:00
Chris Redpath	c27c56105d	ANDROID: Add find_best_target to minimise energy calculation overhead find_best_target started life as an optimised energy cpu selection algorithm designed to be efficient and high performance and integrate well with schedtune and userspace cpuset configuration. It has had many many small tweaks over time, and the version added here is forward ported from the version used in android-4.4 and android-4.9. The linkage to the rest of EAS is slightly different, however. This version is split into the three main use-cases and addresses them in priority order: A) latency sensitive tasks B) non latency sensitive tasks on IDLE CPUs C) non latency sensitive tasks on ACTIVE CPUs Case A) Latency sensitive tasks Unconditionally favoring tasks that prefer idle CPU to improve latency. When we do enter here, we are looking for: - an idle CPU, whatever its idle_state is, since the first CPUs we explore are more likely to be reserved for latency sensitive tasks. - a non idle CPU where the task fits in its current capacity and has the maximum spare capacity. - a non idle CPU with lower contention from other tasks and running at the lowest possible OPP. The last two goals try to favor a non idle CPU where the task can run as if it is "almost alone". A maximum spare capacity CPU is favored since the task already fits into that CPU's capacity without waiting for an OPP change. For any case other than case A, we avoid CPUs which would become overutilized if we placed the task there. Case B) Non latency sensitive tasks on IDLE CPUs. Find an optimal backup IDLE CPU for non latency sensitive tasks. Here we are looking for: - minimizing the capacity_orig, i.e. preferring LITTLE CPUs If IDLE cpus are available, we prefer to choose one in order to spread tasks and improve performance. Case C) Non latency sensitive tasks on ACTIVE CPUs. Pack tasks in the most energy efficient capacities. This task packing strategy prefers more energy efficient CPUs (i.e. pack on smaller maximum capacity CPUs) while also trying to spread tasks to run them all at the lower OPP. This assumes for example that it's more energy efficient to run two tasks on two CPUs at a lower OPP than packing both on a single CPU but running that CPU at an higher OPP. This code has had many other contributors over the development listed here as Cc. Cc: Ke Wang <ke.wang@spreadtrum.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Srinath Sridharan <srinathsr@google.com> Cc: Todd Kjos <tkjos@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> (cherry picked from commit f240e44406558b17ff7765f252b0bcdcbc15126f) [ - Removed tracepoints - Took capacity_curr_of() from "7f6fb82 ANDROID: sched: EAS: take cstate into account when selecting idle core" - Re-use sd_ea from find_energy_efficient_cpu() / removed start_cpu() - Mark candidates with a cpumask given by feec() - Squashed Ionela's tri-gear fbt fixes from android-4.14 - Squashed Patrick's changes related to util_est from android-4.14 ] Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: I9500d308c879dd53b08adeda8f988238e39cc392	2018-10-26 12:12:05 +01:00
Quentin Perret	9378697a9f	ANDROID: sched/fair: Factor out CPU selection from find_energy_efficient_cpu find_energy_efficient_cpu() is composed of two steps; we first look for the CPU with the max spare capacity in each frequency domain, and then the impact on energy of each candidate is estimated. In order to make it easier to implement other CPU selection policies, let's factor the candidate selection algorithm out of find_energy_efficient_cpu(), and mark the candidates in a mask. This should result in no functional difference. Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: I85a28880f01fcd11d7af28f9fbc1fe0cf4f197cf	2018-10-26 12:06:24 +01:00
Quentin Perret	2e88529cf8	ANDROID: sched: Introduce sysctl_sched_cstate_aware Introduce a new sysctl for this option, 'sched_cstate_aware'. When this is enabled, the scheduler can make use of the idle state indexes in order to break the tie between potential CPU candidates. This patch is based on 7f6fb825d6bc ("ANDROID: sched: EAS: take cstate into account when selecting idle core") from android-4.14. All the credits goes to the authors. Change-Id: Ia076cf32faff91e90905291fa6f7924dc3dd6458 Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2018-10-26 11:58:00 +01:00
Morten Rasmussen	3e63fb2c91	ANDROID: sched, cpuidle: Track cpuidle state index in the scheduler The idle-state of each cpu is currently pointed to by rq->idle_state but there isn't any information in the struct cpuidle_state that can used to look up the idle-state energy model data stored in struct sched_group_energy. For this purpose is necessary to store the idle state index as well. Ideally, the idle-state data should be unified. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Change-Id: Ib3d1178512735b0e314881f73fb8ccff5a69319f Signed-off-by: Chris Redpath <chris.redpath@arm.com> (cherry picked from commit a732c97420e109956c20f34c70b91e6d06f5df31) [ Fixed trivial cherry-pick conflict in kernel/sched/sched.h ] Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2018-10-26 11:57:55 +01:00
Patrick Bellasi	68dbff9ce9	ANDROID: sched: fair/tune: Add schedtune with cgroups interface Schedtune is the framework we use in Android to allow userspace task classification and provides a CGroup controller which has two attributes per group. * schedtune.boost * schedtune.prefer_idle Schedtune itself provides task and CPU utilization boosting. EAS in the fair scheduler uses boosted utilization and prefer_idle status to control the algorithm used for wakeup task placement. Boosting: The task utilization signal, which is derived from PELT signals and properly scaled to be architecture and frequency invariant, is used by EAS as an estimation of the task requirements in terms of CPU bandwidth. Schedtune allows userspace to assign a percentage boost to each group and this boost is used to calculate an additional utilization margin. The margin added to the original utilization is: 1. computed based on the "boosting strategy" in use 2. proportional to boost value defined by the "taskgroup" value The boosted signal is used by EAS for task placement, and boosted CPU utilization (if boosted tasks are running) is given when schedutil requests utilization. Prefer_idle: When this attribute is 1 for a group, this is used as a signal from userspace that tasks in this group need to be serviced with the minimum latency possible. Previous versions of schedtune had much more functionality around allowing a more tuneable tradeoff between performand and energy, however this has not been used a lot up until now. If necessary, we can easily resurrect it based upon old code. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Chris Redpath <chris.redpath@arm.com> (cherry-picked from commit 159c14f0397790405b9e8435184366b31f2ed15b) [ - Removed tracepoints (to be added in a separate patch) - Integrated boosted_cpu_util() with cpu_util_cfs() - Backported Patrick's util_est related fixes from android-4.14 ] Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: Ie2fd63d82f604f34bcbc7e1ca9b5af1bdcc037e0	2018-10-26 11:57:46 +01:00
Dietmar Eggemann	af31057250	ANDROID: sched: Enable idle balance to pull single task towards cpu with higher capacity We do not want to miss out on the ability to pull a single remaining task from a potential source cpu towards an idle destination cpu. Add an extra criteria to need_active_balance() to kick off active load balance if the source cpu is over-utilized and has lower capacity than the destination cpu. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Change-Id: Iea3b42b2a0f8d8a4252e42ba67cc33381a4a1075	2018-10-26 11:53:40 +01:00
Morten Rasmussen	97f2533053	ANDROID: sched: Prevent unnecessary active balance of single task in sched group Scenarios with the busiest group having just one task and the local being idle on topologies with sched groups with different numbers of cpus manage to dodge all load-balance bailout conditions resulting the nr_balance_failed counter to be incremented. This eventually causes a pointless active migration of the task. This patch prevents this by not incrementing the counter when the busiest group only has one task. ASYM_PACKING migrations and migrations due to reduced capacity should still take place as these are explicitly captured by need_active_balance(). A better solution would be to not attempt the load-balance in the first place, but that requires significant changes to the order of bailout conditions and statistics gathering. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Change-Id: I28f69c72febe0211decbe77b7bc3e48839d3d7b3	2018-10-26 11:53:40 +01:00
Quentin Perret	54655c9f85	FROMLIST: sched/fair: Select an energy-efficient CPU on task wake-up If an Energy Model (EM) is available and if the system isn't overutilized, re-route waking tasks into an energy-aware placement algorithm. The selection of an energy-efficient CPU for a task is achieved by estimating the impact on system-level active energy resulting from the placement of the task on the CPU with the highest spare capacity in each performance domain. This strategy spreads tasks in a performance domain and avoids overly aggressive task packing. The best CPU energy-wise is then selected if it saves a large enough amount of energy with respect to prev_cpu. Although it has already shown significant benefits on some existing targets, this approach cannot scale to platforms with numerous CPUs. This is an attempt to do something useful as writing a fast heuristic that performs reasonably well on a broad spectrum of architectures isn't an easy task. As such, the scope of usability of the energy-aware wake-up path is restricted to systems with the SD_ASYM_CPUCAPACITY flag set, and where the EM isn't too complex. Change-Id: I8c6384af904668f405319ed4e05054a7fa449192 Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Message-Id: <20181016101513.26919-15-quentin.perret@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2018-10-26 11:47:12 +01:00
Quentin Perret	0d548be9f8	FROMLIST: sched/fair: Introduce an energy estimation helper function In preparation for the definition of an energy-aware wakeup path, introduce a helper function to estimate the consequence on system energy when a specific task wakes-up on a specific CPU. compute_energy() estimates the capacity state to be reached by all performance domains and estimates the consumption of each online CPU according to its Energy Model and its percentage of busy time. Change-Id: Ia291deb16ec9a75f3c0252abbb5d864e3300562d Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Message-Id: <20181016101513.26919-14-quentin.perret@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2018-10-26 11:47:12 +01:00
Morten Rasmussen	4f37ec6ac1	FROMLIST: sched: Add over-utilization/tipping point indicator Energy-aware scheduling is only meant to be active while the system is _not_ over-utilized. That is, there are spare cycles available to shift tasks around based on their actual utilization to get a more energy-efficient task distribution without depriving any tasks. When above the tipping point task placement is done the traditional way based on load_avg, spreading the tasks across as many cpus as possible based on priority scaled load to preserve smp_nice. Below the tipping point we want to use util_avg instead. We need to define a criteria for when we make the switch. The util_avg for each cpu converges towards 100% regardless of how many additional tasks we may put on it. If we define over-utilized as: sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity) some individual cpus may be over-utilized running multiple tasks even when the above condition is false. That should be okay as long as we try to spread the tasks out to avoid per-cpu over-utilization as much as possible and if all tasks have the _same_ priority. If the latter isn't true, we have to consider priority to preserve smp_nice. For example, we could have n_cpus nice=-10 util_avg=55% tasks and n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less over-utilized than 55%+60% for those cpus that have to be shared. The system utilization is only 85% of the system capacity, but we are breaking smp_nice. To be sure not to break smp_nice, we have defined over-utilization conservatively as when any cpu in the system is fully utilized at its highest frequency instead: cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg to factor in priority to preserve smp_nice. With this definition, we can skip periodic load-balance as no cpu has an always-running task when the system is not over-utilized. All tasks will be periodic and we can balance them at wake-up. This conservative condition does however mean that some scenarios that could benefit from energy-aware decisions even if one cpu is fully utilized would not get those benefits. For systems where some cpus might have reduced capacity on some cpus (RT-pressure and/or big.LITTLE), we want periodic load-balance checks as soon a just a single cpu is fully utilized as it might one of those with reduced capacity and in that case we want to migrate it. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> [ Added a comment explaining why new tasks are not accounted during overutilization detection ] Signed-off-by: Quentin Perret <quentin.perret@arm.com> Message-Id: <20181016101513.26919-13-quentin.perret@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: I19f816054adfd2dfa9a69fa92c1589f62794a218	2018-10-26 11:47:12 +01:00
Quentin Perret	e201c56b96	FROMLIST: sched/fair: Clean-up update_sg_lb_stats parameters In preparation for the introduction of a new root domain flag which can be set during load balance (the 'overutilized' flag), clean-up the set of parameters passed to update_sg_lb_stats(). More specifically, the 'local_group' and 'local_idx' parameters can be removed since they can easily be reconstructed from within the function. While at it, transform the 'overload' parameter into a flag stored in the 'sg_status' parameter hence facilitating the definition of new flags when needed. Change-Id: Ic2ccb51fdc08d7da0f8cc0442ef97cbcb4a52c86 Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Message-Id: <20181016101513.26919-12-quentin.perret@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2018-10-26 11:47:12 +01:00
Quentin Perret	b78eec5fb7	FROMLIST: sched: Introduce a sysctl for Energy Aware Scheduling In its current state, Energy Aware Scheduling (EAS) starts automatically on asymmetric platforms having an Energy Model (EM). However, there are users who want to have an EM (for thermal management for example), but don't want EAS with it. In order to let users disable EAS explicitly, introduce a new sysctl called 'sched_energy_aware'. It is enabled by default so that EAS can start automatically on platforms where it makes sense. Flipping it to 0 rebuilds the scheduling domains and disables EAS. Change-Id: I55764e70bf5e90795d2269ec9135ae6e82794a2b Signed-off-by: Quentin Perret <quentin.perret@arm.com> Message-Id: <20181016101513.26919-11-quentin.perret@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2018-10-26 11:47:12 +01:00
Quentin Perret	78814285a9	FROMLIST: sched: Introduce sched_energy_present static key In order to make sure Energy Aware Scheduling (EAS) will not impact systems where no Energy Model is available, introduce a static key guarding the access to EAS code. Since EAS is enabled on a per-root-domain basis, the static key is enabled when at least one root domain meets all conditions for EAS. Change-Id: Ifa3e490e023d3f57b2f1b1272d5ea58d6ae726ab Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Message-Id: <20181016101513.26919-10-quentin.perret@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2018-10-26 11:47:11 +01:00
Quentin Perret	1e6b1214f1	FROMLIST: sched/topology: Make Energy Aware Scheduling depend on schedutil Energy Aware Scheduling (EAS) is designed with the assumption that frequencies of CPUs follow their utilization value. When using a CPUFreq governor other than schedutil, the chances of this assumption being true are small, if any. When schedutil is being used, EAS' predictions are at least consistent with the frequency requests. Although those requests have no guarantees to be honored by the hardware, they should at least guide DVFS in the right direction and provide some hope in regards to the EAS model being accurate. To make sure EAS is only used in a sane configuration, create a strong dependency on schedutil being used. Since having sugov compiled-in does not provide that guarantee, make CPUFreq call a scheduler function on governor changes hence letting it rebuild the scheduling domains, check the governors of the online CPUs, and enable/disable EAS accordingly. Change-Id: I872949134f97d2772fc681b7393eaed7f0e224f2 Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Message-Id: <20181016101513.26919-9-quentin.perret@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2018-10-26 11:47:11 +01:00
Quentin Perret	7aeb1244ec	FROMLIST: sched/topology: Disable EAS on inappropriate platforms Energy Aware Scheduling (EAS) in its current form is most relevant on platforms with asymmetric CPU topologies (e.g. Arm big.LITTLE) since this is where there is a lot of potential for saving energy through scheduling. This is particularly true since the Energy Model only includes the active power costs of CPUs, hence not providing enough data to compare packing-vs-spreading strategies. As such, disable EAS on root domains where the SD_ASYM_CPUCAPACITY flag is not set. While at it, disable EAS on systems where the complexity of the Energy Model is too high since that could lead to unacceptable scheduling overhead. All in all, EAS can be used on a root domain if and only if: 1. an Energy Model is available; 2. the root domain has an asymmetric CPU capacity topology; 3. the complexity of the root domain's EM is low enough to keep scheduling overheads low. Change-Id: Ia557fbb226be44ed40d7d22661773326276bf9c8 cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Message-Id: <20181016101513.26919-8-quentin.perret@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2018-10-26 11:47:11 +01:00
Quentin Perret	a4dffa6a32	FROMLIST: sched/topology: Lowest CPU asymmetry sched_domain level pointer Add another member to the family of per-cpu sched_domain shortcut pointers. This one, sd_asym_cpucapacity, points to the lowest level at which the SD_ASYM_CPUCAPACITY flag is set. While at it, rename the sd_asym shortcut to sd_asym_packing to avoid confusions. Generally speaking, the largest opportunity to save energy via scheduling comes from a smarter exploitation of heterogeneous platforms (i.e. big.LITTLE). Consequently, the sd_asym_cpucapacity shortcut will be used at first as the lowest domain where Energy-Aware Scheduling (EAS) should be applied. For example, it is possible to apply EAS within a socket on a multi-socket system, as long as each socket has an asymmetric topology. Energy-aware cross-sockets wake-up balancing can only happen if this_cpu and prev_cpu are in different sockets. Change-Id: Ie777a1733991d40ce063b318e915199ba3c5416a cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Suggested-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Message-Id: <20181016101513.26919-7-quentin.perret@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2018-10-26 11:47:11 +01:00
Quentin Perret	2b73358bce	FROMLIST: sched/topology: Reference the Energy Model of CPUs when available The existing scheduling domain hierarchy is defined to map to the cache topology of the system. However, Energy Aware Scheduling (EAS) requires more knowledge about the platform, and specifically needs to know about the span of Performance Domains (PD), which do not always align with caches. To address this issue, use the Energy Model (EM) of the system to extend the scheduler topology code with a representation of the PDs, alongside the scheduling domains. More specifically, a linked list of PDs is attached to each root domain. When multiple root domains are in use, each list contains only the PDs covering the CPUs of its root domain. If a PD spans over CPUs of multiple different root domains, it will be duplicated in all lists. The lists are fully maintained by the scheduler from partition_sched_domains() in order to cope with hotplug and cpuset changes. As for scheduling domains, the list are protected by RCU to ensure safe concurrent updates. Change-Id: I27195ab35072210bdef91e78944d1407ff61f644 Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Message-Id: <20181016101513.26919-6-quentin.perret@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2018-10-26 11:47:11 +01:00
Quentin Perret	b8c80a0f42	FROMLIST: PM / EM: Expose the Energy Model in sysfs Expose the Energy Model (read-only) of all performance domains in sysfs for convenience. To do so, add a kobject to the CPU subsystem under the umbrella of which a kobject for each performance domain is attached. The resulting hierarchy is as follows for a platform with two performance domains for example: /sys/devices/system/cpu/energy_model ├── pd0 │ ├── cost │ ├── cpus │ ├── frequency │ └── power └── pd4 ├── cost ├── cpus ├── frequency └── power In this implementation, the kobject abstraction is only used as a convenient way of exposing data to sysfs. However, it could also be used in the future to allocate and release performance domains in a more dynamic way using reference counting. Change-Id: Ia98bcae21c3578e385be9c6b030c9adff8210909 Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Message-Id: <20181016101513.26919-5-quentin.perret@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2018-10-26 11:47:11 +01:00
Quentin Perret	9cbf2b68ca	FROMLIST: PM: Introduce an Energy Model management framework Several subsystems in the kernel (task scheduler and/or thermal at the time of writing) can benefit from knowing about the energy consumed by CPUs. Yet, this information can come from different sources (DT or firmware for example), in different formats, hence making it hard to exploit without a standard API. As an attempt to address this, introduce a centralized Energy Model (EM) management framework which aggregates the power values provided by drivers into a table for each performance domain in the system. The power cost tables are made available to interested clients (e.g. task scheduler or thermal) via platform-agnostic APIs. The overall design is represented by the diagram below (focused on Arm-related drivers as an example, but applicable to any architecture): +---------------+ +-----------------+ +-------------+ \| Thermal (IPA) \| \| Scheduler (EAS) \| \| Other \| +---------------+ +-----------------+ +-------------+ \| \| em_pd_energy() \| \| \| em_cpu_get() \| +-----------+ \| +--------+ \| \| \| v v v +---------------------+ \| \| \| Energy Model \| \| \| \| Framework \| \| \| +---------------------+ ^ ^ ^ \| \| \| em_register_perf_domain() +----------+ \| +---------+ \| \| \| +---------------+ +---------------+ +--------------+ \| cpufreq-dt \| \| arm_scmi \| \| Other \| +---------------+ +---------------+ +--------------+ ^ ^ ^ \| \| \| +--------------+ +---------------+ +--------------+ \| Device Tree \| \| Firmware \| \| ? \| +--------------+ +---------------+ +--------------+ Drivers (typically, but not limited to, CPUFreq drivers) can register data in the EM framework using the em_register_perf_domain() API. The calling driver must provide a callback function with a standardized signature that will be used by the EM framework to build the power cost tables of the performance domain. This design should offer a lot of flexibility to calling drivers which are free of reading information from any location and to use any technique to compute power costs. Moreover, the capacity states registered by drivers in the EM framework are not required to match real performance states of the target. This is particularly important on targets where the performance states are not known by the OS. The power cost coefficients managed by the EM framework are specified in milli-watts. Although the two potential users of those coefficients (IPA and EAS) only need relative correctness, IPA specifically needs to compare the power of CPUs with the power of other components (GPUs, for example), which are still expressed in absolute terms in their respective subsystems. Hence, specifying the power of CPUs in milli-watts should help transitioning IPA to using the EM framework without introducing new problems by keeping units comparable across sub-systems. On the longer term, the EM of other devices than CPUs could also be managed by the EM framework, which would enable to remove the absolute unit. However, this is not absolutely required as a first step, so this extension of the EM framework is left for later. On the client side, the EM framework offers APIs to access the power cost tables of a CPU (em_cpu_get()), and to estimate the energy consumed by the CPUs of a performance domain (em_pd_energy()). Clients such as the task scheduler can then use these APIs to access the shared data structures holding the Energy Model of CPUs. Change-Id: I384cb3d28f37fe82c2943d7208a4cf5dcca2b6bd Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Message-Id: <20181016101513.26919-4-quentin.perret@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2018-10-26 11:47:11 +01:00
Quentin Perret	f3e45428b7	FROMLIST: sched/cpufreq: Prepare schedutil for Energy Aware Scheduling Schedutil requests frequency by aggregating utilization signals from the scheduler (CFS, RT, DL, IRQ) and applying a 25% margin on top of them. Since Energy Aware Scheduling (EAS) needs to be able to predict the frequency requests, it needs to forecast the decisions made by the governor. In order to prepare the introduction of EAS, introduce schedutil_freq_util() to centralize the aforementioned signal aggregation and make it available to both schedutil and EAS. Since frequency selection and energy estimation still need to deal with RT and DL signals slightly differently, schedutil_freq_util() is called with a different 'type' parameter in those two contexts, and returns an aggregated utilization signal accordingly. While at it, introduce the map_util_freq() function which is designed to make schedutil's 25% margin usable easily for both sugov and EAS. As EAS will be able to predict schedutil's frequency requests more accurately than any other governor by design, it'd be sensible to make sure EAS cannot be used without schedutil. This will be done later, once EAS has actually been introduced. Change-Id: Idbeeb00926045507b73f9cba37630b38ae0816c0 Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Message-Id: <20181016101513.26919-3-quentin.perret@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2018-10-26 11:47:10 +01:00
Quentin Perret	30f6e1b5c9	FROMLIST: sched: Relocate arch_scale_cpu_capacity By default, arch_scale_cpu_capacity() is only visible from within the kernel/sched folder. Relocate it to include/linux/sched/topology.h to make it visible to other clients needing to know about the capacity of CPUs, such as the Energy Model framework. Change-Id: I144c7299e122201dbcadc431d55d0a6d24d90005 Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Quentin Perret <quentin.perret@arm.com> Message-Id: <20181016101513.26919-2-quentin.perret@arm.com> Signed-off-by: Quentin Perret <quentin.perret@arm.com>	2018-10-26 11:47:10 +01:00
Morten Rasmussen	d9634ab459	UPSTREAM: sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains The 'prefer sibling' sched_domain flag is intended to encourage spreading tasks to sibling sched_domain to take advantage of more caches and core for SMT systems. It has recently been changed to be on all non-NUMA topology level. However, spreading across domains with CPU capacity asymmetry isn't desirable, e.g. spreading from high capacity to low capacity CPUs even if high capacity CPUs aren't overutilized might give access to more cache but the CPU will be slower and possibly lead to worse overall throughput. To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain level immediately below SD_ASYM_CPUCAPACITY. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-13-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `9c63e84db2`) Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: Iad5af7e95aedf555f4a8ffffaed7bc430fb8a169	2018-10-26 11:44:49 +01:00
Chris Redpath	cf203a200b	UPSTREAM: sched/fair: Don't move tasks to lower capacity CPUs unless necessary When lower capacity CPUs are load balancing and considering to pull something from a higher capacity group, we should not pull tasks from a CPU with only one task running as this is guaranteed to impede progress for that task. If there is more than one task running, load balance in the higher capacity group would have already made any possible moves to resolve imbalance and we should make better use of system compute capacity by moving a task if we still have more than one running. Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-11-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `4ad3831a9d`) Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: I49e78e10920364ad94d3bc4f1407208cd1e60aba	2018-10-26 11:44:49 +01:00
Valentin Schneider	9551c952a2	UPSTREAM: sched/fair: Set rq->rd->overload when misfit Idle balance is a great opportunity to pull a misfit task. However, there are scenarios where misfit tasks are present but idle balance is prevented by the overload flag. A good example of this is a workload of n identical tasks. Let's suppose we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly CPU-intensive tasks - for the sake of simplicity let's say they are just CPU hogs, even when running on big CPUs. They are identical tasks, so on an SMP system they should all end at (roughly) the same time. However, in our case the LITTLE CPUs are less performing than the big CPUs, so tasks running on the LITTLEs will have a longer completion time. This means that the big CPUs will complete their work earlier, at which point they should pull the tasks from the LITTLEs. What we want to happen is summarized as follows: a,b,c,d are our CPU-hogging tasks _ signifies idling LITTLE_0 \| a a a a _ _ LITTLE_1 \| b b b b _ _ ---------\|------------- big_0 \| c c c c a a big_1 \| d d d d b b ^ ^ Tasks end on the big CPUs, idle balance happens and the misfit tasks are pulled straight away This however won't happen, because currently the overload flag is only set when there is any CPU that has more than one runnable task - which may very well not be the case here if our CPU-hogging workload is all there is to run. As such, this commit sets the overload flag in update_sg_lb_stats when a group is flagged as having a misfit task. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-10-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `757ffdd705`) Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: I608765f3dd4238202bce5e6996c2711b28cbf4ea	2018-10-26 11:44:48 +01:00
Valentin Schneider	6497830fe9	UPSTREAM: sched/fair: Wrap rq->rd->overload accesses with READ/WRITE_ONCE() This variable can be read and set locklessly within update_sd_lb_stats(). As such, READ/WRITE_ONCE() are added to make sure nothing terribly wrong can happen because of the compiler. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-9-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `e90c8fe15a`) Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: Id46db398099d15a051566806f7ca0092d3973788	2018-10-26 11:44:48 +01:00
Valentin Schneider	f68d8febf6	UPSTREAM: sched/core: Change root_domain->overload type to int sizeof(_Bool) is implementation defined, so let's just go with 'int' as is done for other structures e.g. sched_domain_shared->has_idle_cores. The local 'overload' variable used in update_sd_lb_stats can remain bool, as it won't impact any struct layout and can be assigned to the root_domain field. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-8-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `575638d104`) Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: I05f2778c80b0c47a9f336b9d2baf1c9daae1f8f6	2018-10-26 11:44:47 +01:00
Valentin Schneider	e45affc455	UPSTREAM: sched/fair: Change 'prefer_sibling' type to bool This variable is entirely local to update_sd_lb_stats, so we can safely change its type and slightly clean up its initialisation. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-7-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `dbbad71944`) Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: Ida430fc3f4c5acbc996da93f886b821d488281e2	2018-10-26 11:44:47 +01:00
Valentin Schneider	d66ab02a3c	UPSTREAM: sched/fair: Kick nohz balance if rq->misfit_task_load There already are a few conditions in nohz_kick_needed() to ensure a nohz kick is triggered, but they are not enough for some misfit task scenarios. Excluding asym packing, those are: - rq->nr_running >=2: Not relevant here because we are running a misfit task, it needs to be migrated regardless and potentially through active balance. - sds->nr_busy_cpus > 1: If there is only the misfit task being run on a group of low capacity CPUs, this will be evaluated to False. - rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here, misfit task needs to be migrated regardless of rt/IRQ pressure As such, this commit adds an rq->misfit_task_load condition to trigger a nohz kick. The idea to kick a nohz balance for misfit tasks originally came from Leo Yan <leo.yan@linaro.org>, and a similar patch was submitted for the Android Common Kernel - see: https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-6-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `5fbdfae522`) Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: I1ad94a5688987950355b4fe0190cd5965aec035e	2018-10-26 11:44:47 +01:00
Morten Rasmussen	fa4d08f31d	UPSTREAM: sched/fair: Consider misfit tasks when load-balancing On asymmetric CPU capacity systems load intensive tasks can end up on CPUs that don't suit their compute demand. In this scenarios 'misfit' tasks should be migrated to CPUs with higher compute capacity to ensure better throughput. group_misfit_task indicates this scenario, but tweaks to the load-balance code are needed to make the migrations happen. Misfit balancing only makes sense between a source group of lower per-CPU capacity and destination group of higher compute capacity. Otherwise, misfit balancing is ignored. group_misfit_task has lowest priority so any imbalance due to overload is dealt with first. The modifications are: 1. Only pick a group containing misfit tasks as the busiest group if the destination group has higher capacity and has spare capacity. 2. When the busiest group is a 'misfit' group, skip the usual average load and group capacity checks. 3. Set the imbalance for 'misfit' balancing sufficiently high for a task to be pulled ignoring average load. 4. Pick the CPU with the highest misfit load as the source CPU. 5. If the misfit task is alone on the source CPU, go for active balancing. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-5-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `cad68e552e`) Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: If41e108f3200ade85b4a130e21eaee62bb3860fd	2018-10-26 11:44:46 +01:00
Morten Rasmussen	92819a68c9	UPSTREAM: sched/fair: Add sched_group per-CPU max capacity The current sg->min_capacity tracks the lowest per-CPU compute capacity available in the sched_group when rt/irq pressure is taken into account. Minimum capacity isn't the ideal metric for tracking if a sched_group needs offloading to another sched_group for some scenarios, e.g. a sched_group with multiple CPUs if only one is under heavy pressure. Tracking maximum capacity isn't perfect either but a better choice for some situations as it indicates that the sched_group definitely compute capacity constrained either due to rt/irq pressure on all CPUs or asymmetric CPU capacities (e.g. big.LITTLE). Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-4-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `e3d6d0cb66`) Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: I56324d1543d8d0d052a66cd94c7f8e5083bd9c28	2018-10-26 11:44:46 +01:00
Morten Rasmussen	bca5744aee	UPSTREAM: sched/fair: Add 'group_misfit_task' load-balance type To maximize throughput in systems with asymmetric CPU capacities (e.g. ARM big.LITTLE) load-balancing has to consider task and CPU utilization as well as per-CPU compute capacity when load-balancing in addition to the current average load based load-balancing policy. Tasks with high utilization that are scheduled on a lower capacity CPU need to be identified and migrated to a higher capacity CPU if possible to maximize throughput. To implement this additional policy an additional group_type (load-balance scenario) is added: 'group_misfit_task'. This represents scenarios where a sched_group has one or more tasks that are not suitable for its per-CPU capacity. 'group_misfit_task' is only considered if the system is not overloaded or imbalanced ('group_imbalanced' or 'group_overloaded'). Identifying misfit tasks requires the rq lock to be held. To avoid taking remote rq locks to examine source sched_groups for misfit tasks, each CPU is responsible for tracking misfit tasks themselves and update the rq->misfit_task flag. This means checking task utilization when tasks are scheduled and on sched_tick. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-3-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `3b1baa6496`) Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: If859d4e275a25e20ca004ed980c6e563e68d74ca	2018-10-26 11:44:45 +01:00
Morten Rasmussen	b1c506a014	UPSTREAM: sched/topology: Add static_key for asymmetric CPU capacity optimizations The existing asymmetric CPU capacity code should cause minimal overhead for others. Putting it behind a static_key, it has been done for SMT optimizations, would make it easier to extend and improve without causing harm to others moving forward. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-2-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `df054e8445`) Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: I6f8de25a9bb593a0032ad0ad7977c269b180ec51	2018-10-26 11:44:45 +01:00
Morten Rasmussen	f2d1d0fea1	UPSTREAM: sched/topology: Add SD_ASYM_CPUCAPACITY flag detection The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the sched_domain in the hierarchy where all CPU capacities are visible for any CPU's point of view on asymmetric CPU capacity systems. The scheduler can then take to take capacity asymmetry into account when balancing at this level. It also serves as an indicator for how wide task placement heuristics have to search to consider all available CPU capacities as asymmetric systems might often appear symmetric at smallest level(s) of the sched_domain hierarchy. The flag has been around for while but so far only been set by out-of-tree code in Android kernels. One solution is to let each architecture provide the flag through a custom sched_domain topology array and associated mask and flag functions. However, SD_ASYM_CPUCAPACITY is special in the sense that it depends on the capacity and presence of all CPUs in the system, i.e. when hotplugging all CPUs out except those with one particular CPU capacity the flag should disappear even if the sched_domains don't collapse. Similarly, the flag is affected by cpusets where load-balancing is turned off. Detecting when the flags should be set therefore depends not only on topology information but also the cpuset configuration and hotplug state. The arch code doesn't have easy access to the cpuset configuration. Instead, this patch implements the flag detection in generic code where cpusets and hotplug state is already taken care of. All the arch is responsible for is to implement arch_scale_cpu_capacity() and force a full rebuild of the sched_domain hierarchy if capacities are updated, e.g. later in the boot process when cpufreq has initialized. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1532093554-30504-2-git-send-email-morten.rasmussen@arm.com [ Fixed 'CPU' capitalization. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit `05484e0984`) Signed-off-by: Quentin Perret <quentin.perret@arm.com> Change-Id: I1d5f695a95f8d023f1ecf14ecb71a558ceb67ed6	2018-10-26 11:44:43 +01:00
Greg Kroah-Hartman	14dbc56aa2	Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Ingo writes: "scheduler fixes: Two fixes: a CFS-throttling bug fix, and an interactivity fix." * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix the min_vruntime update logic in dequeue_entity() sched/fair: Fix throttle_list starvation with low CFS quota	2018-10-20 15:03:45 +02:00

1 2 3 4 5 ...

28480 Commits