Commit Graph

23809 Commits

Author SHA1 Message Date
Chris Redpath
db00ec15cb ANDROID: sched/fair: ensure utilization signals are synchronized before use
wake_cap performs task and cpu utilization synchronization which is
what allows us to subtract current task util from prev_cpu util and
have a sensible number to work with.

It looks as though if wake_wide returns 0, we could potentially not
execute wake_cap, which would result in unsynced signals we then use
for energy calculations.

This is not necessarily an issue we've seen in traces, but it looks
as though it should be changed.

Change-Id: Ic54a3cba2a10d946ea20113a04371dea04115e82
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
[Remove _wake_cap variable to match commit cf6ed9a668]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:33 -07:00
Chris Redpath
7de1b8372f ANDROID: sched/fair: remove task util from own cpu when placing waking task
When we place a waking task with find_best_target, we calculate the
existing and new utilisation of each candidate cpu. However, we do
not remove any blocked load resulting from the waking task on the
previous cpu which might cause unnecessary migrations.

Switch to using cpu_util_wake which does this for us, which requires
moving cpu_util_wake a few functions earlier.

Also, we have multiple potential cpu utilization signals here, so
update the necessary bits to allow WALT to work properly (including
not subtracting task util for WALT).

When WALT is in use, cpu utilization is the utilization
in the previous completed window, whilst the task utilization
ignores fully idle windows. There seems to be no way to have a
decently accurate estimate of how much (if any) utilization from
this task remains on the prev cpu.

Instead, just return cpu_util when we're using WALT.

Change-Id: I448203ab98ffb5c020dfb6b218581eef1f5601f7
Reported-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:32 -07:00
Chris Redpath
8d40f58e90 ANDROID: trace:sched: Make util_avg in load_avg trace reflect PELT/WALT as used
With the ability to choose between WALT and PELT for utilisation tracking
we can have the situation where we're using WALT to make all the
decisions and reporting PELT figures in the sched_load_avg_(cpu|task)
trace points. This is not too much of an issue, but when analysing trace
it is nice to see numbers representing what the scheduler is using rather
than needing to add in additional sched_walt_* traces to figure it out.

Add reporting for both types, and make the util_avg member reflect what
will be seen from cpu or task_util functions in the scheduler.

Change-Id: I2abbd2c5fa70822096d0f3372b4c12b1c6af1590
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
[renamed macros according to 172895e6b5]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:31 -07:00
Dietmar Eggemann
af88a165ec ANDROID: sched/fair: Add eas (& cas) specific rq, sd and task stats
The statistic counter are placed in the eas (& cas) wakeup path. Each
of them has one representation for the runqueue (rq), the sched_domain
(sd) and the task.
A task counter is always incremented. A rq counter is always
incremented for the rq the scheduler is currently running on. A sd
counter is only incremented if a relation to a sd exists.

The counters are exposed:

(1) In /proc/schedstat for rq's and sd's:

$ cat /proc/schedstat
...
cpu0 71422 0 2321254 ...
eas  44144 0 0 19446 0 24698 568435 51621 156932 133 222011 17459 120279 516814 83 0 156962 359235 176439 139981
  <- runqueue for cpu0
...
domain0 3 42430 42331 ...
eas 0 0 0 14200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 66355 0  <- MC sched domain for cpu0
...

The per-cpu eas vector has the following elements:

sis_attempts  sis_idle   sis_cache_affine sis_suff_cap    sis_idle_cpu    sis_count               ||
secb_attempts secb_sync  secb_idle_bt     secb_insuff_cap secb_no_nrg_sav secb_nrg_sav secb_count ||
fbt_attempts  fbt_no_cpu fbt_no_sd        fbt_pref_idle   fbt_count                               ||
cas_attempts  cas_count

The following relations exist between these counters (from cpu0 eas
vector above):

sis_attempts = sis_idle + sis_cache_affine + sis_suff_cap + sis_idle_cpu + sis_count

44144        = 0        + 0                + 19446        + 0            + 24698

secb_attempts = secb_sync + secb_idle_bt + secb_insuff_cap + secb_no_nrg_sav + secb_nrg_sav + secb_count

568435        = 51621     + 156932       + 133             + 222011          + 17459        + 120279

fbt_attempts = fbt_no_cpu + fbt_no_sd + fbt_pref_idle + fbt_count + (return -1)

516814       = 83         + 0         + 156962        + 359235    + (534)

cas_attempts = cas_count + (return -1 or smp_processor_id())

176439       = 139981    + (36458)

(2) In /proc/$PROCESS_PID/task/$TASK_PID/sched for a task.

example: main thread of system_server

$ cat /proc/1083/task/1083/sched

...
se.statistics.nr_wakeups_sis_attempts        :                  945
se.statistics.nr_wakeups_sis_idle            :                    0
se.statistics.nr_wakeups_sis_cache_affine    :                    0
se.statistics.nr_wakeups_sis_suff_cap        :                  219
se.statistics.nr_wakeups_sis_idle_cpu        :                    0
se.statistics.nr_wakeups_sis_count           :                  726
se.statistics.nr_wakeups_secb_attempts       :                10376
se.statistics.nr_wakeups_secb_sync           :                 1462
se.statistics.nr_wakeups_secb_idle_bt        :                 6984
se.statistics.nr_wakeups_secb_insuff_cap     :                    3
se.statistics.nr_wakeups_secb_no_nrg_sav     :                  927
se.statistics.nr_wakeups_secb_nrg_sav        :                  206
se.statistics.nr_wakeups_secb_count          :                  794
se.statistics.nr_wakeups_fbt_attempts        :                 8914
se.statistics.nr_wakeups_fbt_no_cpu          :                    0
se.statistics.nr_wakeups_fbt_no_sd           :                    0
se.statistics.nr_wakeups_fbt_pref_idle       :                 6987
se.statistics.nr_wakeups_fbt_count           :                 1554
se.statistics.nr_wakeups_cas_attempts        :                 3107
se.statistics.nr_wakeups_cas_count           :                 1195
...

The same relation between the counters as in the per-cpu case apply.

Change-Id: Ie7d01267c78a3f41f60a3ef52917d5a5d463f195
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
[fixed schedstat macros calls to match modifications
made in commit ae92882e56]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:31 -07:00
Andres Oportus
8990dc7f3c ANDROID: sched/core: Fix PELT jump to max OPP upon util increase
Change-Id: Ic80b588ec466ef707f658dcea039fd0d6b384b63
Signed-off-by: Andres Oportus <andresoportus@google.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:30 -07:00
Dietmar Eggemann
06654998ed ANDROID: sched: EAS & 'single cpu per cluster'/cpu hotplug interoperability
For Energy-Aware Scheduling (EAS) to work properly, even in the
case that there is only one cpu per cluster or that cpus are hot-plugged
out, the Energy Model (EM) data on all energy-aware sched domains (sd)
has to be present for all online cpus.

Mainline sd hierarchy setup code will remove sd's which are not useful
for task scheduling e.g. in the following situations:

1. Only 1 cpu is/remains in one cluster of a multi cluster system.

   This remaining cpu only has DIE and no MC sd.

2. A complete cluster in a two cluster system is hot-plugged out.

   The cpus of the remaining cluster only have MC and no DIE sd.

To make sure that all online cpus keep all their energy-aware sd's,
the sd degenerate functionality has been changed to not free a sd if
its first sched group (sg) contains EM data in case:

1. There is only 1 cpu left in the sd.

2. There have to be at least 2 sg's if certain sd flags are set.

Instead of freeing such a sd it now clears only its SD_LOAD_BALANCE
flag. This will make sure that the EAS functionality will always see
all energy-aware sd's for all online cpus.

It will introduce a tiny performance degradation for operations on
affected cpus since the hot-path macro for_each_domain() has to deal
with sd's not contributing to task scheduling at all now.

In most cases the exisiting code makes sure that task scheduling is not
invoked on a sd with !SD_LOAD_BALANCE.

However, a small change is necessary in update_sd_lb_stats() to make
sure that sd->parent is only initialized to !NULL in case the parent sd
contains more than 1 sg.

The handling of newidle decay values before the SD_LOAD_BALANCE check in
rebalance_domains() stays unchanged.

Test (w/ CONFIG_SCHED_DEBUG):

JUNO r0 default system:

$ cat /proc/cpuinfo | grep "^CPU part"
CPU part        : 0xd03
CPU part        : 0xd07
CPU part        : 0xd07
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03

SD names and flags:

$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
MC
DIE
MC
DIE
MC
DIE
MC
DIE
MC
DIE
MC
DIE

$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
832f
102f
832f
102f
832f
102f
832f
102f
832f
102f
832f
102f

Test 1: Hotplug-out one A57 (CPU part 0xd07) cpu:

$ echo 0 > /sys/devices/system/cpu/cpu1/online

$ cat /proc/cpuinfo | grep "^CPU part"
CPU part        : 0xd03
CPU part        : 0xd07
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03

SD names and flags for remaining A57 (cpu2) cpu:

$ cat /proc/sys/kernel/sched_domain/cpu2/domain*/name
MC
DIE

$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu2/domain*/flags`
832e <-- MC SD with !SD_LOAD_BALANCE
102f

Test 2: Hotplug-out the entire A57 cluster:

$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ echo 0 > /sys/devices/system/cpu/cpu2/online

$ cat /proc/cpuinfo | grep "^CPU part"
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03
CPU part        : 0xd03

SD names and flags for the remaining A53 (CPU part 0xd03) cluster:

$ cat /proc/sys/kernel/sched_domain/cpu*/domain*/name
MC
DIE
MC
DIE
MC
DIE
MC
DIE

$ printf "%x\n" `cat /proc/sys/kernel/sched_domain/cpu*/domain*/flags`
832f
102e <-- DIE SD with !SD_LOAD_BALANCE
832f
102e
832f
102e
832f
102e

Change-Id: If24aa2b2628f334abbf0207d39e2a86168d9d673
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:29 -07:00
Vincent Guittot
6960f771e9 UPSTREAM: sched/core: Fix group_entity's share update
The update of the share of a cfs_rq is done when its load_avg is updated
but before the group_entity's load_avg has been updated for the past time
slot. This generates wrong load_avg accounting which can be significant
when small tasks are involved in the scheduling.

Let take the example of a task a that is dequeued of its task group A:
   root
  (cfs_rq)
    \
    (se)
     A
    (cfs_rq)
      \
      (se)
       a

Task "a" was the only task in task group A which becomes idle when a is
dequeued.

We have the sequence:

- dequeue_entity a->se
    - update_load_avg(a->se)
    - dequeue_entity_load_avg(A->cfs_rq, a->se)
    - update_cfs_shares(A->cfs_rq)
	A->cfs_rq->load.weight == 0
        A->se->load.weight is updated with the new share (0 in this case)
- dequeue_entity A->se
    - update_load_avg(A->se) but its weight is now null so the last time
      slot (up to a tick) will be accounted with a weight of 0 instead of
      its real weight during the time slot. The last time slot will be
      accounted as an idle one whereas it was a running one.

If the running time of task a is short enough that no tick happens when it
runs, all running time of group entity A->se will be accounted as idle
time.

Instead, we should update the share of a cfs_rq (in fact the weight of its
group entity) only after having updated the load_avg of the group_entity.

update_cfs_shares() now takes the sched_entity as a parameter instead of the
cfs_rq, and the weight of the group_entity is updated only once its load_avg
has been synced with current time.

Change-Id: Id3f651220251d3ae3802c84fa59a68cc241db2c3
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/1482335426-7664-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 89ee048f3c)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:29 -07:00
Vincent Guittot
3a34bf5b2d UPSTREAM: sched/fair: Propagate asynchrous detach
A task can be asynchronously detached from cfs_rq when migrating
between CPUs. The load of the migrated task is then removed from
source cfs_rq during its next update. We use this event to set
propagation flag.

During the load balance, we take advantage of the update of blocked
load to propagate any pending changes.

The propagation relies on patch:

  "sched: Fix hierarchical order in rq->leaf_cfs_rq_list"

... which orders children and parents, to ensure that it's done in one pass.

Change-Id: I5d1a9f273111c0bb7503ca3c9ad2406d12867cb5
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 4e5160766f)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:28 -07:00
Vincent Guittot
96956e2684 UPSTREAM: sched/fair: Propagate load during synchronous attach/detach
When a task moves from/to a cfs_rq, we set a flag which is then used to
propagate the change at parent level (sched_entity and cfs_rq) during
next update. If the cfs_rq is throttled, the flag will stay pending until
the cfs_rq is unthrottled.

For propagating the utilization, we copy the utilization of group cfs_rq to
the sched_entity.

For propagating the load, we have to take into account the load of the
whole task group in order to evaluate the load of the sched_entity.
Similarly to what was done before the rewrite of PELT, we add a correction
factor in case the task group's load is greater than its share so it will
contribute the same load of a task of equal weight.

Change-Id: I0aeaed29bb880c91d10df82c38ac9fc681de4f76
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 09a43ace1f)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:27 -07:00
Vincent Guittot
793cfff02e UPSTREAM: sched/fair: Factorize attach/detach entity
Factorize post_init_entity_util_avg() and part of attach_task_cfs_rq()
in one function attach_entity_cfs_rq().

Create symmetric detach_entity_cfs_rq() function.

Change-Id: Id258d6055039ecfe7a3a8752dfff0257fc9f6508
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: kernellwp@gmail.com
Cc: pjt@google.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1478598827-32372-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit df217913e7)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:26 -07:00
Dietmar Eggemann
56ffdd61dc ANDROID: sched/fair: Simplify idle_idx handling in select_idle_sibling()
Rename best_idle to best_idle_cpu so the same name is used like in
find_best_target().

Fix if (best_idle > 0) since best_idle_cpu = 0 is a valid target.

Use 'unsigned long' data type for best_idle_capacity.

Since we're looking for the shallowest best_idle_cstate initialize
best_idle_cstate = INT_MAX. For cpus which are not idle (idle_idx = -1)
the condition 'if (idle_idx < best_idle_cstate && ...)' is never
executed.

Change-Id: Ic5b63d58478696b3d1ec6253cf739a69a574cf99
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 8bff5e9c0968108d465e1f2a4624fc5ec2f00849)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:25 -07:00
Dietmar Eggemann
2dfb172df5 ANDROID: sched/fair: refactor find_best_target() for simplicity
Simplify backup_capacity handling and use 'unsigned long'
data type for cpu capacity, simplify target_util handling,
simplify idle_idx handling & refactor min_util, new_util.

Also return first idle cpu for prefer_idle task immediately.

Change-Id: Ic89e140f7b369f3965703fdc8463013d16e9b94a
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:25 -07:00
Dietmar Eggemann
3bfde3b4f8 ANDROID: sched/fair: Change cpu iteration order in find_best_target()
The schedtune task parameter 'boosted' is mapped into the cpu iteration
order. Currently for 'boosted' equal true the iteration starts at the
last cpu (NR_CPUS-1) whereas for 'boosted' equal false it starts at the
first cpu (0).

This only has the desired effect if the cpu topology oerdering matches
the underlying assumption. This e.g. is the case for the
Qc snapdragon 821 with its [L0 L1 b0 b1] cpu topology layout
(L=lower max freq, b=higher max freq). This results in cpus with higher
maximum capacity being given the highest logical cpu ids. However not
all big.LITTLE systems enumerate their cpus in the same way. For example,
the ARM Versatile Express Juno board has 6 cpus for which the default
configuration has topology [L0 b0 b1 L1 L2 L3].

To make this approach independent from the cpu topology layout it now
iterates over the cpus in the order of the sched_groups of the EAS
sched_domain (sd_ea). The order of cpu iteration is different for the
different cpu types in case the cpu is used to dereference sd_ea.

Considering the Qc snapdragon 821 again, for cpu L0 and L1 the order is
'b0->b1->L0->L1' whereas for b0 and b1 the order is 'b0->b1->L0->L1'.

This approach does not allow the exact same iteration order as with the
currently used flat iteration over [0 .. NR_CPUS-1] but the cpus
are ordered by the original cpu capacity.

The cpu iteration is now done in the sd_ea sched_group order required by
the 'boosted' value ['L0->L1->b0->b1'/'b0->b1->L0->L1'] rather than
forward/backward over the flat cpu space ['L0->L1->b0->b1'/
'b1->b0->L1->L0'].

Change-Id: I8fbe2073dedd2ecb1c750620c6000c11a5ff4358
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit a0c6a4272c3968c0ff50d3fed65f5865b72d777b)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:24 -07:00
Dietmar Eggemann
14774e7609 ANDROID: sched/core: Add first cpu w/ max/min orig capacity to root domain
This will allow to start iterating from a cpu with max or min original
capacity in the wakeup path regardless on which cpu the scheduler is
currently running (smp_processor_id()) or the previous cpu of the task
(task_cpu(p)). This iteration has to happen on a sched_domain spanning
all cpus in the order of the sched_groups of this sched_domain seen by
the starting cpu.

In case of an SMP system the first cpu with max orig capacity and the
the one with min orig capacity is the same. This can temporally happen
on a big.LITTLE system with hotplug as well.

E.g. the different order of cpu iteration can be used to map schedtune
task parameter 'boosted' into the cpu iteration order in
find_best_target().

Use of READ_ONCE()/WRITE_ONCE() to avoid load/store tearing.

Change-Id: I812fbd9c7e5f506617e456c0eec3edcd2c016e92
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit fd6e9543c1fd8971a5e2e68e39b2f6e591d46114)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:23 -07:00
Dietmar Eggemann
52a8450d07 ANDROID: sched/core: Remove remnants of commit fd5c98da1a42
Commit fd5c98da1a42 "WIP: sched: Store system-wide maximum cpu capacity
in root domain" was repalced by commit 8148bdfff4f5 "WIP: sched: Update
max cpu capacity in case of max frequency constraints" which didn't
remove all the now unused bits.

Change-Id: I067f6366431f43337cffa7a2a8e0de32dd33d2f9
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 6d284a607cec51bcafca313bc396bc3103b1e876)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:23 -07:00
Dietmar Eggemann
ea5a7f2df0 ANDROID: sched: Remove sysctl_sched_is_big_little
With the new wakeup approach this sysctl is not necessary any more.

Change-Id: I52114b3c918791f6a4f9f30f50002919ccbc1a9c
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 885c0d503bcdf0ef4e9b46822496f16b20aa3bbd)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:22 -07:00
Dietmar Eggemann
8c2e3d8850 ANDROID: sched/fair: Code !is_big_little path into select_energy_cpu_brute()
This patch replaces the existing EAS upstream implementation of
select_energy_cpu_brute() with the one of find_best_target() used
in Android previously.

It also removes the cpumask 'and' from select_energy_cpu_brute,
see the existing use of 'cpu = smp_processor_id()' in
select_task_rq_fair().

Change-Id: If678c002efaa87d1ba3ec9989a4e9f8df98b83ec
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[ added guarding for non-schedtune builds ]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:21 -07:00
Dietmar Eggemann
52b09b1fde ANDROID: EAS: sched/fair: Re-integrate 'honor sync wakeups' into wakeup path
This patch re-integrates the part which was initially provided by
3b9d7554aeec ("EAS: sched/fair: tunable to honor sync wakeups") into
energy_aware_wake_cpu() into select_energy_cpu_brute().

Change-Id: I748fde3ecdeb44651179bce0a5bb8dd82d1903f6
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit b75b7286cb068d5761621ea134c23dd131db953f)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:21 -07:00
Dietmar Eggemann
8e862e1e31 ANDROID: Fixup!: sched/fair.c: Set SchedTune specific struct energy_env.task
This has to be done in the caller function of energy_diff() version of
SchedTune to avoid Null pointer dereference in energy_diff().

Change-Id: I3f0f68dbd11efb15bbb3b1832f8294419ed85241
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 14531d4e245d063f713ee5ed835df958e6c7838f)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:20 -07:00
Morten Rasmussen
9e31218f23 ANDROID: sched/fair: Energy-aware wake-up task placement
When the systems is not overutilized, place waking tasks on the most
energy efficient cpu. Previous attempts reduced the search space by
matching task utilization to cpu capacity before consulting the energy
model as this is an expensive operation. The search heuristics didn't
work very well and lacking any better alternatives this patch takes the
brute-force route and tries all potential targets.

This approach doesn't scale, but it might be sufficient for many
embedded applications while work is continuing on a heuristic that can
minimize the necessary computations. The heuristic must be derrived from
the platform energy model rather than make additional assumptions, such
lower capacity implies better energy efficiency. PeterZ mentioned in the
past that we might be able to derrive some simpler deciding functions
using mathematical (modal?) analysis.

Change-Id: I772bacb4c8fd599f8006fa422f842e66377a9c6c
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[rebase: on top of msm-google/android-msm-marlin-3.18]
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit a894422dbdb7b77ea2acfe7ff909ccb5ded23514)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:19 -07:00
Morten Rasmussen
5383892305 ANDROID: sched/fair: Add energy_diff dead-zone margin
It is not worth the overhead to migrate tasks for tiny insignificant
energy savings. To prevent this, an energy margin is introduced in
energy_diff() which effectively adds a dead-zone that rounds tiny energy
differences to zero. Since no scale is enforced for energy model data
the margin can't be absolute. Instead it is defined as +/-1.56% energy
saving compared to the current total estimated energy consumption.

Change-Id: I6be069c752c701fb825430896b3b768a7ab2fee4
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[rebase: on top of msm-google/android-msm-marlin-3.18,
         massage original patch which changes code in energy_diff()
	 into __energy_diff() introduced by SchedTune]
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit 780cb5a5fa47adf13d4fc2b77e8e94448cd56098)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:18 -07:00
Morten Rasmussen
5dbcddecc5 UPSTREAM: sched/fair: Fix incorrect comment for capacity_margin
The comment for capacity_margin introduced in:

  3273163c67 ("sched/fair: Let asymmetric CPU configurations balance at wake-up")

... got its usage the wrong way round - fix it.

Change-Id: I0668ea25e3d95a4ee625a101991610dddcdc2230
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-7-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 893c5d2279)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:17 -07:00
Morten Rasmussen
942295e902 UPSTREAM: sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
For asymmetric CPU capacity systems it is counter-productive for
throughput if low capacity CPUs are pulling tasks from non-overloaded
CPUs with higher capacity. The assumption is that higher CPU capacity is
preferred over running alone in a group with lower CPU capacity.

This patch rejects higher CPU capacity groups with one or less task per
CPU as potential busiest group which could otherwise lead to a series of
failing load-balancing attempts leading to a force-migration.

Change-Id: I6c668db8efb23c98cba6a0358a5eae550b247c27
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-5-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 9e0994c0a1)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:16 -07:00
Morten Rasmussen
3d8cb903db UPSTREAM: sched/fair: Add per-CPU min capacity to sched_group_capacity
struct sched_group_capacity currently represents the compute capacity
sum of all CPUs in the sched_group.

Unless it is divided by the group_weight to get the average capacity
per CPU, it hides differences in CPU capacity for mixed capacity systems
(e.g. high RT/IRQ utilization or ARM big.LITTLE).

But even the average may not be sufficient if the group covers CPUs of
different capacities.

Instead, by extending struct sched_group_capacity to indicate min per-CPU
capacity in the group a suitable group for a given task utilization can
more easily be found such that CPUs with reduced capacity can be avoided
for tasks with high utilization (not implemented by this patch).

Change-Id: I833725e2b70794384c2b8efac5dc8107f1dbb622
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-4-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit bf475ce0a3)
[Fixed cherry-pick issue]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:30:16 -07:00
Morten Rasmussen
544443501e UPSTREAM: sched/fair: Consider spare capacity in find_idlest_group()
In low-utilization scenarios comparing relative loads in
find_idlest_group() doesn't always lead to the most optimum choice.
Systems with groups containing different numbers of cpus and/or cpus of
different compute capacity are significantly better off when considering
spare capacity rather than relative load in those scenarios.

In addition to existing load based search an alternative spare capacity
based candidate sched_group is found and selected instead if sufficient
spare capacity exists. If not, existing behaviour is preserved.

Change-Id: I165d49fa8f20b6c8bfab5324f3549aa449e81a17
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 6a0b19c0f3)
[Fixed cherry-pick issue]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 15:28:30 -07:00
Morten Rasmussen
3557724318 UPSTREAM: sched/fair: Compute task/cpu utilization at wake-up correctly
At task wake-up load-tracking isn't updated until the task is enqueued.
The task's own view of its utilization contribution may therefore not be
aligned with its contribution to the cfs_rq load-tracking which may have
been updated in the meantime. Basically, the task's own utilization
hasn't yet accounted for the sleep decay, while the cfs_rq may have
(partially). Estimating the cfs_rq utilization in case the task is
migrated at wake-up as task_rq(p)->cfs.avg.util_avg - p->se.avg.util_avg
is therefore incorrect as the two load-tracking signals aren't time
synchronized (different last update).

To solve this problem, this patch synchronizes the task utilization with
its previous rq before the task utilization is used in the wake-up path.
Currently the update/synchronization is done _after_ the task has been
placed by select_task_rq_fair(). The synchronization is done without
having to take the rq lock using the existing mechanism used in
remove_entity_load_avg().

Change-Id: I61a79f7f68742026d1bdfd828df13a6165fdae5d
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-2-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 104cb16d9e)
[Fixed cherry-pick issue]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 12:58:58 -07:00
Quentin Perret
7fb3e0c96a ANDROID: Partial Revert: "ANDROID: sched: Add cpu capacity awareness to wakeup balancing"
Revert most of changes of commit b9ac009489.
Keep infrastructure bits used in following EAS patches. Fix task_util
return type.

Change-Id: Ic545ac745d00f19f7b6567d4b4b8d4bde32a95ca
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 12:58:51 -07:00
Dietmar Eggemann
0df28983b8 ANDROID: sched/fair: Decommission energy_aware_wake_cpu()
The EAS functionality in the wakeup path will be brought back by the
following patch ("sched/fair: Energy-aware wake-up task placement")
providing the function select_energy_cpu_brute().

Change-Id: I927fb9e8261cfacfe404695f853941c7959aa146
[ Trivial merge conflicts resolved. ]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 80aee424fb7765a777267e144037642625a71304)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
[removed remaining energy_aware() call from select_task_rq_fair()]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 12:58:38 -07:00
Dietmar Eggemann
e5eb12acc7 ANDROID: Revert "WIP: sched: Consider spare cpu capacity at task wake-up"
This reverts commit 75a9695b619741019363f889c99c97c7bb823797.

Change-Id: I846b21f2bdeb0b0ca30ad65683564ed07a429428
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[ minor merge changes ]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 12:58:23 -07:00
Viresh Kumar
8c29c1a327 FROM-LIST: cpufreq: schedutil: Redefine the rate_limit_us tunable
The rate_limit_us tunable is intended to reduce the possible overhead
from running the schedutil governor.  However, that overhead can be
divided into two separate parts: the governor computations and the
invocation of the scaling driver to set the CPU frequency.  The latter
is where the real overhead comes from.  The former is much less
expensive in terms of execution time and running it every time the
governor callback is invoked by the scheduler, after rate_limit_us
interval has passed since the last frequency update, would not be a
problem.

For this reason, redefine the rate_limit_us tunable so that it means the
minimum time that has to pass between two consecutive invocations of the
scaling driver by the schedutil governor (to set the CPU frequency).

Change-Id: Iced64116b826c25441ef537c27a3dabfcf81919e
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[pulled from linux-pm linux-next https://patchwork.kernel.org/patch/9583949/ ]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 12:58:18 -07:00
Steve Muckle
4152c22c4f ANDROID: cpufreq: schedutil: add up/down frequency transition rate limits
The rate-limit tunable in the schedutil governor applies to transitions
to both lower and higher frequencies. On several platforms it is not the
ideal tunable though, as it is difficult to get best power/performance
figures using the same limit in both directions.

It is common on mobile platforms with demanding user interfaces to want
to increase frequency rapidly for example but decrease slowly.

One of the example can be a case where we have short busy periods
followed by similar or longer idle periods. If we keep the rate-limit
high enough, we will not go to higher frequencies soon enough. On the
other hand, if we keep it too low, we will have too many frequency
transitions, as we will always reduce the frequency after the busy
period.

It would be very useful if we can set low rate-limit while increasing
the frequency (so that we can respond to the short busy periods quickly)
and high rate-limit while decreasing frequency (so that we don't reduce
the frequency immediately after the short busy period and that may avoid
frequency transitions before the next busy period).

Implement separate up/down transition rate limits. Note that the
governor avoids frequency recalculations for a period equal to minimum
of up and down rate-limit. A global mutex is also defined to protect
updates to min_rate_limit_us via two separate sysfs files.

Note that this wouldn't change behavior of the schedutil governor for
the platforms which wish to keep same values for both up and down rate
limits.

This is tested with the rt-app [1] on ARM Exynos, dual A15 processor
platform.

Testcase: Run a SCHED_OTHER thread on CPU0 which will emulate work-load
for X ms of busy period out of the total period of Y ms, i.e. Y - X ms
of idle period. The values of X/Y taken were: 20/40, 20/50, 20/70, i.e
idle periods of 20, 30 and 50 ms respectively. These were tested against
values of up/down rate limits as: 10/10 ms and 10/40 ms.

For every test we noticed a performance increase of 5-10% with the
schedutil governor, which was very much expected.

[Viresh]: Simplified user interface and introduced min_rate_limit_us +
	      mutex, rewrote commit log and included test results.

[1] https://github.com/scheduler-tools/rt-app/

Change-Id: I18720a83855b196b8e21dcdc8deae79131635b84
Signed-off-by: Steve Muckle <smuckle.linux@gmail.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
(applied from https://marc.info/?l=linux-kernel&m=147936011103832&w=2)
[trivial adaptations]
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 12:58:12 -07:00
Juri Lelli
c6e9438301 ANDROID: sched/cpufreq: make schedutil use WALT signal
If WALT is available and enabled, make schedutil governor use its
utilization signal.

Change-Id: I92bc37989447a76616e9bcc4e9e8616774fb9925
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
[we need to use boosted_cpu_util for schedutil, so make it
 not static]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 12:57:47 -07:00
Steve Muckle
8d408128c7 ANDROID: sched: cpufreq: use rt_avg as estimate of required RT CPU capacity
A policy of going to fmax on any RT activity will be detrimental
for power on many platforms. Often RT accounts for only a small amount
of CPU activity so sending the CPU frequency to fmax is overkill. Worse
still, some platforms may not be able to even complete the CPU frequency
change before the RT activity has already completed.

Cpufreq governors have not treated RT activity this way in the past so
it is not part of the expected semantics of the RT scheduling class. The
DL class offers guarantees about task completion and could be used for
this purpose.

Modify the schedutil algorithm to instead use rt_avg as an estimate of
RT utilization of the CPU.

Based on previous work by Vincent Guittot <vincent.guittot@linaro.org>.

Change-Id: I1ed605a3e2512a94d34217a8e57c3fd97cca60be
Signed-off-by: Steve Muckle <smuckle@linaro.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 12:57:00 -07:00
Viresh Kumar
29d892d7f5 UPSTREAM: cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task
If slow path frequency changes are conducted in a SCHED_OTHER context
then they may be delayed for some amount of time, including
indefinitely, when real time or deadline activity is taking place.

Move the slow path to a real time kernel thread. In the future the
thread should be made SCHED_DEADLINE. The RT priority is arbitrarily set
to 50 for now.

Hackbench results on ARM Exynos, dual core A15 platform for 10
iterations:

$ hackbench -s 100 -l 100 -g 10 -f 20

Before			After
---------------------------------
1.808			1.603
1.847			1.251
2.229			1.590
1.952			1.600
1.947			1.257
1.925			1.627
2.694			1.620
1.258			1.621
1.919			1.632
1.250			1.240

Average:

1.8829			1.5041

Based on initial work by Steve Muckle.

Change-Id: I8bb111435b0871e961a6f26db1151c91624817b8
Signed-off-by: Steve Muckle <smuckle.linux@gmail.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry-picked from commit 02a7b1ee3b)
[fixed cherry-pick issue]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 12:54:36 -07:00
Steve Muckle
4e685a28e7 ANDROID: sched/cpufreq: fix tunables for schedfreq governor
The schedfreq governor does not currently handle cpufreq drivers which
use a global set of tunables (!have_governor_per_policy).

For example on x86 and using the acpi cpufreq driver, doing this

  cat /sys/devices/system/cpu/cpufreq/sched/up_throttle_nsec

will result in a bad pointer access.

Update the tunable code using the upstream schedutil tunable code by
Rafael Wysocki as a guide.

Includes a partial backport of the reorganized cpufreq tunable
infrastructure.

Change-Id: I7e6f8de1dac297077ad43f37dd2f6ddbfe921c98
Signed-off-by: Steve Muckle <smuckle@linaro.org>
[fixed cherry-pick issue]
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 12:53:07 -07:00
Brendan Jackman
b224647f87 DEBUG: sched/fair: Fix sched_load_avg_cpu events for task_groups
The current sched_load_avg_cpu event traces the load for any cfs_rq that is
updated. This is not representative of the CPU load - instead we should only
trace this event when the cfs_rq being updated is in the root_task_group.

Change-Id: I345c2f13f6b5718cb4a89beb247f7887ce97ed6b
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
(cherry picked from commit 1cb392e103)
(cherry picked from commit 835f5ebd2f7170c87dc1387976ed1b4e46215251)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 19:09:52 +01:00
Brendan Jackman
0f493a73ed DEBUG: sched/fair: Fix missing sched_load_avg_cpu events
update_cfs_rq_load_avg is called from update_blocked_averages without triggering
the sched_load_avg_cpu event. Move the event trigger to inside
update_cfs_rq_load_avg to avoid this missing event.

Change-Id: I6c4f66f687a644e4e7f798db122d28a8f5919b7b
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
(cherry picked from commit 7f18f0963d)
(cherry picked from commit 998d092b3385f1e62fa5cb5f491c7c633a4bbc3e)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 19:09:52 +01:00
Morten Rasmussen
b19cdb902d sched: Consider misfit tasks when load-balancing
With the new group_misfit_task load-balancing scenario additional policy
conditions are needed when load-balancing. Misfit task balancing only
makes sense between source group with lower capacity than the target
group. If capacities are the same, fallback to normal group_other
balancing. The aim is to balance tasks such that no task has its
throughput hindered by compute capacity if a cpu with more capacity is
available. Load-balancing is generally based on average load in the
sched_groups, but for misfitting tasks it is necessary to introduce
exceptions to migrate tasks against usual metrics and optimize
throughput.

This patch ensures the following load-balance for mixed capacity systems
(e.g. ARM big.LITTLE) for always-running tasks:

1. Place a task on each cpu starting in order from cpus with highest
capacity to lowest until all cpus are in use (i.e. one task on each
cpu).

2. Once all cpus are in use balance according to compute capacity such
that load per capacity is approximately the same regardless of the
compute capacity (i.e. big cpus get more tasks than little cpus).

Necessary changes are introduced in find_busiest_group(),
calculate_imbalance(), and find_busiest_queue(). This includes passing
the group_type on to find_busiest_queue() through struct lb_env, which
is currently only considers imbalance and not the imbalance situation
(group_type).

To avoid taking remote rq locks to examine source sched_groups for
misfit tasks, each cpu is responsible for tracking misfit tasks
themselves and update the rq->misfit_task flag. This means checking task
utilization when tasks are scheduled and on sched_tick.

Change-Id: I29b35783765b0e51ae3ede74a8160b07f885bfde
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
(cherry picked from commit 0d2b1cdfa1)
[Fixed cherry-pick issue]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-08-03 19:09:51 +01:00
Greg Kroah-Hartman
9ae2c670d8 Merge 4.9.40 into android-4.9
Changes in 4.9.40
	disable new gcc-7.1.1 warnings for now
	ir-core: fix gcc-7 warning on bool arithmetic
	dm mpath: cleanup -Wbool-operation warning in choose_pgpath()
	s5p-jpeg: don't return a random width/height
	thermal: max77620: fix device-node reference imbalance
	thermal: cpu_cooling: Avoid accessing potentially freed structures
	ath9k: fix tx99 use after free
	ath9k: fix tx99 bus error
	ath9k: fix an invalid pointer dereference in ath9k_rng_stop()
	NFC: fix broken device allocation
	NFC: nfcmrvl_uart: add missing tty-device sanity check
	NFC: nfcmrvl: do not use device-managed resources
	NFC: nfcmrvl: use nfc-device for firmware download
	NFC: nfcmrvl: fix firmware-management initialisation
	nfc: Ensure presence of required attributes in the activate_target handler
	nfc: Fix the sockaddr length sanitization in llcp_sock_connect
	NFC: Add sockaddr length checks before accessing sa_family in bind handlers
	perf intel-pt: Move decoder error setting into one condition
	perf intel-pt: Improve sample timestamp
	perf intel-pt: Fix missing stack clear
	perf intel-pt: Ensure IP is zero when state is INTEL_PT_STATE_NO_IP
	perf intel-pt: Fix last_ip usage
	perf intel-pt: Ensure never to set 'last_ip' when packet 'count' is zero
	perf intel-pt: Use FUP always when scanning for an IP
	perf intel-pt: Clear FUP flag on error
	Bluetooth: use constant time memory comparison for secret values
	wlcore: fix 64K page support
	btrfs: Don't clear SGID when inheriting ACLs
	igb: Explicitly select page 0 at initialization
	ASoC: compress: Derive substream from stream based on direction
	PM / Domains: Fix unsafe iteration over modified list of device links
	PM / Domains: Fix unsafe iteration over modified list of domain providers
	PM / Domains: Fix unsafe iteration over modified list of domains
	scsi: ses: do not add a device to an enclosure if enclosure_add_links() fails.
	scsi: Add STARGET_CREATED_REMOVE state to scsi_target_state
	iscsi-target: Add login_keys_workaround attribute for non RFC initiators
	xen/scsiback: Fix a TMR related use-after-free
	powerpc/pseries: Fix passing of pp0 in updatepp() and updateboltedpp()
	powerpc/64: Fix atomic64_inc_not_zero() to return an int
	powerpc: Fix emulation of mcrf in emulate_step()
	powerpc: Fix emulation of mfocrf in emulate_step()
	powerpc/asm: Mark cr0 as clobbered in mftb()
	powerpc/mm/radix: Properly clear process table entry
	af_key: Fix sadb_x_ipsecrequest parsing
	PCI: Work around poweroff & suspend-to-RAM issue on Macbook Pro 11
	PCI: rockchip: Use normal register bank for config accessors
	PCI/PM: Restore the status of PCI devices across hibernation
	ipvs: SNAT packet replies only for NATed connections
	xhci: fix 20000ms port resume timeout
	xhci: Fix NULL pointer dereference when cleaning up streams for removed host
	xhci: Bad Ethernet performance plugged in ASM1042A host
	mxl111sf: Fix driver to use heap allocate buffers for USB messages
	usb: storage: return on error to avoid a null pointer dereference
	USB: cdc-acm: add device-id for quirky printer
	usb: renesas_usbhs: fix usbhsc_resume() for !USBHSF_RUNTIME_PWCTRL
	usb: renesas_usbhs: gadget: disable all eps when the driver stops
	md: don't use flush_signals in userspace processes
	x86/xen: allow userspace access during hypercalls
	cx88: Fix regression in initial video standard setting
	libnvdimm, btt: fix btt_rw_page not returning errors
	libnvdimm: fix badblock range handling of ARS range
	ext2: Don't clear SGID when inheriting ACLs
	Raid5 should update rdev->sectors after reshape
	s390/syscalls: Fix out of bounds arguments access
	drm/amd/amdgpu: Return error if initiating read out of range on vram
	drm/radeon/ci: disable mclk switching for high refresh rates (v2)
	drm/radeon: Fix eDP for single-display iMac10,1 (v2)
	ipmi: use rcu lock around call to intf->handlers->sender()
	ipmi:ssif: Add missing unlock in error branch
	xfs: Don't clear SGID when inheriting ACLs
	f2fs: sanity check size of nat and sit cache
	f2fs: Don't clear SGID when inheriting ACLs
	drm/ttm: Fix use-after-free in ttm_bo_clean_mm
	ovl: drop CAP_SYS_RESOURCE from saved mounter's credentials
	vfio: Fix group release deadlock
	vfio: New external user group/file match
	nvme-rdma: remove race conditions from IB signalling
	ftrace: Fix uninitialized variable in match_records()
	MIPS: Fix mips_atomic_set() retry condition
	MIPS: Fix mips_atomic_set() with EVA
	MIPS: Negate error syscall return in trace
	ubifs: Don't leak kernel memory to the MTD
	ACPI / EC: Drop EC noirq hooks to fix a regression
	Revert "ACPI / EC: Enable event freeze mode..." to fix a regression
	x86/acpi: Prevent out of bound access caused by broken ACPI tables
	x86/ioapic: Pass the correct data to unmask_ioapic_irq()
	MIPS: Fix MIPS I ISA /proc/cpuinfo reporting
	MIPS: Save static registers before sysmips
	MIPS: Actually decode JALX in `__compute_return_epc_for_insn'
	MIPS: Fix unaligned PC interpretation in `compute_return_epc'
	MIPS: math-emu: Prevent wrong ISA mode instruction emulation
	MIPS: Send SIGILL for BPOSGE32 in `__compute_return_epc_for_insn'
	MIPS: Rename `sigill_r6' to `sigill_r2r6' in `__compute_return_epc_for_insn'
	MIPS: Send SIGILL for linked branches in `__compute_return_epc_for_insn'
	MIPS: Send SIGILL for R6 branches in `__compute_return_epc_for_insn'
	MIPS: Fix a typo: s/preset/present/ in r2-to-r6 emulation error message
	Input: i8042 - fix crash at boot time
	IB/iser: Fix connection teardown race condition
	IB/core: Namespace is mandatory input for address resolution
	sunrpc: use constant time memory comparison for mac
	NFS: only invalidate dentrys that are clearly invalid.
	udf: Fix deadlock between writeback and udf_setsize()
	target: Fix COMPARE_AND_WRITE caw_sem leak during se_cmd quiesce
	iser-target: Avoid isert_conn->cm_id dereference in isert_login_recv_done
	perf annotate: Fix broken arrow at row 0 connecting jmp instruction to its target
	Revert "perf/core: Drop kernel samples even though :u is specified"
	staging: rtl8188eu: add TL-WN722N v2 support
	staging: comedi: ni_mio_common: fix AO timer off-by-one regression
	staging: sm750fb: avoid conflicting vesafb
	staging: lustre: ko2iblnd: check copy_from_iter/copy_to_iter return code
	ceph: fix race in concurrent readdir
	RDMA/core: Initialize port_num in qp_attr
	drm/mst: Fix error handling during MST sideband message reception
	drm/mst: Avoid dereferencing a NULL mstb in drm_dp_mst_handle_up_req()
	drm/mst: Avoid processing partially received up/down message transactions
	mlx5: Avoid that mlx5_ib_sg_to_klms() overflows the klms[] array
	hfsplus: Don't clear SGID when inheriting ACLs
	ovl: fix random return value on mount
	acpi/nfit: Fix memory corruption/Unregister mce decoder on failure
	of: device: Export of_device_{get_modalias, uvent_modalias} to modules
	spmi: Include OF based modalias in device uevent
	reiserfs: Don't clear SGID when inheriting ACLs
	PM / Domains: defer dev_pm_domain_set() until genpd->attach_dev succeeds if present
	tracing: Fix kmemleak in instance_rmdir
	alarmtimer: don't rate limit one-shot timers
	Linux 4.9.40

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-07-27 15:24:43 -07:00
Greg Hackmann
91af5f04cd alarmtimer: don't rate limit one-shot timers
Commit ff86bf0c65 ("alarmtimer: Rate limit periodic intervals") sets a
minimum bound on the alarm timer interval.  This minimum bound shouldn't
be applied if the interval is 0.  Otherwise, one-shot timers will be
converted into periodic ones.

Fixes: ff86bf0c65 ("alarmtimer: Rate limit periodic intervals")
Reported-by: Ben Fennema <fennema@google.com>
Signed-off-by: Greg Hackmann <ghackmann@google.com>
Cc: stable@vger.kernel.org
Cc: John Stultz <john.stultz@linaro.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-27 15:08:08 -07:00
Chunyu Hu
919e481152 tracing: Fix kmemleak in instance_rmdir
commit db9108e054 upstream.

Hit the kmemleak when executing instance_rmdir, it forgot releasing
mem of tracing_cpumask. With this fix, the warn does not appear any
more.

unreferenced object 0xffff93a8dfaa7c18 (size 8):
  comm "mkdir", pid 1436, jiffies 4294763622 (age 9134.308s)
  hex dump (first 8 bytes):
    ff ff ff ff ff ff ff ff                          ........
  backtrace:
    [<ffffffff88b6567a>] kmemleak_alloc+0x4a/0xa0
    [<ffffffff8861ea41>] __kmalloc_node+0xf1/0x280
    [<ffffffff88b505d3>] alloc_cpumask_var_node+0x23/0x30
    [<ffffffff88b5060e>] alloc_cpumask_var+0xe/0x10
    [<ffffffff88571ab0>] instance_mkdir+0x90/0x240
    [<ffffffff886e5100>] tracefs_syscall_mkdir+0x40/0x70
    [<ffffffff886565c9>] vfs_mkdir+0x109/0x1b0
    [<ffffffff8865b1d0>] SyS_mkdir+0xd0/0x100
    [<ffffffff88403857>] do_syscall_64+0x67/0x150
    [<ffffffff88b710e7>] return_from_SYSCALL_64+0x0/0x6a
    [<ffffffffffffffff>] 0xffffffffffffffff

Link: http://lkml.kernel.org/r/1500546969-12594-1-git-send-email-chuhu@redhat.com

Fixes: ccfe9e42e4 ("tracing: Make tracing_cpumask available for all instances")
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-27 15:08:08 -07:00
Ingo Molnar
a76a032300 Revert "perf/core: Drop kernel samples even though :u is specified"
commit 6a8a75f323 upstream.

This reverts commit cc1582c231.

This commit introduced a regression that broke rr-project, which uses sampling
events to receive a signal on overflow (but does not care about the contents
of the sample). These signals are critical to the correct operation of rr.

There's been some back and forth about how to fix it - but to not keep
applications in limbo queue up a revert.

Reported-by: Kyle Huey <me@kylehuey.com>
Acked-by: Kyle Huey <me@kylehuey.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jin Yao <yao.jin@linux.intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/20170628105600.GC5981@leverpostej
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-27 15:08:06 -07:00
Dan Carpenter
198bd494ce ftrace: Fix uninitialized variable in match_records()
commit 2e028c4fe1 upstream.

My static checker complains that if "func" is NULL then "clear_filter"
is uninitialized.  This seems like it could be true, although it's
possible something subtle is happening that I haven't seen.

    kernel/trace/ftrace.c:3844 match_records()
    error: uninitialized symbol 'clear_filter'.

Link: http://lkml.kernel.org/r/20170712073556.h6tkpjcdzjaozozs@mwanda

Fixes: f0a3b154bd ("ftrace: Clarify code for mod command")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-27 15:08:03 -07:00
Daniel Mentz
5b07c2d252 Revert "ANDROID: proc: smaps: Allow smaps access for CAP_SYS_RESOURCE"
This reverts commit ff8b80819c.

This fixes CVE-2017-0710.

SELinux allows more fine grained control: We grant processes that need
access to smaps CAP_SYS_PTRACE but prohibit them from using ptrace
attach().

Bug: 34951864
Bug: 36468447
Change-Id: I00a513188245a30bc63dcbdafbb9746bc6d9d6ff
Signed-off-by: Daniel Mentz <danielmentz@google.com>
2017-07-21 11:10:17 -07:00
Greg Kroah-Hartman
14accea70e Merge 4.9.39 into android-4.9
Changes in 4.9.39
	xen-netfront: Rework the fix for Rx stall during OOM and network stress
	net_sched: fix error recovery at qdisc creation
	net: sched: Fix one possible panic when no destroy callback
	net/phy: micrel: configure intterupts after autoneg workaround
	ipv6: avoid unregistering inet6_dev for loopback
	net: dp83640: Avoid NULL pointer dereference.
	tcp: reset sk_rx_dst in tcp_disconnect()
	net: prevent sign extension in dev_get_stats()
	bridge: mdb: fix leak on complete_info ptr on fail path
	rocker: move dereference before free
	bpf: prevent leaking pointer via xadd on unpriviledged
	net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish()
	net/mlx5: Cancel delayed recovery work when unloading the driver
	liquidio: fix bug in soft reset failure detection
	net/mlx5e: Fix TX carrier errors report in get stats ndo
	ipv6: dad: don't remove dynamic addresses if link is down
	vxlan: fix hlist corruption
	net: core: Fix slab-out-of-bounds in netdev_stats_to_stats64
	net: ipv6: Compare lwstate in detecting duplicate nexthops
	vrf: fix bug_on triggered by rx when destroying a vrf
	rds: tcp: use sock_create_lite() to create the accept socket
	brcmfmac: fix possible buffer overflow in brcmf_cfg80211_mgmt_tx()
	brcmfmac: Fix a memory leak in error handling path in 'brcmf_cfg80211_attach'
	brcmfmac: Fix glom_skb leak in brcmf_sdiod_recv_chain
	sfc: don't read beyond unicast address list
	cfg80211: Define nla_policy for NL80211_ATTR_LOCAL_MESH_POWER_MODE
	cfg80211: Validate frequencies nested in NL80211_ATTR_SCAN_FREQUENCIES
	cfg80211: Check if PMKID attribute is of expected size
	cfg80211: Check if NAN service ID is of expected size
	irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity
	parisc: Report SIGSEGV instead of SIGBUS when running out of stack
	parisc: use compat_sys_keyctl()
	parisc: DMA API: return error instead of BUG_ON for dma ops on non dma devs
	parisc/mm: Ensure IRQs are off in switch_mm()
	tools/lib/lockdep: Reduce MAX_LOCK_DEPTH to avoid overflowing lock_chain/: Depth
	thp, mm: fix crash due race in MADV_FREE handling
	kernel/extable.c: mark core_kernel_text notrace
	mm/list_lru.c: fix list_lru_count_node() to be race free
	fs/dcache.c: fix spin lockup issue on nlru->lock
	checkpatch: silence perl 5.26.0 unescaped left brace warnings
	binfmt_elf: use ELF_ET_DYN_BASE only for PIE
	arm: move ELF_ET_DYN_BASE to 4MB
	arm64: move ELF_ET_DYN_BASE to 4GB / 4MB
	powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB
	s390: reduce ELF_ET_DYN_BASE
	exec: Limit arg stack to at most 75% of _STK_LIM
	ARM64: dts: marvell: armada37xx: Fix timer interrupt specifiers
	vt: fix unchecked __put_user() in tioclinux ioctls
	rcu: Add memory barriers for NOCB leader wakeup
	nvmem: core: fix leaks on registration errors
	mnt: In umount propagation reparent in a separate pass
	mnt: In propgate_umount handle visiting mounts in any order
	mnt: Make propagate_umount less slow for overlapping mount propagation trees
	selftests/capabilities: Fix the test_execve test
	mm: fix overflow check in expand_upwards()
	crypto: talitos - Extend max key length for SHA384/512-HMAC and AEAD
	crypto: atmel - only treat EBUSY as transient if backlog
	crypto: sha1-ssse3 - Disable avx2
	crypto: caam - properly set IV after {en,de}crypt
	crypto: caam - fix signals handling
	Revert "sched/core: Optimize SCHED_SMT"
	sched/fair, cpumask: Export for_each_cpu_wrap()
	sched/topology: Fix building of overlapping sched-groups
	sched/topology: Optimize build_group_mask()
	sched/topology: Fix overlapping sched_group_mask
	PM / wakeirq: Convert to SRCU
	PM / QoS: return -EINVAL for bogus strings
	tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate results
	kvm: vmx: Do not disable intercepts for BNDCFGS
	kvm: x86: Guest BNDCFGS requires guest MPX support
	kvm: vmx: Check value written to IA32_BNDCFGS
	kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS
	4.9.39

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-07-21 08:55:50 +02:00
Pavankumar Kondeti
04e002a5f6 tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate results
commit c59f29cb14 upstream.

The 's' flag is supposed to indicate that a softirq is running. This
can be detected by testing the preempt_count with SOFTIRQ_OFFSET.

The current code tests the preempt_count with SOFTIRQ_MASK, which
would be true even when softirqs are disabled but not serving a
softirq.

Link: http://lkml.kernel.org/r/1481300417-3564-1-git-send-email-pkondeti@codeaurora.org

Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:42:24 +02:00
Peter Zijlstra
758dc6a8da sched/topology: Fix overlapping sched_group_mask
commit 73bb059f9b upstream.

The point of sched_group_mask is to select those CPUs from
sched_group_cpus that can actually arrive at this balance domain.

The current code gets it wrong, as can be readily demonstrated with a
topology like:

  node   0   1   2   3
    0:  10  20  30  20
    1:  20  10  20  30
    2:  30  20  10  20
    3:  20  30  20  10

Where (for example) domain 1 on CPU1 ends up with a mask that includes
CPU0:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 (mask: 1), 2, 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (mask: 0-2) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

This causes sched_balance_cpu() to compute the wrong CPU and
consequently should_we_balance() will terminate early resulting in
missed load-balance opportunities.

The fixed topology looks like:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 (mask: 1), 2, 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (mask: 1) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

(note: this relies on OVERLAP domains to always have children, this is
 true because the regular topology domains are still here -- this is
 before degenerate trimming)

Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: e3589f6c81 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:42:24 +02:00
Lauro Ramos Venancio
3e165b2322 sched/topology: Optimize build_group_mask()
commit f32d782e31 upstream.

The group mask is always used in intersection with the group CPUs. So,
when building the group mask, we don't have to care about CPUs that are
not part of the group.

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1492717903-5195-2-git-send-email-lvenanci@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:42:23 +02:00
Peter Zijlstra
7c3f08eadc sched/topology: Fix building of overlapping sched-groups
commit 0372dd2736 upstream.

When building the overlapping groups, we very obviously should start
with the previous domain of _this_ @cpu, not CPU-0.

This can be readily demonstrated with a topology like:

  node   0   1   2   3
    0:  10  20  30  20
    1:  20  10  20  30
    2:  30  20  10  20
    3:  20  30  20  10

Where (for example) CPU1 ends up generating the following nonsensical groups:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 2 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 1-3 (cpu_capacity = 3072) 0-1,3 (cpu_capacity = 3072)

Where the fact that domain 1 doesn't include a group with span 0-2 is
the obvious fail.

With patch this looks like:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 0 2
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (cpu_capacity = 3072) 0,2-3 (cpu_capacity = 3072)

Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: e3589f6c81 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:42:23 +02:00
Peter Zijlstra
542ebc96c2 sched/fair, cpumask: Export for_each_cpu_wrap()
commit c6508a39640b9a27fc2bc10cb708152672c82045 upstream.

commit c743f0a5c5 upstream.

More users for for_each_cpu_wrap() have appeared. Promote the construct
to generic cpumask interface.

The implementation is slightly modified to reduce arguments.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Lauro Ramos Venancio <lvenanci@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Link: http://lkml.kernel.org/r/20170414122005.o35me2h5nowqkxbv@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:42:23 +02:00