Commit Graph

562408 Commits

Author SHA1 Message Date
Dietmar Eggemann
b03f1ba3d5 cpufreq: Max freq invariant scheduler load-tracking and cpu capacity support
Implements cpufreq_scale_max_freq_capacity() to provide the scheduler
with a maximum frequency scaling correction factor for more accurate
load-tracking and cpu capacity handling by being able to deal with
frequency capping.

This scaling factor describes the influence of running a cpu with a
current maximum frequency lower than the absolute possible maximum
frequency on load tracking and cpu capacity.

The factor is:

	current_max_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)

In fact, max_freq_scale should be a struct cpufreq_policy data member.
But this would require that the scheduler hot path (__update_load_avg())
would have to grab the cpufreq lock. This can be avoided by using per-cpu
data initialized to SCHED_CAPACITY_SCALE for max_freq_scale.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:49:54 +08:00
Robin Randhawa
8296e78e5e arm64, topology: Updates to use DT bindings for EAS costing data
With the bindings and the associated accessors to extract data from the
bindings in place, remove the static hard-coded data from topology.c and
use the accesors instead.

Signed-off-by: Robin Randhawa <robin.randhawa@arm.com>
2016-05-10 16:49:53 +08:00
Robin Randhawa
d2b3db032e sched: Support for extracting EAS energy costs from DT
This patch implements support for extracting energy cost data from DT.
The data should conform to the DT bindings for energy cost data needed
by EAS (energy aware scheduling).

Signed-off-by: Robin Randhawa <robin.randhawa@arm.com>
2016-05-10 16:49:53 +08:00
Robin Randhawa
1b35232614 Documentation: DT bindings for energy model cost data required by EAS
EAS (energy aware scheduling) provides the scheduler with an alternative
objective - energy efficiency - as opposed to it's current performance
oriented objectives. EAS relies on a simple platform energy cost model
to guide scheduling decisions. The model only considers the CPU
subsystem.

This patch adds documentation describing DT bindings that should be used to
supply the scheduler with an energy cost model.

Signed-off-by: Robin Randhawa <robin.randhawa@arm.com>
2016-05-10 16:49:53 +08:00
Morten Rasmussen
b71188bd2c sched: Disable energy-unfriendly nohz kicks
With energy-aware scheduling enabled nohz_kick_needed() generates many
nohz idle-balance kicks which lead to nothing when multiple tasks get
packed on a single cpu to save energy. This causes unnecessary wake-ups
and hence wastes energy. Make these conditions depend on !energy_aware()
for now until the energy-aware nohz story gets sorted out.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:52 +08:00
Dietmar Eggemann
52b7b8a37c sched: Consider a not over-utilized energy-aware system as balanced
In case the system operates below the tipping point indicator,
introduced in ("sched: Add over-utilization/tipping point
indicator"), bail out in find_busiest_group after the dst and src
group statistics have been checked.

There is simply no need to move usage around because all involved
cpus still have spare cycles available.

For an energy-aware system below its tipping point,  we rely on the
task placement of the wakeup path. This works well for short running
tasks.

The existence of long running tasks on one of the involved cpus lets
the system operate over its tipping point. To be able to move such
a task (whose load can't be used to average the load among the cpus)
from a src cpu with lower capacity than the dst_cpu, an additional
rule has to be implemented in need_active_balance.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:49:52 +08:00
Morten Rasmussen
e38982ba32 sched: Energy-aware wake-up task placement
Let available compute capacity and estimated energy impact select
wake-up target cpu when energy-aware scheduling is enabled and the
system in not over-utilized (above the tipping point).

energy_aware_wake_cpu() attempts to find group of cpus with sufficient
compute capacity to accommodate the task and find a cpu with enough spare
capacity to handle the task within that group. Preference is given to
cpus with enough spare capacity at the current OPP. Finally, the energy
impact of the new target and the previous task cpu is compared to select
the wake-up target cpu.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:52 +08:00
Dietmar Eggemann
ec055b978e sched: Determine the current sched_group idle-state
To estimate the energy consumption of a sched_group in
sched_group_energy() it is necessary to know which idle-state the group
is in when it is idle. For now, it is assumed that this is the current
idle-state (though it might be wrong). Based on the individual cpu
idle-states group_idle_state() finds the group idle-state.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:49:52 +08:00
Morten Rasmussen
19a5ebe13d sched, cpuidle: Track cpuidle state index in the scheduler
The idle-state of each cpu is currently pointed to by rq->idle_state but
there isn't any information in the struct cpuidle_state that can used to
look up the idle-state energy model data stored in struct
sched_group_energy. For this purpose is necessary to store the idle
state index as well. Ideally, the idle-state data should be unified.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:51 +08:00
Morten Rasmussen
1b5ec5d8ab sched: Add over-utilization/tipping point indicator
Energy-aware scheduling is only meant to be active while the system is
_not_ over-utilized. That is, there are spare cycles available to shift
tasks around based on their actual utilization to get a more
energy-efficient task distribution without depriving any tasks. When
above the tipping point task placement is done the traditional way based
on load_avg, spreading the tasks across as many cpus as possible based
on priority scaled load to preserve smp_nice. Below the tipping point we
want to use util_avg instead. We need to define a criteria for when we
make the switch.

The util_avg for each cpu converges towards 100% (1024) regardless of
how many task additional task we may put on it. If we define
over-utilized as:

sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity)

some individual cpus may be over-utilized running multiple tasks even
when the above condition is false. That should be okay as long as we try
to spread the tasks out to avoid per-cpu over-utilization as much as
possible and if all tasks have the _same_ priority. If the latter isn't
true, we have to consider priority to preserve smp_nice.

For example, we could have n_cpus nice=-10 util_avg=55% tasks and
n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are
likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks
getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less
over-utilized than 55%+60% for those cpus that have to be shared. The
system utilization is only 85% of the system capacity, but we are
breaking smp_nice.

To be sure not to break smp_nice, we have defined over-utilization
conservatively as when any cpu in the system is fully utilized at it's
highest frequency instead:

cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity

IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg
to factor in priority to preserve smp_nice.

With this definition, we can skip periodic load-balance as no cpu has an
always-running task when the system is not over-utilized. All tasks will
be periodic and we can balance them at wake-up. This conservative
condition does however mean that some scenarios that could benefit from
energy-aware decisions even if one cpu is fully utilized would not get
those benefits.

For system where some cpus might have reduced capacity on some cpus
(RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
soon a just a single cpu is fully utilized as it might one of those with
reduced capacity and in that case we want to migrate it.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:51 +08:00
Morten Rasmussen
2c6a8a48a7 sched: Estimate energy impact of scheduling decisions
Adds a generic energy-aware helper function, energy_diff(), that
calculates energy impact of adding, removing, and migrating utilization
in the system.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:51 +08:00
Morten Rasmussen
df2030c841 sched: Extend sched_group_energy to test load-balancing decisions
Extended sched_group_energy() to support energy prediction with usage
(tasks) added/removed from a specific cpu or migrated between a pair of
cpus. Useful for load-balancing decision making.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:51 +08:00
Morten Rasmussen
c1770a5213 sched: Calculate energy consumption of sched_group
For energy-aware load-balancing decisions it is necessary to know the
energy consumption estimates of groups of cpus. This patch introduces a
basic function, sched_group_energy(), which estimates the energy
consumption of the cpus in the group and any resources shared by the
members of the group.

NOTE: The function has five levels of identation and breaks the 80
character limit. Refactoring is necessary.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:51 +08:00
Morten Rasmussen
5ec8ccabfe sched: Highest energy aware balancing sched_domain level pointer
Add another member to the family of per-cpu sched_domain shortcut
pointers. This one, sd_ea, points to the highest level at which energy
model is provided. At this level and all levels below all sched_groups
have energy model data attached.

Partial energy model information is possible but restricted to providing
energy model data for lower level sched_domains (sd_ea and below) and
leaving load-balancing on levels above to non-energy-aware
load-balancing. For example, it is possible to apply energy-aware
scheduling within each socket on a multi-socket system and let normal
scheduling handle load-balancing between sockets.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:51 +08:00
Morten Rasmussen
b6c0399e0f sched: Relocated cpu_util() and change return type
Move cpu_util() to an earlier position in fair.c and change return
type to unsigned long as negative usage doesn't make much sense. All
other load and capacity related functions use unsigned long including
the caller of cpu_util().

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:51 +08:00
Morten Rasmussen
3e55d2f2fc sched: Compute cpu capacity available at current frequency
capacity_orig_of() returns the max available compute capacity of a cpu.
For scale-invariant utilization tracking and energy-aware scheduling
decisions it is useful to know the compute capacity available at the
current OPP of a cpu.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:50 +08:00
Juri Lelli
7c791bf110 arm64: Cpu invariant scheduler load-tracking and capacity support
Provides the scheduler with a cpu scaling correction factor for more
accurate load-tracking and cpu capacity handling.

The Energy Model (EM) (in fact the capacity value of the last element
of the capacity states vector of the core (MC) level sched_group_energy
structure) is used as the source for this cpu scaling factor.

The cpu capacity value depends on the micro-architecture and the
maximum frequency of the cpu.

The maximum frequency part should not be confused with the frequency
invariant scheduler load-tracking support which deals with frequency
related scaling due to DFVS functionality.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:49:50 +08:00
Dietmar Eggemann
c7aeeb88c7 arm: Cpu invariant scheduler load-tracking and capacity support
Provides the scheduler with a cpu scaling correction factor for more
accurate load-tracking and cpu capacity handling.

The Energy Model (EM) (in fact the capacity value of the last element
of the capacity states vector of the core (MC) level sched_group_energy
structure) is used instead of the arm arch specific cpu_efficiency and
dtb property 'clock-frequency' values as the source for this cpu
scaling factor.

The cpu capacity value depends on the micro-architecture and the
maximum frequency of the cpu.

The maximum frequency part should not be confused with the frequency
invariant scheduler load-tracking support which deals with frequency
related scaling due to DFVS functionality.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:49:50 +08:00
Morten Rasmussen
f0f739d887 sched: Introduce SD_SHARE_CAP_STATES sched_domain flag
cpufreq is currently keeping it a secret which cpus are sharing
clock source. The scheduler needs to know about clock domains as well
to become more energy aware. The SD_SHARE_CAP_STATES domain flag
indicates whether cpus belonging to the sched_domain share capacity
states (P-states).

There is no connection with cpufreq (yet). The flag must be set by
the arch specific topology code.

cc: Russell King <linux@arm.linux.org.uk>
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:50 +08:00
Dietmar Eggemann
4ce990ec65 sched: Initialize energy data structures
The sched_group_energy (sge) pointer of the first sched_group (sg) in
the sched_domain (sd) is initialized to point to the appropriate (in
terms of sd level and cpu) sge data defined in the arch and so to the
correct part of the Energy Model (EM).

Energy-aware scheduling allows that a system has only EM data up to a
certain sd level (so called highest energy aware balancing sd level).
A check in init_sched_energy() enforces that all sd's below this sd
level contain EM data.

The 'int cpu' parameter of sched_domain_energy_f requires that
check_sched_energy_data() makes sure that all cpus spanned by a sg
are provisioned with the same EM data.

This patch has also been tested with feature FORCE_SD_OVERLAP enabled.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:49:50 +08:00
Dietmar Eggemann
0b3bda54d2 sched: Introduce energy data structures
The struct sched_group_energy represents the per sched_group related
data which is needed for energy aware scheduling. It contains:

  (1) number of elements of the idle state array
  (2) pointer to the idle state array which comprises 'power consumption'
      for each idle state
  (3) number of elements of the capacity state array
  (4) pointer to the capacity state array which comprises 'compute
      capacity and power consumption' tuples for each capacity state

The struct sched_group obtains a pointer to a struct sched_group_energy.

The function pointer sched_domain_energy_f is introduced into struct
sched_domain_topology_level which will allow the arch to pass a particular
struct sched_group_energy from the topology shim layer into the scheduler
core.

The function pointer sched_domain_energy_f has an 'int cpu' parameter
since the folding of two adjacent sd levels via sd degenerate doesn't work
for all sd levels. I.e. it is not possible for example to use this feature
to provide per-cpu energy in sd level DIE on ARM's TC2 platform.

It was discussed that the folding of sd levels approach is preferable
over the cpu parameter approach, simply because the user (the arch
specifying the sd topology table) can introduce less errors. But since
it is not working, the 'int cpu' parameter is the only way out. It's
possible to use the folding of sd levels approach for
sched_domain_flags_f and the cpu parameter approach for the
sched_domain_energy_f at the same time though. With the use of the
'int cpu' parameter, an extra check function has to be provided to make
sure that all cpus spanned by a sched group are provisioned with the same
energy data.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:49:50 +08:00
Morten Rasmussen
e496f328c8 sched: Make energy awareness a sched feature
This patch introduces the ENERGY_AWARE sched feature, which is
implemented using jump labels when SCHED_DEBUG is defined. It is
statically set false when SCHED_DEBUG is not defined. Hence this doesn't
allow energy awareness to be enabled without SCHED_DEBUG. This
sched_feature knob will be replaced later with a more appropriate
control knob when things have matured a bit.

ENERGY_AWARE is based on per-entity load-tracking hence FAIR_GROUP_SCHED
must be enable. This dependency isn't checked at compile time yet.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:50 +08:00
Morten Rasmussen
d8c8088f41 sched: Documentation for scheduler energy cost model
This documentation patch provides an overview of the experimental
scheduler energy costing model, associated data structures, and a
reference recipe on how platforms can be characterized to derive energy
models.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:49 +08:00
Morten Rasmussen
681fa1413b sched: Prevent unnecessary active balance of single task in sched group
Scenarios with the busiest group having just one task and the local
being idle on topologies with sched groups with different numbers of
cpus manage to dodge all load-balance bailout conditions resulting the
nr_balance_failed counter to be incremented. This eventually causes a
pointless active migration of the task. This patch prevents this by not
incrementing the counter when the busiest group only has one task.
ASYM_PACKING migrations and migrations due to reduced capacity should
still take place as these are explicitly captured by
need_active_balance().

A better solution would be to not attempt the load-balance in the first
place, but that requires significant changes to the order of bailout
conditions and statistics gathering.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:49 +08:00
Dietmar Eggemann
cda2bd333b sched: Enable idle balance to pull single task towards cpu with higher capacity
We do not want to miss out on the ability to pull a single remaining
task from a potential source cpu towards an idle destination cpu. Add an
extra criteria to need_active_balance() to kick off active load balance
if the source cpu is over-utilized and has lower capacity than the
destination cpu.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:49:49 +08:00
Morten Rasmussen
8a5c0339bb sched: Consider spare cpu capacity at task wake-up
find_idlest_group() selects the wake-up target group purely
based on group load which leads to suboptimal choices in low load
scenarios. An idle group with reduced capacity (due to RT tasks or
different cpu type) isn't necessarily a better target than a lightly
loaded group with higher capacity.

The patch adds spare capacity as an additional group selection
parameter. The target group is now selected based on the following
criteria:

1. Return the group with the cpu with most spare capacity and this
capacity is significant if such group exists. Significant spare capacity
is currently at least 20% to spare.

2. Return the group with the lowest load, unless it is the local group
in which case NULL is returned and the search is continued at the next
(lower) level.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:49 +08:00
Morten Rasmussen
e8bcb272bc sched: Add cpu capacity awareness to wakeup balancing
Wakeup balancing is completely unaware of cpu capacity, cpu utilization
and task utilization. The task is preferably placed on a cpu which is
idle in the instant the wakeup happens. New tasks
(SD_BALANCE_{FORK,EXEC} are placed on an idle cpu in the idlest group if
such can be found, otherwise it goes on the least loaded one. Existing
tasks (SD_BALANCE_WAKE) are placed on the previous cpu or an idle cpu
sharing the same last level cache unless the wakee_flips heuristic in
wake_wide() decides to fallback to considering cpus outside SD_LLC.
Hence existing tasks are not guaranteed to get a chance to migrate to a
different group at wakeup in case the current one has reduced cpu
capacity (due RT/IRQ pressure or different uarch e.g. ARM big.LITTLE).
They may eventually get pulled by other cpus doing
periodic/idle/nohz_idle balance, but it may take quite a while before it
happens.

This patch adds capacity awareness to find_idlest_{group,queue} (used by
SD_BALANCE_{FORK,EXEC} and SD_BALANCE_WAKE under certain circumstances)
such that groups/cpus that can accommodate the waking task based on task
utilization are preferred. In addition, wakeup of existing tasks
(SD_BALANCE_WAKE) is sent through find_idlest_{group,queue} also if the
task doesn't fit the capacity of the previous cpu to allow it to escape
(override wake_affine) when necessary instead of relying on
periodic/idle/nohz_idle balance to eventually sort it out.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:49 +08:00
Dietmar Eggemann
1eb2b8aa45 sched: Store system-wide maximum cpu capacity in root domain
To be able to compare the capacity of the target cpu with the highest
cpu capacity of the system in the wakeup path, store the system-wide
maximum cpu capacity in the root domain.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:49:49 +08:00
Morten Rasmussen
0f8edc8274 arm: Update arch_scale_cpu_capacity() to reflect change to define
arch_scale_cpu_capacity() is no longer a weak function but a #define
instead. Include the #define in topology.h.

cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
2016-05-10 16:49:49 +08:00
Dietmar Eggemann
8411396e9c arm64: Enable frequency invariant scheduler load-tracking support
Defines arch_scale_freq_capacity() to use cpufreq implementation.

Including <linux/cpufreq.h> in topology.h like for the arm arch doesn't
work because of CONFIG_COMPAT=y (Kernel support for 32-bit EL0).
That's why cpufreq_scale_freq_capacity() has to be declared extern in
topology.h.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:49:48 +08:00
Dietmar Eggemann
372b62344c arm: Enable frequency invariant scheduler load-tracking support
Defines arch_scale_freq_capacity() to use cpufreq implementation.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:49:48 +08:00
Dietmar Eggemann
259bd4cd8a cpufreq: Frequency invariant scheduler load-tracking support
Implements cpufreq_scale_freq_capacity() to provide the scheduler with a
frequency scaling correction factor for more accurate load-tracking.

The factor is:

	current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)

In fact, freq_scale should be a struct cpufreq_policy data member. But
this would require that the scheduler hot path (__update_load_avg()) would
have to grab the cpufreq lock. This can be avoided by using per-cpu data
initialized to SCHED_CAPACITY_SCALE for freq_scale.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2016-05-10 16:49:48 +08:00
Yuyang Du
0ecbd590a7 sched/fair: Fix new task's load avg removed from source CPU in wake_up_new_task()
If a newly created task is selected to go to a different CPU in fork
balance when it wakes up the first time, its load averages should
not be removed from the source CPU since they are never added to
it before. The same is also applicable to a never used group entity.

Fix it in remove_entity_load_avg(): when entity's last_update_time
is 0, simply return. This should precisely identify the case in
question, because in other migrations, the last_update_time is set
to 0 after remove_entity_load_avg().

Reported-by: Steve Muckle <steve.muckle@linaro.org>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
[peterz: cfs_rq_last_update_time]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20151216233427.GJ28098@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-10 16:48:38 +08:00
Linus Torvalds
afd2ff9b7e Linux 4.4 2016-01-10 15:01:32 -08:00
Linus Torvalds
eac6f76ac7 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fix from James Bottomley:
 "A single fix for machines with pages > 4k (PPC mostly).

  There's a bug in our optimal transfer size code where we don't account
  for pages > 4k and can set the transfer size to be less than the page
  size causing nasty failures"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  sd: Reject optimal transfer length smaller than page size
2016-01-09 14:53:48 -08:00
Linus Torvalds
c0cb139345 Merge tag 'pci-v4.4-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI fixlet from Bjorn Helgaas:
 "This marks the TI DRA7xx host bridge driver as broken.  Apparently it
  has never worked without some additional out-of-tree code, so I'm
  going to mark it broken now and remove it completely next cycle unless
  it's fixed"

* tag 'pci-v4.4-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
  PCI: dra7xx: Mark driver as broken
2016-01-09 14:44:44 -08:00
Michal Hocko
751e5f5c75 vmstat: allocate vmstat_wq before it is used
kernel test robot has reported the following crash:

  BUG: unable to handle kernel NULL pointer dereference at 00000100
  IP: [<c1074df6>] __queue_work+0x26/0x390
  *pdpt = 0000000000000000 *pde = f000ff53f000ff53 *pde = f000ff53f000ff53
  Oops: 0000 [#1] PREEMPT PREEMPT SMP SMP
  CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 4.4.0-rc4-00139-g373ccbe #1
  Workqueue: events vmstat_shepherd
  task: cb684600 ti: cb7ba000 task.ti: cb7ba000
  EIP: 0060:[<c1074df6>] EFLAGS: 00010046 CPU: 0
  EIP is at __queue_work+0x26/0x390
  EAX: 00000046 EBX: cbb37800 ECX: cbb37800 EDX: 00000000
  ESI: 00000000 EDI: 00000000 EBP: cb7bbe68 ESP: cb7bbe38
   DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
  CR0: 8005003b CR2: 00000100 CR3: 01fd5000 CR4: 000006b0
  Stack:
  Call Trace:
    __queue_delayed_work+0xa1/0x160
    queue_delayed_work_on+0x36/0x60
    vmstat_shepherd+0xad/0xf0
    process_one_work+0x1aa/0x4c0
    worker_thread+0x41/0x440
    kthread+0xb0/0xd0
    ret_from_kernel_thread+0x21/0x40

The reason is that start_shepherd_timer schedules the shepherd work item
which uses vmstat_wq (vmstat_shepherd) before setup_vmstat allocates
that workqueue so if the further initialization takes more than HZ we
might end up scheduling on a NULL vmstat_wq.  This is really unlikely
but not impossible.

Fixes: 373ccbe592 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
Reported-by: kernel test robot <ying.huang@linux.intel.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-08 23:47:54 -08:00
Linus Torvalds
44d8a7d5c1 Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Arnd Bergmann:
 "This is the final small set of ARM SoC bug fixes for linux-4.4, almost
  all regressions:

  OMAP:
   - data corruption on the Nokia N900 flash

  Allwinner:
   - Two defconfig change to get USB working again

  ARM Versatile:
   - Interrupt numbers gone bad after an older bug fix

  Nomadik:
   - Crashes from incorrect L2 cache settings

  VIA vt8500:
   - SD/MMC support on WM8650 never worked"

* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
  dts: vt8500: Add SDHC node to DTS file for WM8650
  ARM: Fix broken USB support in multi_v7_defconfig for sunxi devices
  ARM: versatile: fix MMC/SD interrupt assignment
  ARM: nomadik: set latencies to 8 cycles
  ARM: OMAP2+: Fix onenand rate detection to avoid filesystem corruption
  ARM: Fix broken USB support in sunxi_defconfig
2016-01-08 16:11:05 -08:00
Linus Torvalds
516c50cde6 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fix from Paolo Bonzini:
 "A simple fix.  I'm sending it before the merge window, because it
  refines a patch found in your master branch but not yet in the
  kvm/next branch that is destined for 4.5"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  kvm: x86: only channel 0 of the i8254 is linked to the HPET
2016-01-08 15:58:14 -08:00
Linus Torvalds
496b0b57c0 Merge tag 'pm+acpi-4.4-final' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
 "Just one obvious fix that adds a missing function argument in ACPI
  code introduced recently (Kees Cook)"

* tag 'pm+acpi-4.4-final' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI / property: avoid leaking format string into kobject name
2016-01-08 15:50:59 -08:00
Linus Torvalds
650e5455d8 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "A handful of x86 fixes:

   - a syscall ABI fix, fixing an Android breakage
   - a Xen PV guest fix relating to the RTC device, causing a
     non-working console
   - a Xen guest syscall stack frame fix
   - an MCE hotplug CPU crash fix"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/numachip: Fix NumaConnect2 MMCFG PCI access
  x86/entry: Restore traditional SYSENTER calling convention
  x86/entry: Fix some comments
  x86/paravirt: Prevent rtc_cmos platform device init on PV guests
  x86/xen: Avoid fast syscall path for Xen PV guests
  x86/mce: Ensure offline CPUs don't participate in rendezvous process
2016-01-08 15:21:48 -08:00
Linus Torvalds
de03017958 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Misc scheduler fixes"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Reset task's lockless wake-queues on fork()
  sched/core: Fix unserialized r-m-w scribbling stuff
  sched/core: Check tgid in is_global_init()
  sched/fair: Fix multiplication overflow on 32-bit systems
2016-01-08 13:57:13 -08:00
Linus Torvalds
3ab6d1ebd5 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Two core subsystem fixes, plus a handful of tooling fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix race in swevent hash
  perf: Fix race in perf_event_exec()
  perf list: Robustify event printing routine
  perf list: Add support for PERF_COUNT_SW_BPF_OUT
  perf hists browser: Fix segfault if use symbol filter in cmdline
  perf hists browser: Reset selection when refresh
  perf hists browser: Add NULL pointer check to prevent crash
  perf buildid-list: Fix return value of perf buildid-list -k
  perf buildid-list: Show running kernel build id fix
2016-01-08 13:52:59 -08:00
Linus Torvalds
ea83ae2fb3 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Ingo Molnar:
 "Fixes a core IRQ subsystem deadlock"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Prevent chip buslock deadlock
2016-01-08 13:46:59 -08:00
Linus Torvalds
a6a7358e49 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block revert from Jens Axboe:
 "The previous pull request had a split fix for NVMe, however there are
  corner cases where that ends up blowing up.

  So let's revert it for 4.4.  The regression isn't introduced in this
  cycle, and it's "just" a performance regression, not a
  stability/integrity issue"

* 'for-linus' of git://git.kernel.dk/linux-block:
  Revert "block: Split bios on chunk boundaries"
2016-01-08 13:39:09 -08:00
Linus Torvalds
212c7f66ec Merge tag 'dmaengine-fix-4.4' of git://git.infradead.org/users/vkoul/slave-dma
Pull dmaengine fixes from Vinod Koul:
 "Late fixes for 4.4 are three fixes for drivers which include a revert
  of mic-x100 fix which is causing regression, xgene fix for double IRQ
  and async_tx fix to use GFP_NOWAIT"

* tag 'dmaengine-fix-4.4' of git://git.infradead.org/users/vkoul/slave-dma:
  dmaengine: xgene-dma: Fix double IRQ issue by setting IRQ_DISABLE_UNLAZY flag
  async_tx: use GFP_NOWAIT rather than GFP_IO
  dmaengine: Revert "dmaengine: mic_x100: add missing spin_unlock"
2016-01-08 12:23:00 -08:00
Linus Torvalds
436950a65d Merge branch 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging
Pull dmi fix from Jean Delvare.

* 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
  firmware: dmi_scan: Fix UUID endianness for SMBIOS >= 2.6
2016-01-08 12:18:45 -08:00
Linus Torvalds
4054f64c93 Merge tag 'sound-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
 "A slightly higher volume than a new year's wish, but not too
  worrisome: a large LOC is only for HD-audio device-specific quirks, so
  fairly safe to apply.  The rest ASoC fixes are all trivial and small;
  a simple replacement of mutex call with nested lock version, a few
  Arizona and Realtek codec fixes, and a regression fix for Skylake
  firmware handling"

* tag 'sound-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ASoC: Intel: Skylake: Fix the memory leak
  ASoC: Intel: Skylake: Revert previous broken fix memory leak fix
  ASoC: Use nested lock for snd_soc_dapm_mutex_lock
  ASoC: rt5645: add sys clk detection
  ALSA: hda - Add keycode map for alc input device
  ALSA: hda - Add mic mute hotkey quirk for Lenovo ThinkCentre AIO
  ASoC: arizona: Fix bclk for sample rates that are multiple of 4kHz
2016-01-08 11:52:18 -08:00
Arnd Bergmann
841bcd2e50 Merge tag 'omap-for-v4.4/onenand-corruption' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into fixes
Pull "urgent onenand file system corruption fix for n900" from Tony Lindgren:

Last minute urgent pull request to prevent file system corruption
on Nokia N900.

Looks like we have a GPMC bus timing bug that has gone unnoticed
because of bootloader configured registers until few days ago. We
are not detecting the onenand clock rate properly unless we have
CONFIG_OMAP_GPMC_DEBUG set and this causes onenand corruption
that can be easily be reproduced.

There seems to be also an additional bug still lurking around for
onenand corruption. But that is still being investigated and
it does not seem to be GPMC timings related.

Meanwhile, it would be good to get this fix into v4.4 to prevent
wrong timings from corrupting onenand.

* tag 'omap-for-v4.4/onenand-corruption' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
  ARM: OMAP2+: Fix onenand rate detection to avoid filesystem corruption
2016-01-08 17:46:45 +01:00
Jens Axboe
6126eb2483 Revert "block: Split bios on chunk boundaries"
This reverts commit d380561113.

If we end up splitting on the first segment, we don't adjust
the sector count. That results in hitting a BUG() with attempting
to split 0 sectors.

As this is just a performance issue and not a regression since
4.3 release, let's just rever this change. That gives us more
time to test a real fix for 4.5, which would be marked for
stable anyway.
2016-01-08 09:00:29 -07:00