Commit Graph

763 Commits

Author SHA1 Message Date
Odin Ugedal
60f0b82677 sched/fair: Fix CFS bandwidth hrtimer expiry type
[ Upstream commit 72d0ad7cb5 ]

The time remaining until expiry of the refresh_timer can be negative.
Casting the type to an unsigned 64-bit value will cause integer
underflow, making the runtime_refresh_within return false instead of
true. These situations are rare, but they do happen.

This does not cause user-facing issues or errors; other than
possibly unthrottling cfs_rq's using runtime from the previous period(s),
making the CFS bandwidth enforcement less strict in those (special)
situations.

Signed-off-by: Odin Ugedal <odin@uged.al>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Link: https://lore.kernel.org/r/20210629121452.18429-1-odin@uged.al
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-16 11:33:40 +09:00
Vincent Guittot
1ec0012772 sched/fair: handle case of task_h_load() returning 0
commit 01cfcde9c2 upstream.

task_h_load() can return 0 in some situations like running stress-ng
mmapfork, which forks thousands of threads, in a sched group on a 224 cores
system. The load balance doesn't handle this correctly because
env->imbalance never decreases and it will stop pulling tasks only after
reaching loop_max, which can be equal to the number of running tasks of
the cfs. Make sure that imbalance will be decreased by at least 1.

misfit task is the other feature that doesn't handle correctly such
situation although it's probably more difficult to face the problem
because of the smaller number of CPUs and running tasks on heterogenous
system.

We can't simply ensure that task_h_load() returns at least one because it
would imply to handle underflow in other places.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: <stable@vger.kernel.org> # v4.4+
Link: https://lkml.kernel.org/r/20200710152426.16981-1-vincent.guittot@linaro.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-16 08:31:08 +09:00
Jens Axboe
3a641a476e sched/fair: Don't NUMA balance for kthreads
[ Upstream commit 18f855e574 ]

Stefano reported a crash with using SQPOLL with io_uring:

  BUG: kernel NULL pointer dereference, address: 00000000000003b0
  CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11
  RIP: 0010:task_numa_work+0x4f/0x2c0
  Call Trace:
   task_work_run+0x68/0xa0
   io_sq_thread+0x252/0x3d0
   kthread+0xf9/0x130
   ret_from_fork+0x35/0x40

which is task_numa_work() oopsing on current->mm being NULL.

The task work is queued by task_tick_numa(), which checks if current->mm is
NULL at the time of the call. But this state isn't necessarily persistent,
if the kthread is using use_mm() to temporarily adopt the mm of a task.

Change the task_tick_numa() check to exclude kernel threads in general,
as it doesn't make sense to attempt ot balance for kthreads anyway.

Reported-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/865de121-8190-5d30-ece5-3b097dc74431@kernel.dk
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-15 17:32:04 +09:00
Xuewei Zhang
9ae6c47d3c sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision
commit 4929a4e6fa upstream.

The quota/period ratio is used to ensure a child task group won't get
more bandwidth than the parent task group, and is calculated as:

  normalized_cfs_quota() = [(quota_us << 20) / period_us]

If the quota/period ratio was changed during this scaling due to
precision loss, it will cause inconsistency between parent and child
task groups.

See below example:

A userspace container manager (kubelet) does three operations:

 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
 2) Create a few children cgroups.
 3) Set quota to 1,000us and period to 10,000us on a child cgroup.

These operations are expected to succeed. However, if the scaling of
147/128 happens before step 3, quota and period of the parent cgroup
will be changed:

  new_quota: 1148437ns,   1148us
 new_period: 11484375ns, 11484us

And when step 3 comes in, the ratio of the child cgroup will be
104857, which will be larger than the parent cgroup ratio (104821),
and will fail.

Scaling them by a factor of 2 will fix the problem.

Tested-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Xuewei Zhang <xueweiz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Phil Auld <pauld@redhat.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: 2e8e192263 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 16:10:08 +09:00
Valentin Schneider
f18fe54d12 sched/fair: Don't increase sd->balance_interval on newidle balance
[ Upstream commit 3f130a37c4 ]

When load_balance() fails to move some load because of task affinity,
we end up increasing sd->balance_interval to delay the next periodic
balance in the hopes that next time we look, that annoying pinned
task(s) will be gone.

However, idle_balance() pays no attention to sd->balance_interval, yet
it will still lead to an increase in balance_interval in case of
pinned tasks.

If we're going through several newidle balances (e.g. we have a
periodic task), this can lead to a huge increase of the
balance_interval in a very small amount of time.

To prevent that, don't increase the balance interval when going
through a newidle balance.

This is a similar approach to what is done in commit 58b26c4c02
("sched: Increment cache_nice_tries only on periodic lb"), where we
disregard newidle balance and rely on periodic balance for more stable
results.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar.Eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: patrick.bellasi@arm.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1537974727-30788-2-git-send-email-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-15 15:18:39 +09:00
Vincent Guittot
9fb9eaa490 sched/fair: Fix imbalance due to CPU affinity
[ Upstream commit f6cad8df6b ]

The load_balance() has a dedicated mecanism to detect when an imbalance
is due to CPU affinity and must be handled at parent level. In this case,
the imbalance field of the parent's sched_group is set.

The description of sg_imbalanced() gives a typical example of two groups
of 4 CPUs each and 4 tasks each with a cpumask covering 1 CPU of the first
group and 3 CPUs of the second group. Something like:

	{ 0 1 2 3 } { 4 5 6 7 }
	        *     * * *

But the load_balance fails to fix this UC on my octo cores system
made of 2 clusters of quad cores.

Whereas the load_balance is able to detect that the imbalanced is due to
CPU affinity, it fails to fix it because the imbalance field is cleared
before letting parent level a chance to run. In fact, when the imbalance is
detected, the load_balance reruns without the CPU with pinned tasks. But
there is no other running tasks in the situation described above and
everything looks balanced this time so the imbalance field is immediately
cleared.

The imbalance field should not be cleared if there is no other task to move
when the imbalance is detected.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1561996022-28829-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-15 14:32:59 +09:00
Liangyan
a199ce2af3 sched/fair: Don't assign runtime for throttled cfs_rq
commit 5e2d2cc258 upstream.

do_sched_cfs_period_timer() will refill cfs_b runtime and call
distribute_cfs_runtime to unthrottle cfs_rq, sometimes cfs_b->runtime
will allocate all quota to one cfs_rq incorrectly, then other cfs_rqs
attached to this cfs_b can't get runtime and will be throttled.

We find that one throttled cfs_rq has non-negative
cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64
in snippet:

  distribute_cfs_runtime() {
    runtime = -cfs_rq->runtime_remaining + 1;
  }

The runtime here will change to a large number and consume all
cfs_b->runtime in this cfs_b period.

According to Ben Segall, the throttled cfs_rq can have
account_cfs_rq_runtime called on it because it is throttled before
idle_balance, and the idle_balance calls update_rq_clock to add time
that is accounted to the task.

This commit prevents cfs_rq to be assgined new runtime if it has been
throttled until that distribute_cfs_runtime is called.

Signed-off-by: Liangyan <liangyan.peng@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: shanpeic@linux.alibaba.com
Cc: stable@vger.kernel.org
Cc: xlpang@linux.alibaba.com
Fixes: d3d9dc3302 ("sched: Throttle entities exceeding their allowed bandwidth")
Link: https://lkml.kernel.org/r/20190826121633.6538-1-liangyan.peng@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 14:29:00 +09:00
Jann Horn
9ebc8cd5f7 sched/fair: Don't free p->numa_faults with concurrent readers
commit 16d51a590a upstream.

When going through execve(), zero out the NUMA fault statistics instead of
freeing them.

During execve, the task is reachable through procfs and the scheduler. A
concurrent /proc/*/sched reader can read data from a freed ->numa_faults
allocation (confirmed by KASAN) and write it back to userspace.
I believe that it would also be possible for a use-after-free read to occur
through a race between a NUMA fault and execve(): task_numa_fault() can
lead to task_numa_compare(), which invokes task_weight() on the currently
running task of a different CPU.

Another way to fix this would be to make ->numa_faults RCU-managed or add
extra locking, but it seems easier to wipe the NUMA fault statistics on
execve.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Fixes: 82727018b0 ("sched/numa: Call task_numa_free() from do_execve()")
Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 14:08:49 +09:00
Xie XiuQi
a67712d343 sched/numa: Fix a possible divide-by-zero
commit a860fa7b96 upstream.

sched_clock_cpu() may not be consistent between CPUs. If a task
migrates to another CPU, then se.exec_start is set to that CPU's
rq_clock_task() by update_stats_curr_start(). Specifically, the new
value might be before the old value due to clock skew.

So then if in numa_get_avg_runtime() the expression:

  'now - p->last_task_numa_placement'

ends up as -1, then the divider '*period + 1' in task_numa_placement()
is 0 and things go bang. Similar to update_curr(), check if time goes
backwards to avoid this.

[ peterz: Wrote new changelog. ]
[ mingo: Tweaked the code comment. ]

Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: cj.chengjian@huawei.com
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20190425080016.GX11158@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 12:35:05 +09:00
Phil Auld
8023469412 sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup
[ Upstream commit 2e8e192263 ]

With extremely short cfs_period_us setting on a parent task group with a large
number of children the for loop in sched_cfs_period_timer() can run until the
watchdog fires. There is no guarantee that the call to hrtimer_forward_now()
will ever return 0.  The large number of children can make
do_sched_cfs_period_timer() take longer than the period.

 NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
 RIP: 0010:tg_nop+0x0/0x10
  <IRQ>
  walk_tg_tree_from+0x29/0xb0
  unthrottle_cfs_rq+0xe0/0x1a0
  distribute_cfs_runtime+0xd3/0xf0
  sched_cfs_period_timer+0xcb/0x160
  ? sched_cfs_slack_timer+0xd0/0xd0
  __hrtimer_run_queues+0xfb/0x270
  hrtimer_interrupt+0x122/0x270
  smp_apic_timer_interrupt+0x6a/0x140
  apic_timer_interrupt+0xf/0x20
  </IRQ>

To prevent this we add protection to the loop that detects when the loop has run
too many times and scales the period and quota up, proportionally, so that the timer
can complete before then next period expires.  This preserves the relative runtime
quota while preventing the hard lockup.

A warning is issued reporting this state and the new values.

Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190319130005.25492-1-pauld@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-15 12:28:05 +09:00
Mel Gorman
fd57ab642c sched/fair: Do not re-read ->h_load_next during hierarchical load calculation
commit 0e9f02450d upstream.

A NULL pointer dereference bug was reported on a distribution kernel but
the same issue should be present on mainline kernel. It occured on s390
but should not be arch-specific.  A partial oops looks like:

  Unable to handle kernel pointer dereference in virtual kernel address space
  ...
  Call Trace:
    ...
    try_to_wake_up+0xfc/0x450
    vhost_poll_wakeup+0x3a/0x50 [vhost]
    __wake_up_common+0xbc/0x178
    __wake_up_common_lock+0x9e/0x160
    __wake_up_sync_key+0x4e/0x60
    sock_def_readable+0x5e/0x98

The bug hits any time between 1 hour to 3 days. The dereference occurs
in update_cfs_rq_h_load when accumulating h_load. The problem is that
cfq_rq->h_load_next is not protected by any locking and can be updated
by parallel calls to task_h_load. Depending on the compiler, code may be
generated that re-reads cfq_rq->h_load_next after the check for NULL and
then oops when reading se->avg.load_avg. The dissassembly showed that it
was possible to reread h_load_next after the check for NULL.

While this does not appear to be an issue for later compilers, it's still
an accident if the correct code is generated. Full locking in this path
would have high overhead so this patch uses READ_ONCE to read h_load_next
only once and check for NULL before dereferencing. It was confirmed that
there were no further oops after 10 days of testing.

As Peter pointed out, it is also necessary to use WRITE_ONCE() to avoid any
potential problems with store tearing.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Fixes: 685207963b ("sched: Move h_load calculation to task_h_load()")
Link: https://lkml.kernel.org/r/20190319123610.nsivgf3mjbjjesxb@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 12:22:49 +09:00
Song Muchun
b9cc54e80d sched/fair: Fix the min_vruntime update logic in dequeue_entity()
[ Upstream commit 9845c49cc9 ]

The comment and the code around the update_min_vruntime() call in
dequeue_entity() are not in agreement.

>From commit:

  b60205c7c5 ("sched/fair: Fix min_vruntime tracking")

I think that we want to update min_vruntime when a task is sleeping/migrating.
So, the check is inverted there - fix it.

Signed-off-by: Song Muchun <smuchun@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: b60205c7c5 ("sched/fair: Fix min_vruntime tracking")
Link: http://lkml.kernel.org/r/20181014112612.2614-1-smuchun@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 09:19:12 +09:00
Phil Auld
d296f53c35 sched/fair: Fix throttle_list starvation with low CFS quota
commit baa9be4ffb upstream.

With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
distribute_cfs_runtime may not empty the throttled_list before it runs
out of runtime to distribute. In that case, due to the change from
c06f04c704 to put throttled entries at the head of the list, later entries
on the list will starve.  Essentially, the same X processes will get pulled
off the list, given CPU time and then, when expired, get put back on the
head of the list where distribute_cfs_runtime will give runtime to the same
set of processes leaving the rest.

Fix the issue by setting a bit in struct cfs_bandwidth when
distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
decide to put the throttled entry on the tail or the head of the list.  The
bit is set/cleared by the callers of distribute_cfs_runtime while they hold
cfs_bandwidth->lock.

This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
the live system. In some cases you can simply look at the throttled list and
see the later entries are not changing:

  crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
    1     ffff90b56cb2d200  -976050
    2     ffff90b56cb2cc00  -484925
    3     ffff90b56cb2bc00  -658814
    4     ffff90b56cb2ba00  -275365
    5     ffff90b166a45600  -135138
    6     ffff90b56cb2da00  -282505
    7     ffff90b56cb2e000  -148065
    8     ffff90b56cb2fa00  -872591
    9     ffff90b56cb2c000  -84687
   10     ffff90b56cb2f000  -87237
   11     ffff90b166a40a00  -164582

  crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
    1     ffff90b56cb2d200  -994147
    2     ffff90b56cb2cc00  -306051
    3     ffff90b56cb2bc00  -961321
    4     ffff90b56cb2ba00  -24490
    5     ffff90b166a45600  -135138
    6     ffff90b56cb2da00  -282505
    7     ffff90b56cb2e000  -148065
    8     ffff90b56cb2fa00  -872591
    9     ffff90b56cb2c000  -84687
   10     ffff90b56cb2f000  -87237
   11     ffff90b166a40a00  -164582

Sometimes it is easier to see by finding a process getting starved and looking
at the sched_info:

  crash> task ffff8eb765994500 sched_info
  PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
    sched_info = {
      pcount = 8,
      run_delay = 697094208,
      last_arrival = 240260125039,
      last_queued = 240260327513
    },
  crash> task ffff8eb765994500 sched_info
  PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
    sched_info = {
      pcount = 8,
      run_delay = 697094208,
      last_arrival = 240260125039,
      last_queued = 240260327513
    },

Signed-off-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: c06f04c704 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csb
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 09:15:24 +09:00
Steve Muckle
5282278601 sched/fair: Fix vruntime_normalized() for remote non-migration wakeup
commit d0cdb3ce88 upstream.

When a task which previously ran on a given CPU is remotely queued to
wake up on that same CPU, there is a period where the task's state is
TASK_WAKING and its vruntime is not normalized. This is not accounted
for in vruntime_normalized() which will cause an error in the task's
vruntime if it is switched from the fair class during this time.

For example if it is boosted to RT priority via rt_mutex_setprio(),
rq->min_vruntime will not be subtracted from the task's vruntime but
it will be added again when the task returns to the fair class. The
task's vruntime will have been erroneously doubled and the effective
priority of the task will be reduced.

Note this will also lead to inflation of all vruntimes since the doubled
vruntime value will become the rq's min_vruntime when other tasks leave
the rq. This leads to repeated doubling of the vruntime and priority
penalty.

Fix this by recognizing a WAKING task's vruntime as normalized only if
sched_remote_wakeup is true. This indicates a migration, in which case
the vruntime would have been normalized in migrate_task_rq_fair().

Based on a similar patch from John Dias <joaodias@google.com>.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Steve Muckle <smuckle@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Redpath <Chris.Redpath@arm.com>
Cc: John Dias <joaodias@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miguel de Dios <migueldedios@google.com>
Cc: Morten Rasmussen <Morten.Rasmussen@arm.com>
Cc: Patrick Bellasi <Patrick.Bellasi@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: kernel-team@android.com
Fixes: b5179ac70d ("sched/fair: Prepare to fix fairness problems on migration")
Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuckle@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-15 08:25:26 +09:00
Victor Wan
cc7b1eac54 Merge branch 'android-4.9' into amlogic-4.9-dev
Signed-off-by: Victor Wan <victor.wan@amlogic.com>

 Conflicts:
	drivers/md/dm-bufio.c
	drivers/media/dvb-core/dvb_frontend.c
	drivers/usb/dwc3/core.c
	drivers/usb/gadget/function/f_fs.c
2018-08-07 14:43:24 +08:00
Patrick Bellasi
c964a2ba92 ANDROID: sched/fair: cosmetics
Change-Id: Ibbefdf1cdb4d72c7075c7b3edf6789630475a8e2
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
2018-06-07 13:44:47 +00:00
Joel Fernandes
7fd40752c3 UPSTREAM: sched/fair: Consider RT/IRQ pressure in capacity_spare_wake
capacity_spare_wake in the slow path influences choice of idlest groups,
as we search for groups with maximum spare capacity. In scenarios where
RT pressure is high, a sub optimal group can be chosen and hurt
performance of the task being woken up.

Several tests with results are included below to show improvements with
this change.

1) Hackbench on Pixel 2 Android device (4x4 ARM64 Octa core)
------------------------------------------------------------
Here we have RT activity running on big CPU cluster induced with rt-app,
and running hackbench in parallel. The RT tasks are bound to 4 CPUs on
the big cluster (cpu 4,5,6,7) and have 100ms periodicity with
runtime=20ms sleep=80ms.

Hackbench shows big benefit (30%) improvement when number of tasks is 8
and 32: Note: data is completion time in seconds (lower is better).
Number of loops for 8 and 16 tasks is 50000, and for 32 tasks its 20000.
+--------+-----+-------+-------------------+---------------------------+
| groups | fds | tasks | Without Patch     | With Patch                |
+--------+-----+-------+---------+---------+-----------------+---------+
|        |     |       | Mean    | Stdev   | Mean            | Stdev   |
|        |     |       +-------------------+-----------------+---------+
|      1 |   8 |     8 | 1.0534  | 0.13722 | 0.7293 (+30.7%) | 0.02653 |
|      2 |   8 |    16 | 1.6219  | 0.16631 | 1.6391 (-1%)    | 0.24001 |
|      4 |   8 |    32 | 1.2538  | 0.13086 | 1.1080 (+11.6%) | 0.16201 |
+--------+-----+-------+---------+---------+-----------------+---------+

2) Rohit ran barrier.c test (details below) with following improvements:
------------------------------------------------------------------------
This was Rohit's original use case for a patch he posted at [1] however
from his recent tests he showed my patch can replace his slow path
changes [1] and there's no need to selectively scan/skip CPUs in
find_idlest_group_cpu in the slow path to get the improvement he sees.

barrier.c (open_mp code) as a micro-benchmark. It does a number of
iterations and barrier sync at the end of each for loop.

Here barrier,c is running in along with ping on CPU 0 and 1 as:
'ping -l 10000 -q -s 10 -f hostX'

barrier.c can be found at:
http://www.spinics.net/lists/kernel/msg2506955.html

Following are the results for the iterations per second with this
micro-benchmark (higher is better), on a 44 core, 2 socket 88 Threads
Intel x86 machine:
+--------+------------------+---------------------------+
|Threads | Without patch    | With patch                |
|        |                  |                           |
+--------+--------+---------+-----------------+---------+
|        | Mean   | Std Dev | Mean            | Std Dev |
+--------+--------+---------+-----------------+---------+
|1       | 539.36 | 60.16   | 572.54 (+6.15%) | 40.95   |
|2       | 481.01 | 19.32   | 530.64 (+10.32%)| 56.16   |
|4       | 474.78 | 22.28   | 479.46 (+0.99%) | 18.89   |
|8       | 450.06 | 24.91   | 447.82 (-0.50%) | 12.36   |
|16      | 436.99 | 22.57   | 441.88 (+1.12%) | 7.39    |
|32      | 388.28 | 55.59   | 429.4  (+10.59%)| 31.14   |
|64      | 314.62 | 6.33    | 311.81 (-0.89%) | 11.99   |
+--------+--------+---------+-----------------+---------+

3) ping+hackbench test on bare-metal sever (Rohit ran this test)
----------------------------------------------------------------
Here hackbench is running in threaded mode along
with, running ping on CPU 0 and 1 as:
'ping -l 10000 -q -s 10 -f hostX'

This test is running on 2 socket, 20 core and 40 threads Intel x86
machine:
Number of loops is 10000 and runtime is in seconds (Lower is better).

+--------------+-----------------+--------------------------+
|Task Groups   | Without patch   |  With patch              |
|              +-------+---------+----------------+---------+
|(Groups of 40)| Mean  | Std Dev |  Mean          | Std Dev |
+--------------+-------+---------+----------------+---------+
|1             | 0.851 | 0.007   |  0.828 (+2.77%)| 0.032   |
|2             | 1.083 | 0.203   |  1.087 (-0.37%)| 0.246   |
|4             | 1.601 | 0.051   |  1.611 (-0.62%)| 0.055   |
|8             | 2.837 | 0.060   |  2.827 (+0.35%)| 0.031   |
|16            | 5.139 | 0.133   |  5.107 (+0.63%)| 0.085   |
|25            | 7.569 | 0.142   |  7.503 (+0.88%)| 0.143   |
+--------------+-------+---------+----------------+---------+

[1] https://patchwork.kernel.org/patch/9991635/

Matt Fleming also ran cyclictest and several different hackbench tests
on his test machines to santiy-check that the patch doesn't harm any
of his usecases.

Change-Id: I1de1e40a3a92a54a9adc1303ccc98104704c1e35
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Morten Ramussen <morten.rasmussen@arm.com>
Cc: Brendan Jackman <brendan.jackman@arm.com>
Tested-by: Rohit Jain <rohit.k.jain@oracle.com>
Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2018-05-29 09:00:30 -07:00
Victor Wan
324524de04 Merge branch 'android-4.9' into amlogic-4.9-dev 2018-05-22 10:48:42 +08:00
Victor Wan
810c6dd972 Merge branch 'android-4.9' into amlogic-4.9-dev
Signed-off-by: Victor Wan <victor.wan@amlogic.com>

Conflicts:
	arch/arm/configs/bcm2835_defconfig
	arch/arm/configs/sunxi_defconfig
	include/linux/cpufreq.h
	init/main.c
2018-04-24 17:43:19 +08:00
Greg Kroah-Hartman
8683408f8e Merge 4.9.94 into android-4.9
Changes in 4.9.94
	qed: Fix overriding of supported autoneg value.
	cfg80211: make RATE_INFO_BW_20 the default
	md/raid5: make use of spin_lock_irq over local_irq_disable + spin_lock
	rtc: snvs: fix an incorrect check of return value
	x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic()
	x86/mm/kaslr: Use the _ASM_MUL macro for multiplication to work around Clang incompatibility
	ovl: persistent inode numbers for upper hardlinks
	NFSv4.1: RECLAIM_COMPLETE must handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION
	x86/boot: Declare error() as noreturn
	IB/srpt: Fix abort handling
	IB/srpt: Avoid that aborting a command triggers a kernel warning
	af_key: Fix slab-out-of-bounds in pfkey_compile_policy.
	mac80211: bail out from prep_connection() if a reconfig is ongoing
	bna: Avoid reading past end of buffer
	qlge: Avoid reading past end of buffer
	ubi: fastmap: Fix slab corruption
	ipmi_ssif: unlock on allocation failure
	net: cdc_ncm: Fix TX zero padding
	net: ethernet: ti: cpsw: adjust cpsw fifos depth for fullduplex flow control
	lockd: fix lockd shutdown race
	drivers/misc/vmw_vmci/vmci_queue_pair.c: fix a couple integer overflow tests
	pidns: disable pid allocation if pid_ns_prepare_proc() is failed in alloc_pid()
	s390: move _text symbol to address higher than zero
	net/mlx4_en: Avoid adding steering rules with invalid ring
	qed: Correct doorbell configuration for !4Kb pages
	NFSv4.1: Work around a Linux server bug...
	CIFS: silence lockdep splat in cifs_relock_file()
	perf/callchain: Force USER_DS when invoking perf_callchain_user()
	blk-mq: NVMe 512B/4K+T10 DIF/DIX format returns I/O error on dd with split op
	net: qca_spi: Fix alignment issues in rx path
	netxen_nic: set rcode to the return status from the call to netxen_issue_cmd
	mdio: mux: Correct mdio_mux_init error path issues
	Input: elan_i2c - check if device is there before really probing
	Input: elantech - force relative mode on a certain module
	KVM: PPC: Book3S PR: Check copy_to/from_user return values
	irqchip/mbigen: Fix the clear register offset calculation
	vmxnet3: ensure that adapter is in proper state during force_close
	mm, vmstat: Remove spurious WARN() during zoneinfo print
	SMB2: Fix share type handling
	bus: brcmstb_gisb: Use register offsets with writes too
	bus: brcmstb_gisb: correct support for 64-bit address output
	PowerCap: Fix an error code in powercap_register_zone()
	iio: pressure: zpa2326: report interrupted case as failure
	ARM: dts: imx53-qsrb: Pulldown PMIC IRQ pin
	staging: wlan-ng: prism2mgmt.c: fixed a double endian conversion before calling hfa384x_drvr_setconfig16, also fixes relative sparse warning
	clk: renesas: rcar-gen2: Fix PLL0 on R-Car V2H and E2
	x86/tsc: Provide 'tsc=unstable' boot parameter
	powerpc/modules: If mprofile-kernel is enabled add it to vermagic
	ARM: dts: imx6qdl-wandboard: Fix audio channel swap
	i2c: mux: reg: put away the parent i2c adapter on probe failure
	arm64: perf: Ignore exclude_hv when kernel is running in HYP
	mdio: mux: fix device_node_continue.cocci warnings
	ipv6: avoid dad-failures for addresses with NODAD
	async_tx: Fix DMA_PREP_FENCE usage in do_async_gen_syndrome()
	KVM: arm: Restore banked registers and physical timer access on hyp_panic()
	KVM: arm64: Restore host physical timer access on hyp_panic()
	usb: dwc3: keystone: check return value
	btrfs: fix incorrect error return ret being passed to mapping_set_error
	ata: libahci: properly propagate return value of platform_get_irq()
	ipmr: vrf: Find VIFs using the actual device
	uio: fix incorrect memory leak cleanup
	neighbour: update neigh timestamps iff update is effective
	arp: honour gratuitous ARP _replies_
	ARM: dts: rockchip: fix rk322x i2s1 pinctrl error
	usb: chipidea: properly handle host or gadget initialization failure
	pxa_camera: fix module remove codepath for v4l2 clock
	USB: ene_usb6250: fix first command execution
	net: x25: fix one potential use-after-free issue
	USB: ene_usb6250: fix SCSI residue overwriting
	serial: 8250: omap: Disable DMA for console UART
	serial: sh-sci: Fix race condition causing garbage during shutdown
	net/wan/fsl_ucc_hdlc: fix unitialized variable warnings
	net/wan/fsl_ucc_hdlc: fix incorrect memory allocation
	fsl/qe: add bit description for SYNL register for GUMR
	sh_eth: Use platform device for printing before register_netdev()
	mlxsw: spectrum: Avoid possible NULL pointer dereference
	scsi: csiostor: fix use after free in csio_hw_use_fwconfig()
	powerpc/mm: Fix virt_addr_valid() etc. on 64-bit hash
	ath5k: fix memory leak on buf on failed eeprom read
	selftests/powerpc: Fix TM resched DSCR test with some compilers
	xfrm: fix state migration copy replay sequence numbers
	ASoC: simple-card: fix mic jack initialization
	iio: hi8435: avoid garbage event at first enable
	iio: hi8435: cleanup reset gpio
	iio: light: rpr0521 poweroff for probe fails
	ext4: handle the rest of ext4_mb_load_buddy() ENOMEM errors
	md-cluster: fix potential lock issue in add_new_disk
	ARM: davinci: da8xx: Create DSP device only when assigned memory
	ray_cs: Avoid reading past end of buffer
	net/wan/fsl_ucc_hdlc: fix muram allocation error
	leds: pca955x: Correct I2C Functionality
	perf/core: Fix error handling in perf_event_alloc()
	sched/numa: Use down_read_trylock() for the mmap_sem
	gpio: crystalcove: Do not write regular gpio registers for virtual GPIOs
	net/mlx5: Tolerate irq_set_affinity_hint() failures
	selinux: do not check open permission on sockets
	block: fix an error code in add_partition()
	mlx5: fix bug reading rss_hash_type from CQE
	net: ieee802154: fix net_device reference release too early
	libceph: NULL deref on crush_decode() error path
	perf report: Fix off-by-one for non-activation frames
	netfilter: ctnetlink: fix incorrect nf_ct_put during hash resize
	pNFS/flexfiles: missing error code in ff_layout_alloc_lseg()
	ASoC: rsnd: SSI PIO adjust to 24bit mode
	scsi: bnx2fc: fix race condition in bnx2fc_get_host_stats()
	fix race in drivers/char/random.c:get_reg()
	ext4: fix off-by-one on max nr_pages in ext4_find_unwritten_pgoff()
	ARM64: PCI: Fix struct acpi_pci_root_ops allocation failure path
	tcp: better validation of received ack sequences
	net: move somaxconn init from sysctl code
	Input: elan_i2c - clear INT before resetting controller
	bonding: Don't update slave->link until ready to commit
	cpuhotplug: Link lock stacks for hotplug callbacks
	PCI/msi: fix the pci_alloc_irq_vectors_affinity stub
	KVM: X86: Fix preempt the preemption timer cancel
	KVM: nVMX: Fix handling of lmsw instruction
	net: llc: add lock_sock in llc_ui_bind to avoid a race condition
	drm/msm: Take the mutex before calling msm_gem_new_impl
	i40iw: Fix sequence number for the first partial FPDU
	i40iw: Correct Q1/XF object count equation
	ARM: dts: ls1021a: add "fsl,ls1021a-esdhc" compatible string to esdhc node
	thermal: power_allocator: fix one race condition issue for thermal_instances list
	perf probe: Add warning message if there is unexpected event name
	l2tp: fix missing print session offset info
	rds; Reset rs->rs_bound_addr in rds_add_bound() failure path
	ACPI / video: Default lcd_only to true on Win8-ready and newer machines
	net/mlx4_en: Change default QoS settings
	VFS: close race between getcwd() and d_move()
	PM / devfreq: Fix potential NULL pointer dereference in governor_store
	hwmon: (ina2xx) Make calibration register value fixed
	media: videobuf2-core: don't go out of the buffer range
	ASoC: Intel: Skylake: Disable clock gating during firmware and library download
	ASoC: Intel: cht_bsw_rt5645: Analog Mic support
	scsi: libiscsi: Allow sd_shutdown on bad transport
	scsi: mpt3sas: Proper handling of set/clear of "ATA command pending" flag.
	irqchip/gic-v3: Fix the driver probe() fail due to disabled GICC entry
	ACPI: EC: Fix debugfs_create_*() usage
	mac80211: Fix setting TX power on monitor interfaces
	vfb: fix video mode and line_length being set when loaded
	gpio: label descriptors using the device name
	IB/rdmavt: Allocate CQ memory on the correct node
	blk-mq: fix race between updating nr_hw_queues and switching io sched
	backlight: tdo24m: Fix the SPI CS between transfers
	pinctrl: baytrail: Enable glitch filter for GPIOs used as interrupts
	ASoC: Intel: sst: Fix the return value of 'sst_send_byte_stream_mrfld()'
	rt2x00: do not pause queue unconditionally on error path
	wl1251: check return from call to wl1251_acx_arp_ip_filter
	hdlcdrv: Fix divide by zero in hdlcdrv_ioctl
	x86/efi: Disable runtime services on kexec kernel if booted with efi=old_map
	netfilter: conntrack: don't call iter for non-confirmed conntracks
	HID: i2c: Call acpi_device_fix_up_power for ACPI-enumerated devices
	ovl: filter trusted xattr for non-admin
	powerpc/[booke|4xx]: Don't clobber TCR[WP] when setting TCR[DIE]
	dmaengine: imx-sdma: Handle return value of clk_prepare_enable
	backlight: Report error on failure
	arm64: futex: Fix undefined behaviour with FUTEX_OP_OPARG_SHIFT usage
	net/mlx5: avoid build warning for uniprocessor
	cxgb4: FW upgrade fixes
	cxgb4: Fix netdev_features flag
	rtc: m41t80: fix SQW dividers override when setting a date
	i40evf: fix merge error in older patch
	rtc: opal: Handle disabled TPO in opal_get_tpo_time()
	rtc: interface: Validate alarm-time before handling rollover
	SUNRPC: ensure correct error is reported by xs_tcp_setup_socket()
	net: freescale: fix potential null pointer dereference
	clk: at91: fix clk-generated parenting
	drm/sun4i: Ignore the generic connectors for components
	dt-bindings: display: sun4i: Add allwinner,tcon-channel property
	mtd: nand: gpmi: Fix gpmi_nand_init() error path
	mtd: nand: check ecc->total sanity in nand_scan_tail
	KVM: SVM: do not zero out segment attributes if segment is unusable or not present
	clk: scpi: fix return type of __scpi_dvfs_round_rate
	clk: Fix __set_clk_rates error print-string
	powerpc/spufs: Fix coredump of SPU contexts
	drm/amdkfd: NULL dereference involving create_process()
	ath10k: add BMI parameters to fix calibration from DT/pre-cal
	perf trace: Add mmap alias for s390
	qlcnic: Fix a sleep-in-atomic bug in qlcnic_82xx_hw_write_wx_2M and qlcnic_82xx_hw_read_wx_2M
	arm64: kernel: restrict /dev/mem read() calls to linear region
	mISDN: Fix a sleep-in-atomic bug
	net: phy: micrel: Restore led_mode and clk_sel on resume
	RDMA/iw_cxgb4: Avoid touch after free error in ARP failure handlers
	RDMA/hfi1: fix array termination by appending NULL to attr array
	drm/omap: fix tiled buffer stride calculations
	powerpc/8xx: fix mpc8xx_get_irq() return on no irq
	cxgb4: fix incorrect cim_la output for T6
	Fix serial console on SNI RM400 machines
	bio-integrity: Do not allocate integrity context for bio w/o data
	ip6_tunnel: fix traffic class routing for tunnels
	skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow
	macsec: check return value of skb_to_sgvec always
	sit: reload iphdr in ipip6_rcv
	net/mlx4: Fix the check in attaching steering rules
	net/mlx4: Check if Granular QoS per VF has been enabled before updating QP qos_vport
	perf header: Set proper module name when build-id event found
	perf report: Ensure the perf DSO mapping matches what libdw sees
	iwlwifi: mvm: fix firmware debug restart recording
	watchdog: f71808e_wdt: Add F71868 support
	iwlwifi: mvm: Fix command queue number on d0i3 flow
	iwlwifi: tt: move ucode_loaded check under mutex
	iwlwifi: pcie: only use d0i3 in suspend/resume if system_pm is set to d0i3
	iwlwifi: fix min API version for 7265D, 3168, 8000 and 8265
	tags: honor COMPILED_SOURCE with apart output directory
	ARM: dts: qcom: ipq4019: fix i2c_0 node
	e1000e: fix race condition around skb_tstamp_tx()
	igb: fix race condition with PTP_TX_IN_PROGRESS bits
	cxl: Unlock on error in probe
	cx25840: fix unchecked return values
	mceusb: sporadic RX truncation corruption fix
	net: phy: avoid genphy_aneg_done() for PHYs without clause 22 support
	ARM: imx: Add MXC_CPU_IMX6ULL and cpu_is_imx6ull
	nvme-pci: fix multiple ctrl removal scheduling
	nvme: fix hang in remove path
	KVM: nVMX: Update vmcs12->guest_linear_address on nested VM-exit
	e1000e: Undo e1000e_pm_freeze if __e1000_shutdown fails
	perf/core: Correct event creation with PERF_FORMAT_GROUP
	sched/deadline: Use the revised wakeup rule for suspending constrained dl tasks
	MIPS: mm: fixed mappings: correct initialisation
	MIPS: mm: adjust PKMAP location
	MIPS: kprobes: flush_insn_slot should flush only if probe initialised
	ARM: dts: armadillo800eva: Split LCD mux and gpio
	Fix loop device flush before configure v3
	net: emac: fix reset timeout with AR8035 phy
	perf tools: Decompress kernel module when reading DSO data
	perf tests: Decompress kernel module before objdump
	skbuff: only inherit relevant tx_flags
	xen: avoid type warning in xchg_xen_ulong
	X.509: Fix error code in x509_cert_parse()
	pinctrl: meson-gxbb: remove non-existing pin GPIOX_22
	coresight: Fix reference count for software sources
	coresight: tmc: Configure DMA mask appropriately
	stmmac: fix ptp header for GMAC3 hw timestamp
	geneve: add missing rx stats accounting
	crypto: omap-sham - buffer handling fixes for hashing later
	crypto: omap-sham - fix closing of hash with separate finalize call
	bnx2x: Allow vfs to disable txvlan offload
	sctp: fix recursive locking warning in sctp_do_peeloff
	net: fec: Add a fec_enet_clear_ethtool_stats() stub for CONFIG_M5272
	sparc64: ldc abort during vds iso boot
	iio: magnetometer: st_magn_spi: fix spi_device_id table
	net: ena: fix rare uncompleted admin command false alarm
	net: ena: fix race condition between submit and completion admin command
	net: ena: add missing return when ena_com_get_io_handlers() fails
	net: ena: add missing unmap bars on device removal
	net: ena: disable admin msix while working in polling mode
	clk: meson: meson8b: add compatibles for Meson8 and Meson8m2
	Bluetooth: Send HCI Set Event Mask Page 2 command only when needed
	cpuidle: dt: Add missing 'of_node_put()'
	ACPICA: OSL: Add support to exclude stdarg.h
	ACPICA: Events: Add runtime stub support for event APIs
	ACPICA: Disassembler: Abort on an invalid/unknown AML opcode
	s390/dasd: fix hanging safe offline
	vxlan: dont migrate permanent fdb entries during learn
	hsr: fix incorrect warning
	selftests: kselftest_harness: Fix compile warning
	drm/vc4: Fix resource leak in 'vc4_get_hang_state_ioctl()' in error handling path
	bcache: stop writeback thread after detaching
	bcache: segregate flash only volume write streams
	scsi: libsas: fix memory leak in sas_smp_get_phy_events()
	scsi: libsas: fix error when getting phy events
	scsi: libsas: initialize sas_phy status according to response of DISCOVER
	blk-mq: fix kernel oops in blk_mq_tag_idle()
	tty: n_gsm: Allow ADM response in addition to UA for control dlci
	EDAC, mv64x60: Fix an error handling path
	cxgb4vf: Fix SGE FL buffer initialization logic for 64K pages
	sdhci: Advertise 2.0v supply on SDIO host controller
	Input: goodix - disable IRQs while suspended
	mtd: mtd_oobtest: Handle bitflips during reads
	perf tools: Fix copyfile_offset update of output offset
	ipsec: check return value of skb_to_sgvec always
	rxrpc: check return value of skb_to_sgvec always
	virtio_net: check return value of skb_to_sgvec always
	virtio_net: check return value of skb_to_sgvec in one more location
	random: use lockless method of accessing and updating f->reg_idx
	clk: at91: fix clk-generated compilation
	arp: fix arp_filter on l3slave devices
	ipv6: the entire IPv6 header chain must fit the first fragment
	net: fix possible out-of-bound read in skb_network_protocol()
	net/ipv6: Fix route leaking between VRFs
	net/ipv6: Increment OUTxxx counters after netfilter hook
	netlink: make sure nladdr has correct size in netlink_connect()
	net/sched: fix NULL dereference in the error path of tcf_bpf_init()
	pptp: remove a buggy dst release in pptp_connect()
	r8169: fix setting driver_data after register_netdev
	sctp: do not leak kernel memory to user space
	sctp: sctp_sockaddr_af must check minimal addr length for AF_INET6
	sky2: Increase D3 delay to sky2 stops working after suspend
	vhost: correctly remove wait queue during poll failure
	vlan: also check phy_driver ts_info for vlan's real device
	bonding: fix the err path for dev hwaddr sync in bond_enslave
	bonding: move dev_mc_sync after master_upper_dev_link in bond_enslave
	bonding: process the err returned by dev_set_allmulti properly in bond_enslave
	net: fool proof dev_valid_name()
	ip_tunnel: better validate user provided tunnel names
	ipv6: sit: better validate user provided tunnel names
	ip6_gre: better validate user provided tunnel names
	ip6_tunnel: better validate user provided tunnel names
	vti6: better validate user provided tunnel names
	net/mlx5e: Sync netdev vxlan ports at open
	net/sched: fix NULL dereference in the error path of tunnel_key_init()
	net/sched: fix NULL dereference on the error path of tcf_skbmod_init()
	net/mlx4_en: Fix mixed PFC and Global pause user control requests
	vhost: validate log when IOTLB is enabled
	route: check sysctl_fib_multipath_use_neigh earlier than hash
	team: move dev_mc_sync after master_upper_dev_link in team_port_add
	vhost_net: add missing lock nesting notation
	net/mlx4_core: Fix memory leak while delete slave's resources
	strparser: Fix sign of err codes
	net sched actions: fix dumping which requires several messages to user space
	vrf: Fix use after free and double free in vrf_finish_output
	Revert "xhci: plat: Register shutdown for xhci_plat"
	Linux 4.9.94

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-04-14 15:40:56 +02:00
Vlastimil Babka
a1e7a9e2e3 sched/numa: Use down_read_trylock() for the mmap_sem
[ Upstream commit 8655d54977 ]

A customer has reported a soft-lockup when running an intensive
memory stress test, where the trace on multiple CPU's looks like this:

 RIP: 0010:[<ffffffff810c53fe>]
  [<ffffffff810c53fe>] native_queued_spin_lock_slowpath+0x10e/0x190
...
 Call Trace:
  [<ffffffff81182d07>] queued_spin_lock_slowpath+0x7/0xa
  [<ffffffff811bc331>] change_protection_range+0x3b1/0x930
  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
  [<ffffffff81098322>] task_work_run+0x72/0x90

Further investigation showed that the lock contention here is pmd_lock().

The task_numa_work() function makes sure that only one thread is let to perform
the work in a single scan period (via cmpxchg), but if there's a thread with
mmap_sem locked for writing for several periods, multiple threads in
task_numa_work() can build up a convoy waiting for mmap_sem for read and then
all get unblocked at once.

This patch changes the down_read() to the trylock version, which prevents the
build up. For a workload experiencing mmap_sem contention, it's probably better
to postpone the NUMA balancing work anyway. This seems to have fixed the soft
lockups involving pmd_lock(), which is in line with the convoy theory.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170515131316.21909-1-vbabka@suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13 19:48:04 +02:00
Quentin Perret
21bd85cd95 Merge remote-tracking branch 'android-4.9' into 'android-4.9-eas-dev'
Fixed merge conflicts caused by 8a174b4749 ("sched/fair: prevent
possible infinite loop in sched_group_energy)

Change-Id: Iba35631a213b88f6361838f28573af3407568919
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-02-22 18:04:18 +00:00
Quentin Perret
b5ba569f6f sched/fair: select the most energy-efficient CPU candidate on wake-up
The current implementation of the energy-aware wake-up path relies on
find_best_target() to select an ordered list of CPU candidates for task
placement. The first candidate of the list saving energy is then chosen,
disregarding all the others to avoid the overhead of an expensive
energy_diff.

With the recent refactoring of select_energy_cpu_idx(), the cost of
exploring multiple CPUs has been reduced, hence offering the opportunity
to select the most energy-efficient candidate at a lower cost. This commit
seizes this opportunity by allowing to change select_energy_cpu_idx()'s
behaviour as to ignore the order of CPUs returned by find_best_target()
and to pick the best candidate energy-wise.

As this functionality is still considered as experimental, it is hidden
behind a sched_feature named FBT_STRICT_ORDER (like the equivalent
feature in Android 4.14) which defaults to true, hence keeping the
current behaviour by default.

Change-Id: I0cb833bfec1a4a053eddaff1652c0b6cad554f97
Suggested-by: Patrick Bellasi <patrick.bellasi@arm.com>
Suggested-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-02-14 16:21:10 +00:00
Chris Redpath
8a174b4749 sched/fair: prevent possible infinite loop in sched_group_energy
There is a race between hotplug and energy_diff which might result
in endless loop in sched_group_energy. When this happens, the end
condition cannot be detected.

We can store how many CPUs we need to visit at the beginning, and
bail out of the energy calculation if we visit more cpus than expected.

Bug: 72311797 72202633
Change-Id: I8dda75468ee1570da4071cd8165ef5131a8205d8
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2018-02-10 00:45:21 +00:00
Pavankumar Kondeti
de22a05ed0 sched/fair: fix array out of bounds access in select_energy_cpu_idx()
We are using an incorrect index while initializing the nrg_delta for
the previous CPU in select_energy_cpu_idx(). This initialization
it self is not needed as the nrg_delta for the previous CPU is already
initialized to 0 while preparing the energ_env struct.

Change-Id: Iee4e2c62f904050d2680a0a1df646d4d515c62cc
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2018-02-03 05:29:57 +05:30
Ionela Voinescu
f964120739 sched/fair: use min capacity when evaluating active cpus
When we are calculating what the impact of placing a task on a specific
cpu is, we should include the information that there might be a minimum
capacity imposed upon that cpu which could change the performance and/or
energy cost decisions.

When choosing an active target CPU, favour CPUs that won't end up
running at a high OPP due to a min capacity cap imposed by external
actors.

Change-Id: Ibc3302304345b63107f172b1fc3ffdabc19aa9d4
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2018-02-01 18:37:35 +00:00
Ionela Voinescu
88a968ca63 sched/fair: use min capacity when evaluating idle backup cpus
When we are calculating what the impact of placing a task on a specific
cpu is, we should include the information that there might be a minimum
capacity imposed upon that cpu which could change the performance and/or
energy cost decisions.

When choosing an idle backup CPU, favour CPUs that won't end up
running at a high OPP due to a min capacity cap imposed by external
actors.

Change-Id: I566623ffb3a7c5b61a23242dcce1cb4147ef8a4a
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2018-02-01 18:37:32 +00:00
Ionela Voinescu
9e1c648c71 sched/fair: use min capacity when evaluating placement energy costs
Add the ability to track minimim capacity forced onto a sched_group
by some external actor.

group_max_util returns the highest utilisation inside a sched_group
and is used when we are trying to calculate an energy cost estimate
for a specific scheduling scenario. Minimum capacities imposed from
elsewhere will influence this energy cost so we should reflect it
here.

Change-Id: Ibd537a6dbe6d67b11cc9e9be18f40fcb2c0f13de
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2018-02-01 18:37:27 +00:00
Ionela Voinescu
58b761f4c0 sched/fair: introduce minimum capacity capping sched feature
We have the ability to track minimum capacity forced onto a CPU by
userspace or external actors. This is provided though a minimum
frequency scale factor exposed by arch_scale_min_freq_capacity.

The use of this information is enabled through the MIN_CAPACITY_CAPPING
feature. If not enabled, the minimum frequency scale factor will
remain 0 and it will not impact energy estimation or scheduling
decisions.

Change-Id: Ibc61f2bf4fddf186695b72b262e602a6e8bfde37
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2018-02-01 18:37:23 +00:00
Dietmar Eggemann
da6833cff7 sched/fair: introduce an arch scaling function for max frequency capping
The max frequency scaling factor is defined as:

  max_freq_scale = policy_max_freq / cpuinfo_max_freq

To be able to scale the cpu capacity by this factor introduce a call to
the new arch scaling function arch_scale_max_freq_capacity() in
update_cpu_capacity() and provide a default implementation which returns
SCHED_CAPACITY_SCALE.

Another subsystem (e.g. cpufreq) can overwrite this default implementation,
exactly as for frequency and cpu invariance. It has to be enabled by the
arch by defining arch_scale_max_freq_capacity to the actual
implementation.

Change-Id: I266cd1f4c1c82f54b80063c36aa5f7662599dd28
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2018-02-01 15:59:24 +00:00
Ke Wang
7be1985454 ANDROID: sched: EAS: check energy_aware() before calling select_energy_cpu_brute() in up-migrate path
In up-migrate path, select_energy_cpu_brute() was called directly
without checking energy_aware(). This will make select_energy_cpu_brute()
always worked even disabling energy_aware() on the asymmetric cpu
capacity system.

Signed-off-by: Ke Wang <ke.wang@spreadtrum.com>
2018-01-29 15:31:47 +00:00
Victor Wan
20946741c8 Merge branch 'android-4.9' into amlogic-4.9-dev
Conflicts:
	Makefile
	init/main.c
2018-01-22 20:17:25 +08:00
Patrick Bellasi
3b9305de32 sched/fair: reduce rounding errors in energy computations
The SG's energy is obtained by adding busy and idle contributions which
are computed by considering a proper fraction of the
SCHED_CAPACITY_SCALE defined by the SG's utilizations.

By scaling each and every contribution conputed we risk to accumulate
rounding errors which can results into a non null energy_delta also in
cases when the same total accomulated utilization is differently
distributed among different CPUs.

To reduce rouding errors, this patch accumulated non-scaled busy/idle
energy contributions for each visited SG, and scale each of them just
one time at the end.

Change-Id: Idf8367fee0ac11938c6436096f0c1b2d630210d2
Suggested-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-01-08 15:55:12 +00:00
Patrick Bellasi
cf28cf03a3 sched/fair: re-factor energy_diff to use a single (extensible) energy_env
The energy_env data structure is used to cache values required by multiple
different functions involved in energy_diff computation. Some of these
functions require additional parameters which can be easily embedded
into the energy_env itself. The current implementation of energy_diff
hardcodes the usage of two different energy_env structures to estimate and
compare the energy consumption related to a "before" and an "after" CPU.
Moreover, it does this energy estimation by walking multiple times the
SDs/SGs data structures.

A better design can be envisioned by better using the energy_env structure
to support a more efficient and concurrent evaluation of multiple schedule
candidates. To this purpose, this patch provides a complete re-factoring
of the energy_diff implementation to:

1. use a single energy_env structure for the evaluation of all the
   candidate CPUs

2. walk just one time the SDs/SGs, thus improving the overall performance
   to compute the energy estimation for each CPU candidate specified by
   the single used energy_env

3. simplify the code (at least if you look at the new version and not at
   this re-factoring patch) thus providing a more clean code to maintain
   and extend for additional features

This patch updated all the clients of energy_env to use only the data
provided by this structure and an index for one of its CPUs candidates.
Embedding everything within the energy env will make it simple to add
tracepoints for this new version, which can easily provide an holistic
view on how energy_diff evaluated the proposed CPU candidates.

The new proposed structure, for both "struct energy_env" and the functions
using it, is designed in such a way to easily accommodate additional
further extensions (e.g. SchedTune filtering) without requiring an
additional big re-factoring of these core functions.

Finally, the usage of a CPUs candidate array, embedded into the
energy_diff structure, allows also to seamless extend the exploration of
multiple candidate CPUs, for example to support the comparison of a
spread-vs-packing strategy.

Change-Id: Ic04ffb6848b2c763cf1788767f22c6872eb12bee
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
[reworked find_new_capacity() and enforced the respect of
find_best_target() selection order]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-01-08 15:55:08 +00:00
Patrick Bellasi
142ce32a63 sched/fair: cleanup select_energy_cpu_brute to be more consistent
The current definition of select_energy_cpu_brute is a bit confusing in
the definition of the value for the target_cpu to be returned as wakeup
CPU for the specified task.

This cleanup the code by ensuring that we always set target_cpu right
before returning it. rcu_read_lock and check on *sd!=NULL are also moved
around to be exactly where they are required.

Change-Id: I70a4b558b3624a13395da1a87ddc0776fd1d6641
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-01-08 15:55:03 +00:00
Patrick Bellasi
7f44e92d1c sched/fair: remove capacity tracking from energy_diff
In preparation for the energy_diff refactoring, let's remove all the
SchedTune specific bits which are used to keep track of the capacity
variations requited by the PESpace filtering.

This removes also the energy_normalization function and the wrapper of
energy_diff which is used to trigger a PESpace filtering by
schedtune_accept_deltas().

The remaining code is the "original" energy_diff function which
looks just at the energy variations to compare prev_cpu vs next_cpu.

Change-Id: I4fb1d1c5ba45a364e6db9ab8044969349aba0307
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-01-08 15:54:59 +00:00
Patrick Bellasi
4905932b05 sched/fair: remove energy_diff tracepoint in preparation to re-factoring
The format of the energy_diff tracepoint is going to be changed by the
following energ_diff refactoring patches. Let's remove it now to start from
a clean slate.

Change-Id: Id4f537ed60d90a7ddcca0a29a49944bfacb85c8c
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-01-08 15:54:55 +00:00
Chris Redpath
e997bf0fde sched/fair: use *p to reference task_structs
This is a simple renaming patch which just align to the most common code
convention used in fair.c, task_structs pointers are usually named *p.

Change-Id: Id0769e52b6a271014d89353fdb4be9bb721b5b2f
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-01-08 15:54:50 +00:00
Ke Wang
225006a4f7 sched: EAS: Fix the calculation of group util in group_idle_state()
util_delta becomes not zero in eenv_before, which will affect the
calculation of grp_util in group_idle_state(). Fix it under the
new condition.

Change-Id: Ic3853bb45876a8e388afcbe4e72d25fc42b1d7b0
Signed-off-by: Ke Wang <ke.wang@spreadtrum.com>
(cherry picked from commit 47c87b2654)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2018-01-08 15:54:46 +00:00
Victor Wan
2c95ea743b Merge branch 'android-4.9' into amlogic-4.9-dev
Conflicts:
	arch/arm/configs/omap2plus_defconfig
	drivers/Makefile
	drivers/android/binder.c
2018-01-08 18:44:19 +08:00
Greg Kroah-Hartman
3f1d77ca5f Merge 4.9.69 into android-4.9
Changes in 4.9.69
	usb: gadget: udc: renesas_usb3: fix number of the pipes
	can: ti_hecc: Fix napi poll return value for repoll
	can: kvaser_usb: free buf in error paths
	can: kvaser_usb: Fix comparison bug in kvaser_usb_read_bulk_callback()
	can: kvaser_usb: ratelimit errors if incomplete messages are received
	can: kvaser_usb: cancel urb on -EPIPE and -EPROTO
	can: ems_usb: cancel urb on -EPIPE and -EPROTO
	can: esd_usb2: cancel urb on -EPIPE and -EPROTO
	can: usb_8dev: cancel urb on -EPIPE and -EPROTO
	virtio: release virtio index when fail to device_register
	hv: kvp: Avoid reading past allocated blocks from KVP file
	isa: Prevent NULL dereference in isa_bus driver callbacks
	scsi: dma-mapping: always provide dma_get_cache_alignment
	scsi: use dma_get_cache_alignment() as minimum DMA alignment
	scsi: libsas: align sata_device's rps_resp on a cacheline
	efi: Move some sysfs files to be read-only by root
	efi/esrt: Use memunmap() instead of kfree() to free the remapping
	ASN.1: fix out-of-bounds read when parsing indefinite length item
	ASN.1: check for error from ASN1_OP_END__ACT actions
	KEYS: add missing permission check for request_key() destination
	X.509: reject invalid BIT STRING for subjectPublicKey
	X.509: fix comparisons of ->pkey_algo
	x86/PCI: Make broadcom_postcore_init() check acpi_disabled
	KVM: x86: fix APIC page invalidation
	btrfs: fix missing error return in btrfs_drop_snapshot
	ALSA: pcm: prevent UAF in snd_pcm_info
	ALSA: seq: Remove spurious WARN_ON() at timer check
	ALSA: usb-audio: Fix out-of-bound error
	ALSA: usb-audio: Add check return value for usb_string()
	iommu/vt-d: Fix scatterlist offset handling
	smp/hotplug: Move step CPUHP_AP_SMPCFD_DYING to the correct place
	s390: fix compat system call table
	KVM: s390: Fix skey emulation permission check
	powerpc/64s: Initialize ISAv3 MMU registers before setting partition table
	brcmfmac: change driver unbind order of the sdio function devices
	kdb: Fix handling of kallsyms_symbol_next() return value
	drm/exynos: gem: Drop NONCONTIG flag for buffers allocated without IOMMU
	media: dvb: i2c transfers over usb cannot be done from stack
	arm64: KVM: fix VTTBR_BADDR_MASK BUG_ON off-by-one
	arm: KVM: Fix VTTBR_BADDR_MASK BUG_ON off-by-one
	KVM: VMX: remove I/O port 0x80 bypass on Intel hosts
	KVM: arm/arm64: Fix broken GICH_ELRSR big endian conversion
	KVM: arm/arm64: vgic-irqfd: Fix MSI entry allocation
	KVM: arm/arm64: vgic-its: Check result of allocation before use
	arm64: fpsimd: Prevent registers leaking from dead tasks
	bus: arm-cci: Fix use of smp_processor_id() in preemptible context
	bus: arm-ccn: Check memory allocation failure
	bus: arm-ccn: Fix use of smp_processor_id() in preemptible context
	bus: arm-ccn: fix module unloading Error: Removing state 147 which has instances left.
	crypto: talitos - fix AEAD test failures
	crypto: talitos - fix memory corruption on SEC2
	crypto: talitos - fix setkey to check key weakness
	crypto: talitos - fix AEAD for sha224 on non sha224 capable chips
	crypto: talitos - fix use of sg_link_tbl_len
	crypto: talitos - fix ctr-aes-talitos
	usb: f_fs: Force Reserved1=1 in OS_DESC_EXT_COMPAT
	ARM: BUG if jumping to usermode address in kernel mode
	ARM: avoid faulting on qemu
	thp: reduce indentation level in change_huge_pmd()
	thp: fix MADV_DONTNEED vs. numa balancing race
	mm: drop unused pmdp_huge_get_and_clear_notify()
	Revert "drm/armada: Fix compile fail"
	Revert "spi: SPI_FSL_DSPI should depend on HAS_DMA"
	ARM: 8657/1: uaccess: consistently check object sizes
	vti6: Don't report path MTU below IPV6_MIN_MTU.
	ARM: OMAP2+: gpmc-onenand: propagate error on initialization failure
	x86/selftests: Add clobbers for int80 on x86_64
	x86/platform/uv/BAU: Fix HUB errors by remove initial write to sw-ack register
	sched/fair: Make select_idle_cpu() more aggressive
	x86/hpet: Prevent might sleep splat on resume
	powerpc/64: Invalidate process table caching after setting process table
	selftest/powerpc: Fix false failures for skipped tests
	powerpc: Fix compiling a BE kernel with a powerpc64le toolchain
	lirc: fix dead lock between open and wakeup_filter
	module: set __jump_table alignment to 8
	powerpc/64: Fix checksum folding in csum_add()
	ARM: OMAP2+: Fix device node reference counts
	ARM: OMAP2+: Release device node after it is no longer needed.
	ASoC: rcar: avoid SSI_MODEx settings for SSI8
	gpio: altera: Use handle_level_irq when configured as a level_high
	HID: chicony: Add support for another ASUS Zen AiO keyboard
	usb: gadget: configs: plug memory leak
	USB: gadgetfs: Fix a potential memory leak in 'dev_config()'
	usb: dwc3: gadget: Fix system suspend/resume on TI platforms
	usb: gadget: pxa27x: Test for a valid argument pointer
	usb: gadget: udc: net2280: Fix tmp reusage in net2280 driver
	kvm: nVMX: VMCLEAR should not cause the vCPU to shut down
	libata: drop WARN from protocol error in ata_sff_qc_issue()
	workqueue: trigger WARN if queue_delayed_work() is called with NULL @wq
	scsi: qla2xxx: Fix ql_dump_buffer
	scsi: lpfc: Fix crash during Hardware error recovery on SLI3 adapters
	irqchip/crossbar: Fix incorrect type of register size
	KVM: nVMX: reset nested_run_pending if the vCPU is going to be reset
	arm: KVM: Survive unknown traps from guests
	arm64: KVM: Survive unknown traps from guests
	KVM: arm/arm64: VGIC: Fix command handling while ITS being disabled
	spi_ks8995: fix "BUG: key accdaa28 not in .data!"
	spi_ks8995: regs_size incorrect for some devices
	bnx2x: prevent crash when accessing PTP with interface down
	bnx2x: fix possible overrun of VFPF multicast addresses array
	bnx2x: fix detection of VLAN filtering feature for VF
	bnx2x: do not rollback VF MAC/VLAN filters we did not configure
	rds: tcp: Sequence teardown of listen and acceptor sockets to avoid races
	ibmvnic: Fix overflowing firmware/hardware TX queue
	ibmvnic: Allocate number of rx/tx buffers agreed on by firmware
	ipv6: reorder icmpv6_init() and ip6_mr_init()
	crypto: s5p-sss - Fix completing crypto request in IRQ handler
	i2c: riic: fix restart condition
	blk-mq: initialize mq kobjects in blk_mq_init_allocated_queue()
	zram: set physical queue limits to avoid array out of bounds accesses
	netfilter: don't track fragmented packets
	axonram: Fix gendisk handling
	drm/amd/amdgpu: fix console deadlock if late init failed
	powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
	EDAC, i5000, i5400: Fix use of MTR_DRAM_WIDTH macro
	EDAC, i5000, i5400: Fix definition of NRECMEMB register
	kbuild: pkg: use --transform option to prefix paths in tar
	coccinelle: fix parallel build with CHECK=scripts/coccicheck
	x86/mpx/selftests: Fix up weird arrays
	mac80211_hwsim: Fix memory leak in hwsim_new_radio_nl()
	gre6: use log_ecn_error module parameter in ip6_tnl_rcv()
	route: also update fnhe_genid when updating a route cache
	route: update fnhe_expires for redirect when the fnhe exists
	drivers/rapidio/devices/rio_mport_cdev.c: fix resource leak in error handling path in 'rio_dma_transfer()'
	lib/genalloc.c: make the avail variable an atomic_long_t
	dynamic-debug-howto: fix optional/omitted ending line number to be LARGE instead of 0
	NFS: Fix a typo in nfs_rename()
	sunrpc: Fix rpc_task_begin trace point
	xfs: fix forgotten rcu read unlock when skipping inode reclaim
	dt-bindings: usb: fix reg-property port-number range
	block: wake up all tasks blocked in get_request()
	sparc64/mm: set fields in deferred pages
	zsmalloc: calling zs_map_object() from irq is a bug
	sctp: do not free asoc when it is already dead in sctp_sendmsg
	sctp: use the right sk after waking up from wait_buf sleep
	bpf: fix lockdep splat
	clk: uniphier: fix DAPLL2 clock rate of Pro5
	atm: horizon: Fix irq release error
	jump_label: Invoke jump_label_test() via early_initcall()
	xfrm: Copy policy family in clone_policy
	IB/mlx4: Increase maximal message size under UD QP
	IB/mlx5: Assign send CQ and recv CQ of UMR QP
	afs: Connect up the CB.ProbeUuid
	Linux 4.9.69

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-12-14 09:58:43 +01:00
Peter Zijlstra
4e4a9ebe33 sched/fair: Make select_idle_cpu() more aggressive
[ Upstream commit 4c77b18cf8 ]

Kitsunyan reported desktop latency issues on his Celeron 887 because
of commit:

  1b568f0aab ("sched/core: Optimize SCHED_SMT")

... even though his CPU doesn't do SMT.

The effect of running the SMT code on a !SMT part is basically a more
aggressive select_idle_cpu(). Removing the avg condition fixed things
for him.

I also know FB likes this test gone, even though other workloads like
having it.

For now, take it out by default, until we get a better idea.

Reported-by: kitsunyan <kitsunyan@inbox.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14 09:28:17 +01:00
Victor Wan
1e547d2935 Merge branch 'android-4.9' into amlogic-4.9-dev 2017-12-02 16:52:23 +08:00
Ke Wang
b763480947 sched: EAS/WALT: Don't take into account of running task's util
For upmigrating misfit running task case, the currently running
task's util has been counted into cpu_util(). Thus currently
__cpu_overutilized() which add task's uitl twice is overestimated.

Change-Id: I5326f4c736a55679009d2e7293f8792311c04294
Signed-off-by: Ke Wang <ke.wang@spreadtrum.com>
2017-12-01 16:14:01 +00:00
Joonwoo Park
b41e1ca689 sched: EAS/WALT: take into account of waking task's load
WALT's function cpu_util(cpu) reports CPU's load without taking into
account of waking task's load.  Thus currently cpu_overutilized()
underestimates load on the previous CPU of waking task.

Take into account of task's load to determine whether previous CPU is
overutilzed to bail out early without running energy_diff() which is
expensive.

Change-Id: I30f146984a880ad2cc1b8a4ce35bd239a8c9a607
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
(minor rebase conflicts)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit 94e5c96507)
[trivial cherry-pick issues]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-12-01 16:11:57 +00:00
Joonwoo Park
4f0693ad08 sched: EAS: upmigrate misfit current task
Upmigrate misfit current task upon scheduler tick with stopper.

We can kick an random (not necessarily big CPU) NOHZ idle CPU when a
CPU bound task is in need of upmigration.  But it's not efficient as that
way needs following unnecessary wakeups:

  1. Busy little CPU A to kick idle B
  2. B runs idle balancer and enqueue migration/A
  3. B goes idle
  4. A runs migration/A, enqueues busy task on B.
  5. B wakes up again.

This change makes active upmigration more efficiently by doing:

  1. Busy little CPU A find target CPU B upon tick.
  2. CPU A enqueues migration/A.

Change-Id: Ie865738054ea3296f28e6ba01710635efa7193c0
[joonwoop: The original version had logic to reserve CPU.  The logic is
 omitted in this version.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
(cherry picked from commit 9e293db052)
[trivial cherry-pick issues]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-12-01 16:11:45 +00:00
Prasad Sodagudi
d345372a29 sched: avoid pushing tasks to an offline CPU
Currently active_load_balance_cpu_stop is run by cpu stopper and it
pushes running tasks off the busiest CPU onto idle target CPU. But
there is no check to see whether target cpu is offline or not before
pushing the tasks.  With the introduction of active migration in the
scheduler tick path (see check_for_migration()) there have been
instances of attempts to migrate tasks to offline CPUs.

Add a check as to whether the target cpu is online or not to prevent
scheduling on offline CPUs.

Change-Id: Ib8ac7f8aeabd3ca7365f3eae977075952dab4f21
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
(cherry picked from commit dc626b28ee)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-12-01 16:11:35 +00:00
Srivatsa Vaddagiri
70e14af60c sched: Extend active balance to accept 'push_task' argument
Active balance currently picks one task to migrate from busy cpu to
a chosen cpu (push_cpu). This patch extends active load balance to
recognize a particular task ('push_task') that needs to be migrated to
'push_cpu'. This capability will be leveraged by HMP-aware task
placement in a subsequent patch.

Change-Id: If31320111e6cc7044e617b5c3fd6d8e0c0e16952
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
(cherry picked from commit 2da014c0d8)
[ trivial cherry-pick issues, fix "may be used uninitialized" warning
for push_task ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-12-01 16:11:25 +00:00
Joonwoo Park
a8611936de sched/fair: prevent meaningless active migration
At present need_active_balance() determines whether an active
upmigration is needed by using capacity_of(). A CPU's capacity
may be reduced by RT pressure, and therefore distinguishing
capability differences with capacity_of() may lead to suboptimal
active migrations to less capable CPUs. Use capacity_orig_of
to distinguish differently capable CPUs in addition to
capacity_of(), thus avoiding placing tasks on less capable CPUs
due to instantaneous RT pressure.

Change-Id: I3e1435246a8edc3ad618ef98a34866cfbd8c16a5
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
[markivx: Reworked the commit text a bit]
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
(cherry picked from commit 7ab48e4c8d)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-12-01 16:10:47 +00:00
Joonwoo Park
cd76b21535 sched: EAS: kill incorrect nohz idle cpu kick
EAS won't allow NOHZ idle balancer until CPU's over utilized.  However
nohz_kick_needed() can return true.  This causes idle CPU wake up for
nothing.

Change-Id: I6e548442e29e4f85cda695e4c7101dd591b12fe6
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
(cherry picked from commit 3989a247e2)
[trivial cherry-pick issues]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-11-14 16:38:38 +00:00