Conflicts:
kernel/sched/sched.h
Refactor the changes from LTS commit 62208707b4
("sched/cputime: Fix prev steal time accouting during CPU hotplug")
to align with the changes from AOSP commit dee8fa1552
("sched: backport cpufreq hooks from 4.9-rc4").
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
commit 51d7b12041 upstream.
In commit c4004b02f8 ("x86: remove the kernel code/data/bss resources
from /proc/iomem") I was hoping to remove the phyiscal kernel address
data from /proc/iomem entirely, but that had to be reverted because some
system programs actually use it.
This limits all the detailed resource information to properly
credentialed users instead.
[sumits: this is used in Ubuntu as a fix for CVE-2015-8944]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 96b777452d upstream.
Commit:
2f5177f0fd ("sched/cgroup: Fix/cleanup cgroup teardown/init")
.. moved sched_online_group() from css_online() to css_alloc().
It exposes half-baked task group into global lists before initializing
generic cgroup stuff.
LTP testcase (third in cgroup_regression_test) written for testing
similar race in kernels 2.6.26-2.6.28 easily triggers this oops:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: kernfs_path_from_node_locked+0x260/0x320
CPU: 1 PID: 30346 Comm: cat Not tainted 4.10.0-rc5-test #4
Call Trace:
? kernfs_path_from_node+0x4f/0x60
kernfs_path_from_node+0x3e/0x60
print_rt_rq+0x44/0x2b0
print_rt_stats+0x7a/0xd0
print_cpu+0x2fc/0xe80
? __might_sleep+0x4a/0x80
sched_debug_show+0x17/0x30
seq_read+0xf2/0x3b0
proc_reg_read+0x42/0x70
__vfs_read+0x28/0x130
? security_file_permission+0x9b/0xc0
? rw_verify_area+0x4e/0xb0
vfs_read+0xa5/0x170
SyS_read+0x46/0xa0
entry_SYSCALL_64_fastpath+0x1e/0xad
Here the task group is already linked into the global RCU-protected 'task_groups'
list, but the css->cgroup pointer is still NULL.
This patch reverts this chunk and moves online back to css_online().
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2f5177f0fd ("sched/cgroup: Fix/cleanup cgroup teardown/init")
Link: http://lkml.kernel.org/r/148655324740.424917.5302984537258726349.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit db9108e054 upstream.
Hit the kmemleak when executing instance_rmdir, it forgot releasing
mem of tracing_cpumask. With this fix, the warn does not appear any
more.
unreferenced object 0xffff93a8dfaa7c18 (size 8):
comm "mkdir", pid 1436, jiffies 4294763622 (age 9134.308s)
hex dump (first 8 bytes):
ff ff ff ff ff ff ff ff ........
backtrace:
[<ffffffff88b6567a>] kmemleak_alloc+0x4a/0xa0
[<ffffffff8861ea41>] __kmalloc_node+0xf1/0x280
[<ffffffff88b505d3>] alloc_cpumask_var_node+0x23/0x30
[<ffffffff88b5060e>] alloc_cpumask_var+0xe/0x10
[<ffffffff88571ab0>] instance_mkdir+0x90/0x240
[<ffffffff886e5100>] tracefs_syscall_mkdir+0x40/0x70
[<ffffffff886565c9>] vfs_mkdir+0x109/0x1b0
[<ffffffff8865b1d0>] SyS_mkdir+0xd0/0x100
[<ffffffff88403857>] do_syscall_64+0x67/0x150
[<ffffffff88b710e7>] return_from_SYSCALL_64+0x0/0x6a
[<ffffffffffffffff>] 0xffffffffffffffff
Link: http://lkml.kernel.org/r/1500546969-12594-1-git-send-email-chuhu@redhat.com
Fixes: ccfe9e42e4 ("tracing: Make tracing_cpumask available for all instances")
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2e028c4fe1 upstream.
My static checker complains that if "func" is NULL then "clear_filter"
is uninitialized. This seems like it could be true, although it's
possible something subtle is happening that I haven't seen.
kernel/trace/ftrace.c:3844 match_records()
error: uninitialized symbol 'clear_filter'.
Link: http://lkml.kernel.org/r/20170712073556.h6tkpjcdzjaozozs@mwanda
Fixes: f0a3b154bd ("ftrace: Clarify code for mod command")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 73bb059f9b upstream.
The point of sched_group_mask is to select those CPUs from
sched_group_cpus that can actually arrive at this balance domain.
The current code gets it wrong, as can be readily demonstrated with a
topology like:
node 0 1 2 3
0: 10 20 30 20
1: 20 10 20 30
2: 30 20 10 20
3: 20 30 20 10
Where (for example) domain 1 on CPU1 ends up with a mask that includes
CPU0:
[] CPU1 attaching sched-domain:
[] domain 0: span 0-2 level NUMA
[] groups: 1 (mask: 1), 2, 0
[] domain 1: span 0-3 level NUMA
[] groups: 0-2 (mask: 0-2) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)
This causes sched_balance_cpu() to compute the wrong CPU and
consequently should_we_balance() will terminate early resulting in
missed load-balance opportunities.
The fixed topology looks like:
[] CPU1 attaching sched-domain:
[] domain 0: span 0-2 level NUMA
[] groups: 1 (mask: 1), 2, 0
[] domain 1: span 0-3 level NUMA
[] groups: 0-2 (mask: 1) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)
(note: this relies on OVERLAP domains to always have children, this is
true because the regular topology domains are still here -- this is
before degenerate trimming)
Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: e3589f6c81 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6bdf6abc56 upstream.
Leaking kernel addresses on unpriviledged is generally disallowed,
for example, verifier rejects the following:
0: (b7) r0 = 0
1: (18) r2 = 0xffff897e82304400
3: (7b) *(u64 *)(r1 +48) = r2
R2 leaks addr into ctx
Doing pointer arithmetic on them is also forbidden, so that they
don't turn into unknown value and then get leaked out. However,
there's xadd as a special case, where we don't check the src reg
for being a pointer register, e.g. the following will pass:
0: (b7) r0 = 0
1: (7b) *(u64 *)(r1 +48) = r0
2: (18) r2 = 0xffff897e82304400 ; map
4: (db) lock *(u64 *)(r1 +48) += r2
5: (95) exit
We could store the pointer into skb->cb, loose the type context,
and then read it out from there again to leak it eventually out
of a map value. Or more easily in a different variant, too:
0: (bf) r6 = r1
1: (7a) *(u64 *)(r10 -8) = 0
2: (bf) r2 = r10
3: (07) r2 += -8
4: (18) r1 = 0x0
6: (85) call bpf_map_lookup_elem#1
7: (15) if r0 == 0x0 goto pc+3
R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R6=ctx R10=fp
8: (b7) r3 = 0
9: (7b) *(u64 *)(r0 +0) = r3
10: (db) lock *(u64 *)(r0 +0) += r6
11: (b7) r0 = 0
12: (95) exit
from 7 to 11: R0=inv,min_value=0,max_value=0 R6=ctx R10=fp
11: (b7) r0 = 0
12: (95) exit
Prevent this by checking xadd src reg for pointer types. Also
add a couple of test cases related to this.
Fixes: 1be7f75d16 ("bpf: enable non-root eBPF programs")
Fixes: 17a5267067 ("bpf: verifier (add verifier core)")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 425fffd886 upstream.
Currently, inputting the following command will succeed but actually the
value will be truncated:
# echo 0x12ffffffff > /proc/sys/net/ipv4/tcp_notsent_lowat
This is not friendly to the user, so instead, we should report error
when the value is larger than UINT_MAX.
Fixes: e7d316a02f ("sysctl: handle error writing UINT_MAX to u32 fields")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5380e5644a upstream.
I saw some very confusing sysctl output on my system:
# cat /proc/sys/net/core/xfrm_aevent_rseqth
-2
# cat /proc/sys/net/core/xfrm_aevent_etime
-10
# cat /proc/sys/net/ipv4/tcp_notsent_lowat
-4294967295
Because we forget to set the *negp flag in proc_douintvec, so it will
become a garbage value.
Since the value related to proc_douintvec is always an unsigned integer,
so we can set *negp to false explictily to fix this issue.
Fixes: e7d316a02f ("sysctl: handle error writing UINT_MAX to u32 fields")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 0d0e57697f ]
The patch fixes two things at once:
1) It checks the env->allow_ptr_leaks and only prints the map address to
the log if we have the privileges to do so, otherwise it just dumps 0
as we would when kptr_restrict is enabled on %pK. Given the latter is
off by default and not every distro sets it, I don't want to rely on
this, hence the 0 by default for unprivileged.
2) Printing of ldimm64 in the verifier log is currently broken in that
we don't print the full immediate, but only the 32 bit part of the
first insn part for ldimm64. Thus, fix this up as well; it's okay to
access, since we verified all ldimm64 earlier already (including just
constants) through replace_map_fd_with_map_ptr().
Fixes: 1be7f75d16 ("bpf: enable non-root eBPF programs")
Fixes: cbd3570086 ("bpf: verifier (add ability to receive verification log)")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 62199770
Change-Id: I62ee47d06ddc669ba2863e8cf24f8f3e7683a461
commit 6e5f32f7a4 upstream.
If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
the pending sample window time on exit, setting the next update not
one window into the future, but two.
This situation on exiting NO_HZ is described by:
this_rq->calc_load_update < jiffies < calc_load_update
In this scenario, what we should be doing is:
this_rq->calc_load_update = calc_load_update [ next window ]
But what we actually do is:
this_rq->calc_load_update = calc_load_update + LOAD_FREQ [ next+1 window ]
This has the effect of delaying load average updates for potentially
up to ~9seconds.
This can result in huge spikes in the load average values due to
per-cpu uninterruptible task counts being out of sync when accumulated
across all CPUs.
It's safe to update the per-cpu active count if we wake between sample
windows because any load that we left in 'calc_load_idle' will have
been zero'd when the idle load was folded in calc_global_load().
This issue is easy to reproduce before,
commit 9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")
just by forking short-lived process pipelines built from ps(1) and
grep(1) in a loop. I'm unable to reproduce the spikes after that
commit, but the bug still seems to be present from code review.
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ceea5e3771 upstream.
In tests, which excercise switching of clocksources, a NULL
pointer dereference can be observed on AMR64 platforms in the
clocksource read() function:
u64 clocksource_mmio_readl_down(struct clocksource *c)
{
return ~(u64)readl_relaxed(to_mmio_clksrc(c)->reg) & c->mask;
}
This is called from the core timekeeping code via:
cycle_now = tkr->read(tkr->clock);
tkr->read is the cached tkr->clock->read() function pointer.
When the clocksource is changed then tkr->clock and tkr->read
are updated sequentially. The code above results in a sequential
load operation of tkr->read and tkr->clock as well.
If the store to tkr->clock hits between the loads of tkr->read
and tkr->clock, then the old read() function is called with the
new clock pointer. As a consequence the read() function
dereferences a different data structure and the resulting 'reg'
pointer can point anywhere including NULL.
This problem was introduced when the timekeeping code was
switched over to use struct tk_read_base. Before that, it was
theoretically possible as well when the compiler decided to
reload clock in the code sequence:
now = tk->clock->read(tk->clock);
Add a helper function which avoids the issue by reading
tk_read_base->clock once into a local variable clk and then issue
the read function via clk->read(clk). This guarantees that the
read() function always gets the proper clocksource pointer handed
in.
Since there is now no use for the tkr.read pointer, this patch
also removes it, and to address stopping the fast timekeeper
during suspend/resume, it introduces a dummy clocksource to use
rather then just a dummy read function.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Daniel Mentz <danielmentz@google.com>
Link: http://lkml.kernel.org/r/1496965462-20003-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 57db7e4a2d upstream.
Thomas Gleixner wrote:
> The CRIU support added a 'feature' which allows a user space task to send
> arbitrary (kernel) signals to itself. The changelog says:
>
> The kernel prevents sending of siginfo with positive si_code, because
> these codes are reserved for kernel. I think we can allow a task to
> send such a siginfo to itself. This operation should not be dangerous.
>
> Quite contrary to that claim, it turns out that it is outright dangerous
> for signals with info->si_code == SI_TIMER. The following code sequence in
> a user space task allows to crash the kernel:
>
> id = timer_create(CLOCK_XXX, ..... signo = SIGX);
> timer_set(id, ....);
> info->si_signo = SIGX;
> info->si_code = SI_TIMER:
> info->_sifields._timer._tid = id;
> info->_sifields._timer._sys_private = 2;
> rt_[tg]sigqueueinfo(..., SIGX, info);
> sigemptyset(&sigset);
> sigaddset(&sigset, SIGX);
> rt_sigtimedwait(sigset, info);
>
> For timers based on CLOCK_PROCESS_CPUTIME_ID, CLOCK_THREAD_CPUTIME_ID this
> results in a kernel crash because sigwait() dequeues the signal and the
> dequeue code observes:
>
> info->si_code == SI_TIMER && info->_sifields._timer._sys_private != 0
>
> which triggers the following callchain:
>
> do_schedule_next_timer() -> posix_cpu_timer_schedule() -> arm_timer()
>
> arm_timer() executes a list_add() on the timer, which is already armed via
> the timer_set() syscall. That's a double list add which corrupts the posix
> cpu timer list. As a consequence the kernel crashes on the next operation
> touching the posix cpu timer list.
>
> Posix clocks which are internally implemented based on hrtimers are not
> affected by this because hrtimer_start() can handle already armed timers
> nicely, but it's a reliable way to trigger the WARN_ON() in
> hrtimer_forward(), which complains about calling that function on an
> already armed timer.
This problem has existed since the posix timer code was merged into
2.5.63. A few releases earlier in 2.5.60 ptrace gained the ability to
inject not just a signal (which linux has supported since 1.0) but the
full siginfo of a signal.
The core problem is that the code will reschedule in response to
signals getting dequeued not just for signals the timers sent but
for other signals that happen to a si_code of SI_TIMER.
Avoid this confusion by testing to see if the queued signal was
preallocated as all timer signals are preallocated, and so far
only the timer code preallocates signals.
Move the check for if a timer needs to be rescheduled up into
collect_signal where the preallocation check must be performed,
and pass the result back to dequeue_signal where the code reschedules
timers. This makes it clear why the code cares about preallocated
timers.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Reference: 66dd34ad31 ("signal: allow to send any siginfo to itself")
Reference: 1669ce53e2ff ("Add PTRACE_GETSIGINFO and PTRACE_SETSIGINFO")
Fixes: db8b50ba75f2 ("[PATCH] POSIX clocks & timers")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ff86bf0c65 upstream.
The alarmtimer code has another source of potentially rearming itself too
fast. Interval timers with a very samll interval have a similar CPU hog
effect as the previously fixed overflow issue.
The reason is that alarmtimers do not implement the normal protection
against this kind of problem which the other posix timer use:
timer expires -> queue signal -> deliver signal -> rearm timer
This scheme brings the rearming under scheduler control and prevents
permanently firing timers which hog the CPU.
Bringing this scheme to the alarm timer code is a major overhaul because it
lacks all the necessary mechanisms completely.
So for a quick fix limit the interval to one jiffie. This is not
problematic in practice as alarmtimers are usually backed by an RTC for
suspend which have 1 second resolution. It could be therefor argued that
the resolution of this clock should be set to 1 second in general, but
that's outside the scope of this fix.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kostya Serebryany <kcc@google.com>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Link: http://lkml.kernel.org/r/20170530211655.896767100@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f4781e76f9 upstream.
Andrey reported a alartimer related RCU stall while fuzzing the kernel with
syzkaller.
The reason for this is an overflow in ktime_add() which brings the
resulting time into negative space and causes immediate expiry of the
timer. The following rearm with a small interval does not bring the timer
back into positive space due to the same issue.
This results in a permanent firing alarmtimer which hogs the CPU.
Use ktime_add_safe() instead which detects the overflow and clamps the
result to KTIME_SEC_MAX.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kostya Serebryany <kcc@google.com>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Link: http://lkml.kernel.org/r/20170530211655.802921648@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add appropriate #ifdef guards to ensure the smp-only easstats structs
are not used when smp is not enabled. Arnd got a report from buildbot,
analysed it, and pointed out exactly what the issue was.
Reported-by: "Arnd Bergmann" <arnd@arndb.de>
Suggested-by: "Arnd Bergmann" <arnd@arndb.de>
Fixes: 4b85765a3d ("sched/fair: Add eas (& cas)
specific rq, sd and task stats")
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I60554dea20137f6774db3f59b4afd40a06554cfc
When EAS is enabled during boot, we have to be careful not to use
schedtune from fair.c before it is ready or it will warn us and we'll
get a traceback in the console.
Change-Id: I1a5cf29b18af626545c636c51219f9ed497c19fa
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
sched_energy_diff tracepoint is in a place where it can never trace
payoff or nrg.delta. If CONFIG_SCHED_TUNE is enabled, put it in
a place where those values exist. If it is not enabled, trace from
the current location
Change-Id: Id5442f2b34ec76625491d27c0f4285433ca12699
Reported-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
We use 5 groups everywhere else, this should default to the same.
Change-Id: I05a20bdcf8046ea90a2e36979940cef11246e735
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
When using WALT we always used boosted cpu util for OPP selection.
This is the primary purpose for boosted cpu util, but we hadn't
changed the PELT utilization check to do the same thing.
Fix that here.
Change-Id: Id5ffb26eac23b25fe754255221f6d21b8cededfd
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
sched_group_energy() was supposed to support per-cpu capacity states
(DVFS), however, while fixing a hotplug issue this was broken as we bail
out if there is no SD_SHARE_CAP_STATES flag set.
This patch implements the hotplug race check differently and should
therefore reinstate support for per-cpu capacity states.
Change-Id: I5b865666c9ce833dcfa6514c574580d75aa0a195
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
In some cases, the new_util of a task can be the same on several
CPUs. This causes an issue because the target_util is only updated
if the current new_util is strictly smaller than target_util.
To fix that, the cpu_util_wake() return value is used alongside the
new_util value. If two CPUs compute the same new_util value,
we'll now also look at their cpu_util_wake() return value. In this
case, the CPU that last ran the task will be chosen in priority.
Change-Id: Ia1ea2c4b3ec39621372c2f748862317d5b497723
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
wake_cap performs task and cpu utilization synchronization which is
what allows us to subtract current task util from prev_cpu util and
have a sensible number to work with.
It looks as though if wake_wide returns 0, we could potentially not
execute wake_cap, which would result in unsynced signals we then use
for energy calculations.
This is not necessarily an issue we've seen in traces, but it looks
as though it should be changed.
Change-Id: Ic54a3cba2a10d946ea20113a04371dea04115e82
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
When we place a waking task with find_best_target, we calculate the
existing and new utilisation of each candidate cpu. However, we do
not remove any blocked load resulting from the waking task on the
previous cpu which might cause unnecessary migrations.
Switch to using cpu_util_wake which does this for us, which requires
moving cpu_util_wake a few functions earlier.
Also, we have multiple potential cpu utilization signals here, so
update the necessary bits to allow WALT to work properly (including
not subtracting task util for WALT).
When WALT is in use, cpu utilization is the utilization
in the previous completed window, whilst the task utilization
ignores fully idle windows. There seems to be no way to have a
decently accurate estimate of how much (if any) utilization from
this task remains on the prev cpu.
Instead, just return cpu_util when we're using WALT.
Change-Id: I448203ab98ffb5c020dfb6b218581eef1f5601f7
Reported-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
With the ability to choose between WALT and PELT for utilisation tracking
we can have the situation where we're using WALT to make all the
decisions and reporting PELT figures in the sched_load_avg_(cpu|task)
trace points. This is not too much of an issue, but when analysing trace
it is nice to see numbers representing what the scheduler is using rather
than needing to add in additional sched_walt_* traces to figure it out.
Add reporting for both types, and make the util_avg member reflect what
will be seen from cpu or task_util functions in the scheduler.
Change-Id: I2abbd2c5fa70822096d0f3372b4c12b1c6af1590
Signed-off-by: Chris Redpath <chris.redpath@arm.com>