All of the sched domains will be destroied and then rebuilded when a cpu
online/offline, if a softirq comes after the shced domains are destroied,
this cpu will be stucked in the infinite loop in sched_group_energy(),
because of it can not get the shced donmain in for_each_domain(cpu, sd).
Change-Id: I154cf560e4e1af4a7a2547154ad321e936196ce3
Signed-off-by: Chen Liang <cl@rock-chips.com>
It will cause deadlock and while(1) if call printk while schedule
is in progress. The block state like as below:
cpu0(hold the console sem):
printk->console_unlock->up_sem->spin_lock(&sem->lock)->wake_up_process(cpu1)
->try_to_wake_up(cpu1)->while(p->on_cpu).
cpu1(request console sem):
console_lock->down_sem->schedule->idle_banlance->update_cpu_capacity->
printk->console_trylock->spin_lock(&sem->lock).
p->on_cpu will be 1 forever, because the task is still running on cpu1,
so cpu0 is blocked in while(p->on_cpu), but cpu1 could not get
spin_lock(&sem->lock), it is blocked too, it means the task will running
on cpu1 forever.
Change-Id: I60d02d8c957273872f97939632bdd235accdad4e
Signed-off-by: Chen Liang <cl@rock-chips.com>
policy->governor_data will be use in cpufreq_sched_thread, but it is init
after wake thread, it will cause NULL point access.
Change-Id: I320a3da34560e49f293211be92cb8310d8e395d7
Signed-off-by: Chen Liang <cl@rock-chips.com>
If we do not limit the freqency immediately when the cpu is overheat,
thermal driver will lost the control of temperature. So implement event
CPUFREQ_GOV_LIMIT for governor to limit the freqency immediately.
Change-Id: Id709edd377226417ead92ead2ae3d3d19b3eeabf
Signed-off-by: Chen Liang <cl@rock-chips.com>
LSK 16.06 v4.4-android
* tag 'lsk-v4.4-16.06-android': (447 commits)
Linux 4.4.14
netfilter: x_tables: introduce and use xt_copy_counters_from_user
netfilter: x_tables: do compat validation via translate_table
netfilter: x_tables: xt_compat_match_from_user doesn't need a retval
netfilter: ip6_tables: simplify translate_compat_table args
netfilter: ip_tables: simplify translate_compat_table args
netfilter: arp_tables: simplify translate_compat_table args
netfilter: x_tables: don't reject valid target size on some architectures
netfilter: x_tables: validate all offsets and sizes in a rule
netfilter: x_tables: check for bogus target offset
netfilter: x_tables: check standard target size too
netfilter: x_tables: add compat version of xt_check_entry_offsets
netfilter: x_tables: assert minimum target size
netfilter: x_tables: kill check_entry helper
netfilter: x_tables: add and use xt_check_entry_offsets
netfilter: x_tables: validate targets of jumps
netfilter: x_tables: don't move to non-existent next rule
drm/core: Do not preserve framebuffer on rmfb, v4.
crypto: qat - fix adf_ctl_drv.c:undefined reference to adf_init_pf_wq
netfilter: x_tables: fix unconditional helper
...
When iterating over the irq domain list, we try to match a domain
either by calling a match() function or by comparing a number
of fields passed as parameters.
Both approaches are a bit restrictive:
- match() is DT specific and only takes a device node
- the fallback case only deals with the fwnode_handle
It would be useful if we had a per-domain function that would
actually perform the matching check on the whole of the
irq_fwspec structure. This would allow for a domain to triage
matching attempts that need to extend beyond the fwnode.
Let's introduce irq_find_matching_fwspec(), which takes a full
blown irq_fwspec structure, and call into a select() function
implemented by the irqdomain. irq_find_matching_fwnode() is
made a wrapper around irq_find_matching_fwspec in order to
preserve compatibility.
Change-Id: I07df9af068d114c80cd97b9cb987a70c0e24afda
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: devicetree@vger.kernel.org
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Rob Herring <robh+dt@kernel.org>
Link: http://lkml.kernel.org/r/1460365075-7316-2-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Caesar Wang <wxt@rock-chips.com>
(cherry picked from commit 651e8b54ab)
Let's take the (outlandish) example of an interrupt controller
capable of handling both wired interrupts and PCI MSIs.
With the current code, the PCI MSI domain is going to be tagged
with DOMAIN_BUS_PCI_MSI, and the wired domain with DOMAIN_BUS_ANY.
Things get hairy when we start looking up the domain for a wired
interrupt (typically when creating it based on some firmware
information - DT or ACPI).
In irq_create_fwspec_mapping(), we perform the lookup using
DOMAIN_BUS_ANY, which is actually used as a wildcard. This gives
us one chance out of two to end up with the wrong domain, and
we try to configure a wired interrupt with the MSI domain.
Everything grinds to a halt pretty quickly.
What we really need to do is to start looking for a domain that
would uniquely identify a wired interrupt domain, and only use
DOMAIN_BUS_ANY as a fallback.
In order to solve this, let's introduce a new DOMAIN_BUS_WIRED
token, which is going to be used exactly as described above.
Of course, this depends on the irqchip to setup the domain
bus_token, and nobody had to implement this so far.
Only so far.
Change-Id: Ia71c7475354eb38ab9b15423560aa3d28ae16381
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Link: http://lkml.kernel.org/r/1453816347-32720-2-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Caesar Wang <wxt@rock-chips.com>
(cherry picked from commit 530cbe100e)
Since there will be several places checking if fwnode.type
is equal FWNODE_IRQCHIP, this patch adds a convenient function
for this purpose.
Change-Id: I65ab9e1350428de18864ba493256b959efc01f45
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Caesar Wang <wxt@rock-chips.com>
(cherry picked from commit 75aba7b0e9)
commit 29d6455178 upstream.
Until now, hitting this BUG_ON caused a recursive oops (because oops
handling involves do_exit(), which calls into the scheduler, which in
turn raises an oops), which caused stuff below the stack to be
overwritten until a panic happened (e.g. via an oops in interrupt
context, caused by the overwritten CPU index in the thread_info).
Just panic directly.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 612bacad78 ]
Follow-up to commit e27f4a942a ("bpf: Use mount_nodev not mount_ns
to mount the bpf filesystem"), which removes the FS_USERNS_MOUNT flag.
The original idea was to have a per mountns instance instead of a
single global fs instance, but that didn't work out and we had to
switch to mount_nodev() model. The intent of that middle ground was
that we avoid users who don't play nice to create endless instances
of bpf fs which are difficult to control and discover from an admin
point of view, but at the same time it would have allowed us to be
more flexible with regard to namespaces.
Therefore, since we now did the switch to mount_nodev() as a fix
where individual instances are created, we also need to remove userns
mount flag along with it to avoid running into mentioned situation.
I don't expect any breakage at this early point in time with removing
the flag and we can revisit this later should the requirement for
this come up with future users. This and commit e27f4a942a have
been split to facilitate tracking should any of them run into the
unlikely case of causing a regression.
Fixes: b2197755b2 ("bpf: add support for persistent maps/progs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e27f4a942a ]
While reviewing the filesystems that set FS_USERNS_MOUNT I spotted the
bpf filesystem. Looking at the code I saw a broken usage of mount_ns
with current->nsproxy->mnt_ns. As the code does not acquire a
reference to the mount namespace it can not possibly be correct to
store the mount namespace on the superblock as it does.
Replace mount_ns with mount_nodev so that each mount of the bpf
filesystem returns a distinct instance, and the code is not buggy.
In discussion with Hannes Frederic Sowa it was reported that the use
of mount_ns was an attempt to have one bpf instance per mount
namespace, in an attempt to keep resources that pin resources from
hiding. That intent simply does not work, the vfs is not built to
allow that kind of behavior. Which means that the bpf filesystem
really is buggy both semantically and in it's implemenation as it does
not nor can it implement the original intent.
This change is userspace visible, but my experience with similar
filesystems leads me to believe nothing will break with a model of each
mount of the bpf filesystem is distinct from all others.
Fixes: b2197755b2 ("bpf: add support for persistent maps/progs")
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some information, like driver specific configuration, is found
in the perf event structure. As such pass a 'struct perf_event'
to function setup_aux() rather than just the CPU number so that
individual drivers can make the right configuration when setting
up a session.
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
It is entirely possible that some PMUs need specific configuration
that is currently not found in the perf options before a session
can be setup.
It is the case for the CoreSight PMU where a sink needs to be
provided. That sink doesn't fall in any of the current perf
options.
As such this patch adds the capability to receive driver
specific configuration using the existing ioctl() mechanism.
Once the configuration has been pushed down the kernel PMU
callbacks are used to deal with the information sent from user
space.
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
When kernel.perf_event_open is set to 3 (or greater), disallow all
access to performance events by users without CAP_SYS_ADMIN.
Add a Kconfig symbol CONFIG_SECURITY_PERF_EVENTS_RESTRICT that
makes this value the default.
This is based on a similar feature in grsecurity
(CONFIG_GRKERNSEC_PERF_HARDEN). This version doesn't include making
the variable read-only. It also allows enabling further restriction
at run-time regardless of whether the default is changed.
https://lkml.org/lkml/2016/1/11/587
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Bug: 29054680
Change-Id: Iff5bff4fc1042e85866df9faa01bce8d04335ab8
* lsk-v4.4-eas-v5.2:
DEBUG: schedtune: add tracepoint for schedtune_tasks_update() values
DEBUG: schedtune: add tracepoint for CPU boost signal
DEBUG: schedtune: add tracepoint for SchedTune configuration update
DEBUG: sched: add energy procfs interface
DEBUG: sched,cpufreq: add cpu_capacity change tracepoint
DEBUG: sched: add tracepoint for CPU load/util signals
DEBUG: sched: add tracepoint for task load/util signals
DEBUG: sched: add tracepoint for cpu/freq scale invariance
sched/fair: filter energy_diff() based on energy_payoff value
sched/tune: add support to compute normalized energy
sched/fair: keep track of energy/capacity variations
sched/fair: add boosted task utilization
sched/{fair,tune}: track RUNNABLE tasks impact on per CPU boost value
sched/tune: compute and keep track of per CPU boost value
sched/tune: add initial support for CGroups based boosting
sched/fair: add boosted CPU usage
sched/fair: add function to convert boost value into "margin"
sched/tune: add sysctl interface to define a boost value
sched/tune: add detailed documentation
fixup! sched/fair: jump to max OPP when crossing UP threshold
fixup! sched: scheduler-driven cpu frequency selection
sched: rt scheduler sets capacity requirement
sched: deadline: use deadline bandwidth in scale_rt_capacity
sched: remove call of sched_avg_update from sched_rt_avg_update
sched/cpufreq_sched: add trace events
sched/fair: jump to max OPP when crossing UP threshold
sched/fair: cpufreq_sched triggers for load balancing
sched/{core,fair}: trigger OPP change request on fork()
sched/fair: add triggers for OPP change requests
sched: scheduler-driven cpu frequency selection
cpufreq: introduce cpufreq_driver_is_slow
sched: Consider misfit tasks when load-balancing
sched: Add group_misfit_task load-balance type
sched: Add per-cpu max capacity to sched_group_capacity
sched: Do eas idle balance regardless of the rq avg idle value
arm64: Enable max freq invariant scheduler load-tracking and capacity support
arm: Enable max freq invariant scheduler load-tracking and capacity support
sched: Update max cpu capacity in case of max frequency constraints
cpufreq: Max freq invariant scheduler load-tracking and cpu capacity support
arm64, topology: Updates to use DT bindings for EAS costing data
sched: Support for extracting EAS energy costs from DT
Documentation: DT bindings for energy model cost data required by EAS
sched: Disable energy-unfriendly nohz kicks
sched: Consider a not over-utilized energy-aware system as balanced
sched: Energy-aware wake-up task placement
sched: Determine the current sched_group idle-state
sched, cpuidle: Track cpuidle state index in the scheduler
sched: Add over-utilization/tipping point indicator
sched: Estimate energy impact of scheduling decisions
sched: Extend sched_group_energy to test load-balancing decisions
sched: Calculate energy consumption of sched_group
sched: Highest energy aware balancing sched_domain level pointer
sched: Relocated cpu_util() and change return type
sched: Compute cpu capacity available at current frequency
arm64: Cpu invariant scheduler load-tracking and capacity support
arm: Cpu invariant scheduler load-tracking and capacity support
sched: Introduce SD_SHARE_CAP_STATES sched_domain flag
sched: Initialize energy data structures
sched: Introduce energy data structures
sched: Make energy awareness a sched feature
sched: Documentation for scheduler energy cost model
sched: Prevent unnecessary active balance of single task in sched group
sched: Enable idle balance to pull single task towards cpu with higher capacity
sched: Consider spare cpu capacity at task wake-up
sched: Add cpu capacity awareness to wakeup balancing
sched: Store system-wide maximum cpu capacity in root domain
arm: Update arch_scale_cpu_capacity() to reflect change to define
arm64: Enable frequency invariant scheduler load-tracking support
arm: Enable frequency invariant scheduler load-tracking support
cpufreq: Frequency invariant scheduler load-tracking support
sched/fair: Fix new task's load avg removed from source CPU in wake_up_new_task()
Conflicts:
drivers/cpufreq/Kconfig
include/linux/cpufreq.h
include/trace/events/power.h
Change-Id: I0efa846911ea6b8d3f458115529cf67be73858e3
Signed-off-by: Huang, Tao <huangtao@rock-chips.com>
commit 759c01142a upstream.
On no-so-small systems, it is possible for a single process to cause an
OOM condition by filling large pipes with data that are never read. A
typical process filling 4000 pipes with 1 MB of data will use 4 GB of
memory. On small systems it may be tricky to set the pipe max size to
prevent this from happening.
This patch makes it possible to enforce a per-user soft limit above
which new pipes will be limited to a single page, effectively limiting
them to 4 kB each, as well as a hard limit above which no new pipes may
be created for this user. This has the effect of protecting the system
against memory abuse without hurting other users, and still allowing
pipes to work correctly though with less data at once.
The limit are controlled by two new sysctls : pipe-user-pages-soft, and
pipe-user-pages-hard. Both may be disabled by setting them to zero. The
default soft limit allows the default number of FDs per process (1024)
to create pipes of the default size (64kB), thus reaching a limit of 64MB
before starting to create only smaller pipes. With 256 processes limited
to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
1084 MB of memory allocated for a user. The hard limit is disabled by
default to avoid breaking existing applications that make intensive use
of pipes (eg: for splicing).
Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Moritz Muehlenhoff <moritz@wikimedia.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bf959931dd upstream.
The following program (simplified version of generated by syzkaller)
#include <pthread.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <stdio.h>
#include <signal.h>
void *thread_func(void *arg)
{
ptrace(PTRACE_TRACEME, 0,0,0);
return 0;
}
int main(void)
{
pthread_t thread;
if (fork())
return 0;
while (getppid() != 1)
;
pthread_create(&thread, NULL, thread_func, NULL);
pthread_join(thread, NULL);
return 0;
}
creates an unreapable zombie if /sbin/init doesn't use __WALL.
This is not a kernel bug, at least in a sense that everything works as
expected: debugger should reap a traced sub-thread before it can reap the
leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.
Unfortunately, it seems that /sbin/init in most (all?) distributions
doesn't use it and we have to change the kernel to avoid the problem.
Note also that most init's use sys_waitid() which doesn't allow __WALL, so
the necessary user-space fix is not that trivial.
This patch just adds the "ptrace" check into eligible_child(). To some
degree this matches the "tsk->ptrace" in exit_notify(), ->exit_signal is
mostly ignored when the tracee reports to debugger. Or WSTOPPED, the
tracer doesn't need to set this flag to wait for the stopped tracee.
This obviously means the user-visible change: __WCLONE and __WALL no
longer have any meaning for debugger. And I can only hope that this won't
break something, but at least strace/gdb won't suffer.
We could make a more conservative change. Say, we can take __WCLONE into
account, or !thread_group_leader(). But it would be nice to not
complicate these historical/confusing checks.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: <syzkaller@googlegroups.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We are currently using asynchronous deallocation in the error path in
AUX mmap code, which is unnecessary and also presents a problem for users
that wish to probe for the biggest possible buffer size they can get:
they'll get -EINVAL on all subsequent attemts to allocate a smaller
buffer before the asynchronous deallocation callback frees up the pages
from the previous unsuccessful attempt.
Currently, gdb does that for allocating AUX buffers for Intel PT traces.
More specifically, overwrite mode of AUX pmus that don't support hardware
sg (some implementations of Intel PT, for instance) is limited to only
one contiguous high order allocation for its buffer and there is no way
of knowing its size without trying.
This patch changes error path freeing to be synchronous as there won't
be any contenders for the AUX pages at that point.
Reported-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1453216469-9509-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 45c815f06b)
commit 20878232c5 upstream.
Systems show a minimal load average of 0.00, 0.01, 0.05 even when they
have no load at all.
Uptime and /proc/loadavg on all systems with kernels released during the
last five years up until kernel version 4.6-rc5, show a 5- and 15-minute
minimum loadavg of 0.01 and 0.05 respectively. This should be 0.00 on
idle systems, but the way the kernel calculates this value prevents it
from getting lower than the mentioned values.
Likewise but not as obviously noticeable, a fully loaded system with no
processes waiting, shows a maximum 1/5/15 loadavg of 1.00, 0.99, 0.95
(multiplied by number of cores).
Once the (old) load becomes 93 or higher, it mathematically can never
get lower than 93, even when the active (load) remains 0 forever.
This results in the strange 0.00, 0.01, 0.05 uptime values on idle
systems. Note: 93/2048 = 0.0454..., which rounds up to 0.05.
It is not correct to add a 0.5 rounding (=1024/2048) here, since the
result from this function is fed back into the next iteration again,
so the result of that +0.5 rounding value then gets multiplied by
(2048-2037), and then rounded again, so there is a virtual "ghost"
load created, next to the old and active load terms.
By changing the way the internally kept value is rounded, that internal
value equivalent now can reach 0.00 on idle, and 1.00 on full load. Upon
increasing load, the internally kept load value is rounded up, when the
load is decreasing, the load value is rounded down.
The modified code was tested on nohz=off and nohz kernels. It was tested
on vanilla kernel 4.6-rc5 and on centos 7.1 kernel 3.10.0-327. It was
tested on single, dual, and octal cores system. It was tested on virtual
hosts and bare hardware. No unwanted effects have been observed, and the
problems that the patch intended to fix were indeed gone.
Tested-by: Damien Wyart <damien.wyart@free.fr>
Signed-off-by: Vik Heyndrickx <vik.heyndrickx@veribox.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Doug Smythies <dsmythies@telus.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 0f004f5a69 ("sched: Cure more NO_HZ load average woes")
Link: http://lkml.kernel.org/r/e8d32bff-d544-7748-72b5-3c86cc71f09f@veribox.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 59643d1535 upstream.
If the size passed to ring_buffer_resize() is greater than MAX_LONG - BUF_PAGE_SIZE
then the DIV_ROUND_UP() will return zero.
Here's the details:
# echo 18014398509481980 > /sys/kernel/debug/tracing/buffer_size_kb
tracing_entries_write() processes this and converts kb to bytes.
18014398509481980 << 10 = 18446744073709547520
and this is passed to ring_buffer_resize() as unsigned long size.
size = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
Where DIV_ROUND_UP(a, b) is (a + b - 1)/b
BUF_PAGE_SIZE is 4080 and here
18446744073709547520 + 4080 - 1 = 18446744073709551599
where 18446744073709551599 is still smaller than 2^64
2^64 - 18446744073709551599 = 17
But now 18446744073709551599 / 4080 = 4521260802379792
and size = size * 4080 = 18446744073709551360
This is checked to make sure its still greater than 2 * 4080,
which it is.
Then we convert to the number of buffer pages needed.
nr_page = DIV_ROUND_UP(size, BUF_PAGE_SIZE)
but this time size is 18446744073709551360 and
2^64 - (18446744073709551360 + 4080 - 1) = -3823
Thus it overflows and the resulting number is less than 4080, which makes
3823 / 4080 = 0
an nr_pages is set to this. As we already checked against the minimum that
nr_pages may be, this causes the logic to fail as well, and we crash the
kernel.
There's no reason to have the two DIV_ROUND_UP() (that's just result of
historical code changes), clean up the code and fix this bug.
Fixes: 83f40318da ("ring-buffer: Make removal of ring buffer pages atomic")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9b94a8fba5 upstream.
The size variable to change the ring buffer in ftrace is a long. The
nr_pages used to update the ring buffer based on the size is int. On 64 bit
machines this can cause an overflow problem.
For example, the following will cause the ring buffer to crash:
# cd /sys/kernel/debug/tracing
# echo 10 > buffer_size_kb
# echo 8556384240 > buffer_size_kb
Then you get the warning of:
WARNING: CPU: 1 PID: 318 at kernel/trace/ring_buffer.c:1527 rb_update_pages+0x22f/0x260
Which is:
RB_WARN_ON(cpu_buffer, nr_removed);
Note each ring buffer page holds 4080 bytes.
This is because:
1) 10 causes the ring buffer to have 3 pages.
(10kb requires 3 * 4080 pages to hold)
2) (2^31 / 2^10 + 1) * 4080 = 8556384240
The value written into buffer_size_kb is shifted by 10 and then passed
to ring_buffer_resize(). 8556384240 * 2^10 = 8761737461760
3) The size passed to ring_buffer_resize() is then divided by BUF_PAGE_SIZE
which is 4080. 8761737461760 / 4080 = 2147484672
4) nr_pages is subtracted from the current nr_pages (3) and we get:
2147484669. This value is saved in a signed integer nr_pages_to_update
5) 2147484669 is greater than 2^31 but smaller than 2^32, a signed int
turns into the value of -2147482627
6) As the value is a negative number, in update_pages_handler() it is
negated and passed to rb_remove_pages() and 2147482627 pages will
be removed, which is much larger than 3 and it causes the warning
because not all the pages asked to be removed were removed.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=118001
Fixes: 7a8e76a382 ("tracing: unified trace buffer")
Reported-by: Hao Qin <QEver.cn@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 79c9ce57eb upstream.
Jann reported that the ptrace_may_access() check in
find_lively_task_by_vpid() is racy against exec().
Specifically:
perf_event_open() execve()
ptrace_may_access()
commit_creds()
... if (get_dumpable() != SUID_DUMP_USER)
perf_event_exit_task();
perf_install_in_context()
would result in installing a counter across the creds boundary.
Fix this by wrapping lots of perf_event_open() in cred_guard_mutex.
This should be fine as perf_event_exit_task() is already called with
cred_guard_mutex held, so all perf locks already nest inside it.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: He Kuang <hekuang@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f7c17d26f4 upstream.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 16 at kernel/workqueue.c:4559 rebind_workers+0x1c0/0x1d0
Modules linked in:
CPU: 0 PID: 16 Comm: cpuhp/0 Not tainted 4.6.0-rc4+ #31
Hardware name: IBM IBM System x3550 M4 Server -[7914IUW]-/00Y8603, BIOS -[D7E128FUS-1.40]- 07/23/2013
0000000000000000 ffff881037babb58 ffffffff8139d885 0000000000000010
0000000000000000 0000000000000000 0000000000000000 ffff881037babba8
ffffffff8108505d ffff881037ba0000 000011cf3e7d6e60 0000000000000046
Call Trace:
dump_stack+0x89/0xd4
__warn+0xfd/0x120
warn_slowpath_null+0x1d/0x20
rebind_workers+0x1c0/0x1d0
workqueue_cpu_up_callback+0xf5/0x1d0
notifier_call_chain+0x64/0x90
? trace_hardirqs_on_caller+0xf2/0x220
? notify_prepare+0x80/0x80
__raw_notifier_call_chain+0xe/0x10
__cpu_notify+0x35/0x50
notify_down_prepare+0x5e/0x80
? notify_prepare+0x80/0x80
cpuhp_invoke_callback+0x73/0x330
? __schedule+0x33e/0x8a0
cpuhp_down_callbacks+0x51/0xc0
cpuhp_thread_fun+0xc1/0xf0
smpboot_thread_fn+0x159/0x2a0
? smpboot_create_threads+0x80/0x80
kthread+0xef/0x110
? wait_for_completion+0xf0/0x120
? schedule_tail+0x35/0xf0
ret_from_fork+0x22/0x50
? __init_kthread_worker+0x70/0x70
---[ end trace eb12ae47d2382d8f ]---
notify_down_prepare: attempt to take down CPU 0 failed
This bug can be reproduced by below config w/ nohz_full= all cpus:
CONFIG_BOOTPARAM_HOTPLUG_CPU0=y
CONFIG_DEBUG_HOTPLUG_CPU0=y
CONFIG_NO_HZ_FULL=y
As Thomas pointed out:
| If a down prepare callback fails, then DOWN_FAILED is invoked for all
| callbacks which have successfully executed DOWN_PREPARE.
|
| But, workqueue has actually two notifiers. One which handles
| UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE.
|
| Now look at the priorities of those callbacks:
|
| CPU_PRI_WORKQUEUE_UP = 5
| CPU_PRI_WORKQUEUE_DOWN = -5
|
| So the call order on DOWN_PREPARE is:
|
| CB 1
| CB ...
| CB workqueue_up() -> Ignores DOWN_PREPARE
| CB ...
| CB X ---> Fails
|
| So we call up to CB X with DOWN_FAILED
|
| CB 1
| CB ...
| CB workqueue_up() -> Handles DOWN_FAILED
| CB ...
| CB X-1
|
| So the problem is that the workqueue stuff handles DOWN_FAILED in the up
| callback, while it should do it in the down callback. Which is not a good idea
| either because it wants to be called early on rollback...
|
| Brilliant stuff, isn't it? The hotplug rework will solve this problem because
| the callbacks become symetric, but for the existing mess, we need some
| workaround in the workqueue code.
The boot CPU handles housekeeping duty(unbound timers, workqueues,
timekeeping, ...) on behalf of full dynticks CPUs. It must remain
online when nohz full is enabled. There is a priority set to every
notifier_blocks:
workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down
So tick_nohz_cpu_down callback failed when down prepare cpu 0, and
notifier_blocks behind tick_nohz_cpu_down will not be called any
more, which leads to workers are actually not unbound. Then hotplug
state machine will fallback to undo and online cpu 0 again. Workers
will be rebound unconditionally even if they are not unbound and
trigger the warning in this progress.
This patch fix it by catching !DISASSOCIATED to avoid rebind bound
workers.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6aff67c85c ]
The commit 35578d7984 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter")
introduced clever way to check bpf_helper<->map_type compatibility.
Later on commit a43eec3042 ("bpf: introduce bpf_perf_event_output() helper") adjusted
the logic and inadvertently broke it.
Get rid of the clever bool compare and go back to two-way check
from map and from helper perspective.
Fixes: a43eec3042 ("bpf: introduce bpf_perf_event_output() helper")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 92117d8443 ]
On a system with >32Gbyte of phyiscal memory and infinite RLIMIT_MEMLOCK,
the malicious application may overflow 32-bit bpf program refcnt.
It's also possible to overflow map refcnt on 1Tb system.
Impose 32k hard limit which means that the same bpf program or
map cannot be shared by more than 32k processes.
Fixes: 1be7f75d16 ("bpf: enable non-root eBPF programs")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8358b02bf6 ]
When bpf(BPF_PROG_LOAD, ...) was invoked with a BPF program whose bytecode
references a non-map file descriptor as a map file descriptor, the error
handling code called fdput() twice instead of once (in __bpf_map_get() and
in replace_map_fd_with_map_ptr()). If the file descriptor table of the
current task is shared, this causes f_count to be decremented too much,
allowing the struct file to be freed while it is still in use
(use-after-free). This can be exploited to gain root privileges by an
unprivileged user.
This bug was introduced in
commit 0246e64d9a ("bpf: handle pseudo BPF_LD_IMM64 insn"), but is only
exploitable since
commit 1be7f75d16 ("bpf: enable non-root eBPF programs") because
previously, CAP_SYS_ADMIN was required to reach the vulnerable code.
(posted publicly according to request by maintainer)
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d82bccc690 ]
verifier must check for reserved size bits in instruction opcode and
reject BPF_LD | BPF_ABS | BPF_DW and BPF_LD | BPF_IND | BPF_DW instructions,
otherwise interpreter will WARN_RATELIMIT on them during execution.
Fixes: ddd872bc30 ("bpf: verifier: add checks for BPF_ABS | BPF_IND instructions")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* linux-linaro-lsk-v4.4-android: (797 commits)
parisc: Use generic extable search and sort routines
arm64: kasan: Use actual memory node when populating the kernel image shadow
arm64: mm: treat memstart_addr as a signed quantity
arm64: lse: deal with clobbered IP registers after branch via PLT
arm64: mm: check at build time that PAGE_OFFSET divides the VA space evenly
arm64: kasan: Fix zero shadow mapping overriding kernel image shadow
arm64: consistently use p?d_set_huge
arm64: fix KASLR boot-time I-cache maintenance
arm64: hugetlb: partial revert of 66b3923a1a
arm64: make irq_stack_ptr more robust
arm64: efi: invoke EFI_RNG_PROTOCOL to supply KASLR randomness
efi: stub: use high allocation for converted command line
efi: stub: add implementation of efi_random_alloc()
efi: stub: implement efi_get_random_bytes() based on EFI_RNG_PROTOCOL
arm64: kaslr: randomize the linear region
arm64: add support for kernel ASLR
arm64: add support for building vmlinux as a relocatable PIE binary
arm64: switch to relative exception tables
extable: add support for relative extables to search and sort routines
scripts/sortextable: add support for ET_DYN binaries
...
Conflicts:
arch/arm64/mm/dma-mapping.c
drivers/clk/rockchip/clk-rk3368.c
drivers/mmc/core/core.c
drivers/mmc/core/sdio.c
include/linux/dcache.h
Change-Id: Ibaa1e90ac735db8d9f5e542c266ef27b91616ef4