Commit Graph

26039 Commits

Author SHA1 Message Date
Steven Rostedt (VMware)
234bc12669 tracing: Remove extra zeroing out of the ring buffer page
commit 6b7e633fe9 upstream.

The ring_buffer_read_page() takes care of zeroing out any extra data in the
page that it returns. There's no need to zero it out again from the
consumer. It was removed from one consumer of this function, but
read_buffers_splice_read() did not remove it, and worse, it contained a
nasty bug because of it.

Fixes: 2711ca237a ("ring-buffer: Move zeroing out excess in page to ring buffer code")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02 20:30:54 +01:00
Thomas Gleixner
ee8e8b2df6 arch, mm: Allow arch_dup_mmap() to fail
commit c10e83f598 upstream.

In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
allowed to fail. Fix up all instances.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: dan.j.williams@intel.com
Cc: hughd@google.com
Cc: keescook@google.com
Cc: kirill.shutemov@linux.intel.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-29 17:53:43 +01:00
Daniel Borkmann
de31796c05 bpf: fix integer overflows
From: Alexei Starovoitov <ast@kernel.org>

[ Upstream commit bb7f0f989c ]

There were various issues related to the limited size of integers used in
the verifier:
 - `off + size` overflow in __check_map_access()
 - `off + reg->off` overflow in check_mem_access()
 - `off + reg->var_off.value` overflow or 32-bit truncation of
   `reg->var_off.value` in check_mem_access()
 - 32-bit truncation in check_stack_boundary()

Make sure that any integer math cannot overflow by not allowing
pointer math with large values.

Also reduce the scope of "scalar op scalar" tracking.

Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:26:33 +01:00
Daniel Borkmann
cb56cc1b29 bpf: don't prune branches when a scalar is replaced with a pointer
From: Jann Horn <jannh@google.com>

[ Upstream commit 179d1c5602 ]

This could be made safe by passing through a reference to env and checking
for env->allow_ptr_leaks, but it would only work one way and is probably
not worth the hassle - not doing it will not directly lead to program
rejection.

Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:26:32 +01:00
Daniel Borkmann
c90268f7cb bpf: force strict alignment checks for stack pointers
From: Jann Horn <jannh@google.com>

[ Upstream commit a5ec6ae161 ]

Force strict alignment checks for stack pointers because the tracking of
stack spills relies on it; unaligned stack accesses can lead to corruption
of spilled registers, which is exploitable.

Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:26:32 +01:00
Daniel Borkmann
2120fca0ec bpf: fix missing error return in check_stack_boundary()
From: Jann Horn <jannh@google.com>

Prevent indirect stack accesses at non-constant addresses, which would
permit reading and corrupting spilled pointers.

Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:26:32 +01:00
Daniel Borkmann
6c8e098d03 bpf: fix 32-bit ALU op verification
From: Jann Horn <jannh@google.com>

[ Upstream commit 468f6eafa6 ]

32-bit ALU ops operate on 32-bit values and have 32-bit outputs.
Adjust the verifier accordingly.

Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:26:32 +01:00
Daniel Borkmann
bf5ee24e87 bpf: fix incorrect tracking of register size truncation
From: Jann Horn <jannh@google.com>

[ Upstream commit 0c17d1d2c6 ]

Properly handle register truncation to a smaller size.

The old code first mirrors the clearing of the high 32 bits in the bitwise
tristate representation, which is correct. But then, it computes the new
arithmetic bounds as the intersection between the old arithmetic bounds and
the bounds resulting from the bitwise tristate representation. Therefore,
when coerce_reg_to_32() is called on a number with bounds
[0xffff'fff8, 0x1'0000'0007], the verifier computes
[0xffff'fff8, 0xffff'ffff] as bounds of the truncated number.
This is incorrect: The truncated number could also be in the range [0, 7],
and no meaningful arithmetic bounds can be computed in that case apart from
the obvious [0, 0xffff'ffff].

Starting with v4.14, this is exploitable by unprivileged users as long as
the unprivileged_bpf_disabled sysctl isn't set.

Debian assigned CVE-2017-16996 for this issue.

v2:
 - flip the mask during arithmetic bounds calculation (Ben Hutchings)
v3:
 - add CVE number (Ben Hutchings)

Fixes: b03c9f9fdc ("bpf/verifier: track signed and unsigned min/max values")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:26:32 +01:00
Daniel Borkmann
6e12ea4fb4 bpf: fix incorrect sign extension in check_alu_op()
From: Jann Horn <jannh@google.com>

[ Upstream commit 95a762e2c8 ]

Distinguish between
BPF_ALU64|BPF_MOV|BPF_K (load 32-bit immediate, sign-extended to 64-bit)
and BPF_ALU|BPF_MOV|BPF_K (load 32-bit immediate, zero-padded to 64-bit);
only perform sign extension in the first case.

Starting with v4.14, this is exploitable by unprivileged users as long as
the unprivileged_bpf_disabled sysctl isn't set.

Debian assigned CVE-2017-16995 for this issue.

v3:
 - add CVE number (Ben Hutchings)

Fixes: 484611357c ("bpf: allow access into map value arrays")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:26:32 +01:00
Daniel Borkmann
4d54f7df51 bpf/verifier: fix bounds calculation on BPF_RSH
From: Edward Cree <ecree@solarflare.com>

[ Upstream commit 4374f256ce ]

Incorrect signed bounds were being computed.
If the old upper signed bound was positive and the old lower signed bound was
negative, this could cause the new upper signed bound to be too low,
leading to security issues.

Fixes: b03c9f9fdc ("bpf/verifier: track signed and unsigned min/max values")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
[jannh@google.com: changed description to reflect bug impact]
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:26:32 +01:00
Daniel Borkmann
a23244e884 bpf: fix corruption on concurrent perf_event_output calls
[ Upstream commit 283ca526a9 ]

When tracing and networking programs are both attached in the
system and both use event-output helpers that eventually call
into perf_event_output(), then we could end up in a situation
where the tracing attached program runs in user context while
a cls_bpf program is triggered on that same CPU out of softirq
context.

Since both rely on the same per-cpu perf_sample_data, we could
potentially corrupt it. This can only ever happen in a combination
of the two types; all tracing programs use a bpf_prog_active
counter to bail out in case a program is already running on
that CPU out of a different context. XDP and cls_bpf programs
by themselves don't have this issue as they run in the same
context only. Therefore, split both perf_sample_data so they
cannot be accessed from each other.

Fixes: 20b9d7ac48 ("bpf: avoid excessive stack usage for perf_sample_data")
Reported-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Song Liu <songliubraving@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:26:31 +01:00
Daniel Borkmann
2b3ea8ceb2 bpf: fix branch pruning logic
From: Alexei Starovoitov <ast@fb.com>

[ Upstream commit c131187db2 ]

when the verifier detects that register contains a runtime constant
and it's compared with another constant it will prune exploration
of the branch that is guaranteed not to be taken at runtime.
This is all correct, but malicious program may be constructed
in such a way that it always has a constant comparison and
the other branch is never taken under any conditions.
In this case such path through the program will not be explored
by the verifier. It won't be taken at run-time either, but since
all instructions are JITed the malicious program may cause JITs
to complain about using reserved fields, etc.
To fix the issue we have to track the instructions explored by
the verifier and sanitize instructions that are dead at run time
with NOPs. We cannot reject such dead code, since llvm generates
it for valid C code, since it doesn't do as much data flow
analysis as the verifier does.

Fixes: 17a5267067 ("bpf: verifier (add verifier core)")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:26:31 +01:00
Tom Zanussi
16e1626e54 tracing: Exclude 'generic fields' from histograms
[ Upstream commit a15f7fc203 ]

There are a small number of 'generic fields' (comm/COMM/cpu/CPU) that
are found by trace_find_event_field() but are only meant for
filtering.  Specifically, they unlike normal fields, they have a size
of 0 and thus wreak havoc when used as a histogram key.

Exclude these (return -EINVAL) when used as histogram keys.

Link: http://lkml.kernel.org/r/956154cbc3e8a4f0633d619b886c97f0f0edf7b4.1506105045.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:26:27 +01:00
Will Deacon
5383f45db3 locking/barriers: Convert users of lockless_dereference() to READ_ONCE()
commit 3382290ed2 upstream.

[ Note, this is a Git cherry-pick of the following commit:

    506458efaf ("locking/barriers: Convert users of lockless_dereference() to READ_ONCE()")

  ... for easier x86 PTI code testing and back-porting. ]

READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
can be used instead of lockless_dereference() without any change in
semantics.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:26:21 +01:00
Rajat Jain
32f733a8f4 PM / s2idle: Clear the events_check_enabled flag
[ Upstream commit 95b982b451 ]

Problem: This flag does not get cleared currently in the suspend or
resume path in the following cases:

 * In case some driver's suspend routine returns an error.
 * Successful s2idle case
 * etc?

Why is this a problem: What happens is that the next suspend attempt
could fail even though the user did not enable the flag by writing to
/sys/power/wakeup_count. This is 1 use case how the issue can be seen
(but similar use case with driver suspend failure can be thought of):

 1. Read /sys/power/wakeup_count
 2. echo count > /sys/power/wakeup_count
 3. echo freeze > /sys/power/wakeup_count
 4. Let the system suspend, and wakeup the system using some wake source
    that calls pm_wakeup_event() e.g. power button or something.
 5. Note that the combined wakeup count would be incremented due
    to the pm_wakeup_event() in the resume path.
 6. After resuming the events_check_enabled flag is still set.

At this point if the user attempts to freeze again (without writing to
/sys/power/wakeup_count), the suspend would fail even though there has
been no wake event since the past resume.

Address that by clearing the flag just before a resume is completed,
so that it is always cleared for the corner cases mentioned above.

Signed-off-by: Rajat Jain <rajatja@google.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:10:23 +01:00
Thomas Gleixner
3df23f7ce7 posix-timer: Properly check sigevent->sigev_notify
commit cef31d9af9 upstream.

timer_create() specifies via sigevent->sigev_notify the signal delivery for
the new timer. The valid modes are SIGEV_NONE, SIGEV_SIGNAL, SIGEV_THREAD
and (SIGEV_SIGNAL | SIGEV_THREAD_ID).

The sanity check in good_sigevent() is only checking the valid combination
for the SIGEV_THREAD_ID bit, i.e. SIGEV_SIGNAL, but if SIGEV_THREAD_ID is
not set it accepts any random value.

This has no real effects on the posix timer and signal delivery code, but
it affects show_timer() which handles the output of /proc/$PID/timers. That
function uses a string array to pretty print sigev_notify. The access to
that array has no bound checks, so random sigev_notify cause access beyond
the array bounds.

Add proper checks for the valid notify modes and remove the SIGEV_THREAD_ID
masking from various code pathes as SIGEV_NONE can never be set in
combination with SIGEV_THREAD_ID.

Reported-by: Eric Biggers <ebiggers3@gmail.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:10:21 +01:00
Steven Rostedt
282e4b259d sched/rt: Do not pull from current CPU if only one CPU to pull
commit f73c52a5bc upstream.

Daniel Wagner reported a crash on the BeagleBone Black SoC.

This is a single CPU architecture, and does not have a functional
arch_send_call_function_single_ipi() implementation which can crash
the kernel if that is called.

As it only has one CPU, it shouldn't be called, but if the kernel is
compiled for SMP, the push/pull RT scheduling logic now calls it for
irq_work if the one CPU is overloaded, it can use that function to call
itself and crash the kernel.

Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system
only has a single CPU. But SCHED_FEAT is a constant if sched debugging
is turned off. Another fix can also be used, and this should also help
with normal SMP machines. That is, do not initiate the pull code if
there's only one RT overloaded CPU, and that CPU happens to be the
current CPU that is scheduling in a lower priority task.

Even on a system with many CPUs, if there's many RT tasks waiting to
run on a single CPU, and that CPU schedules in another RT task of lower
priority, it will initiate the PULL logic in case there's a higher
priority RT task on another CPU that is waiting to run. But if there is
no other CPU with waiting RT tasks, it will initiate the RT pull logic
on itself (as it still has RT tasks waiting to run). This is a wasted
effort.

Not only does this help with SMP code where the current CPU is the only
one with RT overloaded tasks, it should also solve the issue that
Daniel encountered, because it will prevent the PULL logic from
executing, as there's only one CPU on the system, and the check added
here will cause it to exit the RT pull code.

Reported-by: Daniel Wagner <wagi@monom.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Fixes: 4bdced5c9 ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:10:21 +01:00
Changbin Du
b8582c0f79 tracing: Allocate mask_str buffer dynamically
commit 90e406f96f upstream.

The default NR_CPUS can be very large, but actual possible nr_cpu_ids
usually is very small. For my x86 distribution, the NR_CPUS is 8192 and
nr_cpu_ids is 4. About 2 pages are wasted.

Most machines don't have so many CPUs, so define a array with NR_CPUS
just wastes memory. So let's allocate the buffer dynamically when need.

With this change, the mutext tracing_cpumask_update_lock also can be
removed now, which was used to protect mask_str.

Link: http://lkml.kernel.org/r/1512013183-19107-1-git-send-email-changbin.du@intel.com

Fixes: 36dfe9252b ("ftrace: make use of tracing_cpumask")
Signed-off-by: Changbin Du <changbin.du@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:10:18 +01:00
Thiago Rafael Becker
c5d9b78d53 kernel: make groups_sort calling a responsibility group_info allocators
commit bdcf0a423e upstream.

In testing, we found that nfsd threads may call set_groups in parallel
for the same entry cached in auth.unix.gid, racing in the call of
groups_sort, corrupting the groups for that entry and leading to
permission denials for the client.

This patch:
 - Make groups_sort globally visible.
 - Move the call to groups_sort to the modifiers of group_info
 - Remove the call to groups_sort from set_groups

Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com
Signed-off-by: Thiago Rafael Becker <thiago.becker@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:10:18 +01:00
Paul Moore
0ad0bb6016 audit: ensure that 'audit=1' actually enables audit for PID 1
[ Upstream commit 173743dd99 ]

Prior to this patch we enabled audit in audit_init(), which is too
late for PID 1 as the standard initcalls are run after the PID 1 task
is forked.  This means that we never allocate an audit_context (see
audit_alloc()) for PID 1 and therefore miss a lot of audit events
generated by PID 1.

This patch enables audit as early as possible to help ensure that when
PID 1 is forked it can allocate an audit_context if required.

Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-17 15:08:00 +01:00
Steve Grubb
4086f7cf0c audit: Allow auditd to set pid to 0 to end auditing
[ Upstream commit 33e8a90780 ]

The API to end auditing has historically been for auditd to set the
pid to 0. This patch restores that functionality.

See: https://github.com/linux-audit/audit-kernel/issues/69

Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-17 15:08:00 +01:00
Jason Baron
2de359062f jump_label: Invoke jump_label_test() via early_initcall()
[ Upstream commit 92ee46efeb ]

Fengguang Wu reported that running the rcuperf test during boot can cause
the jump_label_test() to hit a WARN_ON(). The issue is that the core jump
label code relies on kernel_text_address() to detect when it can no longer
update branches that may be contained in __init sections. The
kernel_text_address() in turn assumes that if the system_state variable is
greter than or equal to SYSTEM_RUNNING then __init sections are no longer
valid (since the assumption is that they have been freed). However, when
rcuperf is setup to run in early boot it can call kernel_power_off() which
sets the system_state to SYSTEM_POWER_OFF.

Since rcuperf initialization is invoked via a module_init(), we can make
the dependency of jump_label_test() needing to complete before rcuperf
explicit by calling it via early_initcall().

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1510609727-2238-1-git-send-email-jbaron@akamai.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14 09:53:13 +01:00
Eric Dumazet
b316280c81 bpf: fix lockdep splat
[ Upstream commit 89ad2fa3f0 ]

pcpu_freelist_pop() needs the same lockdep awareness than
pcpu_freelist_populate() to avoid a false positive.

 [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]

 switchto-defaul/12508 [HC0[0]:SC0[6]:HE0:SE0] is trying to acquire:
  (&htab->buckets[i].lock){......}, at: [<ffffffff9dc099cb>] __htab_percpu_map_update_elem+0x1cb/0x300

 and this task is already holding:
  (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...}, at: [<ffffffff9e135848>] __dev_queue_xmit+0
x868/0x1240
 which would create a new lock dependency:
  (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...} -> (&htab->buckets[i].lock){......}

 but this new dependency connects a SOFTIRQ-irq-safe lock:
  (dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2){+.-...}
 ... which became SOFTIRQ-irq-safe at:
   [<ffffffff9db5931b>] __lock_acquire+0x42b/0x1f10
   [<ffffffff9db5b32c>] lock_acquire+0xbc/0x1b0
   [<ffffffff9da05e38>] _raw_spin_lock+0x38/0x50
   [<ffffffff9e135848>] __dev_queue_xmit+0x868/0x1240
   [<ffffffff9e136240>] dev_queue_xmit+0x10/0x20
   [<ffffffff9e1965d9>] ip_finish_output2+0x439/0x590
   [<ffffffff9e197410>] ip_finish_output+0x150/0x2f0
   [<ffffffff9e19886d>] ip_output+0x7d/0x260
   [<ffffffff9e19789e>] ip_local_out+0x5e/0xe0
   [<ffffffff9e197b25>] ip_queue_xmit+0x205/0x620
   [<ffffffff9e1b8398>] tcp_transmit_skb+0x5a8/0xcb0
   [<ffffffff9e1ba152>] tcp_write_xmit+0x242/0x1070
   [<ffffffff9e1baffc>] __tcp_push_pending_frames+0x3c/0xf0
   [<ffffffff9e1b3472>] tcp_rcv_established+0x312/0x700
   [<ffffffff9e1c1acc>] tcp_v4_do_rcv+0x11c/0x200
   [<ffffffff9e1c3dc2>] tcp_v4_rcv+0xaa2/0xc30
   [<ffffffff9e191107>] ip_local_deliver_finish+0xa7/0x240
   [<ffffffff9e191a36>] ip_local_deliver+0x66/0x200
   [<ffffffff9e19137d>] ip_rcv_finish+0xdd/0x560
   [<ffffffff9e191e65>] ip_rcv+0x295/0x510
   [<ffffffff9e12ff88>] __netif_receive_skb_core+0x988/0x1020
   [<ffffffff9e130641>] __netif_receive_skb+0x21/0x70
   [<ffffffff9e1306ff>] process_backlog+0x6f/0x230
   [<ffffffff9e132129>] net_rx_action+0x229/0x420
   [<ffffffff9da07ee8>] __do_softirq+0xd8/0x43d
   [<ffffffff9e282bcc>] do_softirq_own_stack+0x1c/0x30
   [<ffffffff9dafc2f5>] do_softirq+0x55/0x60
   [<ffffffff9dafc3a8>] __local_bh_enable_ip+0xa8/0xb0
   [<ffffffff9db4c727>] cpu_startup_entry+0x1c7/0x500
   [<ffffffff9daab333>] start_secondary+0x113/0x140

 to a SOFTIRQ-irq-unsafe lock:
  (&head->lock){+.+...}
 ... which became SOFTIRQ-irq-unsafe at:
 ...  [<ffffffff9db5971f>] __lock_acquire+0x82f/0x1f10
   [<ffffffff9db5b32c>] lock_acquire+0xbc/0x1b0
   [<ffffffff9da05e38>] _raw_spin_lock+0x38/0x50
   [<ffffffff9dc0b7fa>] pcpu_freelist_pop+0x7a/0xb0
   [<ffffffff9dc08b2c>] htab_map_alloc+0x50c/0x5f0
   [<ffffffff9dc00dc5>] SyS_bpf+0x265/0x1200
   [<ffffffff9e28195f>] entry_SYSCALL_64_fastpath+0x12/0x17

 other info that might help us debug this:

 Chain exists of:
   dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2 --> &htab->buckets[i].lock --> &head->lock

  Possible interrupt unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&head->lock);
                                local_irq_disable();
                                lock(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2);
                                lock(&htab->buckets[i].lock);
   <Interrupt>
     lock(dev_queue->dev->qdisc_class ?: &qdisc_tx_lock#2);

  *** DEADLOCK ***

Fixes: e19494edab ("bpf: introduce percpu_freelist")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14 09:53:11 +01:00
Joe Lawrence
30c2f774e1 pipe: match pipe_max_size data type with procfs
[ Upstream commit 98159d977f ]

Patch series "A few round_pipe_size() and pipe-max-size fixups", v3.

While backporting Michael's "pipe: fix limit handling" patchset to a
distro-kernel, Mikulas noticed that current upstream pipe limit handling
contains a few problems:

  1 - procfs signed wrap: echo'ing a large number into
      /proc/sys/fs/pipe-max-size and then cat'ing it back out shows a
      negative value.

  2 - round_pipe_size() nr_pages overflow on 32bit:  this would
      subsequently try roundup_pow_of_two(0), which is undefined.

  3 - visible non-rounded pipe-max-size value: there is no mutual
      exclusion or protection between the time pipe_max_size is assigned
      a raw value from proc_dointvec_minmax() and when it is rounded.

  4 - unsigned long -> unsigned int conversion makes for potential odd
      return errors from do_proc_douintvec_minmax_conv() and
      do_proc_dopipe_max_size_conv().

This version underwent the same testing as v1:
https://marc.info/?l=linux-kernel&m=150643571406022&w=2

This patch (of 4):

pipe_max_size is defined as an unsigned int:

  unsigned int pipe_max_size = 1048576;

but its procfs/sysctl representation is an integer:

  static struct ctl_table fs_table[] = {
          ...
          {
                  .procname       = "pipe-max-size",
                  .data           = &pipe_max_size,
                  .maxlen         = sizeof(int),
                  .mode           = 0644,
                  .proc_handler   = &pipe_proc_fn,
                  .extra1         = &pipe_min_size,
          },
          ...

that is signed:

  int pipe_proc_fn(struct ctl_table *table, int write, void __user *buf,
                   size_t *lenp, loff_t *ppos)
  {
          ...
          ret = proc_dointvec_minmax(table, write, buf, lenp, ppos)

This leads to signed results via procfs for large values of pipe_max_size:

  % echo 2147483647 >/proc/sys/fs/pipe-max-size
  % cat /proc/sys/fs/pipe-max-size
  -2147483648

Use unsigned operations on this variable to avoid such negative values.

Link: http://lkml.kernel.org/r/1507658689-11669-2-git-send-email-joe.lawrence@redhat.com
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14 09:53:08 +01:00
Daniel Thompson
425704be09 kdb: Fix handling of kallsyms_symbol_next() return value
commit c07d353380 upstream.

kallsyms_symbol_next() returns a boolean (true on success). Currently
kdb_read() tests the return value with an inequality that
unconditionally evaluates to true.

This is fixed in the obvious way and, since the conditional branch is
supposed to be unreachable, we also add a WARN_ON().

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14 09:52:58 +01:00
Lai Jiangshan
b68df97ec8 smp/hotplug: Move step CPUHP_AP_SMPCFD_DYING to the correct place
commit 46febd37f9 upstream.

Commit 31487f8328 ("smp/cfd: Convert core to hotplug state machine")
accidently put this step on the wrong place. The step should be at the
cpuhp_ap_states[] rather than the cpuhp_bp_states[].

grep smpcfd /sys/devices/system/cpu/hotplug/states
 40: smpcfd:prepare
129: smpcfd:dying

"smpcfd:dying" was missing before.
So was the invocation of the function smpcfd_dying_cpu().

Fixes: 31487f8328 ("smp/cfd: Convert core to hotplug state machine")
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lkml.kernel.org/r/20171128131954.81229-1-jiangshanlai@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14 09:52:56 +01:00
Masami Hiramatsu
5715de464a kprobes: Use synchronize_rcu_tasks() for optprobe with CONFIG_PREEMPT=y
[ Upstream commit a30b85df7d ]

We want to wait for all potentially preempted kprobes trampoline
execution to have completed. This guarantees that any freed
trampoline memory is not in use by any task in the system anymore.
synchronize_rcu_tasks() gives such a guarantee, so use it.

Also, this guarantees to wait for all potentially preempted tasks
on the instructions which will be replaced with a jump.

Since this becomes a problem only when CONFIG_PREEMPT=y, enable
CONFIG_TASKS_RCU=y for synchronize_rcu_tasks() in that case.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naveen N . Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/150845661962.5443.17724352636247312231.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-10 13:40:40 +01:00
Peter Zijlstra
4ee9572b10 perf/core: Fix __perf_read_group_add() locking
[ Upstream commit a9cd8194e1 ]

Event timestamps are serialized using ctx->lock, make sure to hold it
over reading all values.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-10 13:40:40 +01:00
Marc Zyngier
c01dd3addb genirq: Track whether the trigger type has been set
commit 4f8413a3a7 upstream.

When requesting a shared interrupt, we assume that the firmware
support code (DT or ACPI) has called irqd_set_trigger_type
already, so that we can retrieve it and check that the requester
is being reasonnable.

Unfortunately, we still have non-DT, non-ACPI systems around,
and these guys won't call irqd_set_trigger_type before requesting
the interrupt. The consequence is that we fail the request that
would have worked before.

We can either chase all these use cases (boring), or address it
in core code (easier). Let's have a per-irq_desc flag that
indicates whether irqd_set_trigger_type has been called, and
let's just check it when checking for a shared interrupt.
If it hasn't been set, just take whatever the interrupt
requester asks.

Fixes: 382bd4de61 ("genirq: Use irqd_get_trigger_type to compare the trigger type for shared IRQs")
Reported-and-tested-by: Petr Cvek <petrcvekcz@gmail.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30 08:40:52 +00:00
Steven Rostedt (Red Hat)
f17c786b28 sched/rt: Simplify the IPI based RT balancing logic
commit 4bdced5c9a upstream.

When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.

When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit:

  b6366f048e ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")

changed the way to request a pull. Instead of grabbing the lock of the
overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.

Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.

Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.

This greatly simplifies the code!

The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:

  rto_push_work	 - the irq work to process each CPU set in rto_mask
  rto_lock	 - the lock to protect some of the other rto fields
  rto_loop_start - an atomic that keeps contention down on rto_lock
		    the first CPU scheduling in a lower priority task
		    is the one to kick off the process.
  rto_loop_next	 - an atomic that gets incremented for each CPU that
		    schedules in a lower priority task.
  rto_loop	 - a variable protected by rto_lock that is used to
		    compare against rto_loop_next
  rto_cpu	 - The cpu to send the next IPI to, also protected by
		    the rto_lock.

When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.

If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.

When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.

Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.

If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.

This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?

Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30 08:40:43 +00:00
Paul E. McKenney
9f088f6a67 sched: Make resched_cpu() unconditional
commit 7c2102e56a upstream.

The current implementation of synchronize_sched_expedited() incorrectly
assumes that resched_cpu() is unconditional, which it is not.  This means
that synchronize_sched_expedited() can hang when resched_cpu()'s trylock
fails as follows (analysis by Neeraj Upadhyay):

o	CPU1 is waiting for expedited wait to complete:

	sync_rcu_exp_select_cpus
	     rdp->exp_dynticks_snap & 0x1   // returns 1 for CPU5
	     IPI sent to CPU5

	synchronize_sched_expedited_wait
		 ret = swait_event_timeout(rsp->expedited_wq,
					   sync_rcu_preempt_exp_done(rnp_root),
					   jiffies_stall);

	expmask = 0x20, CPU 5 in idle path (in cpuidle_enter())

o	CPU5 handles IPI and fails to acquire rq lock.

	Handles IPI
	     sync_sched_exp_handler
		 resched_cpu
		     returns while failing to try lock acquire rq->lock
		 need_resched is not set

o	CPU5 calls  rcu_idle_enter() and as need_resched is not set, goes to
	idle (schedule() is not called).

o	CPU 1 reports RCU stall.

Given that resched_cpu() is now used only by RCU, this commit fixes the
assumption by making resched_cpu() unconditional.

Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Suggested-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30 08:40:39 +00:00
Viresh Kumar
b79974945e cpufreq: schedutil: Reset cached_raw_freq when not in sync with next_freq
commit 07458f6a51 upstream.

'cached_raw_freq' is used to get the next frequency quickly but should
always be in sync with sg_policy->next_freq. There is a case where it is
not and in such cases it should be reset to avoid switching to incorrect
frequencies.

Consider this case for example:

 - policy->cur is 1.2 GHz (Max)
 - New request comes for 780 MHz and we store that in cached_raw_freq.
 - Based on 780 MHz, we calculate the effective frequency as 800 MHz.
 - We then see the CPU wasn't idle recently and choose to keep the next
   freq as 1.2 GHz.
 - Now we have cached_raw_freq is 780 MHz and sg_policy->next_freq is
   1.2 GHz.
 - Now if the utilization doesn't change in then next request, then the
   next target frequency will still be 780 MHz and it will match with
   cached_raw_freq. But we will choose 1.2 GHz instead of 800 MHz here.

Fixes: b7eaf1aab9 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30 08:40:39 +00:00
Neeraj Upadhyay
3594216fc6 rcu: Fix up pending cbs check in rcu_prepare_for_idle
commit 135bd1a230 upstream.

The pending-callbacks check in rcu_prepare_for_idle() is backwards.
It should accelerate if there are pending callbacks, but the check
rather uselessly accelerates only if there are no callbacks.  This commit
therefore inverts this check.

Fixes: 15fecf89e4 ("srcu: Abstract multi-tail callback list handling")
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-24 08:37:04 +01:00
Linus Torvalds
3fefc31843 Merge tag 'pm-final-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull final power management fixes from Rafael Wysocki:
 "These fix a regression in the schedutil cpufreq governor introduced by
  a recent change and blacklist Dell XPS13 9360 from using the Low Power
  S0 Idle _DSM interface which triggers serious problems on one of these
  machines.

  Specifics:

   - Prevent the schedutil cpufreq governor from using the utilization
     of a wrong CPU in some cases which started to happen after one of
     the recent changes in it (Chris Redpath).

   - Blacklist Dell XPS13 9360 from using the Low Power S0 Idle _DSM
     interface as that causes serious issue (related to NVMe) to appear
     on one of these machines, even though the other Dells XPS13 9360 in
     somewhat different HW configurations behave correctly (Rafael
     Wysocki)"

* tag 'pm-final-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI / PM: Blacklist Low Power S0 Idle _DSM for Dell XPS13 9360
  cpufreq: schedutil: Examine the correct CPU when we update util
2017-11-09 11:16:28 -08:00
Rafael J. Wysocki
e029b9bf12 Merge branch 'pm-cpufreq-sched'
* pm-cpufreq-sched:
  cpufreq: schedutil: Examine the correct CPU when we update util
2017-11-09 00:07:56 +01:00
Linus Torvalds
e4880bc5df Merge branch 'for-4.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
 "Another fix for a really old bug.

  It only affects drain_workqueue() which isn't used often and even then
  triggers only during a pretty small race window, so it isn't too
  surprising that it stayed hidden for so long.

  The fix is straight-forward and low-risk. Kudos to Li Bin for
  reporting and fixing the bug"

* 'for-4.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Fix NULL pointer dereference
2017-11-06 12:26:49 -08:00
Linus Torvalds
9d9cc4aa00 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Various fixes:

   - synchronize kernel and tooling headers

   - cgroup support fix

   - two tooling fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tools/headers: Synchronize kernel ABI headers
  perf/cgroup: Fix perf cgroup hierarchy support
  perf tools: Unwind properly location after REJECT
  perf symbols: Fix memory corruption because of zero length symbols
2017-11-05 11:44:39 -08:00
Chris Redpath
d62d813c0d cpufreq: schedutil: Examine the correct CPU when we update util
After commit 674e75411f (sched: cpufreq: Allow remote cpufreq
callbacks) we stopped to always read the utilization for the CPU we
are running the governor on, and instead we read it for the CPU
which we've been told has updated utilization.  This is stored in
sugov_cpu->cpu.

The value is set in sugov_register() but we clear it in sugov_start()
which leads to always looking at the utilization of CPU0 instead of
the correct one.

Fix this by consolidating the initialization code into sugov_start().

Fixes: 674e75411f (sched: cpufreq: Allow remote cpufreq callbacks)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Reviewed-by: Patrick Bellasi <patrick.bellasi@arm.com>
Reviewed-by: Brendan Jackman <brendan.jackman@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-11-04 17:44:28 +01:00
Ingo Molnar
649e441f49 Merge branch 'linus' into core/urgent, to pick up dependent commits
We want to fix an objtool build warning that got introduced in the latest upstream kernel.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-04 08:53:04 +01:00
Ingo Molnar
294cbd05e3 Merge branch 'linus' into perf/urgent, to pick up dependent commits
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-03 12:30:12 +01:00
Linus Torvalds
ead751507d Merge tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull initial SPDX identifiers from Greg KH:
 "License cleanup: add SPDX license identifiers to some files

  Many source files in the tree are missing licensing information, which
  makes it harder for compliance tools to determine the correct license.

  By default all files without license information are under the default
  license of the kernel, which is GPL version 2.

  Update the files which contain no license information with the
  'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally
  binding shorthand, which can be used instead of the full boiler plate
  text.

  This patch is based on work done by Thomas Gleixner and Kate Stewart
  and Philippe Ombredanne.

  How this work was done:

  Patches were generated and checked against linux-4.14-rc6 for a subset
  of the use cases:

   - file had no licensing information it it.

   - file was a */uapi/* one with no licensing information in it,

   - file was a */uapi/* one with existing licensing information,

  Further patches will be generated in subsequent months to fix up cases
  where non-standard license headers were used, and references to
  license had to be inferred by heuristics based on keywords.

  The analysis to determine which SPDX License Identifier to be applied
  to a file was done in a spreadsheet of side by side results from of
  the output of two independent scanners (ScanCode & Windriver)
  producing SPDX tag:value files created by Philippe Ombredanne.
  Philippe prepared the base worksheet, and did an initial spot review
  of a few 1000 files.

  The 4.13 kernel was the starting point of the analysis with 60,537
  files assessed. Kate Stewart did a file by file comparison of the
  scanner results in the spreadsheet to determine which SPDX license
  identifier(s) to be applied to the file. She confirmed any
  determination that was not immediately clear with lawyers working with
  the Linux Foundation.

  Criteria used to select files for SPDX license identifier tagging was:

   - Files considered eligible had to be source code files.

   - Make and config files were included as candidates if they contained
     >5 lines of source

   - File already had some variant of a license header in it (even if <5
     lines).

  All documentation files were explicitly excluded.

  The following heuristics were used to determine which SPDX license
  identifiers to apply.

   - when both scanners couldn't find any license traces, file was
     considered to have no license information in it, and the top level
     COPYING file license applied.

     For non */uapi/* files that summary was:

       SPDX license identifier                            # files
       ---------------------------------------------------|-------
       GPL-2.0                                              11139

     and resulted in the first patch in this series.

     If that file was a */uapi/* path one, it was "GPL-2.0 WITH
     Linux-syscall-note" otherwise it was "GPL-2.0". Results of that
     was:

       SPDX license identifier                            # files
       ---------------------------------------------------|-------
       GPL-2.0 WITH Linux-syscall-note                        930

     and resulted in the second patch in this series.

   - if a file had some form of licensing information in it, and was one
     of the */uapi/* ones, it was denoted with the Linux-syscall-note if
     any GPL family license was found in the file or had no licensing in
     it (per prior point). Results summary:

       SPDX license identifier                            # files
       ---------------------------------------------------|------
       GPL-2.0 WITH Linux-syscall-note                       270
       GPL-2.0+ WITH Linux-syscall-note                      169
       ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
       ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
       LGPL-2.1+ WITH Linux-syscall-note                      15
       GPL-1.0+ WITH Linux-syscall-note                       14
       ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
       LGPL-2.0+ WITH Linux-syscall-note                       4
       LGPL-2.1 WITH Linux-syscall-note                        3
       ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
       ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

     and that resulted in the third patch in this series.

   - when the two scanners agreed on the detected license(s), that
     became the concluded license(s).

   - when there was disagreement between the two scanners (one detected
     a license but the other didn't, or they both detected different
     licenses) a manual inspection of the file occurred.

   - In most cases a manual inspection of the information in the file
     resulted in a clear resolution of the license that should apply
     (and which scanner probably needed to revisit its heuristics).

   - When it was not immediately clear, the license identifier was
     confirmed with lawyers working with the Linux Foundation.

   - If there was any question as to the appropriate license identifier,
     the file was flagged for further research and to be revisited later
     in time.

  In total, over 70 hours of logged manual review was done on the
  spreadsheet to determine the SPDX license identifiers to apply to the
  source files by Kate, Philippe, Thomas and, in some cases,
  confirmation by lawyers working with the Linux Foundation.

  Kate also obtained a third independent scan of the 4.13 code base from
  FOSSology, and compared selected files where the other two scanners
  disagreed against that SPDX file, to see if there was new insights.
  The Windriver scanner is based on an older version of FOSSology in
  part, so they are related.

  Thomas did random spot checks in about 500 files from the spreadsheets
  for the uapi headers and agreed with SPDX license identifier in the
  files he inspected. For the non-uapi files Thomas did random spot
  checks in about 15000 files.

  In initial set of patches against 4.14-rc6, 3 files were found to have
  copy/paste license identifier errors, and have been fixed to reflect
  the correct identifier.

  Additionally Philippe spent 10 hours this week doing a detailed manual
  inspection and review of the 12,461 patched files from the initial
  patch version early this week with:

   - a full scancode scan run, collecting the matched texts, detected
     license ids and scores

   - reviewing anything where there was a license detected (about 500+
     files) to ensure that the applied SPDX license was correct

   - reviewing anything where there was no detection but the patch
     license was not GPL-2.0 WITH Linux-syscall-note to ensure that the
     applied SPDX license was correct

  This produced a worksheet with 20 files needing minor correction. This
  worksheet was then exported into 3 different .csv files for the
  different types of files to be modified.

  These .csv files were then reviewed by Greg. Thomas wrote a script to
  parse the csv files and add the proper SPDX tag to the file, in the
  format that the file expected. This script was further refined by Greg
  based on the output to detect more types of files automatically and to
  distinguish between header and source .c files (which need different
  comment types.) Finally Greg ran the script using the .csv files to
  generate the patches.

  Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
  Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
  Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  License cleanup: add SPDX license identifier to uapi header files with a license
  License cleanup: add SPDX license identifier to uapi header files with no license
  License cleanup: add SPDX GPL-2.0 license identifier to files with no license
2017-11-02 10:04:46 -07:00
Jiri Slaby
e78c38f6bd futex: futex_wake_op, do not fail on invalid op
In commit 30d6e0a419 ("futex: Remove duplicated code and fix undefined
behaviour"), I let FUTEX_WAKE_OP to fail on invalid op.  Namely when op
should be considered as shift and the shift is out of range (< 0 or > 31).

But strace's test suite does this madness:

  futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xa0caffee);
  futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xbadfaced);
  futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xffffffff);

When I pick the first 0xa0caffee, it decodes as:

  0x80000000 & 0xa0caffee: oparg is shift
  0x70000000 & 0xa0caffee: op is FUTEX_OP_OR
  0x0f000000 & 0xa0caffee: cmp is FUTEX_OP_CMP_EQ
  0x00fff000 & 0xa0caffee: oparg is sign-extended 0xcaf = -849
  0x00000fff & 0xa0caffee: cmparg is sign-extended 0xfee = -18

That means the op tries to do this:

  (futex |= (1 << (-849))) == -18

which is completely bogus. The new check of op in the code is:

        if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28)) {
                if (oparg < 0 || oparg > 31)
                        return -EINVAL;
                oparg = 1 << oparg;
        }

which results obviously in the "Invalid argument" errno:

  FAIL: futex
  ===========

  futex(0x7fabd78bcffc, 0x5, 0xfacefeed, 0xb, 0x7fabd78bcffc, 0xa0caffee) = -1: Invalid argument
  futex.test: failed test: ../futex failed with code 1

So let us soften the failure to print only a (ratelimited) message, crop
the value and continue as if it were right.  When userspace keeps up, we
can switch this to return -EINVAL again.

[v2] Do not return 0 immediatelly, proceed with the cropped value.

Fixes: 30d6e0a419 ("futex: Remove duplicated code and fix undefined behaviour")
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-02 07:41:50 -07:00
Greg Kroah-Hartman
b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Linus Torvalds
3a99df9a3d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull signal bugfix from Eric Biederman:
 "When making the generic support for SIGEMT conditional on the presence
  of SIGEMT I made a typo that causes it to fail to activate. It was
  noticed comparatively quickly but the bug report just made it to me
  today"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  signal: Fix name of SIGEMT in #if defined() check
2017-11-01 16:04:27 -07:00
Andrew Clayton
c3aff086ea signal: Fix name of SIGEMT in #if defined() check
Commit cc731525f2 ("signal: Remove kernel interal si_code magic")
added a check for SIGMET and NSIGEMT being defined. That SIGMET should
in fact be SIGEMT, with SIGEMT being defined in
arch/{alpha,mips,sparc}/include/uapi/asm/signal.h

This was actually pointed out by BenHutchings in a lwn.net comment
here https://lwn.net/Comments/734608/

Fixes: cc731525f2 ("signal: Remove kernel interal si_code magic")
Signed-off-by: Andrew Clayton <andrew@digital-domain.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-11-01 17:04:57 -05:00
Don Zickus
42f930da7f watchdog/hardlockup/perf: Use atomics to track in-use cpu counter
Guenter reported:
  There is still a problem. When running 
    echo 6 > /proc/sys/kernel/watchdog_thresh
    echo 5 > /proc/sys/kernel/watchdog_thresh
  repeatedly, the message
 
   NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
 
  stops after a while (after ~10-30 iterations, with fluctuations).
  Maybe watchdog_cpus needs to be atomic ?

That's correct as this again is affected by the asynchronous nature of the
smpboot thread unpark mechanism.

CPU 0				CPU1			CPU2
write(watchdog_thresh, 6)	
  stop()
    park()
  update()
  start()
    unpark()
				thread->unpark()
				  cnt++;
write(watchdog_thresh, 5)				thread->unpark()
  stop()
    park()			thread->park()
				   cnt--;		  cnt++;
  update()
  start()
    unpark()

That's not a functional problem, it just affects the informational message.

Convert watchdog_cpus to atomic_t to prevent the problem

Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20171101181126.j727fqjmdthjz4xk@redhat.com
2017-11-01 21:18:40 +01:00
Thomas Gleixner
9c388a5ed1 watchdog/harclockup/perf: Revert a33d44843d ("watchdog/hardlockup/perf: Simplify deferred event destroy")
Guenter reported a crash in the watchdog/perf code, which is caused by
cleanup() and enable() running concurrently. The reason for this is:

The watchdog functions are serialized via the watchdog_mutex and cpu
hotplug locking, but the enable of the perf based watchdog happens in
context of the unpark callback of the smpboot thread. But that unpark
function is not synchronous inside the locking. The unparking of the thread
just wakes it up and leaves so there is no guarantee when the thread is
executing.

If it starts running _before_ the cleanup happened then it will create a
event and overwrite the dead event pointer. The new event is then cleaned
up because the event is marked dead.

    lock(watchdog_mutex);
    lockup_detector_reconfigure();
        cpus_read_lock();
	stop();
	   park()
	update();
	start();
	   unpark()
	cpus_read_unlock();		thread runs()
					  overwrite dead event ptr
	cleanup();
	  free new event, which is active inside perf....
    unlock(watchdog_mutex);

The park side is safe as that actually waits for the thread to reach
parked state.

Commit a33d44843d removed the protection against this kind of scenario
under the stupid assumption that the hotplug serialization and the
watchdog_mutex cover everything. 

Bring it back.

Reverts: a33d44843d ("watchdog/hardlockup/perf: Simplify deferred event destroy")
Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Feels-stupid Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Don Zickus <dzickus@redhat.com>
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1710312145190.1942@nanos
2017-11-01 21:18:39 +01:00
Peter Zijlstra
153fbd1226 futex: Fix more put_pi_state() vs. exit_pi_state_list() races
Dmitry (through syzbot) reported being able to trigger the WARN in
get_pi_state() and a use-after-free on:

	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);

Both are due to this race:

  exit_pi_state_list()				put_pi_state()

  lock(&curr->pi_lock)
  while() {
	pi_state = list_first_entry(head);
	hb = hash_futex(&pi_state->key);
	unlock(&curr->pi_lock);

						dec_and_test(&pi_state->refcount);

	lock(&hb->lock)
	lock(&pi_state->pi_mutex.wait_lock)	// uaf if pi_state free'd
	lock(&curr->pi_lock);

	....

	unlock(&curr->pi_lock);
	get_pi_state();				// WARN; refcount==0

The problem is we take the reference count too late, and don't allow it
being 0. Fix it by using inc_not_zero() and simply retrying the loop
when we fail to get a refcount. In that case put_pi_state() should
remove the entry from the list.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Gratian Crisan <gratian.crisan@ni.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: dvhart@infradead.org
Cc: syzbot <bot+2af19c9e1ffe4d4ee1d16c56ae7580feaee75765@syzkaller.appspotmail.com>
Cc: syzkaller-bugs@googlegroups.com
Cc: <stable@vger.kernel.org>
Fixes: c74aef2d06 ("futex: Fix pi_state->owner serialization")
Link: http://lkml.kernel.org/r/20171031101853.xpfh72y643kdfhjs@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-01 09:05:00 +01:00
John Fastabend
04686ef299 bpf: remove SK_REDIRECT from UAPI
Now that SK_REDIRECT is no longer a valid return code. Remove it
from the UAPI completely. Then do a namespace remapping internal
to sockmap so SK_REDIRECT is no longer externally visible.

Patchs primary change is to do a namechange from SK_REDIRECT to
__SK_REDIRECT

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-01 11:43:50 +09:00
Li Bin
cef572ad9b workqueue: Fix NULL pointer dereference
When queue_work() is used in irq (not in task context), there is
a potential case that trigger NULL pointer dereference.
----------------------------------------------------------------
worker_thread()
|-spin_lock_irq()
|-process_one_work()
	|-worker->current_pwq = pwq
	|-spin_unlock_irq()
	|-worker->current_func(work)
	|-spin_lock_irq()
 	|-worker->current_pwq = NULL
|-spin_unlock_irq()

				//interrupt here
				|-irq_handler
					|-__queue_work()
						//assuming that the wq is draining
						|-is_chained_work(wq)
							|-current_wq_worker()
							//Here, 'current' is the interrupted worker!
								|-current->current_pwq is NULL here!
|-schedule()
----------------------------------------------------------------

Avoid it by checking for task context in current_wq_worker(), and
if not in task context, we shouldn't use the 'current' to check the
condition.

Reported-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Li Bin <huawei.libin@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 8d03ecfe47 ("workqueue: reimplement is_chained_work() using current_wq_worker()")
Cc: stable@vger.kernel.org # v3.9+
2017-10-30 07:56:01 -07:00