commit d049f74f2d upstream.
The get_dumpable() return value is not boolean. Most users of the
function actually want to be testing for non-SUID_DUMP_USER(1) rather than
SUID_DUMP_DISABLE(0). The SUID_DUMP_ROOT(2) is also considered a
protected state. Almost all places did this correctly, excepting the two
places fixed in this patch.
Wrong logic:
if (dumpable == SUID_DUMP_DISABLE) { /* be protective */ }
or
if (dumpable == 0) { /* be protective */ }
or
if (!dumpable) { /* be protective */ }
Correct logic:
if (dumpable != SUID_DUMP_USER) { /* be protective */ }
or
if (dumpable != 1) { /* be protective */ }
Without this patch, if the system had set the sysctl fs/suid_dumpable=2, a
user was able to ptrace attach to processes that had dropped privileges to
that user. (This may have been partially mitigated if Yama was enabled.)
The macros have been moved into the file that declares get/set_dumpable(),
which means things like the ia64 code can see them too.
CVE-2013-2929
Reported-by: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 12ae030d54 upstream.
The current default perf paranoid level is "1" which has
"perf_paranoid_kernel()" return false, and giving any operations that
use it, access to normal users. Unfortunately, this includes function
tracing and normal users should not be allowed to enable function
tracing by default.
The proper level is defined at "-1" (full perf access), which
"perf_paranoid_tracepoint_raw()" will only give access to. Use that
check instead for enabling function tracing.
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
CVE: CVE-2013-2930
Fixes: ced39002f5 ("ftrace, perf: Add support to use function tracepoint in perf")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ea81174789 upstream.
Mike reported that commit 7d1a9417 ("x86: Use generic idle loop")
regressed several workloads and caused excessive reschedule
interrupts.
The patch in question failed to notice that the x86 code had an
inverted sense of the polling state versus the new generic code (x86:
default polling, generic: default !polling).
Fix the two prominent x86 mwait based idle drivers and introduce a few
new generic polling helpers (fixing the wrong smp_mb__after_clear_bit
usage).
Also switch the idle routines to using tif_need_resched() which is an
immediate TIF_NEED_RESCHED test as opposed to need_resched which will
end up being slightly different.
Reported-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: lenb@kernel.org
Cc: tglx@linutronix.de
Link: http://lkml.kernel.org/n/tip-nc03imb0etuefmzybzj7sprf@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
hmp_variable_scale_convert was used without guards in
__update_entity_runnable_avg. Guard it.
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Mark Brown <broonie@linaro.org>
In order to allow userspace to restrict known low-load tasks to
little CPUs, we must export this knowledge from the kernel or
expect userspace to make their own attempts at figuring it out.
Since we now have a userspace requirement for an HMP implementation
to always have at least some sysfs files, change the integration
so that it only depends upon CONFIG_SCHED_HMP rather than
CONFIG_HMP_VARIABLE_SCALE. Fix Kconfig text to match.
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Jon Medhurst <tixy@linaro.org>
When migrating a runnable task, we use the CPU stopper on
the source CPU to ensure that the task to be moved is not
currently running. Before this patch, all forced migrations
(up, offload, idle pull) use the stopper for every migration.
Using the CPU stopper is mandatory only when a task is currently
running on a CPU. Otherwise tasks can be moved by locking the
source and destination run queues.
This patch checks to see if the task to be moved are currently
running. If not the task is moved directly without using the
stopper thread.
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Jon Medhurst <tixy@linaro.org>
commit 057db8488b upstream.
Andrey reported the following report:
ERROR: AddressSanitizer: heap-buffer-overflow on address ffff8800359c99f3
ffff8800359c99f3 is located 0 bytes to the right of 243-byte region [ffff8800359c9900, ffff8800359c99f3)
Accessed by thread T13003:
#0 ffffffff810dd2da (asan_report_error+0x32a/0x440)
#1 ffffffff810dc6b0 (asan_check_region+0x30/0x40)
#2 ffffffff810dd4d3 (__tsan_write1+0x13/0x20)
#3 ffffffff811cd19e (ftrace_regex_release+0x1be/0x260)
#4 ffffffff812a1065 (__fput+0x155/0x360)
#5 ffffffff812a12de (____fput+0x1e/0x30)
#6 ffffffff8111708d (task_work_run+0x10d/0x140)
#7 ffffffff810ea043 (do_exit+0x433/0x11f0)
#8 ffffffff810eaee4 (do_group_exit+0x84/0x130)
#9 ffffffff810eafb1 (SyS_exit_group+0x21/0x30)
#10 ffffffff81928782 (system_call_fastpath+0x16/0x1b)
Allocated by thread T5167:
#0 ffffffff810dc778 (asan_slab_alloc+0x48/0xc0)
#1 ffffffff8128337c (__kmalloc+0xbc/0x500)
#2 ffffffff811d9d54 (trace_parser_get_init+0x34/0x90)
#3 ffffffff811cd7b3 (ftrace_regex_open+0x83/0x2e0)
#4 ffffffff811cda7d (ftrace_filter_open+0x2d/0x40)
#5 ffffffff8129b4ff (do_dentry_open+0x32f/0x430)
#6 ffffffff8129b668 (finish_open+0x68/0xa0)
#7 ffffffff812b66ac (do_last+0xb8c/0x1710)
#8 ffffffff812b7350 (path_openat+0x120/0xb50)
#9 ffffffff812b8884 (do_filp_open+0x54/0xb0)
#10 ffffffff8129d36c (do_sys_open+0x1ac/0x2c0)
#11 ffffffff8129d4b7 (SyS_open+0x37/0x50)
#12 ffffffff81928782 (system_call_fastpath+0x16/0x1b)
Shadow bytes around the buggy address:
ffff8800359c9700: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
ffff8800359c9780: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
ffff8800359c9800: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
ffff8800359c9880: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
ffff8800359c9900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>ffff8800359c9980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00[03]fb
ffff8800359c9a00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
ffff8800359c9a80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
ffff8800359c9b00: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
ffff8800359c9b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff8800359c9c00: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap redzone: fa
Heap kmalloc redzone: fb
Freed heap region: fd
Shadow gap: fe
The out-of-bounds access happens on 'parser->buffer[parser->idx] = 0;'
Although the crash happened in ftrace_regex_open() the real bug
occurred in trace_get_user() where there's an incrementation to
parser->idx without a check against the size. The way it is triggered
is if userspace sends in 128 characters (EVENT_BUF_SIZE + 1), the loop
that reads the last character stores it and then breaks out because
there is no more characters. Then the last character is read to determine
what to do next, and the index is incremented without checking size.
Then the caller of trace_get_user() usually nulls out the last character
with a zero, but since the index is equal to the size, it writes a nul
character after the allocated space, which can corrupt memory.
Luckily, only root user has write access to this file.
Link: http://lkml.kernel.org/r/20131009222323.04fd1a0d@gandalf.local.home
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 97b9410643 upstream.
Marc Kleine-Budde pointed out, that commit 77cc982 "clocksource: use
clockevents_config_and_register() where possible" caused a regression
for some of the converted subarchs.
The reason is, that the clockevents core code converts the minimal
hardware tick delta to a nanosecond value for core internal
usage. This conversion is affected by integer math rounding loss, so
the backwards conversion to hardware ticks will likely result in a
value which is less than the configured hardware limitation. The
affected subarchs used their own workaround (SIGH!) which got lost in
the conversion.
The solution for the issue at hand is simple: adding evt->mult - 1 to
the shifted value before the integer divison in the core conversion
function takes care of it. But this only works for the case where for
the scaled math mult/shift pair "mult <= 1 << shift" is true. For the
case where "mult > 1 << shift" we can apply the rounding add only for
the minimum delta value to make sure that the backward conversion is
not less than the given hardware limit. For the upper bound we need to
omit the rounding add, because the backwards conversion is always
larger than the original latch value. That would violate the upper
bound of the hardware device.
Though looking closer at the details of that function reveals another
bogosity: The upper bounds check is broken as well. Checking for a
resulting "clc" value greater than KTIME_MAX after the conversion is
pointless. The conversion does:
u64 clc = (latch << evt->shift) / evt->mult;
So there is no sanity check for (latch << evt->shift) exceeding the
64bit boundary. The latch argument is "unsigned long", so on a 64bit
arch the handed in argument could easily lead to an unnoticed shift
overflow. With the above rounding fix applied the calculation before
the divison is:
u64 clc = (latch << evt->shift) + evt->mult - 1;
So we need to make sure, that neither the shift nor the rounding add
is overflowing the u64 boundary.
[ukl: move assignment to rnd after eventually changing mult, fix build
issue and correct comment with the right math]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: nicolas.ferre@atmel.com
Cc: Marc Pignat <marc.pignat@hevs.ch>
Cc: john.stultz@linaro.org
Cc: kernel@pengutronix.de
Cc: Ronald Wahl <ronald.wahl@raritan.com>
Cc: LAK <linux-arm-kernel@lists.infradead.org>
Cc: Ludovic Desroches <ludovic.desroches@atmel.com>
Link: http://lkml.kernel.org/r/1380052223-24139-1-git-send-email-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ea84753c98 upstream.
Both Anjana and Eunki reported a stall in the while_each_thread loop
in cgroup_attach_task().
It's because, when we attach a single thread to a cgroup, if the cgroup
is exiting or is already in that cgroup, we won't break the loop.
If the task is already in the cgroup, the bug can lead to another thread
being attached to the cgroup unexpectedly:
# echo 5207 > tasks
# cat tasks
5207
# echo 5207 > tasks
# cat tasks
5207
5215
What's worse, if the task to be attached isn't the leader of the thread
group, we might never exit the loop, hence cpu stall. Thanks for Oleg's
analysis.
This bug was introduced by commit 081aa458c3
("cgroup: consolidate cgroup_attach_task() and cgroup_attach_proc()")
[ lizf: - fixed the first continue, pointed out by Oleg,
- rewrote changelog. ]
Reported-by: Eunki Kim <eunki_kim@samsung.com>
Reported-by: Anjana V Kumar <anjanavk12@gmail.com>
Signed-off-by: Anjana V Kumar <anjanavk12@gmail.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fix two bugs caused by merging anon vma_naming, a typo in
mempolicy.c and a bad merge in sys.c.
Change-Id: Ia4ced447d50573e68195e95ea2f2b4d9456b8a90
Signed-off-by: Colin Cross <ccross@android.com>
commit ded7975475 upstream.
The commit facd8b80c6
("irq: Sanitize invoke_softirq") converted irq exit
calls of do_softirq() to __do_softirq() on all architectures,
assuming it was only used there for its irq disablement
properties.
But as a side effect, the softirqs processed in the end
of the hardirq are always called on the inline current
stack that is used by irq_exit() instead of the softirq
stack provided by the archs that override do_softirq().
The result is mostly safe if the architecture runs irq_exit()
on a separate irq stack because then softirqs are processed
on that same stack that is near empty at this stage (assuming
hardirq aren't nesting).
Otherwise irq_exit() runs in the task stack and so does the softirq
too. The interrupted call stack can be randomly deep already and
the softirq can dig through it even further. To add insult to the
injury, this softirq can be interrupted by a new hardirq, maximizing
the chances for a stack overrun as reported in powerpc for example:
do_IRQ: stack overflow: 1920
CPU: 0 PID: 1602 Comm: qemu-system-ppc Not tainted 3.10.4-300.1.fc19.ppc64p7 #1
Call Trace:
[c0000000050a8740] .show_stack+0x130/0x200 (unreliable)
[c0000000050a8810] .dump_stack+0x28/0x3c
[c0000000050a8880] .do_IRQ+0x2b8/0x2c0
[c0000000050a8930] hardware_interrupt_common+0x154/0x180
--- Exception: 501 at .cp_start_xmit+0x3a4/0x820 [8139cp]
LR = .cp_start_xmit+0x390/0x820 [8139cp]
[c0000000050a8d40] .dev_hard_start_xmit+0x394/0x640
[c0000000050a8e00] .sch_direct_xmit+0x110/0x260
[c0000000050a8ea0] .dev_queue_xmit+0x260/0x630
[c0000000050a8f40] .br_dev_queue_push_xmit+0xc4/0x130 [bridge]
[c0000000050a8fc0] .br_dev_xmit+0x198/0x270 [bridge]
[c0000000050a9070] .dev_hard_start_xmit+0x394/0x640
[c0000000050a9130] .dev_queue_xmit+0x428/0x630
[c0000000050a91d0] .ip_finish_output+0x2a4/0x550
[c0000000050a9290] .ip_local_out+0x50/0x70
[c0000000050a9310] .ip_queue_xmit+0x148/0x420
[c0000000050a93b0] .tcp_transmit_skb+0x4e4/0xaf0
[c0000000050a94a0] .__tcp_ack_snd_check+0x7c/0xf0
[c0000000050a9520] .tcp_rcv_established+0x1e8/0x930
[c0000000050a95f0] .tcp_v4_do_rcv+0x21c/0x570
[c0000000050a96c0] .tcp_v4_rcv+0x734/0x930
[c0000000050a97a0] .ip_local_deliver_finish+0x184/0x360
[c0000000050a9840] .ip_rcv_finish+0x148/0x400
[c0000000050a98d0] .__netif_receive_skb_core+0x4f8/0xb00
[c0000000050a99d0] .netif_receive_skb+0x44/0x110
[c0000000050a9a70] .br_handle_frame_finish+0x2bc/0x3f0 [bridge]
[c0000000050a9b20] .br_nf_pre_routing_finish+0x2ac/0x420 [bridge]
[c0000000050a9bd0] .br_nf_pre_routing+0x4dc/0x7d0 [bridge]
[c0000000050a9c70] .nf_iterate+0x114/0x130
[c0000000050a9d30] .nf_hook_slow+0xb4/0x1e0
[c0000000050a9e00] .br_handle_frame+0x290/0x330 [bridge]
[c0000000050a9ea0] .__netif_receive_skb_core+0x34c/0xb00
[c0000000050a9fa0] .netif_receive_skb+0x44/0x110
[c0000000050aa040] .napi_gro_receive+0xe8/0x120
[c0000000050aa0c0] .cp_rx_poll+0x31c/0x590 [8139cp]
[c0000000050aa1d0] .net_rx_action+0x1dc/0x310
[c0000000050aa2b0] .__do_softirq+0x158/0x330
[c0000000050aa3b0] .irq_exit+0xc8/0x110
[c0000000050aa430] .do_IRQ+0xdc/0x2c0
[c0000000050aa4e0] hardware_interrupt_common+0x154/0x180
--- Exception: 501 at .bad_range+0x1c/0x110
LR = .get_page_from_freelist+0x908/0xbb0
[c0000000050aa7d0] .list_del+0x18/0x50 (unreliable)
[c0000000050aa850] .get_page_from_freelist+0x908/0xbb0
[c0000000050aa9e0] .__alloc_pages_nodemask+0x21c/0xae0
[c0000000050aaba0] .alloc_pages_vma+0xd0/0x210
[c0000000050aac60] .handle_pte_fault+0x814/0xb70
[c0000000050aad50] .__get_user_pages+0x1a4/0x640
[c0000000050aae60] .get_user_pages_fast+0xec/0x160
[c0000000050aaf10] .__gfn_to_pfn_memslot+0x3b0/0x430 [kvm]
[c0000000050aafd0] .kvmppc_gfn_to_pfn+0x64/0x130 [kvm]
[c0000000050ab070] .kvmppc_mmu_map_page+0x94/0x530 [kvm]
[c0000000050ab190] .kvmppc_handle_pagefault+0x174/0x610 [kvm]
[c0000000050ab270] .kvmppc_handle_exit_pr+0x464/0x9b0 [kvm]
[c0000000050ab320] kvm_start_lightweight+0x1ec/0x1fc [kvm]
[c0000000050ab4f0] .kvmppc_vcpu_run_pr+0x168/0x3b0 [kvm]
[c0000000050ab9c0] .kvmppc_vcpu_run+0xc8/0xf0 [kvm]
[c0000000050aba50] .kvm_arch_vcpu_ioctl_run+0x5c/0x1a0 [kvm]
[c0000000050abae0] .kvm_vcpu_ioctl+0x478/0x730 [kvm]
[c0000000050abc90] .do_vfs_ioctl+0x4ec/0x7c0
[c0000000050abd80] .SyS_ioctl+0xd4/0xf0
[c0000000050abe30] syscall_exit+0x0/0x98
Since this is a regression, this patch proposes a minimalistic
and low-risk solution by blindly forcing the hardirq exit processing of
softirqs on the softirq stack. This way we should reduce significantly
the opportunities for task stack overflow dug by softirqs.
Longer term solutions may involve extending the hardirq stack coverage to
irq_exit(), etc...
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If we wake up a task on a little CPU, fill CPUs rather than
spread. Adds 2 new files to sys/kernel/hmp to control packing
behaviour.
packing_enable: task packing enabled (1) or disabled (0)
packing_limit: Runqueues will be filled up to this load ratio.
This functionality is disabled by default on TC2 as it lacks per-cpu
power gating so packing small tasks there doesn't make sense.
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com>
Signed-off-by: Jon Medhurst <tixy@linaro.org>
Accessing the task_struct can be racy in certain conditions, so
we need to only acquire the data when needed.
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com>
Signed-off-by: Jon Medhurst <tixy@linaro.org>
The previous API for hmp_up_migration reset the destination
CPU every time, regardless of if a migration was desired. The code
using it assumed that the value would not be changed unless
a migration was required. In one rare circumstance, this could
have lead to a task migrating to a little CPU at the wrong time.
Fixing that lead to a slight logical tweak to make the surrounding
APIs operate a bit more obviously.
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Robin Randhawa <robin.randhawa@arm.com>
Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com>
Signed-off-by: Jon Medhurst <tixy@linaro.org>
1. Replace magic numbers in code for migration trace.
Trace points still emit a number as force=<n> field:
force=0 : wakeup migration
force=1 : forced migration
force=2 : offload migration
force=3 : idle pull migration
2. Add trace to expose offload decision-making.
Also adds tracing rq->nr_running so that you can
look back to see what state the RQ was in at the time.
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com>
Signed-off-by: Jon Medhurst <tixy@linaro.org>
When the up-threshold is at 512 on TC2, behaviour looks OK since
the graphic-related tasks are very heavy due to lack of a GPU.
Increasing the up-threshold does not reduce power consumption.
When a GPU is present, graphic tasks are much less CPU-heavy and
so additional power may be saved by having a higher threshold.
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Liviu Dudau <Liviu.Dudau@arm.com>
Signed-off-by: Jon Medhurst <tixy@linaro.org>
commit 6c9a27f5da upstream.
There is a small race between copy_process() and cgroup_attach_task()
where child->se.parent,cfs_rq points to invalid (old) ones.
parent doing fork() | someone moving the parent to another cgroup
-------------------------------+---------------------------------------------
copy_process()
+ dup_task_struct()
-> parent->se is copied to child->se.
se.parent,cfs_rq of them point to old ones.
cgroup_attach_task()
+ cgroup_task_migrate()
-> parent->cgroup is updated.
+ cpu_cgroup_attach()
+ sched_move_task()
+ task_move_group_fair()
+- set_task_rq()
-> se.parent,cfs_rq of parent
are updated.
+ cgroup_fork()
-> parent->cgroup is copied to child->cgroup. (*1)
+ sched_fork()
+ task_fork_fair()
-> se.parent,cfs_rq of child are accessed
while they point to old ones. (*2)
In the worst case, this bug can lead to "use-after-free" and cause a panic,
because it's new cgroup's refcount that is incremented at (*1),
so the old cgroup(and related data) can be freed before (*2).
In fact, a panic caused by this bug was originally caught in RHEL6.4.
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0
[...]
Call Trace:
[<ffffffff81051f25>] place_entity+0x75/0xa0
[<ffffffff81056a3a>] task_fork_fair+0xaa/0x160
[<ffffffff81063c0b>] sched_fork+0x6b/0x140
[<ffffffff8106c3c2>] copy_process+0x5b2/0x1450
[<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130
[<ffffffff8106d2f4>] do_fork+0x94/0x460
[<ffffffff81072a9e>] ? sys_wait4+0xae/0x100
[<ffffffff81009598>] sys_clone+0x28/0x30
[<ffffffff8100b393>] stub_clone+0x13/0x20
[<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/039601ceae06$733d3130$59b79390$@mxp.nes.nec.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5a8e01f8fa upstream.
scale_stime() silently assumes that stime < rtime, otherwise
when stime == rtime and both values are big enough (operations
on them do not fit in 32 bits), the resulting scaling stime can
be bigger than rtime. In consequence utime = rtime - stime
results in negative value.
User space visible symptoms of the bug are overflowed TIME
values on ps/top, for example:
$ ps aux | grep rcu
root 8 0.0 0.0 0 0 ? S 12:42 0:00 [rcuc/0]
root 9 0.0 0.0 0 0 ? S 12:42 0:00 [rcub/0]
root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt]
root 11 0.1 0.0 0 0 ? S 12:42 0:02 [rcuop/0]
root 12 62422329 0.0 0 0 ? S 12:42 21114581:35 [rcuop/1]
root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt]
or overflowed utime values read directly from /proc/$PID/stat
Reference:
https://lkml.org/lkml/2013/8/20/259
Reported-and-tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: stable@vger.kernel.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/20130904131602.GC2564@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7bd3601446 upstream.
Gerlando Falauto reported that when HRTICK is enabled, it is
possible to trigger system deadlocks. These were hard to
reproduce, as HRTICK has been broken in the past, but seemed
to be connected to the timekeeping_seq lock.
Since seqlock/seqcount's aren't supported w/ lockdep, I added
some extra spinlock based locking and triggered the following
lockdep output:
[ 15.849182] ntpd/4062 is trying to acquire lock:
[ 15.849765] (&(&pool->lock)->rlock){..-...}, at: [<ffffffff810aa9b5>] __queue_work+0x145/0x480
[ 15.850051]
[ 15.850051] but task is already holding lock:
[ 15.850051] (timekeeper_lock){-.-.-.}, at: [<ffffffff810df6df>] do_adjtimex+0x7f/0x100
<snip>
[ 15.850051] Chain exists of: &(&pool->lock)->rlock --> &p->pi_lock --> timekeeper_lock
[ 15.850051] Possible unsafe locking scenario:
[ 15.850051]
[ 15.850051] CPU0 CPU1
[ 15.850051] ---- ----
[ 15.850051] lock(timekeeper_lock);
[ 15.850051] lock(&p->pi_lock);
[ 15.850051] lock(timekeeper_lock);
[ 15.850051] lock(&(&pool->lock)->rlock);
[ 15.850051]
[ 15.850051] *** DEADLOCK ***
The deadlock was introduced by 06c017fdd4 ("timekeeping:
Hold timekeepering locks in do_adjtimex and hardpps") in 3.10
This patch avoids this deadlock, by moving the call to
schedule_delayed_work() outside of the timekeeper lock
critical section.
Reported-by: Gerlando Falauto <gerlando.falauto@keymile.com>
Tested-by: Lin Ming <minggr@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: http://lkml.kernel.org/r/1378943457-27314-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e79f525e99 upstream.
Commit 8382fcac1b ("pidns: Outlaw thread creation after
unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared
pid_ns, this obviously breaks vfork:
int main(void)
{
assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);
assert(vfork() >= 0);
_exit(0);
return 0;
}
fails without this patch.
Change this check to use CLONE_SIGHAND instead. This also forbids
CLONE_THREAD automatically, and this is what the comment implies.
We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it
would be safer to not do this. The current check denies CLONE_SIGHAND
implicitely and there is no reason to change this.
Eric said "CLONE_SIGHAND is fine. CLONE_THREAD would be even better.
Having shared signal handling between two different pid namespaces is
the case that we are fundamentally guarding against."
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Colin Walters <walters@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>