Commit Graph

15891 Commits

Author SHA1 Message Date
Lorenzo Colitti
4fd02636be Make suspend abort reason logging depend on CONFIG_PM_SLEEP
This unbreaks the build on architectures such as um that do not
support CONFIG_PM_SLEEP.

Change-Id: Ia846ed0a7fca1d762ececad20748d23610e8544f
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
2014-12-02 21:58:47 +00:00
Rom Lemarchand
57114e95e8 cgroup: refactor allow_attach function into common code
move cpu_cgroup_allow_attach to a common subsys_cgroup_allow_attach.
This allows any process with CAP_SYS_NICE to move tasks across cgroups if
they use this function as their allow_attach handler.

Bug: 18260435
Change-Id: I6bb4933d07e889d0dc39e33b4e71320c34a2c90f
Signed-off-by: Rom Lemarchand <romlem@android.com>
2014-11-07 13:47:35 -08:00
Dmitry Shmidt
f141171d7d power: Add check_wakeup_reason() to verify wakeup source irq
Wakeup reason is set before driver resume handlers are called.
It is cleared before driver suspend handlers are called, on
PM_SUSPEND_PREPARE.

Change-Id: I04218c9b0c115a7877e8029c73e6679ff82e0aa4
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
2014-11-04 10:47:40 -08:00
Ruchi Kandoi
7af7a7d021 power: Adds functionality to log the last suspend abort reason.
Extends the last_resume_reason to log suspend abort reason. The abort
reasons will have "Abort:" appended at the start to distinguish itself
from the resume reason.

Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
Change-Id: I3207f1844e3d87c706dfc298fb10e1c648814c5f
2014-10-29 10:36:27 -07:00
Ruchi Kandoi
5ffb57932e power: Avoids bogus error messages for the suspend aborts.
Avoids printing bogus error message "tasks refusing to freeze", in cases
where pending wakeup source caused the suspend abort.

Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
Change-Id: I913ad290f501b31cd536d039834c8d24c6f16928
2014-10-16 09:16:24 -07:00
Guenter Roeck
9ac860041d seccomp: Replace BUG(!spin_is_locked()) with assert_spin_lock
Current upstream kernel hangs with mips and powerpc targets in
uniprocessor mode if SECCOMP is configured.

Bisect points to commit dbd952127d ("seccomp: introduce writer locking").
Turns out that code such as
	BUG_ON(!spin_is_locked(&list_lock));
can not be used in uniprocessor mode because spin_is_locked() always
returns false in this configuration, and that assert_spin_locked()
exists for that very purpose and must be used instead.

Fixes: dbd952127d ("seccomp: introduce writer locking")
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Kees Cook <keescook@chromium.org>
2014-10-08 14:35:33 -07:00
Kees Cook
f14a5db239 seccomp: implement SECCOMP_FILTER_FLAG_TSYNC
Applying restrictive seccomp filter programs to large or diverse
codebases often requires handling threads which may be started early in
the process lifetime (e.g., by code that is linked in). While it is
possible to apply permissive programs prior to process start up, it is
difficult to further restrict the kernel ABI to those threads after that
point.

This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for
synchronizing thread group seccomp filters at filter installation time.

When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC,
filter) an attempt will be made to synchronize all threads in current's
threadgroup to its new seccomp filter program. This is possible iff all
threads are using a filter that is an ancestor to the filter current is
attempting to synchronize to. NULL filters (where the task is running as
SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS,
...) has been set on the calling thread, no_new_privs will be set for
all synchronized threads too. On success, 0 is returned. On failure,
the pid of one of the failing threads will be returned and no filters
will have been applied.

The race conditions against another thread are:
- requesting TSYNC (already handled by sighand lock)
- performing a clone (already handled by sighand lock)
- changing its filter (already handled by sighand lock)
- calling exec (handled by cred_guard_mutex)
The clone case is assisted by the fact that new threads will have their
seccomp state duplicated from their parent before appearing on the tasklist.

Holding cred_guard_mutex means that seccomp filters cannot be assigned
while in the middle of another thread's exec (potentially bypassing
no_new_privs or similar). The call to de_thread() may kill threads waiting
for the mutex.

Changes across threads to the filter pointer includes a barrier.

Based on patches by Will Drewry.

Suggested-by: Julien Tinnes <jln@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
2014-10-07 16:42:34 -07:00
Kees Cook
c852ef7782 seccomp: allow mode setting across threads
This changes the mode setting helper to allow threads to change the
seccomp mode from another thread. We must maintain barriers to keep
TIF_SECCOMP synchronized with the rest of the seccomp state.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>

Conflicts:
	kernel/seccomp.c
2014-10-07 16:42:34 -07:00
Kees Cook
61b6b882a0 seccomp: introduce writer locking
Normally, task_struct.seccomp.filter is only ever read or modified by
the task that owns it (current). This property aids in fast access
during system call filtering as read access is lockless.

Updating the pointer from another task, however, opens up race
conditions. To allow cross-thread filter pointer updates, writes to the
seccomp fields are now protected by the sighand spinlock (which is shared
by all threads in the thread group). Read access remains lockless because
pointer updates themselves are atomic.  However, writes (or cloning)
often entail additional checking (like maximum instruction counts)
which require locking to perform safely.

In the case of cloning threads, the child is invisible to the system
until it enters the task list. To make sure a child can't be cloned from
a thread and left in a prior state, seccomp duplication is additionally
moved under the sighand lock. Then parent and child are certain have
the same seccomp state when they exit the lock.

Based on patches by Will Drewry and David Drysdale.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>

Conflicts:
	kernel/fork.c
2014-10-07 16:42:33 -07:00
Kees Cook
b6a12bf4dd seccomp: split filter prep from check and apply
In preparation for adding seccomp locking, move filter creation away
from where it is checked and applied. This will allow for locking where
no memory allocation is happening. The validation, filter attachment,
and seccomp mode setting can all happen under the future locks.

For extreme defensiveness, I've added a BUG_ON check for the calculated
size of the buffer allocation in case BPF_MAXINSN ever changes, which
shouldn't ever happen. The compiler should actually optimize out this
check since the test above it makes it impossible.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>

Conflicts:
	kernel/seccomp.c
2014-10-07 16:42:33 -07:00
Kees Cook
9d0ff694bc sched: move no_new_privs into new atomic flags
Since seccomp transitions between threads requires updates to the
no_new_privs flag to be atomic, the flag must be part of an atomic flag
set. This moves the nnp flag into a separate task field, and introduces
accessors.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>

Conflicts:
	kernel/sys.c
2014-10-07 16:42:32 -07:00
Kees Cook
e985fd474d seccomp: add "seccomp" syscall
This adds the new "seccomp" syscall with both an "operation" and "flags"
parameter for future expansion. The third argument is a pointer value,
used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).

In addition to the TSYNC flag later in this patch series, there is a
non-zero chance that this syscall could be used for configuring a fixed
argument area for seccomp-tracer-aware processes to pass syscall arguments
in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
for this syscall. Additionally, this syscall uses operation, flags,
and user pointer for arguments because strictly passing arguments via
a user pointer would mean seccomp itself would be unable to trivially
filter the seccomp syscall itself.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>

Conflicts:
	arch/x86/syscalls/syscall_32.tbl
	arch/x86/syscalls/syscall_64.tbl
	include/uapi/asm-generic/unistd.h
	kernel/seccomp.c

And fixup of unistd32.h to truly enable sys_secomp.

Change-Id: I95bea02382c52007d22e5e9dc563c7d055c2c83f
2014-10-07 16:42:32 -07:00
Kees Cook
8908dde5a7 seccomp: split mode setting routines
Separates the two mode setting paths to make things more readable with
fewer #ifdefs within function bodies.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
2014-10-07 16:42:31 -07:00
Kees Cook
b8a9cff6db seccomp: extract check/assign mode helpers
To support splitting mode 1 from mode 2, extract the mode checking and
assignment logic into common functions.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
2014-10-07 16:42:31 -07:00
Kees Cook
2a30a4386e seccomp: create internal mode-setting function
In preparation for having other callers of the seccomp mode setting
logic, split the prctl entry point away from the core logic that performs
seccomp mode setting.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
2014-10-07 16:42:30 -07:00
Oleg Nesterov
987a0f1102 introduce for_each_thread() to replace the buggy while_each_thread()
while_each_thread() and next_thread() should die, almost every lockless
usage is wrong.

1. Unless g == current, the lockless while_each_thread() is not safe.

   while_each_thread(g, t) can loop forever if g exits, next_thread()
   can't reach the unhashed thread in this case. Note that this can
   happen even if g is the group leader, it can exec.

2. Even if while_each_thread() itself was correct, people often use
   it wrongly.

   It was never safe to just take rcu_read_lock() and loop unless
   you verify that pid_alive(g) == T, even the first next_thread()
   can point to the already freed/reused memory.

This patch adds signal_struct->thread_head and task->thread_node to
create the normal rcu-safe list with the stable head.  The new
for_each_thread(g, t) helper is always safe under rcu_read_lock() as
long as this task_struct can't go away.

Note: of course it is ugly to have both task_struct->thread_node and the
old task_struct->thread_group, we will kill it later, after we change
the users of while_each_thread() to use for_each_thread().

Perhaps we can kill it even before we convert all users, we can
reimplement next_thread(t) using the new thread_head/thread_node.  But
we can't do this right now because this will lead to subtle behavioural
changes.  For example, do/while_each_thread() always sees at least one
task, while for_each_thread() can do nothing if the whole thread group
has died.  Or thread_group_empty(), currently its semantics is not clear
unless thread_group_leader(p) and we need to audit the callers before we
can change it.

So this patch adds the new interface which has to coexist with the old
one for some time, hopefully the next changes will be more or less
straightforward and the old one will go away soon.

Bug 200004307

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Sergey Dyasly <dserrg@gmail.com>
Tested-by: Sergey Dyasly <dserrg@gmail.com>
Reviewed-by: Sameer Nanda <snanda@chromium.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: "Ma, Xindong" <xindong.ma@intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: "Tu, Xiaobing" <xiaobing.tu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0c740d0afc)
Signed-off-by: Sri Krishna chowdary <schowdary@nvidia.com>
Change-Id: Id689cb1383ceba2561b66188d88258619b68f5c6
Reviewed-on: http://git-master/r/419041
Reviewed-by: Bharat Nihalani <bnihalani@nvidia.com>
2014-10-07 16:42:30 -07:00
Eric Paris
9499cd23f9 syscall_get_arch: remove useless function arguments
Every caller of syscall_get_arch() uses current for the task and no
implementors of the function need args.  So just get rid of both of
those things.  Admittedly, since these are inline functions we aren't
wasting stack space, but it just makes the prototypes better.

Signed-off-by: Eric Paris <eparis@redhat.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mips@linux-mips.org
Cc: linux390@de.ibm.com
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-arch@vger.kernel.org

Conflicts:
	arch/mips/include/asm/syscall.h
	arch/mips/kernel/ptrace.c
2014-10-07 15:36:24 -07:00
JP Abgrall
4149e0de6d seccomp: revert previous patches in prep for updated ones
This reverts the seccomp related patches committed around 2014-08-27.
This allows for a cleaner cherry-pick of newly landed upstream patches.

 f56b1aa arm: fixup NR_syscalls to accommodate the new seccomp syscall
 81ff7fa seccomp: implement SECCOMP_FILTER_FLAG_TSYNC
 d924727 seccomp: allow mode setting across threads
 743266a seccomp: introduce writer locking
 3497a88 seccomp: split filter prep from check and apply
 2c6d7de MIPS: add seccomp syscall
 83f1ccba ARM: add seccomp syscall
 a75a29b seccomp: add "seccomp" syscall
 1a63bce seccomp: split mode setting routines
 c208e4e seccomp: extract check/assign mode helpers
 6862b01 seccomp: create internal mode-setting function
 1ba2ccb MAINTAINERS: create seccomp entry
 c2da3eb seccomp: fix memory leak on filter attach
 945a225 ARM: 7888/1: seccomp: not compatible with ARM OABI

Change-Id: I3f129263d68a7b3c206d79f84f7f9908d13064f6
Signed-off-by: JP Abgrall <jpa@google.com>
2014-09-17 16:56:33 -07:00
Kees Cook
81ff7fa232 seccomp: implement SECCOMP_FILTER_FLAG_TSYNC
Applying restrictive seccomp filter programs to large or diverse
codebases often requires handling threads which may be started early in
the process lifetime (e.g., by code that is linked in). While it is
possible to apply permissive programs prior to process start up, it is
difficult to further restrict the kernel ABI to those threads after that
point.

This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for
synchronizing thread group seccomp filters at filter installation time.

When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC,
filter) an attempt will be made to synchronize all threads in current's
threadgroup to its new seccomp filter program. This is possible iff all
threads are using a filter that is an ancestor to the filter current is
attempting to synchronize to. NULL filters (where the task is running as
SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS,
...) has been set on the calling thread, no_new_privs will be set for
all synchronized threads too. On success, 0 is returned. On failure,
the pid of one of the failing threads will be returned and no filters
will have been applied.

The race conditions against another thread are:
- requesting TSYNC (already handled by sighand lock)
- performing a clone (already handled by sighand lock)
- changing its filter (already handled by sighand lock)
- calling exec (handled by cred_guard_mutex)
The clone case is assisted by the fact that new threads will have their
seccomp state duplicated from their parent before appearing on the tasklist.

Holding cred_guard_mutex means that seccomp filters cannot be assigned
while in the middle of another thread's exec (potentially bypassing
no_new_privs or similar). The call to de_thread() may kill threads waiting
for the mutex.

Changes across threads to the filter pointer includes a barrier.

Based on patches by Will Drewry.

Suggested-by: Julien Tinnes <jln@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
2014-08-28 01:54:06 +00:00
Kees Cook
d924727911 seccomp: allow mode setting across threads
This changes the mode setting helper to allow threads to change the
seccomp mode from another thread. We must maintain barriers to keep
TIF_SECCOMP synchronized with the rest of the seccomp state.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>

Conflicts:
	kernel/seccomp.c

Change-Id: I091ffa55d8f4e83ff02558a55e2b4dc76ac26905
2014-08-28 01:53:49 +00:00
Kees Cook
743266ae25 seccomp: introduce writer locking
Normally, task_struct.seccomp.filter is only ever read or modified by
the task that owns it (current). This property aids in fast access
during system call filtering as read access is lockless.

Updating the pointer from another task, however, opens up race
conditions. To allow cross-thread filter pointer updates, writes to the
seccomp fields are now protected by the sighand spinlock (which is shared
by all threads in the thread group). Read access remains lockless because
pointer updates themselves are atomic.  However, writes (or cloning)
often entail additional checking (like maximum instruction counts)
which require locking to perform safely.

In the case of cloning threads, the child is invisible to the system
until it enters the task list. To make sure a child can't be cloned from
a thread and left in a prior state, seccomp duplication is additionally
moved under the sighand lock. Then parent and child are certain have
the same seccomp state when they exit the lock.

Based on patches by Will Drewry and David Drysdale.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>

Conflicts:
	kernel/fork.c

Change-Id: Ie01ece43b610867013f7d0e0a2a7be0b9077630f
2014-08-28 01:53:30 +00:00
Kees Cook
3497a88f55 seccomp: split filter prep from check and apply
In preparation for adding seccomp locking, move filter creation away
from where it is checked and applied. This will allow for locking where
no memory allocation is happening. The validation, filter attachment,
and seccomp mode setting can all happen under the future locks.

For extreme defensiveness, I've added a BUG_ON check for the calculated
size of the buffer allocation in case BPF_MAXINSN ever changes, which
shouldn't ever happen. The compiler should actually optimize out this
check since the test above it makes it impossible.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>

Conflicts:
	kernel/seccomp.c

Change-Id: I8d89f80a5b4f2826d90474dcea441c41f0af6594
2014-08-28 01:53:09 +00:00
Kees Cook
a75a29b16e seccomp: add "seccomp" syscall
This adds the new "seccomp" syscall with both an "operation" and "flags"
parameter for future expansion. The third argument is a pointer value,
used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).

In addition to the TSYNC flag later in this patch series, there is a
non-zero chance that this syscall could be used for configuring a fixed
argument area for seccomp-tracer-aware processes to pass syscall arguments
in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
for this syscall. Additionally, this syscall uses operation, flags,
and user pointer for arguments because strictly passing arguments via
a user pointer would mean seccomp itself would be unable to trivially
filter the seccomp syscall itself.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>

Conflicts:
	arch/x86/syscalls/syscall_32.tbl
	arch/x86/syscalls/syscall_64.tbl
	include/uapi/asm-generic/unistd.h
	kernel/seccomp.c

Change-Id: Id7a365079829fd9164315dec75d6ee415c29b176
2014-08-28 01:51:54 +00:00
Kees Cook
1a63bcec4f seccomp: split mode setting routines
Separates the two mode setting paths to make things more readable with
fewer #ifdefs within function bodies.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
2014-08-28 01:51:22 +00:00
Kees Cook
c208e4e9f1 seccomp: extract check/assign mode helpers
To support splitting mode 1 from mode 2, extract the mode checking and
assignment logic into common functions.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
2014-08-28 01:50:55 +00:00
Kees Cook
6862b01436 seccomp: create internal mode-setting function
In preparation for having other callers of the seccomp mode setting
logic, split the prctl entry point away from the core logic that performs
seccomp mode setting.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
2014-08-28 01:50:03 +00:00
Kees Cook
c2da3eba6a seccomp: fix memory leak on filter attach
This sets the correct error code when final filter memory is unavailable,
and frees the raw filter no matter what.

unreferenced object 0xffff8800d6ea4000 (size 512):
  comm "sshd", pid 278, jiffies 4294898315 (age 46.653s)
  hex dump (first 32 bytes):
    21 00 00 00 04 00 00 00 15 00 01 00 3e 00 00 c0  !...........>...
    06 00 00 00 00 00 00 00 21 00 00 00 00 00 00 00  ........!.......
  backtrace:
    [<ffffffff8151414e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811a3a40>] __kmalloc+0x280/0x320
    [<ffffffff8110842e>] prctl_set_seccomp+0x11e/0x3b0
    [<ffffffff8107bb6b>] SyS_prctl+0x3bb/0x4a0
    [<ffffffff8152ef2d>] system_call_fastpath+0x1a/0x1f
    [<ffffffffffffffff>] 0xffffffffffffffff

Reported-by: Masami Ichikawa <masami256@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Masami Ichikawa <masami256@gmail.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Conflicts:
	kernel/seccomp.c

Change-Id: Ide3c27bf378397f8faf4218e75c31e4b8bc43c4c
2014-08-28 01:48:29 +00:00
Colin Cross
9bc0c15675 mm: fix prctl_set_vma_anon_name
prctl_set_vma_anon_name could attempt to set the name across
two vmas at the same time due to a typo, which might corrupt
the vma list.  Fix it to use tmp instead of end to limit
the name setting to a single vma at a time.

Change-Id: Ie32d8ddb0fd547efbeedd6528acdab5ca5b308b4
Reported-by: Jed Davis <jld@mozilla.com>
Signed-off-by: Colin Cross <ccross@android.com>
2014-08-22 23:47:15 +00:00
Ruchi Kandoi
8ae872f1d5 prctl: adds the capable(CAP_SYS_NICE) check to PR_SET_TIMERSLACK_PID.
Adds a capable() check to make sure that arbitary apps do not change the
timer slack for other apps.

Bug: 15000427
Change-Id: I558a2551a0e3579c7f7e7aae54b28aa9d982b209
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
2014-06-16 17:49:10 +00:00
Thomas Gleixner
b50939f89d futex: Make lookup_pi_state more robust
The current implementation of lookup_pi_state has ambigous handling of
the TID value 0 in the user space futex. We can get into the kernel
even if the TID value is 0, because either there is a stale waiters
bit or the owner died bit is set or we are called from the requeue_pi
path or from user space just for fun.

The current code avoids an explicit sanity check for pid = 0 in case
that kernel internal state (waiters) are found for the user space
address. This can lead to state leakage and worse under some
circumstances.

Handle the cases explicit:

     Waiter | pi_state | pi->owner | uTID      | uODIED | ?

[1]  NULL   | ---      | ---       | 0         | 0/1    | Valid
[2]  NULL   | ---      | ---       | >0        | 0/1    | Valid

[3]  Found  | NULL     | --        | Any       | 0/1    | Invalid

[4]  Found  | Found    | NULL      | 0         | 1      | Valid
[5]  Found  | Found    | NULL      | >0        | 1      | Invalid

[6]  Found  | Found    | task      | 0         | 1      | Valid

[7]  Found  | Found    | NULL      | Any       | 0      | Invalid

[8]  Found  | Found    | task      | ==taskTID | 0/1    | Valid
[9]  Found  | Found    | task      | 0         | 0      | Invalid
[10] Found  | Found    | task      | !=taskTID | 0/1    | Invalid

[1]  Indicates that the kernel can acquire the futex atomically. We
     came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit.

[2]  Valid, if TID does not belong to a kernel thread. If no matching
     thread is found then it indicates that the owner TID has died.

[3]  Invalid. The waiter is queued on a non PI futex

[4]  Valid state after exit_robust_list(), which sets the user space
     value to FUTEX_WAITERS | FUTEX_OWNER_DIED.

[5]  The user space value got manipulated between exit_robust_list()
     and exit_pi_state_list()

[6]  Valid state after exit_pi_state_list() which sets the new owner in
     the pi_state but cannot access the user space value.

[7]  pi_state->owner can only be NULL when the OWNER_DIED bit is set.

[8]  Owner and user space value match

[9]  There is no transient state which sets the user space TID to 0
     except exit_robust_list(), but this is indicated by the
     FUTEX_OWNER_DIED bit. See [4]

[10] There is no transient state which leaves owner and user space
     TID out of sync.

Backport to 3.13
  conflicts: kernel/futex.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Drewry <wad@chromium.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: stable@vger.kernel.org
2014-06-06 14:53:48 -07:00
Thomas Gleixner
a2ec8e3dcd futex: Always cleanup owner tid in unlock_pi
If the owner died bit is set at futex_unlock_pi, we currently do not
cleanup the user space futex. So the owner TID of the current owner
(the unlocker) persists. That's observable inconsistant state,
especially when the ownership of the pi state got transferred.

Clean it up unconditionally.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Drewry <wad@chromium.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: stable@vger.kernel.org
2014-06-06 14:53:47 -07:00
Thomas Gleixner
550c7910f0 futex: Validate atomic acquisition in futex_lock_pi_atomic()
We need to protect the atomic acquisition in the kernel against rogue
user space which sets the user space futex to 0, so the kernel side
acquisition succeeds while there is existing state in the kernel
associated to the real owner.

Verify whether the futex has waiters associated with kernel state. If
it has, return -EINVAL. The state is corrupted already, so no point in
cleaning it up. Subsequent calls will fail as well. Not our problem.

[ tglx: Use futex_top_waiter() and explain why we do not need to try
  	restoring the already corrupted user space state. ]

Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Drewry <wad@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-06-06 14:53:47 -07:00
Thomas Gleixner
5b3bbe42f5 futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in futex_requeue(..., requeue_pi=1)
If uaddr == uaddr2, then we have broken the rule of only requeueing
from a non-pi futex to a pi futex with this call. If we attempt this,
then dangling pointers may be left for rt_waiter resulting in an
exploitable condition.

This change brings futex_requeue() into line with
futex_wait_requeue_pi() which performs the same check as per commit
6f7b0a2a5 (futex: Forbid uaddr == uaddr2 in futex_wait_requeue_pi())

[ tglx: Compare the resulting keys as well, as uaddrs might be
  	different depending on the mapping ]

Fixes CVE-2014-3153.

Reported-by: Pinkie Pie
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-06-06 14:53:47 -07:00
Ruchi Kandoi
70dc3ed48f Power: Changes the permission to read only for sysfs file
/sys/kernel/wakeup_reasons/last_resume_reason

Change-Id: I8ac568a7cb58c31decd379195de517ff3c6f9c65
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
2014-04-24 22:14:30 +00:00
Ruchi Kandoi
37a591d407 prctl: adds PR_SET_TIMERSLACK_PID for setting timer slack of an arbitrary thread.
Second argument is similar to PR_SET_TIMERSLACK, if non-zero then the
slack is set to that value otherwise sets it to the default for the thread.

Takes PID of the thread as the third argument.

This allows power/performance management software to set timer slack for
other threads according to its policy for the thread (such as when the
thread is designated foreground vs. background activity)

Change-Id: I744d451ff4e60dae69f38f53948ff36c51c14a3f
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
2014-04-22 17:31:53 -07:00
Greg Hackmann
c9331cabfd power: wakeup_reason: rename irq_count to irqcount
On x86, irq_count conflicts with a declaration in
arch/x86/include/asm/processor.h

Change-Id: I3e4fde0ff64ef59ff5ed2adc0ea3a644641ee0b7
Signed-off-by: Greg Hackmann <ghackmann@google.com>
2014-03-10 14:26:18 -07:00
Ruchi Kandoi
8b531976d5 Power: Add guard condition for maximum wakeup reasons
Ensure the array for the wakeup reason IRQs does not overflow.

Change-Id: Iddc57a3aeb1888f39d4e7b004164611803a4d37c
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
(cherry picked from commit b5ea40cdfcf38296535f931a7e5e7bf47b6fad7f)
2014-03-10 18:17:35 +00:00
Ruchi Kandoi
39129eff0f POWER: fix compile warnings in log_wakeup_reason
Change I81addaf420f1338255c5d0638b0d244a99d777d1 introduced compile
warnings, fix these.

Change-Id: I05482a5335599ab96c0a088a7d175c8d4cf1cf69
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
2014-02-21 04:16:10 +00:00
Ruchi Kandoi
520bc0723d Power: add an API to log wakeup reasons
Add API log_wakeup_reason() and expose it to userspace via sysfs path
/sys/kernel/wakeup_reasons/last_resume_reason

Change-Id: I81addaf420f1338255c5d0638b0d244a99d777d1
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
2014-02-20 02:43:15 +00:00
Arve Hjønnevåg
47e0f4d1dc ARM: Fix "Make low-level printk work" to use a separate config option
Signed-off-by: Arve Hjønnevåg <arve@android.com>
2013-11-13 17:34:12 -08:00
Colin Cross
d67a07b6ec anonymous vma names: fix build with !MMU
Disable PR_SET_VMA when building with !MMU

Change-Id: I896b6979b99aa61df85caf4c3ec22eb8a8204e64
Signed-off-by: Colin Cross <ccross@android.com>
2013-11-07 16:25:12 -08:00
Colin Cross
527aae5291 mm: fix anon vma naming
Fix two bugs caused by merging anon vma_naming, a typo in
mempolicy.c and a bad merge in sys.c.

Change-Id: Ia4ced447d50573e68195e95ea2f2b4d9456b8a90
Signed-off-by: Colin Cross <ccross@android.com>
2013-11-07 16:25:11 -08:00
Colin Cross
6ebfe5864a mm: add a field to store names for private anonymous memory
Userspace processes often have multiple allocators that each do
anonymous mmaps to get memory.  When examining memory usage of
individual processes or systems as a whole, it is useful to be
able to break down the various heaps that were allocated by
each layer and examine their size, RSS, and physical memory
usage.

This patch adds a user pointer to the shared union in
vm_area_struct that points to a null terminated string inside
the user process containing a name for the vma.  vmas that
point to the same address will be merged, but vmas that
point to equivalent strings at different addresses will
not be merged.

Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
Setting the name to NULL clears it.

The names of named anonymous vmas are shown in /proc/pid/maps
as [anon:<name>] and in /proc/pid/smaps in a new "Name" field
that is only present for named vmas.  If the userspace pointer
is no longer valid all or part of the name will be replaced
with "<fault>".

The idea to store a userspace pointer to reduce the complexity
within mm (at the expense of the complexity of reading
/proc/pid/mem) came from Dave Hansen.  This results in no
runtime overhead in the mm subsystem other than comparing
the anon_name pointers when considering vma merging.  The pointer
is stored in a union with fieds that are only used on file-backed
mappings, so it does not increase memory usage.

Change-Id: Ie2ffc0967d4ffe7ee4c70781313c7b00cf7e3092
Signed-off-by: Colin Cross <ccross@android.com>
2013-09-19 14:14:28 -05:00
Rik van Riel
2f42fa9141 add extra free kbytes tunable
Add a userspace visible knob to tell the VM to keep an extra amount
of memory free, by increasing the gap between each zone's min and
low watermarks.

This is useful for realtime applications that call system
calls and have a bound on the number of allocations that happen
in any short time period.  In this application, extra_free_kbytes
would be left at an amount equal to or larger than than the
maximum number of allocations that happen in any burst.

It may also be useful to reduce the memory use of virtual
machines (temporarily?), in a way that does not cause memory
fragmentation like ballooning does.

[ccross]
Revived for use on old kernels where no other solution exists.
The tunable will be removed on kernels that do better at avoiding
direct reclaim.

Change-Id: I765a42be8e964bfd3e2886d1ca85a29d60c3bb3e
Signed-off-by: Rik van Riel<riel@redhat.com>
Signed-off-by: Colin Cross <ccross@android.com>
2013-09-19 13:53:19 -05:00
Colin Cross
e85cba35b3 sigtimedwait: use freezable blocking call
Avoid waking up every thread sleeping in a sigtimedwait call during
suspend and resume by calling a freezable blocking call.  Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.

This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.

Change-Id: Ic27469b60a67d50cdc0d0c78975951a99c25adcd
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-07-01 15:46:24 -07:00
Colin Cross
98009ee43a nanosleep: use freezable blocking call
Avoid waking up every thread sleeping in a nanosleep call during
suspend and resume by calling a freezable blocking call.  Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.

This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.

Change-Id: I93383201d4dd62130cd9a9153842d303fc2e2986
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-07-01 15:46:23 -07:00
Colin Cross
5ccf7dfeea futex: use freezable blocking call
Avoid waking up every thread sleeping in a futex_wait call during
suspend and resume by calling a freezable blocking call.  Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.

This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.

Change-Id: I9ccab9c2d201adb66c85432801cdcf43fc91e94f
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-07-01 15:46:16 -07:00
Colin Cross
44e77f70f6 freezer: skip waking up tasks with PF_FREEZER_SKIP set
Android goes through suspend/resume very often (every few seconds when
on a busy wifi network with the screen off), and a significant portion
of the energy used to go in and out of suspend is spent in the
freezer.  If a task has called freezer_do_not_count(), don't bother
waking it up.  If it happens to wake up later it will call
freezer_count() and immediately enter the refrigerator.

Combined with patches to convert freezable helpers to use
freezer_do_not_count() and convert common sites where idle userspace
tasks are blocked to use the freezable helpers, this reduces the
time and energy required to suspend and resume.

Change-Id: I6ba019d24273619849af757a413271da3261d7db
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-07-01 15:40:39 -07:00
Colin Cross
a57d997d09 freezer: shorten freezer sleep time using exponential backoff
All tasks can easily be frozen in under 10 ms, switch to using
an initial 1 ms sleep followed by exponential backoff until
8 ms.  Also convert the printed time to ms instead of centiseconds.

Change-Id: I7b198b16eefb623c2b0fc45dce50d9bca320afdc
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-07-01 15:40:31 -07:00
Colin Cross
0f7aff8620 lockdep: remove task argument from debug_check_no_locks_held
The only existing caller to debug_check_no_locks_held calls it
with 'current' as the task, and the freezer needs to call
debug_check_no_locks_held but doesn't already have a current
task pointer, so remove the argument.  It is already assuming
that the current task is relevant by dumping the current stack
trace as part of the warning.

This was originally part of 6aa9707099 (lockdep: check that
no locks held at freeze time) which was reverted in
dbf520a9d7.

Change-Id: Idbaf1332ce6c80dc49c1d31c324c7fbf210657c5
Original-author: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-07-01 15:38:03 -07:00