This patch introduces a new binder_fd_array object,
that allows us to support one or more file descriptors
embedded in a buffer that is scatter-gathered.
Change-Id: I647a53cf0d905c7be0dfd9333806982def68dd74
Signed-off-by: Martijn Coenen <maco@google.com>
Previously all data passed over binder needed
to be serialized, with the exception of Binder
objects and file descriptors.
This patchs adds support for scatter-gathering raw
memory buffers into a binder transaction, avoiding
the need to first serialize them into a Parcel.
To remain backwards compatibile with existing
binder clients, it introduces two new command
ioctls for this purpose - BC_TRANSACTION_SG and
BC_REPLY_SG. These commands may only be used with
the new binder_transaction_data_sg structure,
which adds a field for the total size of the
buffers we are scatter-gathering.
Because memory buffers may contain pointers to
other buffers, we allow callers to specify
a parent buffer and an offset into it, to indicate
this is a location pointing to the buffer that
we are fixing up. The kernel will then take care
of fixing up the pointer to that buffer as well.
Change-Id: I02417f28cff14688f2e1d6fcb959438fd96566cc
Signed-off-by: Martijn Coenen <maco@google.com>
flat_binder_object is used for both handling
binder objects and file descriptors, even though
the two are mostly independent. Since we'll
have more fixup objects in binder in the future,
instead of extending flat_binder_object again,
split out file descriptors to their own object
while retaining backwards compatibility to
existing user-space clients. All binder objects
just share a header.
Change-Id: If3c55f27a2aa8f21815383e0e807be47895e4786
Signed-off-by: Martijn Coenen <maco@google.com>
The implementation is utterly broken, resulting in all processes being
allows to move tasks between sets (as long as they have access to the
"tasks" attribute), and upstream is heading towards checking only
capability anyway, so let's get rid of this code.
BUG=b:31790445,chromium:647994
TEST=Boot android container, examine logcat
Change-Id: I2f780a5992c34e52a8f2d0b3557fc9d490da2779
Signed-off-by: Dmitry Torokhov <dtor@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/394967
Reviewed-by: Ricky Zhou <rickyz@chromium.org>
Reviewed-by: John Stultz <john.stultz@linaro.org>
Fix a build breakage where MMC is enabled, but BLOCK is not.
Change-Id: I0eb422d12264f0371f3368ae7c37342ef9efabaa
Signed-off-by: Mohan Srinivasan <srmohan@google.com>
This introduces the MEMBLOCK_NOMAP attribute and the required plumbing
to make it usable as an indicator that some parts of normal memory
should not be covered by the kernel direct mapping. It is up to the
arch to actually honor the attribute when laying out this mapping,
but the memblock code itself is modified to disregard these regions
for allocations and other general use.
Cc: linux-mm@kvack.org
Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Change-Id: I55cd3abdf514ac54c071fa0037d8dac73bda798d
(cherry picked from commit bf3d3cc580)
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
commit c74ba8b348 ("arch: Introduce post-init read-only memory")
introduced the __ro_after_init attribute which allows to add variables
to the ro_after_init data section.
This new section was added to rodata, even though it contains writable
data. This in turn causes problems on architectures which mark the
page table entries read-only that point to rodata very early.
This patch allows architectures to implement an own handling of the
.data..ro_after_init section.
Usually that would be:
- mark the rodata section read-only very early
- mark the ro_after_init section read-only within mark_rodata_ro
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Bug: 31660652
Change-Id: If68cb4d86f88678c9bac8c47072775ab85ef5770
(cherry picked from commit 32fb2fc5c3)
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
This exposes the firmware's implementation of EFI_RNG_PROTOCOL via a new
function efi_get_random_bytes().
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Bug: 30369029
Patchset: kaslr-arm64-4.4
(cherry picked from commit e4fbf47674)
Signed-off-by: Jeff Vander Stoep <jeffv@google.com>
Change-Id: Id46036b78c2efd223b6cd5488e512fd93e8f597d
Adds tracepoints in ext4/f2fs/mpage to track readpages/buffered
write()s. This allows us to track files that are being read/written
to PIDs.
Change-Id: I26bd36f933108927d6903da04d8cb42fd9c3ef3d
Signed-off-by: Mohan Srinivasan <srmohan@google.com>
Currently __set_fixmap_offset is a macro function which has a local
variable called 'addr'. If a caller passes a 'phys' parameter which is
derived from a variable also called 'addr', the local variable will
shadow this, and the compiler will complain about the use of an
uninitialized variable. To avoid the issue with namespace clashes,
'addr' is prefixed with a liberal sprinkling of underscores.
Turning __set_fixmap_offset into a static inline breaks the build for
several architectures. Fixing this properly requires updates to a number
of architectures to make them agree on the prototype of __set_fixmap (it
could be done as a subsequent patch series).
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
[catalin.marinas@arm.com: squashed the original function patch and macro fixup]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Bug: 30369029
Patchset: rework-pagetable
(cherry picked from commit 3694bd7678)
Signed-off-by: Jeff Vander Stoep <jeffv@google.com>
Change-Id: Iec27cb36dfca39e333de9f7318e76da0670d0156
The arm64 MMU supports a Contiguous bit which is a hint that the TTE
is one of a set of contiguous entries which can be cached in a single
TLB entry. Supporting this bit adds new intermediate huge page sizes.
The set of huge page sizes available depends on the base page size.
Without using contiguous pages the huge page sizes are as follows.
4KB: 2MB 1GB
64KB: 512MB
With a 4KB granule, the contiguous bit groups together sets of 16 pages
and with a 64KB granule it groups sets of 32 pages. This enables two new
huge page sizes in each case, so that the full set of available sizes
is as follows.
4KB: 64KB 2MB 32MB 1GB
64KB: 2MB 512MB 16GB
If a 16KB granule is used then the contiguous bit groups 128 pages
at the PTE level and 32 pages at the PMD level.
If the base page size is set to 64KB then 2MB pages are enabled by
default. It is possible in the future to make 2MB the default huge
page size for both 4KB and 64KB granules.
Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com>
Reviewed-by: Steve Capper <steve.capper@linaro.org>
Signed-off-by: David Woods <dwoods@ezchip.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Bug: 30369029
Patchset: rework-pagetable
(cherry picked from commit 66b3923a1a)
Signed-off-by: Jeff Vander Stoep <jeffv@google.com>
Change-Id: I5e99c5165bc5eb966adf4d4523632fd9eedd9602
This adds the capability for a process that has CAP_NET_ADMIN on
a socket to see the socket mark in socket dumps.
Commit a52e95abf7 ("net: diag: allow socket bytecode filters to
match socket marks") recently gave privileged processes the
ability to filter socket dumps based on mark. This patch is
complementary: it ensures that the mark is also passed to
userspace in the socket's netlink attributes. It is useful for
tools like ss which display information about sockets.
[backport of net-next d545caca82]
Change-Id: I33336ed9c3ee3fb78fe05c4c47b7fd18c6e33ef1
Tested: https://android-review.googlesource.com/270210
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implements SOCK_DESTROY for UDP sockets similar to what was done
for TCP with commit c1e64e298b ("net: diag: Support destroying TCP
sockets.") A process with a UDP socket targeted for destroy is awakened
and recvmsg fails with ECONNABORTED.
[cherry-pick of 5d77dca828]
Change-Id: I4b4862548e6e3c05dde27781e7daa0b18b93bd81
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows a privileged process to filter by socket mark when
dumping sockets via INET_DIAG_BY_FAMILY. This is useful on
systems that use mark-based routing such as Android.
The ability to filter socket marks requires CAP_NET_ADMIN, which
is consistent with other privileged operations allowed by the
SOCK_DIAG interface such as the ability to destroy sockets and
the ability to inspect BPF filters attached to packet sockets.
[cherry-pick of a52e95abf7]
Change-Id: I8b90b814264d9808bda050cdba8f104943bdb9a8
Tested: https://android-review.googlesource.com/261350
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to inet_diag facility to filter sockets based on device
index. If an interface index is in the filter only sockets bound
to that index (sk_bound_dev_if) are returned.
[cherry-pick of net-next 637c841dd7]
Change-Id: I6b6bcdcf15d3142003f1ee53b4d82f2fabbb8250
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use do_div() instead of "/" operator to fix undefined references to
"__aeabi_uldivmod" build error for ARCH=arm.
Also in TP_fast_assign(), along with do_div() usage, replace "," with
";" which would have resulted in a syntax error (!), because
'#define TP_fast_assign(args...) args' would have stripped off the ","
and left white space between these two assignments after CPP phase.
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
[jstultz: Cherry-picked from common/android-3.18]
Signed-off-by: John Stultz <john.stultz@linaro.org>
(from https://lkml.org/lkml/2016/9/1/428)
(cherry pick from android-3.10 commit b58133100b38f2bf83cad2d7097417a3a196ed0b)
Removing a bounce buffer copy operation in the pmsg driver path is
always better. We also gain in overall performance by not requesting
a vmalloc on every write as this can cause precious RT tasks, such
as user facing media operation, to stall while memory is being
reclaimed. Added a write_buf_user to the pstore functions, a backup
platform write_buf_user that uses the small buffer that is part of
the instance, and implemented a ramoops write_buf_user that only
supports PSTORE_TYPE_PMSG.
Signed-off-by: Mark Salyzyn <salyzyn@google.com>
Bug: 31057326
Change-Id: I4cdee1cd31467aa3e6c605bce2fbd4de5b0f8caa
Just for good measure, make sure that check_object_size() is always
inlined too, as already done for copy_*_user() and __copy_*_user().
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Change-Id: Ibfdf4790d03fe426e68d9a864c55a0d1bbfb7d61
(cherry picked from commit a85d6b8242)
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Instead of having each caller of check_object_size() need to remember to
check for a const size parameter, move the check into check_object_size()
itself. This actually matches the original implementation in PaX, though
this commit cleans up the now-redundant builtin_const() calls in the
various architectures.
Signed-off-by: Kees Cook <keescook@chromium.org>
Change-Id: I348809399c10ffa051251866063be674d064b9ff
(cherry picked from 81409e9e28)
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
The IO latency histogram change broke allmodconfig and
allnoconfig builds. This fixes those breakages.
Change-Id: I9cdae655b40ed155468f3cef25cdb74bb56c4d3e
Signed-off-by: Mohan Srinivasan <srmohan@google.com>
SLUB already has a redzone debugging feature. But it is only positioned
at the end of object (aka right redzone) so it cannot catch left oob.
Although current object's right redzone acts as left redzone of next
object, first object in a slab cannot take advantage of this effect.
This patch explicitly adds a left red zone to each object to detect left
oob more precisely.
Background:
Someone complained to me that left OOB doesn't catch even if KASAN is
enabled which does page allocation debugging. That page is out of our
control so it would be allocated when left OOB happens and, in this
case, we can't find OOB. Moreover, SLUB debugging feature can be
enabled without page allocator debugging and, in this case, we will miss
that OOB.
Before trying to implement, I expected that changes would be too
complex, but, it doesn't look that complex to me now. Almost changes
are applied to debug specific functions so I feel okay.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change-Id: Ib893a17ecabd692e6c402e864196bf89cd6781a5
(cherry picked from commit d86bd1bece)
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
This patch adds a new sysfs node (latency_hist) and reports
IO (svc time) latency histograms. Disabled by default, can be
enabled by echoing 0 into latency_hist, stats can be cleared
by writing 2 into latency_hist. This commit fixes the 32 bit
build breakage in the previous commit. Tested on both 32 bit
and 64 bit arm devices.
Bug: 30677035
Change-Id: I9a615a16616d80f87e75676ac4d078a5c429dcf9
Signed-off-by: Mohan Srinivasan <srmohan@google.com>
This is the start of porting PAX_USERCOPY into the mainline kernel. This
is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
work is based on code by PaX Team and Brad Spengler, and an earlier port
from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.
This patch contains the logic for validating several conditions when
performing copy_to_user() and copy_from_user() on the kernel object
being copied to/from:
- address range doesn't wrap around
- address range isn't NULL or zero-allocated (with a non-zero copy size)
- if on the slab allocator:
- object size must be less than or equal to copy size (when check is
implemented in the allocator, which appear in subsequent patches)
- otherwise, object must not span page allocations (excepting Reserved
and CMA ranges)
- if on the stack
- object must not extend before/after the current process stack
- object must be contained by a valid stack frame (when there is
arch/build support for identifying stack frames)
- object must not overlap with kernel text
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Change-Id: Iff3b5f1ddb04acd99ccf9a9046c7797363962b2a
(cherry picked from commit f5509cc18d)
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
This creates per-architecture function arch_within_stack_frames() that
should validate if a given object is contained by a kernel stack frame.
Initial implementation is on x86.
This is based on code from PaX.
Signed-off-by: Kees Cook <keescook@chromium.org>
Change-Id: I1f3b299bb8991d65dcdac6af85d633d4b7776df1
(cherry picked from commit 0f60a8efe4)
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
When I initially added the unsafe_[get|put]_user() helpers in commit
5b24a7a2aa ("Add 'unsafe' user access functions for batched
accesses"), I made the mistake of modeling the interface on our
traditional __[get|put]_user() functions, which return zero on success,
or -EFAULT on failure.
That interface is fairly easy to use, but it's actually fairly nasty for
good code generation, since it essentially forces the caller to check
the error value for each access.
In particular, since the error handling is already internally
implemented with an exception handler, and we already use "asm goto" for
various other things, we could fairly easily make the error cases just
jump directly to an error label instead, and avoid the need for explicit
checking after each operation.
So switch the interface to pass in an error label, rather than checking
the error value in the caller. Best do it now before we start growing
more users (the signal handling code in particular would be a good place
to use the new interface).
So rather than
if (unsafe_get_user(x, ptr))
... handle error ..
the interface is now
unsafe_get_user(x, ptr, label);
where an error during the user mode fetch will now just cause a jump to
'label' in the caller.
Right now the actual _implementation_ of this all still ends up being a
"if (err) goto label", and does not take advantage of any exception
label tricks, but for "unsafe_put_user()" in particular it should be
fairly straightforward to convert to using the exception table model.
Note that "unsafe_get_user()" is much harder to convert to a clever
exception table model, because current versions of gcc do not allow the
use of "asm goto" (for the exception) with output values (for the actual
value to be fetched). But that is hopefully not a limitation in the
long term.
[ Also note that it might be a good idea to switch unsafe_get_user() to
actually _return_ the value it fetches from user space, but this
commit only changes the error handling semantics ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change-Id: Ib905a84a04d46984320f6fd1056da4d72f3d6b53
(cherry picked from commit 1bd4403d86)
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
The naming is meant to discourage random use: the helper functions are
not really any more "unsafe" than the traditional double-underscore
functions (which need the address range checking), but they do need even
more infrastructure around them, and should not be used willy-nilly.
In addition to checking the access range, these user access functions
require that you wrap the user access with a "user_acess_{begin,end}()"
around it.
That allows architectures that implement kernel user access control
(x86: SMAP, arm64: PAN) to do the user access control in the wrapping
user_access_begin/end part, and then batch up the actual user space
accesses using the new interfaces.
The main (and hopefully only) use for these are for core generic access
helpers, initially just the generic user string functions
(strnlen_user() and strncpy_from_user()).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change-Id: Ic64efea41f97171bdbdabe3e531489aebd9b6fac
(cherry picked from commit 5b24a7a2aa)
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
(cherry picked from commit bb1fceca22)
When tcp_sendmsg() allocates a fresh and empty skb, it puts it at the
tail of the write queue using tcp_add_write_queue_tail()
Then it attempts to copy user data into this fresh skb.
If the copy fails, we undo the work and remove the fresh skb.
Unfortunately, this undo lacks the change done to tp->highest_sack and
we can leave a dangling pointer (to a freed skb)
Later, tcp_xmit_retransmit_queue() can dereference this pointer and
access freed memory. For regular kernels where memory is not unmapped,
this might cause SACK bugs because tcp_highest_sack_seq() is buggy,
returning garbage instead of tp->snd_nxt, but with various debug
features like CONFIG_DEBUG_PAGEALLOC, this can crash the kernel.
This bug was found by Marco Grassi thanks to syzkaller.
Fixes: 6859d49475 ("[TCP]: Abstract tp->highest_sack accessing & point to next skb")
Reported-by: Marco Grassi <marco.gra@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change-Id: I58bb02d6e4e399612e8580b9e02d11e661df82f5
Bug: 31183296
The current percpu-rwsem read side is entirely free of serializing insns
at the cost of having a synchronize_sched() in the write path.
The latency of the synchronize_sched() is too high for cgroups. The
commit 1ed1328792 talks about the write path being a fairly cold path
but this is not the case for Android which moves task to the foreground
cgroup and back around binder IPC calls from foreground processes to
background processes, so it is significantly hotter than human initiated
operations.
Switch cgroup_threadgroup_rwsem into the slow mode for now to avoid the
problem, hopefully it should not be that slow after another commit
80127a3968 ("locking/percpu-rwsem: Optimize readers and reduce global
impact").
We could just add rcu_sync_enter() into cgroup_init() but we do not want
another synchronize_sched() at boot time, so this patch adds the new helper
which doesn't block but currently can only be called before the first use.
Cc: Tejun Heo <tj@kernel.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Reported-by: John Stultz <john.stultz@linaro.org>
Reported-by: Dmitry Shmidt <dimitrysh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[jstultz: backported to 4.4]
Change-Id: I34aa9c394d3052779b56976693e96d861bd255f2
Mailing-list-URL: https://lkml.org/lkml/2016/8/11/557
Signed-off-by: John Stultz <john.stultz@linaro.org>
Currently the percpu-rwsem switches to (global) atomic ops while a
writer is waiting; which could be quite a while and slows down
releasing the readers.
This patch cures this problem by ordering the reader-state vs
reader-count (see the comments in __percpu_down_read() and
percpu_down_write()). This changes a global atomic op into a full
memory barrier, which doesn't have the global cacheline contention.
This also enables using the percpu-rwsem with rcu_sync disabled in order
to bias the implementation differently, reducing the writer latency by
adding some cost to readers.
Mailing-list-URL: https://lkml.org/lkml/2016/8/9/181
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[jstultz: Backported to 4.4]
Change-Id: I8ea04b4dca2ec36f1c2469eccafde1423490572f
Signed-off-by: John Stultz <john.stultz@linaro.org>
This patch adds a new sysfs node (latency_hist) and reports IO
(svc time) latency histograms. Disabled by default, can be enabled
by echoing 0 into latency_hist, stats can be cleared by writing 2
into latency_hist.
Bug: 30677035
Change-Id: I625938135ea33e6e87cf6af1fc7edc136d8b4b32
Signed-off-by: Mohan Srinivasan <srmohan@google.com>
Fix RTA_UID enum to match it with the Android userspace code which
assumes RTA_UID=18.
With this patch all Android kernel networking unit tests mentioned here
https://source.android.com/devices/tech/config/kernel_network_tests.html
are success.
Without this patch multinetwork_test.py unit test fails.
Change-Id: I3ff36670f7d4e5bf5f01dce584ae9d53deabb3ed
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Currently the vmstat updater is not deferrable as a result of commit
ba4877b9ca ("vmstat: do not use deferrable delayed work for
vmstat_update"). This in turn can cause multiple interruptions of the
applications because the vmstat updater may run at
Make vmstate_update deferrable again and provide a function that folds
the differentials when the processor is going to idle mode thus
addressing the issue of the above commit in a clean way.
Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.
Change-Id: Idf256cfacb40b4dc8dbb6795cf06b34e8fec7a06
Fixes: ba4877b9ca ("vmstat: do not use deferrable delayed work for vmstat_update")
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Git-commit: 0eb77e9880
[shashim@codeaurora.org: resolve minor merge conflicts]
Signed-off-by: Shiraz Hashim <shashim@codeaurora.org>
[jstultz: fwdport to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
use a window based view of time in order to track task
demand and CPU utilization in the scheduler.
Window Assisted Load Tracking (WALT) implementation credits:
Srivatsa Vaddagiri, Steve Muckle, Syed Rameez Mustafa, Joonwoo Park,
Pavan Kumar Kondeti, Olav Haugan
2016-03-06: Integration with EAS/refactoring by Vikram Mulukutla
and Todd Kjos
Change-Id: I21408236836625d4e7d7de1843d20ed5ff36c708
Includes fixes for issues:
eas/walt: Use walt_ktime_clock() instead of ktime_get_ns() to avoid a
race resulting in watchdog resets
BUG: 29353986
Change-Id: Ic1820e22a136f7c7ebd6f42e15f14d470f6bbbdb
Handle walt accounting anomoly during resume
During resume, there is a corner case where on wakeup, a task's
prev_runnable_sum can go negative. This is a workaround that
fixes the condition and warns (instead of crashing).
BUG: 29464099
Change-Id: I173e7874324b31a3584435530281708145773508
Signed-off-by: Todd Kjos <tkjos@google.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Change-Id: I164ee04ba98c3a776605f18cb65ee61b3e917939
Contains also:
eas/stune: schedtune cpu boost_max must be non-negative.
This is to avoid under-accounting cpu capacity which may
cause task stacking and frequency spikes.
Change-Id: Ie1c1cbd52a6edb77b4c15a830030aa748dff6f29
The choice of initial task load upon fork has a large influence
on CPU and OPP selection when scheduler-driven DVFS is in use.
Make this tuneable by adding a new sysctl "sched_initial_task_util".
If the sched governor is not used, the default remains at SCHED_LOAD_SCALE
Otherwise, the value from the sysctl is used. This defaults to 0.
Signed-off-by: "Todd Kjos <tkjos@google.com>"
EAS assumes that clusters with smaller capacity cores are more
energy-efficient. This may not be true on non-big-little devices,
so EAS can make incorrect cluster selections when finding a CPU
to wake. The "sched_is_big_little" hint can be used to cause a
cpu-based selection instead of cluster-based selection.
This change incorporates the addition of the sync hint enable patch
EAS did not honour synchronous wakeup hints, a new sysctl is
created to ask EAS to use this information when selecting a CPU.
The control is called "sched_sync_hint_enable".
Also contains:
EAS: sched/fair: for SMP bias toward idle core with capacity
For SMP devices, on wakeup bias towards idle cores that have capacity
vs busy devices that need a higher OPP
eas: favor idle cpus for boosted tasks
BUG: 29533997
BUG: 29512132
Change-Id: I0cc9a1b1b88fb52916f18bf2d25715bdc3634f9c
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
eas/sched/fair: Favoring busy cpus with low OPPs
BUG: 29533997
BUG: 29512132
Change-Id: I9305b3239698d64278db715a2e277ea0bb4ece79
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Introduce a new sysctl for this option, 'sched_cstate_aware'.
When this is enabled, select_idle_sibling in CFS is modified to
choose the idle CPU in the sibling group which has the lowest
idle state index - idle state indexes are assumed to increase
as sleep depth and hence wakeup latency increase. In this way,
we attempt to minimise wakeup latency when an idle CPU is
required.
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
Includes:
sched: EAS: fix select_idle_sibling
when sysctl_sched_cstate_aware is enabled, best_idle cpu will not be chosen
in the original flow because it will goto done directly
Bug: 30107557
Change-Id: Ie09c2e3960cafbb976f8d472747faefab3b4d6ac
Signed-off-by: martin_liu <martin_liu@htc.com>
Currently the build for a single-core (e.g. user-mode) Linux is broken
and this configuration is required (at least) to run some network tests.
The main issues for the current code support on single-core systems are:
1. {se,rq}::sched_avg is not available nor maintained for !SMP systems
This means that load and utilisation signals are NOT available in single
core systems. All the EAS code depends on these signals.
2. sched_group_energy is also SMP dependant. Again this means that all the
EAS setup and preparation code (energyn model initialization) has to be
properly guarded/disabled for !SMP systems.
3. SchedFreq depends on utilization signal, which is not available on
!SMP systems.
4. SchedTune is useless on unicore systems if SchedFreq is not available.
5. WALT machinery is not required on single-core systems.
This patch addresses all these issues by enforcing some constraints for
single-core systems:
a) WALT, SchedTune and SchedTune are now dependant on SMP
b) The default governor for !SMP systems is INTERACTIVE
c) The energy model initialisation/build functions are
d) Other minor code re-arrangements and CONFIG_SMP guarding to enable
single core builds.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Doing a Exponential moving average per nr_running++/-- does not
guarantee a fixed sample rate which induces errors if there are lots of
threads being enqueued/dequeued from the rq (Linpack mt). Instead of
keeping track of the avg, the scheduler now keeps track of the integral
of nr_running and allows the readers to perform filtering on top.
Original-author: Sai Charan Gurrappadi <sgurrappadi@nvidia.com>
Change-Id: Id946654f32fa8be0eaf9d8fa7c9a8039b5ef9fab
Signed-off-by: Joseph Lo <josephl@nvidia.com>
Signed-off-by: Andrew Bresticker <abrestic@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/174694
Reviewed-on: https://chromium-review.googlesource.com/272853
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>