The patch to add anonymous vma names causes a build failure in some
configurations:
include/linux/mm_types.h: In function 'is_same_vma_anon_name':
include/linux/mm_types.h:924:37: error: implicit declaration of function 'strcmp' [-Werror=implicit-function-declaration]
924 | return name && vma_name && !strcmp(name, vma_name);
| ^~~~~~
include/linux/mm_types.h:22:1: note: 'strcmp' is defined in header '<string.h>'; did you forget to '#include <string.h>'?
This should not really be part of linux/mm_types.h in the first place,
as that header is meant to only contain structure defintions and need a
minimum set of indirect includes itself.
While the header clearly includes more than it should at this point,
let's not make it worse by including string.h as well, which would pull
in the expensive (compile-speed wise) fortify-string logic.
Move the new functions into a separate header that only needs to be
included in a couple of locations.
Link: https://lkml.kernel.org/r/20211207125710.2503446-1-arnd@kernel.org
Fixes: "mm: add a field to store names for private anonymous memory"
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Colin Cross <ccross@google.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 17fca131ce)
Bug: 120441514
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I54719d7ea27d3cf53ef7245b2af88d2a2bc9bafe
While forking a process with high number (64K) of named anonymous vmas
the overhead caused by strdup() is noticeable. Experiments with ARM64
Android device show up to 40% performance regression when forking a
process with 64k unpopulated anonymous vmas using the max name lengths
vs the same process with the same number of anonymous vmas having no
name.
Introduce anon_vma_name refcounted structure to avoid the overhead of
copying vma names during fork() and when splitting named anonymous vmas.
When a vma is duplicated, instead of copying the name we increment the
refcount of this structure. Multiple vmas can point to the same
anon_vma_name as long as they increment the refcount. The name member
of anon_vma_name structure is assigned at structure allocation time and
is never changed. If vma name changes then the refcount of the original
structure is dropped, a new anon_vma_name structure is allocated to hold
the new name and the vma pointer is updated to point to the new
structure.
With this approach the fork() performance regressions is reduced 3-4x
times and with usecases using more reasonable number of VMAs (a few
thousand) the regressions is not measurable.
Link: https://lkml.kernel.org/r/20211019215511.3771969-3-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Colin Cross <ccross@google.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Landley <rob@landley.net>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Shaohua Li <shli@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 78db341283)
Bug: 120441514
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I4b6d63b1aced3813ebb91479f4bcfd0d89e8fa29
In many userspace applications, and especially in VM based applications
like Android uses heavily, there are multiple different allocators in
use. At a minimum there is libc malloc and the stack, and in many cases
there are libc malloc, the stack, direct syscalls to mmap anonymous
memory, and multiple VM heaps (one for small objects, one for big
objects, etc.). Each of these layers usually has its own tools to
inspect its usage; malloc by compiling a debug version, the VM through
heap inspection tools, and for direct syscalls there is usually no way
to track them.
On Android we heavily use a set of tools that use an extended version of
the logic covered in Documentation/vm/pagemap.txt to walk all pages
mapped in userspace and slice their usage by process, shared (COW) vs.
unique mappings, backing, etc. This can account for real physical
memory usage even in cases like fork without exec (which Android uses
heavily to share as many private COW pages as possible between
processes), Kernel SamePage Merging, and clean zero pages. It produces
a measurement of the pages that only exist in that process (USS, for
unique), and a measurement of the physical memory usage of that process
with the cost of shared pages being evenly split between processes that
share them (PSS).
If all anonymous memory is indistinguishable then figuring out the real
physical memory usage (PSS) of each heap requires either a pagemap
walking tool that can understand the heap debugging of every layer, or
for every layer's heap debugging tools to implement the pagemap walking
logic, in which case it is hard to get a consistent view of memory
across the whole system.
Tracking the information in userspace leads to all sorts of problems.
It either needs to be stored inside the process, which means every
process has to have an API to export its current heap information upon
request, or it has to be stored externally in a filesystem that somebody
needs to clean up on crashes. It needs to be readable while the process
is still running, so it has to have some sort of synchronization with
every layer of userspace. Efficiently tracking the ranges requires
reimplementing something like the kernel vma trees, and linking to it
from every layer of userspace. It requires more memory, more syscalls,
more runtime cost, and more complexity to separately track regions that
the kernel is already tracking.
This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
userspace-provided name for anonymous vmas. The names of named
anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
[anon:<name>].
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
Setting the name to NULL clears it. The name length limit is 80 bytes
including NUL-terminator and is checked to contain only printable ascii
characters (including space), except '[',']','\','$' and '`'.
Ascii strings are being used to have a descriptive identifiers for vmas,
which can be understood by the users reading /proc/pid/maps or
/proc/pid/smaps. Names can be standardized for a given system and they
can include some variable parts such as the name of the allocator or a
library, tid of the thread using it, etc.
The name is stored in a pointer in the shared union in vm_area_struct
that points to a null terminated string. Anonymous vmas with the same
name (equivalent strings) and are otherwise mergeable will be merged.
The name pointers are not shared between vmas even if they contain the
same name. The name pointer is stored in a union with fields that are
only used on file-backed mappings, so it does not increase memory usage.
CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
feature. It keeps the feature disabled by default to prevent any
additional memory overhead and to avoid confusing procfs parsers on
systems which are not ready to support named anonymous vmas.
The patch is based on the original patch developed by Colin Cross, more
specifically on its latest version [1] posted upstream by Sumit Semwal.
It used a userspace pointer to store vma names. In that design, name
pointers could be shared between vmas. However during the last
upstreaming attempt, Kees Cook raised concerns [2] about this approach
and suggested to copy the name into kernel memory space, perform
validity checks [3] and store as a string referenced from
vm_area_struct.
One big concern is about fork() performance which would need to strdup
anonymous vma names. Dave Hansen suggested experimenting with
worst-case scenario of forking a process with 64k vmas having longest
possible names [4]. I ran this experiment on an ARM64 Android device
and recorded a worst-case regression of almost 40% when forking such a
process.
This regression is addressed in the followup patch which replaces the
pointer to a name with a refcounted structure that allows sharing the
name pointer between vmas of the same name. Instead of duplicating the
string during fork() or when splitting a vma it increments the refcount.
[1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
[2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
[3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
[4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
Changes for prctl(2) manual page (in the options section):
PR_SET_VMA
Sets an attribute specified in arg2 for virtual memory areas
starting from the address specified in arg3 and spanning the
size specified in arg4. arg5 specifies the value of the attribute
to be set. Note that assigning an attribute to a virtual memory
area might prevent it from being merged with adjacent virtual
memory areas due to the difference in that attribute's value.
Currently, arg2 must be one of:
PR_SET_VMA_ANON_NAME
Set a name for anonymous virtual memory areas. arg5 should
be a pointer to a null-terminated string containing the
name. The name length including null byte cannot exceed
80 bytes. If arg5 is NULL, the name of the appropriate
anonymous virtual memory areas will be reset. The name
can contain only printable ascii characters (including
space), except '[',']','\','$' and '`'.
This feature is available only if the kernel is built with
the CONFIG_ANON_VMA_NAME option enabled.
[surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
[surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
work here was done by Colin Cross, therefore, with his permission, keeping
him as the author]
Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com
Signed-off-by: Colin Cross <ccross@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Landley <rob@landley.net>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Shaohua Li <shli@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 9a10064f56)
Bug: 120441514
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I53d56d551a7d62f75341304751814294b447c04e
Patch series "mm: rearrange madvise code to allow for reuse", v11.
Avoid performance regression of the new anon vma name field refcounting it.
I checked the image sizes with allnoconfig builds:
unpatched Linus' ToT
text data bss dec hex filename
1324759 32 73928 1398719 1557bf vmlinux
After the first patch is applied (madvise refactoring)
text data bss dec hex filename
1322346 32 73928 1396306 154e52 vmlinux
>>> 2413 bytes decrease vs ToT <<<
After all patches applied with CONFIG_ANON_VMA_NAME=n
text data bss dec hex filename
1322337 32 73928 1396297 154e49 vmlinux
>>> 2422 bytes decrease vs ToT <<<
After all patches applied with CONFIG_ANON_VMA_NAME=y
text data bss dec hex filename
1325228 32 73928 1399188 155994 vmlinux
>>> 469 bytes increase vs ToT <<<
This patch (of 3):
Refactor the madvise syscall to allow for parts of it to be reused by a
prctl syscall that affects vmas.
Move the code that walks vmas in a virtual address range into a function
that takes a function pointer as a parameter. The only caller for now
is sys_madvise, which uses it to call madvise_vma_behavior on each vma,
but the next patch will add an additional caller.
Move handling all vma behaviors inside madvise_behavior, and rename it
to madvise_vma_behavior.
Move the code that updates the flags on a vma, including splitting or
merging the vma as necessary, into a new function called
madvise_update_vma. The next patch will add support for updating a new
anon_name field as well.
Link: https://lkml.kernel.org/r/20211019215511.3771969-1-surenb@google.com
Signed-off-by: Colin Cross <ccross@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Rob Landley <rob@landley.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ac1e9acc5a)
Bug: 120441514
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: If96c14ca3acc3795de373d658ba0a940dda68e1c
Build BTF type info into the kernel to enable use of BPF-based tools
such as BCC's libbpf-tools.
Bug: 203823368
Test: build
Signed-off-by: Connor O'Brien <connoro@google.com>
Change-Id: Ice20d6bbf83b3a2407a553a37a9befff6c6bb66d
resolve_btfids is built using $(HOSTCC) and $(HOSTLD) but does not
pick up the corresponding flags. As a result, host-specific settings
(such as a sysroot specified via HOSTCFLAGS=--sysroot=..., or a linker
specified via HOSTLDFLAGS=-fuse-ld=...) will not be respected.
Fix this by setting CFLAGS to KBUILD_HOSTCFLAGS and LDFLAGS to
KBUILD_HOSTLDFLAGS.
Also pass the cflags through to libbpf via EXTRA_CFLAGS to ensure that
the host libbpf is built with flags consistent with resolve_btfids.
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20220112002503.115968-1-connoro@google.com
(cherry picked from commit 0e3a1c902f
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master)
Bug: 203823368
Test: build with CONFIG_DEBUG_INFO_BTF=y
Signed-off-by: Connor O'Brien <connoro@google.com>
Change-Id: I09ee10b29b57933653eb1cdd4249bac2d9cebf22
When a sense code is present we should not override the SAM status; the
driver already sets it based on the response from the hypervisor.
In addition we should only copy the sense buffer if one is actually
provided by the hypervisor.
Link: https://lore.kernel.org/r/20210622091153.29231-1-hare@suse.de
Fixes: 464a00c9e0 ("scsi: core: Kill DRIVER_SENSE")
Tested-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit c43ddbf97f)
Bug: 187129171
Signed-off-by: Connor O'Brien <connoro@google.com>
Change-Id: I6a42c80e2cbd6786f2e08ebe4226f2cddfbb8e97
Setattr implementation was mixing up some flags, and missing some of
them.
Test: atest android.appsecurity.cts.ExternalStorageHostTest
Bug: 202785178
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: Id41fa30881766faad5858b658f5b6871c0ae46b3
By default with SELinux enabled behavior for unsigned
module loading is same as sig_enforce=1. This causes
loading of unsigned modules fail. All modules in Android
GKI are unsigned except GKI modules. Do not prevent
module loading in case of CONFIG_SIG_MODULE_PROTECT; which
was introduced to change behavior of sig_enforce to allow
unsigned modules but not access to protected symbols.
Bug: 200082547
Bug: 214445388
Fixes: 9ab6a24225 ("ANDROID: GKI: Add module load time protected symbol lookup")
Test: TreeHugger
Signed-off-by: Ramji Jiyani <ramjiyani@google.com>
Change-Id: Iab3113d706cbd7db7a5684897bcafd5671a6d424
CONFIG_MODULE_SIG_ALL needs to be set for gki_defconig,
but will require an override via device fragments
to avoid signing the vendor modules at build-time.
It defaults to 'y' so no need to explicitly set for
gki_defconfig.
Bug: 200082547
Bug: 214445388
Fixes: 9ab6a24225 ("ANDROID: GKI: Add module load time protected symbol lookup")
Test: TH, manual builds including P21 mainline
Signed-off-by: Ramji Jiyani <ramjiyani@google.com>
Change-Id: Iafc0936b5e7bfb781b28642d1ec233a7fcf85f09
Open Profile for DICE is an open protocol for measured boot compatible
with the Trusted Computing Group's Device Identifier Composition
Engine (DICE) specification. The generated Compound Device Identifier
(CDI) certificates represent the hardware/software combination measured
by DICE, and can be used for remote attestation and sealing.
Add a driver that exposes reserved memory regions populated by firmware
with DICE CDIs and exposes them to userspace via a character device.
Userspace obtains the memory region's size from read() and calls mmap()
to create a mapping of the memory region in its address space. The
mapping is not allowed to be write+shared, giving userspace a guarantee
that the data were not overwritten by another process.
Userspace can also call write(), which triggers a wipe of the DICE data
by the driver. Because both the kernel and userspace mappings use
write-combine semantics, all clients observe the memory as zeroed after
the syscall has returned.
Acked-by: Rob Herring <robh@kernel.org>
Cc: Andrew Scull <ascull@google.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: David Brazdil <dbrazdil@google.com>
Link: https://lore.kernel.org/r/20220104100645.1810028-3-dbrazdil@google.com
Bug: 198197082
[willdeacon@: Fixed context conflicts in reserved_mem_matches[] and Makefile]
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 209580772
Change-Id: If1160c8cc3a39ea822e089d1b80c837aec8075fa
Signed-off-by: Will Deacon <willdeacon@google.com>
Add DeviceTree bindings for Open Profile for DICE, an open protocol for
measured boot. Firmware uses DICE to measure the hardware/software
combination and generates Compound Device Identifier (CDI) certificates.
These are stored in memory and the buffer is described in the DT as
a reserved memory region compatible with 'google,open-dice'.
Signed-off-by: David Brazdil <dbrazdil@google.com>
Link: https://lore.kernel.org/r/20220104100645.1810028-2-dbrazdil@google.com
Bug: 198197082
Bug: 209580772
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: If318ad91ef1ae26ff639f99a4349e8c737d286b6
Signed-off-by: Will Deacon <willdeacon@google.com>
As pKVM does not trust the host, it should not be involved in the
handling of, or be able to observe the response to entropy requests
issues by protected guests.
When an SMC-based implementation of the ARM SMCCC TRNG interface is
present, pass any HVC-based requests directly on to the secure firmware.
Co-developed-by: Ard Biesheuvel <ardb@google.com>
Signed-off-by: Ard Biesheuvel <ardb@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Bug: 209580772
Change-Id: Ica492ce49fd059a62ecc31bb7ac13c9adb773a08
Signed-off-by: Will Deacon <willdeacon@google.com>
Using an alias of the host's `__icache_flags` variable at EL2 for pKVM
is risky, as it provides the host with a mechanism to elide cache
maintenance of guest pages by causing functions such as icache_is_vpipt()
to erroneously return false.
Create a private copy of the __icache_flags variable at EL2 and
initialise it using the host's version during pKVM init.
Signed-off-by: Will Deacon <will@kernel.org>
Bug: 209580772
Change-Id: I595f11d1e336dadae0eb82222e4da79a1069012a
Signed-off-by: Will Deacon <willdeacon@google.com>
On initialising the MMIO guard infrastructure, register the
earlycon mapping if present.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: I379387253d08e2414fa386a3360a45391da7d90d
Signed-off-by: Will Deacon <willdeacon@google.com>
In order to transfer the early mapping state into KVM's MMIO
guard infrastucture, provide a small helper that will retrieve
the associated PTE.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: Iefc1c57d5e9476b718a8a68f60e562a57b09fb6a
Signed-off-by: Will Deacon <willdeacon@google.com>
Should a guest desire to enroll into the MMIO guard, allow it to
do so with a command-line option.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: Ia9a77f693531740500739693c52b4959abacafd4
Signed-off-by: Will Deacon <willdeacon@google.com>
Implement the previously defined ioremap/iounmap hooks for arm64,
calling into KVM's MMIO guard if available.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: I86a78f8941fb60078fb873a34c5eb32830a00259
Signed-off-by: Will Deacon <willdeacon@google.com>
Add a pair of hooks (ioremap_phys_range_hook/iounmap_phys_range_hook)
that can be implemented by an architecture. Contrary to the existing
arch_sync_kernel_mappings(), this one tracks things at the physical
address level.
This is specially useful in these virtualised environments where
the guest has to tell the host whether (and how) it intends to use
a MMIO device.
Signed-off-by: Marc Zyngier <maz@kernel.org>
[willdeacon@: Hook ioremap_page_range() in mm/ioremap.c]
Bug: 209580772
Change-Id: I970c2e632cb2b01060d5e66e4194fa9248188f43
Signed-off-by: Will Deacon <willdeacon@google.com>
Document the hypercalls user for the MMIO guard infrastructure.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: I927bcd6c5e3ef932265d817288ff2b46b0e0db66
Signed-off-by: Will Deacon <willdeacon@google.com>
Plumb the MMIO checking code into the MMIO fault handling code.
Any fault hitting outside of an MMIO region will now report
an invalid syndrome, and won't leak any data from the guest.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: I68bef2d0211a804aa1e598aeaa0c85dc4098f61e
Signed-off-by: Will Deacon <willdeacon@google.com>
Plumb in the hypercall interface to allow a guest to discover,
enroll, map and unmap MMIO regions.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: I0390456ffde8ceca351d3d8e82fd1dddeb747fac
Signed-off-by: Will Deacon <willdeacon@google.com>
Introduce the infrastructure required to identify an IPA region
that is expected to be used as an MMIO window.
This include mapping, unmapping and checking the regions. Nothing
calls into it yet, so no expected functional change.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: I227eaa28b98e067e3daae4f9e1071eb37a6761cc
Signed-off-by: Will Deacon <willdeacon@google.com>
Add a per-VM flag indicating that the guest has bought into the
MMIO guard enforcement framework.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: If60b2b38a419a9f44ebe9029f55dd016fd2444b5
Signed-off-by: Will Deacon <willdeacon@google.com>
In order to simplify the implementation of an EL2-only version of
MMIO guard, expose topup_hyp_memcache() and simplify its usage
by only requiring a vcpu.
While we're at it, make free_hyp_memcache() visible in kvm_host.h
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: I4f54c57a9693cf7a3450f99fedc15ae32af09a31
Signed-off-by: Will Deacon <willdeacon@google.com>
Define the handful of hypercalls that MMIO guard will require.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: Iac312b2327c31a1532fdb38e8fa8066291d9f611
Signed-off-by: Will Deacon <willdeacon@google.com>
Don't blindly assume that the PTE is valid when checking whether
it describes an executable or cacheable mapping.
This makes sure that we don't issue CMOs for invalid mappings.
Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: I5b271c91aa6ceb23f7b1e6a571e30d080866d5c9
Signed-off-by: Will Deacon <willdeacon@google.com>
We currently deal with a set of booleans for VM features,
while they could be better represented as set of flags
contained in an unsigned long, similarily to what we are
doing on the CPU side.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: I86be6bab12287c3eb21bbe03f255e2899edbdffb
Signed-off-by: Will Deacon <willdeacon@google.com>
Since we must still support the dreaded set/way CMOs for non-protected
VMs (as well as the equivalent operation when vcpus switch their MMU
on), perform an invalidation that will iterate over all the pages
that have been donated to the guest, one after the other.
This requires a minor change to the locking used for donation so
that all donated pages can be seen by a concurrent invalidation.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: I1780127722bda7bdc884bb4e68db6ae47d042822
There is no difference between protected and non-protected guests
when it comes to shadow structures, and we want these shadow
structures to have the same life cycle.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: I7e9bf366aae6bd0542d0038d24e2350a9dd23cd0
Signed-off-by: Will Deacon <willdeacon@google.com>
We want the host to handle everything as usual.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: Icf8ee146917e886bca258815cf948a1b12540353
Signed-off-by: Will Deacon <willdeacon@google.com>
Instead of donating memory to non-pVMs, share the memory, which
gives us a good enough approximation of the usual behaviour.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: I47213754613110a6fb8157806eb96ddf92ead346
Signed-off-by: Will Deacon <willdeacon@google.com>
In order to deal with state synchronisation between EL1 and EL2,
we use the following setup:
- On exit from EL2, the state is forcefully marked clean.
- Should a trap be handled, the state is synchronised and immediately
marked dirty
- On vcpu_put(), the state is also marked dirty, since it can be
modified by userspace
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: I47a889ca5432566f236de4630d81753348632f8a
Signed-off-by: Will Deacon <willdeacon@google.com>
In order for a non-protected guest to be functionnal, userspace
has to be able to query its state, which means that the host view
of the vcpu has to be kept up to date.
In order to achieve this, we establish the following scheme for EL2:
- On entering vcpu_run(), we check for the KVM_ARM64_PKVM_STATE_DIRTY
flag in the host vcpu. If set, we sync the state *from* the host
to the shadow version.
- On exiting vcpu_run(), we don't do anything, but let the host
issue a synch hypercall if required.
- On vcpu_put(), we force a synchronisation *to* the host.
The El1 host will have a complementary approach in the following
patches.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: I42811a25d2e176d6c7d9a66ade6e9149a96e9256
Signed-off-by: Will Deacon <willdeacon@google.com>
A non-protected guest requires a lot less handling than a protected
one when dealing with entries/exits from/to EL2.
Since we already indiredct those, introduce new entry/exit tables
for non-pVMs.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: I66602bc491a4a87d6482b12e4eaf7aa53a7dbfd9
Signed-off-by: Will Deacon <willdeacon@google.com>
As we're about to need to copy some state back and forth for
non0-protected guests, pass the full loaded state to the flush/sync
functions.
No functionnal change.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: I7ad6a00a7500e91237fcc0981261c819b2224ee0
Signed-off-by: Will Deacon <willdeacon@google.com>
When pKVM is enabled, all the vcpus must have a shadow structure
managed by the hypervisor, irrespective of theur protection status.
This field thus represents the wrong abstraction. Replace it with
'pkvm_loaded_state.is_protected', which tracks whether a vcpu is
part of a protected VM.
pkvm_loaded_state gets also moved around for convenience with the
following patches.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: Ic9876fde543abb350fe8969d5b4661e30092f553
Signed-off-by: Will Deacon <willdeacon@google.com>
A number of KVM definitions are keyd on _KVM_NVHE_HYPERVISOR__
being defined or not. Make sure we advertise this #define when
compiling hyp-constants.o, so that we get the right stuff.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: Ied191c0a18274258cffede72b06b0fb5bba5604e
Signed-off-by: Will Deacon <willdeacon@google.com>
Instead of poking into the internals of the host KVM structure,
stick to the shadow structures when trying to work out whether
a vcpu is part of a protected VM or not.
Take this opportunity to sprinkle a couple of unlikely(), just
because.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: I22a096e1e3cfe34cd2658684b02d8bac486416c4
Signed-off-by: Will Deacon <willdeacon@google.com>
As we can't really rely on the host side for the protection status,
snapshot the expected status at VM creation time.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 209580772
Change-Id: I0943eadba25e6c9fe718f29e749b9fcc8fbb79ba
Signed-off-by: Will Deacon <willdeacon@google.com>