According to the PSCI specification, ARM DEN 0022D, 5.1.4 "CPU_ON", the
CPU_ON function takes a target_cpu argument that is bit-compatible with
the affinity fields in MPIDR_EL1. All other bits in the argument are
RES0. Note that the same constraints apply to the target_affinity
argument for the AFFINITY_INFO call.
Enforce the spec by returning INVALID_PARAMS if a guest incorrectly sets
a RES0 bit.
Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210818202133.1106786-4-oupton@google.com
(cherry picked from commit e10ecb4d6c)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: Ib07086826c08d7a7b1018b89d445cd33d387d01b
The CPU_ON PSCI call takes a payload that KVM uses to configure a
destination vCPU to run. This payload is non-architectural state and not
exposed through any existing UAPI. Effectively, we have a race between
CPU_ON and userspace saving/restoring a guest: if the target vCPU isn't
ran again before the VMM saves its state, the requested PC and context
ID are lost. When restored, the target vCPU will be runnable and start
executing at its old PC.
We can avoid this race by making sure the reset payload is serviced
before userspace can access a vCPU's state.
Fixes: 358b28f09f ("arm/arm64: KVM: Allow a VCPU to fully reset itself")
Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210818202133.1106786-3-oupton@google.com
(cherry picked from commit 6826c6849b)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: Ic163ea6f91941a161913fa495e9a677d5e3589df
KVM correctly serializes writes to a vCPU's reset state, however since
we do not take the KVM lock on the read side it is entirely possible to
read state from two different reset requests.
Cure the race for now by taking the KVM lock when reading the
reset_state structure.
Fixes: 358b28f09f ("arm/arm64: KVM: Allow a VCPU to fully reset itself")
Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210818202133.1106786-2-oupton@google.com
(cherry picked from commit 6654f9dfcb)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: I8b563f223a77b8bd6ec486c6243f02d3d65d6cd2
When protected mode is enabled, the host is unable to access most parts
of the EL2 hypervisor image, including 'hyp_physvirt_offset' and the
contents of the hypervisor's '.rodata.str' section. Unfortunately,
nvhe_hyp_panic_handler() tries to read from both of these locations when
handling a BUG() triggered at EL2; the former for converting the ELR to
a physical address and the latter for displaying the name of the source
file where the BUG() occurred.
Hack the EL2 panic asm to pass both physical and virtual ELR values to
the host and utilise the newly introduced CONFIG_NVHE_EL2_DEBUG so that
we disable stage-2 protection for the host before returning to the EL1
panic handler. If the debug option is not enabled, display the address
instead of the source file:line information.
Cc: Andrew Scull <ascull@google.com>
Cc: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210813130336.8139-1-will@kernel.org
(cherry picked from commit ccac969772)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: I1fa87b3fea827b03ca9c8711c70680c26dba1e4c
KVM/arm64 is the sole user of perf_num_counters(), and really
could do without it. Stop using the obsolete API by relying on
the existing probing code.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210414134409.1266357-2-maz@kernel.org
(cherry picked from commit 5421db1be3)
[willdeacon@: Resolve 'is_protected_kvm_enabled()' context conflict]
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: Iea7438030314455002f76eefaa6cb7d945dd8e08
The host kernel is currently able to change EL2 stage-1 mappings without
restrictions thanks to the __pkvm_create_mappings() hypercall. But in a
world where the host is no longer part of the TCB, this clearly poses a
problem.
To fix this, introduce a new hypercall to allow the host to share a
physical memory page with the hypervisor, and remove the
__pkvm_create_mappings() variant. The new hypercall implements
ownership and permission checks before allowing the sharing operation,
and it annotates the shared page in the hypervisor stage-1 and host
stage-2 page-tables.
Signed-off-by: Quentin Perret <qperret@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210809152448.1810400-21-qperret@google.com
(cherry picked from commit 66c57edd3b)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: Iee84b10f898f00e2f82169a7a31c748e38b20f33
Refactor the hypervisor stage-1 locking in nVHE protected mode to expose
a new pkvm_create_mappings_locked() function. This will be used in later
patches to allow walking and changing the hypervisor stage-1 without
releasing the lock.
Signed-off-by: Quentin Perret <qperret@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210809152448.1810400-20-qperret@google.com
(cherry picked from commit f9370010e9)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: Ifd3fffbed53356f33bf1fdb4e4a0ac5b98f5b11f
Now that we mark memory owned by the hypervisor in the host stage-2
during __pkvm_init(), we no longer need to rely on the host to
explicitly mark the hyp sections later on.
Remove the __pkvm_mark_hyp() hypercall altogether.
Signed-off-by: Quentin Perret <qperret@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210809152448.1810400-19-qperret@google.com
(cherry picked from commit ad0e0139a8)
[willdeacon@: Resolve conflict with kmemleak fix]
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: I046c26613dea9406e07bd0e2a83aff1715a56991
As the hypervisor maps the host's .bss and .rodata sections in its
stage-1, make sure to tag them as shared in hyp and host page-tables.
But since the hypervisor relies on the presence of these mappings, we
cannot let the host in complete control of the memory regions -- it
must not unshare or donate them to another entity for example. To
prevent this, let's transfer the ownership of those ranges to the
hypervisor itself, and share the pages back with the host.
Signed-off-by: Quentin Perret <qperret@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210809152448.1810400-18-qperret@google.com
(cherry picked from commit 2c50166c62)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: Ie89c6656848e4056e75071be57dcd9aad885b11d
We will soon start annotating shared pages in page-tables in nVHE
protected mode. Define all the states in which a page can be (owned,
shared and owned, shared and borrowed), and provide helpers allowing to
convert this into SW bits annotations using the matching prot
attributes.
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210809152448.1810400-13-qperret@google.com
(cherry picked from commit ec250a67ea)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: Iffc5738d7c8cd8731a976a8bb74c029cc09f9975
Much of the stage-2 manipulation logic relies on being able to destroy
block mappings if e.g. installing a smaller mapping in the range. The
rationale for this behaviour is that stage-2 mappings can always be
re-created lazily. However, this gets more complicated when the stage-2
page-table is used to store metadata about the underlying pages. In such
cases, destroying a block mapping may lead to losing part of the state,
and confuse the user of those metadata (such as the hypervisor in nVHE
protected mode).
To avoid this, introduce a callback function in the pgtable struct which
is called during all map operations to determine whether the mappings
can use blocks, or should be forced to page granularity. This is used by
the hypervisor when creating the host stage-2 to force page-level
mappings when using non-default protection attributes.
Signed-off-by: Quentin Perret <qperret@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210809152448.1810400-11-qperret@google.com
(cherry picked from commit 5651311941)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: I16a35b501d73ae1c58604a60fd271fc79361eb49
We will soon start annotating page-tables with new flags to track shared
pages and such, and we will do so in valid mappings using software bits
in the PTEs, as provided by the architecture. However, it is possible
that we will need to use those flags to annotate invalid mappings as
well in the future, similar to what we do to track page ownership in the
host stage-2.
In order to facilitate the annotation of invalid mappings with such
flags, it would be preferable to re-use the same bits as for valid
mappings (bits [58-55]), but these are currently used for ownership
encoding. Since we have plenty of bits left to use in invalid
mappings, move the ownership bits further down the PTE to avoid the
conflict.
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210809152448.1810400-9-qperret@google.com
(cherry picked from commit 8a0282c681)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: I14553bde06e4919a913daa6e798e5d9e101d0c14
The ignored bits for both stage-1 and stage-2 page and block
descriptors are in [55:58], so rename KVM_PTE_LEAF_ATTR_S2_IGNORED to
make it applicable to both. And while at it, since these bits are more
commonly known as 'software' bits, rename accordingly.
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210809152448.1810400-8-qperret@google.com
(cherry picked from commit 178cac08d5)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: Idf7138edc6ab973babd356704d6eb46e901ea5ad
The kvm_pgtable_stage2_find_range() function is used in the host memory
abort path to try and look for the largest block mapping that can be
used to map the faulting address. In order to do so, the function
currently walks the stage-2 page-table and looks for existing
incompatible mappings within the range of the largest possible block.
If incompatible mappings are found, it tries the same procedure again,
but using a smaller block range, and repeats until a matching range is
found (potentially up to page granularity). While this approach has
benefits (mostly in the fact that it proactively coalesces host stage-2
mappings), it can be slow if the ranges are fragmented, and it isn't
optimized to deal with CPUs faulting on the same IPA as all of them will
do all the work every time.
To avoid these issues, remove kvm_pgtable_stage2_find_range(), and walk
the page-table only once in the host_mem_abort() path to find the
closest leaf to the input address. With this, use the corresponding
range if it is invalid and not owned by another entity. If a valid leaf
is found, return -EAGAIN similar to what is done in the
kvm_pgtable_stage2_map() path to optimize concurrent faults.
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210809152448.1810400-7-qperret@google.com
(cherry picked from commit c4f0935e4d)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: I37f864cbe58f8bebed6961ed02dbdc4aa5c1bf0f
The KVM pgtable API exposes the kvm_pgtable_walk() function to allow
the definition of walkers outside of pgtable.c. However, it is not easy
to implement any of those walkers without some of the low-level helpers.
Move some of them to the header file to allow re-use from other places.
Signed-off-by: Quentin Perret <qperret@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210809152448.1810400-6-qperret@google.com
(cherry picked from commit 51add45773)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: Ie04c943a17c09cf99e81eda8ae8c59670b10e3b6
We currently unmap all MMIO mappings from the host stage-2 to recycle
the pages whenever we run out. In order to make this pattern easy to
re-use from other places, factor the logic out into a dedicated macro.
While at it, apply the macro for the kvm_pgtable_stage2_set_owner()
calls. They're currently only called early on and are guaranteed to
succeed, but making them robust to the -ENOMEM case doesn't hurt and
will avoid painful debugging sessions later on.
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210809152448.1810400-4-qperret@google.com
(cherry picked from commit 1bac49d490)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: I178f7461e7498d1dac73209e42f24ae049808150
Introduce a poor man's lockdep implementation at EL2 which allows to
BUG() whenever a hyp spinlock is not held when it should. Hide this
feature behind a new Kconfig option that targets the EL2 object
specifically, instead of piggy backing on the existing CONFIG_LOCKDEP.
EL2 cannot WARN() cleanly to report locking issues, hence BUG() is the
only option and it is not clear whether we want this widely enabled.
This is most likely going to be useful for local testing until the EL2
WARN() situation has improved.
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210809152448.1810400-3-qperret@google.com
(cherry picked from commit 8e049e0daf)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: Ib83aaa51d92dda2618f8294991953b6a3ea6ceb8
Booting a KVM host in protected mode with kmemleak quickly results
in a pretty bad crash, as kmemleak doesn't know that the HYP sections
have been taken away. This is specially true for the BSS section,
which is part of the kernel BSS section and registered at boot time
by kmemleak itself.
Unregister the HYP part of the BSS before making that section
HYP-private. The rest of the HYP-specific data is obtained via
the page allocator or lives in other sections, none of which is
subjected to kmemleak.
Fixes: 90134ac9ca ("KVM: arm64: Protect the .hyp sections from the host")
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org # 5.13
Link: https://lore.kernel.org/r/20210802123830.2195174-3-maz@kernel.org
(cherry picked from commit 47e6223c84)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: I1c2f670e130784b00b89f93ea71a52db901077be
KVM/ARM has an architecture-specific implementation of
kvm_flush_remote_tlbs; however, unlike the generic one,
it does not count the flushes in kvm->stat.remote_tlb_flush,
so that it inexorably remained stuck to zero.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210727103251.16561-1-pbonzini@redhat.com
(cherry picked from commit 38f703663d)
[willdeacon@: Drop reference to 'generic' member of 'struct kvm_vm_stat']
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: I09953759935ab362d5c9419df8ea8d99c584154b
We keep an entry for the PMSWINC_EL0 register in the vcpu structure,
while *never* writing anything there outside of reset.
Given that the register is defined as write-only, that we always
trap when this register is accessed, there is little point in saving
anything anyway.
Get rid of the entry, and save a mighty 8 bytes per vcpu structure.
We still need to keep it exposed to userspace in order to preserve
backward compatibility with previously saved VMs. Since userspace
cannot expect any effect of writing to PMSWINC_EL0, treat the
register as RAZ/WI for the purpose of userspace access.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210719123902.1493805-5-maz@kernel.org
(cherry picked from commit 7a3ba3095a)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: I7c3a7ffd90d9e8c2665afeb2894fcf4714a56018
In a KVM guest on arm64, performance counters interrupts have an
unnecessary overhead which slows down execution when using the "perf
record" command and limits the "perf record" sampling period.
The problem is that when a guest VM disables counters by clearing the
PMCR_EL0.E bit (bit 0), KVM will disable all counters defined in
PMCR_EL0 even if they are not enabled in PMCNTENSET_EL0.
KVM disables a counter by calling into the perf framework, in particular
by calling perf_event_create_kernel_counter() which is a time consuming
operation. So, for example, with a Neoverse N1 CPU core which has 6 event
counters and one cycle counter, KVM will always disable all 7 counters
even if only one is enabled.
This typically happens when using the "perf record" command in a guest
VM: perf will disable all event counters with PMCNTENTSET_EL0 and only
uses the cycle counter. And when using the "perf record" -F option with
a high profiling frequency, the overhead of KVM disabling all counters
instead of one on every counter interrupt becomes very noticeable.
The problem is fixed by having KVM disable only counters which are
enabled in PMCNTENSET_EL0. If a counter is not enabled in PMCNTENSET_EL0
then KVM will not enable it when setting PMCR_EL0.E and it will remain
disabled as long as it is not enabled in PMCNTENSET_EL0. So there is
effectively no need to disable a counter when clearing PMCR_EL0.E if it
is not enabled PMCNTENSET_EL0.
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
[maz: moved 'mask' close to the actual user, simplifying the patch]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210712170345.660272-1-alexandre.chartre@oracle.com
Link: https://lore.kernel.org/r/20210719123902.1493805-4-maz@kernel.org
(cherry picked from commit ca4f202d08)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: I9ab70b39a58fcd0f50841912b7cad507ea5a7dcc
A number of the PMU sysregs expose reset values that are not
compliant with the architecture (set bits in the RES0 ranges,
for example).
This in turn has the effect that we need to pointlessly mask
some register fields when using them.
Let's start by making sure we don't have illegal values in the
shadow registers at reset time. This affects all the registers
that dedicate one bit per counter, the counters themselves,
PMEVTYPERn_EL0 and PMSELR_EL0.
Reported-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210719123902.1493805-2-maz@kernel.org
(cherry picked from commit 0ab410a93d)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: Ia08515affbec50257c088dedad80e4db0dc67dfb
When mapping a THP, we are guaranteed that the page isn't reserved,
and we can safely avoid the kvm_is_reserved_pfn() call.
Replace kvm_get_pfn() with get_page(pfn_to_page()).
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210726153552.1535838-6-maz@kernel.org
(cherry picked from commit 0fe4963010)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: I4f52a2cc0994514f8243712afab546e5ee311f5b
Since we only support PMD-sized mappings for THP, getting
a permission fault on a level that results in a mapping
being larger than PAGE_SIZE is a sure indication that we have
already upgraded our mapping to a PMD.
In this case, there is no need to try and parse userspace page
tables, as the fault information already tells us everything.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210726153552.1535838-4-maz@kernel.org
(cherry picked from commit f2cc327303)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: Ice71a7f1ae7c84d0a408f293dc3f15fee83140c2
We currently rely on the kvm_is_transparent_hugepage() helper to
discover whether a given page has the potential to be mapped as
a block mapping.
However, this API doesn't really give un everything we want:
- we don't get the size: this is not crucial today as we only
support PMD-sized THPs, but we'd like to have larger sizes
in the future
- we're the only user left of the API, and there is a will
to remove it altogether
To address the above, implement a simple walker using the existing
page table infrastructure, and plumb it into transparent_hugepage_adjust().
No new page sizes are supported in the process.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://lore.kernel.org/r/20210726153552.1535838-3-maz@kernel.org
(cherry picked from commit 6011cf68c8)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 198418208
Change-Id: I21c25ea6c61e21925fa73bf59f78b58d9922af9b