The MODULE_IMPORT_NS() macro does not allow defined strings to work
properly with it, so add a layer of indirection to allow this to happen.
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Matthias Maennich <maennich@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Matthias Maennich <maennich@google.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
(cherry picked from commit ca321ec743)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ibd64ba139912ea10e81ac22490831129b23a31e1
Add code to head.S's el2_setup to detect MPAM and disable any EL2 traps.
This register resets to an unknown value, setting it to the default
parititons/pmg before we enable the MMU is the best thing to do.
Kexec/kdump will depend on this if the previous kernel left the CPU
configured with a restrictive configuration.
If linux is booted at the highest implemented exception level el2_setup
will clear the enable bit, disabling MPAM.
Signed-off-by: James Morse <james.morse@arm.com>
Bug: 221768437
(cherry picked from commit fa0ff38f06b397d8a92d88eb8083c2c5a20ac87f
git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git mpam/snapshot/v5.16)
Change-Id: I2758f7f7b236d09a207e13d1165efb6887e8611a
Signed-off-by: Valentin Schneider <Valentin.Schneider@arm.com>
[bm: amended commit msg, dropped config option and switched to named labels]
Signed-off-by: Beata Michalska <beata.michalska@arm.com>
This is a partial cherry-pick of commit:
7fe77616f156 ("arm64: cpufeature: discover CPU support for MPAM")
from git://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git
Bug: 221768437
Change-Id: I77101abb07f9b73dbc7cc2a53ac44fbf772f0b1d
Signed-off-by: Valentin Schneider <Valentin.Schneider@arm.com>
Signed-off-by: Beata Michalska <beata.michalska@arm.com>
virtio pci config structures may in future have non-standard bar
values in the bar field. We should anticipate this by skipping any
structures containing such a reserved value.
The bar value should never change: check for harmful modified values
we re-read it from the config space in vp_modern_map_capability().
Also clean up an existing check to consistently use PCI_STD_NUM_BARS.
Signed-off-by: Keir Fraser <keirf@google.com>
Link: https://lore.kernel.org/r/20220323140727.3499235-1-keirf@google.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
(cherry picked from commit 3f63a1d7f6)
Bug: 222232623
[keirf@: Pass virtio_pci_device to map_capability. Move everything
into virtio_pci_modern.c]
Signed-off-by: Keir Fraser <keirf@google.com>
Change-Id: Idbba48154a051cf173b9cb0bd40c77fcf02902a4
This reverts commit 9e35276a53. Issue
were reported for the drivers that are using affinity managed IRQ
where manually toggling IRQ status is not expected. And we forget to
enable the interrupts in the restore path as well.
In the future, we will rework on the interrupt hardening.
Fixes: 9e35276a53 ("virtio_pci: harden MSI-X interrupts")
Reported-by: Marc Zyngier <maz@kernel.org>
Reported-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20220323031524.6555-2-jasowang@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
(cherry picked from commit eb4cecb453)
Bug: 196772804
Signed-off-by: Keir Fraser <keirf@google.com>
Change-Id: I05264d9e61d558522a8a20cf87399aa3578b3a6e
The synchronous wakeup interface is available only for the
interruptible wakeup. Add it for normal wakeup and use this
synchronous wakeup interface to wakeup the userspace daemon.
Scheduler can make use of this hint to find a better CPU for
the waker task.
With this change the performance numbers for compress, decompress
and copy use-cases on /sdcard path has improved by ~30%.
Use-case details:
1. copy 10000 files of each 4k size into /sdcard path
2. use any File explorer application that has compress/decompress
support
3. start compress/decompress and capture the time.
-------------------------------------------------
| Default | wakeup support | Improvement/Diff |
-------------------------------------------------
| 13.8 sec | 9.9 sec | 3.9 sec (28.26%) |
-------------------------------------------------
Co-developed-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
Signed-off-by: Pradeep P V K <quic_pragalla@quicinc.com>
Bug: 216261533
Link: https://lore.kernel.org/lkml/1638780405-38026-1-git-send-email-quic_pragalla@quicinc.com/
Change-Id: I9ac89064e34b1e0605064bf4d2d3a310679cb605
Signed-off-by: Pradeep P V K <quic_pragalla@quicinc.com>
Signed-off-by: Alessio Balsini <balsini@google.com>
We no longer need to map the host's .rodata and .bss sections in the
pkvm hypervisor, so let's remove those mappings. This will avoid
creating dependencies at EL2 on host-controlled data-structures.
Signed-off-by: Quentin Perret <qperret@google.com>
Bug: 225169428
Change-Id: I0fcb0e1b34d3c7c0c226b3fd30cdec0e8d7bfb44
The pkvm hypervisor may need to read the kvm_vgic_global_state variable
at EL2. Make sure to explicitely map it in the its stage-1 page-table
rather than relying on mapping all of .rodata.
Signed-off-by: Quentin Perret <qperret@google.com>
Bug: 225169428
Change-Id: I72d1eba78fb6b7593d236539cd81269480856fdf
In pKVM mode, we can't trust the host not to mess with the hypervisor
per-cpu offsets, so let's move the array containing them to the nVHE
code.
Signed-off-by: Quentin Perret <qperret@google.com>
Bug: 225169428
Change-Id: I9ef4175ce9cf00d6ff1c0e358551a565358f2408
The host KVM PMU code can currently index kvm_arm_hyp_percpu_base[]
through this_cpu_ptr_hyp_sym(), but will not actually dereference that
pointer when protected KVM is enabled. In preparation for making
kvm_arm_hyp_percpu_base[] unaccessible to the host, let's make sure the
indexing in hyp per-cpu pages is also done after the static key check to
avoid spurious accesses to EL2-private data from EL1.
Signed-off-by: Quentin Perret <qperret@google.com>
Bug: 225169428
Change-Id: I3f4e3f7ee789c31a1ae1f67e07edf8fb34f520b9
Vendor may have need to track rt util.
Bug: 201261299
Signed-off-by: Rick Yiu <rickyiu@google.com>
Change-Id: I2f4e5142c6bc8574ee3558042e1fb0dae13b702d
Add two new symbols to aarch64 kernel ABI:
* pkvm_iommu_sysmmu_sync_register
* pkvm_iommu_finalize
The former allows vendor modules to register a SYSMMU_SYNC device with
the hypervisor, and the latter tells the hypervisor to stop acception
new device registrations.
Bug: 190463801
Signed-off-by: David Brazdil <dbrazdil@google.com>
Change-Id: I6c6948d94cb6494f07d52b4e2b7e91db40e2fcd6
Add new hypercall that the host can use to inform the hypervisor that
all hypervisor-controlled IOMMUs have been registered and no new
registrations should be allowed. This will typically be called at the
end of kernel module initialization phase.
Bug: 190463801
Signed-off-by: David Brazdil <dbrazdil@google.com>
Change-Id: I8c175310d5b262a67947443c5a0154056a8ebf3e
The IOMMU DABT handler currently checks if the device is considered
powered by hyp before resolving the request. If the power tracking does
not reflect reality, the IOMMU may trigger issues in the host but the
incorrect state prevents it from diagnosing the issue.
Drop the powered check from the generic IOMMU code. The host accessing
the device's SFR means that it assumes it is powered, and individual
drivers can choose to reject that DABT request.
Bug: 224891559
Bug: 190463801
Signed-off-by: David Brazdil <dbrazdil@google.com>
Change-Id: I1c132c4030a61a90be4675867c9658e3bc696118
SysMMU_SYNC devices expose an interface to start a sync counter and
poll its SFR until the device signals that all memory transactions in
flight at the start have drained. This gives the hypervisor a reliable
indicator that S2MPU invalidation has fully completed and all new
transactions will use the new MPTs.
Add a new pKVM IOMMU driver that the host can use to register
SysMMU_SYNCs. Each device is expected to be a supplier to exactly one
S2MPU (parent), but multiple SYNCs can supply a single S2MPU.
To keep things simple, the SYNCs do not implement suspend/resume and are
assumed to follow the power transitions of their parent.
Following an invalidation, the S2MPU driver iterates over its children
and waits for each SYNC to signal that its transactions have drained.
The algorithm currently waits on each SYNC in turn. If latency proves to
be an issue, this could be optimized to initiate a SYNC on all powered
devices before starting to poll.
Bug: 190463801
Signed-off-by: David Brazdil <dbrazdil@google.com>
Change-Id: I45b832fd11d76b65987935c8548e2a214ee2fa2a
In preparation for adding new IOMMU devices that act as suppliers to
others, add the notion of a parent IOMMU device. Such device must be
registered after its parent and the driver of the parent device must
validate the addition.
The relation has no generic implications, it is up to drivers to make
use of it.
Bug: 190463801
Signed-off-by: David Brazdil <dbrazdil@google.com>
Change-Id: I4ee3675e5529bb73ad4546fa32380f237f054177
In preparation for needing to validate more aspects of a device that is
about to be registered, change the callback to accept the to-be-added
'struct pkvm_iommu' rather than individual inputs.
Bug: 190463801
Signed-off-by: David Brazdil <dbrazdil@google.com>
Change-Id: I3fb911e4280c220ddd779cf6a5fc9c302a5617f7
Private EL2 mappings currently cannot be removed. Move the creation of
IOMMU device mappings at the end of the registration function so that
other errors do not result in unnecessary mappings.
Bug: 190463801
Signed-off-by: David Brazdil <dbrazdil@google.com>
Change-Id: I3139e9af3345f157295eb72441a7cf3cc055116d
Memory for IOMMU device entries gets allocated from a pool donated by
the host. It is possible for pkvm_iommu_register() to allocate the
memory and then fail, in which case the memory remains unused but not
freed.
Refactor the code such that the host lock covers the entire section
where the memory is allocated. This way we can return the memory back to
the linear allocator if an error is returned.
Bug: 190463801
Signed-off-by: David Brazdil <dbrazdil@google.com>
Change-Id: I8c1650ba3e545741144d793de506e93c4066896f
Currently __pkvm_iommu_pm_notify always changes the value of
dev->powered following a suspend/resume attempt. This could potentially
be abused to force the hypervisor to stop issuing updates to an S2MPU
and preserving an old/invalid state.
Modify to only update the power state if suspend/resume was successful.
Bug: 190463801
Signed-off-by: David Brazdil <dbrazdil@google.com>
Change-Id: I285fc822e9fc926c49b9b5e69446790e1edccafb
Passing FOLL_FORCE when pinning guest memory pages was intended to allow
the VMM to map guest memory as PROT_NONE without prohibiting access from
the guest. As it turns out, crosvm doesn't implement this, and since
the host kernel will inject a signal into the VMM on a bad access
irrespective of the stage-1 permissions, we can drop the FOLL_FORCE flag
altogether.
Bug: 226564150
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: If21091b6adf3dbe4155c5c840753c912d283b159
This reverts commit e853c3b172.
This capability is unused, so remove it to avoid UAPI divergence from
upstream.
Bug: 226564150
[willdeacon@: Also removed additional instance in arch/arm64/kvm/arm.c]
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: Ib3e929a5fc81dc5c9c1ff8512d48f63bdda5c404
This reverts commit 7f19cf521f.
These notifications are unused by crosvm and are no longer required now
that the host takes care of injecting a SEGV on an illegal memory access
from userspace.
Bug: 226564150
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: I22c3e49b4aa5f023961c8849b79e2e0a21ebf0c1
Vendor may have the need to implement their own util tracking.
Bug: 201260585
Signed-off-by: Rick Yiu <rickyiu@google.com>
Change-Id: I973902e6ff82a85ecd029ac5a78692d629df1ebe
The pKVM hypervisor will currently panic if the host tries to access
memory that it doesn't own (e.g. protected guest memory). Sadly, as
guest memory can still be mapped into the VMM's address space, userspace
can trivially crash the kernel/hypervisor by poking into guest memory.
To prevent this, inject the abort back in the host with S1PTW set in the
ESR, hence allowing the host to differentiate this abort from normal
userspace faults and inject a SIGSEGV cleanly.
Signed-off-by: Quentin Perret <qperret@google.com>
Bug: 215520143
Change-Id: I9636e71e2fe3eb49d2d7cddaab7774cd672cfcae
In order to simplify the injection of exceptions in the host in pkvm
context, let's factor out of enter_exception64() the code calculating
the exception offset from VBAR_EL1 and the cpsr.
Signed-off-by: Quentin Perret <qperret@google.com>
Bug: 215520143
Change-Id: I97b2431a79fdec87c95c2d1f691bd3a11635c29b
Add a helper allowing to check when the pkvm static key is enabled to
ease the introduction of pkvm hooks in other parts of the code.
Signed-off-by: Quentin Perret <qperret@google.com>
Bug: 215520143
Change-Id: Iae065b09bb33d42d73a408365c803727269d0de0
If a malicious/compromised host issues a PSCI SYSTEM_RESET call in the
presence of guest-owned pages then the contents of those pages may be
susceptible to cold-reboot attacks.
Use the PSCI MEM_PROTECT call to ensure that volatile memory is wiped by
the firmware if a SYSTEM_RESET occurs while unpoisoned guest pages exist
in the system. Since this call does not offer protection for a "warm"
reset initiated by SYSTEM_RESET2, detect this case in the PSCI relay and
repaint the call to a standard SYSTEM_RESET instead.
Bug: 196204410
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: I5c3dd93bc83ebcd0b6cea2ec734f6e3a77f0064e
Let's check the return value of pin_user_pages() before blindly
dereferencing the struct page pointer as it may very well be NULL.
Bug: 223678931
Reported-by: Keir Fraser <keirf@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I49eb0eb14b88429cfeed3e7cc8a2a72404cfea97
A protected VM accessing ID_AA64ISAR2_EL1 gets punished with an UNDEF,
while it really should only get a zero back if the register is not
handled by the hypervisor emulation (as mandated by the architecture).
Introduce all the missing ID registers (including the unallocated ones),
and have them to return 0.
Bug: 226913064
Reported-by: Will Deacon <willdeacon@google.com>
Signed-off-by: Marc Zyngier <mzyngier@google.com>
Change-Id: I1f8de324af8a47974e6ab6b0bf68c8e1b01c4baf
On Android 32-bit system, the following Cts Verifier testcase failed:
manualTests#com.android.cts.verifier.usb.accessory.UsbAccessoryTestActivity
The reason is that compat_ioctl() needs to be called.
So let's add compat_ioctl() for 32-bit applications to solve this issue.
Bug: 223101878
Change-Id: I6e1f797d919494d293184411041955c33ad08aef
Signed-off-by: Aran Dalton <arda@allwinnertech.com>
(cherry picked from commit 77bf53b486)
A deep process chain with many vmas could grow really high. With
default sysctl_max_map_count (64k) and default pid_max (32k) the max
number of vmas in the system is 2147450880 and the refcounter has
headroom of 1073774592 before it reaches REFCOUNT_SATURATED
(3221225472).
Therefore it's unlikely that an anonymous name refcounter will overflow
with these defaults. Currently the max for pid_max is PID_MAX_LIMIT
(4194304) and for sysctl_max_map_count it's INT_MAX (2147483647). In
this configuration anon_vma_name refcount overflow becomes theoretically
possible (that still require heavy sharing of that anon_vma_name between
processes).
kref refcounting interface used in anon_vma_name structure will detect a
counter overflow when it reaches REFCOUNT_SATURATED value but will only
generate a warning and freeze the ref counter. This would lead to the
refcounted object never being freed. A determined attacker could leak
memory like that but it would be rather expensive and inefficient way to
do so.
To ensure anon_vma_name refcount does not overflow, stop anon_vma_name
sharing when the refcount reaches REFCOUNT_MAX (2147483647), which still
leaves INT_MAX/2 (1073741823) values before the counter reaches
REFCOUNT_SATURATED. This should provide enough headroom for raising the
refcounts temporarily.
Link: https://lkml.kernel.org/r/20220223153613.835563-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Gladkov <legion@kernel.org>
Cc: Chris Hyser <chris.hyser@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Colin Cross <ccross@google.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 96403e1128)
Bug: 218352794
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ieaab58f6300d9aff3139eed1c1d3417237d81955
Alistair reports an ext4 splat when running a non-protected guest under
pKVM using Cuttlefish on a rockpi board:
| WARNING: CPU: 4 PID: 3125 at fs/ext4/inode.c:3592 ext4_set_page_dirty+0x6c/0x90
| sp : ffffffc00e1a39b0
| x29: ffffffc00e1a39b0 x28: ffffffc009ac3c18 x27: ffffffc009a80968
| x26: ffffff80c2753a00 x25: 0000000200000000 x24: ffffffc00a6dc000
| x23: 0000000000000000 x22: 0000000000000001 x21: fffffffe0314f640
| x20: ffffff8063a99890 x19: fffffffe0314f640 x18: ffffffc00dbf5090
| x17: 0000000000000020 x16: ffffffc00ab73080 x15: 0000000000000040
| x14: 0000000000000040 x13: 0000000000000040 x12: 0000000080200000
| x11: 0000000000000000 x10: fffffffe0314f640 x9 : 0000000000000016
| x8 : 0000000000000015 x7 : 0000000000000062 x6 : 0000000000000068
| x5 : 0000000080200015 x4 : ffffff80067c7500 x3 : 0000000080200016
| x2 : 0000000000000001 x1 : 0000000000000001 x0 : fffffffe0314f640
| Call trace:
| ext4_set_page_dirty+0x6c/0x90
| set_page_dirty+0xf0/0x264
| set_page_dirty_lock+0x94/0x164
| unpin_user_pages_dirty_lock+0xa0/0x15c
| kvm_shadow_destroy+0xd4/0x150
| kvm_arch_destroy_vm+0xa0/0xa4
| kvm_destroy_vm+0x634/0xa0c
| kvm_vcpu_release+0x44/0xc0
| __fput+0xf8/0x43c
| ____fput+0x14/0x24
| task_work_run+0x140/0x204
| do_exit+0x450/0x12b0
| do_group_exit+0xc8/0x17c
| get_signal+0x85c/0xa10
| do_signal+0x9c/0x268
| do_notify_resume+0x98/0x220
| el0_svc+0x5c/0x84
| el0t_64_sync_handler+0x88/0xec
| el0t_64_sync+0x1b4/0x1b8
This appears to be due to virtio-pmem mapping a host page-cache page
directly into the guest and pinning it with GUP. A later attempt to
wrprotect the page using page_mkclean() on the writeback path will not
find the guest mapping and consequently the filesystem becomes confused
when we later dirty the page without any page buffers having been
allocated.
Since the host cannot generally access the memory of protected VMs,
restrict ourselves to swap-backed pages for now and avoid attempting
writeback altogether, with the GUP pin preventing swapout.
Bug: 223678931
Reported-by: Alistair Delva <adelva@google.com>
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: Id8da126aac220df6eff44177a911dc4627e68c02
When a shadow VM is torn down, its VMID can be reallocated as soon as
the shadow table entry is cleared to NULL. Since tearing down the
stage-2 page-table does not imply TLB invalidation, the TLB could still
contain stale entries from the old VM and the new user of the VMID could
end up seeing erroneous translations.
Invalidate the TLB for the VMID of the VM being torn down prior to
clearing its entry in the shadow table.
Bug: 226312378
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: Ice44d030bf01a1b7612413ee32440f3f38cb3e4e