The pkvm hypervisor at EL2 may need to read the 'kvm_vgic_global_state'
variable from the host, for example when saving and restoring the state
of the virtual GIC.
Explicitly map 'kvm_vgic_global_state' in the stage-1 page-table of the
pKVM hypervisor rather than relying on mapping all of the host '.rodata'
section.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-24-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: I13a190d4164a4e0fd68dd5ec88ab5647dd4e73fc
Signed-off-by: Quentin Perret <qperret@google.com>
Sharing 'kvm_arm_vmid_bits' between EL1 and EL2 allows the host to
modify the variable arbitrarily, potentially leading to all sorts of
shenanians as this is used to configure the VTTBR register for the
guest stage-2.
In preparation for unmapping host sections entirely from EL2, maintain
a copy of 'kvm_arm_vmid_bits' in the pKVM hypervisor and initialise it
from the host value while it is still trusted.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-23-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: I613e7c0ef747324e73cf4d4f543354a3641c7505
Signed-off-by: Quentin Perret <qperret@google.com>
When pKVM is enabled, the hypervisor at EL2 does not trust the host at
EL1 and must therefore prevent it from having unrestricted access to
internal hypervisor state.
The 'kvm_arm_hyp_percpu_base' array holds the offsets for hypervisor
per-cpu allocations, so move this this into the nVHE code where it
cannot be modified by the untrusted host at EL1.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-22-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: I8d67b2905ac97e15f4252d45c36b97e53d3072ce
Signed-off-by: Quentin Perret <qperret@google.com>
Rather than relying on the host to free the previously-donated pKVM
hypervisor VM pages explicitly on teardown, introduce a dedicated
teardown memcache which allows the host to reclaim guest memory
resources without having to keep track of all of the allocations made by
the pKVM hypervisor at EL2.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-21-will@kernel.org
[willdeacon@: Fix GCC compat error due to variable declaration in for loop
initializer prior to C99]
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: Ib43b68f36fdaf5aac578f177ab8260c72acc6ed5
Signed-off-by: Quentin Perret <qperret@google.com>
Extend the initialisation of guest data structures within the pKVM
hypervisor at EL2 so that we instantiate a memory pool and a full
'struct kvm_s2_mmu' structure for each VM, with a stage-2 page-table
entirely independent from the one managed by the host at EL1.
The 'struct kvm_pgtable_mm_ops' used by the page-table code is populated
with a set of callbacks that can manage guest pages in the hypervisor
without any direct intervention from the host, allocating page-table
pages from the provided pool and returning these to the host on VM
teardown. To keep things simple, the stage-2 MMU for the guest is
configured identically to the host stage-2 in the VTCR register and so
the IPA size of the guest must match the PA size of the host.
For now, the new page-table is unused as there is no way for the host
to map anything into it. Yet.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-20-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: I2fc61f4f16d16b4786a50f4e2e4c5b06ed357b22
Signed-off-by: Quentin Perret <qperret@google.com>
The initialisation of guest stage-2 page-tables is currently split
across two functions: kvm_init_stage2_mmu() and kvm_arm_setup_stage2().
That is presumably for historical reasons as kvm_arm_setup_stage2()
originates from the (now defunct) KVM port for 32-bit Arm.
Simplify this code path by merging both functions into one, taking care
to map the 'struct kvm' into the hypervisor stage-1 early on in order to
simplify the failure path.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-19-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: I47433ed294fb5329b17b5e8b290e41b8e6d5b4a9
Signed-off-by: Quentin Perret <qperret@google.com>
The host at EL1 and the pKVM hypervisor at EL2 will soon need to
exchange memory pages dynamically for creating and destroying VM state.
Indeed, the hypervisor will rely on the host to donate memory pages it
can use to create guest stage-2 page-tables and to store VM and vCPU
metadata. In order to ease this process, introduce a
'struct hyp_memcache' which is essentially a linked list of available
pages, indexed by physical addresses so that it can be passed
meaningfully between the different virtual address spaces configured at
EL1 and EL2.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-18-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: I69996a4018e61473f25c5079f9ebcd82754914f4
Signed-off-by: Quentin Perret <qperret@google.com>
In preparation for handling cache maintenance of guest pages from within
the pKVM hypervisor at EL2, introduce an EL2 copy of icache_inval_pou()
which will later be plumbed into the stage-2 page-table cache
maintenance callbacks, ensuring that the initial contents of pages
mapped as executable into the guest stage-2 page-table is visible to the
instruction fetcher.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-17-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: I0a1fb40d49bbc19eb53d59805acb93ebd7702b8b
Signed-off-by: Quentin Perret <qperret@google.com>
The nVHE object at EL2 maintains its own copies of some host variables
so that, when pKVM is enabled, the host cannot directly modify the
hypervisor state. When running in normal nVHE mode, however, these
variables are still mirrored at EL2 but are not initialised.
Initialise the hypervisor symbols from the host copies regardless of
pKVM, ensuring that any reference to this data at EL2 with normal nVHE
will return a sensibly initialised value.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-16-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: I1e490d403e85ae5dbc984c1f0b8447c93cdf8700
Signed-off-by: Quentin Perret <qperret@google.com>
Mapping pages in a guest page-table from within the pKVM hypervisor at
EL2 may require cache maintenance to ensure that the initialised page
contents is visible even to non-cacheable (e.g. MMU-off) accesses from
the guest.
In preparation for performing this maintenance at EL2, introduce a
per-vCPU fixmap which allows the pKVM hypervisor to map guest pages
temporarily into its stage-1 page-table for the purposes of cache
maintenance and, in future, poisoning on the reclaim path. The use of a
fixmap avoids the need for memory allocation or locking on the map()
path.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Co-developed-by: Will Deacon <will@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-15-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: I8054dac551f4fd5c4afb5b497816589f1fae5285
Signed-off-by: Quentin Perret <qperret@google.com>
With the pKVM hypervisor at EL2 now offering hypercalls to the host for
creating and destroying VM and vCPU structures, plumb these in to the
existing arm64 KVM backend to ensure that the hypervisor data structures
are allocated and initialised on first vCPU run for a pKVM guest.
In the host, 'struct kvm_protected_vm' is introduced to hold the handle
of the pKVM VM instance as well as to track references to the memory
donated to the hypervisor so that it can be freed back to the host
allocator following VM teardown. The stage-2 page-table, hypervisor VM
and vCPU structures are allocated separately so as to avoid the need for
a large physically-contiguous allocation in the host at run-time.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-14-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: I2823f936b69f756d867bf584c645b29357c6abe9
Signed-off-by: Quentin Perret <qperret@google.com>
Introduce a global table (and lock) to track pKVM instances at EL2, and
provide hypercalls that can be used by the untrusted host to create and
destroy pKVM VMs and their vCPUs. pKVM VM/vCPU state is directly
accessible only by the trusted hypervisor (EL2).
Each pKVM VM is directly associated with an untrusted host KVM instance,
and is referenced by the host using an opaque handle. Future patches
will provide hypercalls to allow the host to initialize/set/get pKVM
VM/vCPU state using the opaque handle.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Co-developed-by: Will Deacon <will@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-13-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: Ibe946e86f9a773b6024bf74faeccbb6929548a1f
Signed-off-by: Quentin Perret <qperret@google.com>
In preparation for introducing VM and vCPU state at EL2, rename the
existing 'struct host_kvm' and its singleton 'host_kvm' instance to
'host_mmu' so as to avoid confusion between the structure tracking the
host stage-2 MMU state and the host instance of a 'struct kvm' for a
protected guest.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-12-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: Iaf93fa2f37f1e87e56182b8494be808ce0dfcf95
Signed-off-by: Quentin Perret <qperret@google.com>
nvhe/mem_protect.h refers to __load_stage2() in the definition of
__load_host_stage2() but doesn't include the relevant header.
Include asm/kvm_mmu.h in nvhe/mem_protect.h so that users of the latter
don't have to do this themselves.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-10-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: I9f1eea7b57d6ca5cc5ac172ce5175272f166164a
Signed-off-by: Quentin Perret <qperret@google.com>
Add helpers allowing the hypervisor to check whether a range of pages
are currently shared by the host, and 'pin' them if so by blocking host
unshare operations until the memory has been unpinned.
This will allow the hypervisor to take references on host-provided
data-structures (e.g. 'struct kvm') with the guarantee that these pages
will remain in a stable state until the hypervisor decides to release
them, for example during guest teardown.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-9-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: I2918af9b4986b3d5d19ed1c6168b20b59b6a563e
Signed-off-by: Quentin Perret <qperret@google.com>
Memory regions marked as "no-map" in the host device-tree routinely
include TrustZone carev-outs and DMA pools. Although donating such pages
to the hypervisor may not breach confidentiality, it could be used to
corrupt its state in uncontrollable ways. To prevent this, let's block
host-initiated memory transitions targeting "no-map" pages altogether in
nVHE protected mode as there should be no valid reason to do this in
current operation.
Thankfully, the pKVM EL2 hypervisor has a full copy of the host's list
of memblock regions, so we can easily check for the presence of the
MEMBLOCK_NOMAP flag on a region containing pages being donated from the
host.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-8-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: I21198f93a6d9c727b70c504ddd31345329eabb8f
Signed-off-by: Quentin Perret <qperret@google.com>
Transferring ownership information of a memory region from one component
to another can be achieved using a "donate" operation, which results
in the previous owner losing access to the underlying pages entirely
and the new owner having exclusive access to the page.
Implement a do_donate() helper, along the same lines as do_{un,}share,
and provide this functionality for the host-{to,from}-hyp cases as this
will later be used to donate/reclaim memory pages to store VM metadata
at EL2.
In a similar manner to the sharing transitions, permission checks are
performed by the hypervisor to ensure that the component initiating the
transition really is the owner of the page and also that the completer
does not currently have a page mapped at the target address.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Co-developed-by: Quentin Perret <qperret@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-7-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: I19c349d40f63c1bd5eae0baf9acd21697e1e8762
Signed-off-by: Quentin Perret <qperret@google.com>
The 'pkvm_component_id' enum type provides constants to refer to the
host and the hypervisor, yet this information is duplicated by the
'pkvm_hyp_id' constant.
Remove the definition of 'pkvm_hyp_id' and move the 'pkvm_component_id'
type definition to 'mem_protect.h' so that it can be used outside of
the memory protection code, for example when initialising the owner for
hypervisor-owned pages.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-6-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: I78454dc315a0993808a741ebd7fabf9dbcaf09bc
Signed-off-by: Quentin Perret <qperret@google.com>
In order to allow unmapping arbitrary memory pages from the hypervisor
stage-1 page-table, fix-up the initial refcount for pages that have been
mapped before the 'vmemmap' array was up and running so that it
accurately accounts for all existing hypervisor mappings.
This is achieved by traversing the entire hypervisor stage-1 page-table
during initialisation of EL2 and updating the corresponding
'struct hyp_page' for each valid mapping.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-5-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: I10df07bab97198ac4fd9477091b2e9340cc441b0
Signed-off-by: Quentin Perret <qperret@google.com>
The EL2 'vmemmap' array in nVHE Protected mode is currently very sparse:
only memory pages owned by the hypervisor itself have a matching 'struct
hyp_page'. However, as the size of this struct has been reduced
significantly since its introduction, it appears that we can now afford
to back the vmemmap for all of memory.
Having an easily accessible 'struct hyp_page' for every physical page in
memory provides the hypervisor with a simple mechanism to store metadata
(e.g. a refcount) that wouldn't otherwise fit in the very limited number
of software bits available in the host stage-2 page-table entries. This
will be used in subsequent patches when pinning host memory pages for
use by the hypervisor at EL2.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-4-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: Ifced0fb12faf2af6b757d68e18ba7cb74fdd0871
Signed-off-by: Quentin Perret <qperret@google.com>
All the contiguous pages used to initialize a 'struct hyp_pool' are
considered coalescable, which means that the hyp page allocator will
actively try to merge them with their buddies on the hyp_put_page() path.
However, using hyp_put_page() on a page that is not part of the inital
memory range given to a hyp_pool() is currently unsupported.
In order to allow dynamically extending hyp pools at run-time, add a
check to __hyp_attach_page() to allow inserting 'external' pages into
the free-list of order 0. This will be necessary to allow lazy donation
of pages from the host to the hypervisor when allocating guest stage-2
page-table pages at EL2.
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20221020133827.5541-3-will@kernel.org
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 233587962
Change-Id: Ie417f839ae6d30705847ab3e27e2b3c3ac6ee8dc
Signed-off-by: Quentin Perret <qperret@google.com>
On initialising the MMIO guard infrastructure, register the
earlycon mapping if present.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 233587962
Change-Id: I379387253d08e2414fa386a3360a45391da7d90d
Signed-off-by: Will Deacon <willdeacon@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
In order to transfer the early mapping state into KVM's MMIO
guard infrastructure, provide a small helper that will retrieve
the associated PTE.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 233587962
Change-Id: Iefc1c57d5e9476b718a8a68f60e562a57b09fb6a
Signed-off-by: Will Deacon <willdeacon@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Should a guest desire to enroll into the MMIO guard, allow it to
do so with a command-line option.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 233587962
Change-Id: Ia9a77f693531740500739693c52b4959abacafd4
[willdeacon@: Add hypercall IDs]
Signed-off-by: Will Deacon <willdeacon@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Add a pair of hooks (ioremap_phys_range_hook/iounmap_phys_range_hook)
that can be implemented by an architecture. Contrary to the existing
arch_sync_kernel_mappings(), this one tracks things at the physical
address level.
This is specially useful in these virtualised environments where
the guest has to tell the host whether (and how) it intends to use
a MMIO device.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Bug: 233587962
Change-Id: I970c2e632cb2b01060d5e66e4194fa9248188f43
Signed-off-by: Will Deacon <willdeacon@google.com>
[ qperret: Fixed conflict in vmalloc.c due to call to
kmsan_ioremap_page_range ]
Signed-off-by: Quentin Perret <qperret@google.com>
When running as a protected guest, the KVM host does not have access to
any pages mapped into the guest. Consequently, KVM exposes hypercalls to
the guest so that pages can be shared back with the host for the purposes
of shared memory communication such as virtio.
Detect the presence of these hypercalls when running as a guest and use
them to implement the memory encryption interfaces gated by
CONFIG_ARCH_HAS_MEM_ENCRYPT which are called from the DMA layer to share
SWIOTLB bounce buffers for virtio.
Although no encryption is actually performed, "sharing" a page is akin
to decryption, whereas "unsharing" a page maps to encryption, albeit
without destruction of the underlying page contents.
Signed-off-by: Will Deacon <will@kernel.org>
[willdeacon@: Use asm/mem_encrypt.h instead of asm/set_memory.h;
Implement mem_encrypt_active(); Add hypercall IDs;
Drop unneeded GIC change]
[qperret@: Export set_memory_{en,de}crypted() to fix allmodconfig
modpost failures]
Bug: 233587962
Change-Id: I5955ff0dca65561183f9a60e94be87f28fbf14ec
Signed-off-by: Will Deacon <willdeacon@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
In order to prepare the ground for providing set_mem_{en,de}crypted()
for arm64, make sure to include the right header to avoid allmodconfig
build failures.
Bug: 233587962
Change-Id: I6fb5a9e469a6fe8496dff81a8cda50b6c6c1ed7f
Signed-off-by: Quentin Perret <qperret@google.com>
Pull iommu fix from Joerg Roedel:
- Fix device mask to catch all affected devices in the recently added
quirk for QAT devices in the Intel VT-d driver.
* tag 'iommu-fix-v6.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/vt-d: Fix buggy QAT device mask
Pull misc fixes from Andrew Morton:
"Nine hotfixes.
Six for MM, three for other areas. Four of these patches address
post-6.0 issues"
* tag 'mm-hotfixes-stable-2022-12-10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
memcg: fix possible use-after-free in memcg_write_event_control()
MAINTAINERS: update Muchun Song's email
mm/gup: fix gup_pud_range() for dax
mmap: fix do_brk_flags() modifying obviously incorrect VMAs
mm/swap: fix SWP_PFN_BITS with CONFIG_PHYS_ADDR_T_64BIT on 32bit
tmpfs: fix data loss from failed fallocate
kselftests: cgroup: update kmem test precision tolerance
mm: do not BUG_ON missing brk mapping, because userspace can unmap it
mailmap: update Matti Vaittinen's email address
Pull ARM fix from Russell King:
"One further ARM fix for 6.1 from Wang Kefeng, fixing up the handling
for kfence faults"
* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 9278/1: kfence: only handle translation faults
memcg_write_event_control() accesses the dentry->d_name of the specified
control fd to route the write call. As a cgroup interface file can't be
renamed, it's safe to access d_name as long as the specified file is a
regular cgroup file. Also, as these cgroup interface files can't be
removed before the directory, it's safe to access the parent too.
Prior to 347c4a8747 ("memcg: remove cgroup_event->cft"), there was a
call to __file_cft() which verified that the specified file is a regular
cgroupfs file before further accesses. The cftype pointer returned from
__file_cft() was no longer necessary and the commit inadvertently dropped
the file type check with it allowing any file to slip through. With the
invarients broken, the d_name and parent accesses can now race against
renames and removals of arbitrary files and cause use-after-free's.
Fix the bug by resurrecting the file type check in __file_cft(). Now that
cgroupfs is implemented through kernfs, checking the file operations needs
to go through a layer of indirection. Instead, let's check the superblock
and dentry type.
Link: https://lkml.kernel.org/r/Y5FRm/cfcKPGzWwl@slm.duckdns.org
Fixes: 347c4a8747 ("memcg: remove cgroup_event->cft")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jann Horn <jannh@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org> [3.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We use "unsigned long" to store a PFN in the kernel and phys_addr_t to
store a physical address.
On a 64bit system, both are 64bit wide. However, on a 32bit system, the
latter might be 64bit wide. This is, for example, the case on x86 with
PAE: phys_addr_t and PTEs are 64bit wide, while "unsigned long" only spans
32bit.
The current definition of SWP_PFN_BITS without MAX_PHYSMEM_BITS misses
that case, and assumes that the maximum PFN is limited by an 32bit
phys_addr_t. This implies, that SWP_PFN_BITS will currently only be able
to cover 4 GiB - 1 on any 32bit system with 4k page size, which is wrong.
Let's rely on the number of bits in phys_addr_t instead, but make sure to
not exceed the maximum swap offset, to not make the BUILD_BUG_ON() in
is_pfn_swap_entry() unhappy. Note that swp_entry_t is effectively an
unsigned long and the maximum swap offset shares that value with the swap
type.
For example, on an 8 GiB x86 PAE system with a kernel config based on
Debian 11.5 (-> CONFIG_FLATMEM=y, CONFIG_X86_PAE=y), we will currently
fail removing migration entries (remove_migration_ptes()), because
mm/page_vma_mapped.c:check_pte() will fail to identify a PFN match as
swp_offset_pfn() wrongly masks off PFN bits. For example,
split_huge_page_to_list()->...->remap_page() will leave migration entries
in place and continue to unlock the page.
Later, when we stumble over these migration entries (e.g., via
/proc/self/pagemap), pfn_swap_entry_to_page() will BUG_ON() because these
migration entries shouldn't exist anymore and the page was unlocked.
[ 33.067591] kernel BUG at include/linux/swapops.h:497!
[ 33.067597] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 33.067602] CPU: 3 PID: 742 Comm: cow Tainted: G E 6.1.0-rc8+ #16
[ 33.067605] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
[ 33.067606] EIP: pagemap_pmd_range+0x644/0x650
[ 33.067612] Code: 00 00 00 00 66 90 89 ce b9 00 f0 ff ff e9 ff fb ff ff 89 d8 31 db e8 48 c6 52 00 e9 23 fb ff ff e8 61 83 56 00 e9 b6 fe ff ff <0f> 0b bf 00 f0 ff ff e9 38 fa ff ff 3e 8d 74 26 00 55 89 e5 57 31
[ 33.067615] EAX: ee394000 EBX: 00000002 ECX: ee394000 EDX: 00000000
[ 33.067617] ESI: c1b0ded4 EDI: 00024a00 EBP: c1b0ddb4 ESP: c1b0dd68
[ 33.067619] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00010246
[ 33.067624] CR0: 80050033 CR2: b7a00000 CR3: 01bbbd20 CR4: 00350ef0
[ 33.067625] Call Trace:
[ 33.067628] ? madvise_free_pte_range+0x720/0x720
[ 33.067632] ? smaps_pte_range+0x4b0/0x4b0
[ 33.067634] walk_pgd_range+0x325/0x720
[ 33.067637] ? mt_find+0x1d6/0x3a0
[ 33.067641] ? mt_find+0x1d6/0x3a0
[ 33.067643] __walk_page_range+0x164/0x170
[ 33.067646] walk_page_range+0xf9/0x170
[ 33.067648] ? __kmem_cache_alloc_node+0x2a8/0x340
[ 33.067653] pagemap_read+0x124/0x280
[ 33.067658] ? default_llseek+0x101/0x160
[ 33.067662] ? smaps_account+0x1d0/0x1d0
[ 33.067664] vfs_read+0x90/0x290
[ 33.067667] ? do_madvise.part.0+0x24b/0x390
[ 33.067669] ? debug_smp_processor_id+0x12/0x20
[ 33.067673] ksys_pread64+0x58/0x90
[ 33.067675] __ia32_sys_ia32_pread64+0x1b/0x20
[ 33.067680] __do_fast_syscall_32+0x4c/0xc0
[ 33.067683] do_fast_syscall_32+0x29/0x60
[ 33.067686] do_SYSENTER_32+0x15/0x20
[ 33.067689] entry_SYSENTER_32+0x98/0xf1
Decrease the indentation level of SWP_PFN_BITS and SWP_PFN_MASK to keep it
readable and consistent.
[david@redhat.com: rely on sizeof(phys_addr_t) and min_t() instead]
Link: https://lkml.kernel.org/r/20221206105737.69478-1-david@redhat.com
[david@redhat.com: use "int" for comparison, as we're only comparing numbers < 64]
Link: https://lkml.kernel.org/r/1f157500-2676-7cef-a84e-9224ed64e540@redhat.com
Link: https://lkml.kernel.org/r/20221205150857.167583-1-david@redhat.com
Fixes: 0d206b5d2e ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull media fix from Mauro Carvalho Chehab:
"A v4l-core fix related to validating DV timings related to video
blanking values"
* tag 'media/v6.1-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: v4l2-dv-timings.c: fix too strict blanking sanity checks
Pull ARM SoC fix from Arnd Bergmann:
"One more last minute revert for a boot regression that was found on
the popular colibri-imx7"
* tag 'soc-fixes-6.1-6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
Revert "ARM: dts: imx7: Fix NAND controller size-cells"
Pull drm fixes from Dave Airlie:
"Last set of fixes for final, scattered bunch of fixes, two amdgpu, one
vmwgfx, and some misc others.
amdgpu:
- S0ix fix
- DCN 3.2 array out of bounds fix
shmem:
- Fixes to shmem-helper error paths
bridge:
- Fix polarity bug in bridge/ti-sn65dsi86
dw-hdmi:
- Prefer 8-bit RGB fallback before any YUV mode in dw-hdmi, since
some panels lie about YUV support
vmwgfx:
- Stop using screen objects when SEV is active"
* tag 'drm-fixes-2022-12-09' of git://anongit.freedesktop.org/drm/drm:
drm/amd/display: fix array index out of bound error in DCN32 DML
drm/amdgpu/sdma_v4_0: turn off SDMA ring buffer in the s2idle suspend
drm/vmwgfx: Don't use screen objects when SEV is active
drm/shmem-helper: Avoid vm_open error paths
drm/shmem-helper: Remove errant put in error path
drm: bridge: dw_hdmi: fix preference of RGB modes over YUV420
drm/bridge: ti-sn65dsi86: Fix output polarity setting bug
drm/vmwgfx: Fix race issue calling pin_user_pages