Add a speculative field to the vm_operations_struct, which indicates if
the associated file type supports speculative faults.
Initially this is set for files that implement fault() with filemap_fault().
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-30-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ic92efdf13283c45e7da7bf703f4f85f8b392ba69
In the speculative case, we know the page table already exists, and it
must be locked with pte_map_lock(). In the case where no page is found
for the given address, return VM_FAULT_RETRY which will abort the
fault before we get into the vm_ops->fault() callback. This is fine
because if filemap_map_pages does not find the page in page cache,
vm_ops->fault() will not either.
Initialize addr and last_pgoff to correspond to the pte at the original
fault address (which was mapped with pte_map_lock()), rather than the
pte at start_pgoff. The choice of initial values doesn't matter as
they will all be adjusted together before use, so they just need to be
consistent with each other, and using the original fault address and
pte allows us to reuse pte_map_lock() without any changes to it.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-29-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I0acf4f9626ec0126cdc9a95a7ff1cd735c1af2ca
Call the vm_ops->map_pages method within an rcu read locked section.
In the speculative case, verify the mmap sequence lock at the start of
the section. A match guarantees that the original vma is still valid
at that time, and that the associated vma->vm_file stays valid while
the vm_ops->map_pages() method is running.
Do not test vmf->pmd in the speculative case - we only speculate when
a page table already exists, and and this saves us from having to handle
synchronization around the vmf->pmd read.
Change xfs_filemap_map_pages() account for the fact that it can not
block anymore, as it is now running within an rcu read lock.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-28-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Id771c1e6fa9b883595a48d4df63f448a05916eda
In the speculative case, we want to avoid direct pmd checks (which
would require some extra synchronization to be safe), and rely on
pte_map_lock which will both lock the page table and verify that the
pmd has not changed from its initial value.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-27-michel@lespinasse.org/
Conflicts:
mm/memory.c
1. Merge conflict due to new vmf->prealloc_pte usage in finish_fault.
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: If6046592083eaf12caf5c51c3fbb287a4dfa1ace
Extend filemap_fault() to handle speculative faults.
In the speculative case, we will only be fishing existing pages out of
the page cache. The logic we use mirrors what is done in the
non-speculative case, assuming that pages are found in the page cache,
are up to date and not already locked, and that readahead is not
necessary at this time. In all other cases, the fault is aborted to be
handled non-speculatively.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-26-michel@lespinasse.org/
Conflicts:
mm/filemap.c
1. Added back file_ra_state variable used by SPF path.
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I82eba7fcfc81876245c2e65bc5ae3d33ddfcc368
In the speculative case, call the vm_ops->fault() method from within
an rcu read locked section, and verify the mmap sequence lock at the
start of the section. A match guarantees that the original vma is still
valid at that time, and that the associated vma->vm_file stays valid
while the vm_ops->fault() method is running.
Note that this implies that speculative faults can not sleep within
the vm_ops->fault method. We will only attempt to fetch existing pages
from the page cache during speculative faults; any miss (or prefetch)
will be handled by falling back to non-speculative fault handling.
The speculative handling case also does not preallocate page tables,
as it is always called with a pre-existing page table.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20210407014502.24091-25-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I995ba94d8e96014ef83ac93fe5a4669afcde34b9
Attempt speculative mm fault handling first, and fall back to the
existing (non-speculative) code if that fails.
This follows the lines of the x86 speculative fault handling code,
but with some minor arch differences such as the way that the
access_pkey_error case is handled
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-36-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ic12bc3d5070d1502fc5df182a19c92b4a8d59723
Attempt speculative mm fault handling first, and fall back to the
existing (non-speculative) code if that fails.
This follows the lines of the x86 speculative fault handling code,
but with some minor arch differences such as the way that the
VM_FAULT_BADACCESS case is handled.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-34-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Iccd87036b15eebf2ff28fbb8022b07c9f91d7353
Split off the definitions necessary to update event counters from vmstat.h
into a new vm_event.h file.
The rationale is to allow header files included from mm.h to update
counter events. vmstat.h can not be included from such header files,
because it refers to page_pgdat() which is only defined later down
in mm.h, and thus results in compile errors. vm_event.h does not refer
to page_pgdat() and thus does not result in such errors.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-31-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ie70dd435b3dcbad80a4a9bfc294b78a9107c1ac2
Performance tuning: as single threaded userspace does not use
speculative page faults, it does not require rcu safe vma freeing.
Turn this off to avoid the related (small) extra overheads.
For multi threaded userspace, we often see a performance benefit from
the rcu safe vma freeing - even in tests that do not have any frequent
concurrent page faults ! This is because rcu safe vma freeing prevents
recently released vmas from being immediately reused in a new thread.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-30-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I81ef7ab43e2757f268c567d5bfe6ab02f1e43a1c
In handle_pte_fault(), allow speculative execution to proceed.
Use pte_spinlock() to validate the mmap sequence count when locking
the page table.
If speculative execution proceeds through do_wp_page(), ensure that we
end up in the wp_page_reuse() or wp_page_copy() paths, rather than
wp_pfn_shared() or wp_page_shared() (both unreachable as we only
handle anon vmas so far) or handle_userfault() (needs an explicit
abort to handle non-speculatively).
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-28-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ia45d095ec7b8e23f1c5d68b7a7f572a3f6f6df97
Change wp_page_copy() to handle the speculative case. This involves
aborting speculative faults if they have to allocate an anon_vma,
read-locking the mmu_notifier_lock to avoid races with
mmu_notifier_register(), and using pte_map_lock() instead of
pte_offset_map_lock() to complete the page fault.
Also change call sites to clear vmf->pte after unmapping the page table,
in order to satisfy pte_map_lock()'s preconditions.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-27-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Icd2188e9facf5a7fea42000a2808bcda1ad6f0fc
Introduce mmu_notifier_lock as a per-mm percpu_rw_semaphore,
as well as the code to initialize and destroy it together with the mm.
This lock will be used to prevent races between mmu_notifier_register()
and speculative fault handlers that need to fire MMU notifications
without holding any of the mmap or rmap locks.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-24-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I453ebe979c8b9dcc6159b41c5ec7a1ea17d85ee2
Change handle_pte_fault() to allow speculative fault execution to proceed
through do_numa_page().
do_swap_page() does not implement speculative execution yet, so it
needs to abort with VM_FAULT_RETRY in that case.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-22-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I0390331facc9ecd37534012abdd9f255ab5bbb12
in x86 fault handler, only attempt spf if the vma is anonymous.
In do_handle_mm_fault(), let speculative page faults proceed as long
as they fall into anonymous vmas. This enables the speculative
handling code in __handle_mm_fault() and do_anonymous_page().
In handle_pte_fault(), if vmf->pte is set (the original pte was not
pte_none), catch speculative faults and return VM_FAULT_RETRY as
those cases are not implemented yet. Also assert that do_fault()
is not reached in the speculative case.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-20-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I875106fcfa1084f570c2bf8f24a129bdce55316b
Change do_anonymous_page() to handle the speculative case.
This involves aborting speculative faults if they have to allocate a new
anon_vma, and using pte_map_lock() instead of pte_offset_map_lock()
to complete the page fault.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-19-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I5ad955323faabc142c21f62415db039ac889066a
pte_map_lock() and pte_spinlock() are used by fault handlers to ensure
the pte is mapped and locked before they commit the faulted page to the
mm's address space at the end of the fault.
The functions differ in their preconditions; pte_map_lock() expects
the pte to be unmapped prior to the call, while pte_spinlock() expects
it to be already mapped.
In the speculative fault case, the functions verify, after locking the pte,
that the mmap sequence count has not changed since the start of the fault,
and thus that no mmap lock writers have been running concurrently with
the fault. After that point the page table lock serializes any further
races with concurrent mmap lock writers.
If the mmap sequence count check fails, both functions will return false
with the pte being left unmapped and unlocked.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-18-michel@lespinasse.org/
Conflicts:
include/linux/mm.h
1. Fixed pte_map_lock and pte_spinlock macros not to fail when
CONFIG_SPECULATIVE_PAGE_FAULT=n
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ibd7ccc2ead4fdf29f28c7657b312b2f677ac8836
The speculative path calls speculative_page_walk_begin() before walking
the page table tree to prevent page table reclamation. The logic is
otherwise similar to the non-speculative path, but with additional
restrictions: in the speculative path, we do not handle huge pages or
wiring new pages tables.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-17-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: If099534da8b0ac105bbaa5ea4714a6654032592a
Move the code that initializes vmf->pte and vmf->orig_pte from
handle_pte_fault() to its single call site in __handle_mm_fault().
This ensures vmf->pte is now initialized together with the higher levels
of the page table hierarchy. This also prepares for speculative page fault
handling, where the entire page table walk (higher levels down to ptes)
needs special care in the speculative case.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-16-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Id550086fe568331aa71c91468f8314faad993b20
Speculative page faults will use these to protect against races with
page table reclamation.
This could always be handled by disabling local IRQs as the fast GUP
code does; however speculative page faults do not need to protect
against races with THP page splitting, so a weaker rcu read lock is
sufficient in the MMU_GATHER_RCU_TABLE_FREE case.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-15-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I3efe5fc6a5a49d537cf33e8093daeea42550077a
Attempt speculative mm fault handling first, and fall back to the
existing (non-speculative) code if that fails.
The speculative handling closely mirrors the non-speculative logic.
This includes some x86 specific bits such as the access_error() call.
This is why we chose to implement the speculative handling in arch/x86
rather than in common code.
The vma is first looked up and copied, under protection of the rcu
read lock. The mmap lock sequence count is used to verify the
integrity of the copied vma, and passed to do_handle_mm_fault() to
allow checking against races with mmap writers when finalizing the fault.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-14-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I2c078a173ee39f35af16daeee8c6a1466d10c3e8
This prepares for speculative page faults looking up and copying vmas
under protection of an rcu read lock, instead of the usual mmap read lock.
Note - it might also be feasible to just use SLAB_TYPESAFE_BY_RCU when
creating the vm_area_cachep, but that's probably too subtle to consider here.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-12-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I992fddb7c32c61bb4ab10b387f91c4e54c2250ef
The counter's write side is hooked into the existing mmap locking API:
mmap_write_lock() increments the counter to the next (odd) value, and
mmap_write_unlock() increments it again to the next (even) value.
The counter's speculative read side is supposed to be used as follows:
seq = mmap_seq_read_start(mm);
if (seq & 1)
goto fail;
.... speculative handling here ....
if (!mmap_seq_read_check(mm, seq)
goto fail;
This API guarantees that, if none of the "fail" tests abort
speculative execution, the speculative code section did not run
concurrently with any mmap writer.
This is very similar to a seqlock, but both the writer and speculative
readers are allowed to block. In the fail case, the speculative reader
does not spin on the sequence counter; instead it should fall back to
a different mechanism such as grabbing the mmap lock read side.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-11-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I60ba909e789371217cd77c39a562a66e156b68bb
Add a new do_handle_mm_fault function, which extends the existing
handle_mm_fault() API by adding an mmap sequence count, to be used
in the FAULT_FLAG_SPECULATIVE case.
In the initial implementation, FAULT_FLAG_SPECULATIVE always fails
(by returning VM_FAULT_RETRY).
The existing handle_mm_fault() API is kept as a wrapper around
do_handle_mm_fault() so that we do not have to immediately update
every handle_mm_fault() call site.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Conflicts:
mm/memory.c
1. Trivial merge conflict due to folios.
Link: https://lore.kernel.org/all/20220128131006.67712-10-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ic07b6d84af3e5d1fcc856e0968f1a6dd1544fa88
Define the new FAULT_FLAG_SPECULATIVE flag, which indicates when we are
attempting speculative fault handling (without holding the mmap lock).
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Conflicts:
include/linux/mm_types.h
1. Merge conflict due to enum fault_flag being defined in mm.h instead of
mm_types.h
Link: https://lore.kernel.org/all/20220128131006.67712-9-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I48ab427dfa4d7bdbe9932588bec7ae99e9e80ae9
This configuration variable will be used to build the code needed to
handle speculative page fault.
This is enabled by default on supported architectures with SMP and MMU set.
The architecture support is needed since the speculative page fault handler
is called from the architecture's page faulting code, and some code has to
be added there to try speculative fault handling first.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-7-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ie1dc3af30bf3949173b126e6469f372c4505ec8e
In do_anonymous_page(), we have separate cases for the zero page vs
allocating new anonymous pages. However, once the pte entry has been
computed, the rest of the handling (mapping and locking the page table,
checking that we didn't lose a race with another page fault handler, etc)
is identical between the two cases.
This change reduces the code duplication between the two cases.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Conflicts:
mm/memory.c
1. Trivial merge conflict caused by folios in mem_cgroup_charge call.
Link: https://lore.kernel.org/all/20220128131006.67712-6-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ic19579571925878d632e43aa40b9f50cdf473ee6
In the mmap locking API, the *_killable() functions return an error
(or 0 on success), and the *_trylock() functions return a boolean
(true on success).
Rename the return values "int error" and "bool ok", respectively,
rather than using "ret" for both cases which I find less readable.
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-4-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I19473932c2692833dca89db5b805dbb46970dc66
Change mmap_lock_is_contended to return a bool value, rather than an
int which the callers are then supposed to interpret as a bool. This
is to ensure consistency with other mmap lock API functions (such as
the trylock functions).
Signed-off-by: Michel Lespinasse <michel@lespinasse.org>
Link: https://lore.kernel.org/all/20220128131006.67712-3-michel@lespinasse.org/
Bug: 161210518
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I7a11ff25a493adc58480b1fe8e3f14e44ad46fb3
When a shadow VM is torn down, its VMID can be reallocated as soon as
the shadow table entry is cleared to NULL. Since tearing down the
stage-2 page-table does not imply TLB invalidation, the TLB could still
contain stale entries from the old VM and the new user of the VMID could
end up seeing erroneous translations.
Invalidate the TLB for the VMID of the VM being torn down prior to
clearing its entry in the shadow table.
Bug: 226312378
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: Ice44d030bf01a1b7612413ee32440f3f38cb3e4e
The old and new mount user name spaces need to be populated
before calling vfs_rename(). Otherwise vfs_rename will try
to dereference a null ptr and segfault.
Bug: 211066171
Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Change-Id: I3656073581218107fc3b1a52ebe7bcfd81a10fc2
The FROMLIST patches merged in aosp/1974918 that add vmalloc support to
KASAN now have a few fixes staged in linux-next/akpm. Sync the changes.
Bug: 217222520
Bug: 222221793
Change-Id: I33dd30e3834a4d1bb8eac611b350004afdb08a74
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
By default, thermal power throttle is always enable, but sometimes it
need to be disabled for a period of time, so add it to meet platform
thermal requirement.
Bug: 209386157
Signed-off-by: Jeson Gao <jeson.gao@unisoc.com>
Change-Id: If9c53a9669eec8e2821d837cfa3c660a9cfbf934
(cherry picked from commit 64999249d5)
Changes in 5.15.30
Revert "xfrm: state and policy should fail if XFRMA_IF_ID 0"
arm64: dts: rockchip: fix rk3399-puma-haikou USB OTG mode
xfrm: Check if_id in xfrm_migrate
xfrm: Fix xfrm migrate issues when address family changes
arm64: dts: rockchip: fix rk3399-puma eMMC HS400 signal integrity
arm64: dts: rockchip: align pl330 node name with dtschema
arm64: dts: rockchip: reorder rk3399 hdmi clocks
arm64: dts: agilex: use the compatible "intel,socfpga-agilex-hsotg"
ARM: dts: rockchip: reorder rk322x hmdi clocks
ARM: dts: rockchip: fix a typo on rk3288 crypto-controller
mac80211: refuse aggregations sessions before authorized
MIPS: smp: fill in sibling and core maps earlier
ARM: 9178/1: fix unmet dependency on BITREVERSE for HAVE_ARCH_BITREVERSE
Bluetooth: hci_core: Fix leaking sent_cmd skb
can: rcar_canfd: rcar_canfd_channel_probe(): register the CAN device when fully ready
atm: firestream: check the return value of ioremap() in fs_init()
iwlwifi: don't advertise TWT support
drm/vrr: Set VRR capable prop only if it is attached to connector
nl80211: Update bss channel on channel switch for P2P_CLIENT
tcp: make tcp_read_sock() more robust
sfc: extend the locking on mcdi->seqno
bnx2: Fix an error message
kselftest/vm: fix tests build with old libc
x86/module: Fix the paravirt vs alternative order
ice: Fix race condition during interface enslave
Linux 5.15.30
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Icf3c6ca9fb4bb75435d3964e12c0fcb42397b50b