If you have trouble reading this new file format, please refresh your
prebuilt version of STG with repo sync.
Bug: 294213765
Change-Id: I4d7ee716231956c5f4da1343cc0db5170aaaa3b1
Signed-off-by: Giuliano Procida <gprocida@google.com>
When handling deduplicated compressed data, there can be multiple
decompressed extents pointing to the same compressed data in one shot.
In such cases, the bvecs which belong to the longest extent will be
selected as the primary bvecs for real decompressors to decode and the
other duplicated bvecs will be directly copied from the primary bvecs.
Previously, only relative offsets of the longest extent were checked to
decompress the primary bvecs. On rare occasions, it can be incorrect
if there are several extents with the same start relative offset.
As a result, some short bvecs could be selected for decompression and
then cause data corruption.
For example, as Shijie Sun reported off-list, considering the following
extents of a file:
117: 903345.. 915250 | 11905 : 385024.. 389120 | 4096
...
119: 919729.. 930323 | 10594 : 385024.. 389120 | 4096
...
124: 968881.. 980786 | 11905 : 385024.. 389120 | 4096
The start relative offset is the same: 2225, but extent 119 (919729..
930323) is shorter than the others.
Let's restrict the bvec length in addition to the start offset if bvecs
are not full.
Reported-by: Shijie Sun <sunshijie@xiaomi.com>
Fixes: 5c2a64252c ("erofs: introduce partial-referenced pclusters")
Tested-by Shijie Sun <sunshijie@xiaomi.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230719065459.60083-1-hsiangkao@linux.alibaba.com
(cherry picked from commit 7d15c91a75aae55767f368e8abbabd7cedf4ec94
https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs.git dev)
Bug: 293245292
Change-Id: Ic8ded9b2d3592ffd0863f4f0d2ac4ae6a1821a1b
Signed-off-by: sunshijie <sunshijie@xiaomi.corp-partner.google.com>
ABGR64_12 is a reversed RGB format with alpha channel last,
12 bits per component like ABGR32,
expanded to 16bits.
Data in the 12 high bits, zeros in the 4 low bits,
arranged in little endian order.
Bug: 293213303
Change-Id: Idc4e1100c9e2134a48b594151e3398f6436b010d
(cherry picked from commit 302b988ca0)
Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
BGR48_12 is a reversed RGB format with 12 bits per component like BGR24,
expanded to 16bits.
Data in the 12 high bits, zeros in the 4 low bits,
arranged in little endian order.
Bug: 293213303
Change-Id: I27d14a33c8e2b4847a63ea05b285786766949ebf
(cherry picked from commit da0b7a400e)
[Jindong: Fixed conflicts in .rst file and v4l2-ioctl.c]
Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
YUV48_12 is a YUV format with 12-bits per component like YUV24,
expanded to 16bits.
Data in the 12 high bits, zeros in the 4 low bits,
arranged in little endian order.
[hverkuil: replaced a . by ,]
Bug: 293213303
Change-Id: I12e6f02b99918a429224320da2127d6b4d777584
(cherry picked from commit 99c9549677)
Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
Y212 is a YUV format with 12-bits per component like YUYV,
expanded to 16bits.
Data in the 12 high bits, zeros in the 4 low bits,
arranged in little endian order.
Add the missing v4l2 foramt info of Y212
Bug: 293213303
Change-Id: Ibdf9bb3a3f1eb895da9eca52d115e08b656b5153
(cherry picked from commit a178dd3bbe)
Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
Y012 is a luma-only formats with 12-bits per pixel,
expanded to 16bits.
Data in the 12 high bits, zeros in the 4 low bits,
arranged in little endian order.
Bug: 293213303
Change-Id: I1a8f73162932e0760aabbe44525d7c74ace9f7bd
(cherry picked from commit a490ea6844)
Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
P012 is a YUV format with 12-bits per component with interleaved UV,
like NV12, expanded to 16 bits.
Data in the 12 high bits, zeros in the 4 low bits,
arranged in little endian order.
And P012M has two non contiguous planes.
Bug: 293213303
Change-Id: I1fbfa7c445bc682766f479cca07eb8cb16cbb44f
(cherry picked from commit aa10804042)
Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
Create input symbol files to generate GKI modules header
under include/config. By placing files in this generated
directory, the default filters that ignore certain files
will work without any special handling required, and they
will also be available to inspect after the build to inspect
for the debugging purposes.
abi_gki_protected_exports: Input for gki_module_protected_exports.h
From :- ${objtree}/abi_gki_protected_exports
To :- include/config/abi_gki_protected_exports
all_kmi_symbols: Input for gki_module_unprotected.h
- Rename to abi_gki_kmi_symbols
From :- all_kmi_symbols
To :- include/config/abi_gki_kmi_symbols
Bug: 286529877
Test: TH
Test: Manual verification of the generated files
Change-Id: Iafa10631e7712a8e1e87a2f56cfd614de6b1053a
Signed-off-by: Ramji Jiyani <ramjiyani@google.com>
create_open would always take its parent directory's bpf for the created
object. Modify to use the bpf stored in fuse_dentry which is set by
lookup.
Bug: 291705489
Test: fuse_test passes, adb push file /sdcard/Android/data works
Signed-off-by: Paul Lawrence <paullawrence@google.com>
Change-Id: I0a1ea2a291a8fdf67923f1827176b2ea96bd4c2d
Store the results of a negative lookup in the fuse_dentry so later
opcodes can use them to create files
Bug: 291705489
Test: fuse_test passes
Signed-off-by: Paul Lawrence <paullawrence@google.com>
Change-Id: I725e714a1d6ce43f24431d07c24e96349ef1a55c
fuse_iget_backing returns an inode or null, not a ERR_PTR. So check it's
not NULL
Also make sure we put the inode if d_splice_alias fails
Bug: 293349757
Test: fuse_test runs
Signed_off_by: Paul Lawrence <paullawrence@google.com>
Change-Id: I1eadad32f80bab6730e461412b4b7ab4d6c56bf2
This adds passthrough only support for ioctls with fuse-bpf.
compat_ioctls will return -ENOTTY.
Bug: 279519292
Test: F2fsMiscTest#testAtomicWrite
Change-Id: Ia3052e465d87dc1d15ae13955fba8a7f93bc387b
Signed-off-by: Daniel Rosenberg <drosen@google.com>
mbind() calls down into vma_replace_policy() without taking the per-VMA
locks, replaces the VMA's vma->vm_policy pointer, and frees the old
policy. That's bad; a concurrent page fault might still be using the
old policy (in vma_alloc_folio()), resulting in use-after-free.
Normally this will manifest as a use-after-free read first, but it can
result in memory corruption, including because vma_alloc_folio() can
call mpol_cond_put() on the freed policy, which conditionally changes
the policy's refcount member.
This bug is specific to CONFIG_NUMA, but it does also affect non-NUMA
systems as long as the kernel was built with CONFIG_NUMA.
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Fixes: 5e31275cc9 ("mm: add per-VMA lock and helper functions to control it")
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 293665307
(cherry picked from commit 6c21e066f9)
Change-Id: I2e3a4ee8bad97457ee3e127694f0609e7a240a2f
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
lock_vma_under_rcu() tries to guarantee that __anon_vma_prepare() can't
be called in the VMA-locked page fault path by ensuring that
vma->anon_vma is set.
However, this check happens before the VMA is locked, which means a
concurrent move_vma() can concurrently call unlink_anon_vmas(), which
disassociates the VMA's anon_vma.
This means we can get UAF in the following scenario:
THREAD 1 THREAD 2
======== ========
<page fault>
lock_vma_under_rcu()
rcu_read_lock()
mas_walk()
check vma->anon_vma
mremap() syscall
move_vma()
vma_start_write()
unlink_anon_vmas()
<syscall end>
handle_mm_fault()
__handle_mm_fault()
handle_pte_fault()
do_pte_missing()
do_anonymous_page()
anon_vma_prepare()
__anon_vma_prepare()
find_mergeable_anon_vma()
mas_walk() [looks up VMA X]
munmap() syscall (deletes VMA X)
reusable_anon_vma() [called on freed VMA X]
This is a security bug if you can hit it, although an attacker would
have to win two races at once where the first race window is only a few
instructions wide.
This patch is based on some previous discussion with Linus Torvalds on
the security list.
Cc: stable@vger.kernel.org
Fixes: 5e31275cc9 ("mm: add per-VMA lock and helper functions to control it")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 293665307
(cherry picked from commit 657b514695)
[surenb: removed vma_is_tcp() call not present in 6.1]
Change-Id: I4bd91e1db337ff35eb7c1d436f4372944556dd7d
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
GIC700 erratum 2941627 may cause GIC-700 missing SPIs wake
requests when SPIs are deactivated while targeting a
sleeping CPU - ie a CPU for which the redistributor:
GICR_WAKER.ProcessorSleep == 1
This runtime situation can happen if an SPI that has been
activated on a core is retargeted to a different core, it
becomes pending and the target core subsequently enters a
power state quiescing the respective redistributor.
When this situation is hit, the de-activation carried out
on the core that activated the SPI (through either ICC_EOIR1_EL1
or ICC_DIR_EL1 register writes) does not trigger a wake
requests for the sleeping GIC redistributor even if the SPI
is pending.
Work around the erratum by de-activating the SPI using the
redistributor GICD_ICACTIVER register if the runtime
conditions require it (ie the IRQ was retargeted between
activation and de-activation).
Bug: 292459437
Change-Id: Ide915b8c925a631a7fc9ccebca19d9175def162e
Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230704155034.148262-1-lpieralisi@kernel.org
(cherry picked from commit 6fe5c68ee6https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git irq/irqchip-fixes)
Signed-off-by: Carlos Galo <carlosgalo@google.com>
Lockdep is certainly right to complain about
(&vma->vm_lock->lock){++++}-{3:3}, at: vma_start_write+0x2d/0x3f
but task is already holding lock:
(&mapping->i_mmap_rwsem){+.+.}-{3:3}, at: mmap_region+0x4dc/0x6db
Invert those to the usual ordering.
Fixes: 33313a747e ("mm: lock newly mapped VMA which can be modified after it becomes visible")
Cc: stable@vger.kernel.org
Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1c7873e336)
Change-Id: I85f9cfb6ee8f3d9fefda5518c5637a7dff64bac3
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
When forking a child process, the parent write-protects anonymous pages
and COW-shares them with the child being forked using copy_present_pte().
We must not take any concurrent page faults on the source vma's as they
are being processed, as we expect both the vma and the pte's behind it
to be stable. For example, the anon_vma_fork() expects the parents
vma->anon_vma to not change during the vma copy.
A concurrent page fault on a page newly marked read-only by the page
copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
source vma, defeating the anon_vma_clone() that wasn't done because the
parent vma originally didn't have an anon_vma, but we now might end up
copying a pte entry for a page that has one.
Before the per-vma lock based changes, the mmap_lock guaranteed
exclusion with concurrent page faults. But now we need to do a
vma_start_write() to make sure no concurrent faults happen on this vma
while it is being processed.
This fix can potentially regress some fork-heavy workloads. Kernel
build time did not show noticeable regression on a 56-core machine while
a stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~5% regression. If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
optimizations are possible if this regression proves to be problematic.
Suggested-by: David Hildenbrand <david@redhat.com>
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/
Reported-by: Jacob Young <jacobly.alt@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
Fixes: 0bff0aaea0 ("x86/mm: try VMA lock-based page fault handling first")
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit fb49c45532)
Change-Id: Ic5aa9dc51a888b5b0319ec4ec6d2941424573ca0
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
mmap_region adds a newly created VMA into VMA tree and might modify it
afterwards before dropping the mmap_lock. This poses a problem for page
faults handled under per-VMA locks because they don't take the mmap_lock
and can stumble on this VMA while it's still being modified. Currently
this does not pose a problem since post-addition modifications are done
only for file-backed VMAs, which are not handled under per-VMA lock.
However, once support for handling file-backed page faults with per-VMA
locks is added, this will become a race.
Fix this by write-locking the VMA before inserting it into the VMA tree.
Other places where a new VMA is added into VMA tree do not modify it
after the insertion, so do not need the same locking.
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 33313a747e)
Change-Id: I3bb6a7bc8dd579e11f9c18cbc8e4a6e7279bbfb2
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
With recent changes necessitating mmap_lock to be held for write while
expanding a stack, per-VMA locks should follow the same rules and be
write-locked to prevent page faults into the VMA being expanded. Add
the necessary locking.
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c137381f71)
Change-Id: I3e6a8c89c1fb7c0669e1232176bb04ea6b09bc0a
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
In commit 8d7071af89 ("mm: always expand the stack with the mmap write
lock held"), find_extend_vma() was no longer being used in the tree, so
it was removed. Unfortunately some GKI external module is using this,
so bring it back to allow things to continue to work.
Bug: 161946584
Fixes: 8d7071af89 ("mm: always expand the stack with the mmap write lock held")
Change-Id: I6f1fb1fd8193625fe3dac0bbc5b0aff653b3d879
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 8d7071af89 upstream
This finishes the job of always holding the mmap write lock when
extending the user stack vma, and removes the 'write_locked' argument
from the vm helper functions again.
For some cases, we just avoid expanding the stack at all: drivers and
page pinning really shouldn't be extending any stacks. Let's see if any
strange users really wanted that.
It's worth noting that architectures that weren't converted to the new
lock_mm_and_find_vma() helper function are left using the legacy
"expand_stack()" function, but it has been changed to drop the mmap_lock
and take it for writing while expanding the vma. This makes it fairly
straightforward to convert the remaining architectures.
As a result of dropping and re-taking the lock, the calling conventions
for this function have also changed, since the old vma may no longer be
valid. So it will now return the new vma if successful, and NULL - and
the lock dropped - if the area could not be extended.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[6.1: Patch drivers/iommu/io-pgfault.c instead]
Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[surenb: change in io-pgfault.c was done in iommu-sva.c]
Change-Id: Icdcdded08d7ad4eda8fae1120a3c8b3d957516c1
(cherry picked from commit 8d7071af89)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit f313c51d26 upstream.
This is a small step towards a model where GUP itself would not expand
the stack, and any user that needs GUP to not look up existing mappings,
but actually expand on them, would have to do so manually before-hand,
and with the mm lock held for writing.
It turns out that execve() already did almost exactly that, except it
didn't take the mm lock at all (it's single-threaded so no locking
technically needed, but it could cause lockdep errors). And it only did
it for the CONFIG_STACK_GROWSUP case, since in that case GUP has
obviously never expanded the stack downwards.
So just make that CONFIG_STACK_GROWSUP case do the right thing with
locking, and enable it generally. This will eventually help GUP, and in
the meantime avoids a special case and the lockdep issue.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[6.1 Minor context from still having FOLL_FORCE flags set]
Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: I24c652740dcfc674b0aef8e09ef72f09ad61254c
(cherry picked from commit f313c51d26)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit f440fa1ac9 upstream.
Make calls to extend_vma() and find_extend_vma() fail if the write lock
is required.
To avoid making this a flag-day event, this still allows the old
read-locking case for the trivial situations, and passes in a flag to
say "is it write-locked". That way write-lockers can say "yes, I'm
being careful", and legacy users will continue to work in all the common
cases until they have been fully converted to the new world order.
Co-Developed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: If12d2d68429b6d71393f02d5ed7e6939c3cd5405
(cherry picked from commit f440fa1ac9)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 2cd76c50d0 upstream.
This is one of the simple cases, except there's no pt_regs pointer.
Which is fine, as lock_mm_and_find_vma() is set up to work fine with a
NULL pt_regs.
Powerpc already enabled LOCK_MM_AND_FIND_VMA for the main CPU faulting,
so we can just use the helper without any extra work.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: I5736f498b2f45625e46554520d3aeb679e680907
(cherry picked from commit 2cd76c50d0)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit a050ba1e74 upstream.
This does the simple pattern conversion of alpha, arc, csky, hexagon,
loongarch, nios2, sh, sparc32, and xtensa to the lock_mm_and_find_vma()
helper. They all have the regular fault handling pattern without odd
special cases.
The remaining architectures all have something that keeps us from a
straightforward conversion: ia64 and parisc have stacks that can grow
both up as well as down (and ia64 has special address region checks).
And m68k, microblaze, openrisc, sparc64, and um end up having extra
rules about only expanding the stack down a limited amount below the
user space stack pointer. That is something that x86 used to do too
(long long ago), and it probably could just be skipped, but it still
makes the conversion less than trivial.
Note that this conversion was done manually and with the exception of
alpha without any build testing, because I have a fairly limited cross-
building environment. The cases are all simple, and I went through the
changes several times, but...
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: I93e4ce3cb077329e202699a16db576be3a40285b
(cherry picked from commit a050ba1e74)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 8b35ca3e45 upstream.
arm has an additional check for address < FIRST_USER_ADDRESS before
expanding the stack. Since FIRST_USER_ADDRESS is defined everywhere
(generally as 0), move that check to the generic expand_downwards().
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: Ie1090f587090ef16de4bce224bbc52334bfe78fa
(cherry picked from commit 8b35ca3e45)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 24be4d0b46 upstream.
Commit ae870a68b5 ("arm64/mm: Convert to using
lock_mm_and_find_vma()") made do_page_fault() to use 'vma' even if
CONFIG_PER_VMA_LOCK is not defined, but the declaration is still in the
ifdef.
As a result, building kernel without the config fails with undeclared
variable error as below:
arch/arm64/mm/fault.c: In function 'do_page_fault':
arch/arm64/mm/fault.c:624:2: error: 'vma' undeclared (first use in this function); did you mean 'vmap'?
624 | vma = lock_mm_and_find_vma(mm, addr, regs);
| ^~~
| vmap
Fix it by moving the declaration out of the ifdef.
Fixes: ae870a68b5 ("arm64/mm: Convert to using lock_mm_and_find_vma()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[surenb: this one is taken from 6.4.y stable branch]
Change-Id: Iba3153aa67f2dab347e4bc04a09c566b47cf4f63
(cherry picked from commit 24be4d0b46)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit ae870a68b5 upstream.
This converts arm64 to use the new page fault helper. It was very
straightforward, but still needed a fix for the "obvious" conversion I
initially did. Thanks to Suren for the fix and testing.
Fixed-and-tested-by: Suren Baghdasaryan <surenb@google.com>
Unnecessary-code-removal-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[surenb: this one is taken from 6.4.y stable branch]
Change-Id: Ibda94ca9b3893b8961e1d6536c854c0aee559a6b
(cherry picked from commit ae870a68b5)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit eda0047296 upstream.
This is done as a separate patch from introducing the new
lock_mm_and_find_vma() helper, because while it's an obvious change,
it's not what x86 used to do in this area.
We already abort the page fault on fatal signals anyway, so why should
we wait for the mmap lock only to then abort later? With the new helper
function that returns without the lock held on failure anyway, this is
particularly easy and straightforward.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: I9730b4543265a20253cbfc02de135cc77927f821
(cherry picked from commit eda0047296)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Since upstream commit 715f7f9ece ("locking/rtmutex: Squash !RT
tasks to DEFAULT_PRIO"), non-rt tasks do not inherit the
nice-priority values across rt_mutexes. This removes the minor
(and indirect) priority-inheritance that rt-mutexes provided for
CFS tasks.
Though without priority inheritance, time-bounded priority
inversion can occur between CFS tasks of different nice
priorities / cgroup limitations. The proxy-execution efforts
are a work-in-progress to resolve this upstream, but in the
meantime it is left to vendor hooks to provide a near term
solution to avoid priority inversion between CFS tasks.
In our oem scheduler, if a CFS thread has an "user-aware
property", we will always pick it even if it's vruntime is
bigger than the smallest one in runqueue. That's why the
trace_android_rvh_replace_next_task_fair vendorhook was added
previously in commit 53e8099784 ("ANDROID: vendor_hooks: Add
hooks for scheduler").
Thus for our oem scheduler, important CFS tasks(like
RenderThread) are marked with the "user-aware property" in their
struct task_struct. If those tasks are blocked on an rtmutex, we
want to allow the "user-aware property" to be inherited to lock
owner, so it will be selected to run immediately to release the
lock.
To support this, we need new hooks to map "user-aware property"
into different rtmutex_waiter prio and update the owner's
"user-aware property" if needed. Thus these additional vendor
hooks are needed.
In the future, once an generalized upstream solution for CFS
priority inheritance is in place, this will no longer be needed.
Bug: 290585456
Change-Id: I6521ed2086b147400a54da6b84a324baf16bc649
Signed-off-by: xieliujie <xieliujie@oppo.com>
When a device-mapper device is passing through the inline encryption
support of an underlying device, calls to blk_crypto_evict_key() take
the blk_crypto_profile::lock of the device-mapper device, then take the
blk_crypto_profile::lock of the underlying device (nested). This isn't
a real deadlock, but it causes a lockdep report because there is only
one lock class for all instances of this lock.
Lockdep subclasses don't really work here because the hierarchy of block
devices is dynamic and could have more than 2 levels.
Instead, register a dynamic lock class for each blk_crypto_profile, and
associate that with the lock.
This avoids false-positive lockdep reports like the following:
============================================
WARNING: possible recursive locking detected
6.4.0-rc5 #2 Not tainted
--------------------------------------------
fscryptctl/1421 is trying to acquire lock:
ffffff80829ca418 (&profile->lock){++++}-{3:3}, at: __blk_crypto_evict_key+0x44/0x1c0
but task is already holding lock:
ffffff8086b68ca8 (&profile->lock){++++}-{3:3}, at: __blk_crypto_evict_key+0xc8/0x1c0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&profile->lock);
lock(&profile->lock);
*** DEADLOCK ***
May be due to missing lock nesting notation
Fixes: 1b26283970 ("block: Keyslot Manager for Inline Encryption")
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230610061139.212085-1-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 286427075
(cherry picked from commit 2fb48d88e7)
(added '#ifdef CONFIG_LOCKDEP' to keep the KMI tooling happy)
Change-Id: I21c0f941a36663c956a5c89324813bbaac0633ef
Signed-off-by: Eric Biggers <ebiggers@google.com>
Xclipse GPU driver depends on TTM for graphics buffer allocation and
management. It is required by customers to add graphics memory swap
to improve overall memory efficiency. However TTM's swap feature can't
be used since it selects victim buffer by LRU and we can't choose a
specific buffer to swap.
Xclipse GPU driver implements its own swap feature by means of APIs of
TTM. But the problem is TTM's buffer allocations statistics in ttm_tt.c
which are local to that file. Whenever a graphic buffer is swapped out,
the size of total page allocation should be decreased but it is not
possible from the outside of ttm_tt.c.
If the statistics is not maintained well, TTM ends up swapping out TTM
buffers globally which is unexpected.
Bug: 291101811
Change-Id: I143c705834bcc196432c3ef59b49c9ec31f2e971
Signed-off-by: Kyongho Cho <pullip.cho@samsung.com>
Select hidden Kconfig: NET_DEVLINK.
Required by device drivers to provide unified interface to expose
device info, capture coredump and perform device flash.
Bug: 283707518
Change-Id: I1cc5b7dce36c79549cd7f1d9b755f7bab3973f0e
Signed-off-by: michael cai <michael.cai@mediatek.com>
Signed-off-by: lambert wang <lambert.wang@mediatek.com>
Registration of module callbacks for the pKVM hypervisor is lockless
thanks to the use of a cmpxchg.
Problem, a CPU can speculatively execute an indirect branch and
speculatively read variables used in that branch. We then need to order
the memory access between variables potentially set in the driver init
(before the callback registration happen) and the call to that
registered callback.
e.g. in the case of the serial.
CPU0: CPU1:
driver_init(): hyp_serial_enabled()
base_addr = 0xdeadbeef; enabled = __hyp_putc
barrier(); barrier();
ops->register_serial_driver(putc); if (enabled)
__hyp_putc(); /* read base_addr */
This is the same for the SMC and PSCI handler callbacks. The abort and
fault callbacks are not impacted: the driver init can only happen before
the kernel is deprivileged i.e. before the host stage-2 is in place and
then before any of those callbacks can be triggered.
Instead of a full barrier, we can use the acquire/release semantics:
relaxing cmpxchg to cmpxchg_release in the registration path and use a
load_acquire in hyp_serial_enabled().
Bug: 292470326
Change-Id: I4b5fe3713fe40cc5ab42ea0e9cdf54e8315dfb44
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
commit c2508ec5a5 upstream.
.. and make x86 use it.
This basically extracts the existing x86 "find and expand faulting vma"
code, but extends it to also take the mmap lock for writing in case we
actually do need to expand the vma.
We've historically short-circuited that case, and have some rather ugly
special logic to serialize the stack segment expansion (since we only
hold the mmap lock for reading) that doesn't match the normal VM
locking.
That slight violation of locking worked well, right up until it didn't:
the maple tree code really does want proper locking even for simple
extension of an existing vma.
So extract the code for "look up the vma of the fault" from x86, fix it
up to do the necessary write locking, and make it available as a helper
function for other architectures that can use the common helper.
Note: I say "common helper", but it really only handles the normal
stack-grows-down case. Which is all architectures except for PA-RISC
and IA64. So some rare architectures can't use the helper, but if they
care they'll just need to open-code this logic.
It's also worth pointing out that this code really would like to have an
optimistic "mmap_upgrade_trylock()" to make it quicker to go from a
read-lock (for the common case) to taking the write lock (for having to
extend the vma) in the normal single-threaded situation where there is
no other locking activity.
But that _is_ all the very uncommon special case, so while it would be
nice to have such an operation, it probably doesn't matter in reality.
I did put in the skeleton code for such a possible future expansion,
even if it only acts as pseudo-documentation for what we're doing.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[surenb: this one is taken from 6.4.y stable branch]
Change-Id: I6e16e6751245ac24adcbe78114bc57c726463acb
(cherry-picked from commit d6a5c7a1a6)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
commit cd00dd2585 upstream.
Check the write offset end bounds before using it as the offset into the
pivot array. This avoids a possible out-of-bounds access on the pivot
array if the write extends to the last slot in the node, in which case the
node maximum should be used as the end pivot.
akpm: this doesn't affect any current callers, but new users of mapletree
may encounter this problem if backported into earlier kernels, so let's
fix it in -stable kernels in case of this.
Link: https://lkml.kernel.org/r/20230506024752.2550-1-zhangpeng.00@bytedance.com
Fixes: 54a611b605 ("Maple Tree: add new data structure")
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: I992549af25fa9c22f587893d004002d2e004d317
(cherry-picked from commit 4e2ad53aba)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
commit d7893093a7 upstream.
TLDR: It's a mess.
When kexec() is executed on a system with offline CPUs, which are parked in
mwait_play_dead() it can end up in a triple fault during the bootup of the
kexec kernel or cause hard to diagnose data corruption.
The reason is that kexec() eventually overwrites the previous kernel's text,
page tables, data and stack. If it writes to the cache line which is
monitored by a previously offlined CPU, MWAIT resumes execution and ends
up executing the wrong text, dereferencing overwritten page tables or
corrupting the kexec kernels data.
Cure this by bringing the offlined CPUs out of MWAIT into HLT.
Write to the monitored cache line of each offline CPU, which makes MWAIT
resume execution. The written control word tells the offlined CPUs to issue
HLT, which does not have the MWAIT problem.
That does not help, if a stray NMI, MCE or SMI hits the offlined CPUs as
those make it come out of HLT.
A follow up change will put them into INIT, which protects at least against
NMI and SMI.
Fixes: ea53069231 ("x86, hotplug: Use mwait to offline a processor, fix the legacy case")
Reported-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230615193330.492257119@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: I80035e671b55732ac3d56c71dc53364e82238fe2
(cherry-picked from commit 0af4750eaa)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>