Commit Graph

1149989 Commits

Author SHA1 Message Date
Matthew Wilcox (Oracle)
66cbbe6b31 FROMGIT: mm: move FAULT_FLAG_VMA_LOCK check from handle_mm_fault()
Handle a little more of the page fault path outside the mmap sem.  The
hugetlb path doesn't need to check whether the VMA is anonymous; the
VM_HUGETLB flag is only set on hugetlbfs VMAs.  There should be no
performance change from the previous commit; this is simply a step to ease
bisection of any problems.

Link: https://lkml.kernel.org/r/20230724185410.1124082-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 51db5e8974cafee10b2252efa78f89af7d60cd11
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)

Bug: 293665307
Change-Id: I300c7105fa3530e8eb05862cb3f66b7adac99420
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 09:59:51 -07:00
Matthew Wilcox (Oracle)
e26044769f BACKPORT: FROMGIT: mm: allow per-VMA locks on file-backed VMAs
Remove the TCP layering violation by allowing per-VMA locks on all VMAs.
The fault path will immediately fail in handle_mm_fault().  There may be a
small performance reduction from this patch as a little unnecessary work
will be done on each page fault.  See later patches for the improvement.

Link: https://lkml.kernel.org/r/20230724185410.1124082-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 698dcd77360a3ce15dfc6fe55f9b5572ad4c4291
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
[surenb: skip tcp-related changes]

Bug: 293665307
Change-Id: I73d9d1e4f96419d4723a920fc5960e806749c368
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 09:59:46 -07:00
Matthew Wilcox (Oracle)
4cb518a06f FROMGIT: mm: remove CONFIG_PER_VMA_LOCK ifdefs
Patch series "Handle most file-backed faults under the VMA lock", v3.

This patchset adds the ability to handle page faults on parts of files
which are already in the page cache without taking the mmap lock.

This patch (of 10):

Provide lock_vma_under_rcu() when CONFIG_PER_VMA_LOCK is not defined to
eliminate ifdefs in the users.

Link: https://lkml.kernel.org/r/20230724185410.1124082-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230724185410.1124082-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit a457f3e92ccba03be36e8c04c77992d87004806c
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)

Bug: 293665307
Change-Id: I8cb7d9536b8c54925d04945566b75d4ea2ff042c
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 09:59:44 -07:00
Suren Baghdasaryan
f4b32b7f15 FROMGIT: mm: fix a lockdep issue in vma_assert_write_locked
__is_vma_write_locked() can be used only when mmap_lock is write-locked to
guarantee vm_lock_seq and mm_lock_seq stability during the check.
Therefore it asserts this condition before further checks.  Because of
that it can't be used unless the user expects the mmap_lock to be
write-locked.  vma_assert_locked() can't assume this before ensuring that
VMA is not read-locked.

Change the order of the checks in vma_assert_locked() to check if the VMA
is read-locked first and only then assert if it's not write-locked.

Link: https://lkml.kernel.org/r/20230712195652.969194-1-surenb@google.com
Fixes: 50b88b63e3e4 ("mm: handle userfaults under VMA lock")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: Liam R. Howlett <liam.howlett@oracle.com>
Closes: https://lore.kernel.org/all/20230712022620.3yytbdh24b7i4zrn@revolver/
Reported-by: syzbot+339b02f826caafd5f7a8@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/0000000000002db68f05ffb791bc@google.com/
Cc: Christian Brauner <brauner@kernel.org>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 781537884e9905f2df812c8b754165dd606ae300
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
Bug: 161210518
Change-Id: Ida7831576918000bb73850e639aae0d82f0c9fca
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 09:59:42 -07:00
Suren Baghdasaryan
250f19771f FROMGIT: mm: handle userfaults under VMA lock
Enable handle_userfault to operate under VMA lock by releasing VMA lock
instead of mmap_lock and retrying.  Note that FAULT_FLAG_RETRY_NOWAIT
should never be used when handling faults under per-VMA lock protection
because that would break the assumption that lock is dropped on retry.

Link: https://lkml.kernel.org/r/20230630211957.1341547-7-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hillf Danton <hdanton@sina.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit c3c986f59c814edecc096a049d67e5791083388b
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
Bug: 161210518
Change-Id: I9df667dae39024e5473252d7347ec7929f7f999e
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 09:59:38 -07:00
Suren Baghdasaryan
e704d0e4f9 FROMGIT: mm: handle swap page faults under per-VMA lock
When page fault is handled under per-VMA lock protection, all swap page
faults are retried with mmap_lock because folio_lock_or_retry has to drop
and reacquire mmap_lock if folio could not be immediately locked.  Follow
the same pattern as mmap_lock to drop per-VMA lock when waiting for folio
and retrying once folio is available.

With this obstacle removed, enable do_swap_page to operate under per-VMA
lock protection.  Drivers implementing ops->migrate_to_ram might still
rely on mmap_lock, therefore we have to fall back to mmap_lock in that
particular case.

Note that the only time do_swap_page calls synchronous swap_readpage is
when SWP_SYNCHRONOUS_IO is set, which is only set for
QUEUE_FLAG_SYNCHRONOUS devices: brd, zram and nvdimms (both btt and pmem).
Therefore we don't sleep in this path, and there's no need to drop the
mmap or per-VMA lock.

Link: https://lkml.kernel.org/r/20230630211957.1341547-6-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hillf Danton <hdanton@sina.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit cc989adb5544594d8c12893eda3c6df8682de11b
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
Bug: 161210518
Change-Id: I5d80f435b2dbdc3f3d02be056e893f6fedbc7a98
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 09:59:37 -07:00
Suren Baghdasaryan
f8a65b694b FROMGIT: mm: change folio_lock_or_retry to use vm_fault directly
Change folio_lock_or_retry to accept vm_fault struct and return the
vm_fault_t directly.

Link: https://lkml.kernel.org/r/20230630211957.1341547-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hillf Danton <hdanton@sina.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit af27bb856a0a29a0673aabe163e4774df67a8bcd
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
Bug: 161210518
Change-Id: I9d203e801f0d5517fba8430f9ab82d4063b517f3
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 09:59:35 -07:00
Suren Baghdasaryan
693d905ec0 BACKPORT: FROMGIT: mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED
handle_mm_fault returning VM_FAULT_RETRY or VM_FAULT_COMPLETED means
mmap_lock has been released.  However with per-VMA locks behavior is
different and the caller should still release it.  To make the rules
consistent for the caller, drop the per-VMA lock when returning
VM_FAULT_RETRY or VM_FAULT_COMPLETED.  Currently the only path returning
VM_FAULT_RETRY under per-VMA locks is do_swap_page and no path returns
VM_FAULT_COMPLETED for now.

Link: https://lkml.kernel.org/r/20230630211957.1341547-4-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hillf Danton <hdanton@sina.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 5197d920745dd42eae023986dbf053107ac238db
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
[surenb: add the code from missing sanitize_fault_flags directly into
handle_mm_fault, add the fix for riscv]
Bug: 161210518
Change-Id: Iefd4e49bda940c457a70ecf40d074ad532959759
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 09:59:34 -07:00
Suren Baghdasaryan
939d4b1ccc BACKPORT: FROMGIT: mm: move vma locking out of vma_prepare and dup_anon_vma
vma_prepare() is currently the central place where vmas are being locked
before vma_complete() applies changes to them. While this is convenient,
it also obscures vma locking and makes it harder to follow the locking
rules. Move vma locking out of vma_prepare() and take vma locks
explicitly at the locations where vmas are being modified. Move vma
locking and replace it with an assertion inside dup_anon_vma() to further
clarify the locking pattern inside vma_merge().

Link: https://lkml.kernel.org/r/20230804152724.3090321-7-surenb@google.com
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit b1985ca5e7e6464d205a98a78cca229224346c21
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
[surenb: skip changes in vma_prepare() which does not exist, skip
changes in vma_merge() since required locks are already in __vma_adjust(),
skip change in dup_anon_vma() since required locks are already in place,
skip unnecessary lock in do_brk_flags()]

Bug: 293665307
Change-Id: I99261aa1db3bec73795e63c333768bc68da8045c
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 16:55:02 +00:00
Suren Baghdasaryan
0f0b09c02c BACKPORT: FROMGIT: mm: always lock new vma before inserting into vma tree
While it's not strictly necessary to lock a newly created vma before
adding it into the vma tree (as long as no further changes are performed
to it), it seems like a good policy to lock it and prevent accidental
changes after it becomes visible to the page faults. Lock the vma before
adding it into the vma tree.

Link: https://lkml.kernel.org/r/20230804152724.3090321-6-surenb@google.com
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit c3249c06c48dda30f93e62b57773d5ed409d4f77
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
[surenb: resolved conflicts due to changes in vma_merge() and
__vma_adjust()]

Bug: 293665307
Change-Id: I4ee0d2abcc8a3f45545f470f1bf7f0be728d6f44
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 16:55:02 +00:00
Suren Baghdasaryan
a8a479ed96 FROMGIT: mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_once
Implicit vma locking inside vm_flags_reset() and vm_flags_reset_once() is
not obvious and makes it hard to understand where vma locking is happening.
Also in some cases (like in dup_userfaultfd()) vma should be locked earlier
than vma_flags modification. To make locking more visible, change these
functions to assert that the vma write lock is taken and explicitly lock
the vma beforehand. Fix userfaultfd functions which should lock the vma
earlier.

Link: https://lkml.kernel.org/r/20230804152724.3090321-5-surenb@google.com
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit f26ee2701ab3ecd771084b44f262bd010accab72
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)

Bug: 293665307
Change-Id: I62f0f25c883588c3ba7a322b3a4929df01413591
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 16:55:02 +00:00
Suren Baghdasaryan
ad18923856 FROMGIT: mm: replace mmap with vma write lock assertions when operating on a vma
Vma write lock assertion always includes mmap write lock assertion and
additional vma lock checks when per-VMA locks are enabled. Replace
weaker mmap_assert_write_locked() assertions with stronger
vma_assert_write_locked() ones when we are operating on a vma which
is expected to be locked.

Link: https://lkml.kernel.org/r/20230804152724.3090321-4-surenb@google.com
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 928a31b91cf64aa99a8999dcd66bec0ad02f64ef
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)

Bug: 293665307
Change-Id: I861db0510612f571f2ca44e0a9d7e01274d4eb36
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 16:55:02 +00:00
Suren Baghdasaryan
5f0ca924aa FROMGIT: mm: for !CONFIG_PER_VMA_LOCK equate write lock assertion for vma and mmap
When CONFIG_PER_VMA_LOCK=n, vma_assert_write_locked() should be equivalent
to mmap_assert_write_locked().

Link: https://lkml.kernel.org/r/20230804152724.3090321-3-surenb@google.com
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit f0cdd55d6dd8c7a1b333049e4f83eb25fef312ad
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)

Bug: 293665307
Change-Id: Ie20ff6c35fb2f561f2061c5d0135238b7b03afa5
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 16:55:02 +00:00
Jann Horn
abb0f2767e FROMGIT: mm: don't drop VMA locks in mm_drop_all_locks()
Despite its name, mm_drop_all_locks() does not drop _all_ locks; the mmap
lock is held write-locked by the caller, and the caller is responsible for
dropping the mmap lock at a later point (which will also release the VMA
locks).

Calling vma_end_write_all() here is dangerous because the caller might
have write-locked a VMA with the expectation that it will stay
write-locked until the mmap_lock is released, as usual.

This _almost_ becomes a problem in the following scenario:

An anonymous VMA A and an SGX VMA B are mapped adjacent to each other.
Userspace calls munmap() on a range starting at the start address of A and
ending in the middle of B.

Hypothetical call graph with additional notes in brackets:

do_vmi_align_munmap
  [begin first for_each_vma_range loop]
  vma_start_write [on VMA A]
  vma_mark_detached [on VMA A]
  __split_vma [on VMA B]
    sgx_vma_open [== new->vm_ops->open]
      sgx_encl_mm_add
        __mmu_notifier_register [luckily THIS CAN'T ACTUALLY HAPPEN]
          mm_take_all_locks
          mm_drop_all_locks
            vma_end_write_all [drops VMA lock taken on VMA A before]
  vma_start_write [on VMA B]
  vma_mark_detached [on VMA B]
  [end first for_each_vma_range loop]
  vma_iter_clear_gfp [removes VMAs from maple tree]
  mmap_write_downgrade
  unmap_region
  mmap_read_unlock

In this hypothetical scenario, while do_vmi_align_munmap() thinks it still
holds a VMA write lock on VMA A, the VMA write lock has actually been
invalidated inside __split_vma().

The call from sgx_encl_mm_add() to __mmu_notifier_register() can't
actually happen here, as far as I understand, because we are duplicating
an existing SGX VMA, but sgx_encl_mm_add() only calls
__mmu_notifier_register() for the first SGX VMA created in a given
process.  So this could only happen in fork(), not on munmap().  But in my
view it is just pure luck that this can't happen.

Also, we wouldn't actually have any bad consequences from this in
do_vmi_align_munmap(), because by the time the bug drops the lock on VMA
A, we've already marked VMA A as detached, which makes it completely
ineligible for any VMA-locked page faults.  But again, that's just pure
luck.

So remove the vma_end_write_all(), so that VMA write locks are only ever
released on mmap_write_unlock() or mmap_write_downgrade().

Also add comments to document the locking rules established by this patch.

Link: https://lkml.kernel.org/r/20230720193436.454247-1-jannh@google.com
Fixes: eeff9a5d47 ("mm/mmap: prevent pagefault handler from racing with mmu_notifier registration")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 28ed252b44fb2f1efaef1287eea267d54e79f7d5
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)

Bug: 293665307
Change-Id: Ic0b28229d175e3125de1ef274282fbf43b556db7
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 16:55:02 +00:00
Jisheng Zhang
365af746f5 BACKPORT: riscv: mm: try VMA lock-based page fault handling first
Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.

A simple running the ebizzy benchmark on Lichee Pi 4A shows that
PER_VMA_LOCK can improve the ebizzy benchmark by about 32.68%. In
theory, the more CPUs, the bigger improvement, but I don't have any
HW platform which has more than 4 CPUs.

This is the riscv variant of "x86/mm: try VMA lock-based page fault
handling first".

Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Reviewed-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/r/20230523165942.2630-1-jszhang@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>

(cherry picked from commit 648321fa0d)

Bug: 293665307
Change-Id: I59b63add96645d2483f87c2b680d4a7afa86f7b6
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 16:55:02 +00:00
Suren Baghdasaryan
3c187b4a12 BACKPORT: FROMGIT: mm: enable page walking API to lock vmas during the walk
walk_page_range() and friends often operate under write-locked mmap_lock.
With introduction of vma locks, the vmas have to be locked as well during
such walks to prevent concurrent page faults in these areas.  Add an
additional member to mm_walk_ops to indicate locking requirements for the
walk.

The change ensures that page walks which prevent concurrent page faults
by write-locking mmap_lock, operate correctly after introduction of
per-vma locks.  With per-vma locks page faults can be handled under vma
lock without taking mmap_lock at all, so write locking mmap_lock would
not stop them.  The change ensures vmas are properly locked during such
walks.

A sample issue this solves is do_mbind() performing queue_pages_range()
to queue pages for migration.  Without this change a concurrent page
can be faulted into the area and be left out of migration.

Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Suggested-by: Jann Horn <jannh@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 2ebc368f59eedcef0de7c832fe1d62935cd3a7ff
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
[surenb: changed locking in break_ksm since it's done differently,
skipped the change in the missing __ksm_del_vma(),  skipped the change in
the missing walk_page_range_vma(), removed unused local variables]

Bug: 293665307
Change-Id: Iede9eaa950ea59a268a2e74a8d3022162f0bbd80
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 16:55:02 +00:00
Jann Horn
b6093c47fe BACKPORT: mm: lock VMA in dup_anon_vma() before setting ->anon_vma
When VMAs are merged, dup_anon_vma() is called with `dst` pointing to the
VMA that is being expanded to cover the area previously occupied by
another VMA.  This currently happens while `dst` is not write-locked.

This means that, in the `src->anon_vma && !dst->anon_vma` case, as soon as
the assignment `dst->anon_vma = src->anon_vma` has happened, concurrent
page faults can happen on `dst` under the per-VMA lock.  This is already
icky in itself, since such page faults can now install pages into `dst`
that are attached to an `anon_vma` that is not yet tied back to the
`anon_vma` with an `anon_vma_chain`.  But if `anon_vma_clone()` fails due
to an out-of-memory error, things get much worse: `anon_vma_clone()` then
reverts `dst->anon_vma` back to NULL, and `dst` remains completely
unconnected to the `anon_vma`, even though we can have pages in the area
covered by `dst` that point to the `anon_vma`.

This means the `anon_vma` of such pages can be freed while the pages are
still mapped into userspace, which leads to UAF when a helper like
folio_lock_anon_vma_read() tries to look up the anon_vma of such a page.

This theoretically is a security bug, but I believe it is really hard to
actually trigger as an unprivileged user because it requires that you can
make an order-0 GFP_KERNEL allocation fail, and the page allocator tries
pretty hard to prevent that.

I think doing the vma_start_write() call inside dup_anon_vma() is the most
straightforward fix for now.

For a kernel-assisted reproducer, see the notes section of the patch mail.

Link: https://lkml.kernel.org/r/20230721034643.616851-1-jannh@google.com
Fixes: 5e31275cc9 ("mm: add per-VMA lock and helper functions to control it")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit d8ab9f7b64)
[surenb: since dup_anon_vma() is missing, add vma_start_write() directly
before anon_vma is assigned]

Bug: 293665307
Change-Id: I1b44e6278e464157e666cc5dbdb0fcc29bcf665e
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 16:55:02 +00:00
Jann Horn
0ee0062c94 UPSTREAM: mm: fix memory ordering for mm_lock_seq and vm_lock_seq
mm->mm_lock_seq effectively functions as a read/write lock; therefore it
must be used with acquire/release semantics.

A specific example is the interaction between userfaultfd_register() and
lock_vma_under_rcu().

userfaultfd_register() does the following from the point where it changes
a VMA's flags to the point where concurrent readers are permitted again
(in a simple scenario where only a single private VMA is accessed and no
merging/splitting is involved):

userfaultfd_register
  userfaultfd_set_vm_flags
    vm_flags_reset
      vma_start_write
        down_write(&vma->vm_lock->lock)
        vma->vm_lock_seq = mm_lock_seq [marks VMA as busy]
        up_write(&vma->vm_lock->lock)
      vm_flags_init
        [sets VM_UFFD_* in __vm_flags]
  vma->vm_userfaultfd_ctx.ctx = ctx
  mmap_write_unlock
    vma_end_write_all
      WRITE_ONCE(mm->mm_lock_seq, mm->mm_lock_seq + 1) [unlocks VMA]

There are no memory barriers in between the __vm_flags update and the
mm->mm_lock_seq update that unlocks the VMA, so the unlock can be
reordered to above the `vm_flags_init()` call, which means from the
perspective of a concurrent reader, a VMA can be marked as a userfaultfd
VMA while it is not VMA-locked.  That's bad, we definitely need a
store-release for the unlock operation.

The non-atomic write to vma->vm_lock_seq in vma_start_write() is mostly
fine because all accesses to vma->vm_lock_seq that matter are always
protected by the VMA lock.  There is a racy read in vma_start_read()
though that can tolerate false-positives, so we should be using
WRITE_ONCE() to keep things tidy and data-race-free (including for KCSAN).

On the other side, lock_vma_under_rcu() works as follows in the relevant
region for locking and userfaultfd check:

lock_vma_under_rcu
  vma_start_read
    vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq) [early bailout]
    down_read_trylock(&vma->vm_lock->lock)
    vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq) [main check]
  userfaultfd_armed
    checks vma->vm_flags & __VM_UFFD_FLAGS

Here, the interesting aspect is how far down the mm->mm_lock_seq read can
be reordered - if this read is reordered down below the vma->vm_flags
access, this could cause lock_vma_under_rcu() to partly operate on
information that was read while the VMA was supposed to be locked.  To
prevent this kind of downwards bleeding of the mm->mm_lock_seq read, we
need to read it with a load-acquire.

Some of the comment wording is based on suggestions by Suren.

BACKPORT WARNING: One of the functions changed by this patch (which I've
written against Linus' tree) is vma_try_start_write(), but this function
no longer exists in mm/mm-everything.  I don't know whether the merged
version of this patch will be ordered before or after the patch that
removes vma_try_start_write().  If you're backporting this patch to a tree
with vma_try_start_write(), make sure this patch changes that function.

Link: https://lkml.kernel.org/r/20230721225107.942336-1-jannh@google.com
Fixes: 5e31275cc9 ("mm: add per-VMA lock and helper functions to control it")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit b1f02b9575)

Bug: 293665307
Change-Id: Ifbf30a8ee7211f9c7fe26b923ca33ffde68b6a7b
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 16:55:02 +00:00
Xu Yang
3378cbd264 FROMGIT: usb: host: ehci-sched: try to turn on io watchdog as long as periodic_count > 0
If initially isoc_count = 0, periodic_count > 0 and the io watchdog is
not started (e.g. just timed out), then the io watchdog may not run after
submitting isoc urbs and enable_periodic(). The isoc urbs may not complete
forever if the controller had already stopped periodic schedule.

This will try to call turn_on_io_watchdog() for each enable_periodic() to
ensure the io watchdog functions properly.

Bug: 295046582
Signed-off-by: Xu Yang <xu.yang_2@nxp.com>
Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
Link: https://lore.kernel.org/r/20230809065327.952368-1-xu.yang_2@nxp.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit c272dabf2d
    https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git usb-next)
Change-Id: I0f10ec8bcf0e14269b2a9693617dd83327c26a20
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
2023-08-16 16:51:21 +00:00
Xu Yang
2d3351bd5e FROMGIT: BACKPORT: usb: ehci: add workaround for chipidea PORTSC.PEC bug
Some NXP processor using chipidea IP has a bug when frame babble is
detected.

As per 4.15.1.1.1 Serial Bus Babble:
  A babble condition also exists if IN transaction is in progress at
High-speed SOF2 point. This is called frame babble. The host controller
must disable the port to which the frame babble is detected.

The USB controller has disabled the port (PE cleared) and has asserted
USBERRINT when frame babble is detected, but PEC is not asserted.
Therefore, the SW isn't aware that port has been disabled. Then the
SW keeps sending packets to this port, but all of the transfers will
fail.

This workaround will firstly assert PCD by SW when USBERRINT is detected
and then judge whether port change has really occurred or not by polling
roothub status. Because the PEC doesn't get asserted in our case, this
patch will also assert it by SW when specific conditions are satisfied.

Bug: 295046582
Signed-off-by: Xu Yang <xu.yang_2@nxp.com>
Acked-by: Peter Chen <peter.chen@kernel.org>
Link: https://lore.kernel.org/r/20230809024432.535160-1-xu.yang_2@nxp.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit dda4b60ed7
    https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git usb-next)
[JD: replaced has_ci_pec_bug with existing has_fsl_port_bug to avoid abi breakage]
Change-Id: I7d36cf656efda2dd46c0ddcca252b3de6ea434ee
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
2023-08-16 16:51:21 +00:00
Chaoyuan Peng
7fa8861130 UPSTREAM: tty: n_gsm: fix UAF in gsm_cleanup_mux
commit 9b9c8195f3 upstream.

In gsm_cleanup_mux() the 'gsm->dlci' pointer was not cleaned properly,
leaving it a dangling pointer after gsm_dlci_release.
This leads to use-after-free where 'gsm->dlci[0]' are freed and accessed
by the subsequent gsm_cleanup_mux().

Such is the case in the following call trace:

 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x1e3/0x2cb lib/dump_stack.c:106
 print_address_description+0x63/0x3b0 mm/kasan/report.c:248
 __kasan_report mm/kasan/report.c:434 [inline]
 kasan_report+0x16b/0x1c0 mm/kasan/report.c:451
 gsm_cleanup_mux+0x76a/0x850 drivers/tty/n_gsm.c:2397
 gsm_config drivers/tty/n_gsm.c:2653 [inline]
 gsmld_ioctl+0xaae/0x15b0 drivers/tty/n_gsm.c:2986
 tty_ioctl+0x8ff/0xc50 drivers/tty/tty_io.c:2816
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:874 [inline]
 __se_sys_ioctl+0xf1/0x160 fs/ioctl.c:860
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
 </TASK>

Allocated by task 3501:
 kasan_save_stack mm/kasan/common.c:38 [inline]
 kasan_set_track mm/kasan/common.c:46 [inline]
 set_alloc_info mm/kasan/common.c:434 [inline]
 ____kasan_kmalloc+0xba/0xf0 mm/kasan/common.c:513
 kasan_kmalloc include/linux/kasan.h:264 [inline]
 kmem_cache_alloc_trace+0x143/0x290 mm/slub.c:3247
 kmalloc include/linux/slab.h:591 [inline]
 kzalloc include/linux/slab.h:721 [inline]
 gsm_dlci_alloc+0x53/0x3a0 drivers/tty/n_gsm.c:1932
 gsm_activate_mux+0x1c/0x330 drivers/tty/n_gsm.c:2438
 gsm_config drivers/tty/n_gsm.c:2677 [inline]
 gsmld_ioctl+0xd46/0x15b0 drivers/tty/n_gsm.c:2986
 tty_ioctl+0x8ff/0xc50 drivers/tty/tty_io.c:2816
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:874 [inline]
 __se_sys_ioctl+0xf1/0x160 fs/ioctl.c:860
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x61/0xcb

Freed by task 3501:
 kasan_save_stack mm/kasan/common.c:38 [inline]
 kasan_set_track+0x4b/0x80 mm/kasan/common.c:46
 kasan_set_free_info+0x1f/0x40 mm/kasan/generic.c:360
 ____kasan_slab_free+0xd8/0x120 mm/kasan/common.c:366
 kasan_slab_free include/linux/kasan.h:230 [inline]
 slab_free_hook mm/slub.c:1705 [inline]
 slab_free_freelist_hook+0xdd/0x160 mm/slub.c:1731
 slab_free mm/slub.c:3499 [inline]
 kfree+0xf1/0x270 mm/slub.c:4559
 dlci_put drivers/tty/n_gsm.c:1988 [inline]
 gsm_dlci_release drivers/tty/n_gsm.c:2021 [inline]
 gsm_cleanup_mux+0x574/0x850 drivers/tty/n_gsm.c:2415
 gsm_config drivers/tty/n_gsm.c:2653 [inline]
 gsmld_ioctl+0xaae/0x15b0 drivers/tty/n_gsm.c:2986
 tty_ioctl+0x8ff/0xc50 drivers/tty/tty_io.c:2816
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:874 [inline]
 __se_sys_ioctl+0xf1/0x160 fs/ioctl.c:860
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x61/0xcb

Bug: 291178675
Fixes: aa371e96f0 ("tty: n_gsm: fix restart handling via CLD command")
Signed-off-by: Chaoyuan Peng <hedonistsmith@gmail.com>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 9615ca54bc)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I947cad0e8080378b40d4098add48992ade5fe638
2023-08-16 09:18:40 +00:00
Liam R. Howlett
683966ac69 UPSTREAM: mm/mmap: Fix extra maple tree write
based on commit 0503ea8f5b upstream.

This was inadvertently fixed during the removal of __vma_adjust().

When __vma_adjust() is adjusting next with a negative value (pushing
vma->vm_end lower), there would be two writes to the maple tree.  The
first write is unnecessary and uses all allocated nodes in the maple
state.  The second write is necessary but will need to allocate nodes
since the first write has used the allocated nodes.  This may be a
problem as it may not be safe to allocate at this time, such as a low
memory situation.  Fix the issue by avoiding the first write and only
write the adjusted "next" VMA.

Reported-by: John Hsu <John.Hsu@mediatek.com>
Link: https://lore.kernel.org/lkml/9cb8c599b1d7f9c1c300d1a334d5eb70ec4d7357.camel@mediatek.com/
Cc: stable@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

(cherry picked from commit a02c6dc0ef
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
linux-6.1.y)

Bug: 295269894
Change-Id: I1a4bdc080d4ee92dbe06dc788961532d0c85fd7c
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-15 21:06:30 +00:00
Charan Teja Kalla
f86c79eb86 FROMGIT: Multi-gen LRU: skip CMA pages when they are not eligible
This patch is based on the commit 5da226dbfce3("mm: skip CMA pages when
they are not available") which skips cma pages reclaim when they are not
eligible for the current allocation context.  In mglru, such pages are
added to the tail of the immediate generation to maintain better LRU
order, which is unlike the case of conventional LRU where such pages are
directly added to the head of the LRU list(akin to adding to head of the
youngest generation in mglru).

No observable issue without this patch on MGLRU, but logically it make
sense to skip the CMA page reclaim when those pages can't be satisfied for
the current allocation context.

Link: https://lkml.kernel.org/r/1691568344-13475-1-git-send-email-quic_charante@quicinc.com
Change-Id: I586415b3e3a92da23f3e79b9d63802a2ced03432
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 75d52d9304ef5b268eb798b0c679815290a0fc83 https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
Bug: 288383787
Bug: 291719697
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
2023-08-15 19:57:01 +00:00
Zhaoyang Huang
7ae1e02abb UPSTREAM: mm: skip CMA pages when they are not available
This patch fixes unproductive reclaiming of CMA pages by skipping them
when they are not available for current context.  It arises from the below
OOM issue, which was caused by a large proportion of MIGRATE_CMA pages
among free pages.

[   36.172486] [03-19 10:05:52.172] ActivityManager: page allocation failure: order:0, mode:0xc00(GFP_NOIO), nodemask=(null),cpuset=foreground,mems_allowed=0
[   36.189447] [03-19 10:05:52.189] DMA32: 0*4kB 447*8kB (C) 217*16kB (C) 124*32kB (C) 136*64kB (C) 70*128kB (C) 22*256kB (C) 3*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 35848kB
[   36.193125] [03-19 10:05:52.193] Normal: 231*4kB (UMEH) 49*8kB (MEH) 14*16kB (H) 13*32kB (H) 8*64kB (H) 2*128kB (H) 0*256kB 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 3236kB
...
[   36.234447] [03-19 10:05:52.234] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
[   36.234455] [03-19 10:05:52.234] cache: ext4_io_end, object size: 64, buffer size: 64, default order: 0, min order: 0
[   36.234459] [03-19 10:05:52.234] node 0: slabs: 53,objs: 3392, free: 0

This change further decreases the chance for wrong OOMs in the presence
of a lot of CMA memory.

[david@redhat.com: changelog addition]
Link: https://lkml.kernel.org/r/1685501461-19290-1-git-send-email-zhaoyang.huang@unisoc.com
Change-Id: I84f1145c38b5ff7b825f2122b33bc55997931bd7
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: ke.wang <ke.wang@unisoc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 5da226dbfc)
Bug: 288383787
Bug: 291719697
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
2023-08-15 19:57:01 +00:00
Dan Carpenter
7666325265 UPSTREAM: dma-buf: fix an error pointer vs NULL bug
Smatch detected potential error pointer dereference.

    drivers/gpu/drm/drm_syncobj.c:888 drm_syncobj_transfer_to_timeline()
    error: 'fence' dereferencing possible ERR_PTR()

The error pointer comes from dma_fence_allocate_private_stub().  One
caller expected error pointers and one expected NULL pointers.  Change
it to return NULL and update the caller which expected error pointers,
drm_syncobj_assign_null_handle(), to check for NULL instead.

Bug: 286438670
Fixes: f781f661e8 ("dma-buf: keep the signaling time of merged fences v3")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sumit Semwal <sumit.semwal@linaro.org>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
Link: https://patchwork.freedesktop.org/patch/msgid/b09f1996-3838-4fa2-9193-832b68262e43@moroto.mountain
(cherry picked from commit 00ae1491f9)
Change-Id: I9fe1e61543e84a0f22d8ec26e01d94b809620744
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
2023-08-15 16:58:34 +00:00
Christian König
e61d76121f UPSTREAM: dma-buf: keep the signaling time of merged fences v3
Some Android CTS is testing if the signaling time keeps consistent
during merges.

v2: use the current time if the fence is still in the signaling path and
the timestamp not yet available.
v3: improve comment, fix one more case to use the correct timestamp

Bug: 286438670
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230630120041.109216-1-christian.koenig@amd.com
(cherry picked from commit f781f661e8)
Change-Id: I5cd3178213fc28ac67146f58fddf83f7d482fd76
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
2023-08-15 16:58:34 +00:00
Pablo Neira Ayuso
fda157ce15 UPSTREAM: netfilter: nf_tables: skip bound chain on rule flush
[ Upstream commit 6eaf41e87a ]

Skip bound chain when flushing table rules, the rule that owns this
chain releases these objects.

Otherwise, the following warning is triggered:

  WARNING: CPU: 2 PID: 1217 at net/netfilter/nf_tables_api.c:2013 nf_tables_chain_destroy+0x1f7/0x210 [nf_tables]
  CPU: 2 PID: 1217 Comm: chain-flush Not tainted 6.1.39 #1
  RIP: 0010:nf_tables_chain_destroy+0x1f7/0x210 [nf_tables]

Bug: 294357305
Fixes: d0e2c7de92 ("netfilter: nf_tables: add NFT_CHAIN_BINDING")
Reported-by: Kevin Rich <kevinrich1337@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit e18922ce3e)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I48f43d0ce3410efec2513479a1f4c7708a097b01
2023-08-15 16:18:17 +00:00
Pedro Tammela
110a26edd1 UPSTREAM: net/sched: sch_qfq: account for stab overhead in qfq_enqueue
[ Upstream commit 3e337087c3 ]

Lion says:
-------
In the QFQ scheduler a similar issue to CVE-2023-31436
persists.

Consider the following code in net/sched/sch_qfq.c:

static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
                struct sk_buff **to_free)
{
     unsigned int len = qdisc_pkt_len(skb), gso_segs;

    // ...

     if (unlikely(cl->agg->lmax < len)) {
         pr_debug("qfq: increasing maxpkt from %u to %u for class %u",
              cl->agg->lmax, len, cl->common.classid);
         err = qfq_change_agg(sch, cl, cl->agg->class_weight, len);
         if (err) {
             cl->qstats.drops++;
             return qdisc_drop(skb, sch, to_free);
         }

    // ...

     }

Similarly to CVE-2023-31436, "lmax" is increased without any bounds
checks according to the packet length "len". Usually this would not
impose a problem because packet sizes are naturally limited.

This is however not the actual packet length, rather the
"qdisc_pkt_len(skb)" which might apply size transformations according to
"struct qdisc_size_table" as created by "qdisc_get_stab()" in
net/sched/sch_api.c if the TCA_STAB option was set when modifying the qdisc.

A user may choose virtually any size using such a table.

As a result the same issue as in CVE-2023-31436 can occur, allowing heap
out-of-bounds read / writes in the kmalloc-8192 cache.
-------

We can create the issue with the following commands:

tc qdisc add dev $DEV root handle 1: stab mtu 2048 tsize 512 mpu 0 \
overhead 999999999 linklayer ethernet qfq
tc class add dev $DEV parent 1: classid 1:1 htb rate 6mbit burst 15k
tc filter add dev $DEV parent 1: matchall classid 1:1
ping -I $DEV 1.1.1.2

This is caused by incorrectly assuming that qdisc_pkt_len() returns a
length within the QFQ_MIN_LMAX < len < QFQ_MAX_LMAX.

Bug: 292249631
Fixes: 462dbc9101 ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost")
Reported-by: Lion <nnamrec@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 70feebdbfa)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I69bec7b092e980fe8e0946c26ed9b5ac7c57bf3d
2023-08-15 16:15:08 +00:00
Pedro Tammela
9db1437238 UPSTREAM: net/sched: sch_qfq: refactor parsing of netlink parameters
[ Upstream commit 25369891fc ]

Two parameters can be transformed into netlink policies and
validated while parsing the netlink message.

Bug: 292249631
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 3e337087c3 ("net/sched: sch_qfq: account for stab overhead in qfq_enqueue")
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 4b33836824)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: Ifce65b6b0ce2f7dee2040a4c91fd90ea7b2e8f3c
2023-08-15 16:15:08 +00:00
Florian Westphal
7688102949 UPSTREAM: netfilter: nft_set_pipapo: fix improper element removal
[ Upstream commit 87b5a5c209 ]

end key should be equal to start unless NFT_SET_EXT_KEY_END is present.

Its possible to add elements that only have a start key
("{ 1.0.0.0 . 2.0.0.0 }") without an internval end.

Insertion treats this via:

if (nft_set_ext_exists(ext, NFT_SET_EXT_KEY_END))
   end = (const u8 *)nft_set_ext_key_end(ext)->data;
else
   end = start;

but removal side always uses nft_set_ext_key_end().
This is wrong and leads to garbage remaining in the set after removal
next lookup/insert attempt will give:

BUG: KASAN: slab-use-after-free in pipapo_get+0x8eb/0xb90
Read of size 1 at addr ffff888100d50586 by task nft-pipapo_uaf_/1399
Call Trace:
 kasan_report+0x105/0x140
 pipapo_get+0x8eb/0xb90
 nft_pipapo_insert+0x1dc/0x1710
 nf_tables_newsetelem+0x31f5/0x4e00
 ..

Bug: 293587745
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Reported-by: lonial con <kongln9170@gmail.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 90c3955beb)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I51a423aaa2c31c4df89776505b602aa2c1523b82
2023-08-15 11:50:35 +01:00
Yifan Hong
37f4509407 ANDROID: Add checkpatch target.
Running the following will run scripts/checkpatch.pl on a
patch of HEAD

  tools/bazel run //common:checkpatch

or a given Git SHA1:

  tools/bazel run //common:checkpatch -- --git_sha1 ...

For additional flags, see

  tools/bazel run //common:checkpatch -- --help

For details, see
  build/kernel/kleaf/docs/checkpatch.md
in your source tree.

Test: TH
Bug: 259995152
Change-Id: Iaad8fd69508cf9be11340166aafbb84930d4805c
Signed-off-by: Yifan Hong <elsk@google.com>
(cherry picked from commit 7dbf26568fcccde88470e7a25c07f0c7229e85f1)
2023-08-11 17:53:56 +00:00
Alan Stern
d7dacaa439 UPSTREAM: USB: Gadget: core: Help prevent panic during UVC unconfigure
Avichal Rakesh reported a kernel panic that occurred when the UVC
gadget driver was removed from a gadget's configuration.  The panic
involves a somewhat complicated interaction between the kernel driver
and a userspace component (as described in the Link tag below), but
the analysis did make one thing clear: The Gadget core should
accomodate gadget drivers calling usb_gadget_deactivate() as part of
their unbind procedure.

Currently this doesn't work.  gadget_unbind_driver() calls
driver->unbind() while holding the udc->connect_lock mutex, and
usb_gadget_deactivate() attempts to acquire that mutex, which will
result in a deadlock.

The simple fix is for gadget_unbind_driver() to release the mutex when
invoking the ->unbind() callback.  There is no particular reason for
it to be holding the mutex at that time, and the mutex isn't held
while the ->bind() callback is invoked.  So we'll drop the mutex
before performing the unbind callback and reacquire it afterward.

We'll also add a couple of comments to usb_gadget_activate() and
usb_gadget_deactivate().  Because they run in process context they
must not be called from a gadget driver's ->disconnect() callback,
which (according to the kerneldoc for struct usb_gadget_driver in
include/linux/usb/gadget.h) may run in interrupt context.  This may
help prevent similar bugs from arising in the future.

Reported-and-tested-by: Avichal Rakesh <arakesh@google.com>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Fixes: 286d9975a8 ("usb: gadget: udc: core: Prevent soft_connect_store() race")
Link: https://lore.kernel.org/linux-usb/4d7aa3f4-22d9-9f5a-3d70-1bd7148ff4ba@google.com/
Cc: Badhri Jagan Sridharan <badhri@google.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/48b2f1f1-0639-46bf-bbfc-98cb05a24914@rowland.harvard.edu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Bug: 291976100
Change-Id: Icff01d8e88f041af4bda8726242de9cd518a247a
(cherry picked from commit 65dadb2bee)
Signed-off-by: Avichal Rakesh <arakesh@google.com>
2023-08-11 17:30:16 +00:00
Zhanyuan Hu
4dc009c3a8 ANDROID: GKI: Update symbols to symbol list
Update symbols to symbol list externed by oppo memory group.

ABI DIFFERENCES HAVE BEEN DETECTED!
1 variable symbol(s) added
  'unsigned long zero_pfn'

Bug: 292051411

Change-Id: I913c01c7671729bf33b78a218c61cfb94628fb0e
Signed-off-by: huzhanyuan <huzhanyuan@oppo.com>
2023-08-11 17:12:52 +00:00
xieliujie
fadc35923d ANDROID: vendor_hook: fix the error record position of mutex
Make sure vendorhook trace_android_vh_record_mutex_lock_starttime woking both in fastpath unlock and slowpath unlock.

Fixes: 57750518de ("ANDROID: vendor_hook: Avoid clearing protect-flag before waking waiters")
Bug: 286024926
Change-Id: Ib91c1b88d27aaa4ef872d44102969ffc3c9adb58
Signed-off-by: xieliujie <xieliujie@oppo.com>
2023-08-11 16:57:28 +00:00
Woogeun Lee
3fc69d3f70 ANDROID: ABI: add allowed list for galaxy
19 function symbol(s) added
  'int __fsnotify_parent(struct dentry*, __u32, const void*, int)'
  'int __traceiter_android_vh_wq_lockup_pool(void*, int, unsigned long)'
  'int cleancache_register_ops(const struct cleancache_ops*)'
  'int fsnotify(__u32, const void*, int, struct inode*, const struct qstr*, struct inode*, u32)'
  'void kernel_neon_begin()'
  'void kernel_neon_end()'
  'int kstrtos16(const char*, unsigned int, s16*)'
  'int regulator_get_current_limit(struct regulator*)'
  'int smpboot_register_percpu_thread(struct smp_hotplug_thread*)'
  'void smpboot_unregister_percpu_thread(struct smp_hotplug_thread*)'
  'int snd_soc_add_card_controls(struct snd_soc_card*, const struct snd_kcontrol_new*, int)'
  'unsigned int stack_trace_save_regs(struct pt_regs*, unsigned long*, unsigned int, unsigned int)'
  'int tcp_register_congestion_control(struct tcp_congestion_ops*)'
  'void tcp_reno_cong_avoid(struct sock*, u32, u32)'
  'u32 tcp_reno_ssthresh(struct sock*)'
  'u32 tcp_reno_undo_cwnd(struct sock*)'
  'u32 tcp_slow_start(struct tcp_sock*, u32)'
  'void tcp_unregister_congestion_control(struct tcp_congestion_ops*)'
  'int usb_set_configuration(struct usb_device*, int)'

1 variable symbol(s) added
  'struct tracepoint __tracepoint_android_vh_wq_lockup_pool'

Bug: 294125592
Change-Id: I6c2f2fb274dbe45263e39e43b4b8bc3766ef2bab
Signed-off-by: Woogeun Lee <woogeun.lee@samsung.com>
2023-08-10 19:30:21 +00:00
Jaewon Kim
a5a662187f ANDROID: gfp: add __GFP_CMA in gfpflag_names
The __GFP_CMA was added but not added to the gfpflag_names. Let me add
it to show on %pGg printk.

Bug: 295271520
Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
Change-Id: I155fdcc0e2c18db390b5166ba8d2b93c793caae6
2023-08-10 18:41:10 +00:00
Ramji Jiyani
b520b90913 ANDROID: ABI: Update to fix slab-out-of-bounds in xhci_vendor_get_ops
type 'struct xhci_hcd' changed
  member 'union { struct xhci_vendor_ops* vendor_ops; struct { u64 android_kabi_reserved1; }; union { }; }' was added
  member 'u64 android_kabi_reserved1' was removed

Bug: 293869685
Test: TH
Change-Id: I1fa551fc1b9263302d38f4e2989eed9f5f0d816a
Signed-off-by: Ramji Jiyani <ramjiyani@google.com>
2023-08-10 18:29:38 +00:00
Howard Yen
c2cbb3cc24 ANDROID: usb: host: fix slab-out-of-bounds in xhci_vendor_get_ops
slab-out-of-bounds happens if the xhci platform drivers don't define
the extra_priv_size in their xhci_driver_overrides structure. Move
xhci_vendor_ops structure to xhci main structure to avoid
extra_priv_size affacts xhci_vendor_get_ops which causes the
slab-out-of-bounds error.

Fixes: 90ab8e7f98 ("ANDROID: usb: host: add xhci hooks for USB offload")
Bug: 293869685
Bug: 194461020
Test: build and boot pass
Change-Id: Id17fdfbfd3e8edcc89a05c9c2f553ffab494215e
Signed-off-by: Howard Yen <howardyen@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
(cherry picked from commit 34f6c9c308)
(cherry picked from commit 00666b8e3e)
2023-08-10 18:29:38 +00:00
André Draszik
64787ee451 ANDROID: GKI: update pixel symbol list for xhci
Pixel is using these symbols in its USB driver implementation.

3 function symbol(s) added
  'int xhci_address_device(struct usb_hcd*, struct usb_device*)'
  'int xhci_bus_resume(struct usb_hcd*)'
  'int xhci_bus_suspend(struct usb_hcd*)'

Bug: 277396090
Bug: 287008367
Change-Id: Id89097ab094e0582560383793c91278c88cb078f
Signed-off-by: André Draszik <draszik@google.com>
2023-08-10 14:31:27 +01:00
Andrew Yang
b0c06048a8 FROMGIT: fs: drop_caches: draining pages before dropping caches
We expect a file page access after dropping caches should be a major
fault, but sometimes it's still a minor fault.  That's because a file page
can't be dropped if it's in a per-cpu pagevec.  Draining all pages from
per-cpu pagevec to lru list before trying to drop caches.

Link: https://lkml.kernel.org/r/20230630092203.16080-1-andrew.yang@mediatek.com
Change-Id: I9b03c53e39b87134d5ddd0c40ac9b36cf4d190cd
Signed-off-by: Andrew Yang <andrew.yang@mediatek.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 285794522
(cherry picked from commit a481c6fdf3e4fdf31bda91098dfbf46098037e76
 https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
2023-08-10 10:47:24 +00:00
Author Name
2f76bb83b1 ANDROID: GKI: update symbol list file for xiaomi
INFO: ABI DIFFERENCES HAVE BEEN DETECTED!
INFO: 8 function symbol(s) added
  'int sock_wake_async(struct socket_wq *wq, int how, int band)'
  'void bpf_map_put(struct bpf_map *map)'
  'void bpf_map_inc(struct bpf_map *map)'
  'int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id)'
  'void napi_busy_loop(unsigned int napi_id,bool (*loop_end)(void *,
                       unsigned long),void *loop_end_arg, bool
                       prefer_busy_poll, u16 budget)'
  'bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)'
  'void page_pool_put_page_bulk(struct page_pool *pool, void **data,
                                int count)'
  'struct sk_buff *build_skb_around(struct sk_buff *skb,void *data,
                                    unsigned int frag_size)'
INFO: 2 variable symbol(s) added
  'DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info),
  'DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg)'

Bug: 294257769

Change-Id: I98da395227810eecb1fd978dedd20fba445757d0
Signed-off-by: dongziqi <dongziqi1@xiaomi.corp-partner.google.com>
2023-08-09 22:53:11 +00:00
Elliot Berman
8e86825eec ANDROID: uid_sys_stats: Use a single work for deferred updates
uid_sys_stats tries to acquire a lock when any task exits to do some
bookkeeping in common data structure. If the lock is contended, it
allocates and schedules a work to do the work later to avoid task exit
latency.

In a stress test which creates many tasks exiting, the workqueue can be
overwhelmed by the number of works being scheduled and allocates more
worker threads to handle queue. The growth of the number of threads is
effectively unbounded and can exhaust the process table. This causes
denial of service to userspace trying to fork().

Instead of allocating a new work each, create a linked list of the
update stats deferred work and have a single work to drain the linked
list. The linked list is implemented using an atomic_long_t.

Bug: 294468796
Fixes: 5586278c0f ("ANDROID: uid_sys_stats: defer process_notifier work if uid_lock is contended")
Change-Id: I15f20f4f69ea66a452bdf815c4ef3a0da3edfd36
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
2023-08-09 20:50:55 +00:00
Junki Min
960d9828ee ANDROID: ABI: Update symbol for Exynos SoC
Update symbols for Exynos WLBT driver.

1 function symbol(s) added
  'unsigned long __find_nth_bit(const unsigned long*, unsigned long, unsigned long)'

Bug: 294470344
Change-Id: I9f8d9d20f643b34bbc475dde468dbaa11f56e667
Signed-off-by: Junki Min <joonki.min@samsung.com>
2023-08-08 18:02:10 +00:00
Jiewen Wang
3926cc6ef8 ANDROID: GKI: Add symbols to symbol list for vivo
INFO: 1 function symbol(s) added
  'int __traceiter_android_vh_tune_scan_type(void*, enum scan_balance*)'

1 variable symbol(s) added
  'struct tracepoint __tracepoint_android_vh_tune_scan_type'

Bug: 294180281

Change-Id: I171099cdbe68c04885e286554f56290356d543d2
Signed-off-by: Jiewen Wang <jiewen.wang@vivo.com>
2023-08-07 18:11:35 +00:00
Jiewen Wang
dbb09068c1 ANDROID: vendor_hooks: Add tune scan type hook in get_scan_count()
Add hook in get_scan_count() for oem to wield customized reclamation strategy

Bug: 294180281
Change-Id: Ic54d35128e458661fc2b641809f5371b1d9a488e
Signed-off-by: Jiewen Wang <jiewen.wang@vivo.com>
2023-08-07 18:11:35 +00:00
Kalesh Singh
5e1d25ac2a FROMGIT: BACKPORT: Multi-gen LRU: Fix can_swap in lru_gen_look_around()
walk->can_swap might be invalid since it's not guaranteed to be
initialized for the particular lruvec.  Instead deduce it from the folio
type (anon/file).

Link: https://lkml.kernel.org/r/20230802025606.346758-3-kaleshsingh@google.com
Fixes: 018ee47f14 ("mm: multi-gen LRU: exploit locality in rmap")
Change-Id: I1ae78011d4972d87bac9f2db8c56352cdb7a9be6
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek]
Tested-by: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Steven Barrett <steven@liquorix.net>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit fdf19e8c8f1cdcee4eccf4c98a875f44f39d8b9d https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
Bug: 288383787
Bug: 291719697
[ Kalesh Singh - Fix trivial conflict in lru_gen_look_around() ]
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
2023-08-03 20:45:58 +00:00
Kalesh Singh
addf1a9a65 FROMGIT: Multi-gen LRU: Avoid race in inc_min_seq()
inc_max_seq() will try to inc_min_seq() if nr_gens == MAX_NR_GENS. This
is because the generations are reused (the last oldest now empty
generation will become the next youngest generation).

inc_min_seq() is retried until successful, dropping the lru_lock
and yielding the CPU on each failure, and retaking the lock before
trying again:

        while (!inc_min_seq(lruvec, type, can_swap)) {
                spin_unlock_irq(&lruvec->lru_lock);
                cond_resched();
                spin_lock_irq(&lruvec->lru_lock);
        }

However, the initial condition that required incrementing the min_seq
(nr_gens == MAX_NR_GENS) is not retested. This can change by another
call to inc_max_seq() from run_aging() with force_scan=true from the
debugfs interface.

Since the eviction stalls when the nr_gens == MIN_NR_GENS, avoid
unnecessarily incrementing the min_seq by rechecking the number of
generations before each attempt.

This issue was uncovered in previous discussion on the list by Yu Zhao
and Aneesh Kumar [1].

[1] https://lore.kernel.org/linux-mm/CAOUHufbO7CaVm=xjEb1avDhHVvnC8pJmGyKcFf2iY_dpf+zR3w@mail.gmail.com/

Link: https://lkml.kernel.org/r/20230802025606.346758-2-kaleshsingh@google.com
Fixes: d6c3af7d8a ("mm: multi-gen LRU: debugfs interface")
Change-Id: I89e84ef2927eb1b0091f1be28bd03eb04dee4c57
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek]
Tested-by: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Steven Barrett <steven@liquorix.net>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 250dbd10306126b06415afda8adfc27b2b780428 https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
Bug: 288383787
Bug: 291719697
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
2023-08-03 20:45:58 +00:00
Kalesh Singh
a7adb98897 FROMGIT: Multi-gen LRU: Fix per-zone reclaim
MGLRU has a LRU list for each zone for each type (anon/file) in each
generation:

	long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];

The min_seq (oldest generation) can progress independently for each
type but the max_seq (youngest generation) is shared for both anon and
file. This is to maintain a common frame of reference.

In order for eviction to advance the min_seq of a type, all the per-zone
lists in the oldest generation of that type must be empty.

The eviction logic only considers pages from eligible zones for
eviction or promotion.

    scan_folios() {
	...
	for (zone = sc->reclaim_idx; zone >= 0; zone--)  {
	    ...
	    sort_folio(); 	// Promote
	    ...
	    isolate_folio(); 	// Evict
	}
	...
    }

Consider the system has the movable zone configured and default 4
generations. The current state of the system is as shown below
(only illustrating one type for simplicity):

Type: ANON

	Zone    DMA32     Normal    Movable    Device

	Gen 0       0          0        4GB         0

	Gen 1       0        1GB        1MB         0

	Gen 2     1MB        4GB        1MB         0

	Gen 3     1MB        1MB        1MB         0

Now consider there is a GFP_KERNEL allocation request (eligible zone
index <= Normal), evict_folios() will return without doing any work
since there are no pages to scan in the eligible zones of the oldest
generation. Reclaim won't make progress until triggered from a ZONE_MOVABLE
allocation request; which may not happen soon if there is a lot of free
memory in the movable zone. This can lead to OOM kills, although there
is 1GB pages in the Normal zone of Gen 1 that we have not yet tried to
reclaim.

This issue is not seen in the conventional active/inactive LRU since
there are no per-zone lists.

If there are no (not enough) folios to scan in the eligible zones, move
folios from ineligible zone (zone_index > reclaim_index) to the next
generation. This allows for the progression of min_seq and reclaiming
from the next generation (Gen 1).

Qualcomm, Mediatek and raspberrypi [1] discovered this issue independently.

[1] https://github.com/raspberrypi/linux/issues/5395

Link: https://lkml.kernel.org/r/20230802025606.346758-1-kaleshsingh@google.com
Fixes: ac35a49023 ("mm: multi-gen LRU: minimal implementation")
Change-Id: I5bbf44bd7ffe42f4347df4be59a75c1603c9b947
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reported-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek]
Tested-by: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Steven Barrett <steven@liquorix.net>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 1462260adc41c5974362cb54ff577c2a15b8c7b2 https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
Bug: 288383787
Bug: 291719697
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
2023-08-03 20:45:58 +00:00
Jaewon Kim
03812b904e ANDROID: ABI: update symbol list for galaxy
INFO: 1 function symbol(s) added
  'int cleancache_register_ops(const struct cleancache_ops*)

Bug: 294177078
Change-Id: Ic22ddae4e92896ed28bc876d98969c6c3e94cb9d
Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
2023-08-03 16:48:30 +00:00
huzhanyuan
b283f9b41f ANDROID: oplus: Update the ABI xml and symbol list
INFO: ABI DIFFERENCES HAVE BEEN DETECTED!
INFO: 4 function symbol(s) added
'int __traceiter_android_vh_check_folio_look_around_ref(void*, struct folio*,int*)'
'int __traceiter_android_vh_look_around(void*, struct page_vma_mapped_walk*,struct folio*, struct vm_area_struct*, int*)'
'int __traceiter_android_vh_look_around_migrate_folio(void*, struct folio*, struct folio*)'
'int __traceiter_android_vh_test_clear_look_around_ref(void*, struct page*)'

4 variable symbol(s) added
'struct tracepoint __tracepoint_android_vh_check_folio_look_around_ref'
'struct tracepoint __tracepoint_android_vh_look_around'
'struct tracepoint __tracepoint_android_vh_look_around_migrate_folio'
'struct tracepoint __tracepoint_android_vh_test_clear_look_around_ref'

Bug: 292051411
Change-Id: I25fff4eefc6773d3e1130bd0ff3f3cc21d6c0964
signed-off-by: Zhanyuan Hu <huzhanyuan@oppo.com>
2023-08-02 21:57:15 +00:00