After ptep_clear_flush(), if we find that src_folio is pinned we will fail
UFFDIO_MOVE and put src_folio back to src_pte entry, but the change to
src_folio->{mapping,index} is not restored in this process. This is not
what we expected, so fix it.
This can cause the rmap for that page to be invalid, possibly resulting
in memory corruption. At least swapout+migration would no longer work,
because we might fail to locate the mappings of that folio.
Link: https://lkml.kernel.org/r/20240222080815.46291-1-zhengqi.arch@bytedance.com
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit d7a08838ab74652f2b53fee9763f0178278c3a4b)
Conflicts:
mm/userfaultfd.c
1. Replace folio_move_anon_rmap() with page_move_anon_rmap().
Bug: 274911254
Change-Id: Ie4bf5785244271ab233c6230ed71460fd571bd1a
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Implement the uABI of UFFDIO_MOVE ioctl.
UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pages are
available (in userspace) for recycling, as is usually the case in heap
compaction algorithms, then we can avoid the page allocation and memcpy
(done by UFFDIO_COPY). Also, since the pages are recycled in the
userspace, we avoid the need to release (via madvise) the pages back to
the kernel [2].
We see over 40% reduction (on a Google pixel 6 device) in the compacting
thread's completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was
measured using a benchmark that emulates a heap compaction implementation
using userfaultfd (to allow concurrent accesses by application threads).
More details of the usecase are explained in [2]. Furthermore,
UFFDIO_MOVE enables moving swapped-out pages without touching them within
the same vma. Today, it can only be done by mremap, however it forces
splitting the vma.
[1] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redhat.com/
[2] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyjniNVjp0Aw@mail.gmail.com/
Update for the ioctl_userfaultfd(2) manpage:
UFFDIO_MOVE
(Since Linux xxx) Move a continuous memory chunk into the
userfault registered range and optionally wake up the blocked
thread. The source and destination addresses and the number of
bytes to move are specified by the src, dst, and len fields of
the uffdio_move structure pointed to by argp:
struct uffdio_move {
__u64 dst; /* Destination of move */
__u64 src; /* Source of move */
__u64 len; /* Number of bytes to move */
__u64 mode; /* Flags controlling behavior of move */
__s64 move; /* Number of bytes moved, or negated error */
};
The following value may be bitwise ORed in mode to change the
behavior of the UFFDIO_MOVE operation:
UFFDIO_MOVE_MODE_DONTWAKE
Do not wake up the thread that waits for page-fault
resolution
UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES
Allow holes in the source virtual range that is being moved.
When not specified, the holes will result in ENOENT error.
When specified, the holes will be accounted as successfully
moved memory. This is mostly useful to move hugepage aligned
virtual regions without knowing if there are transparent
hugepages in the regions or not, but preventing the risk of
having to split the hugepage during the operation.
The move field is used by the kernel to return the number of
bytes that was actually moved, or an error (a negated errno-
style value). If the value returned in move doesn't match the
value that was specified in len, the operation fails with the
error EAGAIN. The move field is output-only; it is not read by
the UFFDIO_MOVE operation.
The operation may fail for various reasons. Usually, remapping of
pages that are not exclusive to the given process fail; once KSM
might deduplicate pages or fork() COW-shares pages during fork()
with child processes, they are no longer exclusive. Further, the
kernel might only perform lightweight checks for detecting whether
the pages are exclusive, and return -EBUSY in case that check fails.
To make the operation more likely to succeed, KSM should be
disabled, fork() should be avoided or MADV_DONTFORK should be
configured for the source VMA before fork().
This ioctl(2) operation returns 0 on success. In this case, the
entire area was moved. On error, -1 is returned and errno is
set to indicate the error. Possible errors include:
EAGAIN The number of bytes moved (i.e., the value returned in
the move field) does not equal the value that was
specified in the len field.
EINVAL Either dst or len was not a multiple of the system page
size, or the range specified by src and len or dst and len
was invalid.
EINVAL An invalid bit was specified in the mode field.
ENOENT
The source virtual memory range has unmapped holes and
UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set.
EEXIST
The destination virtual memory range is fully or partially
mapped.
EBUSY
The pages in the source virtual memory range are either
pinned or not exclusive to the process. The kernel might
only perform lightweight checks for detecting whether the
pages are exclusive. To make the operation more likely to
succeed, KSM should be disabled, fork() should be avoided
or MADV_DONTFORK should be configured for the source virtual
memory area before fork().
ENOMEM Allocating memory needed for the operation failed.
ESRCH
The target process has exited at the time of a UFFDIO_MOVE
operation.
Link: https://lkml.kernel.org/r/20231206103702.3873743-3-surenb@google.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit adef440691bab824e39c1b17382322d195e1fab0)
Conflicts:
mm/huge_memory.c
mm/userfaultfd.c
1. Add vma parameter in mmu_notifier_range_init() calls.
2. Replace folio_move_anon_rmap() with page_move_anon_rmap().
3. Remove vma parameter in pmd_mkwrite() calls.
4. Replace pte_offset_map_nolock() with pte_offset_map()+pte_lockptr()
combo.
5. Remove VM_SHADOW_STACK in vma_move_compatible().
6. Replace pmdp_get_lockless() with pmd_read_atomic().
Bug: 274911254
Change-Id: I1116f15a395f1a8bac176906f7f9c2411e59dc54
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Patch series "userfaultfd move option", v6.
This patch series introduces UFFDIO_MOVE feature to userfaultfd, which has
long been implemented and maintained by Andrea in his local tree [1], but
was not upstreamed due to lack of use cases where this approach would be
better than allocating a new page and copying the contents. Previous
upstraming attempts could be found at [6] and [7].
UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
needs pages to be allocated [2]. However, with UFFDIO_MOVE, if pages are
available (in userspace) for recycling, as is usually the case in heap
compaction algorithms, then we can avoid the page allocation and memcpy
(done by UFFDIO_COPY). Also, since the pages are recycled in the
userspace, we avoid the need to release (via madvise) the pages back to
the kernel [3]. We see over 40% reduction (on a Google pixel 6 device) in
the compacting thread's completion time by using UFFDIO_MOVE vs.
UFFDIO_COPY. This was measured using a benchmark that emulates a heap
compaction implementation using userfaultfd (to allow concurrent accesses
by application threads). More details of the usecase are explained in
[3].
Furthermore, UFFDIO_MOVE enables moving swapped-out pages without
touching them within the same vma. Today, it can only be done by mremap,
however it forces splitting the vma.
TODOs for follow-up improvements:
- cross-mm support. Known differences from single-mm and missing pieces:
- memcg recharging (might need to isolate pages in the process)
- mm counters
- cross-mm deposit table moves
- cross-mm test
- document the address space where src and dest reside in struct
uffdio_move
- TLB flush batching. Will require extensive changes to PTL locking in
move_pages_pte(). OTOH that might let us reuse parts of mremap code.
This patch (of 5):
For now, folio_move_anon_rmap() was only used to move a folio to a
different anon_vma after fork(), whereby the root anon_vma stayed
unchanged. For that, it was sufficient to hold the folio lock when
calling folio_move_anon_rmap().
However, we want to make use of folio_move_anon_rmap() to move folios
between VMAs that have a different root anon_vma. As folio_referenced()
performs an RMAP walk without holding the folio lock but only holding the
anon_vma in read mode, holding the folio lock is insufficient.
When moving to an anon_vma with a different root anon_vma, we'll have to
hold both, the folio lock and the anon_vma lock in write mode.
Consequently, whenever we succeeded in folio_lock_anon_vma_read() to
read-lock the anon_vma, we have to re-check if the mapping was changed in
the meantime. If that was the case, we have to retry.
Note that folio_move_anon_rmap() must only be called if the anon page is
exclusive to a process, and must not be called on KSM folios.
This is a preparation for UFFDIO_MOVE, which will hold the folio lock, the
anon_vma lock in write mode, and the mmap_lock in read mode.
Link: https://lkml.kernel.org/r/20231206103702.3873743-1-surenb@google.com
Link: https://lkml.kernel.org/r/20231206103702.3873743-2-surenb@google.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: kernel-team@android.com
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 880a99b60d467eefd96322e27b0a8c0b805dfa43)
Bug: 274911254
Change-Id: Iad9619c0273e050af26356f66ae9fc88b56d68bd
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Currently only the uncompressed hibernation snapshot image is encrypted
before being written to the swap partition. Extend the encryption
support for compression enabled scenarios as well.
Bug: 335581841
Change-Id: Ida781b727f56b664a67e2887a4db3d6b355dafdb
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
In case of hibernation with compression enabled, 'n' number of pages
will be compressed to 'x' number of pages before being written to the
disk. Keep a note of these compressed block counts so that bootloader
can directly read 'x' pages and pass it on to the decompressor. An
array will be maintained which will hold the count of these compressed
blocks and later on written to the the disk as part of the hibernation
image save process.
The vendor hook '__tracepoint_android_vh_hibernated_do_mem_alloc' does
the required memory allocations, for example, the array which is
dynamically allocated based on the snapshot image size so as to hold
the compressed block counts etc. This memory is later freed as part of
PM_POST_HIBERNATION notifier call.
The vendor hook '__tracepoint_android_vh_hibernate_save_cmp_len' saves
the compressed block counts to the array which is later written to the
disk.
Bug: 335581841
Change-Id: I574b641e2d9f4cd503c7768a66a7be3142c2686b
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
After applying commit 990d3701d0 ("BACKPORT: PM: hibernate: Move
to crypto APIs for LZO compression"), CRPTO_LZO is selected by
default. Sync the gki_defconfig accordingly. This doesn't add any
functional change.
Bug: 335581841
Change-Id: Iafa7211245f6d66a96f8f0030e2574c7a220d3a4
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
Currently the default compression algorithm is selected based on
compile time options. Introduce a module parameter "hibernate.compressor"
to override this behaviour.
Different compression algorithms have different characteristics and
hibernation may benefit when it uses any of these algorithms, especially
when a secondary algorithm(LZ4) offers better decompression speeds over
a default algorithm(LZO), which in turn reduces hibernation image
restore time.
Users can override the default algorithm in two ways:
1) Passing "hibernate.compressor" as kernel command line parameter.
Usage:
LZO: hibernate.compressor=lzo
LZ4: hibernate.compressor=lz4
2) Specifying the algorithm at runtime.
Usage:
LZO: echo lzo > /sys/module/hibernate/parameters/compressor
LZ4: echo lz4 > /sys/module/hibernate/parameters/compressor
Currently LZO and LZ4 are the supported algorithms. LZO is the default
compression algorithm used with hibernation.
Bug: 335581841
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 3fec6e5961b77af6a952b77f5c2ea26f7513b216)
Change-Id: I3c787939c8d37dfeb2c164de69078d303411f938
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
Extend the support for LZ4 compression to be used with hibernation.
The main idea is that different compression algorithms
have different characteristics and hibernation may benefit when it uses
any of these algorithms: a default algorithm, having higher
compression rate but is slower(compression/decompression) and a
secondary algorithm, that is faster(compression/decompression) but has
lower compression rate.
LZ4 algorithm has better decompression speeds over LZO. This reduces
the hibernation image restore time.
As per test results:
LZO LZ4
Size before Compression(bytes) 682696704 682393600
Size after Compression(bytes) 146502402 155993547
Decompression Rate 335.02 MB/s 501.05 MB/s
Restore time 4.4s 3.8s
LZO is the default compression algorithm used for hibernation. Enable
CONFIG_HIBERNATION_COMP_LZ4 to set the default compressor as LZ4.
Bug: 335581841
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 8bc29736357e7f9a6bd0d16b57b5612197e1924b)
Change-Id: I640d834bb626e9a139e41740d4bef7548d5c6401
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
Currently for hibernation, LZO is the only compression algorithm
available and uses the existing LZO library calls. However, there
is no flexibility to switch to other algorithms which provides better
results. The main idea is that different compression algorithms have
different characteristics and hibernation may benefit when it uses
alternate algorithms.
By moving to crypto based APIs, it lays a foundation to use other
compression algorithms for hibernation. There are no functional changes
introduced by this approach.
Bug: 335581841
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit a06c6f5d3cc90b3b070d7b99979d57238db77a86)
Change-Id: I8d15262f9823219d291b84eab28b2ec44474dad4
[quic_nprakash: Resolved minor conflicts in kernel/power/(power.h,swap.c)]
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
Renaming lzo* to generic names, except for lzo_xxx() APIs. This is
used in the next patch where we move to crypto based APIs for
compression. There are no functional changes introduced by this
approach.
Bug: 335581841
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 89a807625f9701154167bf6bf136adfa1be4d849)
Change-Id: I8e2032658132965bed4c7b24b7409ae7a1bfa6cd
[quic_nprakash: Resolved minor conflicts in kernel/power/swap.c]
Signed-off-by: Nikhil V <quic_nprakash@quicinc.com>
There are no new symbols to be added to GKI symbol list.
We simply update our symbol list.
Bug: 335537438
Change-Id: Iae0594ff776853df2b38eb3215a5378a03995c40
Signed-off-by: Youngmin Nam <youngmin.nam@samsung.com>
Symbols updated to QCOM abi symbol list for Marvell Phy and Mdio Bus Mux:
ethnl_cable_test_amplitude
ethnl_cable_test_pulse
ethnl_cable_test_step
genphy_check_and_restart_aneg
genphy_read_status_fixed
of_mdio_find_bus
phy_config_aneg
phy_gbit_fibre_features
Bug: 335414016
Change-Id: I1fa732935cb1a33bcde782fdfe38f5b3fcfae4cb
Signed-off-by: Andre Ding <quic_shuangxi@quicinc.com>
When building clock drivers for MT8186, some may want to build in only
some of them to, for example, get CPUFreq up faster, and some may want
to leave out some clock drivers entirely as a machine may not need the
Warp Engine or the camera ISP (hence, their clock drivers).
Split the various clock drivers in their own configuration options,
keeping MT8186 configuration options consistent with other MediaTek
SoCs.
While at it, also allow building the remaining clock drivers as modules
by switching COMMON_CLK_MT8186 to tristate.
Bug: 335112842
(cherry picked from commit 5baf38e06ahttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ master)
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Chen-Yu Tsai <wenst@chromium.org>
Link: https://lore.kernel.org/r/20230306140543.1813621-47-angelogioacchino.delregno@collabora.com
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Change-Id: Id9dad0e1d56fcae3bb9302b7db63be31afc91d5d
Adding the following symbols:
- devm_drm_bridge_add
Bug: 333511135
Change-Id: I4f815c00515c1f2660032e584778edf1c2c41da4
Signed-off-by: Ken Huang <kenbshuang@google.com>
This reverts commit I764adf995cae6b485d4d98e410c78128a88647e0.
Bug: 333812722
Bug: 308663717
Bug: 319125789
Change-Id: Ie7f03b9d1ab67a33ccbc4311ba26cd746e1aaa22
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
[jyescas@google.com: Call blk_segments() in blk_mq_submit_bio()
for the case when request is defined or it
is null]
Signed-off-by: Juan Yescas <jyescas@google.com>
The USB Type-C Cable and Connector Specification defines the wire
connections for the USB Type-C to USB 2.0 Standard-A cable assembly
(Release 2.2, Chapter 3.5.2).
The Notes says that Pin A5 (CC) of the USB Type-C plug shall be connected
to Vbus through a resister Rp.
However, there is a large amount of such double Rp connected to Vbus
non-standard cables which produced by UGREEN circulating on the market, and
it can affects the normal operations of the state machine easily,
especially to CC1 and CC2 be pulled up at the same time.
In fact, we can regard those cables as sink to avoid abnormal state.
Message as follow:
[ 58.900212] VBUS on
[ 59.265433] CC1: 0 -> 3, CC2: 0 -> 3 [state TOGGLING, polarity 0, connected]
[ 62.623308] CC1: 3 -> 0, CC2: 3 -> 0 [state TOGGLING, polarity 0, disconnected]
[ 62.625006] VBUS off
[ 62.625012] VBUS VSAFE0V
Bug: 335057705
Change-Id: I415db22b0012ace9535039bc4c8e5ec113482e33
Signed-off-by: Michael Wu <michael@allwinnertech.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Link: https://lore.kernel.org/r/20230920063030.66312-1-michael@allwinnertech.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Frank Wang <frank.wang@rock-chips.com>
(cherry picked from commit dbc1defec1aa7d8d80da3ea9e3ddafbcfca8f822)
Adding the following symbols:
- devm_pm_runtime_enable
Bug: 335356311
Change-Id: Iecd45183cead8807974bb2a065c48aab86e47e89
Signed-off-by: John Scheible <johnscheible@google.com>
This adds `struct dwc3` and `struct kernel_all_info` to the KMI via fake
GKI symbols as we know some partners are using these in their
out-of-tree drivers. This ensures that future changes to these structs
will not break partner builds.
Bug: 332277393
Bug: 236036821
Change-Id: Ifa1ac6b71d58415339a63f16a79c1f713dda789f
Signed-off-by: Will McVicker <willmcvicker@google.com>
commit 0d459e2ffb541841714839e8228b845458ed3b27 upstream.
The commit mutex should not be released during the critical section
between nft_gc_seq_begin() and nft_gc_seq_end(), otherwise, async GC
worker could collect expired objects and get the released commit lock
within the same GC sequence.
nf_tables_module_autoload() temporarily releases the mutex to load
module dependencies, then it goes back to replay the transaction again.
Move it at the end of the abort phase after nft_gc_seq_end() is called.
Bug: 332996726
Cc: stable@vger.kernel.org
Fixes: 720344340f ("netfilter: nf_tables: GC transaction race with abort path")
Reported-by: Kuan-Ting Chen <hexrabbit@devco.re>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 8038ee3c3e)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I637389421d8eca5ab59a41bd1a4b70432440034c
commit a45e6889575c2067d3c0212b6bc1022891e65b91 upstream.
Unlike early commit path stage which triggers a call to abort, an
explicit release of the batch is required on abort, otherwise mutex is
released and commit_list remains in place.
Add WARN_ON_ONCE to ensure commit_list is empty from the abort path
before releasing the mutex.
After this patch, commit_list is always assumed to be empty before
grabbing the mutex, therefore
03c1f1ef15 ("netfilter: Cleanup nft_net->module_list from nf_tables_exit_net()")
only needs to release the pending modules for registration.
Bug: 332996726
Cc: stable@vger.kernel.org
Fixes: c0391b6ab8 ("netfilter: nf_tables: missing validation from the abort path")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b0b36dcbe0)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I38f9b05ac4eadd1d2b7b306cccaf0aeacb61b57a
commit 552705a3650bbf46a22b1adedc1b04181490fc36 upstream.
While the rhashtable set gc runs asynchronously, a race allows it to
collect elements from anonymous sets with timeouts while it is being
released from the commit path.
Mingi Cho originally reported this issue in a different path in 6.1.x
with a pipapo set with low timeouts which is not possible upstream since
7395dfacfff6 ("netfilter: nf_tables: use timestamp to check for set
element timeout").
Fix this by setting on the dead flag for anonymous sets to skip async gc
in this case.
According to 08e4c8c5919f ("netfilter: nf_tables: mark newset as dead on
transaction abort"), Florian plans to accelerate abort path by releasing
objects via workqueue, therefore, this sets on the dead flag for abort
path too.
Bug: 329205787
Cc: stable@vger.kernel.org
Fixes: 5f68718b34 ("netfilter: nf_tables: GC transaction API to avoid race with control plane")
Reported-by: Mingi Cho <mgcho.minic@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 406b0241d0)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I6170493c267e020c50a739150f8c421deb635b35
[ Upstream commit b0e256f3dd2ba6532f37c5c22e07cb07a36031ee ]
Clone already always provides a current view of the lookup table, use it
to destroy the set, otherwise it is possible to destroy elements twice.
This fix requires:
212ed75dc5 ("netfilter: nf_tables: integrate pipapo into commit protocol")
which came after:
9827a0e6e2 ("netfilter: nft_set_pipapo: release elements in clone from abort path").
Bug: 330876672
Fixes: 9827a0e6e2 ("netfilter: nft_set_pipapo: release elements in clone from abort path")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit ff90050771)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I8c0811e69f82681c7fcfdca1111f1702e27bb80e
In commit 35a06697da ("FROMLIST: sched: Avoid placing RT threads on cores handling long softirqs"), the patch drop TASKLET_SOFTIRQ from LONG_SOFTIRQS.
Some drivers use tasklet softirq for data reception from firmware. When
there's a high volume or continuous stream of data from the firmware,
it can lead to unexpectedly long execution times for RT tasks and further cause jank or audio glitch.
Bug: 333895950
Test: check tasklet are defered when RT task is running
Change-Id: I53d6b43fe5a8860758898f09810a52fe319344f9
Signed-off-by: lukechang <lukechang@google.com>
To report vendor-specific memory statistics, add restricted
vendor hook since normal vendor hook work with only atomic
context.
Bug: 333482947
Change-Id: I5c32961b30f082a8a4aa78906d2fce1cdf4b0d2b
Signed-off-by: Minchan Kim <minchan@google.com>
Offlining a CPU and bringing it back online is a common operation and it
happens frequently during system suspend/resume, where the non-boot CPUs
are hotplugged out during suspend and brought back at resume.
The cpufreq core already tries to make this path as fast as possible as
the changes are only temporary in nature and full cleanup of resources
isn't required in this case. For example the drivers can implement
online()/offline() callbacks to avoid a lot of tear down of resources.
On similar lines, there is no need to unregister the cpufreq cooling
device during suspend / resume, but only while the policy is getting
removed.
Moreover, unregistering the cpufreq cooling device is resulting in an
unwanted outcome, where the system suspend is eventually aborted in the
process. Currently, during system suspend the cpufreq core unregisters
the cooling device, which in turn removes a kobject using device_del()
and that generates a notification to the userspace via uevent broadcast.
This causes system suspend to abort in some setups.
This was also earlier reported (indirectly) by Roman [1]. Maybe there is
another way around to fixing that problem properly, but this change
makes sense anyways.
Move the registering and unregistering of the cooling device to policy
creation and removal times onlyy.
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218521
Reported-by: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
Reported-by: Roman Stratiienko <r.stratiienko@gmail.com>
Link: https://patchwork.kernel.org/project/linux-pm/patch/20220710164026.541466-1-r.stratiienko@gmail.com/ [1]
Tested-by: Manaf Meethalavalappu Pallikunhi <quic_manafm@quicinc.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Dhruva Gole <d-gole@ti.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit c4d61a529db788d2e52654f5b02c8d1de4952c5b)
Change-Id: I0b9435be1c2b41ea52a26cb578d850fd8121d57a
[jstultz: Tweaked to work around collision with existing android_vh
calls]
Signed-off-by: John Stultz <jstultz@google.com>
Bug: 333300519
Off-by-one errors happen because nr_snk_pdo and nr_src_pdo are
incorrectly added one. The index of the loop is equal to the number of
PDOs to be updated when leaving the loop and it doesn't need to be added
one.
When doing the power negotiation, TCPM relies on the "nr_snk_pdo" as
the size of the local sink PDO array to match the Source capabilities
of the partner port. If the off-by-one overflow occurs, a wrong RDO
might be sent and unexpected power transfer might happen such as over
voltage or over current (than expected).
"nr_src_pdo" is used to set the Rp level when the port is in Source
role. It is also the array size of the local Source capabilities when
filling up the buffer which will be sent as the Source PDOs (such as
in Power Negotiation). If the off-by-one overflow occurs, a wrong Rp
level might be set and wrong Source PDOs will be sent to the partner
port. This could potentially cause over current or port resets.
Fixes: cd099cde4ed2 ("usb: typec: tcpm: Support multiple capabilities")
Cc: stable@vger.kernel.org
Signed-off-by: Kyle Tso <kyletso@google.com>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Link: https://lore.kernel.org/r/20240404133517.2707955-1-kyletso@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 206108037
(cherry picked from commit c4128304c2169b4664ed6fb6200f228cead2ab70
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git usb-linus)
Change-Id: Icf86f562c7bbaefe7e27885e107b373aa4b64fd0
Signed-off-by: Kyle Tso <kyletso@google.com>
commit e01e3934a1b2d122919f73bc6ddbe1cdafc4bbdb upstream.
Similarly to previous commit, the submitting thread (recvmsg/sendmsg)
may exit as soon as the async crypto handler calls complete().
Reorder scheduling the work before calling complete().
This seems more logical in the first place, as it's
the inverse order of what the submitting thread will do.
Bug: 326214245
Reported-by: valis <sec@valis.email>
Fixes: a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
[Lee: Fixed merge-conflict in Stable branches linux-6.1.y and older]
Signed-off-by: Lee Jones <lee@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 196f198ca6)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I3128347d1e45018db30b6f2336ece2a4a3a630db
(cherry picked from commit e78d26a9ec366b108c89099b148ae3cea6f1a8e9)
commit 01acb2e8666a6529697141a6017edbf206921913 upstream.
Remove netdevice from inet/ingress basechain in case NETDEV_UNREGISTER
event is reported, otherwise a stale reference to netdevice remains in
the hook list.
Bug: 332803585
Fixes: 60a3815da7 ("netfilter: add inet ingress support")
Cc: stable@vger.kernel.org
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 70f17b48c8)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I28482dca416b61dcf2e722ba0aef62d2d41a8f23
[ Upstream commit aec7961916f3f9e88766e2688992da6980f11b8d ]
The submitting thread (one which called recvmsg/sendmsg)
may exit as soon as the async crypto handler calls complete()
so any code past that point risks touching already freed data.
Try to avoid the locking and extra flags altogether.
Have the main thread hold an extra reference, this way
we can depend solely on the atomic ref counter for
synchronization.
Don't futz with reiniting the completion, either, we are now
tightly controlling when completion fires.
Bug: 326214245
Reported-by: valis <sec@valis.email>
Fixes: 0cada33241 ("net/tls: fix race condition causing kernel panic")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 7a3ca06d04)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: Idda32dd68ed26ae5c85c985305f52c3b2245e32c
[ Upstream commit c57ca512f3b68ddcd62bda9cc24a8f5584ab01b1 ]
Factor out waiting for async encrypt and decrypt to finish.
There are already multiple copies and a subsequent fix will
need more. No functional changes.
Note that crypto_wait_req() returns wait->err
Bug: 326214245
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: aec7961916f3 ("tls: fix race between async notify and socket close")
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 2c6841c882)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I7355c32d284623e08650c4d2b2a7d3be40f0cc0c
The current implementation of the mark_victim tracepoint provides only the
process ID (pid) of the victim process. This limitation poses challenges
for userspace tools requiring real-time OOM analysis and intervention.
Although this information is available from the kernel logs, it’s not
the appropriate format to provide OOM notifications. In Android, BPF
programs are used with the mark_victim trace events to notify userspace of
an OOM kill. For consistency, update the trace event to include the same
information about the OOMed victim as the kernel logs.
- UID
In Android each installed application has a unique UID. Including
the `uid` assists in correlating OOM events with specific apps.
- Process Name (comm)
Enables identification of the affected process.
- OOM Score
Will allow userspace to get additional insight of the relative kill
priority of the OOM victim. In Android, the oom_score_adj is used to
categorize app state (foreground, background, etc.), which aids in
analyzing user-perceptible impacts of OOM events [1].
- Total VM, RSS Stats, and pgtables
Amount of memory used by the victim that will, potentially, be freed up
by killing it.
[1] 246dc8fc95:frameworks/base/services/core/java/com/android/server/am/ProcessList.java;l=188-283
Signed-off-by: Carlos Galo <carlosgalo@google.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 331214192
(cherry picked from commit 72ba14deb40a9e9668ec5e66a341ed657e5215c2)
[ carlosgalo: Manually added struct cred change in mark_oom_victim function ]
Link: https://lore.kernel.org/all/20240223173258.174828-1-carlosgalo@google.com/
Change-Id: I24f503ceca04b83f8abf42fcd04a3409e17be6b5
This reverts commit 6b4c816d17.
Reason for revert: b/331214192
Change-Id: I9f4f56de7d65cee19c7015b0cb1bda339d82a5f5
Signed-off-by: Carlos Galo <carlosgalo@google.com>
Export two functions to help memory reclaim.
Bug: 323406883
Change-Id: I099d414c9b3648224ab077b9929c6622b2d4228a
Signed-off-by: Minchan Kim <minchan@google.com>
This patch adds two exported functions to set/get reclaim parameters.
Bug: 323406883
Change-Id: I8c29073dba3e77cb5db7f45b640518deae04b8a9
Signed-off-by: Minchan Kim <minchan@google.com>
commit 16603605b667b70da974bea8216c93e7db043bf1 upstream.
Anonymous sets are never used with timeout from userspace, reject this.
Exception to this rule is NFT_SET_EVAL to ensure legacy meters still work.
Bug: 329055463
Cc: stable@vger.kernel.org
Fixes: 761da2935d ("netfilter: nf_tables: add set timeout API support")
Reported-by: lonial con <kongln9170@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 72c1efe3f2)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I8c1c818e3d155d5edefee0b741568104081efb38