commit 80f130bfad1dab93b95683fc39b87235682b8f72 upstream.
The documentation in rculist.h explains the absence of list_empty_rcu()
and cautions programmers against relying on a list_empty() ->
list_first() sequence in RCU safe code. This is because each of these
functions performs its own READ_ONCE() of the list head. This can lead
to a situation where the list_empty() sees a valid list entry, but the
subsequent list_first() sees a different view of list head state after a
modification.
In the case of dm-thin, this author had a production box crash from a GP
fault in the process_deferred_bios path. This function saw a valid list
head in get_first_thin() but when it subsequently dereferenced that and
turned it into a thin_c, it got the inside of the struct pool, since the
list was now empty and referring to itself. The kernel on which this
occurred printed both a warning about a refcount_t being saturated, and
a UBSAN error for an out-of-bounds cpuid access in the queued spinlock,
prior to the fault itself. When the resulting kdump was examined, it
was possible to see another thread patiently waiting in thin_dtr's
synchronize_rcu.
The thin_dtr call managed to pull the thin_c out of the active thins
list (and have it be the last entry in the active_thins list) at just
the wrong moment which lead to this crash.
Fortunately, the fix here is straight forward. Switch get_first_thin()
function to use list_first_or_null_rcu() which performs just a single
READ_ONCE() and returns NULL if the list is already empty.
This was run against the devicemapper test suite's thin-provisioning
suites for delete and suspend and no regressions were observed.
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Fixes: b10ebd34cc ("dm thin: fix rcu_read_lock being held in code that can sleep")
Cc: stable@vger.kernel.org
Acked-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 7e25044b804581b9c029d5a28d8800aebde18043 ]
The 'np' device_node is initialized via of_cpu_device_node_get(), which
requires explicit calls to of_node_put() when it is no longer required
to avoid leaking the resource.
Instead of adding the missing calls to of_node_put() in all execution
paths, use the cleanup attribute for 'np' by means of the __free()
macro, which automatically calls of_node_put() when the variable goes
out of scope. Given that 'np' is only used within the
for_each_possible_cpu(), reduce its scope to release the nood after
every iteration of the loop.
Fixes: 6abf32f1d9 ("cpuidle: Add RISC-V SBI CPU idle driver")
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Javier Carrasco <javier.carrasco.cruz@gmail.com>
Link: https://lore.kernel.org/r/20241116-cpuidle-riscv-sbi-cleanup-v3-1-a3a46372ce08@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2ac538e40278a2c0c051cca81bcaafc547d61372 ]
When `ksmbd_vfs_kern_path_locked` met an error and it is not the last
entry, it will exit without restoring changed path buffer. But later this
buffer may be used as the filename for creation.
Fixes: c5a709f08d40 ("ksmbd: handle caseless file creation")
Signed-off-by: He Wang <xw897002528@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 8fd56ad6e7c90ac2bddb0741c6b248c8c5d56ac8 ]
The kafs filesystem limits the maximum length of a cell to 256 bytes, but a
problem occurs if someone actually does that: kafs tries to create a
directory under /proc/net/afs/ with the name of the cell, but that fails
with a warning:
WARNING: CPU: 0 PID: 9 at fs/proc/generic.c:405
because procfs limits the maximum filename length to 255.
However, the DNS limits the maximum lookup length and, by extension, the
maximum cell name, to 255 less two (length count and trailing NUL).
Fix this by limiting the maximum acceptable cellname length to 253. This
also allows us to be sure we can create the "/afs/.<cell>/" mountpoint too.
Further, split the YFS VL record cell name maximum to be the 256 allowed by
the protocol and ignore the record retrieved by YFSVL.GetCellName if it
exceeds 253.
Fixes: c3e9f88826 ("afs: Implement client support for the YFSVL.GetCellName RPC op")
Reported-by: syzbot+7848fee1f1e5c53f912b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/6776d25d.050a0220.3a8527.0048.GAE@google.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/376236.1736180460@warthog.procyon.org.uk
Tested-by: syzbot+7848fee1f1e5c53f912b@syzkaller.appspotmail.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4c16e1cadcbcaf3c82d5fc310fbd34d0f5d0db7c ]
In the smb2_send_interim_resp(), if ksmbd_alloc_work_struct()
fails to allocate a node, it returns a NULL pointer to the
in_work pointer. This can lead to an illegal memory write of
in_work->response_buf when allocate_interim_rsp_buf() attempts
to perform a kzalloc() on it.
To address this issue, incorporating a check for the return
value of ksmbd_alloc_work_struct() ensures that the function
returns immediately upon allocation failure, thereby preventing
the aforementioned illegal memory access.
Fixes: 041bba4414 ("ksmbd: fix wrong interim response on compound")
Signed-off-by: Wentao Liang <liangwentao@iscas.ac.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0d68b55887cedc7487036ed34cb4c2097c4228f1 ]
Fix dp mode valid issue to avoid abnormal display of limit state.
After DP passes link training, it can express the lane count of the
current link status is good. Calculate the maximum bandwidth supported
by DP using the current lane count.
The color format will select the best one based on the bandwidth
requirements of the current timing mode. If the current timing mode
uses RGB and meets the DP link bandwidth requirements, RGB will be used.
If the timing mode uses RGB but does not meet the DP link bandwidthi
requirements, it will continue to check whether YUV422 meets
the DP link bandwidth.
FEC overhead is approximately 2.4% from DP 1.4a spec 2.2.1.4.2.
The down-spread amplitude shall either be disabled (0.0%) or up
to 0.5% from 1.4a 3.5.2.6. Add up to approximately 3% total overhead.
Because rate is already divided by 10,
mode->clock does not need to be multiplied by 10.
Fixes: f70ac097a2 ("drm/mediatek: Add MT8195 Embedded DisplayPort driver")
Signed-off-by: Liankun Yang <liankun.yang@mediatek.com>
Link: https://patchwork.kernel.org/project/dri-devel/patch/20241025083036.8829-3-liankun.yang@mediatek.com/
Signed-off-by: Chun-Kuang Hu <chunkuang.hu@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 924d66011f2401a4145e2e814842c5c4572e439f ]
The PHY portion of the mediatek hdmi driver was originally part of
the driver it self and later split out into drivers/phy, which a
'select' to keep the prior behavior.
However, this leads to build failures when the PHY driver cannot
be built:
WARNING: unmet direct dependencies detected for PHY_MTK_HDMI
Depends on [n]: (ARCH_MEDIATEK || COMPILE_TEST [=y]) && COMMON_CLK [=y] && OF [=y] && REGULATOR [=n]
Selected by [m]:
- DRM_MEDIATEK_HDMI [=m] && HAS_IOMEM [=y] && DRM [=m] && DRM_MEDIATEK [=m]
ERROR: modpost: "devm_regulator_register" [drivers/phy/mediatek/phy-mtk-hdmi-drv.ko] undefined!
ERROR: modpost: "rdev_get_drvdata" [drivers/phy/mediatek/phy-mtk-hdmi-drv.ko] undefined!
The best option here is to just not select the phy driver and leave that
up to the defconfig. Do the same for the other PHY and memory drivers
selected here as well for consistency.
Fixes: a481bf2f0c ("drm/mediatek: Separate mtk_hdmi_phy to an independent module")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: CK Hu <ck.hu@mediatek.com>
Link: https://patchwork.kernel.org/project/dri-devel/patch/20241218085837.2670434-1-arnd@kernel.org/
Signed-off-by: Chun-Kuang Hu <chunkuang.hu@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 737d4d91d35b5f7fa5bb442651472277318b0bfd ]
Even though we fixed a logic error in the commit cited below, syzbot
still managed to trigger an underflow of the per-host bulk flow
counters, leading to an out of bounds memory access.
To avoid any such logic errors causing out of bounds memory accesses,
this commit factors out all accesses to the per-host bulk flow counters
to a series of helpers that perform bounds-checking before any
increments and decrements. This also has the benefit of improving
readability by moving the conditional checks for the flow mode into
these helpers, instead of having them spread out throughout the
code (which was the cause of the original logic error).
As part of this change, the flow quantum calculation is consolidated
into a helper function, which means that the dithering applied to the
ost load scaling is now applied both in the DRR rotation and when a
sparse flow's quantum is first initiated. The only user-visible effect
of this is that the maximum packet size that can be sent while a flow
stays sparse will now vary with +/- one byte in some cases. This should
not make a noticeable difference in practice, and thus it's not worth
complicating the code to preserve the old behaviour.
Fixes: 546ea84d07e3 ("sched: sch_cake: fix bulk flow accounting logic for host fairness")
Reported-by: syzbot+f63600d288bfb7057424@syzkaller.appspotmail.com
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Dave Taht <dave.taht@gmail.com>
Link: https://patch.msgid.link/20250107120105.70685-1-toke@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b541ba7d1f5a5b7b3e2e22dc9e40e18a7d6dbc13 ]
Use INT_MAX as maximum size for the conntrack hashtable. Otherwise, it
is possible to hit WARN_ON_ONCE in __kvmalloc_node_noprof() when
resizing hashtable because __GFP_NOWARN is unset. See:
0708a0afe2 ("mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls")
Note: hashtable resize is only possible from init_netns.
Fixes: 9cc1c73ad6 ("netfilter: conntrack: avoid integer overflow when resizing")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 13210fc63f353fe78584048079343413a3cdf819 ]
All these cases cause imbalance between BIND and UNBIND calls:
- Delete an interface from a flowtable with multiple interfaces
- Add a (device to a) flowtable with --check flag
- Delete a netns containing a flowtable
- In an interactive nft session, create a table with owner flag and
flowtable inside, then quit.
Fix it by calling FLOW_BLOCK_UNBIND when unregistering hooks, then
remove late FLOW_BLOCK_UNBIND call when destroying flowtable.
Fixes: ff4bf2f42a ("netfilter: nf_tables: add nft_unregister_flowtable_hook()")
Reported-by: Phil Sutter <phil@nwl.cc>
Tested-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c2994b008492db033d40bd767be1620229a3035e ]
This fixes errors such as the following when Own address type is set to
Random Address but it has not been programmed yet due to either be
advertising or connecting:
< HCI Command: LE Set Exte.. (0x08|0x0041) plen 13
Own address type: Random (0x03)
Filter policy: Ignore not in accept list (0x01)
PHYs: 0x05
Entry 0: LE 1M
Type: Passive (0x00)
Interval: 60.000 msec (0x0060)
Window: 30.000 msec (0x0030)
Entry 1: LE Coded
Type: Passive (0x00)
Interval: 180.000 msec (0x0120)
Window: 90.000 msec (0x0090)
> HCI Event: Command Complete (0x0e) plen 4
LE Set Extended Scan Parameters (0x08|0x0041) ncmd 1
Status: Success (0x00)
< HCI Command: LE Set Exten.. (0x08|0x0042) plen 6
Extended scan: Enabled (0x01)
Filter duplicates: Enabled (0x01)
Duration: 0 msec (0x0000)
Period: 0.00 sec (0x0000)
> HCI Event: Command Complete (0x0e) plen 4
LE Set Extended Scan Enable (0x08|0x0042) ncmd 1
Status: Invalid HCI Command Parameters (0x12)
Fixes: c45074d68a ("Bluetooth: Fix not generating RPA when required")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b341ca51d2679829d26a3f6a4aa9aee9abd94f92 ]
We've noticed that NFS can hang when using RPC over TLS on an unstable
connection, and investigation shows that the RPC layer is stuck in a tight
loop attempting to transmit, but forever getting -EBADMSG back from the
underlying network. The loop begins when tcp_sendmsg_locked() returns
-EPIPE to tls_tx_records(), but that error is converted to -EBADMSG when
calling the socket's error reporting handler.
Instead of converting errors from tcp_sendmsg_locked(), let's pass them
along in this path. The RPC layer handles -EPIPE by reconnecting the
transport, which prevents the endless attempts to transmit on a broken
connection.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
Link: https://patch.msgid.link/9594185559881679d81f071b181a10eb07cd079f.1736004079.git.bcodding@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6c5b989116083a98f45aada548ff54e7a83a9c2d ]
ptp4l application reports too high offset when ran on E823 device
with a 100GB/s link. Those values cannot go under 100ns, like in a
working case when using 100 GB/s cable.
This is due to incorrect frequency settings on the PHY clocks for
100 GB/s speed. Changes are introduced to align with the internal
hardware documentation, and correctly initialize frequency in PHY
clocks with the frequency values that are in our HW spec.
To reproduce the issue run ptp4l as a Time Receiver on E823 device,
and observe the offset, which will never approach values seen
in the PTP working case.
Reproduction output:
ptp4l -i enp137s0f3 -m -2 -s -f /etc/ptp4l_8275.conf
ptp4l[5278.775]: master offset 12470 s2 freq +41288 path delay -3002
ptp4l[5278.837]: master offset 10525 s2 freq +39202 path delay -3002
ptp4l[5278.900]: master offset -24840 s2 freq -20130 path delay -3002
ptp4l[5278.963]: master offset 10597 s2 freq +37908 path delay -3002
ptp4l[5279.025]: master offset 8883 s2 freq +36031 path delay -3002
ptp4l[5279.088]: master offset 7267 s2 freq +34151 path delay -3002
ptp4l[5279.150]: master offset 5771 s2 freq +32316 path delay -3002
ptp4l[5279.213]: master offset 4388 s2 freq +30526 path delay -3002
ptp4l[5279.275]: master offset -30434 s2 freq -28485 path delay -3002
ptp4l[5279.338]: master offset -28041 s2 freq -27412 path delay -3002
ptp4l[5279.400]: master offset 7870 s2 freq +31118 path delay -3002
Fixes: 3a7496234d ("ice: implement basic E822 PTP support")
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Przemyslaw Korba <przemyslaw.korba@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4c1224501e9d6c5fd12d83752f1c1b444e0e3418 ]
During ARP failure, tid is not inserted but _c4iw_free_ep()
attempts to remove tid which results in error.
This patch fixes the issue by avoiding removal of uninserted tid.
Fixes: 59437d78f0 ("cxgb4/chtls: fix ULD connection failures due to wrong TID base")
Signed-off-by: Anumula Murali Mohan Reddy <anumula@chelsio.com>
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Link: https://patch.msgid.link/20250103092327.1011925-1-anumula@chelsio.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9a79c65f00e2b036e17af3a3a607d7d732b7affb ]
Since commit 099ecf59f0 ("net: annotate lockless accesses to
sk->sk_max_ack_backlog") decided to handle the sk_max_ack_backlog
locklessly, there is one more function mostly called in TCP/DCCP
cases. So this patch completes it:)
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240331090521.71965-1-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Stable-dep-of: 3479c7549fb1 ("tcp/dccp: allow a connection when sk_max_ack_backlog is zero")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1e9b0e1c550c42c13c111d1a31e822057232abc4 ]
802.2+LLC+SNAP frames received by napi_complete_done() with GRO and DSA
have skb->transport_header set two bytes short, or pointing 2 bytes
before network_header & skb->data. This was an issue as snap_rcv()
expected offset to point to SNAP header (OID:PID), causing packet to
be dropped.
A fix at llc_fixup_skb() (a024e377efed) resets transport_header for any
LLC consumers that may care about it, and stops SNAP packets from being
dropped, but doesn't fix the problem which is that LLC and SNAP should
not use transport_header offset.
Ths patch eliminates the use of transport_header offset for SNAP lookup
of OID:PID so that SNAP does not rely on the offset at all.
The offset is reset after pull for any SNAP packet consumers that may
(but shouldn't) use it.
Fixes: fda55eca5a ("net: introduce skb_transport_header_was_set()")
Signed-off-by: Antonio Pastor <antonio.pastor@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250103012303.746521-1-antonio.pastor@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 32c9c06adb5b157ef259233775a063a43746d699 ]
On Chromebooks based on Mediatek MT8195 or MT8188, the audio frontend
(AFE) is limited to accessing a very small window (1 MiB) of memory,
which is described as a reserved memory region in the device tree.
On these two platforms, the maximum buffer size is given as 512 KiB.
The MediaTek common code uses the same value for preallocations. This
means that only the first two PCM substreams get preallocations, and
then the whole space is exhausted, barring any other substreams from
working. Since the substreams used are not always the first two, this
means audio won't work correctly.
This is observed on the MT8188 Geralt Chromebooks, on which the
"mediatek,dai-link" property was dropped when it was upstreamed. That
property causes the driver to only register the PCM substreams listed
in the property, and in the order given.
Instead of trying to compute an optimal value and figuring out which
streams are used, simply disable preallocation. The PCM buffers are
managed by the core and are allocated and released on the fly. There
should be no impact to any of the other MediaTek platforms.
Signed-off-by: Chen-Yu Tsai <wenst@chromium.org>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Link: https://patch.msgid.link/20241219105303.548437-1-wenst@chromium.org
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0bb1968da2737ba68fd63857d1af2b301a18d3bf ]
dm_array_cursor_skip() seeks to the target position by loading array
blocks iteratively until the specified number of entries to skip is
reached. When seeking across block boundaries, it uses
dm_array_cursor_next() to step into the next block.
dm_array_cursor_skip() must first move the cursor index to the end
of the current block; otherwise, the cursor position could incorrectly
remain in the same block, causing the actual number of skipped entries
to be much smaller than expected.
This bug affects cache resizing in v2 metadata and could lead to data
loss if the fast device is shrunk during the first-time resume. For
example:
1. create a cache metadata consists of 32768 blocks, with a dirty block
assigned to the second bitmap block. cache_restore v1.0 is required.
cat <<EOF >> cmeta.xml
<superblock uuid="" block_size="64" nr_cache_blocks="32768" \
policy="smq" hint_width="4">
<mappings>
<mapping cache_block="32767" origin_block="0" dirty="true"/>
</mappings>
</superblock>
EOF
dmsetup create cmeta --table "0 8192 linear /dev/sdc 0"
cache_restore -i cmeta.xml -o /dev/mapper/cmeta --metadata-version=2
2. bring up the cache while attempt to discard all the blocks belonging
to the second bitmap block (block# 32576 to 32767). The last command
is expected to fail, but it actually succeeds.
dmsetup create cdata --table "0 2084864 linear /dev/sdc 8192"
dmsetup create corig --table "0 65536 linear /dev/sdc 2105344"
dmsetup create cache --table "0 65536 cache /dev/mapper/cmeta \
/dev/mapper/cdata /dev/mapper/corig 64 2 metadata2 writeback smq \
2 migration_threshold 0"
In addition to the reproducer described above, this fix can be
verified using the "array_cursor/skip" tests in dm-unit:
dm-unit run /pdata/array_cursor/skip/ --kernel-dir <KERNEL_DIR>
Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com>
Fixes: 9b696229aa ("dm persistent data: add cursor skip functions to the cursor APIs")
Reviewed-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 626f128ee9c4133b1cfce4be2b34a1508949370e ]
The cached block pointer in dm_array_cursor might be NULL if it reaches
an unreadable array block, or the array is empty. Therefore,
dm_array_cursor_end() should call dm_btree_cursor_end() unconditionally,
to prevent leaving unreleased btree blocks.
This fix can be verified using the "array_cursor/iterate/empty" test
in dm-unit:
dm-unit run /pdata/array_cursor/iterate/empty --kernel-dir <KERNEL_DIR>
Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com>
Fixes: fdd1315aa5 ("dm array: introduce cursor api")
Reviewed-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a0851ea9cd555c333795b85ddd908898b937c4e1 ]
When committing transaction in jbd2_journal_commit_transaction(), the
disk caches for the filesystem device should be flushed before updating
the journal tail sequence. However, this step is missed if the journal
is not located on the filesystem device. As a result, the filesystem may
become inconsistent following a power failure or system crash. Fix it by
ensuring that the filesystem device is flushed appropriately.
Fixes: 3339578f05 ("jbd2: cleanup journal tail after transaction commit")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20241203014407.805916-3-yi.zhang@huaweicloud.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ac1e21bd8c883aeac2f1835fc93b39c1e6838b35 ]
Commit '6a3afb6ac6df ("jbd2: increase the journal IO's priority")'
increases the priority of journal I/O by marking I/O with the
JBD2_JOURNAL_REQ_FLAGS. However, that commit missed the revoke buffers,
so also addresses that kind of I/Os.
Fixes: 6a3afb6ac6df ("jbd2: increase the journal IO's priority")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20241203014407.805916-2-yi.zhang@huaweicloud.com
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 550f7ca98ee028a606aa75705a7e77b1bd11720f upstream.
If the full path to be built by ceph_mdsc_build_path() happens to be
longer than PATH_MAX, then this function will enter an endless (retry)
loop, effectively blocking the whole task. Most of the machine
becomes unusable, making this a very simple and effective DoS
vulnerability.
I cannot imagine why this retry was ever implemented, but it seems
rather useless and harmful to me. Let's remove it and fail with
ENAMETOOLONG instead.
Cc: stable@vger.kernel.org
Reported-by: Dario Weißer <dario@cure53.de>
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
[idryomov@gmail.com: backport to 6.1: pr_warn() is still in use]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This is a pre-requisite for the backport of commit 74363ec674cb ("zram:
fix uninitialized ZRAM not releasing backing device"), which has been
implemented differently in commit 7ac07a26de ("zram: preparation for
multi-zcomp support") upstream.
We only need to ensure that zcomp_destroy is not called with a NULL
comp, so add this check as the other commit cannot be backported easily.
Stable-dep-of: 74363ec674cb ("zram: fix uninitialized ZRAM not releasing backing device")
Link: https://lore.kernel.org/Z3ytcILx4S1v_ueJ@codewreck.org
Suggested-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 449e6912a2522af672e99992e1201a454910864e upstream.
If the recvmsg() blocks after receiving some data - i.e. due to
SO_RCVLOWAT - the MPTCP code will attempt multiple times to
adjust the receive buffer size, wrongly accounting every time the
cumulative of received data - instead of accounting only for the
delta.
Address the issue moving mptcp_rcv_space_adjust just after the
data reception and passing it only the just received bytes.
This also removes an unneeded difference between the TCP and MPTCP
RX code path implementation.
Fixes: 581302298524 ("mptcp: error out earlier on disconnect")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20241230-net-mptcp-rbuf-fixes-v1-1-8608af434ceb@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6aaced5abd32e2a57cd94fd64f824514d0361da8 upstream.
The task sometimes continues looping in throttle_direct_reclaim() because
allow_direct_reclaim(pgdat) keeps returning false.
#0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac
#1 [ffff80002cb6f900] __schedule at ffff800008abbd1c
#2 [ffff80002cb6f990] schedule at ffff800008abc50c
#3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550
#4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68
#5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660
#6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98
#7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8
#8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974
#9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4
At this point, the pgdat contains the following two zones:
NODE: 4 ZONE: 0 ADDR: ffff00817fffe540 NAME: "DMA32"
SIZE: 20480 MIN/LOW/HIGH: 11/28/45
VM_STAT:
NR_FREE_PAGES: 359
NR_ZONE_INACTIVE_ANON: 18813
NR_ZONE_ACTIVE_ANON: 0
NR_ZONE_INACTIVE_FILE: 50
NR_ZONE_ACTIVE_FILE: 0
NR_ZONE_UNEVICTABLE: 0
NR_ZONE_WRITE_PENDING: 0
NR_MLOCK: 0
NR_BOUNCE: 0
NR_ZSPAGES: 0
NR_FREE_CMA_PAGES: 0
NODE: 4 ZONE: 1 ADDR: ffff00817fffec00 NAME: "Normal"
SIZE: 8454144 PRESENT: 98304 MIN/LOW/HIGH: 68/166/264
VM_STAT:
NR_FREE_PAGES: 146
NR_ZONE_INACTIVE_ANON: 94668
NR_ZONE_ACTIVE_ANON: 3
NR_ZONE_INACTIVE_FILE: 735
NR_ZONE_ACTIVE_FILE: 78
NR_ZONE_UNEVICTABLE: 0
NR_ZONE_WRITE_PENDING: 0
NR_MLOCK: 0
NR_BOUNCE: 0
NR_ZSPAGES: 0
NR_FREE_CMA_PAGES: 0
In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of
inactive/active file-backed pages calculated in zone_reclaimable_pages()
based on the result of zone_page_state_snapshot() is zero.
Additionally, since this system lacks swap, the calculation of inactive/
active anonymous pages is skipped.
crash> p nr_swap_pages
nr_swap_pages = $1937 = {
counter = 0
}
As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on to
the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 having
free pages significantly exceeding the high watermark.
The problem is that the pgdat->kswapd_failures hasn't been incremented.
crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures
$1935 = 0x0
This is because the node deemed balanced. The node balancing logic in
balance_pgdat() evaluates all zones collectively. If one or more zones
(e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the
entire node is deemed balanced. This causes balance_pgdat() to exit early
before incrementing the kswapd_failures, as it considers the overall
memory state acceptable, even though some zones (like ZONE_NORMAL) remain
under significant pressure.
The patch ensures that zone_reclaimable_pages() includes free pages
(NR_FREE_PAGES) in its calculation when no other reclaimable pages are
available (e.g., file-backed or anonymous pages). This change prevents
zones like ZONE_DMA32, which have sufficient free pages, from being
mistakenly deemed unreclaimable. By doing so, the patch ensures proper
node balancing, avoids masking pressure on other zones like ZONE_NORMAL,
and prevents infinite loops in throttle_direct_reclaim() caused by
allow_direct_reclaim(pgdat) repeatedly returning false.
The kernel hangs due to a task stuck in throttle_direct_reclaim(), caused
by a node being incorrectly deemed balanced despite pressure in certain
zones, such as ZONE_NORMAL. This issue arises from
zone_reclaimable_pages() returning 0 for zones without reclaimable file-
backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient
free pages to be skipped.
The lack of swap or reclaimable pages results in ZONE_DMA32 being ignored
during reclaim, masking pressure in other zones. Consequently,
pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback
mechanisms in allow_direct_reclaim() from being triggered, leading to an
infinite loop in throttle_direct_reclaim().
This patch modifies zone_reclaimable_pages() to account for free pages
(NR_FREE_PAGES) when no other reclaimable pages exist. This ensures zones
with sufficient free pages are not skipped, enabling proper balancing and
reclaim behavior.
[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/20241130164346.436469-1-snishika@redhat.com
Link: https://lkml.kernel.org/r/20241130161236.433747-2-snishika@redhat.com
Fixes: 5a1c84b404 ("mm: remove reclaim and compaction retry approximations")
Signed-off-by: Seiji Nishikawa <snishika@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 158cdce87c8c172787063998ad5dd3e2f658b963 upstream.
When testing large folio support with XFS on our servers, we observed that
only a few large folios are mapped when reading large files via mmap.
After a thorough analysis, I identified it was caused by the
`/sys/block/*/queue/read_ahead_kb` setting. On our test servers, this
parameter is set to 128KB. After I tune it to 2MB, the large folio can
work as expected. However, I believe the large folio behavior should not
be dependent on the value of read_ahead_kb. It would be more robust if
the kernel can automatically adopt to it.
With /sys/block/*/queue/read_ahead_kb set to 128KB and performing a
sequential read on a 1GB file using MADV_HUGEPAGE, the differences in
/proc/meminfo are as follows:
- before this patch
FileHugePages: 18432 kB
FilePmdMapped: 4096 kB
- after this patch
FileHugePages: 1067008 kB
FilePmdMapped: 1048576 kB
This shows that after applying the patch, the entire 1GB file is mapped to
huge pages. The stable list is CCed, as without this patch, large folios
don't function optimally in the readahead path.
It's worth noting that if read_ahead_kb is set to a larger value that
isn't aligned with huge page sizes (e.g., 4MB + 128KB), it may still fail
to map to hugepages.
Link: https://lkml.kernel.org/r/20241108141710.9721-1-laoar.shao@gmail.com
Link: https://lkml.kernel.org/r/20241206083025.3478-1-laoar.shao@gmail.com
Fixes: 4687fdbb80 ("mm/filemap: Support VM_HUGEPAGE for file mappings")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Tested-by: kernel test robot <oliver.sang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a37eecb705f33726f1fb7cd2a67e514a15dfe693 upstream.
If a device uses MCP23xxx IO expander to receive IRQs, the following
bug can happen:
BUG: sleeping function called from invalid context
at kernel/locking/mutex.c:283
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, ...
preempt_count: 1, expected: 0
...
Call Trace:
...
__might_resched+0x104/0x10e
__might_sleep+0x3e/0x62
mutex_lock+0x20/0x4c
regmap_lock_mutex+0x10/0x18
regmap_update_bits_base+0x2c/0x66
mcp23s08_irq_set_type+0x1ae/0x1d6
__irq_set_trigger+0x56/0x172
__setup_irq+0x1e6/0x646
request_threaded_irq+0xb6/0x160
...
We observed the problem while experimenting with a touchscreen driver which
used MCP23017 IO expander (I2C).
The regmap in the pinctrl-mcp23s08 driver uses a mutex for protection from
concurrent accesses, which is the default for regmaps without .fast_io,
.disable_locking, etc.
mcp23s08_irq_set_type() calls regmap_update_bits_base(), and the latter
locks the mutex.
However, __setup_irq() locks desc->lock spinlock before calling these
functions. As a result, the system tries to lock the mutex whole holding
the spinlock.
It seems, the internal regmap locks are not needed in this driver at all.
mcp->lock seems to protect the regmap from concurrent accesses already,
except, probably, in mcp_pinconf_get/set.
mcp23s08_irq_set_type() and mcp23s08_irq_mask/unmask() are called under
chip_bus_lock(), which calls mcp23s08_irq_bus_lock(). The latter takes
mcp->lock and enables regmap caching, so that the potentially slow I2C
accesses are deferred until chip_bus_unlock().
The accesses to the regmap from mcp23s08_probe_one() do not need additional
locking.
In all remaining places where the regmap is accessed, except
mcp_pinconf_get/set(), the driver already takes mcp->lock.
This patch adds locking in mcp_pinconf_get/set() and disables internal
locking in the regmap config. Among other things, it fixes the sleeping
in atomic context described above.
Fixes: 8f38910ba4 ("pinctrl: mcp23s08: switch to regmap caching")
Cc: stable@vger.kernel.org
Signed-off-by: Evgenii Shatokhin <e.shatokhin@yadro.com>
Link: https://lore.kernel.org/20241209074659.1442898-1-e.shatokhin@yadro.com
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d0257e089d1bbd35c69b6c97ff73e3690ab149a9 upstream.
In the expression "cmd.wqe_size * cmd.wr_count", both variables are u32
values that come from the user so the multiplication can lead to integer
wrapping. Then we pass the result to uverbs_request_next_ptr() which also
could potentially wrap. The "cmd.sge_count * sizeof(struct ib_uverbs_sge)"
multiplication can also overflow on 32bit systems although it's fine on
64bit systems.
This patch does two things. First, I've re-arranged the condition in
uverbs_request_next_ptr() so that the use controlled variable "len" is on
one side of the comparison by itself without any math. Then I've modified
all the callers to use size_mul() for the multiplications.
Fixes: 67cdb40ca4 ("[IB] uverbs: Implement more commands")
Cc: stable@vger.kernel.org
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://patch.msgid.link/b8765ab3-c2da-4611-aae0-ddd6ba173d23@stanley.mountain
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>