[ Upstream commit 087a9eb9e5978e3ba362e1163691e41097e8ca20 ]
When a VNI is deleted from a VXLAN device in 'vnifilter' mode, the FDB
entry associated with the default remote (assuming one was configured)
is deleted without holding the hash lock. This is wrong and will result
in a warning [1] being generated by the lockdep annotation that was
added by commit ebe642067455 ("vxlan: Create wrappers for FDB lookup").
Reproducer:
# ip link add vx0 up type vxlan dstport 4789 external vnifilter local 192.0.2.1
# bridge vni add vni 10010 remote 198.51.100.1 dev vx0
# bridge vni del vni 10010 dev vx0
Fix by acquiring the hash lock before the deletion and releasing it
afterwards. Blame the original commit that introduced the issue rather
than the one that exposed it.
[1]
WARNING: CPU: 3 PID: 392 at drivers/net/vxlan/vxlan_core.c:417 vxlan_find_mac+0x17f/0x1a0
[...]
RIP: 0010:vxlan_find_mac+0x17f/0x1a0
[...]
Call Trace:
<TASK>
__vxlan_fdb_delete+0xbe/0x560
vxlan_vni_delete_group+0x2ba/0x940
vxlan_vni_del.isra.0+0x15f/0x580
vxlan_process_vni_filter+0x38b/0x7b0
vxlan_vnifilter_process+0x3bb/0x510
rtnetlink_rcv_msg+0x2f7/0xb70
netlink_rcv_skb+0x131/0x360
netlink_unicast+0x426/0x710
netlink_sendmsg+0x75a/0xc20
__sock_sendmsg+0xc1/0x150
____sys_sendmsg+0x5aa/0x7b0
___sys_sendmsg+0xfc/0x180
__sys_sendmsg+0x121/0x1b0
do_syscall_64+0xbb/0x1d0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Fixes: f9c4bb0b24 ("vxlan: vni filtering support on collect metadata device")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250423145131.513029-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0fb15ae3b0a9221be01715dac0335647c79f3362 ]
plfxlc_mac_release() asserts that mac->lock is held. This assertion is
incorrect, because even if it was possible, it would not be the valid
behaviour. The function is used when probe fails or after the device is
disconnected. In both cases mac->lock can not be held as the driver is
not working with the device at the moment. All functions that use mac->lock
unlock it just after it was held. There is also no need to hold mac->lock
for plfxlc_mac_release() itself, as mac data is not affected, except for
mac->flags, which is modified atomically.
This bug leads to the following warning:
================================================================
WARNING: CPU: 0 PID: 127 at drivers/net/wireless/purelifi/plfxlc/mac.c:106 plfxlc_mac_release+0x7d/0xa0
Modules linked in:
CPU: 0 PID: 127 Comm: kworker/0:2 Not tainted 6.1.124-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
Workqueue: usb_hub_wq hub_event
RIP: 0010:plfxlc_mac_release+0x7d/0xa0 drivers/net/wireless/purelifi/plfxlc/mac.c:106
Call Trace:
<TASK>
probe+0x941/0xbd0 drivers/net/wireless/purelifi/plfxlc/usb.c:694
usb_probe_interface+0x5c0/0xaf0 drivers/usb/core/driver.c:396
really_probe+0x2ab/0xcb0 drivers/base/dd.c:639
__driver_probe_device+0x1a2/0x3d0 drivers/base/dd.c:785
driver_probe_device+0x50/0x420 drivers/base/dd.c:815
__device_attach_driver+0x2cf/0x510 drivers/base/dd.c:943
bus_for_each_drv+0x183/0x200 drivers/base/bus.c:429
__device_attach+0x359/0x570 drivers/base/dd.c:1015
bus_probe_device+0xba/0x1e0 drivers/base/bus.c:489
device_add+0xb48/0xfd0 drivers/base/core.c:3696
usb_set_configuration+0x19dd/0x2020 drivers/usb/core/message.c:2165
usb_generic_driver_probe+0x84/0x140 drivers/usb/core/generic.c:238
usb_probe_device+0x130/0x260 drivers/usb/core/driver.c:293
really_probe+0x2ab/0xcb0 drivers/base/dd.c:639
__driver_probe_device+0x1a2/0x3d0 drivers/base/dd.c:785
driver_probe_device+0x50/0x420 drivers/base/dd.c:815
__device_attach_driver+0x2cf/0x510 drivers/base/dd.c:943
bus_for_each_drv+0x183/0x200 drivers/base/bus.c:429
__device_attach+0x359/0x570 drivers/base/dd.c:1015
bus_probe_device+0xba/0x1e0 drivers/base/bus.c:489
device_add+0xb48/0xfd0 drivers/base/core.c:3696
usb_new_device+0xbdd/0x18f0 drivers/usb/core/hub.c:2620
hub_port_connect drivers/usb/core/hub.c:5477 [inline]
hub_port_connect_change drivers/usb/core/hub.c:5617 [inline]
port_event drivers/usb/core/hub.c:5773 [inline]
hub_event+0x2efe/0x5730 drivers/usb/core/hub.c:5855
process_one_work+0x8a9/0x11d0 kernel/workqueue.c:2292
worker_thread+0xa47/0x1200 kernel/workqueue.c:2439
kthread+0x28d/0x320 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
</TASK>
================================================================
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Fixes: 68d57a07bf ("wireless: add plfxlc driver for pureLiFi X, XL, XC devices")
Reported-by: syzbot+7d4f142f6c288de8abfe@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7d4f142f6c288de8abfe
Signed-off-by: Murad Masimov <m.masimov@mt-integration.ru>
Link: https://patch.msgid.link/20250321185226.71-2-m.masimov@mt-integration.ru
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9aff2e8df240e84a36f2607f98a0a9924a24e65d ]
Issue:
When multiple audio streams share a common BE DAI, the BE DAI
widget can be powered up before its hardware parameters are configured.
This incorrect sequence leads to intermittent pcm_write errors.
For example, the below Tegra use-case throws an error:
aplay(2 streams) -> AMX(mux) -> ADX(demux) -> arecord(2 streams),
here, 'AMX TX' and 'ADX RX' are common BE DAIs.
For above usecase when failure happens below sequence is observed:
aplay(1) FE open()
- BE DAI callbacks added to the list
- BE DAI state = SND_SOC_DPCM_STATE_OPEN
aplay(2) FE open()
- BE DAI callbacks are not added to the list as the state is
already SND_SOC_DPCM_STATE_OPEN during aplay(1) FE open().
aplay(2) FE hw_params()
- BE DAI hw_params() callback ignored
aplay(2) FE prepare()
- Widget is powered ON without BE DAI hw_params() call
aplay(1) FE hw_params()
- BE DAI hw_params() is now called
Fix:
Add BE DAIs in the list if its state is either SND_SOC_DPCM_STATE_OPEN
or SND_SOC_DPCM_STATE_HW_PARAMS as well.
It ensures the widget is powered ON after BE DAI hw_params() callback.
Fixes: 0c25db3f76 ("ASoC: soc-pcm: Don't reconnect an already active BE")
Signed-off-by: Sheetal <sheetal@nvidia.com>
Link: https://patch.msgid.link/20250404105953.2784819-1-sheetal@nvidia.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit c2fee09fc167c74a64adb08656cb993ea475197e upstream.
Move the conditional loading of hardware DR6 with the guest's DR6 value
out of the core .vcpu_run() loop to fix a bug where KVM can load hardware
with a stale vcpu->arch.dr6.
When the guest accesses a DR and host userspace isn't debugging the guest,
KVM disables DR interception and loads the guest's values into hardware on
VM-Enter and saves them on VM-Exit. This allows the guest to access DRs
at will, e.g. so that a sequence of DR accesses to configure a breakpoint
only generates one VM-Exit.
For DR0-DR3, the logic/behavior is identical between VMX and SVM, and also
identical between KVM_DEBUGREG_BP_ENABLED (userspace debugging the guest)
and KVM_DEBUGREG_WONT_EXIT (guest using DRs), and so KVM handles loading
DR0-DR3 in common code, _outside_ of the core kvm_x86_ops.vcpu_run() loop.
But for DR6, the guest's value doesn't need to be loaded into hardware for
KVM_DEBUGREG_BP_ENABLED, and SVM provides a dedicated VMCB field whereas
VMX requires software to manually load the guest value, and so loading the
guest's value into DR6 is handled by {svm,vmx}_vcpu_run(), i.e. is done
_inside_ the core run loop.
Unfortunately, saving the guest values on VM-Exit is initiated by common
x86, again outside of the core run loop. If the guest modifies DR6 (in
hardware, when DR interception is disabled), and then the next VM-Exit is
a fastpath VM-Exit, KVM will reload hardware DR6 with vcpu->arch.dr6 and
clobber the guest's actual value.
The bug shows up primarily with nested VMX because KVM handles the VMX
preemption timer in the fastpath, and the window between hardware DR6
being modified (in guest context) and DR6 being read by guest software is
orders of magnitude larger in a nested setup. E.g. in non-nested, the
VMX preemption timer would need to fire precisely between #DB injection
and the #DB handler's read of DR6, whereas with a KVM-on-KVM setup, the
window where hardware DR6 is "dirty" extends all the way from L1 writing
DR6 to VMRESUME (in L1).
L1's view:
==========
<L1 disables DR interception>
CPU 0/KVM-7289 [023] d.... 2925.640961: kvm_entry: vcpu 0
A: L1 Writes DR6
CPU 0/KVM-7289 [023] d.... 2925.640963: <hack>: Set DRs, DR6 = 0xffff0ff1
B: CPU 0/KVM-7289 [023] d.... 2925.640967: kvm_exit: vcpu 0 reason EXTERNAL_INTERRUPT intr_info 0x800000ec
D: L1 reads DR6, arch.dr6 = 0
CPU 0/KVM-7289 [023] d.... 2925.640969: <hack>: Sync DRs, DR6 = 0xffff0ff0
CPU 0/KVM-7289 [023] d.... 2925.640976: kvm_entry: vcpu 0
L2 reads DR6, L1 disables DR interception
CPU 0/KVM-7289 [023] d.... 2925.640980: kvm_exit: vcpu 0 reason DR_ACCESS info1 0x0000000000000216
CPU 0/KVM-7289 [023] d.... 2925.640983: kvm_entry: vcpu 0
CPU 0/KVM-7289 [023] d.... 2925.640983: <hack>: Set DRs, DR6 = 0xffff0ff0
L2 detects failure
CPU 0/KVM-7289 [023] d.... 2925.640987: kvm_exit: vcpu 0 reason HLT
L1 reads DR6 (confirms failure)
CPU 0/KVM-7289 [023] d.... 2925.640990: <hack>: Sync DRs, DR6 = 0xffff0ff0
L0's view:
==========
L2 reads DR6, arch.dr6 = 0
CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216
CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216
L2 => L1 nested VM-Exit
CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit_inject: reason: DR_ACCESS ext_inf1: 0x0000000000000216
CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_entry: vcpu 23
CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_exit: vcpu 23 reason VMREAD
CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_entry: vcpu 23
CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason VMREAD
CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_entry: vcpu 23
L1 writes DR7, L0 disables DR interception
CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000007
CPU 23/KVM-5046 [001] d.... 3410.005613: kvm_entry: vcpu 23
L0 writes DR6 = 0 (arch.dr6)
CPU 23/KVM-5046 [001] d.... 3410.005613: <hack>: Set DRs, DR6 = 0xffff0ff0
A: <L1 writes DR6 = 1, no interception, arch.dr6 is still '0'>
B: CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_exit: vcpu 23 reason PREEMPTION_TIMER
CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_entry: vcpu 23
C: L0 writes DR6 = 0 (arch.dr6)
CPU 23/KVM-5046 [001] d.... 3410.005614: <hack>: Set DRs, DR6 = 0xffff0ff0
L1 => L2 nested VM-Enter
CPU 23/KVM-5046 [001] d.... 3410.005616: kvm_exit: vcpu 23 reason VMRESUME
L0 reads DR6, arch.dr6 = 0
Reported-by: John Stultz <jstultz@google.com>
Closes: https://lkml.kernel.org/r/CANDhNCq5_F3HfFYABqFGCA1bPd_%2BxgNj-iDQhH4tDk%2Bwi8iZZg%40mail.gmail.com
Fixes: 375e28ffc0 ("KVM: X86: Set host DR6 only on VMX and for KVM_DEBUGREG_WONT_EXIT")
Fixes: d67668e9dd ("KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20250125011833.3644371-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
[jth: Handled conflicts with kvm_x86_ops reshuffle]
Signed-off-by: James Houghton <jthoughton@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 288e1f693f04e66be99f27e7cbe4a45936a66745 ]
xfs/205 produces the following failure when always_cow is enabled:
# --- a/tests/xfs/205.out 2024-02-28 16:20:24.437887970 -0800
# +++ b/tests/xfs/205.out.bad 2024-06-03 21:13:40.584000000 -0700
# @@ -1,4 +1,5 @@
# QA output created by 205
# *** one file
# + !!! disk full (expected)
# *** one file, a few bytes at a time
# *** done
This is the result of overly aggressive attempts to align cow fork
delalloc reservations to the CoW extent size hint. Looking at the trace
data, we're trying to append a single fsblock to the "fred" file.
Trying to create a speculative post-eof reservation fails because
there's not enough space.
We then set @prealloc_blocks to zero and try again, but the cowextsz
alignment code triggers, which expands our request for a 1-fsblock
reservation into a 39-block reservation. There's not enough space for
that, so the whole write fails with ENOSPC even though there's
sufficient space in the filesystem to allocate the single block that we
need to land the write.
There are two things wrong here -- first, we shouldn't be attempting
speculative preallocations beyond what was requested when we're low on
space. Second, if we've already computed a posteof preallocation, we
shouldn't bother trying to align that to the cowextsize hint.
Fix both of these problems by adding a flag that only enables the
expansion of the delalloc reservation to the cowextsize if we're doing a
non-extending write, and only if we're not doing an ENOSPC retry. This
requires us to move the ENOSPC retry logic to xfs_bmapi_reserve_delalloc.
I probably should have caught this six years ago when 6ca30729c2 was
being reviewed, but oh well. Update the comments to reflect what the
code does now.
Fixes: 6ca30729c2 ("xfs: bmap code cleanup")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1ec9307fc066dd8a140d5430f8a7576aa9d78cd3 ]
For a very very long time, inode inactivation has set the inode size to
zero before unmapping the extents associated with the data fork.
Unfortunately, commit 3c6f46eacd changed the inode verifier to
prohibit zero-length symlinks and directories. If an inode happens to
get logged in this state and the system crashes before freeing the
inode, log recovery will also fail on the broken inode.
Therefore, allow zero-size symlinks and directories as long as the link
count is zero; nobody will be able to open these files by handle so
there isn't any risk of data exposure.
Fixes: 3c6f46eacd ("xfs: sanity check directory inode di_size")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 610b29161b0aa9feb59b78dc867553274f17fb01 ]
xfs_can_free_eofblocks returns false for files that have persistent
preallocations unless the force flag is passed and there are delayed
blocks. This means it won't free delalloc reservations for files
with persistent preallocations unless the force flag is set, and it
will also free the persistent preallocations if the force flag is
set and the file happens to have delayed allocations.
Both of these are bad, so do away with the force flag and always free
only post-EOF delayed allocations for files with the XFS_DIFLAG_PREALLOC
or APPEND flags set.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 58f880711f2ba53fd5e959875aff5b3bf6d5c32e ]
A user with a completely full filesystem experienced an unexpected
shutdown when the filesystem tried to write the superblock during
runtime.
kernel shows the following dmesg:
[ 8.176281] XFS (dm-4): Metadata corruption detected at xfs_sb_write_verify+0x60/0x120 [xfs], xfs_sb block 0x0
[ 8.177417] XFS (dm-4): Unmount and run xfs_repair
[ 8.178016] XFS (dm-4): First 128 bytes of corrupted metadata buffer:
[ 8.178703] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 01 90 00 00 XFSB............
[ 8.179487] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 8.180312] 00000020: cf 12 dc 89 ca 26 45 29 92 e6 e3 8d 3b b8 a2 c3 .....&E)....;...
[ 8.181150] 00000030: 00 00 00 00 01 00 00 06 00 00 00 00 00 00 00 80 ................
[ 8.182003] 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................
[ 8.182004] 00000050: 00 00 00 01 00 64 00 00 00 00 00 04 00 00 00 00 .....d..........
[ 8.182004] 00000060: 00 00 64 00 b4 a5 02 00 02 00 00 08 00 00 00 00 ..d.............
[ 8.182005] 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 17 00 00 19 ................
[ 8.182008] XFS (dm-4): Corruption of in-memory data detected. Shutting down filesystem
[ 8.182010] XFS (dm-4): Please unmount the filesystem and rectify the problem(s)
When xfs_log_sb writes super block to disk, b_fdblocks is fetched from
m_fdblocks without any lock. As m_fdblocks can experience a positive ->
negative -> positive changing when the FS reaches fullness (see
xfs_mod_fdblocks). So there is a chance that sb_fdblocks is negative, and
because sb_fdblocks is type of unsigned long long, it reads super big.
And sb_fdblocks being bigger than sb_dblocks is a problem during log
recovery, xfs_validate_sb_write() complains.
Fix:
As sb_fdblocks will be re-calculated during mount when lazysbcount is
enabled, We just need to make xfs_validate_sb_write() happy -- make sure
sb_fdblocks is not nenative. This patch also takes care of other percpu
counters in xfs_log_sb.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 38de567906d95c397d87f292b892686b7ec6fbc3 ]
An internal user complained about log recovery failing on a symlink
("Bad dinode after recovery") with the following (excerpted) format:
core.magic = 0x494e
core.mode = 0120777
core.version = 3
core.format = 2 (extents)
core.nlinkv2 = 1
core.nextents = 1
core.size = 297
core.nblocks = 1
core.naextents = 0
core.forkoff = 0
core.aformat = 2 (extents)
u3.bmx[0] = [startoff,startblock,blockcount,extentflag]
0:[0,12,1,0]
This is a symbolic link with a 297-byte target stored in a disk block,
which is to say this is a symlink with a remote target. The forkoff is
0, which is to say that there's 512 - 176 == 336 bytes in the inode core
to store the data fork.
Eventually, testing of generic/388 failed with the same inode corruption
message during inode recovery. In writing a debugging patch to call
xfs_dinode_verify on dirty inode log items when we're committing
transactions, I observed that xfs/298 can reproduce the problem quite
quickly.
xfs/298 creates a symbolic link, adds some extended attributes, then
deletes them all. The test failure occurs when the final removexattr
also deletes the attr fork because that does not convert the remote
symlink back into a shortform symlink. That is how we trip this test.
The only reason why xfs/298 only triggers with the debug patch added is
that it deletes the symlink, so the final iflush shows the inode as
free.
I wrote a quick fstest to emulate the behavior of xfs/298, except that
it leaves the symlinks on the filesystem after inducing the "corrupt"
state. Kernels going back at least as far as 4.18 have written out
symlink inodes in this manner and prior to 1eb70f54c4 they did not
object to reading them back in.
Because we've been writing out inodes this way for quite some time, the
only way to fix this is to relax the check for symbolic links.
Directories don't have this problem because di_size is bumped to
blocksize during the sf->data conversion.
Fixes: 1eb70f54c4 ("xfs: validate inode fork size against fork format")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 5ce5674187c345dc31534d2024c09ad8ef29b7ba ]
Current clone operation could be non-atomic if the destination of a file
is beyond EOF, user could get a file with corrupted (zeroed) data on
crash.
The problem is about preallocations. If you write some data into a file:
[A...B)
and XFS decides to preallocate some post-eof blocks, then it can create
a delayed allocation reservation:
[A.........D)
The writeback path tries to convert delayed extents to real ones by
allocating blocks. If there aren't enough contiguous free space, we can
end up with two extents, the first real and the second still delalloc:
[A....C)[C.D)
After that, both the in-memory and the on-disk file sizes are still B.
If we clone into the range [E...F) from another file:
[A....C)[C.D) [E...F)
then xfs_reflink_zero_posteof() calls iomap_zero_range() to zero out the
range [B, E) beyond EOF and flush it. Since [C, D) is still a delalloc
extent, its pagecache will be zeroed and both the in-memory and on-disk
size will be updated to D after flushing but before cloning. This is
wrong, because the user can see the size change and read the zeroes
while the clone operation is ongoing.
We need to keep the in-memory and on-disk size before the clone
operation starts, so instead of writing zeroes through the page cache
for delayed ranges beyond EOF, we convert these ranges to unwritten and
invalidate any cached data over that range beyond EOF.
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 2e08371a83f1c06fd85eea8cd37c87a224cc4cc4 ]
Since xfs_bmapi_convert_delalloc() only attempts to allocate the entire
delalloc extent and require multiple invocations to allocate the target
offset. So xfs_convert_blocks() add a loop to do this job and we call it
in the write back path, but xfs_convert_blocks() isn't a common helper.
Let's do it in xfs_bmapi_convert_delalloc() and drop
xfs_convert_blocks(), preparing for the post EOF delalloc blocks
converting in the buffered write begin path.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit bb712842a85d595525e72f0e378c143e620b3ea2 ]
Commit 1aa91d9c99 ("xfs: Add async buffered write support") replace
xfs_ilock(XFS_ILOCK_EXCL) with xfs_ilock_for_iomap() when locking the
writing inode, and a new variable lockmode is used to indicate the lock
mode. Although the lockmode should always be XFS_ILOCK_EXCL, it's still
better to use this variable instead of useing XFS_ILOCK_EXCL directly
when unlocking the inode.
Fixes: 1aa91d9c99 ("xfs: Add async buffered write support")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 2a009397eb5ae178670cbd7101e9635cf6412b35 ]
In my haste to fix what I thought was a performance problem in the attr
scrub code, I neglected to notice that the xfs_attr_get_ilocked also had
the effect of checking that attributes can actually be looked up through
the attr dabtree. Fix this.
Fixes: 44af6c7e59 ("xfs: don't load local xattr values during scrub")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1c7f09d210aba2f2bb206e2e8c97c9f11a3fd880 ]
Strengthen the xattri log item recovery code by checking that we
actually have the required name and newname buffers for whatever
operation we're replaying.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ad206ae50eca62836c5460ab5bbf2a6c59a268e7 ]
Check that the number of recovered log iovecs is what is expected for
the xattri opcode is expecting.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8ef1d96a985e4dc07ffbd71bd7fc5604a80cc644 ]
The XFS_SB_FEAT_INCOMPAT_LOG_XATTRS feature bit protects a filesystem
from old kernels that do not know how to recover extended attribute log
intent items. Make this check mandatory instead of a debugging assert.
Fixes: fd92000878 ("xfs: Set up infrastructure for log attribute replay")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 86de848403abda05bf9c16dcdb6bef65a8d88c41 ]
Accessing if_bytes without the ilock is racy. Remove the initial
if_bytes == 0 check in xfs_reflink_end_cow_extent and let
ext_iext_lookup_extent fail for this case after we've taken the ilock.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d69bee6a35d3c5e4873b9e164dd1a9711351a97c ]
xfs_bmap_add_extent_delay_real takes parts or all of a delalloc extent
and converts them to a real extent. It is written to deal with any
potential overlap of the to be converted range with the delalloc extent,
but it turns out that currently only converting the entire extents, or a
part starting at the beginning is actually exercised, as the only caller
always tries to convert the entire delalloc extent, and either succeeds
or at least progresses partially from the start.
If it only converts a tiny part of a delalloc extent, the indirect block
calculation for the new delalloc extent (da_new) might be equivalent to that
of the existing delalloc extent (da_old). If this extent conversion now
requires allocating an indirect block that gets accounted into da_new,
leading to the assert that da_new must be smaller or equal to da_new
unless we split the extent to trigger.
Except for the assert that case is actually handled by just trying to
allocate more space, as that already handled for the split case (which
currently can't be reached at all), so just reusing it should be fine.
Except that without dipping into the reserved block pool that would make
it a bit too easy to trigger a fs shutdown due to ENOSPC. So in addition
to adjusting the assert, also dip into the reserved block pool.
Note that I could only reproduce the assert with a change to only convert
the actually asked range instead of the full delalloc extent from
xfs_bmapi_write.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6773da870ab89123d1b513da63ed59e32a29cb77 ]
xfs_bmapi_write can return 0 without actually returning a mapping in
mval in two different cases:
1) when there is absolutely no space available to do an allocation
2) when converting delalloc space, and the allocation is so small
that it only covers parts of the delalloc extent before the
range requested by the caller
Callers at best can handle one of these cases, but in many cases can't
cope with either one. Switch xfs_bmapi_write to always return a
mapping or return an error code instead. For case 1) above ENOSPC is
the obvious choice which is very much what the callers expect anyway.
For case 2) there is no really good error code, so pick a funky one
from the SysV streams portfolio.
This fixes the reproducer here:
https://lore.kernel.org/linux-xfs/CAEJPjCvT3Uag-pMTYuigEjWZHn1sGMZ0GCjVVCv29tNHK76Cgg@mail.gmail.com0/
which uses reserved blocks to create file systems that are gravely
out of space and thus cause at least xfs_file_alloc_space to hang
and trigger the lack of ENOSPC handling in xfs_dquot_disk_alloc.
Note that this patch does not actually make any caller but
xfs_alloc_file_space deal intelligently with case 2) above.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: 刘通 <lyutoon@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f5178c41bb43444a6008150fe6094497135d07cb upstream.
syzbot reported this bug:
==================================================================
BUG: KASAN: slab-out-of-bounds in trace_seq_to_buffer kernel/trace/trace.c:1830 [inline]
BUG: KASAN: slab-out-of-bounds in tracing_splice_read_pipe+0x6be/0xdd0 kernel/trace/trace.c:6822
Write of size 4507 at addr ffff888032b6b000 by task syz.2.320/7260
CPU: 1 UID: 0 PID: 7260 Comm: syz.2.320 Not tainted 6.15.0-rc1-syzkaller-00301-g3bde70a2c827 #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:408 [inline]
print_report+0xc3/0x670 mm/kasan/report.c:521
kasan_report+0xe0/0x110 mm/kasan/report.c:634
check_region_inline mm/kasan/generic.c:183 [inline]
kasan_check_range+0xef/0x1a0 mm/kasan/generic.c:189
__asan_memcpy+0x3c/0x60 mm/kasan/shadow.c:106
trace_seq_to_buffer kernel/trace/trace.c:1830 [inline]
tracing_splice_read_pipe+0x6be/0xdd0 kernel/trace/trace.c:6822
....
==================================================================
It has been reported that trace_seq_to_buffer() tries to copy more data
than PAGE_SIZE to buf. Therefore, to prevent this, we should use the
smaller of trace_seq_used(&iter->seq) and PAGE_SIZE as an argument.
Link: https://lore.kernel.org/20250422113026.13308-1-aha310510@gmail.com
Reported-by: syzbot+c8cd2d2c412b868263fb@syzkaller.appspotmail.com
Fixes: 3c56819b14 ("tracing: splice support for tracing_pipe")
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b79028039f440e7d2c4df6ab243060c4e3803e84 upstream.
Commit 7491cdf46b5c ("cpufreq: Avoid using inconsistent policy->min and
policy->max") overlooked the fact that policy->min and policy->max were
accessed directly in cpufreq_frequency_table_target() and in the
functions called by it. Consequently, the changes made by that commit
led to problems with setting policy limits.
Address this by passing the target frequency limits to __resolve_freq()
and cpufreq_frequency_table_target() and propagating them to the
functions called by the latter.
Fixes: 7491cdf46b5c ("cpufreq: Avoid using inconsistent policy->min and policy->max")
Cc: 5.16+ <stable@vger.kernel.org> # 5.16+
Closes: https://lore.kernel.org/linux-pm/aAplED3IA_J0eZN0@linaro.org/
Reported-by: Stephan Gerhold <stephan.gerhold@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Stephan Gerhold <stephan.gerhold@linaro.org>
Reviewed-by: Lifeng Zheng <zhenglifeng1@huawei.com>
Link: https://patch.msgid.link/5896780.DvuYhMxLoT@rjwysocki.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7491cdf46b5cbdf123fc84fbe0a07e9e3d7b7620 upstream.
Since cpufreq_driver_resolve_freq() can run in parallel with
cpufreq_set_policy() and there is no synchronization between them,
the former may access policy->min and policy->max while the latter
is updating them and it may see intermediate values of them due
to the way the update is carried out. Also the compiler is free
to apply any optimizations it wants both to the stores in
cpufreq_set_policy() and to the loads in cpufreq_driver_resolve_freq()
which may result in additional inconsistencies.
To address this, use WRITE_ONCE() when updating policy->min and
policy->max in cpufreq_set_policy() and use READ_ONCE() for reading
them in cpufreq_driver_resolve_freq(). Moreover, rearrange the update
in cpufreq_set_policy() to avoid storing intermediate values in
policy->min and policy->max with the help of the observation that
their new values are expected to be properly ordered upfront.
Also modify cpufreq_driver_resolve_freq() to take the possible reverse
ordering of policy->min and policy->max, which may happen depending on
the ordering of operations when this function and cpufreq_set_policy()
run concurrently, into account by always honoring the max when it
turns out to be less than the min (in case it comes from thermal
throttling or similar).
Fixes: 1517176906 ("cpufreq: Make policy min/max hard requirements")
Cc: 5.16+ <stable@vger.kernel.org> # 5.16+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://patch.msgid.link/5907080.DvuYhMxLoT@rjwysocki.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e86e9134e1d1c90a960dd57f59ce574d27b9a124 upstream.
Setting sess->user = NULL was introduced to fix the dangling pointer
created by ksmbd_free_user. However, it is possible another thread could
be operating on the session and make use of sess->user after it has been
passed to ksmbd_free_user but before sess->user is set to NULL.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Heelan <seanheelan@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8d6955ed76e8a47115f2ea1d9c263ee6f505d737 upstream.
In certain situations, the sysfs for uncore may not be present when all
CPUs in a package are offlined and then brought back online after boot.
This issue can occur if there is an error in adding the sysfs entry due
to a memory allocation failure. Retrying to bring the CPUs online will
not resolve the issue, as the uncore_cpu_mask is already set for the
package before the failure condition occurs.
This issue does not occur if the failure happens during module
initialization, as the module will fail to load in the event of any
error.
To address this, ensure that the uncore_cpu_mask is not set until the
successful return of uncore_freq_add_entry().
Fixes: dbce412a77 ("platform/x86/intel-uncore-freq: Split common and enumeration part")
Signed-off-by: Shouye Liu <shouyeliu@tencent.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250417032321.75580-1-shouyeliu@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2c8a7c66c90832432496616a9a3c07293f1364f3 upstream.
On the Lenovo ThinkPad X201, when Intel VT-d is enabled in the BIOS, the
kernel boots with errors related to DMAR, the graphical interface appeared
quite choppy, and the system resets erratically within a minute after it
booted:
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xb97ff000
[fault reason 0x05] PTE Write access is not set
Upon comparing boot logs with VT-d on/off, I found that the Intel Calpella
quirk (`quirk_calpella_no_shadow_gtt()') correctly applied the igfx IOMMU
disable/quirk correctly:
pci 0000:00:00.0: DMAR: BIOS has allocated no shadow GTT; disabling IOMMU
for graphics
Whereas with VT-d on, it went into the "else" branch, which then
triggered the DMAR handling fault above:
... else if (!disable_igfx_iommu) {
/* we have to ensure the gfx device is idle before we flush */
pci_info(dev, "Disabling batched IOTLB flush on Ironlake\n");
iommu_set_dma_strict();
}
Now, this is not exactly scientific, but moving 0x0044 to quirk_iommu_igfx
seems to have fixed the aforementioned issue. Running a few `git blame'
runs on the function, I have found that the quirk was originally
introduced as a fix specific to ThinkPad X201:
commit 9eecabcb9a ("intel-iommu: Abort IOMMU setup for igfx if BIOS gave
no shadow GTT space")
Which was later revised twice to the "else" branch we saw above:
- 2011: commit 6fbcfb3e46 ("intel-iommu: Workaround IOTLB hang on
Ironlake GPU")
- 2024: commit ba00196ca41c ("iommu/vt-d: Decouple igfx_off from graphic
identity mapping")
I'm uncertain whether further testings on this particular laptops were
done in 2011 and (honestly I'm not sure) 2024, but I would be happy to do
some distro-specific testing if that's what would be required to verify
this patch.
P.S., I also see IDs 0x0040, 0x0062, and 0x006a listed under the same
`quirk_calpella_no_shadow_gtt()' quirk, but I'm not sure how similar these
chipsets are (if they share the same issue with VT-d or even, indeed, if
this issue is specific to a bug in the Lenovo BIOS). With regards to
0x0062, it seems to be a Centrino wireless card, but not a chipset?
I have also listed a couple (distro and kernel) bug reports below as
references (some of them are from 7-8 years ago!), as they seem to be
similar issue found on different Westmere/Ironlake, Haswell, and Broadwell
hardware setups.
Cc: stable@vger.kernel.org
Fixes: 6fbcfb3e46 ("intel-iommu: Workaround IOTLB hang on Ironlake GPU")
Fixes: ba00196ca41c ("iommu/vt-d: Decouple igfx_off from graphic identity mapping")
Link: https://groups.google.com/g/qubes-users/c/4NP4goUds2c?pli=1
Link: https://bugs.archlinux.org/task/65362
Link: https://bbs.archlinux.org/viewtopic.php?id=230323
Reported-by: Wenhao Sun <weiguangtwk@outlook.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=197029
Signed-off-by: Mingcong Bai <jeffbai@aosc.io>
Link: https://lore.kernel.org/r/20250415133330.12528-1-jeffbai@aosc.io
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8dee308e4c01dea48fc104d37f92d5b58c50b96c upstream.
There is a string parsing logic error which can lead to an overflow of hid
or uid buffers. Comparing ACPIID_LEN against a total string length doesn't
take into account the lengths of individual hid and uid buffers so the
check is insufficient in some cases. For example if the length of hid
string is 4 and the length of the uid string is 260, the length of str
will be equal to ACPIID_LEN + 1 but uid string will overflow uid buffer
which size is 256.
The same applies to the hid string with length 13 and uid string with
length 250.
Check the length of hid and uid strings separately to prevent
buffer overflow.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: ca3bf5d47c ("iommu/amd: Introduces ivrs_acpihid kernel parameter")
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Paklov <Pavel.Paklov@cyberprotect.ru>
Link: https://lore.kernel.org/r/20250325092259.392844-1-Pavel.Paklov@cyberprotect.ru
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5a2a6c428190f945c5cbf5791f72dbea83e97f66 upstream.
realloc_argv() was only updating the array size if it was called with
old_argv already allocated. The first time it was called to create an
argv array, it would allocate the array but return the array size as
zero. dm_split_args() would think that it couldn't store any arguments
in the array and would call realloc_argv() again, causing it to
reallocate the initial slots (this time using GPF_KERNEL) and finally
return a size. Aside from being wasteful, this could cause deadlocks on
targets that need to process messages without starting new IO. Instead,
realloc_argv should always update the allocated array size on success.
Fixes: a065192655 ("dm table: don't copy from a NULL pointer in realloc_argv()")
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0a533c3e4246c29d502a7e0fba0e86d80a906b04 upstream.
If we use the 'B' mode and we have an invalit table line,
cancel_delayed_work_sync would trigger a warning. This commit avoids the
warning.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8e089e7b585d95122c8122d732d1d5ef8f879396 upstream.
The function brcmf_usb_dl_writeimage() calls the function
brcmf_usb_dl_cmd() but dose not check its return value. The
'state.state' and the 'state.bytes' are uninitialized if the
function brcmf_usb_dl_cmd() fails. It is dangerous to use
uninitialized variables in the conditions.
Add error handling for brcmf_usb_dl_cmd() to jump to error
handling path if the brcmf_usb_dl_cmd() fails and the
'state.state' and the 'state.bytes' are uninitialized.
Improve the error message to report more detailed error
information.
Fixes: 71bb244ba2 ("brcm80211: fmac: add USB support for bcm43235/6/8 chipsets")
Cc: stable@vger.kernel.org # v3.4+
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Link: https://patch.msgid.link/20250422042203.2259-1-vulab@iscas.ac.cn
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 38a05c0b87833f5b188ae43b428b1f792df2b384 upstream.
On Qualcomm chipsets not all GPIOs are wakeup capable. Those GPIOs do not
have a corresponding MPM pin and should not be handled inside the MPM
driver. The IRQ domain hierarchy is always applied, so it's required to
explicitly disconnect the hierarchy for those. The pinctrl-msm driver marks
these with GPIO_NO_WAKE_IRQ. qcom-pdc has a check for this, but
irq-qcom-mpm is currently missing the check. This is causing crashes when
setting up interrupts for non-wake GPIOs:
root@rb1:~# gpiomon -c gpiochip1 10
irq: IRQ159: trimming hierarchy from :soc@0:interrupt-controller@f200000-1
Unable to handle kernel paging request at virtual address ffff8000a1dc3820
Hardware name: Qualcomm Technologies, Inc. Robotics RB1 (DT)
pc : mpm_set_type+0x80/0xcc
lr : mpm_set_type+0x5c/0xcc
Call trace:
mpm_set_type+0x80/0xcc (P)
qcom_mpm_set_type+0x64/0x158
irq_chip_set_type_parent+0x20/0x38
msm_gpio_irq_set_type+0x50/0x530
__irq_set_trigger+0x60/0x184
__setup_irq+0x304/0x6bc
request_threaded_irq+0xc8/0x19c
edge_detector_setup+0x260/0x364
linereq_create+0x420/0x5a8
gpio_ioctl+0x2d4/0x6c0
Fix this by copying the check for GPIO_NO_WAKE_IRQ from qcom-pdc.c, so that
MPM is removed entirely from the hierarchy for non-wake GPIOs.
Fixes: a6199bb514 ("irqchip: Add Qualcomm MPM controller driver")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Signed-off-by: Stephan Gerhold <stephan.gerhold@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexey Klimov <alexey.klimov@linaro.org>
Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250502-irq-qcom-mpm-fix-no-wake-v1-1-8a1eafcd28d4@linaro.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f04dd30f1bef1ed2e74a4050af6e5e5e3869bac3 upstream.
According to the XGMAC specification, enabling features such as Layer 3
and Layer 4 Packet Filtering, Split Header and Virtualized Network support
automatically selects the IPC Full Checksum Offload Engine on the receive
side.
When RX checksum offload is disabled, these dependent features must also
be disabled to prevent abnormal behavior caused by mismatched feature
dependencies.
Ensure that toggling RX checksum offload (disabling or enabling) properly
disables or enables all dependent features, maintaining consistent and
expected behavior in the network device.
Cc: stable@vger.kernel.org
Fixes: 1a510ccf58 ("amd-xgbe: Add support for VXLAN offload capabilities")
Signed-off-by: Vishal Badole <Vishal.Badole@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250424130248.428865-1-Vishal.Badole@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 58f6217e5d0132a9f14e401e62796916aa055c1b upstream.
When generating the MSR_IA32_PEBS_ENABLE value that will be loaded on
VM-Entry to a KVM guest, mask the value with the vCPU's desired PEBS_ENABLE
value. Consulting only the host kernel's host vs. guest masks results in
running the guest with PEBS enabled even when the guest doesn't want to use
PEBS. Because KVM uses perf events to proxy the guest virtual PMU, simply
looking at exclude_host can't differentiate between events created by host
userspace, and events created by KVM on behalf of the guest.
Running the guest with PEBS unexpectedly enabled typically manifests as
crashes due to a near-infinite stream of #PFs. E.g. if the guest hasn't
written MSR_IA32_DS_AREA, the CPU will hit page faults on address '0' when
trying to record PEBS events.
The issue is most easily reproduced by running `perf kvm top` from before
commit 7b100989b4 ("perf evlist: Remove __evlist__add_default") (after
which, `perf kvm top` effectively stopped using PEBS). The userspace side
of perf creates a guest-only PEBS event, which intel_guest_get_msrs()
misconstrues a guest-*owned* PEBS event.
Arguably, this is a userspace bug, as enabling PEBS on guest-only events
simply cannot work, and userspace can kill VMs in many other ways (there
is no danger to the host). However, even if this is considered to be bad
userspace behavior, there's zero downside to perf/KVM restricting PEBS to
guest-owned events.
Note, commit 854250329c ("KVM: x86/pmu: Disable guest PEBS temporarily
in two rare situations") fixed the case where host userspace is profiling
KVM *and* userspace, but missed the case where userspace is profiling only
KVM.
Fixes: c59a1f106f ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS")
Closes: https://lore.kernel.org/all/Z_VUswFkWiTYI0eD@do-x1carbon
Reported-by: Seth Forshee <sforshee@kernel.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: "Seth Forshee (DigitalOcean)" <sforshee@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250426001355.1026530-1-seanjc@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit de3629baf5a33af1919dec7136d643b0662e85ef upstream.
Camm noticed that on parisc a SIGFPE exception will crash an application with
a second SIGFPE in the signal handler. Dave analyzed it, and it happens
because glibc uses a double-word floating-point store to atomically update
function descriptors. As a result of lazy binding, we hit a floating-point
store in fpe_func almost immediately.
When the T bit is set, an assist exception trap occurs when when the
co-processor encounters *any* floating-point instruction except for a double
store of register %fr0. The latter cancels all pending traps. Let's fix this
by clearing the Trap (T) bit in the FP status register before returning to the
signal handler in userspace.
The issue can be reproduced with this test program:
root@parisc:~# cat fpe.c
static void fpe_func(int sig, siginfo_t *i, void *v) {
sigset_t set;
sigemptyset(&set);
sigaddset(&set, SIGFPE);
sigprocmask(SIG_UNBLOCK, &set, NULL);
printf("GOT signal %d with si_code %ld\n", sig, i->si_code);
}
int main() {
struct sigaction action = {
.sa_sigaction = fpe_func,
.sa_flags = SA_RESTART|SA_SIGINFO };
sigaction(SIGFPE, &action, 0);
feenableexcept(FE_OVERFLOW);
return printf("%lf\n",1.7976931348623158E308*1.7976931348623158E308);
}
root@parisc:~# gcc fpe.c -lm
root@parisc:~# ./a.out
Floating point exception
root@parisc:~# strace -f ./a.out
execve("./a.out", ["./a.out"], 0xf9ac7034 /* 20 vars */) = 0
getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM_INFINITY}) = 0
...
rt_sigaction(SIGFPE, {sa_handler=0x1110a, sa_mask=[], sa_flags=SA_RESTART|SA_SIGINFO}, NULL, 8) = 0
--- SIGFPE {si_signo=SIGFPE, si_code=FPE_FLTOVF, si_addr=0x1078f} ---
--- SIGFPE {si_signo=SIGFPE, si_code=FPE_FLTOVF, si_addr=0xf8f21237} ---
+++ killed by SIGFPE +++
Floating point exception
Signed-off-by: Helge Deller <deller@gmx.de>
Suggested-by: John David Anglin <dave.anglin@bell.net>
Reported-by: Camm Maguire <camm@maguirefamily.org>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fee4d171451c1ad9e8aaf65fc0ab7d143a33bd72 upstream.
Commit a5951389e58d ("arm64: errata: Add newer ARM cores to the
spectre_bhb_loop_affected() lists") added some additional CPUs to the
Spectre-BHB workaround, including some new arrays for designs that
require new 'k' values for the workaround to be effective.
Unfortunately, the new arrays omitted the sentinel entry and so
is_midr_in_range_list() will walk off the end when it doesn't find a
match. With UBSAN enabled, this leads to a crash during boot when
is_midr_in_range_list() is inlined (which was more common prior to
c8c2647e69be ("arm64: Make _midr_in_range_list() an exported
function")):
| Internal error: aarch64 BRK: 00000000f2000001 [#1] PREEMPT SMP
| pstate: 804000c5 (Nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : spectre_bhb_loop_affected+0x28/0x30
| lr : is_spectre_bhb_affected+0x170/0x190
| [...]
| Call trace:
| spectre_bhb_loop_affected+0x28/0x30
| update_cpu_capabilities+0xc0/0x184
| init_cpu_features+0x188/0x1a4
| cpuinfo_store_boot_cpu+0x4c/0x60
| smp_prepare_boot_cpu+0x38/0x54
| start_kernel+0x8c/0x478
| __primary_switched+0xc8/0xd4
| Code: 6b09011f 54000061 52801080 d65f03c0 (d4200020)
| ---[ end trace 0000000000000000 ]---
| Kernel panic - not syncing: aarch64 BRK: Fatal exception
Add the missing sentinel entries.
Cc: Lee Jones <lee@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Doug Anderson <dianders@chromium.org>
Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Cc: <stable@vger.kernel.org>
Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: a5951389e58d ("arm64: errata: Add newer ARM cores to the spectre_bhb_loop_affected() lists")
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Lee Jones <lee@kernel.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20250501104747.28431-1-will@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bbe5679f30d7690a9b6838a583b9690ea73fe0e9 upstream.
Nouveau is mostly designed in a way that it's expected that fences only
ever get signaled through nouveau_fence_signal(). However, in at least
one other place, nouveau_fence_done(), can signal fences, too. If that
happens (race) a signaled fence remains in the pending list for a while,
until it gets removed by nouveau_fence_update().
Should nouveau_fence_context_kill() run in the meantime, this would be
a bug because the function would attempt to set an error code on an
already signaled fence.
Have nouveau_fence_context_kill() check for a fence being signaled.
Cc: stable@vger.kernel.org # v5.10+
Fixes: ea13e5abf8 ("drm/nouveau: signal pending fences when channel has been killed")
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250415121900.55719-3-phasta@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In 6.1 there is no pmdp_get() definition, so use *pmd directly, in order
to avoid such build error due to a recently backport:
arch/loongarch/mm/hugetlbpage.c: In function 'huge_pte_offset':
arch/loongarch/mm/hugetlbpage.c:50:25: error: implicit declaration of function 'pmdp_get'; did you mean 'ptep_get'? [-Wimplicit-function-declaration]
50 | return pmd_none(pmdp_get(pmd)) ? NULL : (pte_t *) pmd;
| ^~~~~~~~
| ptep_get
Reported-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/f978ec9a-b103-40af-b116-6a9238197110@roeck-us.net
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fab89a09c8 upstream.
To differentiate between long arrays and cpumasks, the __cpumask() field
was created. Part of the TRACE_EVENT() macros test if the type is signed
or not by using the is_signed_type() macro. The __cpumask() field used the
__dynamic_array() helper but because cpumask_t is a structure, it could
not be used in the is_signed_type() macro as that would fail to build, so
instead it passed in the pointer to cpumask_t.
Unfortunately, that creates in the format file:
field:__data_loc cpumask_t *[] mask; offset:36; size:4; signed:0;
Which looks like an array of pointers to cpumask_t and not a cpumask_t
type, which is misleading to user space parsers.
Douglas Raillard pointed out that the "[]" are also misleading, as
cpumask_t is not an array.
Since cpumask() hasn't been created yet, and the parsers currently fail on
it (but will still produce the raw output), make it be:
field:__data_loc cpumask_t mask; offset:36; size:4; signed:0;
Which is the correct type of the field.
Then the parsers can be updated to handle this.
Link: https://lore.kernel.org/lkml/6dda5e1d-9416-b55e-88f3-31d148bc925f@arm.com/
Link: https://lore.kernel.org/linux-trace-kernel/20221212193814.0e3f1e43@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 8230f27b1c ("tracing: Add __cpumask to denote a trace event field that is a cpumask_t")
Reported-by: Douglas Raillard <douglas.raillard@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>