[ Upstream commit 2bf66e66d2e6feece6175ec09ec590a0a8563bdd ]
Commit c814bf9589 ("powerpc/selftests: Use timersub() for
gettimeofday()"), got the order of arguments to timersub() wrong,
leading to a negative time delta being reported, eg:
test: gettimeofday
tags: git_version:v6.12-rc5-409-gdddf291c3030
time = -3.297781
success: gettimeofday
The correct order is minuend, subtrahend, which in this case is end,
start. Which gives:
test: gettimeofday
tags: git_version:v6.12-rc5-409-gdddf291c3030-dirty
time = 3.300650
success: gettimeofday
Fixes: c814bf9589 ("powerpc/selftests: Use timersub() for gettimeofday()")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20241218114347.428108-1-mpe@ellerman.id.au
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ac32057acc7f3d7a238dafaa9b2aa2bc9750080e ]
The xa_store() may fail due to memory allocation failure because there
is no guarantee that the index csi is already used. This fix adds an
error check of the return value of xa_store() in nvme_get_effects_log().
Fixes: 1cf7a12e09 ("nvme: use an xarray to lookup the Commands Supported and Effects log")
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 844b8cdc681612ff24df62cdefddeab5772fadf1 ]
Following process can cause nbd_config UAF:
1) grab nbd_config temporarily;
2) nbd_genl_disconnect() flush all recv_work() and release the
initial reference:
nbd_genl_disconnect
nbd_disconnect_and_put
nbd_disconnect
flush_workqueue(nbd->recv_workq)
if (test_and_clear_bit(NBD_RT_HAS_CONFIG_REF, ...))
nbd_config_put
-> due to step 1), reference is still not zero
3) nbd_genl_reconfigure() queue recv_work() again;
nbd_genl_reconfigure
config = nbd_get_config_unlocked(nbd)
if (!config)
-> succeed
if (!test_bit(NBD_RT_BOUND, ...))
-> succeed
nbd_reconnect_socket
queue_work(nbd->recv_workq, &args->work)
4) step 1) release the reference;
5) Finially, recv_work() will trigger UAF:
recv_work
nbd_config_put(nbd)
-> nbd_config is freed
atomic_dec(&config->recv_threads)
-> UAF
Fix the problem by clearing NBD_RT_BOUND in nbd_genl_disconnect(), so
that nbd_genl_reconfigure() will fail.
Fixes: b7aa3d3938 ("nbd: add a reconfigure netlink command")
Reported-by: syzbot+6b0df248918b92c33e6a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/675bfb65.050a0220.1a2d0d.0006.GAE@google.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250103092859.3574648-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 457ef47c08d2979f3e59ce66267485c3faed70c8 ]
Set kernel config:
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_LOOP_MIN_COUNT=0
Do latter:
mknod loop0 b 7 0
exec 4<> loop0
Before commit e418de3abc ("block: switch gendisk lookup to a simple
xarray"), lookup_gendisk will first use base_probe to load module loop,
and then the retry will call loop_probe to prepare the loop disk. Finally
open for this disk will success. However, after this commit, we lose the
retry logic, and open will fail with ENXIO. Block device autoloading is
deprecated and will be removed soon, but maybe we should keep open success
until we really remove it. So, give a retry to fix it.
Fixes: e418de3abc ("block: switch gendisk lookup to a simple xarray")
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241209110435.3670985-1-yangerkun@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 07a10767853adcbdbf436dc91393b729b52c4e81 ]
The AFS directory format structure, union afs_xdr_dir_block::meta, has too
many alloc counter slots declared and so pushes the hash table along and
over the data. This doesn't cause a problem at the moment because I'm
currently ignoring the hash table and only using the correct number of
alloc_ctrs in the code anyway. In future, however, I should start using
the hash table to try and speed up afs_lookup().
Fix this by using the correct constant to declare the counter array.
Fixes: 4ea219a839 ("afs: Split the directory content defs into a header")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-14-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b49194da2aff2c879dec9c59ef8dec0f2b0809ef ]
AFS servers pass back a code indicating EEXIST when they're asked to remove
a directory that is not empty rather than ENOTEMPTY because not all the
systems that an AFS server can run on have the latter error available and
AFS preexisted the addition of that error in general.
Fix afs_rmdir() to translate EEXIST to ENOTEMPTY.
Fixes: 260a980317 ("[AFS]: Add "directory write" support.")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20241216204124.3752367-13-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d629d7a8efc33d05d62f4805c0ffb44727e3d99f ]
Commit 8597538712 ("powerpc/fadump: Do not use hugepages when fadump
is active") disabled hugetlb support when fadump is active by returning
early from hugetlbpage_init():arch/powerpc/mm/hugetlbpage.c and not
populating hpage_shift/HPAGE_SHIFT.
Later, commit 2354ad252b ("powerpc/mm: Update default hugetlb size
early") moved the allocation of hpage_shift/HPAGE_SHIFT to early boot,
which inadvertently re-enabled hugetlb support when fadump is active.
Fix this by implementing hugepages_supported() on powerpc. This ensures
that disabling hugetlb for the fadump kernel is independent of
hpage_shift/HPAGE_SHIFT.
Fixes: 2354ad252b ("powerpc/mm: Update default hugetlb size early")
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20241217074640.1064510-1-sourabhjain@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 48c6253fef upstream.
'Headphone Switch' control is already registered from
sound/soc/codecs/wm_hubs.c:479, so duplicating it in midas_wm1811
causes following probe failure:
midas-audio sound: control 2:0:0:Headphone Switch:0 is already present
midas-audio sound: ASoC: Failed to add Headphone Switch: -16
midas-audio sound: Failed to register card: -16
midas-audio: probe of sound failed with error -16
Fix this by dropping duplicated control.
Fixes: d27224a45e ("ASoC: samsung: midas_wm1811: Map missing jack kcontrols")
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20230809100446.2105825-1-m.szyprowski@samsung.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 907bc9268a5a9f823ffa751957a5c1dd59f83f42 upstream.
Microsoft defined Meta+Shift+F23 as the Copilot shortcut instead of a
dedicated keycode, and multiple vendors have their keyboards emit this
sequence in response to users pressing a dedicated "Copilot" key.
Unfortunately the default keymap table in atkbd does not map scancode
0x6e (F23) and so the key combination does not work even if userspace
is ready to handle it.
Because this behavior is common between multiple vendors and the
scancode is currently unused map 0x6e to keycode 193 (KEY_F23) so that
key sequence is generated properly.
MS documentation for the scan code:
https://learn.microsoft.com/en-us/windows/win32/inputdev/about-keyboard-input#scan-codes
Confirmed on Lenovo, HP and Dell machines by Canonical.
Tested on Lenovo T14s G6 AMD.
Signed-off-by: Mark Pearson <mpearson-lenovo@squebb.ca>
Link: https://lore.kernel.org/r/20250107034554.25843-1-mpearson-lenovo@squebb.ca
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 575a5adf48b06a2980c9eeffedf699ed5534fade upstream.
This patch addresses a null-ptr-deref in qt2_process_read_urb() due to
an incorrect bounds check in the following:
if (newport > serial->num_ports) {
dev_err(&port->dev,
"%s - port change to invalid port: %i\n",
__func__, newport);
break;
}
The condition doesn't account for the valid range of the serial->port
buffer, which is from 0 to serial->num_ports - 1. When newport is equal
to serial->num_ports, the assignment of "port" in the
following code is out-of-bounds and NULL:
serial_priv->current_port = newport;
port = serial->port[serial_priv->current_port];
The fix checks if newport is greater than or equal to serial->num_ports
indicating it is out-of-bounds.
Reported-by: syzbot <syzbot+506479ebf12fe435d01a@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=506479ebf12fe435d01a
Fixes: f7a33e608d ("USB: serial: add quatech2 usb to serial driver")
Cc: <stable@vger.kernel.org> # 3.5
Signed-off-by: Qasim Ijaz <qasdev00@gmail.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 90e0569dd3d32f4f4d2ca691d3fa5a8a14a13c12 upstream.
The per-netns IP tunnel hash table is protected by the RTNL mutex and
ip_tunnel_find() is only called from the control path where the mutex is
taken.
Add a lockdep expression to hlist_for_each_entry_rcu() in
ip_tunnel_find() in order to validate that the mutex is held and to
silence the suspicious RCU usage warning [1].
[1]
WARNING: suspicious RCU usage
6.12.0-rc3-custom-gd95d9a31aceb #139 Not tainted
-----------------------------
net/ipv4/ip_tunnel.c:221 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by ip/362:
#0: ffffffff86fc7cb0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x377/0xf60
stack backtrace:
CPU: 12 UID: 0 PID: 362 Comm: ip Not tainted 6.12.0-rc3-custom-gd95d9a31aceb #139
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
<TASK>
dump_stack_lvl+0xba/0x110
lockdep_rcu_suspicious.cold+0x4f/0xd6
ip_tunnel_find+0x435/0x4d0
ip_tunnel_newlink+0x517/0x7a0
ipgre_newlink+0x14c/0x170
__rtnl_newlink+0x1173/0x19c0
rtnl_newlink+0x6c/0xa0
rtnetlink_rcv_msg+0x3cc/0xf60
netlink_rcv_skb+0x171/0x450
netlink_unicast+0x539/0x7f0
netlink_sendmsg+0x8c1/0xd80
____sys_sendmsg+0x8f9/0xc20
___sys_sendmsg+0x197/0x1e0
__sys_sendmsg+0x122/0x1f0
do_syscall_64+0xbb/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes: c544193214 ("GRE: Refactor GRE tunneling code.")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241023123009.749764-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alva Lan <alvalan9@foxmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 23dfdb56581ad92a9967bcd720c8c23356af74c1 upstream.
The following kernel trace can be triggered with fstest generic/629 when
executed against a filesystem with fast-commit feature enabled:
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
CPU: 0 PID: 866 Comm: mount Not tainted 6.10.0+ #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x66/0x90
register_lock_class+0x759/0x7d0
__lock_acquire+0x85/0x2630
? __find_get_block+0xb4/0x380
lock_acquire+0xd1/0x2d0
? __ext4_journal_get_write_access+0xd5/0x160
_raw_spin_lock+0x33/0x40
? __ext4_journal_get_write_access+0xd5/0x160
__ext4_journal_get_write_access+0xd5/0x160
ext4_reserve_inode_write+0x61/0xb0
__ext4_mark_inode_dirty+0x79/0x270
? ext4_ext_replay_set_iblocks+0x2f8/0x450
ext4_ext_replay_set_iblocks+0x330/0x450
ext4_fc_replay+0x14c8/0x1540
? jread+0x88/0x2e0
? rcu_is_watching+0x11/0x40
do_one_pass+0x447/0xd00
jbd2_journal_recover+0x139/0x1b0
jbd2_journal_load+0x96/0x390
ext4_load_and_init_journal+0x253/0xd40
ext4_fill_super+0x2cc6/0x3180
...
In the replay path there's an attempt to lock sbi->s_bdev_wb_lock in
function ext4_check_bdev_write_error(). Unfortunately, at this point this
spinlock has not been initialized yet. Moving it's initialization to an
earlier point in __ext4_fill_super() fixes this splat.
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Link: https://patch.msgid.link/20240718094356.7863-1-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Bruno VERNAY <bruno.vernay@se.com>
Signed-off-by: Victor Giraud <vgiraud.opensource@witekio.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3d88ba86ba6f35a0467f25a88c38aa5639190d04 upstream.
This reverts commit 251efae73bd46b097deec4f9986d926813aed744.
Quoting Wang Yuli:
"The 27C6:01E0 touchpad doesn't require the workaround and applying it
would actually break functionality.
The initial report came from a BBS forum, but we suspect the
information provided by the forum user may be incorrect which could
happen sometimes. [1]
Further investigation showed that the Lenovo Y9000P 2024 doesn't even
use a Goodix touchpad. [2]
For the broader issue of 27c6:01e0 being unusable on some devices, it
just need to address it with a libinput quirk.
In conclusion, we should revert this commit, which is the best
solution."
Reported-by: Ulrich Müller <ulm@gentoo.org>
Reported-by: WangYuli <wangyuli@uniontech.com>
Link: https://lore.kernel.org/all/uikt4wwpw@gentoo.org/
Signed-off-by: Jiri Kosina <jkosina@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7c9d9223802fbed4dee1ae301661bf346964c9d2 upstream.
Truncate an inode's address space when flipping the GFS2_DIF_JDATA flag:
depending on that flag, the pages in the address space will either use
buffer heads or iomap_folio_state structs, and we cannot mix the two.
Reported-by: Kun Hu <huk23@m.fudan.edu.cn>, Jiaji Qin <jjtan24@m.fudan.edu.cn>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9c235dfc3d3f901fe22acb20f2ab37ff39f2ce02 ]
When we're recovering ondisk quota records from the log, we need to
validate the recovered buffer contents before writing them to disk.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ed17f7da5f0c8b65b7b5f7c98beb0aadbc0546ee ]
Since the introduction of xfs_dqblk in V5, xfs really ought to find the
dqblk pointer from the dquot buffer, then compute the xfs_disk_dquot
pointer from the dqblk pointer. Fix the open-coded xfs_buf_offset calls
and do the type checking in the correct order.
Note that this has made no practical difference since the start of the
xfs_disk_dquot is coincident with the start of the xfs_dqblk.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 038ca189c0d2c1570b4d922f25b524007c85cf94 ]
Discovered when trying to track down a weird recovery corruption
issue that wasn't detected at recovery time.
The specific corruption was a zero extent count field when big
extent counts are in use, and it turns out the dinode verifier
doesn't detect that specific corruption case, either. So fix it too.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit f63a5b3769ad7659da4c0420751d78958ab97675 ]
We've been seeing XFS errors like the following:
XFS: Internal error i != 1 at line 3526 of file fs/xfs/libxfs/xfs_btree.c. Caller xfs_btree_insert+0x1ec/0x280
...
Call Trace:
xfs_corruption_error+0x94/0xa0
xfs_btree_insert+0x221/0x280
xfs_alloc_fixup_trees+0x104/0x3e0
xfs_alloc_ag_vextent_size+0x667/0x820
xfs_alloc_fix_freelist+0x5d9/0x750
xfs_free_extent_fix_freelist+0x65/0xa0
__xfs_free_extent+0x57/0x180
...
This is the XFS_IS_CORRUPT() check in xfs_btree_insert() when
xfs_btree_insrec() fails.
After converting this into a panic and dissecting the core dump, I found
that xfs_btree_insrec() is failing because it's trying to split a leaf
node in the cntbt when the AG free list is empty. In particular, it's
failing to get a block from the AGFL _while trying to refill the AGFL_.
If a single operation splits every level of the bnobt and the cntbt (and
the rmapbt if it is enabled) at once, the free list will be empty. Then,
when the next operation tries to refill the free list, it allocates
space. If the allocation does not use a full extent, it will need to
insert records for the remaining space in the bnobt and cntbt. And if
those new records go in full leaves, the leaves (and potentially more
nodes up to the old root) need to be split.
Fix it by accounting for the additional splits that may be required to
refill the free list in the calculation for the minimum free list size.
P.S. As far as I can tell, this bug has existed for a long time -- maybe
back to xfs-history commit afdf80ae7405 ("Add XFS_AG_MAXLEVELS macros
...") in April 1994! It requires a very unlucky sequence of events, and
in fact we didn't hit it until a particular sparse mmap workload updated
from 5.12 to 5.19. But this bug existed in 5.12, so it must've been
exposed by some other change in allocation or writeback patterns. It's
also much less likely to be hit with the rmapbt enabled, since that
increases the minimum free list size and is unlikely to split at the
same time as the bnobt and cntbt.
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 471de20303dda0b67981e06d59cc6c4a83fd2a3c ]
We flush the data device cache before we issue external log IO. If
the flush fails, we shut down the log immediately and return. However,
the iclog->ic_sema is left in a decremented state so let's add an up().
Prior to this patch, xfs/438 would fail consistently when running with
an external log device:
sync
-> xfs_log_force
-> xlog_write_iclog
-> down(&iclog->ic_sema)
-> blkdev_issue_flush (fail causes us to intiate shutdown)
-> xlog_force_shutdown
-> return
unmount
-> xfs_log_umount
-> xlog_wait_iclog_completion
-> down(&iclog->ic_sema) --------> HANG
There is a second early return / shutdown. Make sure the up() happens
for it as well. Also make sure we cleanup the iclog state,
xlog_state_done_syncing, before dropping the iclog lock.
Fixes: b5d721eaae ("xfs: external logs need to flush data device")
Fixes: 842a42d126 ("xfs: shutdown on failure to add page to log bio")
Fixes: 7d839e325a ("xfs: check return codes when flushing block devices")
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 55f669f34184ecb25b8353f29c7f6f1ae5b313d1 ]
xfs_reflink_end_cow_extent looks up the COW extent and the data fork
extent at offset_fsb, and then proceeds to remap the common subset
between the two.
It does however not limit the remapped extent to the passed in
[*offset_fsbm end_fsb] range and thus potentially remaps more blocks than
the one handled by the current I/O completion. This means that with
sufficiently large data and COW extents we could be remapping COW fork
mappings that have not been written to, leading to a stale data exposure
on a powerfail event.
We use to have a xfs_trim_range to make the remap fit the I/O completion
range, but that got (apparently accidentally) removed in commit
df2fd88f8a ("xfs: rewrite xfs_reflink_end_cow to use intents").
Note that I've only found this by code inspection, and a test case would
probably require very specific delay and error injection.
Fixes: df2fd88f8a ("xfs: rewrite xfs_reflink_end_cow to use intents")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit f8f9d952e42dd49ae534f61f2fa7ca0876cb9848 ]
When recovering intents, we capture newly created intent items as part of
committing recovered intent items. If intent recovery fails at a later
point, we forget to remove those newly created intent items from the AIL
and hang:
[root@localhost ~]# cat /proc/539/stack
[<0>] xfs_ail_push_all_sync+0x174/0x230
[<0>] xfs_unmount_flush_inodes+0x8d/0xd0
[<0>] xfs_mountfs+0x15f7/0x1e70
[<0>] xfs_fs_fill_super+0x10ec/0x1b20
[<0>] get_tree_bdev+0x3c8/0x730
[<0>] vfs_get_tree+0x89/0x2c0
[<0>] path_mount+0xecf/0x1800
[<0>] do_mount+0xf3/0x110
[<0>] __x64_sys_mount+0x154/0x1f0
[<0>] do_syscall_64+0x39/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
When newly created intent items fail to commit via transaction, intent
recovery hasn't created done items for these newly created intent items,
so the capture structure is the sole owner of the captured intent items.
We must release them explicitly or else they leak:
unreferenced object 0xffff888016719108 (size 432):
comm "mount", pid 529, jiffies 4294706839 (age 144.463s)
hex dump (first 32 bytes):
08 91 71 16 80 88 ff ff 08 91 71 16 80 88 ff ff ..q.......q.....
18 91 71 16 80 88 ff ff 18 91 71 16 80 88 ff ff ..q.......q.....
backtrace:
[<ffffffff8230c68f>] xfs_efi_init+0x18f/0x1d0
[<ffffffff8230c720>] xfs_extent_free_create_intent+0x50/0x150
[<ffffffff821b671a>] xfs_defer_create_intents+0x16a/0x340
[<ffffffff821bac3e>] xfs_defer_ops_capture_and_commit+0x8e/0xad0
[<ffffffff82322bb9>] xfs_cui_item_recover+0x819/0x980
[<ffffffff823289b6>] xlog_recover_process_intents+0x246/0xb70
[<ffffffff8233249a>] xlog_recover_finish+0x8a/0x9a0
[<ffffffff822eeafb>] xfs_log_mount_finish+0x2bb/0x4a0
[<ffffffff822c0f4f>] xfs_mountfs+0x14bf/0x1e70
[<ffffffff822d1f80>] xfs_fs_fill_super+0x10d0/0x1b20
[<ffffffff81a21fa2>] get_tree_bdev+0x3d2/0x6d0
[<ffffffff81a1ee09>] vfs_get_tree+0x89/0x2c0
[<ffffffff81a9f35f>] path_mount+0xecf/0x1800
[<ffffffff81a9fd83>] do_mount+0xf3/0x110
[<ffffffff81aa00e4>] __x64_sys_mount+0x154/0x1f0
[<ffffffff83968739>] do_syscall_64+0x39/0x80
Fix the problem above by abort intent items that don't have a done item
when recovery intents fail.
Fixes: e6fff81e48 ("xfs: proper replay of deferred ops queued during log recovery")
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 2a5db859c6825b5d50377dda9c3cc729c20cad43 ]
Factor out xfs_defer_pending_abort() from xfs_defer_trans_abort(), which
not use transaction parameter, so it can be used after the transaction
life cycle.
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 14a537983b228cb050ceca3a5b743d01315dc4aa ]
One of our VM cluster management products needs to snapshot KVM image
files so that they can be restored in case of failure. Snapshotting is
done by redirecting VM disk writes to a sidecar file and using reflink
on the disk image, specifically the FICLONE ioctl as used by
"cp --reflink". Reflink locks the source and destination files while it
operates, which means that reads from the main vm disk image are blocked,
causing the vm to stall. When an image file is heavily fragmented, the
copy process could take several minutes. Some of the vm image files have
50-100 million extent records, and duplicating that much metadata locks
the file for 30 minutes or more. Having activities suspended for such
a long time in a cluster node could result in node eviction.
Clone operations and read IO do not change any data in the source file,
so they should be able to run concurrently. Demote the exclusive locks
taken by FICLONE to shared locks to allow reads while cloning. While a
clone is in progress, writes will take the IOLOCK_EXCL, so they block
until the clone completes.
Link: https://lore.kernel.org/linux-xfs/8911B94D-DD29-4D6E-B5BC-32EAF1866245@oracle.com/
Signed-off-by: Catherine Hoang <catherine.hoang@oracle.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 35dc55b9e80cb9ec4bcb969302000b002b2ed850 ]
If xfs_bmapi_write finds a delalloc extent at the requested range, it
tries to convert the entire delalloc extent to a real allocation.
But if the allocator cannot find a single free extent large enough to
cover the start block of the requested range, xfs_bmapi_write will
return 0 but leave *nimaps set to 0.
In that case we simply need to keep looping with the same startoffset_fsb
so that one of the following allocations will eventually reach the
requested range.
Note that this could affect any caller of xfs_bmapi_write that covers
an existing delayed allocation. As far as I can tell we do not have
any other such caller, though - the regular writeback path uses
xfs_bmapi_convert_delalloc to convert delayed allocations to real ones,
and direct I/O invalidates the page cache first.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit f6a2dae2a1f52ea23f649c02615d073beba4cc35 ]
In commit 2a6ca4baed, we tried to fix an overflow problem in the
realtime allocator that was caused by an overly large maxlen value
causing xfs_rtcheck_range to run off the end of the realtime bitmap.
Unfortunately, there is a subtle bug here -- maxlen (and minlen) both
have to be aligned with @prod, but @prod can be larger than 1 if the
user has set an extent size hint on the file, and that extent size hint
is larger than the realtime extent size.
If the rt free space extents are not aligned to this file's extszhint
because other files without extent size hints allocated space (or the
number of rt extents is similarly not aligned), then it's possible that
maxlen after clamping to sb_rextents will no longer be aligned to prod.
The allocation will succeed just fine, but we still trip the assertion.
Fix the problem by reducing maxlen by any misalignment with prod. While
we're at it, split the assertions into two so that we can tell which
value had the bad alignment.
Fixes: 2a6ca4baed ("xfs: make sure the rt allocator doesn't run off the end")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ddd98076d5c075c8a6c49d9e6e8ee12844137f23 ]
The unit conversions in this function do not make sense. First we
convert a block count to bytes, then divide that bytes value by
rextsize, which is in blocks, to get an rt extent count. You can't
divide bytes by blocks to get a (possibly multiblock) extent value.
Fortunately nobody uses delalloc on the rt volume so this hasn't
mattered.
Fixes: fa5c836ca8 ("xfs: refactor xfs_bunmapi_cow")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c2988eb5cff75c02bc57e02c323154aa08f55b78 ]
When realtime support is not compiled into the kernel, these functions
should return negative errnos, not positive errnos. While we're at it,
fix a broken macro declaration.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit b73494fa9a304ab95b59f07845e8d7d36e4d23e0 ]
Quotas aren't (yet) supported with realtime, so we shouldn't allow
userspace to set up a realtime section when quotas are enabled, even if
they attached one via mount options. IOWS, you shouldn't be able to do:
# mkfs.xfs -f /dev/sda
# mount /dev/sda /mnt -o rtdev=/dev/sdb,usrquota
# xfs_growfs -r /mnt
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6c664484337b37fa0cf6e958f4019623e30d40f7 ]
Currently, xfs_bmap_del_extent_real contains a bunch of code to convert
the physical extent of a data fork mapping for a realtime file into rt
extents and pass that to the rt extent freeing function. Since the
details of this aren't needed when CONFIG_XFS_REALTIME=n, move it to
xfs_rtbitmap.c to reduce code size when realtime isn't enabled.
This will (one day) enable realtime EFIs to reuse the same
unit-converting call with less code duplication.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9488062805943c2d63350d3ef9e4dc093799789a ]
The latest version of the fs geometry structure is v5. Bump this
constant so that xfs_db and mkfs calls to libxfs_fs_geometry will fill
out all the fields.
IOWs, this commit is a no-op for the kernel, but will be useful for
userspace reporting in later changes.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>