linux

mirror of https://github.com/hardkernel/linux.git synced 2026-06-11 13:27:06 +09:00

Author	SHA1	Message	Date
Darrick J. Wong	537baedb3e	xfs: estimate post-merge refcounts correctly [ Upstream commit `b25d1984aa` ] Upon enabling fsdax + reflink for XFS, xfs/179 began to report refcount metadata corruptions after being run. Specifically, xfs_repair noticed single-block refcount records that could be combined but had not been. The root cause of this is improper MAXREFCOUNT edge case handling in xfs_refcount_merge_extents. When we're trying to find candidates for a refcount btree record merge, we compute the refcount attribute of the merged record, but we fail to account for the fact that once a record hits rc_refcount == MAXREFCOUNT, it is pinned that way forever. Hence the computed refcount is wrong, and we fail to merge the extents. Fix this by adjusting the merge predicates to compute the adjusted refcount correctly. Fixes: `3172725814` ("xfs: adjust refcount of an extent of blocks in refcount btree") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Xiao Yang <yangx.jy@fujitsu.com> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:33 +02:00
Darrick J. Wong	131a854c09	xfs: hoist refcount record merge predicates [ Upstream commit `9d720a5a65` ] Hoist these multiline conditionals into separate static inline helpers to improve readability and set the stage for corruption fixes that will be introduced in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Xiao Yang <yangx.jy@fujitsu.com> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:33 +02:00
Guo Xuenan	0d889ae85f	xfs: fix super block buf log item UAF during force shutdown [ Upstream commit `575689fc0f` ] xfs log io error will trigger xlog shut down, and end_io worker call xlog_state_shutdown_callbacks to unpin and release the buf log item. The race condition is that when there are some thread doing transaction commit and happened not to be intercepted by xlog_is_shutdown, then, these log item will be insert into CIL, when unpin and release these buf log item, UAF will occur. BTW, add delay before `xlog_cil_commit` can increase recurrence probability. The following call graph actually encountered this bad situation. fsstress io end worker kworker/0:1H-216 xlog_ioend_work ->xlog_force_shutdown ->xlog_state_shutdown_callbacks ->xlog_cil_process_committed ->xlog_cil_committed ->xfs_trans_committed_bulk ->xfs_trans_apply_sb_deltas ->li_ops->iop_unpin(lip, 1); ->xfs_trans_getsb ->_xfs_trans_bjoin ->xfs_buf_item_init ->if (bip) { return 0;} //relog ->xlog_cil_commit ->xlog_cil_insert_items //insert into CIL ->xfs_buf_ioend_fail(bp); ->xfs_buf_ioend ->xfs_buf_item_done ->xfs_buf_item_relse ->xfs_buf_item_free when cil push worker gather percpu cil and insert super block buf log item into ctx->log_items then uaf occurs. ================================================================== BUG: KASAN: use-after-free in xlog_cil_push_work+0x1c8f/0x22f0 Write of size 8 at addr ffff88801800f3f0 by task kworker/u4:4/105 CPU: 0 PID: 105 Comm: kworker/u4:4 Tainted: G W 6.1.0-rc1-00001-g274115149b42 #136 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Workqueue: xfs-cil/sda xlog_cil_push_work Call Trace: <TASK> dump_stack_lvl+0x4d/0x66 print_report+0x171/0x4a6 kasan_report+0xb3/0x130 xlog_cil_push_work+0x1c8f/0x22f0 process_one_work+0x6f9/0xf70 worker_thread+0x578/0xf30 kthread+0x28c/0x330 ret_from_fork+0x1f/0x30 </TASK> Allocated by task 2145: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 __kasan_slab_alloc+0x54/0x60 kmem_cache_alloc+0x14a/0x510 xfs_buf_item_init+0x160/0x6d0 _xfs_trans_bjoin+0x7f/0x2e0 xfs_trans_getsb+0xb6/0x3f0 xfs_trans_apply_sb_deltas+0x1f/0x8c0 __xfs_trans_commit+0xa25/0xe10 xfs_symlink+0xe23/0x1660 xfs_vn_symlink+0x157/0x280 vfs_symlink+0x491/0x790 do_symlinkat+0x128/0x220 __x64_sys_symlink+0x7a/0x90 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 216: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_save_free_info+0x2a/0x40 __kasan_slab_free+0x105/0x1a0 kmem_cache_free+0xb6/0x460 xfs_buf_ioend+0x1e9/0x11f0 xfs_buf_item_unpin+0x3d6/0x840 xfs_trans_committed_bulk+0x4c2/0x7c0 xlog_cil_committed+0xab6/0xfb0 xlog_cil_process_committed+0x117/0x1e0 xlog_state_shutdown_callbacks+0x208/0x440 xlog_force_shutdown+0x1b3/0x3a0 xlog_ioend_work+0xef/0x1d0 process_one_work+0x6f9/0xf70 worker_thread+0x578/0xf30 kthread+0x28c/0x330 ret_from_fork+0x1f/0x30 The buggy address belongs to the object at ffff88801800f388 which belongs to the cache xfs_buf_item of size 272 The buggy address is located 104 bytes inside of 272-byte region [ffff88801800f388, ffff88801800f498) The buggy address belongs to the physical page: page:ffffea0000600380 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88801800f208 pfn:0x1800e head:ffffea0000600380 order:1 compound_mapcount:0 compound_pincount:0 flags: 0x1fffff80010200(slab\|head\|node=0\|zone=1\|lastcpupid=0x1fffff) raw: 001fffff80010200 ffffea0000699788 ffff88801319db50 ffff88800fb50640 raw: ffff88801800f208 000000000015000a 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88801800f280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88801800f300: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88801800f380: fc fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88801800f400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88801800f480: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Disabling lock debugging due to kernel taint Signed-off-by: Guo Xuenan <guoxuenan@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:33 +02:00
Guo Xuenan	2f1eb71ae8	xfs: wait iclog complete before tearing down AIL [ Upstream commit `1eb52a6a71` ] Fix uaf in xfs_trans_ail_delete during xlog force shutdown. In commit `cd6f79d1fb` ("xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks") changed the order of running callbacks and wait for iclog completion to avoid unmount path untimely destroy AIL. But which seems not enough to ensue this, adding mdelay in `xfs_buf_item_unpin` can prove that. The reproduction is as follows. To ensure destroy AIL safely, we should wait all xlog ioend workers done and sync the AIL. ================================================================== BUG: KASAN: use-after-free in xfs_trans_ail_delete+0x240/0x2a0 Read of size 8 at addr ffff888023169400 by task kworker/1:1H/43 CPU: 1 PID: 43 Comm: kworker/1:1H Tainted: G W 6.1.0-rc1-00002-gc28266863c4a #137 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Workqueue: xfs-log/sda xlog_ioend_work Call Trace: <TASK> dump_stack_lvl+0x4d/0x66 print_report+0x171/0x4a6 kasan_report+0xb3/0x130 xfs_trans_ail_delete+0x240/0x2a0 xfs_buf_item_done+0x7b/0xa0 xfs_buf_ioend+0x1e9/0x11f0 xfs_buf_item_unpin+0x4c8/0x860 xfs_trans_committed_bulk+0x4c2/0x7c0 xlog_cil_committed+0xab6/0xfb0 xlog_cil_process_committed+0x117/0x1e0 xlog_state_shutdown_callbacks+0x208/0x440 xlog_force_shutdown+0x1b3/0x3a0 xlog_ioend_work+0xef/0x1d0 process_one_work+0x6f9/0xf70 worker_thread+0x578/0xf30 kthread+0x28c/0x330 ret_from_fork+0x1f/0x30 </TASK> Allocated by task 9606: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 __kasan_kmalloc+0x7a/0x90 __kmalloc+0x59/0x140 kmem_alloc+0xb2/0x2f0 xfs_trans_ail_init+0x20/0x320 xfs_log_mount+0x37e/0x690 xfs_mountfs+0xe36/0x1b40 xfs_fs_fill_super+0xc5c/0x1a70 get_tree_bdev+0x3c5/0x6c0 vfs_get_tree+0x85/0x250 path_mount+0xec3/0x1830 do_mount+0xef/0x110 __x64_sys_mount+0x150/0x1f0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 9662: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_save_free_info+0x2a/0x40 __kasan_slab_free+0x105/0x1a0 __kmem_cache_free+0x99/0x2d0 kvfree+0x3a/0x40 xfs_log_unmount+0x60/0xf0 xfs_unmountfs+0xf3/0x1d0 xfs_fs_put_super+0x78/0x300 generic_shutdown_super+0x151/0x400 kill_block_super+0x9a/0xe0 deactivate_locked_super+0x82/0xe0 deactivate_super+0x91/0xb0 cleanup_mnt+0x32a/0x4a0 task_work_run+0x15f/0x240 exit_to_user_mode_prepare+0x188/0x190 syscall_exit_to_user_mode+0x12/0x30 do_syscall_64+0x42/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd The buggy address belongs to the object at ffff888023169400 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 0 bytes inside of 128-byte region [ffff888023169400, ffff888023169480) The buggy address belongs to the physical page: page:ffffea00008c5a00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888023168f80 pfn:0x23168 head:ffffea00008c5a00 order:1 compound_mapcount:0 compound_pincount:0 flags: 0x1fffff80010200(slab\|head\|node=0\|zone=1\|lastcpupid=0x1fffff) raw: 001fffff80010200 ffffea00006b3988 ffffea0000577a88 ffff88800f842ac0 raw: ffff888023168f80 0000000000150007 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888023169300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888023169380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff888023169400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888023169480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888023169500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Disabling lock debugging due to kernel taint Fixes: `cd6f79d1fb` ("xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks") Signed-off-by: Guo Xuenan <guoxuenan@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:33 +02:00
Darrick J. Wong	e62c784a56	xfs: attach dquots to inode before reading data/cow fork mappings [ Upstream commit `4c6dbfd275` ] I've been running near-continuous integration testing of online fsck, and I've noticed that once a day, one of the ARM VMs will fail the test with out of order records in the data fork. xfs/804 races fsstress with online scrub (aka scan but do not change anything), so I think this might be a bug in the core xfs code. This also only seems to trigger if one runs the test for more than ~6 minutes via TIME_FACTOR=13 or something. https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/tree/tests/xfs/804?h=djwong-wtf I added a debugging patch to the kernel to check the data fork extents after taking the ILOCK, before dropping ILOCK, and before and after each bmapping operation. So far I've narrowed it down to the delalloc code inserting a record in the wrong place in the iext tree: xfs_bmap_add_extent_hole_delay, near line 2691: case 0: /* * New allocation is not contiguous with another * delayed allocation. * Insert a new entry. */ oldlen = newlen = 0; xfs_iunlock_check_datafork(ip); <-- ok here xfs_iext_insert(ip, icur, new, state); xfs_iunlock_check_datafork(ip); <-- bad here break; } I recorded the state of the data fork mappings and iext cursor state when a corrupt data fork is detected immediately after the xfs_bmap_add_extent_hole_delay call in xfs_bmapi_reserve_delalloc: ino 0x140bb3 func xfs_bmapi_reserve_delalloc line 4164 data fork: ino 0x140bb3 nr 0x0 nr_real 0x0 offset 0xb9 blockcount 0x1f startblock 0x935de2 state 1 ino 0x140bb3 nr 0x1 nr_real 0x1 offset 0xe6 blockcount 0xa startblock 0xffffffffe0007 state 0 ino 0x140bb3 nr 0x2 nr_real 0x1 offset 0xd8 blockcount 0xe startblock 0x935e01 state 0 Here we see that a delalloc extent was inserted into the wrong position in the iext leaf, same as all the other times. The extra trace data I collected are as follows: ino 0x140bb3 fork 0 oldoff 0xe6 oldlen 0x4 oldprealloc 0x6 isize 0xe6000 ino 0x140bb3 oldgotoff 0xea oldgotstart 0xfffffffffffffffe oldgotcount 0x0 oldgotstate 0 ino 0x140bb3 crapgotoff 0x0 crapgotstart 0x0 crapgotcount 0x0 crapgotstate 0 ino 0x140bb3 freshgotoff 0xd8 freshgotstart 0x935e01 freshgotcount 0xe freshgotstate 0 ino 0x140bb3 nowgotoff 0xe6 nowgotstart 0xffffffffe0007 nowgotcount 0xa nowgotstate 0 ino 0x140bb3 oldicurpos 1 oldleafnr 2 oldleaf 0xfffffc00f0609a00 ino 0x140bb3 crapicurpos 2 crapleafnr 2 crapleaf 0xfffffc00f0609a00 ino 0x140bb3 freshicurpos 1 freshleafnr 2 freshleaf 0xfffffc00f0609a00 ino 0x140bb3 newicurpos 1 newleafnr 3 newleaf 0xfffffc00f0609a00 The first line shows that xfs_bmapi_reserve_delalloc was called with whichfork=XFS_DATA_FORK, off=0xe6, len=0x4, prealloc=6. The second line ("oldgot") shows the contents of @got at the beginning of the call, which are the results of the first iext lookup in xfs_buffered_write_iomap_begin. Line 3 ("crapgot") is the result of duplicating the cursor at the start of the body of xfs_bmapi_reserve_delalloc and performing a fresh lookup at @off. Line 4 ("freshgot") is the result of a new xfs_iext_get_extent right before the call to xfs_bmap_add_extent_hole_delay. Totally garbage. Line 5 ("nowgot") is contents of @got after the xfs_bmap_add_extent_hole_delay call. Line 6 is the contents of @icur at the beginning fo the call. Lines 7-9 are the contents of the iext cursors at the point where the block mappings were sampled. I think @oldgot is a HOLESTARTBLOCK extent because the first lookup didn't find anything, so we filled in imap with "fake hole until the end". At the time of the first lookup, I suspect that there's only one 32-block unwritten extent in the mapping (hence oldicurpos==1) but by the time we get to recording crapgot, crapicurpos==2. Dave then added: Ok, that's much simpler to reason about, and implies the smoke is coming from xfs_buffered_write_iomap_begin() or xfs_bmapi_reserve_delalloc(). I suspect the former - it does a lot of stuff with the ILOCK_EXCL held..... .... including calling xfs_qm_dqattach_locked(). xfs_buffered_write_iomap_begin ILOCK_EXCL look up icur xfs_qm_dqattach_locked xfs_qm_dqattach_one xfs_qm_dqget_inode dquot cache miss xfs_iunlock(ip, XFS_ILOCK_EXCL); error = xfs_qm_dqread(mp, id, type, can_alloc, &dqp); xfs_ilock(ip, XFS_ILOCK_EXCL); .... xfs_bmapi_reserve_delalloc(icur) Yup, that's what is letting the magic smoke out - xfs_qm_dqattach_locked() can cycle the ILOCK. If that happens, we can pass a stale icur to xfs_bmapi_reserve_delalloc() and it all goes downhill from there. Back to Darrick now: So. Fix this by moving the dqattach_locked call up before we take the ILOCK, like all the other callers in that file. Fixes: `a526c85c22` ("xfs: move xfs_file_iomap_begin_delay around") # goes further back than this Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:33 +02:00
Darrick J. Wong	5465403341	xfs: invalidate block device page cache during unmount [ Upstream commit `032e160305` ] Every now and then I see fstests failures on aarch64 (64k pages) that trigger on the following sequence: mkfs.xfs $dev mount $dev $mnt touch $mnt/a umount $mnt xfs_db -c 'path /a' -c 'print' $dev 99% of the time this succeeds, but every now and then xfs_db cannot find /a and fails. This turns out to be a race involving udev/blkid, the page cache for the block device, and the xfs_db process. udev is triggered whenever anyone closes a block device or unmounts it. The default udev rules invoke blkid to read the fs super and create symlinks to the bdev under /dev/disk. For this, it uses buffered reads through the page cache. xfs_db also uses buffered reads to examine metadata. There is no coordination between xfs_db and udev, which means that they can run concurrently. Note there is no coordination between the kernel and blkid either. On a system with 64k pages, the page cache can cache the superblock and the root inode (and hence the root dir) with the same 64k page. If udev spawns blkid after the mkfs and the system is busy enough that it is still running when xfs_db starts up, they'll both read from the same page in the pagecache. The unmount writes updated inode metadata to disk directly. The XFS buffer cache does not use the bdev pagecache, nor does it invalidate the pagecache on umount. If the above scenario occurs, the pagecache no longer reflects what's on disk, xfs_db reads the stale metadata, and fails to find /a. Most of the time this succeeds because closing a bdev invalidates the page cache, but when processes race, everyone loses. Fix the problem by invalidating the bdev pagecache after flushing the bdev, so that xfs_db will see up to date metadata. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:32 +02:00
Long Li	781f80e519	xfs: fix incorrect i_nlink caused by inode racing [ Upstream commit `28b4b05963` ] The following error occurred during the fsstress test: XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2452 The problem was that inode race condition causes incorrect i_nlink to be written to disk, and then it is read into memory. Consider the following call graph, inodes that are marked as both XFS_IFLUSHING and XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk may be set to 1. xfsaild xfs_inode_item_push xfs_iflush_cluster xfs_iflush xfs_inode_to_disk xfs_iget xfs_iget_cache_hit xfs_iget_recycle xfs_reinit_inode inode_init_always xfs_reinit_inode() needs to hold the ILOCK_EXCL as it is changing internal inode state and can race with other RCU protected inode lookups. On the read side, xfs_iflush_cluster() grabs the ILOCK_SHARED while under rcu + ip->i_flags_lock, and so xfs_iflush/xfs_inode_to_disk() are protected from racing inode updates (during transactions) by that lock. Fixes: `ff7bebeb91` ("xfs: refactor the inode recycling code") # goes further back than this Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:32 +02:00
Long Li	42163ff6c6	xfs: fix sb write verify for lazysbcount [ Upstream commit `59f6ab40fd` ] When lazysbcount is enabled, fsstress and loop mount/unmount test report the following problems: XFS (loop0): SB summary counter sanity check failed XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460, xfs_sb block 0x0 XFS (loop0): Unmount and run xfs_repair XFS (loop0): First 128 bytes of corrupted metadata buffer: 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(.. 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.\|._.D..t....4Z 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... .......... 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................ 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................ 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................ XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply +0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem. XFS (loop0): Please unmount the filesystem and rectify the problem(s) XFS (loop0): log mount/recovery failed: error -117 XFS (loop0): log mount failed This corruption will shutdown the file system and the file system will no longer be mountable. The following script can reproduce the problem, but it may take a long time. #!/bin/bash device=/dev/sda testdir=/mnt/test round=0 function fail() { echo "$" exit 1 } mkdir -p $testdir while [ $round -lt 10000 ] do echo "**** round $round ******" mkfs.xfs -f $device mount $device $testdir \|\| fail "mount failed!" fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null & sleep 4 killall -w fsstress umount $testdir xfs_repair -e $device > /dev/null if [ $? -eq 2 ];then echo "ERR CODE 2: Dirty log exception during repair." exit 1 fi round=$(($round+1)) done With lazysbcount is enabled, There is no additional lock protection for reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the m_ifree, this will make the m_ifree greater than m_icount. For example, consider the following sequence and ifreedelta is postive: CPU0 CPU1 xfs_log_sb xfs_trans_unreserve_and_mod_sb ---------- ------------------------------ percpu_counter_sum(&mp->m_icount) percpu_counter_add_batch(&mp->m_icount, idelta, XFS_ICOUNT_BATCH) percpu_counter_add(&mp->m_ifree, ifreedelta); percpu_counter_sum(&mp->m_ifree) After this, incorrect inode count (sb_ifree > sb_icount) will be writen to the log. In the subsequent writing of sb, incorrect inode count (sb_ifree > sb_icount) will fail to pass the boundary check in xfs_validate_sb_write() that cause the file system shutdown. When lazysbcount is enabled, we don't need to guarantee that Lazy sb counters are completely correct, but we do need to guarantee that sb_ifree <= sb_icount. On the other hand, the constraint that m_ifree <= m_icount must be satisfied any time that there /cannot/ be other threads allocating or freeing inode chunks. If the constraint is violated under these circumstances, sb_i{count,free} (the ondisk superblock inode counters) maybe incorrect and need to be marked sick at unmount, the count will be rebuilt on the next mount. Fixes: `8756a5af18` ("libxfs: add more bounds checking to sb sanity checks") Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:32 +02:00
Darrick J. Wong	77d31f0c70	xfs: fix incorrect error-out in xfs_remove [ Upstream commit `2653d53345` ] Clean up resources if resetting the dotdot entry doesn't succeed. Observed through code inspection. Fixes: `5838d0356b` ("xfs: reset child dir '..' entry when unlinking child") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:32 +02:00
Dave Chinner	e2ae64993c	xfs: fix off-by-one-block in xfs_discard_folio() [ Upstream commit `8ac5b996bf` ] The recent writeback corruption fixes changed the code in xfs_discard_folio() to calculate a byte range to for punching delalloc extents. A mistake was made in using round_up(pos) for the end offset, because when pos points at the first byte of a block, it does not get rounded up to point to the end byte of the block. hence the punch range is short, and this leads to unexpected behaviour in certain cases in xfs_bmap_punch_delalloc_range. e.g. pos = 0 means we call xfs_bmap_punch_delalloc_range(0,0), so there is no previous extent and it rounds up the punch to the end of the delalloc extent it found at offset 0, not the end of the range given to xfs_bmap_punch_delalloc_range(). Fix this by handling the zero block offset case correctly. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=217030 Link: https://lore.kernel.org/linux-xfs/Y+vOfaxIWX1c%2Fyy9@bfoster/ Fixes: `7348b32233` ("xfs: xfs_bmap_punch_delalloc_range() should take a byte range") Reported-by: Pengfei Xu <pengfei.xu@intel.com> Found-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:32 +02:00
Dave Chinner	e811fec51c	xfs: drop write error injection is unfixable, remove it [ Upstream commit `6e8af15ccd` ] With the changes to scan the page cache for dirty data to avoid data corruptions from partial write cleanup racing with other page cache operations, the drop writes error injection no longer works the same way it used to and causes xfs/196 to fail. This is because xfs/196 writes to the file and populates the page cache before it turns on the error injection and starts failing -overwrites-. The result is that the original drop-writes code failed writes only -after- overwriting the data in the cache, followed by invalidates the cached data, then punching out the delalloc extent from under that data. On the surface, this looks fine. The problem is that page cache invalidation doesn't guarantee that it removes anything from the page cache and it doesn't change the dirty state of the folio. When block size == page size and we do page aligned IO (as xfs/196 does) everything happens to align perfectly and page cache invalidation removes the single page folios that span the written data. Hence the followup delalloc punch pass does not find cached data over that range and it can punch the extent out. IOWs, xfs/196 "works" for block size == page size with the new code. I say "works", because it actually only works for the case where IO is page aligned, and no data was read from disk before writes occur. Because the moment we actually read data first, the readahead code allocates multipage folios and suddenly the invalidate code goes back to zeroing subfolio ranges without changing dirty state. Hence, with multipage folios in play, block size == page size is functionally identical to block size < page size behaviour, and drop-writes is manifestly broken w.r.t to this case. Invalidation of a subfolio range doesn't result in the folio being removed from the cache, just the range gets zeroed. Hence after we've sequentially walked over a folio that we've dirtied (via write data) and then invalidated, we end up with a dirty folio full of zeroed data. And because the new code skips punching ranges that have dirty folios covering them, we end up leaving the delalloc range intact after failing all the writes. Hence failed writes now end up writing zeroes to disk in the cases where invalidation zeroes folios rather than removing them from cache. This is a fundamental change of behaviour that is needed to avoid the data corruption vectors that exist in the old write fail path, and it renders the drop-writes injection non-functional and unworkable as it stands. As it is, I think the error injection is also now unnecessary, as partial writes that need delalloc extent are going to be a lot more common with stale iomap detection in place. Hence this patch removes the drop-writes error injection completely. xfs/196 can remain for testing kernels that don't have this data corruption fix, but those that do will report: xfs/196 3s ... [not run] XFS error injection drop_writes unknown on this kernel. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:32 +02:00
Dave Chinner	ea67e73129	xfs: use iomap_valid method to detect stale cached iomaps [ Upstream commit `304a68b9c6` ] Now that iomap supports a mechanism to validate cached iomaps for buffered write operations, hook it up to the XFS buffered write ops so that we can avoid data corruptions that result from stale cached iomaps. See: https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/ or the ->iomap_valid() introduction commit for exact details of the corruption vector. The validity cookie we store in the iomap is based on the type of iomap we return. It is expected that the iomap->flags we set in xfs_bmbt_to_iomap() is not perturbed by the iomap core and are returned to us in the iomap passed via the .iomap_valid() callback. This ensures that the validity cookie is always checking the correct inode fork sequence numbers to detect potential changes that affect the extent cached by the iomap. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:32 +02:00
Dave Chinner	54a37e5d07	iomap: write iomap validity checks [ Upstream commit `d7b6404116` ] A recent multithreaded write data corruption has been uncovered in the iomap write code. The core of the problem is partial folio writes can be flushed to disk while a new racing write can map it and fill the rest of the page: writeback new write allocate blocks blocks are unwritten submit IO ..... map blocks iomap indicates UNWRITTEN range loop { lock folio copyin data ..... IO completes runs unwritten extent conv blocks are marked written <iomap now stale> get next folio } Now add memory pressure such that memory reclaim evicts the partially written folio that has already been written to disk. When the new write finally gets to the last partial page of the new write, it does not find it in cache, so it instantiates a new page, sees the iomap is unwritten, and zeros the part of the page that it does not have data from. This overwrites the data on disk that was originally written. The full description of the corruption mechanism can be found here: https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/ To solve this problem, we need to check whether the iomap is still valid after we lock each folio during the write. We have to do it after we lock the page so that we don't end up with state changes occurring while we wait for the folio to be locked. Hence we need a mechanism to be able to check that the cached iomap is still valid (similar to what we already do in buffered writeback), and we need a way for ->begin_write to back out and tell the high level iomap iterator that we need to remap the remaining write range. The iomap needs to grow some storage for the validity cookie that the filesystem provides to travel with the iomap. XFS, in particular, also needs to know some more information about what the iomap maps (attribute extents rather than file data extents) to for the validity cookie to cover all the types of iomaps we might need to validate. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:31 +02:00
Dave Chinner	580f40b4c9	xfs: xfs_bmap_punch_delalloc_range() should take a byte range [ Upstream commit `7348b32233` ] All the callers of xfs_bmap_punch_delalloc_range() jump through hoops to convert a byte range to filesystem blocks before calling xfs_bmap_punch_delalloc_range(). Instead, pass the byte range to xfs_bmap_punch_delalloc_range() and have it do the conversion to filesystem blocks internally. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:31 +02:00
Dave Chinner	38be53c3fd	iomap: buffered write failure should not truncate the page cache [ Upstream commit `f43dc4dc3e` ] iomap_file_buffered_write_punch_delalloc() currently invalidates the page cache over the unused range of the delalloc extent that was allocated. While the write allocated the delalloc extent, it does not own it exclusively as the write does not hold any locks that prevent either writeback or mmap page faults from changing the state of either the page cache or the extent state backing this range. Whilst xfs_bmap_punch_delalloc_range() already handles races in extent conversion - it will only punch out delalloc extents and it ignores any other type of extent - the page cache truncate does not discriminate between data written by this write or some other task. As a result, truncating the page cache can result in data corruption if the write races with mmap modifications to the file over the same range. generic/346 exercises this workload, and if we randomly fail writes (as will happen when iomap gets stale iomap detection later in the patchset), it will randomly corrupt the file data because it removes data written by mmap() in the same page as the write() that failed. Hence we do not want to punch out the page cache over the range of the extent we failed to write to - what we actually need to do is detect the ranges that have dirty data in cache over them and not punch them out. To do this, we have to walk the page cache over the range of the delalloc extent we want to remove. This is made complex by the fact we have to handle partially up-to-date folios correctly and this can happen even when the FSB size == PAGE_SIZE because we now support multi-page folios in the page cache. Because we are only interested in discovering the edges of data ranges in the page cache (i.e. hole-data boundaries) we can make use of mapping_seek_hole_data() to find those transitions in the page cache. As we hold the invalidate_lock, we know that the boundaries are not going to change while we walk the range. This interface is also byte-based and is sub-page block aware, so we can find the data ranges in the cache based on byte offsets rather than page, folio or fs block sized chunks. This greatly simplifies the logic of finding dirty cached ranges in the page cache. Once we've identified a range that contains cached data, we can then iterate the range folio by folio. This allows us to determine if the data is dirty and hence perform the correct delalloc extent punching operations. The seek interface we use to iterate data ranges will give us sub-folio start/end granularity, so we may end up looking up the same folio multiple times as the seek interface iterates across each discontiguous data region in the folio. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:31 +02:00
Dave Chinner	12339ec6fe	xfs,iomap: move delalloc punching to iomap [ Upstream commit `9c7babf94a` ] Because that's what Christoph wants for this error handling path only XFS uses. It requires a new iomap export for handling errors over delalloc ranges. This is basically the XFS code as is stands, but even though Christoph wants this as iomap funcitonality, we still have to call it from the filesystem specific ->iomap_end callback, and call into the iomap code with yet another filesystem specific callback to punch the delalloc extent within the defined ranges. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:31 +02:00
Dave Chinner	8b6afad39b	xfs: use byte ranges for write cleanup ranges [ Upstream commit `b71f889c18` ] xfs_buffered_write_iomap_end() currently converts the byte ranges passed to it to filesystem blocks to pass them to the bmap code to punch out delalloc blocks, but then has to convert filesytem blocks back to byte ranges for page cache truncate. We're about to make the page cache truncate go away and replace it with a page cache walk, so having to convert everything to/from/to filesystem blocks is messy and error-prone. It is much easier to pass around byte ranges and convert to page indexes and/or filesystem blocks only where those units are needed. In preparation for the page cache walk being added, add a helper that converts byte ranges to filesystem blocks and calls xfs_bmap_punch_delalloc_range() and convert xfs_buffered_write_iomap_end() to calculate limits in byte ranges. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:31 +02:00
Dave Chinner	142eafd24d	xfs: punching delalloc extents on write failure is racy [ Upstream commit `198dd8aede` ] xfs_buffered_write_iomap_end() has a comment about the safety of punching delalloc extents based holding the IOLOCK_EXCL. This comment is wrong, and punching delalloc extents is not race free. When we punch out a delalloc extent after a write failure in xfs_buffered_write_iomap_end(), we punch out the page cache with truncate_pagecache_range() before we punch out the delalloc extents. At this point, we only hold the IOLOCK_EXCL, so there is nothing stopping mmap() write faults racing with this cleanup operation, reinstantiating a folio over the range we are about to punch and hence requiring the delalloc extent to be kept. If this race condition is hit, we can end up with a dirty page in the page cache that has no delalloc extent or space reservation backing it. This leads to bad things happening at writeback time. To avoid this race condition, we need the page cache truncation to be atomic w.r.t. the extent manipulation. We can do this by holding the mapping->invalidate_lock exclusively across this operation - this will prevent new pages from being inserted into the page cache whilst we are removing the pages and the backing extent and space reservation. Taking the mapping->invalidate_lock exclusively in the buffered write IO path is safe - it naturally nests inside the IOLOCK (see truncate and fallocate paths). iomap_zero_range() can be called from under the mapping->invalidate_lock (from the truncate path via either xfs_zero_eof() or xfs_truncate_page(), but iomap_zero_iter() will not instantiate new delalloc pages (because it skips holes) and hence will not ever need to punch out delalloc extents on failure. Fix the locking issue, and clean up the code logic a little to avoid unnecessary work if we didn't allocate the delalloc extent or wrote the entire region we allocated. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:31 +02:00
Dave Chinner	495e934c66	xfs: write page faults in iomap are not buffered writes [ Upstream commit `118e021b4b` ] When we reserve a delalloc region in xfs_buffered_write_iomap_begin, we mark the iomap as IOMAP_F_NEW so that the the write context understands that it allocated the delalloc region. If we then fail that buffered write, xfs_buffered_write_iomap_end() checks for the IOMAP_F_NEW flag and if it is set, it punches out the unused delalloc region that was allocated for the write. The assumption this code makes is that all buffered write operations that can allocate space are run under an exclusive lock (i_rwsem). This is an invalid assumption: page faults in mmap()d regions call through this same function pair to map the file range being faulted and this runs only holding the inode->i_mapping->invalidate_lock in shared mode. IOWs, we can have races between page faults and write() calls that fail the nested page cache write operation that result in data loss. That is, the failing iomap_end call will punch out the data that the other racing iomap iteration brought into the page cache. This can be reproduced with generic/34[46] if we arbitrarily fail page cache copy-in operations from write() syscalls. Code analysis tells us that the iomap_page_mkwrite() function holds the already instantiated and uptodate folio locked across the iomap mapping iterations. Hence the folio cannot be removed from memory whilst we are mapping the range it covers, and as such we do not care if the mapping changes state underneath the iomap iteration loop: 1. if the folio is not already dirty, there is no writeback races possible. 2. if we allocated the mapping (delalloc or unwritten), the folio cannot already be dirty. See #1. 3. If the folio is already dirty, it must be up to date. As we hold it locked, it cannot be reclaimed from memory. Hence we always have valid data in the page cache while iterating the mapping. 4. Valid data in the page cache can exist when the underlying mapping is DELALLOC, UNWRITTEN or WRITTEN. Having the mapping change from DELALLOC->UNWRITTEN or UNWRITTEN->WRITTEN does not change the data in the page - it only affects actions if we are initialising a new page. Hence #3 applies and we don't care about these extent map transitions racing with iomap_page_mkwrite(). 5. iomap_page_mkwrite() checks for page invalidation races (truncate, hole punch, etc) after it locks the folio. We also hold the mapping->invalidation_lock here, and hence the mapping cannot change due to extent removal operations while we are iterating the folio. As such, filesystems that don't use bufferheads will never fail the iomap_folio_mkwrite_iter() operation on the current mapping, regardless of whether the iomap should be considered stale. Further, the range we are asked to iterate is limited to the range inside EOF that the folio spans. Hence, for XFS, we will only map the exact range we are asked for, and we will only do speculative preallocation with delalloc if we are mapping a hole at the EOF page. The iterator will consume the entire range of the folio that is within EOF, and anything beyond the EOF block cannot be accessed. We never need to truncate this post-EOF speculative prealloc away in the context of the iomap_page_mkwrite() iterator because if it remains unused we'll remove it when the last reference to the inode goes away. Hence we don't actually need an .iomap_end() cleanup/error handling path at all for iomap_page_mkwrite() for XFS. This means we can separate the page fault processing from the complexity of the .iomap_end() processing in the buffered write path. This also means that the buffered write path will also be able to take the mapping->invalidate_lock as necessary. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:31 +02:00
Mengqi Zhang	493a8172e5	mmc: core: Add HS400 tuning in HS400es initialization commit 77e01b49e35f24ebd1659096d5fc5c3b75975545 upstream. During the initialization to HS400es stage, add a HS400 tuning flow as an optional process. For Mediatek IP, the HS400es mode requires a specific tuning to ensure the correct HS400 timing setting. Signed-off-by: Mengqi Zhang <mengqi.zhang@mediatek.com> Link: https://lore.kernel.org/r/20231225093839.22931-2-mengqi.zhang@mediatek.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Cc: "Lin Gui (桂林)" <Lin.Gui@mediatek.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:30 +02:00
Jarkko Sakkinen	5d91238b59	KEYS: trusted: Fix memory leak in tpm2_key_encode() commit ffcaa2172cc1a85ddb8b783de96d38ca8855e248 upstream. 'scratch' is never freed. Fix this by calling kfree() in the success, and in the error case. Cc: stable@vger.kernel.org # +v5.13 Fixes: `f221974525` ("security: keys: trusted: use ASN.1 TPM2 key format for the blobs") Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:30 +02:00
NeilBrown	104ef3d8cd	nfsd: don't allow nfsd threads to be signalled. commit `3903902401` upstream. The original implementation of nfsd used signals to stop threads during shutdown. In Linux 2.3.46pre5 nfsd gained the ability to shutdown threads internally it if was asked to run "0" threads. After this user-space transitioned to using "rpc.nfsd 0" to stop nfsd and sending signals to threads was no longer an important part of the API. In commit `3ebdbe5203` ("SUNRPC: discard svo_setup and rename svc_set_num_threads_sync()") (v5.17-rc1~75^2~41) we finally removed the use of signals for stopping threads, using kthread_stop() instead. This patch makes the "obvious" next step and removes the ability to signal nfsd threads - or any svc threads. nfsd stops allowing signals and we don't check for their delivery any more. This will allow for some simplification in later patches. A change worth noting is in nfsd4_ssc_setup_dul(). There was previously a signal_pending() check which would only succeed when the thread was being shut down. It should really have tested kthread_should_stop() as well. Now it just does the latter, not the former. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:30 +02:00
Aidan MacDonald	cf8e6ae857	mfd: stpmic1: Fix swapped mask/unmask in irq chip commit `c79e387389` upstream. The usual behavior of mask registers is writing a '1' bit to disable (mask) an interrupt; similarly, writing a '1' bit to an unmask register enables (unmasks) an interrupt. Due to a longstanding issue in regmap-irq, mask and unmask registers were inverted when both kinds of registers were present on the same chip, ie. regmap-irq actually wrote '1's to the mask register to enable an IRQ and '1's to the unmask register to disable an IRQ. This was fixed by commit `e8ffb12e7f` ("regmap-irq: Fix inverted handling of unmask registers") but the fix is opt-in via mask_unmask_non_inverted = true because it requires manual changes for each affected driver. The new behavior will become the default once all drivers have been updated. The STPMIC1 has a normal mask register with separate set and clear registers. The driver intends to use the set & clear registers with regmap-irq and has compensated for regmap-irq's inverted behavior, and should currently be working properly. Thus, swap mask_base and unmask_base, and opt in to the new non-inverted behavior. Signed-off-by: Aidan MacDonald <aidanmacdonald.0x0@gmail.com> Signed-off-by: Lee Jones <lee@kernel.org> Link: https://lore.kernel.org/r/20221112151835.39059-16-aidanmacdonald.0x0@gmail.com Cc: Yoann Congal <yoann.congal@smile.fr> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:30 +02:00
Sergey Shtylyov	026caf92c6	pinctrl: core: handle radix_tree_insert() errors in pinctrl_register_one_pin() commit `ecfe9a015d` upstream. pinctrl_register_one_pin() doesn't check the result of radix_tree_insert() despite they both may return a negative error code. Linus Walleij said he has copied the radix tree code from kernel/irq/ where the functions calling radix_tree_insert() are void themselves; I think it makes more sense to propagate the errors from radix_tree_insert() upstream if we can do that... Found by Linux Verification Center (linuxtesting.org) with the Svace static analysis tool. Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Link: https://lore.kernel.org/r/20230719202253.13469-3-s.shtylyov@omp.ru Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Cc: "Hemdan, Hagar Gamal Halim" <hagarhem@amazon.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:30 +02:00
Jacob Keller	90cbd4c081	ice: remove unnecessary duplicate checks for VF VSI ID commit 363f689600dd010703ce6391bcfc729a97d21840 upstream. The ice_vc_fdir_param_check() function validates that the VSI ID of the virtchnl flow director command matches the VSI number of the VF. This is already checked by the call to ice_vc_isvalid_vsi_id() immediately following this. This check is unnecessary since ice_vc_isvalid_vsi_id() already confirms this by checking that the VSI ID can locate the VSI associated with the VF structure. Furthermore, a following change is going to refactor the ice driver to report VSI IDs using a relative index for each VF instead of reporting the PF VSI number. This additional check would break that logic since it enforces that the VSI ID matches the VSI number. Since this check duplicates the logic in ice_vc_isvalid_vsi_id() and gets in the way of refactoring that logic, remove it. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:30 +02:00
Jacob Keller	59161a21ca	ice: pass VSI pointer into ice_vc_isvalid_q_id commit a21605993dd5dfd15edfa7f06705ede17b519026 upstream. The ice_vc_isvalid_q_id() function takes a VSI index and a queue ID. It looks up the VSI from its index, and then validates that the queue number is valid for that VSI. The VSI ID passed is typically a VSI index from the VF. This VSI number is validated by the PF to ensure that it matches the VSI associated with the VF already. In every flow where ice_vc_isvalid_q_id() is called, the PF driver already has a pointer to the VSI associated with the VF. This pointer is obtained using ice_get_vf_vsi(), rather than looking up the VSI using the index sent by the VF. Since we already know which VSI to operate on, we can modify ice_vc_isvalid_q_id() to take a VSI pointer instead of a VSI index. Pass the VSI we found from ice_get_vf_vsi() instead of re-doing the lookup. This removes some unnecessary computation and scanning of the VSI list. It also removes the last place where the driver directly used the VSI number from the VF. This will pave the way for refactoring to communicate relative VSI numbers to the VF instead of absolute numbers from the PF space. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:30 +02:00
Ronald Wahl	8a94fc9d20	net: ks8851: Fix another TX stall caused by wrong ISR flag handling commit 317a215d493230da361028ea8a4675de334bfa1a upstream. Under some circumstances it may happen that the ks8851 Ethernet driver stops sending data. Currently the interrupt handler resets the interrupt status flags in the hardware after handling TX. With this approach we may lose interrupts in the time window between handling the TX interrupt and resetting the TX interrupt status bit. When all of the three following conditions are true then transmitting data stops: - TX queue is stopped to wait for room in the hardware TX buffer - no queued SKBs in the driver (txq) that wait for being written to hw - hardware TX buffer is empty and the last TX interrupt was lost This is because reenabling the TX queue happens when handling the TX interrupt status but if the TX status bit has already been cleared then this interrupt will never come. With this commit the interrupt status flags will be cleared before they are handled. That way we stop losing interrupts. The wrong handling of the ISR flags was there from the beginning but with commit 3dc5d4454545 ("net: ks8851: Fix TX stall caused by TX buffer overrun") the issue becomes apparent. Fixes: 3dc5d4454545 ("net: ks8851: Fix TX stall caused by TX buffer overrun") Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: netdev@vger.kernel.org Cc: stable@vger.kernel.org # 5.10+ Signed-off-by: Ronald Wahl <ronald.wahl@raritan.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:29 +02:00
Jose Fernandez	91402e0e5d	drm/amd/display: Fix division by zero in setup_dsc_config commit 130afc8a886183a94cf6eab7d24f300014ff87ba upstream. When slice_height is 0, the division by slice_height in the calculation of the number of slices will cause a division by zero driver crash. This leaves the kernel in a state that requires a reboot. This patch adds a check to avoid the division by zero. The stack trace below is for the 6.8.4 Kernel. I reproduced the issue on a Z16 Gen 2 Lenovo Thinkpad with a Apple Studio Display monitor connected via Thunderbolt. The amdgpu driver crashed with this exception when I rebooted the system with the monitor connected. kernel: ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447) kernel: ? do_trap (arch/x86/kernel/traps.c:113 arch/x86/kernel/traps.c:154) kernel: ? setup_dsc_config (drivers/gpu/drm/amd/amdgpu/../display/dc/dsc/dc_dsc.c:1053) amdgpu kernel: ? do_error_trap (./arch/x86/include/asm/traps.h:58 arch/x86/kernel/traps.c:175) kernel: ? setup_dsc_config (drivers/gpu/drm/amd/amdgpu/../display/dc/dsc/dc_dsc.c:1053) amdgpu kernel: ? exc_divide_error (arch/x86/kernel/traps.c:194 (discriminator 2)) kernel: ? setup_dsc_config (drivers/gpu/drm/amd/amdgpu/../display/dc/dsc/dc_dsc.c:1053) amdgpu kernel: ? asm_exc_divide_error (./arch/x86/include/asm/idtentry.h:548) kernel: ? setup_dsc_config (drivers/gpu/drm/amd/amdgpu/../display/dc/dsc/dc_dsc.c:1053) amdgpu kernel: dc_dsc_compute_config (drivers/gpu/drm/amd/amdgpu/../display/dc/dsc/dc_dsc.c:1109) amdgpu After applying this patch, the driver no longer crashes when the monitor is connected and the system is rebooted. I believe this is the same issue reported for 3113. Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Signed-off-by: Jose Fernandez <josef@netflix.com> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3113 Signed-off-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: "Limonciello, Mario" <mario.limonciello@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-25 16:21:29 +02:00
Qianfeng Rong	68c821783c	UPSTREAM: epoll: be better about file lifetimes epoll can call out to vfs_poll() with a file pointer that may race with the last 'fput()'. That would make f_count go down to zero, and while the ep->mtx locking means that the resulting file pointer tear-down will be blocked until the poll returns, it means that f_count is already dead, and any use of it won't actually get a reference to the file any more: it's dead regardless. Make sure we have a valid ref on the file pointer before we call down to vfs_poll() from the epoll routines. Bug: 341834298 Change-Id: Iefa13cd84102ded3e104c030c8d7d0b7a8c1eab2 Link: https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/ Reported-by: syzbot+045b454ab35fd82a35fb@syzkaller.appspotmail.com Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 4efaa5acf0a1d2b5947f98abb3acf8bfd966422b) Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.corp-partner.google.com>	2024-05-23 18:50:17 +08:00
Kyle Tso	84574a4ee9	FROMLIST: usb: typec: tcpm: Ignore received Hard Reset in TOGGLING state Similar to what fixed in Commit a6fe37f428c1 ("usb: typec: tcpm: Skip hard reset when in error recovery"), the handling of the received Hard Reset has to be skipped during TOGGLING state. [ 4086.021288] VBUS off [ 4086.021295] pending state change SNK_READY -> SNK_UNATTACHED @ 650 ms [rev2 NONE_AMS] [ 4086.022113] VBUS VSAFE0V [ 4086.022117] state change SNK_READY -> SNK_UNATTACHED [rev2 NONE_AMS] [ 4086.022447] VBUS off [ 4086.022450] state change SNK_UNATTACHED -> SNK_UNATTACHED [rev2 NONE_AMS] [ 4086.023060] VBUS VSAFE0V [ 4086.023064] state change SNK_UNATTACHED -> SNK_UNATTACHED [rev2 NONE_AMS] [ 4086.023070] disable BIST MODE TESTDATA [ 4086.023766] disable vbus discharge ret:0 [ 4086.023911] Setting usb_comm capable false [ 4086.028874] Setting voltage/current limit 0 mV 0 mA [ 4086.028888] polarity 0 [ 4086.030305] Requesting mux state 0, usb-role 0, orientation 0 [ 4086.033539] Start toggling [ 4086.038496] state change SNK_UNATTACHED -> TOGGLING [rev2 NONE_AMS] // This Hard Reset is unexpected [ 4086.038499] Received hard reset [ 4086.038501] state change TOGGLING -> HARD_RESET_START [rev2 HARD_RESET] Fixes: `f0690a25a1` ("staging: typec: USB Type-C Port Manager (tcpm)") Cc: stable@vger.kernel.org Signed-off-by: Kyle Tso <kyletso@google.com> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Change-Id: Icfa144f370bd87670df1cd71f247a3528ab4c591 Bug: 331356545 Link: https://lore.kernel.org/all/20240520154858.1072347-1-kyletso@google.com/	2024-05-23 08:26:52 +00:00
Krishna Kurapati	2755f25d0c	UPSTREAM: usb: gadget: ncm: Fix handling of zero block length packets While connecting to a Linux host with CDC_NCM_NTB_DEF_SIZE_TX set to 65536, it has been observed that we receive short packets, which come at interval of 5-10 seconds sometimes and have block length zero but still contain 1-2 valid datagrams present. According to the NCM spec: "If wBlockLength = 0x0000, the block is terminated by a short packet. In this case, the USB transfer must still be shorter than dwNtbInMaxSize or dwNtbOutMaxSize. If exactly dwNtbInMaxSize or dwNtbOutMaxSize bytes are sent, and the size is a multiple of wMaxPacketSize for the given pipe, then no ZLP shall be sent. wBlockLength= 0x0000 must be used with extreme care, because of the possibility that the host and device may get out of sync, and because of test issues. wBlockLength = 0x0000 allows the sender to reduce latency by starting to send a very large NTB, and then shortening it when the sender discovers that there’s not sufficient data to justify sending a large NTB" However, there is a potential issue with the current implementation, as it checks for the occurrence of multiple NTBs in a single giveback by verifying if the leftover bytes to be processed is zero or not. If the block length reads zero, we would process the same NTB infintely because the leftover bytes is never zero and it leads to a crash. Fix this by bailing out if block length reads zero. Cc: stable@vger.kernel.org Fixes: `427694cfaa` ("usb: gadget: ncm: Handle decoding of multiple NTB's in unwrap call") Signed-off-by: Krishna Kurapati <quic_kriskura@quicinc.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20240228115441.2105585-1-quic_kriskura@quicinc.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit f90ce1e04cbcc76639d6cba0fdbd820cd80b3c70 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master) Bug: 320608613 Change-Id: I4b60d855f5539e66261e71dc2a29c7d22712e382 Signed-off-by: Krishna Kurapati <quic_kriskura@quicinc.com> (cherry picked from commit b493b35d3a52a47d92607a03c257fcb71fcc2ef9)	2024-05-22 19:29:41 +00:00
Greg Kroah-Hartman	1dca1fead9	Merge branch 'android14-6.1' into branch 'android14-6.1-lts' This catches the android14-6.1-lts branch up to date with recent changes in the android14-6.1 branch, including symbol additions which are required for us to track in the LTS branch. Included in here are the following commits: * `0a5aada71c` ANDROID: GKI: Update symbol list for mtk * `34a15d3507` UPSTREAM: usb: gadget: ncm: Avoid dropping datagrams of properly parsed NTBs * `bd552fcbbd` ANDROID: GKI: Update rockchip symbols to add iova APIs * `4ed706c20a` FROMLIST: sched/pi: Reweight fair_policy() tasks when inheriting prio * `b1e11ffd90` ANDROID: Update the ABI symbol list * `29a00abe43` ANDROID: mm: Add restricted vendor hook in do_read_fault() * `51c421385e` ANDROID: abi_gki_aarch64_qcom: Update symbol list * `a9dca663a7` ANDROID: Update the ABI symbol list * `6316af1012` ANDROID: add vendor hooks and expoert reclaim_pages to reclaim memory * `1d241d978d` FROMGIT: usb: dwc3: Wait unconditionally after issuing EndXfer command * `f9ca61c8d8` ANDROID: ABI: Update honor symbol list * `c7fcb9bf9a` ANDROID: add vendor hook in do_read_fault to tune fault_around_bytes * `23f2a9f5f1` ANDROID: usb: Optimize the problem of slow transfer rate in USB accessory mode * `6a3d68af9c` ANDROID: Zap kernel/sched/android.h stubs * `274e3e9696` ANDROID: export one function for mm metrics * `117a941226` ANDROID: Update the ABI symbol list * `0d080e01a2` ANDROID: Export sysctl_sched_wakeup_granularity to enable modifying it * `039d2a958c` UPSTREAM: ALSA: virtio: use ack callback * `47dfe41d57` UPSTREAM: usb: typec: tcpm: clear pd_event queue in PORT_RESET * `93188d7732` BACKPORT: usb: typec: tcpm: enforce ready state when queueing alt mode vdm * `4d55129aea` UPSTREAM: crypto: x86/curve25519 - disable gcov * `cf685d2b02` ANDROID: GKI: Update QCOM symbol list and ABI STG * `fae94bc4e7` ANDROID: GKI: update symbol list file for xiaomi * `d5e04556d4` UPSTREAM: netfilter: nft_set_pipapo: do not free live element * `dc6facfe02` UPSTREAM: net: tls: handle backlogging of crypto requests * `1794308d46` ANDROID: 16K: Fix show maps CFI failure * `72a9c0a205` ANDROID: 16K: Handle pad VMA splits and merges * `b86b5cb22d` ANDROID: 16K: madvise_vma_pad_pages: Remove filemap_fault check * `1657717c12` ANDROID: 16K: Only madvise padding from dynamic linker context * `2ca5e076c9` ANDROID: 16K: Separate padding from ELF LOAD segment mappings * `1537dbe21b` ANDROID: 16K: Exclude ELF padding for fault around range * `6815ef3195` ANDROID: 16K: Use MADV_DONTNEED to save VMA padding pages. * `6b9e404675` ANDROID: 16K: Introduce ELF padding representation for VMAs * `e79c1d4590` ANDROID: 16K: Introduce /sys/kernel/mm/pgsize_miration/enabled * `ea3c70fb95` FROMGIT: usb: typec: tcpm: Check for port partner validity before consuming it * `13f322e958` Revert "FROMGIT: usb: typec: tcpm: Check for port partner validity before consuming it" * `6657c436ed` FROMGIT: usb: typec: tcpm: Check for port partner validity before consuming it * `1d37bc9913` ANDROID: vendor_hooks: add symbols for lazy preemption * `14f07c1db0` ANDROID: vendor_hooks: add two hooks for lazy preemption * `6364d59412` ANDROID: KVM: arm64: wait_for_initramfs for pKVM module loading procfs * `4744b3a4ed` ANDROID: GKI: Expose device async to userspace * `08cc4037cf` FROMGIT: coresight: etm4x: Fix access to resource selector registers * `7ff054397a` FROMGIT: coresight: etm4x: Safe access for TRCQCLTR * `f401cce7d9` FROMGIT: coresight: etm4x: Do not save/restore Data trace control registers * `d9604db041` FROMGIT: coresight: etm4x: Do not hardcode IOMEM access for register restore * `fa87a072a7` ANDROID: GKI: Update honda symbol list for led-trigger * `c61278bb70` ANDROID: GKI: Update symbols to symbol list * `260bfad693` ANDROID: vendor_hook: Add hooks to support reader optimistic spin in rwsem * `d0c6724b0f` UPSTREAM: af_unix: Fix garbage collector racing against connect() * `94c88f80ff` UPSTREAM: af_unix: Do not use atomic ops for unix_sk(sk)->inflight. * `3dfddcb9c2` ANDROID: GKI: fix ABI breakage in struct userfaultfd_ctx * `8dd482be44` UPSTREAM: userfaultfd: fix deadlock warning when locking src and dst VMAs * `ce2896c0c6` BACKPORT: userfaultfd: use per-vma locks in userfaultfd operations * `daf0b0fc4a` BACKPORT: mm: add vma_assert_locked() for !CONFIG_PER_VMA_LOCK * `a5b6040d5c` BACKPORT: userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx * `6b5ee039a1` BACKPORT: userfaultfd: move userfaultfd_ctx struct to header file * `ac96edb501` BACKPORT: userfaultfd: fix mmap_changing checking in mfill_atomic_hugetlb * `51eab7ecc4` BACKPORT: selftests/mm: add separate UFFDIO_MOVE test for PMD splitting * `f152691515` BACKPORT: selftests/mm: add UFFDIO_MOVE ioctl test * `a5d504c067` BACKPORT: selftests/mm: add uffd_test_case_ops to allow test case-specific operations * `ee72d5a7d9` BACKPORT: selftests/mm: call uffd_test_ctx_clear at the end of the test * `abd6748ba6` UPSTREAM: userfaultfd: fix return error if mmap_changing is non-zero in MOVE ioctl * `4f658d7723` BACKPORT: userfaultfd: change src_folio after ensuring it's unpinned in UFFDIO_MOVE * `bfb4b24b64` BACKPORT: mm: userfaultfd: fix unexpected change to src_folio when UFFDIO_MOVE fails * `6ecd08eaf4` BACKPORT: userfaultfd: handle zeropage moves by UFFDIO_MOVE * `e275c2b743` UPSTREAM: userfaultfd: avoid huge_zero_page in UFFDIO_MOVE * `60c5a0e023` UPSTREAM: userfaultfd: fix move_pages_pte() splitting folio under RCU read lock * `5025ad140e` BACKPORT: userfaultfd: UFFDIO_MOVE uABI * `25db7c13d8` UPSTREAM: mm/rmap: support move to different root anon_vma in folio_move_anon_rmap() * `503add1843` ANDROID: PM: hibernate: Encryption support with compression * `3e99ae28ea` ANDROID: abi_gki_aarch64_qcom: Update symbol list * `8f08ea0d59` ANDROID: vendor_hooks: Add hooks to support hibernation * `e7e8932600` ANDROID: gki_defconfig: Sync gki_defconfig * `54c2418b76` UPSTREAM: PM: hibernate: Support to select compression algorithm * `76c7e9747b` UPSTREAM: PM: hibernate: Add support for LZ4 compression for hibernation * `990d3701d0` BACKPORT: PM: hibernate: Move to crypto APIs for LZO compression * `d224d17a14` BACKPORT: PM: hibernate: Rename lzo* to make it generic * `dcb09569bb` ANDROID: ABI: Update symbol list for Exynos SoC * `692e3553d2` ANDROID: abi_gki_aarch64_qcom: Update symbol list * `8943be7d1b` BACKPORT: mtk-mmsys: Change mtk-mmsys & mtk-mutex to modules * `34e8dc4ed0` BACKPORT: clk: mediatek: Split configuration options for MT8186 clock drivers * `a5ce14670a` BACKPORT: clk: mediatek: Add MODULE_LICENSE() where missing * `4bfe25d0b6` ANDROID: Update the ABI symbol list * `24edb63b85` Reapply "ANDROID: block: Add support for filesystem requests and small segments" * `141ebdcb28` UPSTREAM: usb:typec:tcpm:support double Rp to Vbus cable as sink * `8672a5ee4d` ANDROID: Update the ABI symbol list Change-Id: I594743790b6a498847862039bd47c65c51876b73 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2024-05-22 10:50:52 +00:00
Seiya Wang	0a5aada71c	ANDROID: GKI: Update symbol list for mtk 3 function symbol(s) added 'int dev_pm_opp_register_notifier(struct device, struct notifier_block)' 'int dev_pm_opp_unregister_notifier(struct device, struct notifier_block)' 'int snd_soc_suspend(struct device*)' Bug: 341821144 Change-Id: Iafcfaede99a35e10d9162e0298a7e3feb43cec73 Signed-off-by: Seiya Wang <seiya.wang@mediatek.com>	2024-05-21 09:55:20 +00:00
Greg Kroah-Hartman	b98ce0fe28	ANDROID: GKI: update the abi for tracing changes in 6.1.84 In 6.1.84, a number of internal tracing structures changed. Those structures are not used outside of the core kernel, but due to opaque pointers being carried into some abi signatures, they are tracked by the .stg file. Update the .stg file to handle these changes, as they are safe to modify at this point in time. The changes are: INFO: ABI DIFFERENCES HAVE BEEN DETECTED! INFO: type 'struct trace_buffer' changed byte size changed from 224 to 216 2 members ('bool time_stamp_abs' .. 'struct ring_buffer_ext_cb* ext_cb') changed offset changed by -64 type 'struct ring_buffer_per_cpu' changed byte size changed from 496 to 488 type 'struct rb_irq_work' changed byte size changed from 96 to 88 member 'long wait_index' was removed 3 members ('bool waiters_pending' .. 'bool wakeup_full') changed offset changed by -64 Fixes: `347385861c` ("Linux 6.1.84") Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Id0c90b04188335ffa9a40db0397ed5a12080ca95	2024-05-20 16:04:32 +00:00
Greg Kroah-Hartman	5f29666f69	Revert "timers: Rename del_timer_sync() to timer_delete_sync()" This reverts commit `113d5341ee` which is commit `9b13df3fb6` upstream. It breaks the Android kernel abi by turning del_timer_sync() into an inline function, which breaks the abi. Fix this by putting it back as needed AND fix up the only use of this new function in drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c which is what caused this commit to be backported to 5.4.274 in the first place. Bug: 161946584 Change-Id: Icd26c7c81e6172f36eeeb69827989bfab1d32afe Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2024-05-20 11:02:50 +00:00
Greg Kroah-Hartman	501c229a8a	Revert "media: mc: Add num_links flag to media_pad" This reverts commit `cff51913c5` which is commit baeddf94aa61879b118f2faa37ed126d772670cc upstream. It breaks the Android kernel abi and can be brought back in the future in an abi-safe way if it is really needed. Bug: 161946584 Change-Id: I5b874c8b01bdd8cdeed6dec216fdad500593f5a7 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2024-05-20 10:34:50 +00:00
Greg Kroah-Hartman	2b84f5edda	Revert "media: mc: Expand MUST_CONNECT flag to always require an enabled link" This reverts commit `e2c545b841` which is commit b3decc5ce7d778224d266423b542326ad469cb5f upstream. It breaks the Android kernel abi and can be brought back in the future in an abi-safe way if it is really needed. Bug: 161946584 Change-Id: I94f10b3fe86210799b5697259e32114f00f080f0 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2024-05-20 10:34:50 +00:00
Krishna Kurapati	34a15d3507	UPSTREAM: usb: gadget: ncm: Avoid dropping datagrams of properly parsed NTBs It is observed sometimes when tethering is used over NCM with Windows 11 as host, at some instances, the gadget_giveback has one byte appended at the end of a proper NTB. When the NTB is parsed, unwrap call looks for any leftover bytes in SKB provided by u_ether and if there are any pending bytes, it treats them as a separate NTB and parses it. But in case the second NTB (as per unwrap call) is faulty/corrupt, all the datagrams that were parsed properly in the first NTB and saved in rx_list are dropped. Adding a few custom traces showed the following: [002] d..1 7828.532866: dwc3_gadget_giveback: ep1out: req 000000003868811a length 1025/16384 zsI ==> 0 [002] d..1 7828.532867: ncm_unwrap_ntb: K: ncm_unwrap_ntb toprocess: 1025 [002] d..1 7828.532867: ncm_unwrap_ntb: K: ncm_unwrap_ntb nth: 1751999342 [002] d..1 7828.532868: ncm_unwrap_ntb: K: ncm_unwrap_ntb seq: 0xce67 [002] d..1 7828.532868: ncm_unwrap_ntb: K: ncm_unwrap_ntb blk_len: 0x400 [002] d..1 7828.532868: ncm_unwrap_ntb: K: ncm_unwrap_ntb ndp_len: 0x10 [002] d..1 7828.532869: ncm_unwrap_ntb: K: Parsed NTB with 1 frames In this case, the giveback is of 1025 bytes and block length is 1024. The rest 1 byte (which is 0x00) won't be parsed resulting in drop of all datagrams in rx_list. Same is case with packets of size 2048: [002] d..1 7828.557948: dwc3_gadget_giveback: ep1out: req 0000000011dfd96e length 2049/16384 zsI ==> 0 [002] d..1 7828.557949: ncm_unwrap_ntb: K: ncm_unwrap_ntb nth: 1751999342 [002] d..1 7828.557950: ncm_unwrap_ntb: K: ncm_unwrap_ntb blk_len: 0x800 Lecroy shows one byte coming in extra confirming that the byte is coming in from PC: Transfer 2959 - Bytes Transferred(1025) Timestamp((18.524 843 590) - Transaction 8391 - Data(1025 bytes) Timestamp(18.524 843 590) --- Packet 4063861 Data(1024 bytes) Duration(2.117us) Idle(14.700ns) Timestamp(18.524 843 590) --- Packet 4063863 Data(1 byte) Duration(66.160ns) Time(282.000ns) Timestamp(18.524 845 722) According to Windows driver, no ZLP is needed if wBlockLength is non-zero, because the non-zero wBlockLength has already told the function side the size of transfer to be expected. However, there are in-market NCM devices that rely on ZLP as long as the wBlockLength is multiple of wMaxPacketSize. To deal with such devices, it pads an extra 0 at end so the transfer is no longer multiple of wMaxPacketSize. Cc: <stable@vger.kernel.org> Fixes: `9f6ce4240a` ("usb: gadget: f_ncm.c added") Signed-off-by: Krishna Kurapati <quic_kriskura@quicinc.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20240205074650.200304-1-quic_kriskura@quicinc.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit 76c51146820c5dac629f21deafab0a7039bc3ccd https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master) Bug: 320608613 Change-Id: Iee598bcbede12582235fca38a0c9f50f3b7375c5 Signed-off-by: Krishna Kurapati <quic_kriskura@quicinc.com> (cherry picked from commit c344c3ebe3fead1ed0c12bd686be083748011342)	2024-05-20 05:51:47 +00:00
Kever Yang	bd552fcbbd	ANDROID: GKI: Update rockchip symbols to add iova APIs INFO: 2 function symbol(s) added 'struct iova* alloc_iova(struct iova_domain, unsigned long, unsigned long, bool)' 'void free_iova(struct iova_domain, unsigned long)' Bug: 300024866 Change-Id: Iccdadf2b516343411871f1df0f46299af9b51c97 Signed-off-by: Kever Yang <kever.yang@rock-chips.com>	2024-05-18 20:21:31 +00:00
Qais Yousef	4ed706c20a	FROMLIST: sched/pi: Reweight fair_policy() tasks when inheriting prio For fair tasks inheriting the priority (nice) without reweighting is a NOP as the task's share won't change. This is visible when running with PTHREAD_PRIO_INHERIT where fair tasks with low priority values are susceptible to starvation leading to PI like impact on lock contention. The logic in rt_mutex will reset these low priority fair tasks into nice 0, but without the additional reweight operation to actually update the weights, it doesn't have the desired impact of boosting them to allow them to run sooner/longer to release the lock. Apply the reweight for fair_policy() tasks to achieve the desired boost for those low nice values tasks. Note that boost here means resetting their nice to 0; as this is what the current logic does for fair tasks. We need to re-instate ordering fair tasks by their priority order on the waiter tree to ensure we inherit the top_waiter properly. Handling of idle_policy() requires more code refactoring and is not handled yet. idle_policy() are treated specially and only run when the CPU is idle and get a hardcoded low weight value. Changing weights won't be enough without a promotion first to SCHED_OTHER. Tested with a test program that creates three threads. 1. main thread that spawns high prio and low prio task and busy loops 2. low priority thread that holds a pthread_mutex() with PTHREAD_PRIO_INHERIT protocol. Runs at nice +10. Busy loops after holding the lock. 3. high priority thread that holds a pthread_mutex() with PTHREADPTHREAD_PRIO_INHERIT, but made to start after the low priority thread. Runs at nice 0. Should remain blocked by the low priority thread. All tasks are pinned to CPU0. Without the patch I can see the low priority thread running only for ~10% of the time which is what expected without it being boosted. With the patch the low priority thread runs for ~50% which is what expected if it gets boosted to nice 0. I modified the test program logic afterwards to ensure that after releasing the lock the low priority thread goes back to running for 10% of the time, and it does. Bug: 263876335 Link: https://lore.kernel.org/lkml/20240514160711.hpdg64grdwc43ux7@airbuntu/ Reported-by: Yabin Cui <yabinc@google.com> Signed-off-by: Qais Yousef <qyousef@layalina.io> [Fix trivial conflict with vendor hook] Signed-off-by: Qais Yousef <qyousef@google.com> Change-Id: Ia954ee528495b5cf5c3a2157c68b4a757cef1f83 (cherry picked from commit 23ac35ed8fc6220e4e498a21d22a9dbe67e7da9b) Signed-off-by: Qais Yousef <qyousef@google.com>	2024-05-18 19:08:51 +00:00
liangjlee	b1e11ffd90	ANDROID: Update the ABI symbol list Adding the following symbols: - __traceiter_android_rvh_do_read_fault - __tracepoint_android_rvh_do_read_fault Bug: 336873696 Change-Id: I7ff2b064942826dcadc949595c9d7df917123986 Signed-off-by: liangjlee <liangjlee@google.com>	2024-05-18 19:08:12 +00:00
liangjlee	29a00abe43	ANDROID: mm: Add restricted vendor hook in do_read_fault() This patch add a restricted vendor hook in do_read_fault() for tracking which file and offsets are faulted. Bug: 336736235 Change-Id: I425690e58550c4ac44912daa10b5eac0728bfb4e Signed-off-by: liangjlee <liangjlee@google.com>	2024-05-18 19:08:12 +00:00
Greg Kroah-Hartman	4078fa637f	Linux 6.1.91 Link: https://lore.kernel.org/r/20240514101020.320785513@linuxfoundation.org Tested-by: Miguel Ojeda <ojeda@kernel.org> Tested-by: Pavel Machek (CIP) <pavel@denx.de> Tested-by: Allen Pais <apais@linux.microsoft.com> Tested-by: Yann Sionneau <ysionneau@kalrayinc.com> Tested-by: Shuah Khan <skhan@linuxfoundation.org> Link: https://lore.kernel.org/r/20240515082456.986812732@linuxfoundation.org Tested-by: Salvatore Bonaccorso <carnil@debian.org> Tested-by: Conor Dooley <conor.dooley@microchip.com> Tested-by: Ron Economos <re@w6rz.net> Tested-by: Yann Sionneau<ysionneau@kalrayinc.com> Link: https://lore.kernel.org/r/20240516091232.619851361@linuxfoundation.org Tested-by: Pavel Machek (CIP) <pavel@denx.de> Tested-by: Mark Brown <broonie@kernel.org> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: SeongJae Park <sj@kernel.org> Tested-by: Florian Fainelli <florian.fainelli@broadcom.com> Tested-by: kernelci.org bot <bot@kernelci.org> Tested-by: Allen Pais <apais@linux.microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-17 11:56:25 +02:00
Doug Berger	8064a711c4	net: bcmgenet: synchronize UMAC_CMD access commit 0d5e2a82232605b337972fb2c7d0cbc46898aca1 upstream The UMAC_CMD register is written from different execution contexts and has insufficient synchronization protections to prevent possible corruption. Of particular concern are the acceses from the phy_device delayed work context used by the adjust_link call and the BH context that may be used by the ndo_set_rx_mode call. A spinlock is added to the driver to protect contended register accesses (i.e. reg_lock) and it is used to synchronize accesses to UMAC_CMD. Fixes: `1c1008c793` ("net: bcmgenet: add main driver file") Cc: stable@vger.kernel.org Signed-off-by: Doug Berger <opendmb@gmail.com> Acked-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-17 11:56:25 +02:00
Doug Berger	9ed299be99	net: bcmgenet: synchronize use of bcmgenet_set_rx_mode() commit 2dbe5f19368caae63b1f59f5bc2af78c7d522b3a upstream The ndo_set_rx_mode function is synchronized with the netif_addr_lock spinlock and BHs disabled. Since this function is also invoked directly from the driver the same synchronization should be applied. Fixes: `72f9634762` ("net: bcmgenet: set Rx mode before starting netif") Cc: stable@vger.kernel.org Signed-off-by: Doug Berger <opendmb@gmail.com> Acked-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-17 11:56:24 +02:00
Doug Berger	714e053565	net: bcmgenet: synchronize EXT_RGMII_OOB_CTRL access commit d85cf67a339685beae1d0aee27b7f61da95455be upstream The EXT_RGMII_OOB_CTRL register can be written from different contexts. It is predominantly written from the adjust_link handler which is synchronized by the phydev->lock, but can also be written from a different context when configuring the mii in bcmgenet_mii_config(). The chances of contention are quite low, but it is conceivable that adjust_link could occur during resume when WoL is enabled so use the phydev->lock synchronizer in bcmgenet_mii_config() to be sure. Fixes: `afe3f907d2` ("net: bcmgenet: power on MII block for all MII modes") Cc: stable@vger.kernel.org Signed-off-by: Doug Berger <opendmb@gmail.com> Acked-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-17 11:56:24 +02:00
Florian Fainelli	ed804e9d8b	net: bcmgenet: Clear RGMII_LINK upon link down commit `696450c051` upstream Clear the RGMII_LINK bit upon detecting link down to be consistent with setting the bit upon link up. We also move the clearing of the out-of-band disable to the runtime initialization rather than for each link up/down transition. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20221118213754.1383364-1-f.fainelli@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-17 11:56:24 +02:00
Li Nan	beaf11969f	md: fix kmemleak of rdev->serial commit 6cf350658736681b9d6b0b6e58c5c76b235bb4c4 upstream. If kobject_add() is fail in bind_rdev_to_array(), 'rdev->serial' will be alloc not be freed, and kmemleak occurs. unreferenced object 0xffff88815a350000 (size 49152): comm "mdadm", pid 789, jiffies 4294716910 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc f773277a): [<0000000058b0a453>] kmemleak_alloc+0x61/0xe0 [<00000000366adf14>] __kmalloc_large_node+0x15e/0x270 [<000000002e82961b>] __kmalloc_node.cold+0x11/0x7f [<00000000f206d60a>] kvmalloc_node+0x74/0x150 [<0000000034bf3363>] rdev_init_serial+0x67/0x170 [<0000000010e08fe9>] mddev_create_serial_pool+0x62/0x220 [<00000000c3837bf0>] bind_rdev_to_array+0x2af/0x630 [<0000000073c28560>] md_add_new_disk+0x400/0x9f0 [<00000000770e30ff>] md_ioctl+0x15bf/0x1c10 [<000000006cfab718>] blkdev_ioctl+0x191/0x3f0 [<0000000085086a11>] vfs_ioctl+0x22/0x60 [<0000000018b656fe>] __x64_sys_ioctl+0xba/0xe0 [<00000000e54e675e>] do_syscall_64+0x71/0x150 [<000000008b0ad622>] entry_SYSCALL_64_after_hwframe+0x6c/0x74 Fixes: `963c555e75` ("md: introduce mddev_create/destroy_wb_pool for the change of member device") Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240208085556.2412922-1-linan666@huaweicloud.com [ mddev_destroy_serial_pool third parameter was removed in mainline, where there is no need to suspend within this function anymore. ] Signed-off-by: Jeremy Bongio <jbongio@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-17 11:56:24 +02:00
Oscar Salvador	ea92809e29	mm,swapops: update check in is_pfn_swap_entry for hwpoison entries commit 07a57a338adb6ec9e766d6a6790f76527f45ceb5 upstream. Tony reported that the Machine check recovery was broken in v6.9-rc1, as he was hitting a VM_BUG_ON when injecting uncorrectable memory errors to DRAM. After some more digging and debugging on his side, he realized that this went back to v6.1, with the introduction of 'commit `0d206b5d2e` ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry")'. That commit, among other things, introduced swp_offset_pfn(), replacing hwpoison_entry_to_pfn() in its favour. The patch also introduced a VM_BUG_ON() check for is_pfn_swap_entry(), but is_pfn_swap_entry() never got updated to cover hwpoison entries, which means that we would hit the VM_BUG_ON whenever we would call swp_offset_pfn() for such entries on environments with CONFIG_DEBUG_VM set. Fix this by updating the check to cover hwpoison entries as well, and update the comment while we are it. Link: https://lkml.kernel.org/r/20240407130537.16977-1-osalvador@suse.de Fixes: `0d206b5d2e` ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry") Signed-off-by: Oscar Salvador <osalvador@suse.de> Reported-by: Tony Luck <tony.luck@intel.com> Closes: https://lore.kernel.org/all/Zg8kLSl2yAlA3o5D@agluck-desk3/ Tested-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: <stable@vger.kernel.org> [6.1.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-17 11:56:24 +02:00
Miaohe Lin	2effe407f7	mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio() commit 52ccdde16b6540abe43b6f8d8e1e1ec90b0983af upstream. When I did memory failure tests recently, below warning occurs: DEBUG_LOCKS_WARN_ON(1) WARNING: CPU: 8 PID: 1011 at kernel/locking/lockdep.c:232 __lock_acquire+0xccb/0x1ca0 Modules linked in: mce_inject hwpoison_inject CPU: 8 PID: 1011 Comm: bash Kdump: loaded Not tainted 6.9.0-rc3-next-20240410-00012-gdb69f219f4be #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 RIP: 0010:__lock_acquire+0xccb/0x1ca0 RSP: 0018:ffffa7a1c7fe3bd0 EFLAGS: 00000082 RAX: 0000000000000000 RBX: eb851eb853975fcf RCX: ffffa1ce5fc1c9c8 RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffffa1ce5fc1c9c0 RBP: ffffa1c6865d3280 R08: ffffffffb0f570a8 R09: 0000000000009ffb R10: 0000000000000286 R11: ffffffffb0f2ad50 R12: ffffa1c6865d3d10 R13: ffffa1c6865d3c70 R14: 0000000000000000 R15: 0000000000000004 FS: 00007ff9f32aa740(0000) GS:ffffa1ce5fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ff9f3134ba0 CR3: 00000008484e4000 CR4: 00000000000006f0 Call Trace: <TASK> lock_acquire+0xbe/0x2d0 _raw_spin_lock_irqsave+0x3a/0x60 hugepage_subpool_put_pages.part.0+0xe/0xc0 free_huge_folio+0x253/0x3f0 dissolve_free_huge_page+0x147/0x210 __page_handle_poison+0x9/0x70 memory_failure+0x4e6/0x8c0 hard_offline_page_store+0x55/0xa0 kernfs_fop_write_iter+0x12c/0x1d0 vfs_write+0x380/0x540 ksys_write+0x64/0xe0 do_syscall_64+0xbc/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7ff9f3114887 RSP: 002b:00007ffecbacb458 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007ff9f3114887 RDX: 000000000000000c RSI: 0000564494164e10 RDI: 0000000000000001 RBP: 0000564494164e10 R08: 00007ff9f31d1460 R09: 000000007fffffff R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c R13: 00007ff9f321b780 R14: 00007ff9f3217600 R15: 00007ff9f3216a00 </TASK> Kernel panic - not syncing: kernel: panic_on_warn set ... CPU: 8 PID: 1011 Comm: bash Kdump: loaded Not tainted 6.9.0-rc3-next-20240410-00012-gdb69f219f4be #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> panic+0x326/0x350 check_panic_on_warn+0x4f/0x50 __warn+0x98/0x190 report_bug+0x18e/0x1a0 handle_bug+0x3d/0x70 exc_invalid_op+0x18/0x70 asm_exc_invalid_op+0x1a/0x20 RIP: 0010:__lock_acquire+0xccb/0x1ca0 RSP: 0018:ffffa7a1c7fe3bd0 EFLAGS: 00000082 RAX: 0000000000000000 RBX: eb851eb853975fcf RCX: ffffa1ce5fc1c9c8 RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffffa1ce5fc1c9c0 RBP: ffffa1c6865d3280 R08: ffffffffb0f570a8 R09: 0000000000009ffb R10: 0000000000000286 R11: ffffffffb0f2ad50 R12: ffffa1c6865d3d10 R13: ffffa1c6865d3c70 R14: 0000000000000000 R15: 0000000000000004 lock_acquire+0xbe/0x2d0 _raw_spin_lock_irqsave+0x3a/0x60 hugepage_subpool_put_pages.part.0+0xe/0xc0 free_huge_folio+0x253/0x3f0 dissolve_free_huge_page+0x147/0x210 __page_handle_poison+0x9/0x70 memory_failure+0x4e6/0x8c0 hard_offline_page_store+0x55/0xa0 kernfs_fop_write_iter+0x12c/0x1d0 vfs_write+0x380/0x540 ksys_write+0x64/0xe0 do_syscall_64+0xbc/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7ff9f3114887 RSP: 002b:00007ffecbacb458 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007ff9f3114887 RDX: 000000000000000c RSI: 0000564494164e10 RDI: 0000000000000001 RBP: 0000564494164e10 R08: 00007ff9f31d1460 R09: 000000007fffffff R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c R13: 00007ff9f321b780 R14: 00007ff9f3217600 R15: 00007ff9f3216a00 </TASK> After git bisecting and digging into the code, I believe the root cause is that _deferred_list field of folio is unioned with _hugetlb_subpool field. In __update_and_free_hugetlb_folio(), folio->_deferred_list is initialized leading to corrupted folio->_hugetlb_subpool when folio is hugetlb. Later free_huge_folio() will use _hugetlb_subpool and above warning happens. But it is assumed hugetlb flag must have been cleared when calling folio_put() in update_and_free_hugetlb_folio(). This assumption is broken due to below race: CPU1 CPU2 dissolve_free_huge_page update_and_free_pages_bulk update_and_free_hugetlb_folio hugetlb_vmemmap_restore_folios folio_clear_hugetlb_vmemmap_optimized clear_flag = folio_test_hugetlb_vmemmap_optimized if (clear_flag) <-- False, it's already cleared. __folio_clear_hugetlb(folio) <-- Hugetlb is not cleared. folio_put free_huge_folio <-- free_the_page is expected. list_for_each_entry() __folio_clear_hugetlb <-- Too late. Fix this issue by checking whether folio is hugetlb directly instead of checking clear_flag to close the race window. Link: https://lkml.kernel.org/r/20240419085819.1901645-1-linmiaohe@huawei.com Fixes: `32c877191e` ("hugetlb: do not clear hugetlb dtor until allocating vmemmap") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-17 11:56:24 +02:00

... 68 69 70 71 72 ...

1165241 Commits