Commit Graph

1165241 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
524ae3c9d3 Merge 6.1.107 into android14-6.1-lts
Changes in 6.1.107
	tty: atmel_serial: use the correct RTS flag.
	fuse: Initialize beyond-EOF page contents before setting uptodate
	char: xillybus: Don't destroy workqueue from work item running on it
	char: xillybus: Refine workqueue handling
	char: xillybus: Check USB endpoints when probing device
	ALSA: usb-audio: Add delay quirk for VIVO USB-C-XE710 HEADSET
	ALSA: usb-audio: Support Yamaha P-125 quirk entry
	xhci: Fix Panther point NULL pointer deref at full-speed re-enumeration
	thunderbolt: Mark XDomain as unplugged when router is removed
	s390/dasd: fix error recovery leading to data corruption on ESE devices
	riscv: change XIP's kernel_map.size to be size of the entire kernel
	arm64: ACPI: NUMA: initialize all values of acpi_early_node_map to NUMA_NO_NODE
	dm resume: don't return EINVAL when signalled
	dm persistent data: fix memory allocation failure
	vfs: Don't evict inode under the inode lru traversing context
	fs/ntfs3: add prefix to bitmap_size() and use BITS_TO_U64()
	s390/cio: rename bitmap_size() -> idset_bitmap_size()
	btrfs: rename bitmap_set_bits() -> btrfs_bitmap_set_bits()
	bitmap: introduce generic optimized bitmap_size()
	fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHARE
	i2c: qcom-geni: Add missing geni_icc_disable in geni_i2c_runtime_resume
	rtla/osnoise: Prevent NULL dereference in error handling
	fs/netfs/fscache_cookie: add missing "n_accesses" check
	selinux: fix potential counting error in avc_add_xperms_decision()
	mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu
	btrfs: zoned: properly take lock to read/update block group's zoned variables
	btrfs: tree-checker: add dev extent item checks
	drm/amdgpu: Actually check flags for all context ops.
	memcg_write_event_control(): fix a user-triggerable oops
	drm/amdgpu/jpeg2: properly set atomics vmid field
	s390/uv: Panic for set and remove shared access UVC errors
	bpf: Fix updating attached freplace prog in prog_array map
	nilfs2: prevent WARNING in nilfs_dat_commit_end()
	ext4, jbd2: add an optimized bmap for the journal inode
	9P FS: Fix wild-memory-access write in v9fs_get_acl
	nilfs2: initialize "struct nilfs_binfo_dat"->bi_pad field
	mm: khugepaged: fix kernel BUG in hpage_collapse_scan_file()
	bpf: Split off basic BPF verifier log into separate file
	bpf: drop unnecessary user-triggerable WARN_ONCE in verifierl log
	posix-timers: Ensure timer ID search-loop limit is valid
	pid: Replace struct pid 1-element array with flex-array
	gfs2: Rename remaining "transaction" glock references
	gfs2: Rename the {freeze,thaw}_super callbacks
	gfs2: Rename gfs2_freeze_lock{ => _shared }
	gfs2: Rename SDF_{FS_FROZEN => FREEZE_INITIATOR}
	gfs2: Rework freeze / thaw logic
	gfs2: Stop using gfs2_make_fs_ro for withdraw
	Bluetooth: Fix hci_link_tx_to RCU lock usage
	wifi: mac80211: take wiphy lock for MAC addr change
	wifi: mac80211: fix change_address deadlock during unregister
	net: sched: Print msecs when transmit queue time out
	net: don't dump stack on queue timeout
	jfs: fix shift-out-of-bounds in dbJoin
	squashfs: squashfs_read_data need to check if the length is 0
	Squashfs: fix variable overflow triggered by sysbot
	reiserfs: fix uninit-value in comp_keys
	erofs: avoid debugging output for (de)compressed data
	quota: Detect loops in quota tree
	net:rds: Fix possible deadlock in rds_message_put
	net: sctp: fix skb leak in sctp_inq_free()
	pppoe: Fix memory leak in pppoe_sendmsg()
	wifi: mac80211: fix and simplify unencrypted drop check for mesh
	wifi: cfg80211: move A-MSDU check in ieee80211_data_to_8023_exthdr
	wifi: cfg80211: factor out bridge tunnel / RFC1042 header check
	wifi: mac80211: remove mesh forwarding congestion check
	wifi: mac80211: fix receiving A-MSDU frames on mesh interfaces
	wifi: mac80211: add a workaround for receiving non-standard mesh A-MSDU
	wifi: cfg80211: check A-MSDU format more carefully
	docs/bpf: Document BPF_MAP_TYPE_LPM_TRIE map
	bpf: Replace bpf_lpm_trie_key 0-length array with flexible array
	bpf: Avoid kfree_rcu() under lock in bpf_lpm_trie.
	Bluetooth: RFCOMM: Fix not validating setsockopt user input
	ext4: check the return value of ext4_xattr_inode_dec_ref()
	ext4: fold quota accounting into ext4_xattr_inode_lookup_create()
	ext4: do not create EA inode under buffer lock
	udf: Fix bogus checksum computation in udf_rename()
	bpf, net: Use DEV_STAT_INC()
	fou: remove warn in gue_gro_receive on unsupported protocol
	jfs: fix null ptr deref in dtInsertEntry
	jfs: Fix shift-out-of-bounds in dbDiscardAG
	fs/ntfs3: Do copy_to_user out of run_lock
	ALSA: usb: Fix UBSAN warning in parse_audio_unit()
	igc: Correct the launchtime offset
	igc: Fix packet still tx after gate close by reducing i226 MAC retry buffer
	net/mlx5e: Take state lock during tx timeout reporter
	net/mlx5e: Correctly report errors for ethtool rx flows
	atm: idt77252: prevent use after free in dequeue_rx()
	net: axienet: Fix register defines comment description
	net: dsa: vsc73xx: pass value in phy_write operation
	net: dsa: vsc73xx: use read_poll_timeout instead delay loop
	net: dsa: vsc73xx: check busy flag in MDIO operations
	mlxbf_gige: Remove two unused function declarations
	mlxbf_gige: disable RX filters until RX path initialized
	mptcp: correct MPTCP_SUBFLOW_ATTR_SSN_OFFSET reserved size
	netfilter: allow ipv6 fragments to arrive on different devices
	netfilter: flowtable: initialise extack before use
	netfilter: nf_queue: drop packets with cloned unconfirmed conntracks
	netfilter: nf_tables: Audit log dump reset after the fact
	netfilter: nf_tables: Drop pointless memset in nf_tables_dump_obj
	netfilter: nf_tables: Unconditionally allocate nft_obj_filter
	netfilter: nf_tables: A better name for nft_obj_filter
	netfilter: nf_tables: Carry s_idx in nft_obj_dump_ctx
	netfilter: nf_tables: nft_obj_filter fits into cb->ctx
	netfilter: nf_tables: Carry reset boolean in nft_obj_dump_ctx
	netfilter: nf_tables: Introduce nf_tables_getobj_single
	netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests
	net: hns3: fix wrong use of semaphore up
	net: hns3: use the user's cfg after reset
	net: hns3: fix a deadlock problem when config TC during resetting
	ALSA: hda/realtek: Fix noise from speakers on Lenovo IdeaPad 3 15IAU7
	drm/amd/amdgpu/imu_v11_0: Increase buffer size to ensure all possible values can be stored
	ssb: Fix division by zero issue in ssb_calc_clock_rate
	wifi: cfg80211: check wiphy mutex is held for wdev mutex
	wifi: mac80211: fix BA session teardown race
	mm: Remove kmem_valid_obj()
	rcu: Dump memory object info if callback function is invalid
	rcu: Eliminate rcu_gp_slow_unregister() false positive
	wifi: cw1200: Avoid processing an invalid TIM IE
	cgroup: Avoid extra dereference in css_populate_dir()
	i2c: riic: avoid potential division by zero
	RDMA/rtrs: Fix the problem of variable not initialized fully
	s390/smp,mcck: fix early IPI handling
	drm/bridge: tc358768: Attempt to fix DSI horizontal timings
	i3c: mipi-i3c-hci: Remove BUG() when Ring Abort request times out
	i3c: mipi-i3c-hci: Do not unmap region not mapped for transfer
	drm/amdkfd: Move dma unmapping after TLB flush
	media: radio-isa: use dev_name to fill in bus_info
	staging: iio: resolver: ad2s1210: fix use before initialization
	usb: gadget: uvc: cleanup request when not in correct state
	drm/amd/display: Validate hw_points_num before using it
	staging: ks7010: disable bh on tx_dev_lock
	media: s5p-mfc: Fix potential deadlock on condlock
	md/raid5-cache: use READ_ONCE/WRITE_ONCE for 'conf->log'
	binfmt_misc: cleanup on filesystem umount
	drm/tegra: Zero-initialize iosys_map
	media: qcom: venus: fix incorrect return value
	scsi: spi: Fix sshdr use
	gfs2: setattr_chown: Add missing initialization
	wifi: iwlwifi: abort scan when rfkill on but device enabled
	wifi: iwlwifi: fw: Fix debugfs command sending
	clk: visconti: Add bounds-checking coverage for struct visconti_pll_provider
	IB/hfi1: Fix potential deadlock on &irq_src_lock and &dd->uctxt_lock
	hwmon: (ltc2992) Avoid division by zero
	kbuild: rust_is_available: normalize version matching
	kbuild: rust_is_available: handle failures calling `$RUSTC`/`$BINDGEN`
	rust: work around `bindgen` 0.69.0 issue
	rust: suppress error messages from CONFIG_{RUSTC,BINDGEN}_VERSION_TEXT
	rust: fix the default format for CONFIG_{RUSTC,BINDGEN}_VERSION_TEXT
	arm64: Fix KASAN random tag seed initialization
	block: Fix lockdep warning in blk_mq_mark_tag_wait
	drm/msm: Reduce fallout of fence signaling vs reclaim hangs
	memory: tegra: Skip SID programming if SID registers aren't set
	powerpc/xics: Check return value of kasprintf in icp_native_map_one_cpu
	ASoC: SOF: ipc4: check return value of snd_sof_ipc_msg_data
	hwmon: (pc87360) Bounds check data->innr usage
	drm/rockchip: vop2: clear afbc en and transform bit for cluster window at linear mode
	Bluetooth: hci_conn: Check non NULL function before calling for HFP offload
	gfs2: Refcounting fix in gfs2_thaw_super
	nvmet-trace: avoid dereferencing pointer too early
	ext4: do not trim the group with corrupted block bitmap
	afs: fix __afs_break_callback() / afs_drop_open_mmap() race
	fuse: fix UAF in rcu pathwalks
	quota: Remove BUG_ON from dqget()
	kernfs: fix false-positive WARN(nr_mmapped) in kernfs_drain_open_files
	media: pci: cx23885: check cx23885_vdev_init() return
	fs: binfmt_elf_efpic: don't use missing interpreter's properties
	scsi: lpfc: Initialize status local variable in lpfc_sli4_repost_sgl_list()
	media: drivers/media/dvb-core: copy user arrays safely
	net/sun3_82586: Avoid reading past buffer in debug output
	drm/lima: set gp bus_stop bit before hard reset
	hrtimer: Select housekeeping CPU during migration
	virtiofs: forbid newlines in tags
	clocksource/drivers/arm_global_timer: Guard against division by zero
	netlink: hold nlk->cb_mutex longer in __netlink_dump_start()
	md: clean up invalid BUG_ON in md_ioctl
	x86: Increase brk randomness entropy for 64-bit systems
	memory: stm32-fmc2-ebi: check regmap_read return value
	parisc: Use irq_enter_rcu() to fix warning at kernel/context_tracking.c:367
	powerpc/boot: Handle allocation failure in simple_realloc()
	powerpc/boot: Only free if realloc() succeeds
	btrfs: delayed-inode: drop pointless BUG_ON in __btrfs_remove_delayed_item()
	btrfs: change BUG_ON to assertion when checking for delayed_node root
	btrfs: tests: allocate dummy fs_info and root in test_find_delalloc()
	btrfs: handle invalid root reference found in may_destroy_subvol()
	btrfs: send: handle unexpected data in header buffer in begin_cmd()
	btrfs: change BUG_ON to assertion in tree_move_down()
	btrfs: delete pointless BUG_ON check on quota root in btrfs_qgroup_account_extent()
	f2fs: fix to do sanity check in update_sit_entry
	usb: gadget: fsl: Increase size of name buffer for endpoints
	nvme: clear caller pointer on identify failure
	Bluetooth: bnep: Fix out-of-bound access
	firmware: cirrus: cs_dsp: Initialize debugfs_root to invalid
	rtc: nct3018y: fix possible NULL dereference
	net: hns3: add checking for vf id of mailbox
	nvmet-tcp: do not continue for invalid icreq
	NFS: avoid infinite loop in pnfs_update_layout.
	openrisc: Call setup_memory() earlier in the init sequence
	s390/iucv: fix receive buffer virtual vs physical address confusion
	irqchip/renesas-rzg2l: Do not set TIEN and TINT source at the same time
	clocksource: Make watchdog and suspend-timing multiplication overflow safe
	platform/x86: lg-laptop: fix %s null argument warning
	usb: dwc3: core: Skip setting event buffers for host only controllers
	fbdev: offb: replace of_node_put with __free(device_node)
	irqchip/gic-v3-its: Remove BUG_ON in its_vpe_irq_domain_alloc
	ext4: set the type of max_zeroout to unsigned int to avoid overflow
	nvmet-rdma: fix possible bad dereference when freeing rsps
	drm/amdgpu: fix dereference null return value for the function amdgpu_vm_pt_parent
	hrtimer: Prevent queuing of hrtimer without a function callback
	gtp: pull network headers in gtp_dev_xmit()
	media: solo6x10: replace max(a, min(b, c)) by clamp(b, a, c)
	i2c: tegra: allow DVC support to be compiled out
	i2c: tegra: allow VI support to be compiled out
	i2c: tegra: Do not mark ACPI devices as irq safe
	dm suspend: return -ERESTARTSYS instead of -EINTR
	net: mana: Fix doorbell out of order violation and avoid unnecessary doorbell rings
	btrfs: replace sb::s_blocksize by fs_info::sectorsize
	btrfs: send: allow cloning non-aligned extent if it ends at i_size
	drm/amd/display: Adjust cursor position
	platform/surface: aggregator: Fix warning when controller is destroyed in probe
	drm/amdkfd: reserve the BO before validating it
	Bluetooth: hci_core: Fix LE quote calculation
	Bluetooth: SMP: Fix assumption of Central always being Initiator
	net: dsa: tag_ocelot: do not rely on skb_mac_header() for VLAN xmit
	net: dsa: tag_ocelot: call only the relevant portion of __skb_vlan_pop() on TX
	net: mscc: ocelot: use ocelot_xmit_get_vlan_info() also for FDMA and register injection
	net: mscc: ocelot: fix QoS class for injected packets with "ocelot-8021q"
	net: mscc: ocelot: serialize access to the injection/extraction groups
	tc-testing: don't access non-existent variable on exception
	selftests/net: synchronize udpgro tests' tx and rx connection
	selftests: udpgro: report error when receive failed
	tcp/dccp: bypass empty buckets in inet_twsk_purge()
	tcp/dccp: do not care about families in inet_twsk_purge()
	tcp: prevent concurrent execution of tcp_sk_exit_batch
	net: mctp: test: Use correct skb for route input check
	kcm: Serialise kcm_sendmsg() for the same socket.
	netfilter: nft_counter: Disable BH in nft_counter_offload_stats().
	netfilter: nft_counter: Synchronize nft_counter_reset() against reader.
	ip6_tunnel: Fix broken GRO
	bonding: fix bond_ipsec_offload_ok return type
	bonding: fix null pointer deref in bond_ipsec_offload_ok
	bonding: fix xfrm real_dev null pointer dereference
	bonding: fix xfrm state handling when clearing active slave
	ice: Prepare legacy-rx for upcoming XDP multi-buffer support
	ice: Add xdp_buff to ice_rx_ring struct
	ice: Store page count inside ice_rx_buf
	ice: Pull out next_to_clean bump out of ice_put_rx_buf()
	ice: fix page reuse when PAGE_SIZE is over 8k
	ice: fix ICE_LAST_OFFSET formula
	dpaa2-switch: Fix error checking in dpaa2_switch_seed_bp()
	net: dsa: mv88e6xxx: Fix out-of-bound access
	netem: fix return value if duplicate enqueue fails
	ipv6: prevent UAF in ip6_send_skb()
	ipv6: fix possible UAF in ip6_finish_output2()
	ipv6: prevent possible UAF in ip6_xmit()
	netfilter: flowtable: validate vlan header
	octeontx2-af: Fix CPT AF register offset calculation
	net: xilinx: axienet: Always disable promiscuous mode
	net: xilinx: axienet: Fix dangling multicast addresses
	drm/msm/dpu: don't play tricks with debug macros
	drm/msm/dp: fix the max supported bpp logic
	drm/msm/dp: reset the link phy params before link training
	drm/msm/dpu: cleanup FB if dpu_format_populate_layout fails
	mmc: mmc_test: Fix NULL dereference on allocation failure
	Bluetooth: MGMT: Add error handling to pair_device()
	scsi: core: Fix the return value of scsi_logical_block_count()
	ksmbd: the buffer of smb2 query dir response has at least 1 byte
	drm/amdgpu: Validate TA binary size
	MIPS: Loongson64: Set timer mode in cpu-probe
	HID: wacom: Defer calculation of resolution until resolution_code is known
	HID: microsoft: Add rumble support to latest xbox controllers
	Input: i8042 - add forcenorestore quirk to leave controller untouched even on s3
	Input: i8042 - use new forcenorestore quirk to replace old buggy quirk combination
	cxgb4: add forgotten u64 ivlan cast before shift
	KVM: arm64: Make ICC_*SGI*_EL1 undef in the absence of a vGICv3
	mmc: dw_mmc: allow biu and ciu clocks to defer
	pmdomain: imx: wait SSAR when i.MX93 power domain on
	mptcp: pm: re-using ID of unused removed ADD_ADDR
	mptcp: pm: re-using ID of unused removed subflows
	mptcp: pm: re-using ID of unused flushed subflows
	mptcp: pm: only decrement add_addr_accepted for MPJ req
	Revert "usb: gadget: uvc: cleanup request when not in correct state"
	Revert "drm/amd/display: Validate hw_points_num before using it"
	tcp: do not export tcp_twsk_purge()
	hwmon: (ltc2992) Fix memory leak in ltc2992_parse_dt()
	ALSA: timer: Relax start tick time check for slave timer elements
	mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order fallback to order 0
	mm/numa: no task_numa_fault() call if PMD is changed
	mm/numa: no task_numa_fault() call if PTE is changed
	nfsd: Simplify code around svc_exit_thread() call in nfsd()
	nfsd: separate nfsd_last_thread() from nfsd_put()
	NFSD: simplify error paths in nfsd_svc()
	nfsd: call nfsd_last_thread() before final nfsd_put()
	nfsd: drop the nfsd_put helper
	nfsd: don't call locks_release_private() twice concurrently
	nfsd: Fix a regression in nfsd_setattr()
	Bluetooth: hci_ldisc: check HCI_UART_PROTO_READY flag in HCIUARTGETPROTO
	drm/amdgpu/vcn: identify unified queue in sw init
	drm/amdgpu/vcn: not pause dpg for unified queue
	KVM: x86: fire timer when it is migrated and expired, and in oneshot mode
	Revert "s390/dasd: Establish DMA alignment"
	udp: allow header check for dodgy GSO_UDP_L4 packets.
	gso: fix dodgy bit handling for GSO_UDP_L4
	net: more strict VIRTIO_NET_HDR_GSO_UDP_L4 validation
	net: drop bad gso csum_start and offset in virtio_net_hdr
	wifi: mac80211: add documentation for amsdu_mesh_control
	wifi: mac80211: fix mesh path discovery based on unicast packets
	wifi: mac80211: fix mesh forwarding
	wifi: mac80211: fix flow dissection for forwarded packets
	wifi: mac80211: fix receiving mesh packets in forwarding=0 networks
	wifi: mac80211: drop bogus static keywords in A-MSDU rx
	wifi: mac80211: fix potential null pointer dereference
	wifi: cfg80211: fix receiving mesh packets without RFC1042 header
	gfs2: Fix another freeze/thaw hang
	gfs2: don't withdraw if init_threads() got interrupted
	gfs2: Remove LM_FLAG_PRIORITY flag
	gfs2: Remove freeze_go_demote_ok
	udp: fix receiving fraglist GSO packets
	ice: fix W=1 headers mismatch
	Revert "jfs: fix shift-out-of-bounds in dbJoin"
	net: change maximum number of UDP segments to 128
	selftests: net: more strict check in net_helper
	Input: MT - limit max slots
	tools: move alignment-related macros to new <linux/align.h>
	Linux 6.1.107

Change-Id: I11d18ae169b1e55f18f0dc2953df2dd3a1f25624
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2024-10-01 11:35:58 +00:00
Snehal Koukuntla
bd3cc5c733 UPSTREAM: KVM: arm64: Add memory length checks and remove inline in do_ffa_mem_xfer
When we share memory through FF-A and the description of the buffers
exceeds the size of the mapped buffer, the fragmentation API is used.
The fragmentation API allows specifying chunks of descriptors in subsequent
FF-A fragment calls and no upper limit has been established for this.
The entire memory region transferred is identified by a handle which can be
used to reclaim the transferred memory.
To be able to reclaim the memory, the description of the buffers has to fit
in the ffa_desc_buf.
Add a bounds check on the FF-A sharing path to prevent the memory reclaim
from failing.

Also do_ffa_mem_xfer() does not need __always_inline, except for the
BUILD_BUG_ON() aspect, which gets moved to a macro.

[maz: fixed the BUILD_BUG_ON() breakage with LLVM, thanks to Wei-Lin Chang
 for the timely report]

Fixes: 634d90cf0a ("KVM: arm64: Handle FFA_MEM_LEND calls from the host")
Cc: stable@vger.kernel.org
Reviewed-by: Sebastian Ene <sebastianene@google.com>
Signed-off-by: Snehal Koukuntla <snehalreddy@google.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240909180154.3267939-1-snehalreddy@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit f26a525b77e040d584e967369af1e018d2d59112)

Bug: 298514256
Change-Id: If515f5c03db42e7c994f9f82bef6167b104d75e2
Signed-off-by: Snehal Koukuntla <snehalreddy@google.com>
2024-10-01 08:13:54 +00:00
Pierre Couillaud
a43e7c2c12 ANDROID: GKI: Update symbol list for BCMSTB
INFO: 6 function symbol(s) added
  'void __read_overflow2_field(size_t, size_t)'
  'int iommu_fwspec_init(struct device*, struct fwnode_handle*, const struct iommu_ops*)'
  'struct pci_host_bridge* pci_find_host_bridge(struct pci_bus*)'
  'long strnlen_user(const char*, long)'
  'int tty_buffer_request_room(struct tty_port*, size_t)'
  'int tty_buffer_set_limit(struct tty_port*, int)'

Bug: 369085303
Change-Id: Ia56882467e3f523ad476db0d237f69ffc4e80084
Signed-off-by: Pierre Couillaud <pierre@broadcom.com>
2024-09-30 22:48:00 +00:00
Besar Wicaksono
5162f9a67b UPSTREAM: arm64: Add Neoverse-V2 part
[ Upstream commit f4d9d9dcc70b96b5e5d7801bd5fbf8491b07b13d ]

Add the part number and MIDR for Neoverse-V2

Bug: 342491759

Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
Reviewed-by: James Clark <james.clark@arm.com>
Link: https://lore.kernel.org/r/20240109192310.16234-2-bwicaksono@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
[ Mark: trivial backport ]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Chunhui Li <chunhui.li@mediatek.com>
Change-Id: I2811f794a836d7c8f868b4f069e5d1e05ed69741
2024-09-30 18:10:28 +00:00
Greg Kroah-Hartman
8f2e4ac396 Revert "cgroup: Make operations on the cgroup root_list RCU safe"
This reverts commit f5b7a97920 which is
commit d23b5c577715892c87533b13923306acc6243f93 upstream.

It breaks the Android kernel abi and can be brought back in the future
in an abi-safe way if it is really needed.

Bug: 161946584
Change-Id: I6b60c2d6e3ef46d02d40e76c8fd0d0ca8ac58af9
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2024-09-30 15:23:48 +00:00
Greg Kroah-Hartman
b4c085bbdb Revert "cgroup: Move rcu_head up near the top of cgroup_root"
This reverts commit 0e76e9bb1d which is
commit a7fb0423c201ba12815877a0b5a68a6a1710b23a upstream.

It breaks the Android kernel abi and can be brought back in the future
in an abi-safe way if it is really needed.

Bug: 161946584
Change-Id: Ib15a38a3826b47d2a058a29c6b042107e70d2e33
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2024-09-30 15:23:36 +00:00
Greg Kroah-Hartman
aa4cd140bb Linux 6.1.112
Link: https://lore.kernel.org/r/20240927121719.897851549@linuxfoundation.org
Tested-by: Peter Schneider <pschneider1968@googlemail.com>
Tested-by: Allen Pais <apais@linux.microsoft.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Shuah Khan <skhan@linuxfoundation.org>
Tested-by: Ron Economos <re@w6rz.net>
Tested-by: kernelci.org bot <bot@kernelci.org>
Tested-by: Pavel Machek (CIP) <pavel@denx.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:56 +02:00
Edward Adam Davis
ba6269e187 USB: usbtmc: prevent kernel-usb-infoleak
commit 625fa77151f00c1bd00d34d60d6f2e710b3f9aad upstream.

The syzbot reported a kernel-usb-infoleak in usbtmc_write,
we need to clear the structure before filling fields.

Fixes: 4ddc645f40 ("usb: usbtmc: Add ioctl for vendor specific write")
Reported-and-tested-by: syzbot+9d34f80f841e948c3fdb@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9d34f80f841e948c3fdb
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Cc: stable <stable@kernel.org>
Link: https://lore.kernel.org/r/tencent_9649AA6EC56EDECCA8A7D106C792D1C66B06@qq.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:56 +02:00
Junhao Xie
c74796ff4f USB: serial: pl2303: add device id for Macrosilicon MS3020
commit 7d47d22444bb7dc1b6d768904a22070ef35e1fc0 upstream.

Add the device id for the Macrosilicon MS3020 which is a
PL2303HXN based device.

Signed-off-by: Junhao Xie <bigfoot@classfun.cn>
Cc: stable@vger.kernel.org
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:56 +02:00
Tony Luck
a20eea14a6 x86/mm: Switch to new Intel CPU model defines
commit 2eda374e883ad297bd9fe575a16c1dc850346075 upstream.

New CPU #defines encode vendor and family as well as model.

[ dhansen: vertically align 0's in invlpg_miss_ids[] ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20240424181518.41946-1-tony.luck%40intel.com
[ Ricardo: I used the old match macro X86_MATCH_INTEL_FAM6_MODEL()
  instead of X86_MATCH_VFM() as in the upstream commit.
  I also kept the ALDERLAKE_N name instead of ATOM_GRACEMONT. Both refer
  to the same CPU model. ]
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:56 +02:00
Sumeet Pawnikar
ee8adcb4c0 powercap: RAPL: fix invalid initialization for pl4_supported field
commit d05b5e0baf upstream.

The current initialization of the struct x86_cpu_id via
pl4_support_ids[] is partial and wrong. It is initializing
"stepping" field with "X86_FEATURE_ANY" instead of "feature" field.

Use X86_MATCH_INTEL_FAM6_MODEL macro instead of initializing
each field of the struct x86_cpu_id for pl4_supported list of CPUs.
This X86_MATCH_INTEL_FAM6_MODEL macro internally uses another macro
X86_MATCH_VENDOR_FAM_MODEL_FEATURE for X86 based CPU matching with
appropriate initialized values.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lore.kernel.org/lkml/28ead36b-2d9e-1a36-6f4e-04684e420260@intel.com
Fixes: eb52bc2ae5 ("powercap: RAPL: Add Power Limit4 support for Meteor Lake SoC")
Fixes: b08b95cf30 ("powercap: RAPL: Add Power Limit4 support for Alder Lake-N and Raptor Lake-P")
Fixes: 5157559069 ("powercap: RAPL: Add Power Limit4 support for RaptorLake")
Fixes: 1cc5b9a411 ("powercap: Add Power Limit4 support for Alder Lake SoC")
Fixes: 8365a898fe ("powercap: Add Power Limit4 support")
Signed-off-by: Sumeet Pawnikar <sumeet.r.pawnikar@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
[ Ricardo: I removed METEORLAKE and METEORLAKE_L from pl4_support_ids as
  they are not included in v6.1. ]
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:56 +02:00
Filipe Manana
563df8b411 btrfs: calculate the right space for delayed refs when updating global reserve
commit f8f210dc84 upstream.

When updating the global block reserve, we account for the 6 items needed
by an unlink operation and the 6 delayed references for each one of those
items. However the calculation for the delayed references is not correct
in case we have the free space tree enabled, as in that case we need to
touch the free space tree as well and therefore need twice the number of
bytes. So use the btrfs_calc_delayed_ref_bytes() helper to calculate the
number of bytes need for the delayed references at
btrfs_update_global_block_rsv().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[Diogo: this patch has been cherry-picked from the original commit;
conflicts included lack of a define (picked from commit 5630e2bcfe)
and lack of btrfs_calc_delayed_ref_bytes (picked from commit 0e55a54502)
- changed const struct -> struct for compatibility.]
Signed-off-by: Diogo Jahchan Koike <djahchankoike@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:55 +02:00
Matthieu Baerts (NGI0)
2626cbee1f selftests: mptcp: join: restrict fullmesh endp on 1st sf
commit 49ac6f05ace5bb0070c68a0193aa05d3c25d4c83 upstream.

A new endpoint using the IP of the initial subflow has been recently
added to increase the code coverage. But it breaks the test when using
old kernels not having commit 86e39e0448 ("mptcp: keep track of local
endpoint still available for each msk"), e.g. on v5.15.

Similar to commit d4c81bbb86 ("selftests: mptcp: join: support local
endpoint being tracked or not"), it is possible to add the new endpoint
conditionally, by checking if "mptcp_pm_subflow_check_next" is present
in kallsyms: this is not directly linked to the commit introducing this
symbol but for the parent one which is linked anyway. So we can know in
advance what will be the expected behaviour, and add the new endpoint
only when it makes sense to do so.

Fixes: 4878f9f8421f ("selftests: mptcp: join: validate fullmesh endp on 1st sf")
Cc: stable@vger.kernel.org
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20240910-net-selftests-mptcp-fix-install-v1-1-8f124aa9156d@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
[ Conflicts in mptcp_join.sh, because the 'run_tests' helper has been
  modified in multiple commits that are not in this version, e.g. commit
  e571fb09c8 ("selftests: mptcp: add speed env var"). The conflict was
  in the context, the new lines can still be added at the same place. ]
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:55 +02:00
Marc Kleine-Budde
0ba8b599c3 can: mcp251xfd: move mcp251xfd_timestamp_start()/stop() into mcp251xfd_chip_start/stop()
commit a7801540f325d104de5065850a003f1d9bdc6ad3 upstream.

The mcp251xfd wakes up from Low Power or Sleep Mode when SPI activity
is detected. To avoid this, make sure that the timestamp worker is
stopped before shutting down the chip.

Split the starting of the timestamp worker out of
mcp251xfd_timestamp_init() into the separate function
mcp251xfd_timestamp_start().

Call mcp251xfd_timestamp_init() before mcp251xfd_chip_start(), move
mcp251xfd_timestamp_start() to mcp251xfd_chip_start(). In this way,
mcp251xfd_timestamp_stop() can be called unconditionally by
mcp251xfd_chip_stop().

Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:55 +02:00
Marc Kleine-Budde
88047c4b2d can: mcp251xfd: properly indent labels
commit 51b2a721612236335ddec4f3fb5f59e72a204f3a upstream.

To fix the coding style, remove the whitespace in front of labels.

Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:55 +02:00
Hagar Hemdan
672c19165f gpio: prevent potential speculation leaks in gpio_device_get_desc()
commit d795848ecce24a75dfd46481aee066ae6fe39775 upstream.

Userspace may trigger a speculative read of an address outside the gpio
descriptor array.
Users can do that by calling gpio_ioctl() with an offset out of range.
Offset is copied from user and then used as an array index to get
the gpio descriptor without sanitization in gpio_device_get_desc().

This change ensures that the offset is sanitized by using
array_index_nospec() to mitigate any possibility of speculative
information leaks.

This bug was discovered and resolved using Coverity Static Analysis
Security Testing (SAST) by Synopsys, Inc.

Signed-off-by: Hagar Hemdan <hagarhem@amazon.com>
Link: https://lore.kernel.org/r/20240523085332.1801-1-hagarhem@amazon.com
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Hugo SIMELIERE <hsimeliere.opensource@witekio.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:55 +02:00
Kent Gibson
5c3a421c1f gpiolib: cdev: Ignore reconfiguration without direction
commit b440396387418fe2feaacd41ca16080e7a8bc9ad upstream.

linereq_set_config() behaves badly when direction is not set.
The configuration validation is borrowed from linereq_create(), where,
to verify the intent of the user, the direction must be set to in order to
effect a change to the electrical configuration of a line. But, when
applied to reconfiguration, that validation does not allow for the unset
direction case, making it possible to clear flags set previously without
specifying the line direction.

Adding to the inconsistency, those changes are not immediately applied by
linereq_set_config(), but will take effect when the line value is next get
or set.

For example, by requesting a configuration with no flags set, an output
line with GPIO_V2_LINE_FLAG_ACTIVE_LOW and GPIO_V2_LINE_FLAG_OPEN_DRAIN
set could have those flags cleared, inverting the sense of the line and
changing the line drive to push-pull on the next line value set.

Skip the reconfiguration of lines for which the direction is not set, and
only reconfigure the lines for which direction is set.

Fixes: a54756cb24 ("gpiolib: cdev: support GPIO_V2_LINE_SET_CONFIG_IOCTL")
Signed-off-by: Kent Gibson <warthog618@gmail.com>
Link: https://lore.kernel.org/r/20240626052925.174272-3-warthog618@gmail.com
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:55 +02:00
Ping-Ke Shih
e388656a85 Revert "wifi: cfg80211: check wiphy mutex is held for wdev mutex"
This reverts commit 19d13ec00a which is
commmit 1474bc87fe57deac726cc10203f73daa6c3212f7 upstream.

The reverted commit is based on implementation of wiphy locking that isn't
planned to redo on a stable kernel, so revert it to avoid warning:

 WARNING: CPU: 0 PID: 9 at net/wireless/core.h:231 disconnect_work+0xb8/0x144 [cfg80211]
 CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.51-00141-ga1649b6f8ed6 #7
 Hardware name: Freescale i.MX6 SoloX (Device Tree)
 Workqueue: events disconnect_work [cfg80211]
  unwind_backtrace from show_stack+0x10/0x14
  show_stack from dump_stack_lvl+0x58/0x70
  dump_stack_lvl from __warn+0x70/0x1c0
  __warn from warn_slowpath_fmt+0x16c/0x294
  warn_slowpath_fmt from disconnect_work+0xb8/0x144 [cfg80211]
  disconnect_work [cfg80211] from process_one_work+0x204/0x620
  process_one_work from worker_thread+0x1b0/0x474
  worker_thread from kthread+0x10c/0x12c
  kthread from ret_from_fork+0x14/0x24

Reported-by: petter@technux.se
Closes: https://lore.kernel.org/linux-wireless/9e98937d781c990615ef27ee0c858ff9@technux.se/T/#t
Cc: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Ping-Ke Shih <pkshih@realtek.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:55 +02:00
Pablo Neira Ayuso
ddeead4761 netfilter: nf_tables: missing iterator type in lookup walk
commit efefd4f00c967d00ad7abe092554ffbb70c1a793 upstream.

Add missing decorator type to lookup expression and tighten WARN_ON_ONCE
check in pipapo to spot earlier that this is unset.

Fixes: 29b359cf6d95 ("netfilter: nft_set_pipapo: walk over current view on netlink dump")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:54 +02:00
Pablo Neira Ayuso
52735a010f netfilter: nft_set_pipapo: walk over current view on netlink dump
commit 29b359cf6d95fd60730533f7f10464e95bd17c73 upstream.

The generation mask can be updated while netlink dump is in progress.
The pipapo set backend walk iterator cannot rely on it to infer what
view of the datastructure is to be used. Add notation to specify if user
wants to read/update the set.

Based on patch from Florian Westphal.

Fixes: 2b84e215f8 ("netfilter: nft_set_pipapo: .walk does not deal with generations")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:54 +02:00
Dan Carpenter
8a64f87e74 netfilter: nft_socket: Fix a NULL vs IS_ERR() bug in nft_socket_cgroup_subtree_level()
commit 7052622fccb1efb850c6b55de477f65d03525a30 upstream.

The cgroup_get_from_path() function never returns NULL, it returns error
pointers.  Update the error handling to match.

Fixes: 7f3287db6543 ("netfilter: nft_socket: make cgroupsv2 matching work with namespaces")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Link: https://patch.msgid.link/bbc0c4e0-05cc-4f44-8797-2f4b3920a820@stanley.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:54 +02:00
Florian Westphal
ace0db36b4 netfilter: nft_socket: make cgroupsv2 matching work with namespaces
commit 7f3287db654395f9c5ddd246325ff7889f550286 upstream.

When running in container environmment, /sys/fs/cgroup/ might not be
the real root node of the sk-attached cgroup.

Example:

In container:
% stat /sys//fs/cgroup/
Device: 0,21    Inode: 2214  ..
% stat /sys/fs/cgroup/foo
Device: 0,21    Inode: 2264  ..

The expectation would be for:

  nft add rule .. socket cgroupv2 level 1 "foo" counter

to match traffic from a process that got added to "foo" via
"echo $pid > /sys/fs/cgroup/foo/cgroup.procs".

However, 'level 3' is needed to make this work.

Seen from initial namespace, the complete hierarchy is:

% stat /sys/fs/cgroup/system.slice/docker-.../foo
  Device: 0,21    Inode: 2264 ..

i.e. hierarchy is
0    1               2              3
/ -> system.slice -> docker-1... -> foo

... but the container doesn't know that its "/" is the "docker-1.."
cgroup.  Current code will retrieve the 'system.slice' cgroup node
and store its kn->id in the destination register, so compare with
2264 ("foo" cgroup id) will not match.

Fetch "/" cgroup from ->init() and add its level to the level we try to
extract.  cgroup root-level is 0 for the init-namespace or the level
of the ancestor that is exposed as the cgroup root inside the container.

In the above case, cgrp->level of "/" resolved in the container is 2
(docker-1...scope/) and request for 'level 1' will get adjusted
to fetch the actual level (3).

v2: use CONFIG_SOCK_CGROUP_DATA, eval function depends on it.
    (kernel test robot)

Cc: cgroups@vger.kernel.org
Fixes: e0bb96db96 ("netfilter: nft_socket: add support for cgroupsv2")
Reported-by: Nadia Pinaeva <n.m.pinaeva@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:54 +02:00
Dave Chinner
5899daf1d8 xfs: journal geometry is not properly bounds checked
[ Upstream commit f1e1765aad ]

If the journal geometry results in a sector or log stripe unit
validation problem, it indicates that we cannot set the log up to
safely write to the the journal. In these cases, we must abort the
mount because the corruption needs external intervention to resolve.
Similarly, a journal that is too large cannot be written to safely,
either, so we shouldn't allow those geometries to mount, either.

If the log is too small, we risk having transaction reservations
overruning the available log space and the system hanging waiting
for space it can never provide. This is purely a runtime hang issue,
not a corruption issue as per the first cases listed above. We abort
mounts of the log is too small for V5 filesystems, but we must allow
v4 filesystems to mount because, historically, there was no log size
validity checking and so some systems may still be out there with
undersized logs.

The problem is that on V4 filesystems, when we discover a log
geometry problem, we skip all the remaining checks and then allow
the log to continue mounting. This mean that if one of the log size
checks fails, we skip the log stripe unit check. i.e. we allow the
mount because a "non-fatal" geometry is violated, and then fail to
check the hard fail geometries that should fail the mount.

Move all these fatal checks to the superblock verifier, and add a
new check for the two log sector size geometry variables having the
same values. This will prevent any attempt to mount a log that has
invalid or inconsistent geometries long before we attempt to mount
the log.

However, for the minimum log size checks, we can only do that once
we've setup up the log and calculated all the iclog sizes and
roundoffs. Hence this needs to remain in the log mount code after
the log has been initialised. It is also the only case where we
should allow a v4 filesystem to continue running, so leave that
handling in place, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:54 +02:00
Darrick J. Wong
68e6efe0d4 xfs: set bnobt/cntbt numrecs correctly when formatting new AGs
[ Upstream commit 8e698ee72c ]

Through generic/300, I discovered that mkfs.xfs creates corrupt
filesystems when given these parameters:

Filesystems formatted with --unsupported are not supported!!
meta-data=/dev/sda               isize=512    agcount=8, agsize=16352 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
data     =                       bsize=4096   blocks=130816, imaxpct=25
         =                       sunit=32     swidth=128 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=8192, version=2
         =                       sectsz=512   sunit=32 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
         =                       rgcount=0    rgsize=0 blks
Discarding blocks...Done.
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
        - 16:30:50: zeroing log - 16320 of 16320 blocks done
        - scan filesystem freespace and inode maps...
agf_freeblks 25, counted 0 in ag 4
sb_fdblocks 8823, counted 8798

The root cause of this problem is the numrecs handling in
xfs_freesp_init_recs, which is used to initialize a new AG.  Prior to
calling the function, we set up the new bnobt block with numrecs == 1
and rely on _freesp_init_recs to format that new record.  If the last
record created has a blockcount of zero, then it sets numrecs = 0.

That last bit isn't correct if the AG contains the log, the start of the
log is not immediately after the initial blocks due to stripe alignment,
and the end of the log is perfectly aligned with the end of the AG.  For
this case, we actually formatted a single bnobt record to handle the
free space before the start of the (stripe aligned) log, and incremented
arec to try to format a second record.  That second record turned out to
be unnecessary, so what we really want is to leave numrecs at 1.

The numrecs handling itself is overly complicated because a different
function sets numrecs == 1.  Change the bnobt creation code to start
with numrecs set to zero and only increment it after successfully
formatting a free space extent into the btree block.

Fixes: f327a00745 ("xfs: account for log space when formatting new AGs")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:54 +02:00
Darrick J. Wong
af871df651 xfs: fix reloading entire unlinked bucket lists
[ Upstream commit 537c013b14 ]

During review of the patcheset that provided reloading of the incore
iunlink list, Dave made a few suggestions, and I updated the copy in my
dev tree.  Unfortunately, I then got distracted by ... who even knows
what ... and forgot to backport those changes from my dev tree to my
release candidate branch.  I then sent multiple pull requests with stale
patches, and that's what was merged into -rc3.

So.

This patch re-adds the use of an unlocked iunlink list check to
determine if we want to allocate the resources to recreate the incore
list.  Since lost iunlinked inodes are supposed to be rare, this change
helps us avoid paying the transaction and AGF locking costs every time
we open any inode.

This also re-adds the shutdowns on failure, and re-applies the
restructuring of the inner loop in xfs_inode_reload_unlinked_bucket, and
re-adds a requested comment about the quotachecking code.

Retain the original RVB tag from Dave since there's no code change from
the last submission.

Fixes: 68b957f64f ("xfs: load uncached unlinked inodes into memory on demand")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:54 +02:00
Darrick J. Wong
62ca591045 xfs: make inode unlinked bucket recovery work with quotacheck
[ Upstream commit 49813a21ed ]

Teach quotacheck to reload the unlinked inode lists when walking the
inode table.  This requires extra state handling, since it's possible
that a reloaded inode will get inactivated before quotacheck tries to
scan it; in this case, we need to ensure that the reloaded inode does
not have dquots attached when it is freed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:54 +02:00
Darrick J. Wong
e9d1551f80 xfs: reload entire unlinked bucket lists
[ Upstream commit 83771c50e4 ]

The previous patch to reload unrecovered unlinked inodes when adding a
newly created inode to the unlinked list is missing a key piece of
functionality.  It doesn't handle the case that someone calls xfs_iget
on an inode that is not the last item in the incore list.  For example,
if at mount time the ondisk iunlink bucket looks like this:

AGI -> 7 -> 22 -> 3 -> NULL

None of these three inodes are cached in memory.  Now let's say that
someone tries to open inode 3 by handle.  We need to walk the list to
make sure that inodes 7 and 22 get loaded cold, and that the
i_prev_unlinked of inode 3 gets set to 22.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:53 +02:00
Darrick J. Wong
8ffd3ae7a0 xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list
[ Upstream commit f12b96683d ]

Alter the definition of i_prev_unlinked slightly to make it more obvious
when an inode with 0 link count is not part of the iunlink bucket lists
rooted in the AGI.  This distinction is necessary because it is not
sufficient to check inode.i_nlink to decide if an inode is on the
unlinked list.  Updates to i_nlink can happen while holding only
ILOCK_EXCL, but updates to an inode's position in the AGI unlinked list
(which happen after the nlink update) requires both ILOCK_EXCL and the
AGI buffer lock.

The next few patches will make it possible to reload an entire unlinked
bucket list when we're walking the inode table or performing handle
operations and need more than the ability to iget the last inode in the
chain.

The upcoming directory repair code also needs to be able to make this
distinction to decide if a zero link count directory should be moved to
the orphanage or allowed to inactivate.  An upcoming enhancement to the
online AGI fsck code will need this distinction to check and rebuild the
AGI unlinked buckets.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:53 +02:00
Shiyang Ruan
8e2147f37f xfs: correct calculation for agend and blockcount
[ Upstream commit 3c90c01e49 ]

The agend should be "start + length - 1", then, blockcount should be
"end + 1 - start".  Correct 2 calculation mistakes.

Also, rename "agend" to "range_agend" because it's not the end of the AG
per se; it's the end of the dead region within an AG's agblock space.

Fixes: 5cf32f63b0 ("xfs: fix the calculation for "end" and "length"")
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:53 +02:00
Dave Chinner
d931b6c6a9 xfs: fix unlink vs cluster buffer instantiation race
[ Upstream commit 348a1983cf4cf5099fc398438a968443af4c9f65 ]

Luis has been reporting an assert failure when freeing an inode
cluster during inode inactivation for a while. The assert looks
like:

 XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241
 ------------[ cut here ]------------
 kernel BUG at fs/xfs/xfs_message.c:102!
 Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
 CPU: 4 PID: 73 Comm: kworker/4:1 Not tainted 6.10.0-rc1 #4
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 Workqueue: xfs-inodegc/loop5 xfs_inodegc_worker [xfs]
 RIP: 0010:assfail (fs/xfs/xfs_message.c:102) xfs
 RSP: 0018:ffff88810188f7f0 EFLAGS: 00010202
 RAX: 0000000000000000 RBX: ffff88816e748250 RCX: 1ffffffff844b0e7
 RDX: 0000000000000004 RSI: ffff88810188f558 RDI: ffffffffc2431fa0
 RBP: 1ffff11020311f01 R08: 0000000042431f9f R09: ffffed1020311e9b
 R10: ffff88810188f4df R11: ffffffffac725d70 R12: ffff88817a3f4000
 R13: ffff88812182f000 R14: ffff88810188f998 R15: ffffffffc2423f80
 FS:  0000000000000000(0000) GS:ffff8881c8400000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000055fe9d0f109c CR3: 000000014426c002 CR4: 0000000000770ef0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  <TASK>
 xfs_trans_read_buf_map (fs/xfs/xfs_trans_buf.c:241 (discriminator 1)) xfs
 xfs_imap_to_bp (fs/xfs/xfs_trans.h:210 fs/xfs/libxfs/xfs_inode_buf.c:138) xfs
 xfs_inode_item_precommit (fs/xfs/xfs_inode_item.c:145) xfs
 xfs_trans_run_precommits (fs/xfs/xfs_trans.c:931) xfs
 __xfs_trans_commit (fs/xfs/xfs_trans.c:966) xfs
 xfs_inactive_ifree (fs/xfs/xfs_inode.c:1811) xfs
 xfs_inactive (fs/xfs/xfs_inode.c:2013) xfs
 xfs_inodegc_worker (fs/xfs/xfs_icache.c:1841 fs/xfs/xfs_icache.c:1886) xfs
 process_one_work (kernel/workqueue.c:3231)
 worker_thread (kernel/workqueue.c:3306 (discriminator 2) kernel/workqueue.c:3393 (discriminator 2))
 kthread (kernel/kthread.c:389)
 ret_from_fork (arch/x86/kernel/process.c:147)
 ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
  </TASK>

And occurs when the the inode precommit handlers is attempt to look
up the inode cluster buffer to attach the inode for writeback.

The trail of logic that I can reconstruct is as follows.

	1. the inode is clean when inodegc runs, so it is not
	   attached to a cluster buffer when precommit runs.

	2. #1 implies the inode cluster buffer may be clean and not
	   pinned by dirty inodes when inodegc runs.

	3. #2 implies that the inode cluster buffer can be reclaimed
	   by memory pressure at any time.

	4. The assert failure implies that the cluster buffer was
	   attached to the transaction, but not marked done. It had
	   been accessed earlier in the transaction, but not marked
	   done.

	5. #4 implies the cluster buffer has been invalidated (i.e.
	   marked stale).

	6. #5 implies that the inode cluster buffer was instantiated
	   uninitialised in the transaction in xfs_ifree_cluster(),
	   which only instantiates the buffers to invalidate them
	   and never marks them as done.

Given factors 1-3, this issue is highly dependent on timing and
environmental factors. Hence the issue can be very difficult to
reproduce in some situations, but highly reliable in others. Luis
has an environment where it can be reproduced easily by g/531 but,
OTOH, I've reproduced it only once in ~2000 cycles of g/531.

I think the fix is to have xfs_ifree_cluster() set the XBF_DONE flag
on the cluster buffers, even though they may not be initialised. The
reasons why I think this is safe are:

	1. A buffer cache lookup hit on a XBF_STALE buffer will
	   clear the XBF_DONE flag. Hence all future users of the
	   buffer know they have to re-initialise the contents
	   before use and mark it done themselves.

	2. xfs_trans_binval() sets the XFS_BLI_STALE flag, which
	   means the buffer remains locked until the journal commit
	   completes and the buffer is unpinned. Hence once marked
	   XBF_STALE/XFS_BLI_STALE by xfs_ifree_cluster(), the only
	   context that can access the freed buffer is the currently
	   running transaction.

	3. #2 implies that future buffer lookups in the currently
	   running transaction will hit the transaction match code
	   and not the buffer cache. Hence XBF_STALE and
	   XFS_BLI_STALE will not be cleared unless the transaction
	   initialises and logs the buffer with valid contents
	   again. At which point, the buffer will be marked marked
	   XBF_DONE again, so having XBF_DONE already set on the
	   stale buffer is a moot point.

	4. #2 also implies that any concurrent access to that
	   cluster buffer will block waiting on the buffer lock
	   until the inode cluster has been fully freed and is no
	   longer an active inode cluster buffer.

	5. #4 + #1 means that any future user of the disk range of
	   that buffer will always see the range of disk blocks
	   covered by the cluster buffer as not done, and hence must
	   initialise the contents themselves.

	6. Setting XBF_DONE in xfs_ifree_cluster() then means the
	   unlinked inode precommit code will see a XBF_DONE buffer
	   from the transaction match as it expects. It can then
	   attach the stale but newly dirtied inode to the stale
	   but newly dirtied cluster buffer without unexpected
	   failures. The stale buffer will then sail through the
	   journal and do the right thing with the attached stale
	   inode during unpin.

Hence the fix is just one line of extra code. The explanation of
why we have to set XBF_DONE in xfs_ifree_cluster, OTOH, is long and
complex....

Fixes: 82842fee6e ("xfs: fix AGF vs inode cluster buffer deadlock")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:53 +02:00
Darrick J. Wong
1486aeb788 xfs: fix negative array access in xfs_getbmap
[ Upstream commit 1bba82fe1a ]

In commit 8ee81ed581, Ye Bin complained about an ASSERT in the bmapx
code that trips if we encounter a delalloc extent after flushing the
pagecache to disk.  The ioctl code does not hold MMAPLOCK so it's
entirely possible that a racing write page fault can create a delalloc
extent after the file has been flushed.  The proposed solution was to
replace the assertion with an early return that avoids filling out the
bmap recordset with a delalloc entry if the caller didn't ask for it.

At the time, I recall thinking that the forward logic sounded ok, but
felt hesitant because I suspected that changing this code would cause
something /else/ to burst loose due to some other subtlety.

syzbot of course found that subtlety.  If all the extent mappings found
after the flush are delalloc mappings, we'll reach the end of the data
fork without ever incrementing bmv->bmv_entries.  This is new, since
before we'd have emitted the delalloc mappings even though the caller
didn't ask for them.  Once we reach the end, we'll try to set
BMV_OF_LAST on the -1st entry (because bmv_entries is zero) and go
corrupt something else in memory.  Yay.

I really dislike all these stupid patches that fiddle around with debug
code and break things that otherwise worked well enough.  Nobody was
complaining that calling XFS_IOC_BMAPX without BMV_IF_DELALLOC would
return BMV_OF_DELALLOC records, and now we've gone from "weird behavior
that nobody cared about" to "bad behavior that must be addressed
immediately".

Maybe I'll just ignore anything from Huawei from now on for my own sake.

Reported-by: syzbot+c103d3808a0de5faaf80@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-xfs/20230412024907.GP360889@frogsfrogsfrogs/
Fixes: 8ee81ed581 ("xfs: fix BUG_ON in xfs_getbmap()")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:53 +02:00
Darrick J. Wong
4790c167cc xfs: load uncached unlinked inodes into memory on demand
[ Upstream commit 68b957f64f ]

shrikanth hegde reports that filesystems fail shortly after mount with
the following failure:

	WARNING: CPU: 56 PID: 12450 at fs/xfs/xfs_inode.c:1839 xfs_iunlink_lookup+0x58/0x80 [xfs]

This of course is the WARN_ON_ONCE in xfs_iunlink_lookup:

	ip = radix_tree_lookup(&pag->pag_ici_root, agino);
	if (WARN_ON_ONCE(!ip || !ip->i_ino)) { ... }

>From diagnostic data collected by the bug reporters, it would appear
that we cleanly mounted a filesystem that contained unlinked inodes.
Unlinked inodes are only processed as a final step of log recovery,
which means that clean mounts do not process the unlinked list at all.

Prior to the introduction of the incore unlinked lists, this wasn't a
problem because the unlink code would (very expensively) traverse the
entire ondisk metadata iunlink chain to keep things up to date.
However, the incore unlinked list code complains when it realizes that
it is out of sync with the ondisk metadata and shuts down the fs, which
is bad.

Ritesh proposed to solve this problem by unconditionally parsing the
unlinked lists at mount time, but this imposes a mount time cost for
every filesystem to catch something that should be very infrequent.
Instead, let's target the places where we can encounter a next_unlinked
pointer that refers to an inode that is not in cache, and load it into
cache.

Note: This patch does not address the problem of iget loading an inode
from the middle of the iunlink list and needing to set i_prev_unlinked
correctly.

Reported-by: shrikanth hegde <sshegde@linux.vnet.ibm.com>
Triaged-by: Ritesh Harjani <ritesh.list@gmail.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:53 +02:00
Shiyang Ruan
0cc1922687 xfs: fix the calculation for "end" and "length"
[ Upstream commit 5cf32f63b0 ]

The value of "end" should be "start + length - 1".

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:53 +02:00
Dave Chinner
4427e3d362 xfs: remove WARN when dquot cache insertion fails
[ Upstream commit 4b827b3f30 ]

It just creates unnecessary bot noise these days.

Reported-by: syzbot+6ae213503fb12e87934f@syzkaller.appspotmail.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:52 +02:00
Long Li
e8c6533404 xfs: fix ag count overflow during growfs
[ Upstream commit c3b880acad ]

I found a corruption during growfs:

 XFS (loop0): Internal error agbno >= mp->m_sb.sb_agblocks at line 3661 of
   file fs/xfs/libxfs/xfs_alloc.c.  Caller __xfs_free_extent+0x28e/0x3c0
 CPU: 0 PID: 573 Comm: xfs_growfs Not tainted 6.3.0-rc7-next-20230420-00001-gda8c95746257
 Call Trace:
  <TASK>
  dump_stack_lvl+0x50/0x70
  xfs_corruption_error+0x134/0x150
  __xfs_free_extent+0x2c1/0x3c0
  xfs_ag_extend_space+0x291/0x3e0
  xfs_growfs_data+0xd72/0xe90
  xfs_file_ioctl+0x5f9/0x14a0
  __x64_sys_ioctl+0x13e/0x1c0
  do_syscall_64+0x39/0x80
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
 XFS (loop0): Corruption detected. Unmount and run xfs_repair
 XFS (loop0): Internal error xfs_trans_cancel at line 1097 of file
   fs/xfs/xfs_trans.c.  Caller xfs_growfs_data+0x691/0xe90
 CPU: 0 PID: 573 Comm: xfs_growfs Not tainted 6.3.0-rc7-next-20230420-00001-gda8c95746257
 Call Trace:
  <TASK>
  dump_stack_lvl+0x50/0x70
  xfs_error_report+0x93/0xc0
  xfs_trans_cancel+0x2c0/0x350
  xfs_growfs_data+0x691/0xe90
  xfs_file_ioctl+0x5f9/0x14a0
  __x64_sys_ioctl+0x13e/0x1c0
  do_syscall_64+0x39/0x80
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
 RIP: 0033:0x7f2d86706577

The bug can be reproduced with the following sequence:

 # truncate -s  1073741824 xfs_test.img
 # mkfs.xfs -f -b size=1024 -d agcount=4 xfs_test.img
 # truncate -s 2305843009213693952  xfs_test.img
 # mount -o loop xfs_test.img /mnt/test
 # xfs_growfs -D  1125899907891200  /mnt/test

The root cause is that during growfs, user space passed in a large value
of newblcoks to xfs_growfs_data_private(), due to current sb_agblocks is
too small, new AG count will exceed UINT_MAX. Because of AG number type
is unsigned int and it would overflow, that caused nagcount much smaller
than the actual value. During AG extent space, delta blocks in
xfs_resizefs_init_new_ags() will much larger than the actual value due to
incorrect nagcount, even exceed UINT_MAX. This will cause corruption and
be detected in __xfs_free_extent. Fix it by growing the filesystem to up
to the maximally allowed AGs and not return EINVAL when new AG count
overflow.

Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:52 +02:00
Dave Chinner
02f44e7ff6 xfs: collect errors from inodegc for unlinked inode recovery
[ Upstream commit d4d12c02bf ]

Unlinked list recovery requires errors removing the inode the from
the unlinked list get fed back to the main recovery loop. Now that
we offload the unlinking to the inodegc work, we don't get errors
being fed back when we trip over a corruption that prevents the
inode from being removed from the unlinked list.

This means we never clear the corrupt unlinked list bucket,
resulting in runtime operations eventually tripping over it and
shutting down.

Fix this by collecting inodegc worker errors and feed them
back to the flush caller. This is largely best effort - the only
context that really cares is log recovery, and it only flushes a
single inode at a time so we don't need complex synchronised
handling. Essentially the inodegc workers will capture the first
error that occurs and the next flush will gather them and clear
them. The flush itself will only report the first gathered error.

In the cases where callers can return errors, propagate the
collected inodegc flush error up the error handling chain.

In the case of inode unlinked list recovery, there are several
superfluous calls to flush queued unlinked inodes -
xlog_recover_iunlink_bucket() guarantees that it has flushed the
inodegc and collected errors before it returns. Hence nothing in the
calling path needs to run a flush, even when an error is returned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:52 +02:00
Dave Chinner
65fc94fc87 xfs: fix AGF vs inode cluster buffer deadlock
[ Upstream commit 82842fee6e ]

Lock order in XFS is AGI -> AGF, hence for operations involving
inode unlinked list operations we always lock the AGI first. Inode
unlinked list operations operate on the inode cluster buffer,
so the lock order there is AGI -> inode cluster buffer.

For O_TMPFILE operations, this now means the lock order set down in
xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
unlinked ops are done before the directory modifications that may
allocate space and lock the AGF.

Unfortunately, we also now lock the inode cluster buffer when
logging an inode so that we can attach the inode to the cluster
buffer and pin it in memory. This creates a lock order of AGF ->
inode cluster buffer in directory operations as we have to log the
inode after we've allocated new space for it.

This creates a lock inversion between the AGF and the inode cluster
buffer. Because the inode cluster buffer is shared across multiple
inodes, the inversion is not specific to individual inodes but can
occur when inodes in the same cluster buffer are accessed in
different orders.

To fix this we need move all the inode log item cluster buffer
interactions to the end of the current transaction. Unfortunately,
xfs_trans_log_inode() calls are littered throughout the transactions
with no thought to ordering against other items or locking. This
makes it difficult to do anything that involves changing the call
sites of xfs_trans_log_inode() to change locking orders.

However, we do now have a mechanism that allows is to postpone dirty
item processing to just before we commit the transaction: the
->iop_precommit method. This will be called after all the
modifications are done and high level objects like AGI and AGF
buffers have been locked and modified, thereby providing a mechanism
that guarantees we don't lock the inode cluster buffer before those
high level objects are locked.

This change is largely moving the guts of xfs_trans_log_inode() to
xfs_inode_item_precommit() and providing an extra flag context in
the inode log item to track the dirty state of the inode in the
current transaction. This also means we do a lot less repeated work
in xfs_trans_log_inode() by only doing it once per transaction when
all the work is done.

Fixes: 298f7bec50 ("xfs: pin inode backing buffer to the inode log item")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:52 +02:00
Dave Chinner
b4aea9f9e0 xfs: defered work could create precommits
[ Upstream commit cb04211748 ]

To fix a AGI-AGF-inode cluster buffer deadlock, we need to move
inode cluster buffer operations to the ->iop_precommit() method.
However, this means that deferred operations can require precommits
to be run on the final transaction that the deferred ops pass back
to xfs_trans_commit() context. This will be exposed by attribute
handling, in that the last changes to the inode in the attr set
state machine "disappear" because the precommit operation is not run.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:52 +02:00
Dave Chinner
8127489103 xfs: buffer pins need to hold a buffer reference
[ Upstream commit 89a4bf0dc3 ]

When a buffer is unpinned by xfs_buf_item_unpin(), we need to access
the buffer after we've dropped the buffer log item reference count.
This opens a window where we can have two racing unpins for the
buffer item (e.g. shutdown checkpoint context callback processing
racing with journal IO iclog completion processing) and both attempt
to access the buffer after dropping the BLI reference count.  If we
are unlucky, the "BLI freed" context wins the race and frees the
buffer before the "BLI still active" case checks the buffer pin
count.

This results in a use after free that can only be triggered
in active filesystem shutdown situations.

To fix this, we need to ensure that buffer existence extends beyond
the BLI reference count checks and until the unpin processing is
complete. This implies that a buffer pin operation must also take a
buffer reference to ensure that the buffer cannot be freed until the
buffer unpin processing is complete.

Reported-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:52 +02:00
Ye Bin
cbf91ddb88 xfs: fix BUG_ON in xfs_getbmap()
[ Upstream commit 8ee81ed581 ]

There's issue as follows:
XFS: Assertion failed: (bmv->bmv_iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c, line: 329
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:102!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 14612 Comm: xfs_io Not tainted 6.3.0-rc2-next-20230315-00006-g2729d23ddb3b-dirty #422
RIP: 0010:assfail+0x96/0xa0
RSP: 0018:ffffc9000fa178c0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff888179a18000
RDX: 0000000000000000 RSI: ffff888179a18000 RDI: 0000000000000002
RBP: 0000000000000000 R08: ffffffff8321aab6 R09: 0000000000000000
R10: 0000000000000001 R11: ffffed1105f85139 R12: ffffffff8aacc4c0
R13: 0000000000000149 R14: ffff888269f58000 R15: 000000000000000c
FS:  00007f42f27a4740(0000) GS:ffff88882fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000b92388 CR3: 000000024f006000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 xfs_getbmap+0x1a5b/0x1e40
 xfs_ioc_getbmap+0x1fd/0x5b0
 xfs_file_ioctl+0x2cb/0x1d50
 __x64_sys_ioctl+0x197/0x210
 do_syscall_64+0x39/0xb0
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Above issue may happen as follows:
         ThreadA                       ThreadB
do_shared_fault
 __do_fault
  xfs_filemap_fault
   __xfs_filemap_fault
    filemap_fault
                             xfs_ioc_getbmap -> Without BMV_IF_DELALLOC flag
			      xfs_getbmap
			       xfs_ilock(ip, XFS_IOLOCK_SHARED);
			       filemap_write_and_wait
 do_page_mkwrite
  xfs_filemap_page_mkwrite
   __xfs_filemap_fault
    xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
    iomap_page_mkwrite
     ...
     xfs_buffered_write_iomap_begin
      xfs_bmapi_reserve_delalloc -> Allocate delay extent
                              xfs_ilock_data_map_shared(ip)
	                      xfs_getbmap_report_one
			       ASSERT((bmv->bmv_iflags & BMV_IF_DELALLOC) != 0)
	                        -> trigger BUG_ON

As xfs_filemap_page_mkwrite() only hold XFS_MMAPLOCK_SHARED lock, there's
small window mkwrite can produce delay extent after file write in xfs_getbmap().
To solve above issue, just skip delalloc extents.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:52 +02:00
Dave Chinner
fcd6ff906d xfs: quotacheck failure can race with background inode inactivation
[ Upstream commit 0c7273e494 ]

The background inode inactivation can attached dquots to inodes, but
this can race with a foreground quotacheck failure that leads to
disabling quotas and freeing the mp->m_quotainfo structure. The
background inode inactivation then tries to allocate a quota, tries
to dereference mp->m_quotainfo, and crashes like so:

XFS (loop1): Quotacheck: Unsuccessful (Error -5): Disabling quotas.
xfs filesystem being mounted at /root/syzkaller.qCVHXV/0/file0 supports timestamps until 2038 (0x7fffffff)
BUG: kernel NULL pointer dereference, address: 00000000000002a8
....
CPU: 0 PID: 161 Comm: kworker/0:4 Not tainted 6.2.0-c9c3395d5e3d #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: xfs-inodegc/loop1 xfs_inodegc_worker
RIP: 0010:xfs_dquot_alloc+0x95/0x1e0
....
Call Trace:
 <TASK>
 xfs_qm_dqread+0x46/0x440
 xfs_qm_dqget_inode+0x154/0x500
 xfs_qm_dqattach_one+0x142/0x3c0
 xfs_qm_dqattach_locked+0x14a/0x170
 xfs_qm_dqattach+0x52/0x80
 xfs_inactive+0x186/0x340
 xfs_inodegc_worker+0xd3/0x430
 process_one_work+0x3b1/0x960
 worker_thread+0x52/0x660
 kthread+0x161/0x1a0
 ret_from_fork+0x29/0x50
 </TASK>
....

Prevent this race by flushing all the queued background inode
inactivations pending before purging all the cached dquots when
quotacheck fails.

Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:51 +02:00
Darrick J. Wong
120108df92 xfs: fix uninitialized variable access
[ Upstream commit 60b730a40c ]

If the end position of a GETFSMAP query overlaps an allocated space and
we're using the free space info to generate fsmap info, the akeys
information gets fed into the fsmap formatter with bad results.
Zero-init the space.

Reported-by: syzbot+090ae72d552e6bd93cfe@syzkaller.appspotmail.com
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:51 +02:00
Dave Chinner
ce563912b0 xfs: block reservation too large for minleft allocation
[ Upstream commit d5753847b2 ]

When we enter xfs_bmbt_alloc_block() without having first allocated
a data extent (i.e. tp->t_firstblock == NULLFSBLOCK) because we
are doing something like unwritten extent conversion, the transaction
block reservation is used as the minleft value.

This works for operations like unwritten extent conversion, but it
assumes that the block reservation is only for a BMBT split. THis is
not always true, and sometimes results in larger than necessary
minleft values being set. We only actually need enough space for a
btree split, something we already handle correctly in
xfs_bmapi_write() via the xfs_bmapi_minleft() calculation.

We should use xfs_bmapi_minleft() in xfs_bmbt_alloc_block() to
calculate the number of blocks a BMBT split on this inode is going to
require, not use the transaction block reservation that contains the
maximum number of blocks this transaction may consume in it...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:51 +02:00
Dave Chinner
0e3c9d6950 xfs: prefer free inodes at ENOSPC over chunk allocation
[ Upstream commit f08f984c63 ]

When an XFS filesystem has free inodes in chunks already allocated
on disk, it will still allocate new inode chunks if the target AG
has no free inodes in it. Normally, this is a good idea as it
preserves locality of all the inodes in a given directory.

However, at ENOSPC this can lead to using the last few remaining
free filesystem blocks to allocate a new chunk when there are many,
many free inodes that could be allocated without consuming free
space. This results in speeding up the consumption of the last few
blocks and inode create operations then returning ENOSPC when there
free inodes available because we don't have enough block left in the
filesystem for directory creation reservations to proceed.

Hence when we are near ENOSPC, we should be attempting to preserve
the remaining blocks for directory block allocation rather than
using them for unnecessary inode chunk creation.

This particular behaviour is exposed by xfs/294, when it drives to
ENOSPC on empty file creation whilst there are still thousands of
free inodes available for allocation in other AGs in the filesystem.

Hence, when we are within 1% of ENOSPC, change the inode allocation
behaviour to prefer to use existing free inodes over allocating new
inode chunks, even though it results is poorer locality of the data
set. It is more important for the allocations to be space efficient
near ENOSPC than to have optimal locality for performance, so lets
modify the inode AG selection code to reflect that fact.

This allows generic/294 to not only pass with this allocator rework
patchset, but to increase the number of post-ENOSPC empty inode
allocations to from ~600 to ~9080 before we hit ENOSPC on the
directory create transaction reservation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:51 +02:00
Dave Chinner
bb798c9128 xfs: fix low space alloc deadlock
[ Upstream commit 1dd0510f6d ]

I've recently encountered an ABBA deadlock with g/476. The upcoming
changes seem to make this much easier to hit, but the underlying
problem is a pre-existing one.

Essentially, if we select an AG for allocation, then lock the AGF
and then fail to allocate for some reason (e.g. minimum length
requirements cannot be satisfied), then we drop out of the
allocation with the AGF still locked.

The caller then modifies the allocation constraints - usually
loosening them up - and tries again. This can result in trying to
access AGFs that are lower than the AGF we already have locked from
the failed attempt. e.g. the failed attempt skipped several AGs
before failing, so we have locks an AG higher than the start AG.
Retrying the allocation from the start AG then causes us to violate
AGF lock ordering and this can lead to deadlocks.

The deadlock exists even if allocation succeeds - we can do a
followup allocations in the same transaction for BMBT blocks that
aren't guaranteed to be in the same AG as the original, and can move
into higher AGs. Hence we really need to move the tp->t_firstblock
tracking down into xfs_alloc_vextent() where it can be set when we
exit with a locked AG.

xfs_alloc_vextent() can also check there if the requested
allocation falls within the allow range of AGs set by
tp->t_firstblock. If we can't allocate within the range set, we have
to fail the allocation. If we are allowed to to non-blocking AGF
locking, we can ignore the AG locking order limitations as we can
use try-locks for the first iteration over requested AG range.

This invalidates a set of post allocation asserts that check that
the allocation is always above tp->t_firstblock if it is set.
Because we can use try-locks to avoid the deadlock in some
circumstances, having a pre-existing locked AGF doesn't always
prevent allocation from lower order AGFs. Hence those ASSERTs need
to be removed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:51 +02:00
Dave Chinner
cdbc02da9f xfs: don't use BMBT btree split workers for IO completion
[ Upstream commit c85007e2e3 ]

When we split a BMBT due to record insertion, we offload it to a
worker thread because we can be deep in the stack when we try to
allocate a new block for the BMBT. Allocation can use several
kilobytes of stack (full memory reclaim, swap and/or IO path can
end up on the stack during allocation) and we can already be several
kilobytes deep in the stack when we need to split the BMBT.

A recent workload demonstrated a deadlock in this BMBT split
offload. It requires several things to happen at once:

1. two inodes need a BMBT split at the same time, one must be
unwritten extent conversion from IO completion, the other must be
from extent allocation.

2. there must be a no available xfs_alloc_wq worker threads
available in the worker pool.

3. There must be sustained severe memory shortages such that new
kworker threads cannot be allocated to the xfs_alloc_wq pool for
both threads that need split work to be run

4. The split work from the unwritten extent conversion must run
first.

5. when the BMBT block allocation runs from the split work, it must
loop over all AGs and not be able to either trylock an AGF
successfully, or each AGF is is able to lock has no space available
for a single block allocation.

6. The BMBT allocation must then attempt to lock the AGF that the
second task queued to the rescuer thread already has locked before
it finds an AGF it can allocate from.

At this point, we have an ABBA deadlock between tasks queued on the
xfs_alloc_wq rescuer thread and a locked AGF. i.e. The queued task
holding the AGF lock can't be run by the rescuer thread until the
task the rescuer thread is runing gets the AGF lock....

This is a highly improbably series of events, but there it is.

There's a couple of ways to fix this, but the easiest way to ensure
that we only punt tasks with a locked AGF that holds enough space
for the BMBT block allocations to the worker thread.

This works for unwritten extent conversion in IO completion (which
doesn't have a locked AGF and space reservations) because we have
tight control over the IO completion stack. It is typically only 6
functions deep when xfs_btree_split() is called because we've
already offloaded the IO completion work to a worker thread and
hence we don't need to worry about stack overruns here.

The other place we can be called for a BMBT split without a
preceeding allocation is __xfs_bunmapi() when punching out the
center of an existing extent. We don't remove extents in the IO
path, so these operations don't tend to be called with a lot of
stack consumed. Hence we don't really need to ship the split off to
a worker thread in these cases, either.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:51 +02:00
Wengang Wang
98b8fd60b3 xfs: fix extent busy updating
[ Upstream commit 601a27ea09 ]

In xfs_extent_busy_update_extent() case 6 and 7, whenever bno is modified on
extent busy, the relavent length has to be modified accordingly.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:51 +02:00
Wu Guanghao
b36c2ae02a xfs: Fix deadlock on xfs_inodegc_worker
[ Upstream commit 4da112513c ]

We are doing a test about deleting a large number of files
when memory is low. A deadlock problem was found.

[ 1240.279183] -> #1 (fs_reclaim){+.+.}-{0:0}:
[ 1240.280450]        lock_acquire+0x197/0x460
[ 1240.281548]        fs_reclaim_acquire.part.0+0x20/0x30
[ 1240.282625]        kmem_cache_alloc+0x2b/0x940
[ 1240.283816]        xfs_trans_alloc+0x8a/0x8b0
[ 1240.284757]        xfs_inactive_ifree+0xe4/0x4e0
[ 1240.285935]        xfs_inactive+0x4e9/0x8a0
[ 1240.286836]        xfs_inodegc_worker+0x160/0x5e0
[ 1240.287969]        process_one_work+0xa19/0x16b0
[ 1240.289030]        worker_thread+0x9e/0x1050
[ 1240.290131]        kthread+0x34f/0x460
[ 1240.290999]        ret_from_fork+0x22/0x30
[ 1240.291905]
[ 1240.291905] -> #0 ((work_completion)(&gc->work)){+.+.}-{0:0}:
[ 1240.293569]        check_prev_add+0x160/0x2490
[ 1240.294473]        __lock_acquire+0x2c4d/0x5160
[ 1240.295544]        lock_acquire+0x197/0x460
[ 1240.296403]        __flush_work+0x6bc/0xa20
[ 1240.297522]        xfs_inode_mark_reclaimable+0x6f0/0xdc0
[ 1240.298649]        destroy_inode+0xc6/0x1b0
[ 1240.299677]        dispose_list+0xe1/0x1d0
[ 1240.300567]        prune_icache_sb+0xec/0x150
[ 1240.301794]        super_cache_scan+0x2c9/0x480
[ 1240.302776]        do_shrink_slab+0x3f0/0xaa0
[ 1240.303671]        shrink_slab+0x170/0x660
[ 1240.304601]        shrink_node+0x7f7/0x1df0
[ 1240.305515]        balance_pgdat+0x766/0xf50
[ 1240.306657]        kswapd+0x5bd/0xd20
[ 1240.307551]        kthread+0x34f/0x460
[ 1240.308346]        ret_from_fork+0x22/0x30
[ 1240.309247]
[ 1240.309247] other info that might help us debug this:
[ 1240.309247]
[ 1240.310944]  Possible unsafe locking scenario:
[ 1240.310944]
[ 1240.312379]        CPU0                    CPU1
[ 1240.313363]        ----                    ----
[ 1240.314433]   lock(fs_reclaim);
[ 1240.315107]                                lock((work_completion)(&gc->work));
[ 1240.316828]                                lock(fs_reclaim);
[ 1240.318088]   lock((work_completion)(&gc->work));
[ 1240.319203]
[ 1240.319203]  *** DEADLOCK ***
...
[ 2438.431081] Workqueue: xfs-inodegc/sda xfs_inodegc_worker
[ 2438.432089] Call Trace:
[ 2438.432562]  __schedule+0xa94/0x1d20
[ 2438.435787]  schedule+0xbf/0x270
[ 2438.436397]  schedule_timeout+0x6f8/0x8b0
[ 2438.445126]  wait_for_completion+0x163/0x260
[ 2438.448610]  __flush_work+0x4c4/0xa40
[ 2438.455011]  xfs_inode_mark_reclaimable+0x6ef/0xda0
[ 2438.456695]  destroy_inode+0xc6/0x1b0
[ 2438.457375]  dispose_list+0xe1/0x1d0
[ 2438.458834]  prune_icache_sb+0xe8/0x150
[ 2438.461181]  super_cache_scan+0x2b3/0x470
[ 2438.461950]  do_shrink_slab+0x3cf/0xa50
[ 2438.462687]  shrink_slab+0x17d/0x660
[ 2438.466392]  shrink_node+0x87e/0x1d40
[ 2438.467894]  do_try_to_free_pages+0x364/0x1300
[ 2438.471188]  try_to_free_pages+0x26c/0x5b0
[ 2438.473567]  __alloc_pages_slowpath.constprop.136+0x7aa/0x2100
[ 2438.482577]  __alloc_pages+0x5db/0x710
[ 2438.485231]  alloc_pages+0x100/0x200
[ 2438.485923]  allocate_slab+0x2c0/0x380
[ 2438.486623]  ___slab_alloc+0x41f/0x690
[ 2438.490254]  __slab_alloc+0x54/0x70
[ 2438.491692]  kmem_cache_alloc+0x23e/0x270
[ 2438.492437]  xfs_trans_alloc+0x88/0x880
[ 2438.493168]  xfs_inactive_ifree+0xe2/0x4e0
[ 2438.496419]  xfs_inactive+0x4eb/0x8b0
[ 2438.497123]  xfs_inodegc_worker+0x16b/0x5e0
[ 2438.497918]  process_one_work+0xbf7/0x1a20
[ 2438.500316]  worker_thread+0x8c/0x1060
[ 2438.504938]  ret_from_fork+0x22/0x30

When the memory is insufficient, xfs_inonodegc_worker will trigger memory
reclamation when memory is allocated, then flush_work() may be called to
wait for the work to complete. This causes a deadlock.

So use memalloc_nofs_save() to avoid triggering memory reclamation in
xfs_inodegc_worker.

Signed-off-by: Wu Guanghao <wuguanghao3@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:50 +02:00
Dave Chinner
d2b4752119 xfs: dquot shrinker doesn't check for XFS_DQFLAG_FREEING
[ Upstream commit 52f31ed228 ]

Resulting in a UAF if the shrinker races with some other dquot
freeing mechanism that sets XFS_DQFLAG_FREEING before the dquot is
removed from the LRU. This can occur if a dquot purge races with
drop_caches.

Reported-by: syzbot+912776840162c13db1a3@syzkaller.appspotmail.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-09-30 16:23:50 +02:00
Ferry Meng
cfb926051f ocfs2: strict bound check before memcmp in ocfs2_xattr_find_entry()
[ Upstream commit af77c4fc1871847b528d58b7fdafb4aa1f6a9262 ]

xattr in ocfs2 maybe 'non-indexed', which saved with additional space
requested.  It's better to check if the memory is out of bound before
memcmp, although this possibility mainly comes from crafted poisonous
images.

Link: https://lkml.kernel.org/r/20240520024024.1976129-2-joseph.qi@linux.alibaba.com
Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reported-by: lei lu <llfamsec@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-09-30 16:23:50 +02:00