CPUID leaf 0x80000022 i.e. ExtPerfMonAndDbg advertises some new
performance monitoring features for AMD processors.
Bit 0 of EAX indicates support for Performance Monitoring Version 2
(PerfMonV2) features. If found to be set during PMU initialization,
the EBX bits of the same CPUID function can be used to determine
the number of available PMCs for different PMU types.
Expose the relevant bits via KVM_GET_SUPPORTED_CPUID so that
guests can make use of the PerfMonV2 features.
Co-developed-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230603011058.1038821-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
If AMD Performance Monitoring Version 2 (PerfMonV2) is detected by
the guest, it can use a new scheme to manage the Core PMCs using the
new global control and status registers.
In addition to benefiting from the PerfMonV2 functionality in the same
way as the host (higher precision), the guest also can reduce the number
of vm-exits by lowering the total number of MSRs accesses.
In terms of implementation details, amd_is_valid_msr() is resurrected
since three newly added MSRs could not be mapped to one vPMC.
The possibility of emulating PerfMonV2 on the mainframe has also
been eliminated for reasons of precision.
Co-developed-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Like Xu <likexu@tencent.com>
[sean: drop "Based on the observed HW." comments]
Link: https://lore.kernel.org/r/20230603011058.1038821-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Cap the number of general purpose counters enumerated on AMD to what KVM
actually supports, i.e. don't allow userspace to coerce KVM into thinking
there are more counters than actually exist, e.g. by enumerating
X86_FEATURE_PERFCTR_CORE in guest CPUID when its not supported.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
[sean: massage changelog]
Link: https://lore.kernel.org/r/20230603011058.1038821-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Enable and advertise PERFCTR_CORE if and only if the minimum number of
required counters are available, i.e. if perf says there are less than six
general purpose counters.
Opportunistically, use kvm_cpu_cap_check_and_set() instead of open coding
the check for host support.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
[sean: massage shortlog and changelog]
Link: https://lore.kernel.org/r/20230603011058.1038821-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Disable PMU support when running on AMD and perf reports fewer than four
general purpose counters. All AMD PMUs must define at least four counters
due to AMD's legacy architecture hardcoding the number of counters
without providing a way to enumerate the number of counters to software,
e.g. from AMD's APM:
The legacy architecture defines four performance counters (PerfCtrn)
and corresponding event-select registers (PerfEvtSeln).
Virtualizing fewer than four counters can lead to guest instability as
software expects four counters to be available. Rather than bleed AMD
details into the common code, just define a const unsigned int and
provide a convenient location to document why Intel and AMD have different
mins (in particular, AMD's lack of any way to enumerate less than four
counters to the guest).
Keep the minimum number of counters at Intel at one, even though old P6
and Core Solo/Duo processor effectively require a minimum of two counters.
KVM can, and more importantly has up until this point, supported a vPMU so
long as the CPU has at least one counter. Perf's support for P6/Core CPUs
does require two counters, but perf will happily chug along with a single
counter when running on a modern CPU.
Cc: Jim Mattson <jmattson@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
[sean: set Intel min to '1', not '2']
Link: https://lore.kernel.org/r/20230603011058.1038821-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move the Intel PMU implementation of pmc_is_enabled() to common x86 code
as pmc_is_globally_enabled(), and drop AMD's implementation. AMD PMU
currently supports only v1, and thus not PERF_GLOBAL_CONTROL, thus the
semantics for AMD are unchanged. And when support for AMD PMU v2 comes
along, the common behavior will also Just Work.
Signed-off-by: Like Xu <likexu@tencent.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230603011058.1038821-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move the handling of GLOBAL_CTRL, GLOBAL_STATUS, and GLOBAL_OVF_CTRL,
a.k.a. GLOBAL_STATUS_RESET, from Intel PMU code to generic x86 PMU code.
AMD PerfMonV2 defines three registers that have the same semantics as
Intel's variants, just with different names and indices. Conveniently,
since KVM virtualizes GLOBAL_CTRL on Intel only for PMU v2 and above, and
AMD's version shows up in v2, KVM can use common code for the existence
check as well.
Signed-off-by: Like Xu <likexu@tencent.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230603011058.1038821-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reject userspace writes to MSR_CORE_PERF_GLOBAL_STATUS that attempt to set
reserved bits. Allowing userspace to stuff reserved bits doesn't harm KVM
itself, but it's architecturally wrong and the guest can't clear the
unsupported bits, e.g. makes the guest's PMI handler very confused.
Signed-off-by: Like Xu <likexu@tencent.com>
[sean: rewrite changelog to avoid use of #GP, rebase on name change]
Link: https://lore.kernel.org/r/20230603011058.1038821-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rename global_ovf_ctrl_mask to global_status_mask to avoid confusion now
that Intel has renamed GLOBAL_OVF_CTRL to GLOBAL_STATUS_RESET in PMU v4.
GLOBAL_OVF_CTRL and GLOBAL_STATUS_RESET are the same MSR index, i.e. are
just different names for the same thing, but the SDM provides different
entries in the IA-32 Architectural MSRs table, which gets really confusing
when looking at PMU v4 definitions since it *looks* like GLOBAL_STATUS has
bits that don't exist in GLOBAL_OVF_CTRL, but in reality the bits are
simply defined in the GLOBAL_STATUS_RESET entry.
No functional change intended.
Cc: Like Xu <like.xu.linux@gmail.com>
Link: https://lore.kernel.org/r/20230603011058.1038821-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add MSR_IA32_TSX_CTRL into msrs_to_save[] to explicitly tell userspace to
save/restore the register value during migration. Missing this may cause
userspace that relies on KVM ioctl(KVM_GET_MSR_INDEX_LIST) fail to port the
value to the target VM.
In addition, there is no need to add MSR_IA32_TSX_CTRL when
ARCH_CAP_TSX_CTRL_MSR is not supported in kvm_get_arch_capabilities(). So
add the checking in kvm_probe_msr_to_save().
Fixes: c11f83e062 ("KVM: vmx: implement MSR_IA32_TSX_CTRL disable RTM functionality")
Reported-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20230509032348.1153070-1-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop KVM's manipulation of guest's CPUID.0x12.1 ECX and EDX, i.e. the
allowed XFRM of SGX enclaves, now that KVM explicitly checks the guest's
allowed XCR0 when emulating ECREATE.
Note, this could theoretically break a setup where userspace advertises
a "bad" XFRM and relies on KVM to provide a sane CPUID model, but QEMU
is the only known user of KVM SGX, and QEMU explicitly sets the SGX CPUID
XFRM subleaf based on the guest's XCR0.
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230503160838.3412617-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly check the vCPU's supported XCR0 when determining whether or not
the XFRM for ECREATE is valid. Checking CPUID works because KVM updates
guest CPUID.0x12.1 to restrict the leaf to a subset of the guest's allowed
XCR0, but that is rather subtle and KVM should not modify guest CPUID
except for modeling true runtime behavior (allowed XFRM is most definitely
not "runtime" behavior).
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230503160838.3412617-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In kvm_vm_ioctl_create_vcpu(), add vcpu to vcpu_array iff it's safe to
access vcpu via kvm_get_vcpu() and kvm_for_each_vcpu(), i.e. when there's
no failure path requiring vcpu removal and destruction. Such order is
important because vcpu_array accessors may end up referencing vcpu at
vcpu_array[0] even before online_vcpus is set to 1.
When online_vcpus=0, any call to kvm_get_vcpu() goes through
array_index_nospec() and ends with an attempt to xa_load(vcpu_array, 0):
int num_vcpus = atomic_read(&kvm->online_vcpus);
i = array_index_nospec(i, num_vcpus);
return xa_load(&kvm->vcpu_array, i);
Similarly, when online_vcpus=0, a kvm_for_each_vcpu() does not iterate over
an "empty" range, but actually [0, ULONG_MAX]:
xa_for_each_range(&kvm->vcpu_array, idx, vcpup, 0, \
(atomic_read(&kvm->online_vcpus) - 1))
In both cases, such online_vcpus=0 edge case, even if leading to
unnecessary calls to XArray API, should not be an issue; requesting
unpopulated indexes/ranges is handled by xa_load() and xa_for_each_range().
However, this means that when the first vCPU is created and inserted in
vcpu_array *and* before online_vcpus is incremented, code calling
kvm_get_vcpu()/kvm_for_each_vcpu() already has access to that first vCPU.
This should not pose a problem assuming that once a vcpu is stored in
vcpu_array, it will remain there, but that's not the case:
kvm_vm_ioctl_create_vcpu() first inserts to vcpu_array, then requests a
file descriptor. If create_vcpu_fd() fails, newly inserted vcpu is removed
from the vcpu_array, then destroyed:
vcpu->vcpu_idx = atomic_read(&kvm->online_vcpus);
r = xa_insert(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, GFP_KERNEL_ACCOUNT);
kvm_get_kvm(kvm);
r = create_vcpu_fd(vcpu);
if (r < 0) {
xa_erase(&kvm->vcpu_array, vcpu->vcpu_idx);
kvm_put_kvm_no_destroy(kvm);
goto unlock_vcpu_destroy;
}
atomic_inc(&kvm->online_vcpus);
This results in a possible race condition when a reference to a vcpu is
acquired (via kvm_get_vcpu() or kvm_for_each_vcpu()) moments before said
vcpu is destroyed.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Message-Id: <20230510140410.1093987-2-mhal@rbox.co>
Cc: stable@vger.kernel.org
Fixes: c5b0775491 ("KVM: Convert the kvm->vcpus array to a xarray", 2021-12-08)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reject hardware enabling, i.e. VM creation, if a restart/shutdown has
been initiated to avoid re-enabling hardware between kvm_reboot() and
machine_{halt,power_off,restart}(). The restart case is especially
problematic (for x86) as enabling VMX (or clearing GIF in KVM_RUN on
SVM) blocks INIT, which results in the restart/reboot hanging as BIOS
is unable to wake and rendezvous with APs.
Note, this bug, and the original issue that motivated the addition of
kvm_reboot(), is effectively limited to a forced reboot, e.g. `reboot -f`.
In a "normal" reboot, userspace will gracefully teardown userspace before
triggering the kernel reboot (modulo bugs, errors, etc), i.e. any process
that might do ioctl(KVM_CREATE_VM) is long gone.
Fixes: 8e1c18157d ("KVM: VMX: Disable VMX when system shutdown")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20230512233127.804012-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use syscore_ops.shutdown to disable hardware virtualization during a
reboot instead of using the dedicated reboot_notifier so that KVM disables
virtualization _after_ system_state has been updated. This will allow
fixing a race in KVM's handling of a forced reboot where KVM can end up
enabling hardware virtualization between kernel_restart_prepare() and
machine_restart().
Rename KVM's hook to match the syscore op to avoid any possible confusion
from wiring up a "reboot" helper to a "shutdown" hook (neither "shutdown
nor "reboot" is completely accurate as the hook handles both).
Opportunistically rewrite kvm_shutdown()'s comment to make it less VMX
specific, and to explain why kvm_rebooting exists.
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: James Morse <james.morse@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Zenghui Yu <yuzenghui@huawei.com>
Cc: kvmarm@lists.linux.dev
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>
Cc: Anup Patel <anup@brainfault.org>
Cc: Atish Patra <atishp@atishpatra.org>
Cc: kvm-riscv@lists.infradead.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20230512233127.804012-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM/arm64 fixes for 6.4, take #1
- Plug a race in the stage-2 mapping code where the IPA and the PA
would end up being out of sync
- Make better use of the bitmap API (bitmap_zero, bitmap_zalloc...)
- FP/SVE/SME documentation update, in the hope that this field
becomes clearer...
- Add workaround for the usual Apple SEIS brokenness
- Random comment fixes
Pull compute express link fixes from Dan Williams:
- Fix a compilation issue with DEFINE_STATIC_SRCU() in the unit tests
- Fix leaking kernel memory to a root-only sysfs attribute
* tag 'cxl-fixes-6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl: Add missing return to cdat read error path
tools/testing/cxl: Use DEFINE_STATIC_SRCU()
Pull parisc architecture fixes from Helge Deller:
- Fix encoding of swp_entry due to added SWP_EXCLUSIVE flag
- Include reboot.h to avoid gcc-12 compiler warning
* tag 'parisc-for-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix encoding of swp_entry due to added SWP_EXCLUSIVE flag
parisc: kexec: include reboot.h
Pull ARM fixes from Russell King:
- fix unwinder for uleb128 case
- fix kernel-doc warnings for HP Jornada 7xx
- fix unbalanced stack on vfp success path
* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 9297/1: vfp: avoid unbalanced stack on 'success' return path
ARM: 9296/1: HP Jornada 7XX: fix kernel-doc warnings
ARM: 9295/1: unwind:fix unwind abort for uleb128 case
Pull locking fix from Borislav Petkov:
- Make sure __down_read_common() is always inlined so that the callers'
names land in traceevents output and thus the blocked function can be
identified
* tag 'locking_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Add __always_inline annotation to __down_read_common() and inlined callers
Pull perf fixes from Borislav Petkov:
- Make sure the PEBS buffer is flushed before reprogramming the
hardware so that the correct record sizes are used
- Update the sample size for AMD BRS events
- Fix a confusion with using the same on-stack struct with different
events in the event processing path
* tag 'perf_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG
perf/x86: Fix missing sample size update on AMD BRS
perf/core: Fix perf_sample_data not properly initialized for different swevents in perf_tp_event()
Pull scheduler fix from Borislav Petkov:
- Fix a couple of kernel-doc warnings
* tag 'sched_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: fix cid_lock kernel-doc warnings
Pull x86 fix from Borislav Petkov:
- Add the required PCI IDs so that the generic SMN accesses provided by
amd_nb.c work for drivers which switch to them. Add a PCI device ID
to k10temp's table so that latter is loaded on such systems too
* tag 'x86_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hwmon: (k10temp) Add PCI ID for family 19, model 78h
x86/amd_nb: Add PCI ID for family 19h model 78h
Pull timer fix from Borislav Petkov:
- Prevent CPU state corruption when an active clockevent broadcast
device is replaced while the system is already in oneshot mode
* tag 'timers_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/broadcast: Make broadcast device replacement work correctly
Pull ext4 fixes from Ted Ts'o:
"Some ext4 bug fixes (mostly to address Syzbot reports)"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: bail out of ext4_xattr_ibody_get() fails for any reason
ext4: add bounds checking in get_max_inline_xattr_value_size()
ext4: add indication of ro vs r/w mounts in the mount message
ext4: fix deadlock when converting an inline directory in nojournal mode
ext4: improve error recovery code paths in __ext4_remount()
ext4: improve error handling from ext4_dirhash()
ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled
ext4: check iomap type only if ext4_iomap_begin() does not fail
ext4: avoid a potential slab-out-of-bounds in ext4_group_desc_csum
ext4: fix data races when using cached status extents
ext4: avoid deadlock in fs reclaim with page writeback
ext4: fix invalid free tracking in ext4_xattr_move_to_block()
ext4: remove a BUG_ON in ext4_mb_release_group_pa()
ext4: allow ext4_get_group_info() to fail
ext4: fix lockdep warning when enabling MMP
ext4: fix WARNING in mb_find_extent
Pull SCSI fix from James Bottomley:
"A single small fix for the UFS driver to fix a power management
failure"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: core: Fix I/O hang that occurs when BKOPS fails in W-LUN suspend
In ext4_update_inline_data(), if ext4_xattr_ibody_get() fails for any
reason, it's best if we just fail as opposed to stumbling on,
especially if the failure is EFSCORRUPTED.
Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Whether the file system is mounted read-only or read/write is more
important than the quota mode, which we are already printing. Add the
ro vs r/w indication since this can be helpful in debugging problems
from the console log.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If there are failures while changing the mount options in
__ext4_remount(), we need to restore the old mount options.
This commit fixes two problem. The first is there is a chance that we
will free the old quota file names before a potential failure leading
to a use-after-free. The second problem addressed in this commit is
if there is a failed read/write to read-only transition, if the quota
has already been suspended, we need to renable quota handling.
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230506142419.984260-2-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When a file system currently mounted read/only is remounted
read/write, if we clear the SB_RDONLY flag too early, before the quota
is initialized, and there is another process/thread constantly
attempting to create a directory, it's possible to trigger the
WARN_ON_ONCE(dquot_initialize_needed(inode));
in ext4_xattr_block_set(), with the following stack trace:
WARNING: CPU: 0 PID: 5338 at fs/ext4/xattr.c:2141 ext4_xattr_block_set+0x2ef2/0x3680
RIP: 0010:ext4_xattr_block_set+0x2ef2/0x3680 fs/ext4/xattr.c:2141
Call Trace:
ext4_xattr_set_handle+0xcd4/0x15c0 fs/ext4/xattr.c:2458
ext4_initxattrs+0xa3/0x110 fs/ext4/xattr_security.c:44
security_inode_init_security+0x2df/0x3f0 security/security.c:1147
__ext4_new_inode+0x347e/0x43d0 fs/ext4/ialloc.c:1324
ext4_mkdir+0x425/0xce0 fs/ext4/namei.c:2992
vfs_mkdir+0x29d/0x450 fs/namei.c:4038
do_mkdirat+0x264/0x520 fs/namei.c:4061
__do_sys_mkdirat fs/namei.c:4076 [inline]
__se_sys_mkdirat fs/namei.c:4074 [inline]
__x64_sys_mkdirat+0x89/0xa0 fs/namei.c:4074
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230506142419.984260-1-tytso@mit.edu
Reported-by: syzbot+6385d7d3065524c5ca6d@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=6513f6cb5cd6b5fc9f37e3bb70d273b94be9c34c
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Ext4 has a filesystem wide lock protecting ext4_writepages() calls to
avoid races with switching of journalled data flag or inode format. This
lock can however cause a deadlock like:
CPU0 CPU1
ext4_writepages()
percpu_down_read(sbi->s_writepages_rwsem);
ext4_change_inode_journal_flag()
percpu_down_write(sbi->s_writepages_rwsem);
- blocks, all readers block from now on
ext4_do_writepages()
ext4_init_io_end()
kmem_cache_zalloc(io_end_cachep, GFP_KERNEL)
fs_reclaim frees dentry...
dentry_unlink_inode()
iput() - last ref =>
iput_final() - inode dirty =>
write_inode_now()...
ext4_writepages() tries to acquire sbi->s_writepages_rwsem
and blocks forever
Make sure we cannot recurse into filesystem reclaim from writeback code
to avoid the deadlock.
Reported-by: syzbot+6898da502aef574c5f8a@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/0000000000004c66b405fa108e27@google.com
Fixes: c8585c6fca ("ext4: fix races between changing inode journal mode and ext4_writepages")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230504124723.20205-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Previously, ext4_get_group_info() would treat an invalid group number
as BUG(), since in theory it should never happen. However, if a
malicious attaker (or fuzzer) modifies the superblock via the block
device while it is the file system is mounted, it is possible for
s_first_data_block to get set to a very large number. In that case,
when calculating the block group of some block number (such as the
starting block of a preallocation region), could result in an
underflow and very large block group number. Then the BUG_ON check in
ext4_get_group_info() would fire, resutling in a denial of service
attack that can be triggered by root or someone with write access to
the block device.
For a quality of implementation perspective, it's best that even if
the system administrator does something that they shouldn't, that it
will not trigger a BUG. So instead of BUG'ing, ext4_get_group_info()
will call ext4_error and return NULL. We also add fallback code in
all of the callers of ext4_get_group_info() that it might NULL.
Also, since ext4_get_group_info() was already borderline to be an
inline function, un-inline it. The results in a next reduction of the
compiled text size of ext4 by roughly 2k.
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230430154311.579720-2-tytso@mit.edu
Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Pull block fixes from Jens Axboe:
"Just a few minor fixes for drivers, and a deletion of a file that is
woefully out-of-date these days"
* tag 'block-6.4-2023-05-13' of git://git.kernel.dk/linux:
Documentation/block: drop the request.rst file
ublk: fix command op code check
block/rnbd: replace REQ_OP_FLUSH with REQ_OP_WRITE
nbd: Fix debugfs_create_dir error checking