[ Upstream commit cd24e2a60a ]
Fix an information leak where an uninitialized hole in struct
vfio_iommu_type1_info_cap_migration on the stack is exposed to userspace.
The definition of struct vfio_iommu_type1_info_cap_migration contains a hole as
shown in this pahole(1) output:
struct vfio_iommu_type1_info_cap_migration {
struct vfio_info_cap_header header; /* 0 8 */
__u32 flags; /* 8 4 */
/* XXX 4 bytes hole, try to pack */
__u64 pgsize_bitmap; /* 16 8 */
__u64 max_dirty_bitmap_size; /* 24 8 */
/* size: 32, cachelines: 1, members: 4 */
/* sum members: 28, holes: 1, sum holes: 4 */
/* last cacheline: 32 bytes */
};
The cap_mig variable is filled in without initializing the hole:
static int vfio_iommu_migration_build_caps(struct vfio_iommu *iommu,
struct vfio_info_cap *caps)
{
struct vfio_iommu_type1_info_cap_migration cap_mig;
cap_mig.header.id = VFIO_IOMMU_TYPE1_INFO_CAP_MIGRATION;
cap_mig.header.version = 1;
cap_mig.flags = 0;
/* support minimum pgsize */
cap_mig.pgsize_bitmap = (size_t)1 << __ffs(iommu->pgsize_bitmap);
cap_mig.max_dirty_bitmap_size = DIRTY_BITMAP_SIZE_MAX;
return vfio_info_add_capability(caps, &cap_mig.header, sizeof(cap_mig));
}
The structure is then copied to a temporary location on the heap. At this point
it's already too late and ioctl(VFIO_IOMMU_GET_INFO) copies it to userspace
later:
int vfio_info_add_capability(struct vfio_info_cap *caps,
struct vfio_info_cap_header *cap, size_t size)
{
struct vfio_info_cap_header *header;
header = vfio_info_cap_add(caps, size, cap->id, cap->version);
if (IS_ERR(header))
return PTR_ERR(header);
memcpy(header + 1, cap + 1, size - sizeof(*header));
return 0;
}
This issue was found by code inspection.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Fixes: ad721705d0 ("vfio iommu: Add migration capability to report supported features")
Link: https://lore.kernel.org/r/20230801155352.1391945-1-stefanha@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4a9dd8f292 ]
With skiboot_defconfig, Clang reports:
CC arch/powerpc/mm/book3s64/radix_tlb.o
arch/powerpc/mm/book3s64/radix_tlb.c:419:20: error: unused function '_tlbie_pid_lpid' [-Werror,-Wunused-function]
static inline void _tlbie_pid_lpid(unsigned long pid, unsigned long lpid,
^
arch/powerpc/mm/book3s64/radix_tlb.c:663:20: error: unused function '_tlbie_va_range_lpid' [-Werror,-Wunused-function]
static inline void _tlbie_va_range_lpid(unsigned long start, unsigned long end,
^
This is because those functions are only called from functions
enclosed in a #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
Move below functions inside that #ifdef
* __tlbie_pid_lpid(unsigned long pid,
* __tlbie_va_lpid(unsigned long va, unsigned long pid,
* fixup_tlbie_pid_lpid(unsigned long pid, unsigned long lpid)
* _tlbie_pid_lpid(unsigned long pid, unsigned long lpid,
* fixup_tlbie_va_range_lpid(unsigned long va,
* __tlbie_va_range_lpid(unsigned long start, unsigned long end,
* _tlbie_va_range_lpid(unsigned long start, unsigned long end,
Fixes: f0c6fbbb90 ("KVM: PPC: Book3S HV: Add support for H_RPT_INVALIDATE")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202307260802.Mjr99P5O-lkp@intel.com/
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/3d72efd39f986ee939d068af69fdce28bd600766.1691568093.git.christophe.leroy@csgroup.eu
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4dd432d985 ]
Reconfiguring the clock divider to the exact same value is observed
on an i.MX8MN to often cause a longer than usual clock pause, probably
because the divider restarts counting whenever the register is rewritten.
This issue doesn't show up normally, because the clock framework will
take care to not call set_rate when the clock rate is the same.
However, when we reconfigure an upstream clock, the common code will
call set_rate with the newly calculated rate on all children, e.g.:
- sai5 is running normally and divides Audio PLL out by 16.
- Audio PLL rate is increased by 32Hz (glitch-free kdiv change)
- rates for children are recalculated and rates are set recursively
- imx8m_clk_composite_divider_set_rate(sai5) is called with
32/16 = 2Hz more
- imx8m_clk_composite_divider_set_rate computes same divider as before
- divider register is written, so it restarts counting from zero and
MCLK is briefly paused, so instead of e.g. 40ns, MCLK is low for 120ns.
Some external clock consumers can be upset by such unexpected clock pauses,
so let's make sure we only rewrite the divider value when the value to be
written is actually different.
Fixes: d3ff972813 ("clk: imx: Add imx composite clock")
Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Link: https://lore.kernel.org/r/20230807082201.2332746-1-a.fatoum@pengutronix.de
Signed-off-by: Abel Vesa <abel.vesa@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e09060b3b6 ]
Don't assume that the device is fully under the control of ASPM and use RMW
capability accessors which do proper locking to avoid losing concurrent
updates to the register values.
If configuration fails in pcie_aspm_configure_common_clock(), the
function attempts to restore the old PCI_EXP_LNKCTL_CCC settings. Store
only the old PCI_EXP_LNKCTL_CCC bit for the relevant devices rather
than the content of the whole LNKCTL registers. It aligns better with
how pcie_lnkctl_clear_and_set() expects its parameter and makes the
code more obvious to understand.
Suggested-by: Lukas Wunner <lukas@wunner.de>
Fixes: 2a42d9dba7 ("PCIe: ASPM: Break out of endless loop waiting for PCI config bits to switch")
Fixes: 7d715a6c1a ("PCI: add PCI Express ASPM support")
Link: https://lore.kernel.org/r/20230717120503.15276-5-ilpo.jarvinen@linux.intel.com
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5e70d0acf0 ]
Many places in the kernel write the Link Control and Root Control PCI
Express Capability Registers without proper concurrency control and this
could result in losing the changes one of the writers intended to make.
Add pcie_cap_lock spinlock into the struct pci_dev and use it to protect
bit changes made in the RMW capability accessors. Protect only a selected
set of registers by differentiating the RMW accessor internally to
locked/unlocked variants using a wrapper which has the same signature as
pcie_capability_clear_and_set_word(). As the Capability Register (pos)
given to the wrapper is always a constant, the compiler should be able to
simplify all the dead-code away.
So far only the Link Control Register (ASPM, hotplug, link retraining,
various drivers) and the Root Control Register (AER & PME) seem to
require RMW locking.
Suggested-by: Lukas Wunner <lukas@wunner.de>
Fixes: c7f486567c ("PCI PM: PCIe PME root port service driver")
Fixes: f12eb72a26 ("PCI/ASPM: Use PCI Express Capability accessors")
Fixes: 7d715a6c1a ("PCI: add PCI Express ASPM support")
Fixes: affa48de84 ("staging/rdma/hfi1: Add support for enabling/disabling PCIe ASPM")
Fixes: 849a9366cb ("misc: rtsx: Add support new chip rts5228 mmc: rtsx: Add support MMC_CAP2_NO_MMC")
Fixes: 3d1e7aa80d ("misc: rtsx: Use pcie_capability_clear_and_set_word() for PCI_EXP_LNKCTL")
Fixes: c0e5f4e73a ("misc: rtsx: Add support for RTS5261")
Fixes: 3df4fce739 ("misc: rtsx: separate aspm mode into MODE_REG and MODE_CFG")
Fixes: 121e9c6b5c ("misc: rtsx: modify and fix init_hw function")
Fixes: 19f3bd548f ("mfd: rtsx: Remove LCTLR defination")
Fixes: 773ccdfd9c ("mfd: rtsx: Read vendor setting from config space")
Fixes: 8275b77a15 ("mfd: rts5249: Add support for RTS5250S power saving")
Fixes: 5da4e04ae4 ("misc: rtsx: Add support for RTS5260")
Fixes: 0f49bfbd0f ("tg3: Use PCI Express Capability accessors")
Fixes: 5e7dfd0fb9 ("tg3: Prevent corruption at 10 / 100Mbps w CLKREQ")
Fixes: b726e493e8 ("r8169: sync existing 8168 device hardware start sequences with vendor driver")
Fixes: e6de30d63e ("r8169: more 8168dp support.")
Fixes: 8a06127602 ("Bluetooth: hci_bcm4377: Add new driver for BCM4377 PCIe boards")
Fixes: 6f461f6c7c ("e1000e: enable/disable ASPM L0s and L1 and ERT according to hardware errata")
Fixes: 1eae4eb2a1 ("e1000e: Disable L1 ASPM power savings for 82573 mobile variants")
Fixes: 8060e169e0 ("ath9k: Enable extended synch for AR9485 to fix L0s recovery issue")
Fixes: 69ce674bfa ("ath9k: do btcoex ASPM disabling at initialization time")
Fixes: f37f055035 ("mt76: mt76x2e: disable pcie_aspm by default")
Link: https://lore.kernel.org/r/20230717120503.15276-2-ilpo.jarvinen@linux.intel.com
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 278294798a ]
PCI config space access from user space has traditionally been
unrestricted with writes being an understood risk for device operation.
Unfortunately, device breakage or odd behavior from config writes lacks
indicators that can leave driver writers confused when evaluating
failures. This is especially true with the new PCIe Data Object
Exchange (DOE) mailbox protocol where backdoor shenanigans from user
space through things such as vendor defined protocols may affect device
operation without complete breakage.
A prior proposal restricted read and writes completely.[1] Greg and
Bjorn pointed out that proposal is flawed for a couple of reasons.
First, lspci should always be allowed and should not interfere with any
device operation. Second, setpci is a valuable tool that is sometimes
necessary and it should not be completely restricted.[2] Finally
methods exist for full lock of device access if required.
Even though access should not be restricted it would be nice for driver
writers to be able to flag critical parts of the config space such that
interference from user space can be detected.
Introduce pci_request_config_region_exclusive() to mark exclusive config
regions. Such regions trigger a warning and kernel taint if accessed
via user space.
Create pci_warn_once() to restrict the user from spamming the log.
[1] https://lore.kernel.org/all/161663543465.1867664.5674061943008380442.stgit@dwillia2-desk3.amr.corp.intel.com/
[2] https://lore.kernel.org/all/YF8NGeGv9vYcMfTV@kroah.com/
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/r/20220926215711.2893286-2-ira.weiny@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Stable-dep-of: 5e70d0acf0 ("PCI: Add locking to RMW PCI Express Capability Register accessors")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9e0f4f2918 ]
kvm_vfio_group_add() creates kvg instance, links it to kv->group_list,
and calls kvm_vfio_file_set_kvm() with kvg->file as an argument after
dropping kv->lock. If we race group addition and deletion calls, kvg
instance may get freed by the time we get around to calling
kvm_vfio_file_set_kvm().
Previous iterations of the code did not reference kvg->file outside of
the critical section, but used a temporary variable. Still, they had
similar problem of the file reference being owned by kvg structure and
potential for kvm_vfio_group_del() dropping it before
kvm_vfio_group_add() had a chance to complete.
Fix this by moving call to kvm_vfio_file_set_kvm() under the protection
of kv->lock. We already call it while holding the same lock when vfio
group is being deleted, so it should be safe here as well.
Fixes: 2fc1bec158 ("kvm: set/clear kvm to/from vfio_group when group add/delete")
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20230714224538.404793-1-dmitry.torokhov@gmail.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 60c672b7f2 ]
ngroups is ext4_group_t (unsigned int) while next_linear_group treat it
in int. If ngroups is bigger than max number described by int, it will
be treat as a negative number. Then "return group + 1 >= ngroups ? 0 :
group + 1;" may keep returning 0.
Switch int to ext4_group_t in next_linear_group to fix the overflow.
Fixes: 196e402adf ("ext4: improve cr 0 / cr 1 group scanning")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ce53ad81ed ]
Current igen6_edac checks for pending errors before the registration
of the error handler. However, there is a possibility that the error
occurs during the registration process, leading to unhandled pending
errors and no future error events. This issue can be reproduced by
repeatedly injecting errors during the loading of the igen6_edac.
Fix this issue by moving the pending error handler after the registration
of the error handler, ensuring that no pending errors are left unhandled.
Fixes: 10590a9d4f ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC")
Reported-by: Ee Wey Lim <ee.wey.lim@intel.com>
Tested-by: Ee Wey Lim <ee.wey.lim@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20230725080427.23883-1-qiuxu.zhuo@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e3a3a097ea ]
The following debug object splat was observed in testing:
ODEBUG: free active (active state 0) object: 0000000097d23782 object type: work_struct hint: doe_statemachine_work+0x0/0x510
WARNING: CPU: 1 PID: 71 at lib/debugobjects.c:514 debug_print_object+0x7d/0xb0
...
Workqueue: pci 0000:36:00.0 DOE [1 doe_statemachine_work
RIP: 0010:debug_print_object+0x7d/0xb0
...
Call Trace:
? debug_print_object+0x7d/0xb0
? __pfx_doe_statemachine_work+0x10/0x10
debug_object_free.part.0+0x11b/0x150
doe_statemachine_work+0x45e/0x510
process_one_work+0x1d4/0x3c0
This occurs because destroy_work_on_stack() was called after signaling
the completion in the calling thread. This creates a race between
destroy_work_on_stack() and the task->work struct going out of scope in
pci_doe().
Signal the work complete after destroying the work struct. This is safe
because signal_task_complete() is the final thing the work item does and
the workqueue code is careful not to access the work struct after.
Fixes: abf04be0e7 ("PCI/DOE: Fix memory leak with CONFIG_DEBUG_OBJECTS=y")
Link: https://lore.kernel.org/r/20230726-doe-fix-v1-1-af07e614d4dd@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b9cbc06049 ]
Currently, as part of the qcom_pcie_perst_deassert() function, instead
of writing the updated value to clear PARF_MSTR_AXI_CLK_EN, the variable
"val" is re-read.
This must be fixed to ensure that the master clock supplied to the MHI
bus is correctly gated during L1.1/L1.2 to save power.
Thus, replace the line that re-reads "val" with a line that writes the
updated value to the register to clear PARF_MSTR_AXI_CLK_EN.
[kwilczynski: commit log]
Fixes: c457ac029e ("PCI: qcom-ep: Gate Master AXI clock to MHI bus during L1SS")
Link: https://lore.kernel.org/linux-pci/20230627141036.11600-1-manivannan.sadhasivam@linaro.org
Reported-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d8650c0c2a ]
The apple_pcie_setup_port() function computes ilog2(pcie->nvecs) to set
up the number of MSIs available for each port. However, it's called
before apple_msi_init(), which initializes pcie->nvecs.
Luckily, pcie->nvecs is part of kzalloc()-ed structure and, as such,
initialized as zero. ilog2(0) happens to be 0xffffffff which then simply
configures more MSIs in hardware than we have. This doesn't break
anything because we never hand out those vectors.
Thus, swap the order of the two calls so that the correctly initialized
value is then used.
[kwilczynski: commit log]
Link: https://lore.kernel.org/linux-pci/20230311133453.63246-1-sven@svenpeter.dev
Fixes: 476c41ed45 ("PCI: apple: Implement MSI support")
Signed-off-by: Sven Peter <sven@svenpeter.dev>
Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reviewed-by: Eric Curtin <ecurtin@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b8d72e32e1 ]
The adapter scan ssif_info_find() sets info->adapter_name if the adapter
info came from SMBIOS, as it's not set in that case. However, this
function can be called more than once, and it will leak the adapter name
if it had already been set. So check for NULL before setting it.
Fixes: c4436c9149 ("ipmi_ssif: avoid registering duplicate ssif interface")
Signed-off-by: Corey Minyard <minyard@acm.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 67de40c9df ]
Before committing 79597c8bf6, *rac97 always be NULL if there is
an error. When error happens, make sure *rac97 is NULL is safer.
For examble, in snd_vortex_mixer():
err = snd_ac97_mixer(pbus, &ac97, &vortex->codec);
vortex->isquad = ((vortex->codec == NULL) ?
0 : (vortex->codec->ext_id&0x80));
If error happened but vortex->codec isn't NULL, this may cause some
problems.
Move the judgement order to be clearer and better.
Fixes: 79597c8bf6 ("ALSA: ac97: Fix possible NULL dereference in snd_ac97_mixer")
Suggested-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Su Hui <suhui@nfschina.com>
Link: https://lore.kernel.org/r/20230823025212.1000961-1-suhui@nfschina.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a9515ff4fb ]
When of_overlay_fdt_apply() fails, the changeset may be partially
applied, and the caller is still expected to call of_overlay_remove() to
clean up this partial state.
However, of_overlay_apply() calls of_resolve_phandles() before
init_overlay_changeset(). Hence if the overlay fails to apply due to an
unresolved symbol, the overlay_changeset.cset.entries list is still
uninitialized, and cleanup will crash with a NULL-pointer dereference in
overlay_removal_is_ok().
Fix this by moving the call to of_changeset_init() from
init_overlay_changeset() to of_overlay_fdt_apply(), where all other
early initialization is done.
Fixes: f948d6d8b7 ("of: overlay: avoid race condition between applying multiple overlays")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/4f1d6d74b61cba2599026adb6d1948ae559ce91f.1690533838.git.geert+renesas@glider.be
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 38592ae6dc ]
DSP_SW_INTR_STAT_OFFSET is a common interrupt register which will be
accessed by both ACP firmware and driver. This register contains register
bits corresponds to host to dsp interrupts and vice versa.
when dsp to host interrupt is reported, only clear dsp to host
interrupt bit in DSP_SW_INTR_STAT_OFFSET.
Fixes: 2e7c6652f9 ("ASoC: SOF: amd: Fix for handling spurious interrupts from DSP")
Signed-off-by: Vijendar Mukunda <Vijendar.Mukunda@amd.com>
Link: https://lore.kernel.org/r/20230823073340.2829821-7-Vijendar.Mukunda@amd.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit cc22b5407e ]
When a bio is split by md raid0, the newly created bio will not be tracked
by md for I/O accounting. Only the portion of I/O still assigned to the
original bio which was reduced by the split will be accounted for. This
results in md iostat data sometimes showing I/O values far below the actual
amount of data being sent through md.
md_account_bio() needs to be called for all bio generated by the bio split.
A simple example of the issue was generated using a raid0 device on partitions
to the same device. Since all raid0 I/O then goes to one device, it makes it
easy to see a gap between the md device and its sd storage. Reading an lvm
device on top of the md device, the iostat output (some 0 columns and extra
devices removed to make the data more compact) was:
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read
md2 0.00 0.00 0.00 0.00 0
sde 0.00 0.00 0.00 0.00 0
md2 1364.00 411496.00 0.00 0.00 411496
sde 1734.00 646144.00 0.00 0.00 646144
md2 1699.00 510680.00 0.00 0.00 510680
sde 2155.00 802784.00 0.00 0.00 802784
md2 803.00 241480.00 0.00 0.00 241480
sde 1016.00 377888.00 0.00 0.00 377888
md2 0.00 0.00 0.00 0.00 0
sde 0.00 0.00 0.00 0.00 0
I/O was generated doing large direct I/O reads (12M) with dd to a linear
lvm volume on top of the 4 leg raid0 device.
The md2 reads were showing as roughly 2/3 of the reads to the sde device
containing all of md2's raid partitions. The sum of reads to sde was
1826816 kB, which was the expected amount as it was the amount read by
dd. With the patch, the total reads from md will match the reads from
sde and be consistent with the amount of I/O generated.
Fixes: 10764815ff ("md: add io accounting for raid0 and raid5")
Signed-off-by: David Jeffery <djeffery@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230816181433.13289-1-djeffery@redhat.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 319ff40a54 ]
Commit f00d7c85be ("md/raid0: fix up bio splitting.") among other
things changed how bio that needs to be split is submitted. Before this
commit, we have split the bio, mapped and submitted each part. After
this commit, we map only the first part of the split bio and submit the
second part unmapped. Due to bio sorting in __submit_bio_noacct() this
results in the following request ordering:
9,0 18 1181 0.525037895 15995 Q WS 1479315464 + 63392
Split off chunk-sized (1024 sectors) request:
9,0 18 1182 0.629019647 15995 X WS 1479315464 / 1479316488
Request is unaligned to the chunk so it's split in
raid0_make_request(). This is the first part mapped and punted to
bio_list:
8,0 18 7053 0.629020455 15995 A WS 739921928 + 1016 <- (9,0) 1479315464
Now raid0_make_request() returns, second part is postponed on
bio_list. __submit_bio_noacct() resorts the bio_list, mapped request
is submitted to the underlying device:
8,0 18 7054 0.629022782 15995 G WS 739921928 + 1016
Now we take another request from the bio_list which is the remainder
of the original huge request. Split off another chunk-sized bit from
it and the situation repeats:
9,0 18 1183 0.629024499 15995 X WS 1479316488 / 1479317512
8,16 18 6998 0.629025110 15995 A WS 739921928 + 1016 <- (9,0) 1479316488
8,16 18 6999 0.629026728 15995 G WS 739921928 + 1016
...
9,0 18 1184 0.629032940 15995 X WS 1479317512 / 1479318536 [libnetacq-write]
8,0 18 7059 0.629033294 15995 A WS 739922952 + 1016 <- (9,0) 1479317512
8,0 18 7060 0.629033902 15995 G WS 739922952 + 1016
...
This repeats until we consume the whole original huge request. Now we
finally get to processing the second parts of the split off requests
(in reverse order):
8,16 18 7181 0.629161384 15995 A WS 739952640 + 8 <- (9,0) 1479377920
8,0 18 7239 0.629162140 15995 A WS 739952640 + 8 <- (9,0) 1479376896
8,16 18 7186 0.629163881 15995 A WS 739951616 + 8 <- (9,0) 1479375872
8,0 18 7242 0.629164421 15995 A WS 739951616 + 8 <- (9,0) 1479374848
...
I guess it is obvious that this IO pattern is extremely inefficient way
to perform sequential IO. It also makes bio_list to grow to rather long
lengths.
Change raid0_make_request() to map both parts of the split bio. Since we
know we are provided with at most chunk-sized bios, we will always need
to split the incoming bio at most once.
Fixes: f00d7c85be ("md/raid0: fix up bio splitting.")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230814092720.3931-2-jack@suse.cz
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c31fea2f8e ]
After the commit 9631abdbf406c("md: Set MD_BROKEN for RAID1 and RAID10")
MD_BROKEN must be set if array is failed because state_store() checks it.
If it is set then -EBUSY is returned to userspace.
For raid0 and linear MD_BROKEN is not set by error_handler(). As a result
mdadm is unable to trigger clean-up actions. It is a regression.
This patch adds appropriate error_handler for raid0 and linear. The
error handler sets MD_BROKEN for this device.
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230306130317.3418-1-mariusz.tkaczyk@linux.intel.com
Stable-dep-of: 319ff40a54 ("md/raid0: Fix performance regression for large sequential writes")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7ac1102b22 ]
Before adding a new FW control, its name is checked against
existing controls list. But the string length in strncmp used
to compare controls names is taken from the list, so if beginnings
of the controls are matching, then the new control is not created.
For example, if CAL_R control already exists, CAL_R_SELECTED
is not created.
The fix is to compare string lengths as well.
Fixes: 6477960755 ("ASoC: wm_adsp: Move check for control existence")
Signed-off-by: Vlad Karpovich <vkarpovi@opensource.cirrus.com>
Link: https://lore.kernel.org/r/20230815172908.3454056-1-vkarpovi@opensource.cirrus.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0d0bd28c50 ]
r5l_flush_stripe_to_raid() will check if the list 'flushing_ios' is
empty, and then submit 'flush_bio', however, r5l_log_flush_endio()
is clearing the list first and then clear the bio, which will cause
null-ptr-deref:
T1: submit flush io
raid5d
handle_active_stripes
r5l_flush_stripe_to_raid
// list is empty
// add 'io_end_ios' to the list
bio_init
submit_bio
// io1
T2: io1 is done
r5l_log_flush_endio
list_splice_tail_init
// clear the list
T3: submit new flush io
...
r5l_flush_stripe_to_raid
// list is empty
// add 'io_end_ios' to the list
bio_init
bio_uninit
// clear bio->bi_blkg
submit_bio
// null-ptr-deref
Fix this problem by clearing bio before clearing the list in
r5l_log_flush_endio().
Fixes: 0dd00cba99 ("raid5-cache: fully initialize flush_bio when needed")
Reported-and-tested-by: Corey Hickey <bugfood-ml@fatooh.org>
Closes: https://lore.kernel.org/all/cddd7213-3dfd-4ab7-a3ac-edd54d74a626@fatooh.org/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a705b11b35 ]
Commit b13015af94 ("md/raid5-cache: Clear conf->log after finishing
work") introduce a new problem:
// caller hold reconfig_mutex
r5l_exit_log
flush_work(&log->disable_writeback_work)
r5c_disable_writeback_async
wait_event
/*
* conf->log is not NULL, and mddev_trylock()
* will fail, wait_event() can never pass.
*/
conf->log = NULL
Fix this problem by setting 'config->log' to NULL before wake_up() as it
used to be, so that wait_event() from r5c_disable_writeback_async() can
exist. In the meantime, move forward md_unregister_thread() so that
null-ptr-deref this commit fixed can still be fixed.
Fixes: b13015af94 ("md/raid5-cache: Clear conf->log after finishing work")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230708091727.1417894-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>