linux

mirror of https://github.com/hardkernel/linux.git synced 2026-06-07 03:15:31 +09:00

Author	SHA1	Message	Date
Dmitry Baryshkov	59938d4f05	clk: qcom: dispcc-sm8550: fix several supposed typos [ Upstream commit 7b6a4b907297b28727933493b9e0c95494504634 ] Fix seveal odd-looking places in SM8550's dispcc driver: - duplicate entries in disp_cc_parent_map_4 and disp_cc_parent_map_5 - using &disp_cc_mdss_dptx0_link_div_clk_src as a source for disp_cc_mdss_dptx1_usb_router_link_intf_clk The SM8650 driver has been used as a reference. Fixes: `90114ca114` ("clk: qcom: add SM8550 DISPCC driver") Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reviewed-by: Neil Armstrong <neil.armstrong@linaro.org> Link: https://lore.kernel.org/r/20240717-dispcc-sm8550-fixes-v2-1-5c4a3128c40b@linaro.org Signed-off-by: Bjorn Andersson <andersson@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:26 +02:00
Jonas Karlman	77c859e8b8	clk: rockchip: Set parent rate for DCLK_VOP clock on RK3228 [ Upstream commit 1d34b9757523c1ad547bd6d040381f62d74a3189 ] Similar to DCLK_LCDC on RK3328, the DCLK_VOP on RK3228 is typically parented by the hdmiphy clk and it is expected that the DCLK_VOP and hdmiphy clk rate are kept in sync. Use CLK_SET_RATE_PARENT and CLK_SET_RATE_NO_REPARENT flags, same as used on RK3328, to make full use of all possible supported display modes. Fixes: `0a9d4ac08e` ("clk: rockchip: set the clock ids for RK3228 VOP") Fixes: `307a2e9ac5` ("clk: rockchip: add clock controller for rk3228") Signed-off-by: Jonas Karlman <jonas@kwiboo.se> Link: https://lore.kernel.org/r/20240615170417.3134517-3-jonas@kwiboo.se Signed-off-by: Heiko Stuebner <heiko@sntech.de> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:26 +02:00
Peng Fan	d271e66f74	remoteproc: imx_rproc: Initialize workqueue earlier [ Upstream commit 858e57c1d3dd7b92cc0fa692ba130a0a5d57e49d ] Initialize workqueue before requesting mailbox channel, otherwise if mailbox interrupt comes before workqueue ready, the imx_rproc_rx_callback will trigger issue. Fixes: `2df7062002` ("remoteproc: imx_proc: enable virtio/mailbox") Signed-off-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Daniel Baluta <daniel.baluta@nxp.com> Link: https://lore.kernel.org/r/20240719-imx_rproc-v2-3-10d0268c7eb1@nxp.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:26 +02:00
Peng Fan	2941577c76	remoteproc: imx_rproc: Correct ddr alias for i.MX8M [ Upstream commit c901f817792822eda9cec23814a4621fa3e66695 ] The DDR Alias address should be 0x40000000 according to RM, so correct it. Fixes: `4ab8f9607a` ("remoteproc: imx_rproc: support i.MX8MQ/M") Reported-by: Terry Lv <terry.lv@nxp.com> Reviewed-by: Iuliana Prodan <iuliana.prodan@nxp.com> Signed-off-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Daniel Baluta <daniel.baluta@nxp.com> Link: https://lore.kernel.org/r/20240719-imx_rproc-v2-1-10d0268c7eb1@nxp.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:26 +02:00
Peng Fan	af70d9395d	clk: imx: imx8qxp: Parent should be initialized earlier than the clock [ Upstream commit 766c386c16c9899461b83573a06380d364c6e261 ] The initialization order of SCU clocks affects the sequence of SCU clock resume. If there are no other effects, the earlier the initialization, the earlier the resume. During SCU clock resume, the clock rate is restored. As SCFW guidelines, configure the parent clock rate before configuring the child rate. Fixes: `babfaa9556` ("clk: imx: scu: add more scu clocks") Signed-off-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Abel Vesa <abel.vesa@linaro.org> Link: https://lore.kernel.org/r/20240607133347.3291040-15-peng.fan@oss.nxp.com Signed-off-by: Abel Vesa <abel.vesa@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:26 +02:00
Peng Fan	d64513b2da	clk: imx: imx8qxp: Register dc0_bypass0_clk before disp clk [ Upstream commit e61352d5ecdc0da2e7253121c15d9a3e040f78a1 ] The initialization order of SCU clocks affects the sequence of SCU clock resume. If there are no other effects, the earlier the initialization, the earlier the resume. During SCU clock resume, the clock rate is restored. As SCFW guidelines, configure the parent clock rate before configuring the child rate. Fixes: `91e916771d` ("clk: imx: scu: remove legacy scu clock binding support") Signed-off-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Abel Vesa <abel.vesa@linaro.org> Link: https://lore.kernel.org/r/20240607133347.3291040-14-peng.fan@oss.nxp.com Signed-off-by: Abel Vesa <abel.vesa@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:26 +02:00
Zhipeng Wang	5b44298953	clk: imx: imx8mp: fix clock tree update of TF-A managed clocks [ Upstream commit 3d29036853b9cb07ac49e8261fca82a940be5c41 ] On the i.MX8M*, the TF-A exposes a SiP (Silicon Provider) service for DDR frequency scaling. The imx8m-ddrc-devfreq driver calls the SiP and then does clk_set_parent on the DDR muxes to synchronize the clock tree. since commit `936c383673` ("clk: imx: fix composite peripheral flags"), these TF-A managed muxes have SET_PARENT_GATE set, which results in imx8m-ddrc-devfreq's clk_set_parent after SiP failing with -EBUSY: clk_set_parent(dram_apb_src, sys1_pll_40m);(busfreq-imx8mq.c) commit `926bf91248` ("clk: imx8m: fix clock tree update of TF-A managed clocks") adds this method and enables 8mm, 8mn and 8mq. i.MX8MP also needs it. This is safe to do, because updating the Linux clock tree to reflect reality will always be glitch-free. Another reason to this patch is that powersave image BT music requires dram to be 400MTS, so clk_set_parent(dram_alt_src, sys1_pll_800m); is required. Without this patch, it will not succeed. Fixes: `936c383673` ("clk: imx: fix composite peripheral flags") Signed-off-by: Zhipeng Wang <zhipeng.wang_1@nxp.com> Reviewed-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Signed-off-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Abel Vesa <abel.vesa@linaro.org> Link: https://lore.kernel.org/r/20240607133347.3291040-7-peng.fan@oss.nxp.com Signed-off-by: Abel Vesa <abel.vesa@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:26 +02:00
Pengfei Li	908165b5d3	clk: imx: fracn-gppll: fix fractional part of PLL getting lost [ Upstream commit 7622f888fca125ae46f695edf918798ebc0506c5 ] Fractional part of PLL gets lost after re-enabling the PLL. the MFN can NOT be automatically loaded when doing frac PLL enable/disable, So when re-enable PLL, configure mfn explicitly. Fixes: `1b26cb8a77` ("clk: imx: support fracn gppll") Signed-off-by: Pengfei Li <pengfei.li_1@nxp.com> Reviewed-by: Jacky Bai <ping.bai@nxp.com> Signed-off-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Abel Vesa <abel.vesa@linaro.org> Link: https://lore.kernel.org/r/20240607133347.3291040-5-peng.fan@oss.nxp.com Signed-off-by: Abel Vesa <abel.vesa@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:25 +02:00
Ye Li	ed323659a0	clk: imx: composite-7ulp: Check the PCC present bit [ Upstream commit 4717ccadb51e2630790dddd222830702de17f090 ] When some module is disabled by fuse, its PCC PR bit is default 0 and PCC is not operational. Any write to this PCC will cause SError. Fixes: `b40ba80653` ("clk: imx: Update the compsite driver to support imx8ulp") Reviewed-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: Ye Li <ye.li@nxp.com> Signed-off-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Abel Vesa <abel.vesa@linaro.org> Link: https://lore.kernel.org/r/20240607133347.3291040-4-peng.fan@oss.nxp.com Signed-off-by: Abel Vesa <abel.vesa@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:25 +02:00
Jacky Bai	c1eb71fd98	clk: imx: composite-93: keep root clock on when mcore enabled [ Upstream commit d342df11726bfac9c3a9d2037afa508ac0e9e44e ] Previously we assumed that the root clock slice is enabled by default when kernel boot up. But the bootloader may disable the clocks before jump into kernel. The gate ops should be registered rather than NULL to make sure the disabled clock can be enabled when kernel boot up. Refine the code to skip disable the clock if mcore booted. Fixes: `a740d7350f` ("clk: imx: imx93: add mcore_booted module paratemter") Signed-off-by: Jacky Bai <ping.bai@nxp.com> Reviewed-by: Peng Fan <peng.fan@nxp.com> Tested-by: Chancel Liu <chancel.liu@nxp.com> Signed-off-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Abel Vesa <abel.vesa@linaro.org> Link: https://lore.kernel.org/r/20240607133347.3291040-3-peng.fan@oss.nxp.com Signed-off-by: Abel Vesa <abel.vesa@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:25 +02:00
Peng Fan	73034d130b	clk: imx: composite-8m: Enable gate clk with mcore_booted [ Upstream commit 8f32e9dd0916eb3fd4bcf550ed1d04542a65cb9e ] Bootloader might disable some CCM ROOT Slices. So if mcore_booted set with display CCM ROOT disabled by Bootloader, kernel display BLK CTRL driver imx8m_blk_ctrl_driver_init may hang the system because the BUS clk is disabled. Add back gate ops, but with disable doing nothing, then the CCM ROOT will be enabled when used. Fixes: `bb7e897b00` ("clk: imx8m: check mcore_booted before register clk") Reviewed-by: Ye Li <ye.li@nxp.com> Reviewed-by: Jacky Bai <ping.bai@nxp.com> Signed-off-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Abel Vesa <abel.vesa@linaro.org> Link: https://lore.kernel.org/r/20240607133347.3291040-2-peng.fan@oss.nxp.com Signed-off-by: Abel Vesa <abel.vesa@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:25 +02:00
Markus Elfring	554c590d22	clk: imx: composite-8m: Less function calls in __imx8m_clk_hw_composite() after error detection [ Upstream commit fed6bf52c86df27ad4f39a72cdad8c27da9a50ba ] The function “kfree” was called in up to three cases by the function “__imx8m_clk_hw_composite” during error handling even if the passed variables contained a null pointer. Adjust jump targets according to the Linux coding style convention. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Peng Fan <peng.fan@nxp.com> Link: https://lore.kernel.org/r/147ca1e6-69f3-4586-b5b3-b69f9574a862@web.de Signed-off-by: Abel Vesa <abel.vesa@linaro.org> Stable-dep-of: 8f32e9dd0916 ("clk: imx: composite-8m: Enable gate clk with mcore_booted") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:25 +02:00
Sebastien Laveze	c2ee6de22d	clk: imx: imx6ul: fix default parent for enet*_ref_sel [ Upstream commit e52fd71333b4ed78fd5bb43094bdc46476614d25 ] The clk_set_parent for "enet1_ref_sel" and "enet2_ref_sel" are incorrect, therefore the original requirements to have "enet_clk_ref" as output sourced by iMX ENET PLL as a default config is not met. Only "enet[1,2]_ref_125m" "enet[1,2]_ref_pad" are possible parents for "enet1_ref_sel" and "enet2_ref_sel". This was observed as a regression using a custom device tree which was expecting this default config. This can be fixed at the device tree level but having a default config matching the original behavior (before refclock mux) will avoid breaking existing configs. Fixes: `4e197ee880` ("clk: imx6ul: add ethernet refclock mux support") Link: https://lore.kernel.org/lkml/20230306020226.GC143566@dragon/T/ Signed-off-by: Sebastien Laveze <slaveze@smartandconnective.com> Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/r/20240528151434.227602-1-slaveze@smartandconnective.com Signed-off-by: Abel Vesa <abel.vesa@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:25 +02:00
Shengjiu Wang	bd553be1cf	clk: imx: clk-audiomix: Correct parent clock for earc_phy and audpll [ Upstream commit d40371a1c963db688b37826adaf5ffdafb0862a1 ] According to Reference Manual of i.MX8MP The parent clock of "earc_phy" is "sai_pll_out_div2", The parent clock of "audpll" is "osc_24m". Add CLK_GATE_PARENT() macro for usage of specifying parent clock. Fixes: `6cd95f7b15` ("clk: imx: imx8mp: Add audiomix block control") Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com> Reviewed-by: Peng Fan <peng.fan@nxp.com> Link: https://lore.kernel.org/r/1718350923-21392-6-git-send-email-shengjiu.wang@nxp.com Signed-off-by: Abel Vesa <abel.vesa@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:25 +02:00
Ian Rogers	3ba5a2e91c	perf time-utils: Fix 32-bit nsec parsing [ Upstream commit 38e2648a81204c9fc5b4c87a8ffce93a6ed91b65 ] The "time utils" test fails in 32-bit builds: ... parse_nsec_time("18446744073.709551615") Failed. ptime 4294967295709551615 expected 18446744073709551615 ... Switch strtoul to strtoull as an unsigned long in 32-bit build isn't 64-bits. Fixes: `c284d669a2` ("perf tools: Move parse_nsec_time to time-utils.c") Signed-off-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: Colin Ian King <colin.i.king@gmail.com> Cc: David Ahern <dsa@cumulusnetworks.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Garry <john.g.garry@oracle.com> Cc: Junhao He <hejunhao3@huawei.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Yang Jihong <yangjihong@bytedance.com> Link: https://lore.kernel.org/r/20240831070415.506194-3-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:24 +02:00
Yang Jihong	022f9328ef	perf sched timehist: Fixed timestamp error when unable to confirm event sched_in time [ Upstream commit 39c243411bdb8fb35777adf49ee32549633c4e12 ] If sched_in event for current task is not recorded, sched_in timestamp will be set to end_time of time window interest, causing an error in timestamp show. In this case, we choose to ignore this event. Test scenario: perf[1229608] does not record the first sched_in event, run time and sch delay are both 0 # perf sched timehist Samples of sched_switch event do not have callchains. time cpu task name wait time sch delay run time [tid/pid] (msec) (msec) (msec) --------------- ------ ------------------------------ --------- --------- --------- 2090450.763231 [0000] perf[1229608] 0.000 0.000 0.000 2090450.763235 [0000] migration/0[15] 0.000 0.001 0.003 2090450.763263 [0001] perf[1229608] 0.000 0.000 0.000 2090450.763268 [0001] migration/1[21] 0.000 0.001 0.004 2090450.763302 [0002] perf[1229608] 0.000 0.000 0.000 2090450.763309 [0002] migration/2[27] 0.000 0.001 0.007 2090450.763338 [0003] perf[1229608] 0.000 0.000 0.000 2090450.763343 [0003] migration/3[33] 0.000 0.001 0.004 Before: arbitrarily specify a time window of interest, timestamp will be set to an incorrect value # perf sched timehist --time 100,200 Samples of sched_switch event do not have callchains. time cpu task name wait time sch delay run time [tid/pid] (msec) (msec) (msec) --------------- ------ ------------------------------ --------- --------- --------- 200.000000 [0000] perf[1229608] 0.000 0.000 0.000 200.000000 [0001] perf[1229608] 0.000 0.000 0.000 200.000000 [0002] perf[1229608] 0.000 0.000 0.000 200.000000 [0003] perf[1229608] 0.000 0.000 0.000 200.000000 [0004] perf[1229608] 0.000 0.000 0.000 200.000000 [0005] perf[1229608] 0.000 0.000 0.000 200.000000 [0006] perf[1229608] 0.000 0.000 0.000 200.000000 [0007] perf[1229608] 0.000 0.000 0.000 After: # perf sched timehist --time 100,200 Samples of sched_switch event do not have callchains. time cpu task name wait time sch delay run time [tid/pid] (msec) (msec) (msec) --------------- ------ ------------------------------ --------- --------- --------- Fixes: `853b740711` ("perf sched timehist: Add option to specify time window of interest") Signed-off-by: Yang Jihong <yangjihong@bytedance.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsa@cumulusnetworks.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20240819024720.2405244-1-yangjihong@bytedance.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:24 +02:00
Yicong Yang	fa0720b32a	perf stat: Display iostat headers correctly [ Upstream commit 2615639352420e6e3115952c5b8f46846e1c6d0e ] Currently we'll only print metric headers for metric leader in aggregration mode. This will make `perf iostat` header not shown since it'll aggregrated globally but don't have metric events: root@ubuntu204:/home/yang/linux/tools/perf# ./perf stat --iostat --timeout 1000 Performance counter stats for 'system wide': port 0000:00 0 0 0 0 0000:80 0 0 0 0 [...] Fix this by excluding the iostat in the check of printing metric headers. Then we can see the headers: root@ubuntu204:/home/yang/linux/tools/perf# ./perf stat --iostat --timeout 1000 Performance counter stats for 'system wide': port Inbound Read(MB) Inbound Write(MB) Outbound Read(MB) Outbound Write(MB) 0000:00 0 0 0 0 0000:80 0 0 0 0 [...] Fixes: 193a9e30207f5477 ("perf stat: Don't display metric header for non-leader uncore events") Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Junhao He <hejunhao3@huawei.com> Cc: linuxarm@huawei.com Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> Cc: Zeng Tao <prime.zeng@hisilicon.com> Link: https://lore.kernel.org/r/20240802065800.48774-1-yangyicong@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:24 +02:00
Yang Jihong	505ec05002	perf sched timehist: Fix missing free of session in perf_sched__timehist() [ Upstream commit 6bdf5168b6fb19541b0c1862bdaa596d116c7bfb ] When perf_time__parse_str() fails in perf_sched__timehist(), need to free session that was previously created, fix it. Fixes: `853b740711` ("perf sched timehist: Add option to specify time window of interest") Signed-off-by: Yang Jihong <yangjihong@bytedance.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsa@cumulusnetworks.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20240806023533.1316348-1-yangjihong@bytedance.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:24 +02:00
Kan Liang	88c4b5dd21	perf report: Fix --total-cycles --stdio output error [ Upstream commit 3ef44458071a19e5b5832cdfe6f75273aa521b6e ] The --total-cycles may output wrong information with the --stdio. For example: # perf record -e "{cycles,instructions}",cache-misses -b sleep 1 # perf report --total-cycles --stdio The total cycles output of {cycles,instructions} and cache-misses are almost the same. # Samples: 938 of events 'anon group { cycles, instructions }' # Event count (approx.): 938 # # Sampled Cycles% Sampled Cycles Avg Cycles% Avg Cycles [Program Block Range] # ............... .............. ........... .......... ..................................................> # 11.19% 2.6K 0.10% 21 [perf_iterate_ctx+48 -> > 5.79% 1.4K 0.45% 97 [__intel_pmu_enable_all.constprop.0+80 -> __intel_> 5.11% 1.2K 0.33% 71 [native_write_msr+0 ->> # Samples: 293 of event 'cache-misses' # Event count (approx.): 293 # # Sampled Cycles% Sampled Cycles Avg Cycles% Avg Cycles [Program Block Range] # ............... .............. ........... .......... ..................................................> # 11.19% 2.6K 0.13% 21 [perf_iterate_ctx+48 -> > 5.79% 1.4K 0.59% 97 [__intel_pmu_enable_all.constprop.0+80 -> __intel_> 5.11% 1.2K 0.43% 71 [native_write_msr+0 ->> With the symbol_conf.event_group, the 'perf report' should only report the block information of the leader event in a group. However, the current implementation retrieves the next event's block information, rather than the next group leader's block information. Make sure the index is updated even if the event is skipped. With the patch, # Samples: 293 of event 'cache-misses' # Event count (approx.): 293 # # Sampled Cycles% Sampled Cycles Avg Cycles% Avg Cycles [Program Block Range] # ............... .............. ........... .......... ..................................................> # 37.98% 9.0K 4.05% 299 [perf_event_addr_filters_exec+0 -> perf_event_a> 11.19% 2.6K 0.28% 21 [perf_iterate_ctx+48 -> > 5.79% 1.4K 1.32% 97 [__intel_pmu_enable_all.constprop.0+80 -> __intel_> Fixes: `6f7164fa23` ("perf report: Sort by sampled cycles percent per block for stdio") Reviewed-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20240813160208.2493643-2-kan.liang@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:24 +02:00
Namhyung Kim	297871cb51	perf ui/browser/annotate: Use global annotation_options [ Upstream commit 22197fb296913f83c7182befd2a8b23bf042f279 ] Now it can use the global options and no need save local browser options separately. Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20231128175441.721579-6-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Stable-dep-of: 3ef44458071a ("perf report: Fix --total-cycles --stdio output error") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:24 +02:00
Namhyung Kim	4c857dcf34	perf annotate: Move some source code related fields from 'struct annotation' to 'struct annotated_source' [ Upstream commit 0aae4c99c5f8f748c6cb5ca03bb3b3ae8cfb10df ] Some fields in the 'struct annotation' are only used with 'struct annotated_source' so better to be moved there in order to reduce memory consumption for other symbols. Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20231103191907.54531-5-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Stable-dep-of: 3ef44458071a ("perf report: Fix --total-cycles --stdio output error") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:24 +02:00
Namhyung Kim	4ef032d899	perf annotate: Split branch stack cycles info from 'struct annotation' [ Upstream commit b7f87e32590bf48eca84f729d3422be7b8dc22d3 ] The cycles info is only meaningful when sample has branch stacks. To save the memory for normal cases, move those fields to a new 'struct annotated_branch' and dynamically allocate it when needed. Also move cycles_hist from annotated_source as it's related here. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20231103191907.54531-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Stable-dep-of: 3ef44458071a ("perf report: Fix --total-cycles --stdio output error") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:23 +02:00
Ian Rogers	ba18185bea	perf inject: Fix leader sampling inserting additional samples [ Upstream commit 79bcd34e0f3da39fda841406ccc957405e724852 ] The processing of leader samples would turn an individual sample with a group of read values into multiple samples. 'perf inject' would pass through the additional samples increasing the output data file size: $ perf record -g -e "{instructions,cycles}:S" -o perf.orig.data true $ perf script -D -i perf.orig.data \| sed -e 's/perf.orig.data/perf.data/g' > orig.txt $ perf inject -i perf.orig.data -o perf.new.data $ perf script -D -i perf.new.data \| sed -e 's/perf.new.data/perf.data/g' > new.txt $ diff -u orig.txt new.txt --- orig.txt 2024-07-29 14:29:40.606576769 -0700 +++ new.txt 2024-07-29 14:30:04.142737434 -0700 ... -0xc550@perf.data [0x30]: event: 3 +0xc550@perf.data [0xd0]: event: 9 +. +. ... raw event: size 208 bytes +. 0000: 09 00 00 00 01 00 d0 00 fc 72 01 86 ff ff ff ff .........r...... +. 0010: 74 7d 2c 00 74 7d 2c 00 fb c3 79 f9 ba d5 05 00 t},.t},...y..... +. 0020: e6 cb 1a 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ +. 0030: 02 00 00 00 00 00 00 00 76 01 00 00 00 00 00 00 ........v....... +. 0040: e6 cb 1a 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ +. 0050: 62 18 00 00 00 00 00 00 f6 cb 1a 00 00 00 00 00 b............... +. 0060: 00 00 00 00 00 00 00 00 0c 00 00 00 00 00 00 00 ................ +. 0070: 80 ff ff ff ff ff ff ff fc 72 01 86 ff ff ff ff .........r...... +. 0080: f3 0e 6e 85 ff ff ff ff 0c cb 7f 85 ff ff ff ff ..n............. +. 0090: bc f2 87 85 ff ff ff ff 44 af 7f 85 ff ff ff ff ........D....... +. 00a0: bd be 7f 85 ff ff ff ff 26 d0 7f 85 ff ff ff ff ........&....... +. 00b0: 6d a4 ff 85 ff ff ff ff ea 00 20 86 ff ff ff ff m......... ..... +. 00c0: 00 fe ff ff ff ff ff ff 57 14 4f 43 fc 7e 00 00 ........W.OC.~.. + +1642373909693435 0xc550 [0xd0]: PERF_RECORD_SAMPLE(IP, 0x1): 2915700/2915700: 0xffffffff860172fc period: 1 addr: 0 +... FP chain: nr:12 +..... 0: ffffffffffffff80 +..... 1: ffffffff860172fc +..... 2: ffffffff856e0ef3 +..... 3: ffffffff857fcb0c +..... 4: ffffffff8587f2bc +..... 5: ffffffff857faf44 +..... 6: ffffffff857fbebd +..... 7: ffffffff857fd026 +..... 8: ffffffff85ffa46d +..... 9: ffffffff862000ea +..... 10: fffffffffffffe00 +..... 11: 00007efc434f1457 +... sample_read: +.... group nr 2 +..... id 00000000001acbe6, value 0000000000000176, lost 0 +..... id 00000000001acbf6, value 0000000000001862, lost 0 + +0xc620@perf.data [0x30]: event: 3 ... This behavior is incorrect as in the case above 'perf inject' should have done nothing. Fix this behavior by disabling separating samples for a tool that requests it. Only request this for `perf inject` so as to not affect other perf tools. With the patch and the test above there are no differences between the orig.txt and new.txt. Fixes: `e4caec0d1a` ("perf evsel: Add PERF_SAMPLE_READ sample related processing") Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20240729220620.2957754-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:23 +02:00
Namhyung Kim	1490a5dbd5	perf mem: Free the allocated sort string, fixing a leak [ Upstream commit 3da209bb1177462b6fe8e3021a5527a5a49a9336 ] The get_sort_order() returns either a new string (from strdup) or NULL but it never gets freed. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Fixes: `2e7f545096` ("perf mem: Factor out a function to generate sort order") Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20240731235505.710436-3-namhyung@kernel.org [ Added Fixes tag ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:23 +02:00
Daniel Borkmann	a634fa8e48	bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error [ Upstream commit 4b3786a6c5397dc220b1483d8e2f4867743e966f ] For all non-tracing helpers which formerly had ARG_PTR_TO_{LONG,INT} as input arguments, zero the value for the case of an error as otherwise it could leak memory. For tracing, it is not needed given CAP_PERFMON can already read all kernel memory anyway hence bpf_get_func_arg() and bpf_get_func_ret() is skipped in here. Also, the MTU helpers mtu_len pointer value is being written but also read. Technically, the MEM_UNINIT should not be there in order to always force init. Removing MEM_UNINIT needs more verifier rework though: MEM_UNINIT right now implies two things actually: i) write into memory, ii) memory does not have to be initialized. If we lift MEM_UNINIT, it then becomes: i) read into memory, ii) memory must be initialized. This means that for bpf__check_mtu() we're readding the issue we're trying to fix, that is, it would then be able to write back into things like .rodata BPF maps. Follow-up work will rework the MEM_UNINIT semantics such that the intent can be better expressed. For now just clear the mtu_len on error path which can be lifted later again. Fixes: `8a67f2de9b` ("bpf: expose bpf_strtol and bpf_strtoul to all program types") Fixes: `d7a4cb9b67` ("bpf: Introduce bpf_strtol and bpf_strtoul helpers") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/e5edd241-59e7-5e39-0ee5-a51e31b6840a@iogearbox.net Link: https://lore.kernel.org/r/20240913191754.13290-5-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:23 +02:00
Daniel Borkmann	abf7559b4f	bpf: Improve check_raw_mode_ok test for MEM_UNINIT-tagged types [ Upstream commit 18752d73c1898fd001569195ba4b0b8c43255f4a ] When checking malformed helper function signatures, also take other argument types into account aside from just ARG_PTR_TO_UNINIT_MEM. This concerns (formerly) ARG_PTR_TO_{INT,LONG} given uninitialized memory can be passed there, too. The func proto sanity check goes back to commit `435faee1aa` ("bpf, verifier: add ARG_PTR_TO_RAW_STACK type"), and its purpose was to detect wrong func protos which had more than just one MEM_UNINIT-tagged type as arguments. The reason more than one is currently not supported is as we mark stack slots with STACK_MISC in check_helper_call() in case of raw mode based on meta.access_size to allow uninitialized stack memory to be passed to helpers when they just write into the buffer. Probing for base type as well as MEM_UNINIT tagging ensures that other types do not get missed (as it used to be the case for ARG_PTR_TO_{INT,LONG}). Fixes: `57c3bb725a` ("bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types") Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20240913191754.13290-4-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:23 +02:00
Daniel Borkmann	a2c8dc7e21	bpf: Fix helper writes to read-only maps [ Upstream commit 32556ce93bc45c730829083cb60f95a2728ea48b ] Lonial found an issue that despite user- and BPF-side frozen BPF map (like in case of .rodata), it was still possible to write into it from a BPF program side through specific helpers having ARG_PTR_TO_{LONG,INT} as arguments. In check_func_arg() when the argument is as mentioned, the meta->raw_mode is never set. Later, check_helper_mem_access(), under the case of PTR_TO_MAP_VALUE as register base type, it assumes BPF_READ for the subsequent call to check_map_access_type() and given the BPF map is read-only it succeeds. The helpers really need to be annotated as ARG_PTR_TO_{LONG,INT} \| MEM_UNINIT when results are written into them as opposed to read out of them. The latter indicates that it's okay to pass a pointer to uninitialized memory as the memory is written to anyway. However, ARG_PTR_TO_{LONG,INT} is a special case of ARG_PTR_TO_FIXED_SIZE_MEM just with additional alignment requirement. So it is better to just get rid of the ARG_PTR_TO_{LONG,INT} special cases altogether and reuse the fixed size memory types. For this, add MEM_ALIGNED to additionally ensure alignment given these helpers write directly into the args via <ptr> = val. The .arg_size has been initialized reflecting the actual sizeof(<ptr>). MEM_ALIGNED can only be used in combination with MEM_FIXED_SIZE annotated argument types, since in !MEM_FIXED_SIZE cases the verifier does not know the buffer size a priori and therefore cannot blindly write <ptr> = val. Fixes: `57c3bb725a` ("bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types") Reported-by: Lonial Con <kongln9170@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20240913191754.13290-3-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:23 +02:00
Daniel Borkmann	81197a9b45	bpf: Fix bpf_strtol and bpf_strtoul helpers for 32bit [ Upstream commit cfe69c50b05510b24e26ccb427c7cc70beafd6c1 ] The bpf_strtol() and bpf_strtoul() helpers are currently broken on 32bit: The argument type ARG_PTR_TO_LONG is BPF-side "long", not kernel-side "long" and therefore always considered fixed 64bit no matter if 64 or 32bit underlying architecture. This contract breaks in case of the two mentioned helpers since their BPF_CALL definition for the helpers was added with {unsigned,}long res. Meaning, the transition from BPF-side "long" (BPF program) to kernel-side "long" (BPF helper) breaks here. Both helpers call __bpf_strtoll() with "long long" correctly, but later assigning the result into 32-bit "(long )" on 32bit architectures. From a BPF program point of view, this means upper bits will be seen as uninitialised. Therefore, fix both BPF_CALL signatures to {s,u}64 types to fix this situation. Now, changing also uapi/bpf.h helper documentation which generates bpf_helper_defs.h for BPF programs is tricky: Changing signatures there to __{s,u}64 would trigger compiler warnings (incompatible pointer types passing 'long ' to parameter of type '__s64 ' (aka 'long long ')) for existing BPF programs. Leaving the signatures as-is would be fine as from BPF program point of view it is still BPF-side "long" and thus equivalent to __{s,u}64 on 64 or 32bit underlying architectures. Note that bpf_strtol() and bpf_strtoul() are the only helpers with this issue. Fixes: `d7a4cb9b67` ("bpf: Introduce bpf_strtol and bpf_strtoul helpers") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/481fcec8-c12c-9abb-8ecb-76c71c009959@iogearbox.net Link: https://lore.kernel.org/r/20240913191754.13290-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:22 +02:00
Ryusuke Konishi	257f9e5185	nilfs2: fix potential oob read in nilfs_btree_check_delete() [ Upstream commit f9c96351aa6718b42a9f42eaf7adce0356bdb5e8 ] The function nilfs_btree_check_delete(), which checks whether degeneration to direct mapping occurs before deleting a b-tree entry, causes memory access outside the block buffer when retrieving the maximum key if the root node has no entries. This does not usually happen because b-tree mappings with 0 child nodes are never created by mkfs.nilfs2 or nilfs2 itself. However, it can happen if the b-tree root node read from a device is configured that way, so fix this potential issue by adding a check for that case. Link: https://lkml.kernel.org/r/20240904081401.16682-4-konishi.ryusuke@gmail.com Fixes: `17c76b0104` ("nilfs2: B-tree based block mapping") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Lizhi Xu <lizhi.xu@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:22 +02:00
Ryusuke Konishi	0f28b3b51f	nilfs2: determine empty node blocks as corrupted [ Upstream commit 111b812d3662f3a1b831d19208f83aa711583fe6 ] Due to the nature of b-trees, nilfs2 itself and admin tools such as mkfs.nilfs2 will never create an intermediate b-tree node block with 0 child nodes, nor will they delete (key, pointer)-entries that would result in such a state. However, it is possible that a b-tree node block is corrupted on the backing device and is read with 0 child nodes. Because operation is not guaranteed if the number of child nodes is 0 for intermediate node blocks other than the root node, modify nilfs_btree_node_broken(), which performs sanity checks when reading a b-tree node block, so that such cases will be judged as metadata corruption. Link: https://lkml.kernel.org/r/20240904081401.16682-3-konishi.ryusuke@gmail.com Fixes: `17c76b0104` ("nilfs2: B-tree based block mapping") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Lizhi Xu <lizhi.xu@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:22 +02:00
Ryusuke Konishi	21839b6fbc	nilfs2: fix potential null-ptr-deref in nilfs_btree_insert() [ Upstream commit 9403001ad65ae4f4c5de368bdda3a0636b51d51a ] Patch series "nilfs2: fix potential issues with empty b-tree nodes". This series addresses three potential issues with empty b-tree nodes that can occur with corrupted filesystem images, including one recently discovered by syzbot. This patch (of 3): If a b-tree is broken on the device, and the b-tree height is greater than 2 (the level of the root node is greater than 1) even if the number of child nodes of the b-tree root is 0, a NULL pointer dereference occurs in nilfs_btree_prepare_insert(), which is called from nilfs_btree_insert(). This is because, when the number of child nodes of the b-tree root is 0, nilfs_btree_do_lookup() does not set the block buffer head in any of path[x].bp_bh, leaving it as the initial value of NULL, but if the level of the b-tree root node is greater than 1, nilfs_btree_get_nonroot_node(), which accesses the buffer memory of path[x].bp_bh, is called. Fix this issue by adding a check to nilfs_btree_root_broken(), which performs sanity checks when reading the root node from the device, to detect this inconsistency. Thanks to Lizhi Xu for trying to solve the bug and clarifying the cause early on. Link: https://lkml.kernel.org/r/20240904081401.16682-1-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/20240902084101.138971-1-lizhi.xu@windriver.com Link: https://lkml.kernel.org/r/20240904081401.16682-2-konishi.ryusuke@gmail.com Fixes: `17c76b0104` ("nilfs2: B-tree based block mapping") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+9bff4c7b992038a7409f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=9bff4c7b992038a7409f Cc: Lizhi Xu <lizhi.xu@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:22 +02:00
Yujie Liu	66f3fc7411	sched/numa: Fix the vma scan starving issue [ Upstream commit f22cde4371f3c624e947a35b075c06c771442a43 ] Problem statement: Since commit `fc137c0dda` ("sched/numa: enhance vma scanning logic"), the Numa vma scan overhead has been reduced a lot. Meanwhile, the reducing of the vma scan might create less Numa page fault information. The insufficient information makes it harder for the Numa balancer to make decision. Later, commit b7a5b537c55c08 ("sched/numa: Complete scanning of partial VMAs regardless of PID activity") and commit 84db47ca7146d7 ("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found to bring back part of the performance. Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system, a long duration of remote Numa node read was observed by PMU events: A few cores having ~500MB/s remote memory access for ~20 seconds. It causes high core-to-core variance and performance penalty. After the investigation, it is found that many vmas are skipped due to the active PID check. According to the trace events, in most cases, vma_is_accessed() returns false because the history access info stored in pids_active array has been cleared. Proposal: The main idea is to adjust vma_is_accessed() to let it return true easier. Thus compare the diff between mm->numa_scan_seq and vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold, scan the vma. This patch especially helps the cases where there are small number of threads, like the process-based SPECcpu. Without this patch, if the SPECcpu process access the vma at the beginning, then sleeps for a long time, the pid_active array will be cleared. A a result, if this process is woken up again, it never has a chance to set prot_none anymore. Because only the first 2 times of access is granted for vma scan: (current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2 to be worse, no other threads within the task can help set the prot_none. This causes information lost. Raghavendra helped test current patch and got the positive result on the AMD platform: autonumabench NUMA01 base patched Amean syst-NUMA01 194.05 ( 0.00%) 165.11 * 14.92%* Amean elsp-NUMA01 324.86 ( 0.00%) 315.58 * 2.86%* Duration User 380345.36 368252.04 Duration System 1358.89 1156.23 Duration Elapsed 2277.45 2213.25 autonumabench NUMA02 Amean syst-NUMA02 1.12 ( 0.00%) 1.09 * 2.93%* Amean elsp-NUMA02 3.50 ( 0.00%) 3.56 * -1.84%* Duration User 1513.23 1575.48 Duration System 8.33 8.13 Duration Elapsed 28.59 29.71 kernbench Amean user-256 22935.42 ( 0.00%) 22535.19 * 1.75%* Amean syst-256 7284.16 ( 0.00%) 7608.72 * -4.46%* Amean elsp-256 159.01 ( 0.00%) 158.17 * 0.53%* Duration User 68816.41 67615.74 Duration System 21873.94 22848.08 Duration Elapsed 506.66 504.55 Intel 256 CPUs/2 Sockets: autonuma benchmark also shows improvements: v6.10-rc5 v6.10-rc5 +patch Amean syst-NUMA01 245.85 ( 0.00%) 230.84 * 6.11%* Amean syst-NUMA01_THREADLOCAL 205.27 ( 0.00%) 191.86 * 6.53%* Amean syst-NUMA02 18.57 ( 0.00%) 18.09 * 2.58%* Amean syst-NUMA02_SMT 2.63 ( 0.00%) 2.54 * 3.47%* Amean elsp-NUMA01 517.17 ( 0.00%) 526.34 * -1.77%* Amean elsp-NUMA01_THREADLOCAL 99.92 ( 0.00%) 100.59 * -0.67%* Amean elsp-NUMA02 15.81 ( 0.00%) 15.72 * 0.59%* Amean elsp-NUMA02_SMT 13.23 ( 0.00%) 12.89 * 2.53%* v6.10-rc5 v6.10-rc5 +patch Duration User 1064010.16 1075416.23 Duration System 3307.64 3104.66 Duration Elapsed 4537.54 4604.73 The SPECcpu remote node access issue disappears with the patch applied. Link: https://lkml.kernel.org/r/20240827112958.181388-1-yu.c.chen@intel.com Fixes: `fc137c0dda` ("sched/numa: enhance vma scanning logic") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Yujie Liu <yujie.liu@intel.com> Reported-by: Xiaoping Zhou <xiaoping.zhou@intel.com> Reviewed-and-tested-by: Raghavendra K T <raghavendra.kt@amd.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:22 +02:00
Mel Gorman	e3a2d3f6c4	sched/numa: Complete scanning of inactive VMAs when there is no alternative [ Upstream commit f169c62ff7cd1acf8bac8ae17bfeafa307d9e6fa ] VMAs are skipped if there is no recent fault activity but this represents a chicken-and-egg problem as there may be no fault activity if the PTEs are never updated to trap NUMA hints. There is an indirect reliance on scanning to be forced early in the lifetime of a task but this may fail to detect changes in phase behaviour. Force inactive VMAs to be scanned when all other eligible VMAs have been updated within the same scan sequence. Test results in general look good with some changes in performance, both negative and positive, depending on whether the additional scanning and faulting was beneficial or not to the workload. The autonuma benchmark workload NUMA01_THREADLOCAL was picked for closer examination. The workload creates two processes with numerous threads and thread-local storage that is zero-filled in a loop. It exercises the corner case where unrelated threads may skip VMAs that are thread-local to another thread and still has some VMAs that inactive while the workload executes. The VMA skipping activity frequency with and without the patch: 6.6.0-rc2-sched-numabtrace-v1 ============================= 649 reason=scan_delay 9,094 reason=unsuitable 48,915 reason=shared_ro 143,919 reason=inaccessible 193,050 reason=pid_inactive 6.6.0-rc2-sched-numabselective-v1 ============================= 146 reason=seq_completed 622 reason=ignore_pid_inactive 624 reason=scan_delay 6,570 reason=unsuitable 16,101 reason=shared_ro 27,608 reason=inaccessible 41,939 reason=pid_inactive Note that with the patch applied, the PID activity is ignored (ignore_pid_inactive) to ensure a VMA with some activity is completely scanned. In addition, a small number of VMAs are scanned when no other eligible VMA is available during a single scan window (seq_completed). The number of times a VMA is skipped due to no PID activity from the scanning task (pid_inactive) drops dramatically. It is expected that this will increase the number of PTEs updated for NUMA hinting faults as well as hinting faults but these represent PTEs that would otherwise have been missed. The tradeoff is scan+fault overhead versus improving locality due to migration. On a 2-socket Cascade Lake test machine, the time to complete the workload is as follows; 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%) Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%* Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%) CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%) Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%) The time to complete the workload is reduced by almost 30%: 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 / Duration User 91201.80 63506.64 Duration System 2015.53 1819.78 Duration Elapsed 1234.77 868.37 In this specific case, system CPU time was not increased but it's not universally true. From vmstat, the NUMA scanning and fault activity is as follows; 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 Ops NUMA base-page range updates 64272.00 26374386.00 Ops NUMA PTE updates 36624.00 55538.00 Ops NUMA PMD updates 54.00 51404.00 Ops NUMA hint faults 15504.00 75786.00 Ops NUMA hint local faults % 14860.00 56763.00 Ops NUMA hint local percent 95.85 74.90 Ops NUMA pages migrated 1629.00 6469222.00 Both the number of PTE updates and hint faults is dramatically increased. While this is superficially unfortunate, it represents ranges that were simply skipped without the patch. As a result of the scanning and hinting faults, many more pages were also migrated but as the time to completion is reduced, the overhead is offset by the gain. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Link: https://lore.kernel.org/r/20231010083143.19593-7-mgorman@techsingularity.net Stable-dep-of: f22cde4371f3 ("sched/numa: Fix the vma scan starving issue") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:22 +02:00
Mel Gorman	cb7846df6b	sched/numa: Complete scanning of partial VMAs regardless of PID activity [ Upstream commit b7a5b537c55c088d891ae554103d1b281abef781 ] NUMA Balancing skips VMAs when the current task has not trapped a NUMA fault within the VMA. If the VMA is skipped then mm->numa_scan_offset advances and a task that is trapping faults within the VMA may never fully update PTEs within the VMA. Force tasks to update PTEs for partially scanned PTEs. The VMA will be tagged for NUMA hints by some task but this removes some of the benefit of tracking PID activity within a VMA. A follow-on patch will mitigate this problem. The test cases and machines evaluated did not trigger the corner case so the performance results are neutral with only small changes within the noise from normal test-to-test variance. However, the next patch makes the corner case easier to trigger. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Link: https://lore.kernel.org/r/20231010083143.19593-6-mgorman@techsingularity.net Stable-dep-of: f22cde4371f3 ("sched/numa: Fix the vma scan starving issue") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:22 +02:00
Raghavendra K T	7f01977665	sched/numa: Move up the access pid reset logic [ Upstream commit 2e2675db1906ac04809f5399bf1f5e30d56a6f3e ] Recent NUMA hinting faulting activity is reset approximately every VMA_PID_RESET_PERIOD milliseconds. However, if the current task has not accessed a VMA then the reset check is missed and the reset is potentially deferred forever. Check if the PID activity information should be reset before checking if the current task recently trapped a NUMA hinting fault. [ mgorman@techsingularity.net: Rewrite changelog ] Suggested-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231010083143.19593-5-mgorman@techsingularity.net Stable-dep-of: f22cde4371f3 ("sched/numa: Fix the vma scan starving issue") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:21 +02:00
Mel Gorman	6654e54ae7	sched/numa: Trace decisions related to skipping VMAs [ Upstream commit ed2da8b725b932b1e2b2f4835bb664d47ed03031 ] NUMA balancing skips or scans VMAs for a variety of reasons. In preparation for completing scans of VMAs regardless of PID access, trace the reasons why a VMA was skipped. In a later patch, the tracing will be used to track if a VMA was forcibly scanned. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231010083143.19593-4-mgorman@techsingularity.net Stable-dep-of: f22cde4371f3 ("sched/numa: Fix the vma scan starving issue") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:21 +02:00
Mel Gorman	707e9a6c88	sched/numa: Rename vma_numab_state::access_pids[] => ::pids_active[], ::next_pid_reset => ::pids_active_reset [ Upstream commit f3a6c97940fbd25d6c84c2d5642338fc99a9b35b ] The access_pids[] field name is somewhat ambiguous as no PIDs are accessed. Similarly, it's not clear that next_pid_reset is related to access_pids[]. Rename the fields to more accurately reflect their purpose. [ mingo: Rename in the comments too. ] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231010083143.19593-3-mgorman@techsingularity.net Stable-dep-of: f22cde4371f3 ("sched/numa: Fix the vma scan starving issue") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:21 +02:00
Mel Gorman	ba4eb7f258	sched/numa: Document vma_numab_state fields [ Upstream commit 9ae5c00ea2e600a8b823f9b95606dd244f3096bf ] Document the intended usage of the fields. [ mingo: Reformatted to take less vertical space & tidied it up. ] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231010083143.19593-2-mgorman@techsingularity.net Stable-dep-of: f22cde4371f3 ("sched/numa: Fix the vma scan starving issue") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:21 +02:00
Ojaswin Mujoo	faeff8b1ee	ext4: check stripe size compatibility on remount as well [ Upstream commit ee85e0938aa8f9846d21e4d302c3cf6a2a75110d ] We disable stripe size in __ext4_fill_super if it is not a multiple of the cluster ratio however this check is missed when trying to remount. This can leave us with cases where stripe < cluster_ratio after remount:set making EXT4_B2C(sbi->s_stripe) become 0 that can cause some unforeseen bugs like divide by 0. Fix that by adding the check in remount path as well. Reported-by: syzbot+1ad8bac5af24d01e2cbd@syzkaller.appspotmail.com Tested-by: syzbot+1ad8bac5af24d01e2cbd@syzkaller.appspotmail.com Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Fixes: `c3defd99d5` ("ext4: treat stripe in block unit") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/3a493bb503c3598e25dcfbed2936bb2dff3fece7.1725002410.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:21 +02:00
Thadeu Lima de Souza Cascardo	2a6579ef5f	ext4: avoid OOB when system.data xattr changes underneath the filesystem [ Upstream commit c6b72f5d82b1017bad80f9ebf502832fc321d796 ] When looking up for an entry in an inlined directory, if e_value_offs is changed underneath the filesystem by some change in the block device, it will lead to an out-of-bounds access that KASAN detects as an UAF. EXT4-fs (loop0): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none. loop0: detected capacity change from 2048 to 2047 ================================================================== BUG: KASAN: use-after-free in ext4_search_dir+0xf2/0x1c0 fs/ext4/namei.c:1500 Read of size 1 at addr ffff88803e91130f by task syz-executor269/5103 CPU: 0 UID: 0 PID: 5103 Comm: syz-executor269 Not tainted 6.11.0-rc4-syzkaller #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:93 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 ext4_search_dir+0xf2/0x1c0 fs/ext4/namei.c:1500 ext4_find_inline_entry+0x4be/0x5e0 fs/ext4/inline.c:1697 __ext4_find_entry+0x2b4/0x1b30 fs/ext4/namei.c:1573 ext4_lookup_entry fs/ext4/namei.c:1727 [inline] ext4_lookup+0x15f/0x750 fs/ext4/namei.c:1795 lookup_one_qstr_excl+0x11f/0x260 fs/namei.c:1633 filename_create+0x297/0x540 fs/namei.c:3980 do_symlinkat+0xf9/0x3a0 fs/namei.c:4587 __do_sys_symlinkat fs/namei.c:4610 [inline] __se_sys_symlinkat fs/namei.c:4607 [inline] __x64_sys_symlinkat+0x95/0xb0 fs/namei.c:4607 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f3e73ced469 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 21 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fff4d40c258 EFLAGS: 00000246 ORIG_RAX: 000000000000010a RAX: ffffffffffffffda RBX: 0032656c69662f2e RCX: 00007f3e73ced469 RDX: 0000000020000200 RSI: 00000000ffffff9c RDI: 00000000200001c0 RBP: 0000000000000000 R08: 00007fff4d40c290 R09: 00007fff4d40c290 R10: 0023706f6f6c2f76 R11: 0000000000000246 R12: 00007fff4d40c27c R13: 0000000000000003 R14: 431bde82d7b634db R15: 00007fff4d40c2b0 </TASK> Calling ext4_xattr_ibody_find right after reading the inode with ext4_get_inode_loc will lead to a check of the validity of the xattrs, avoiding this problem. Reported-by: syzbot+0c2508114d912a54ee79@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0c2508114d912a54ee79 Fixes: `e8e948e780` ("ext4: let ext4_find_entry handle inline data") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Link: https://patch.msgid.link/20240821152324.3621860-5-cascardo@igalia.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:21 +02:00
Thadeu Lima de Souza Cascardo	dd3f90e8c4	ext4: return error on ext4_find_inline_entry [ Upstream commit 4d231b91a944f3cab355fce65af5871fb5d7735b ] In case of errors when reading an inode from disk or traversing inline directory entries, return an error-encoded ERR_PTR instead of returning NULL. ext4_find_inline_entry only caller, __ext4_find_entry already returns such encoded errors. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Link: https://patch.msgid.link/20240821152324.3621860-3-cascardo@igalia.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: c6b72f5d82b1 ("ext4: avoid OOB when system.data xattr changes underneath the filesystem") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:20 +02:00
Kemeng Shi	9f70768554	ext4: avoid negative min_clusters in find_group_orlov() [ Upstream commit bb0a12c3439b10d88412fd3102df5b9a6e3cd6dc ] min_clusters is signed integer and will be converted to unsigned integer when compared with unsigned number stats.free_clusters. If min_clusters is negative, it will be converted to a huge unsigned value in which case all groups may not meet the actual desired free clusters. Set negative min_clusters to 0 to avoid unexpected behavior. Fixes: `ac27a0ec11` ("[PATCH] ext4: initial copy of files from ext3") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://patch.msgid.link/20240820132234.2759926-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:20 +02:00
Kemeng Shi	fae0793abd	ext4: avoid potential buffer_head leak in __ext4_new_inode() [ Upstream commit 227d31b9214d1b9513383cf6c7180628d4b3b61f ] If a group is marked EXT4_GROUP_INFO_IBITMAP_CORRUPT after it's inode bitmap buffer_head was successfully verified, then __ext4_new_inode() will get a valid inode_bitmap_bh of a corrupted group from ext4_read_inode_bitmap() in which case inode_bitmap_bh misses a release. Hnadle "IS_ERR(inode_bitmap_bh)" and group corruption separately like how ext4_free_inode() does to avoid buffer_head leak. Fixes: `9008a58e5d` ("ext4: make the bitmap read routines return real error codes") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://patch.msgid.link/20240820132234.2759926-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:20 +02:00
Kemeng Shi	7a349feead	ext4: avoid buffer_head leak in ext4_mark_inode_used() [ Upstream commit 5e5b2a56c57def1b41efd49596621504d7bcc61c ] Release inode_bitmap_bh from ext4_read_inode_bitmap() in ext4_mark_inode_used() to avoid buffer_head leak. By the way, remove unneeded goto for invalid ino when inode_bitmap_bh is NULL. Fixes: `8016e29f43` ("ext4: fast commit recovery path") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://patch.msgid.link/20240820132234.2759926-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:20 +02:00
Jiawei Ye	72eef5226f	smackfs: Use rcu_assign_pointer() to ensure safe assignment in smk_set_cipso [ Upstream commit 2749749afa071f8a0e405605de9da615e771a7ce ] In the `smk_set_cipso` function, the `skp->smk_netlabel.attr.mls.cat` field is directly assigned to a new value without using the appropriate RCU pointer assignment functions. According to RCU usage rules, this is illegal and can lead to unpredictable behavior, including data inconsistencies and impossible-to-diagnose memory corruption issues. This possible bug was identified using a static analysis tool developed by myself, specifically designed to detect RCU-related issues. To address this, the assignment is now done using rcu_assign_pointer(), which ensures that the pointer assignment is done safely, with the necessary memory barriers and synchronization. This change prevents potential RCU dereference issues by ensuring that the `cat` field is safely updated while still adhering to RCU's requirements. Fixes: `0817534ff9` ("smackfs: Fix use-after-free in netlbl_catmap_walk()") Signed-off-by: Jiawei Ye <jiawei.ye@foxmail.com> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:20 +02:00
yangerkun	e4006410b0	ext4: clear EXT4_GROUP_INFO_WAS_TRIMMED_BIT even mount with discard [ Upstream commit 20cee68f5b44fdc2942d20f3172a262ec247b117 ] Commit `3d56b8d2c7` ("ext4: Speed up FITRIM by recording flags in ext4_group_info") speed up fstrim by skipping trim trimmed group. We also has the chance to clear trimmed once there exists some block free for this group(mount without discard), and the next trim for this group will work well too. For mount with discard, we will issue dicard when we free blocks, so leave trimmed flag keep alive to skip useless trim trigger from userspace seems reasonable. But for some case like ext4 build on dm-thinpool(ext4 blocksize 4K, pool blocksize 128K), discard from ext4 maybe unaligned for dm thinpool, and thinpool will just finish this discard(see process_discard_bio when begein equals to end) without actually process discard. For this case, trim from userspace can really help us to free some thinpool block. So convert to clear trimmed flag for all case no matter mounted with discard or not. Fixes: `3d56b8d2c7` ("ext4: Speed up FITRIM by recording flags in ext4_group_info") Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240817085510.2084444-1-yangerkun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:20 +02:00
Chen Yu	cfd257f5e8	kthread: fix task state in kthread worker if being frozen [ Upstream commit e16c7b07784f3fb03025939c4590b9a7c64970a7 ] When analyzing a kernel waring message, Peter pointed out that there is a race condition when the kworker is being frozen and falls into try_to_freeze() with TASK_INTERRUPTIBLE, which could trigger a might_sleep() warning in try_to_freeze(). Although the root cause is not related to freeze()[1], it is still worthy to fix this issue ahead. One possible race scenario: CPU 0 CPU 1 ----- ----- // kthread_worker_fn set_current_state(TASK_INTERRUPTIBLE); suspend_freeze_processes() freeze_processes static_branch_inc(&freezer_active); freeze_kernel_threads pm_nosig_freezing = true; if (work) { //false __set_current_state(TASK_RUNNING); } else if (!freezing(current)) //false, been frozen freezing(): if (static_branch_unlikely(&freezer_active)) if (pm_nosig_freezing) return true; schedule() } // state is still TASK_INTERRUPTIBLE try_to_freeze() might_sleep() <--- warning Fix this by explicitly set the TASK_RUNNING before entering try_to_freeze(). Link: https://lore.kernel.org/lkml/Zs2ZoAcUsZMX2B%2FI@chenyu5-mobl2/ [1] Link: https://lkml.kernel.org/r/20240827112308.181081-1-yu.c.chen@intel.com Fixes: `b56c0d8937` ("kthread: implement kthread_worker") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: David Gow <davidgow@google.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Mickaël Salaün <mic@digikod.net> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:20 +02:00
Lasse Collin	b7d6e724e4	xz: cleanup CRC32 edits from 2018 [ Upstream commit 2ee96abef214550d9e92f5143ee3ac1fd1323e67 ] In 2018, a dependency on <linux/crc32poly.h> was added to avoid duplicating the same constant in multiple files. Two months later it was found to be a bad idea and the definition of CRC32_POLY_LE macro was moved into xz_private.h to avoid including <linux/crc32poly.h>. xz_private.h is a wrong place for it too. Revert back to the upstream version which has the poly in xz_crc32_init() in xz_crc32.c. Link: https://lkml.kernel.org/r/20240721133633.47721-10-lasse.collin@tukaani.org Fixes: `faa16bc404` ("lib: Use existing define with polynomial") Fixes: `242cdad873` ("lib/xz: Put CRC32_POLY_LE in xz_private.h") Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Reviewed-by: Sam James <sam@gentoo.org> Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Joel Stanley <joel@jms.id.au> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Emil Renner Berthing <emil.renner.berthing@canonical.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jubin Zhong <zhongjubin@huawei.com> Cc: Jules Maselbas <jmaselbas@zdiv.net> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Rui Li <me@lirui.org> Cc: Simon Glass <sjg@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:19 +02:00
Eduard Zingerman	2288b54b96	bpf: correctly handle malformed BPF_CORE_TYPE_ID_LOCAL relos [ Upstream commit 3d2786d65aaa954ebd3fcc033ada433e10da21c4 ] In case of malformed relocation record of kind BPF_CORE_TYPE_ID_LOCAL referencing a non-existing BTF type, function bpf_core_calc_relo_insn would cause a null pointer deference. Fix this by adding a proper check upper in call stack, as malformed relocation records could be passed from user space. Simplest reproducer is a program: r0 = 0 exit With a single relocation record: .insn_off = 0, /* patch first instruction / .type_id = 100500, / this type id does not exist / .access_str_off = 6, / offset of string "0" */ .kind = BPF_CORE_TYPE_ID_LOCAL, See the link for original reproducer or next commit for a test case. Fixes: `74753e1462` ("libbpf: Replace btf__type_by_id() with btf_type_by_id().") Reported-by: Liu RuiTong <cnitlrt@gmail.com> Closes: https://lore.kernel.org/bpf/CAK55_s6do7C+DVwbwY_7nKfUz0YLDoiA1v6X3Y9+p0sWzipFSA@mail.gmail.com/ Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240822080124.2995724-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:19 +02:00
Jiangshan Yi	fc2b89707e	samples/bpf: Fix compilation errors with cf-protection option [ Upstream commit fdf1c728fac541891ef1aa773bfd42728626769c ] Currently, compiling the bpf programs will result the compilation errors with the cf-protection option as follows in arm64 and loongarch64 machine when using gcc 12.3.1 and clang 17.0.6. This commit fixes the compilation errors by limited the cf-protection option only used in x86 platform. [root@localhost linux]# make M=samples/bpf ...... CLANG-bpf samples/bpf/xdp2skb_meta_kern.o error: option 'cf-protection=return' cannot be specified on this target error: option 'cf-protection=branch' cannot be specified on this target 2 errors generated. CLANG-bpf samples/bpf/syscall_tp_kern.o error: option 'cf-protection=return' cannot be specified on this target error: option 'cf-protection=branch' cannot be specified on this target 2 errors generated. ...... Fixes: `34f6e38f58` ("samples/bpf: fix warning with ignored-attributes") Reported-by: Jiangshan Yi <yijiangshan@kylinos.cn> Signed-off-by: Jiangshan Yi <yijiangshan@kylinos.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Qiang Wang <wangqiang1@kylinos.cn> Link: https://lore.kernel.org/bpf/20240815135524.140675-1-13667453960@163.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:19 +02:00

1 2 3 4 5 ...

1228156 Commits