The process exit time is mainly caused by freeing its swp_entrys. When
low memory triggers to kill lots of processes exit simultaneously, there
is a problem of CPU high load occupied by the exiting processes.
This feature is used to asynchronously maintain and release the
swp_entrys of the exiting process to accelerate the efficiency of the
exiting process.
Bug: 364480846
Bug: 340798358
Change-Id: I6fc0b813e7ac6a0796e08ce7a17d5ff3ab2b799b
Signed-off-by: Justin Jiang <justinjiang@vivo.corp-partner.google.com>
Attaching/detaching of a device to multiple PM domains has started to
become a common operation for many drivers, typically during ->probe() and
->remove(). In most cases, this has lead to lots of boilerplate code in the
drivers.
To fixup up the situation, let's introduce a pair of helper functions,
dev_pm_domain_attach|detach_list(), that driver can use instead of the
open-coding. Note that, it seems reasonable to limit the support for these
helpers to DT based platforms, at it's the only valid use case for now.
Suggested-by: Daniel Baluta <daniel.baluta@nxp.com>
Tested-by: Bryan O'Donoghue <bryan.odonoghue@linaro.org>
Tested-by: Iuliana Prodan <iuliana.prodan@nxp.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20240130123951.236243-2-ulf.hansson@linaro.org
Bug: 323966425
Change-Id: Ib0cc368d9eac3a9615b75ca5954d81ab6b14b4b2
(cherry picked from commit 161e16a5e50a82d219b3df3ce32286b0a2ae08bd)
Signed-off-by: Nikunj Kela <quic_nkela@quicinc.com>
Signed-off-by: Anant Goel <quic_anantg@quicinc.com>
Clang warns (or errors with CONFIG_WERROR=y):
drivers/opp/of.c:1081:28: error: multiple unsequenced modifications to 'val' [-Werror,-Wunsequenced]
1081 | .freq = be32_to_cpup(val++) * 1000,
| ^
1082 | .u_volt = be32_to_cpup(val++),
| ~~
1 error generated.
There is no sequence point in a designated initializer. Move back to
separate variables for the creation of the values, so that there are
sequence points between each evaluation and increment of val.
Fixes: 75bbc92c09d8 ("OPP: Add dev_pm_opp_add_dynamic() to allow more flexibility")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Bug: 323966425
Change-Id: I870855263a0db8e35bcf8f150f5e05b2cfbdbc6b
(cherry picked from commit 184ff4f721638e37a5a5907bf98962b6d9318ef6)
Signed-off-by: Nikunj Kela <quic_nkela@quicinc.com>
Signed-off-by: Anant Goel <quic_anantg@quicinc.com>
To enable the performance level to be used for OPPs, let's convert into
using the dev_pm_opp_add_dynamic() API when creating them. This will be
particularly useful for the SCMI performance domain, as shown through
subsequent changes.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20230925131715.138411-9-ulf.hansson@linaro.org
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Bug: 323966425
Change-Id: I0d831c1e923d64e3e88e42acf4f4f16587c21696
(cherry picked from commit 5a6a104193520dc3b66ad2c7d823e00b50734ab6)
Signed-off-by: Nikunj Kela <quic_nkela@quicinc.com>
Signed-off-by: Anant Goel <quic_anantg@quicinc.com>
At this point the level (performance state) for an OPP is currently limited
to be requested for a device that is attached to a PM domain. Moreover,
the device needs to have the so called required-opps assigned to it, which
are based upon OPP tables being described in DT.
To extend the support beyond required-opps and DT, let's enable the level
to be set for all OPPs. More precisely, if the requested OPP has a valid
level let's try to request it through the device's optional PM domain, via
calling dev_pm_domain_set_performance_state().
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
[ Viresh: Handle NULL opp in _set_opp_level() ]
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Bug: 323966425
Change-Id: I6176127586d50324fbb96f099ddad60bb3c910e5
(cherry picked from commit 0025ff64ffcf6bd6ece5484e7818401f77bf115f)
[nikunj: Resolved minor conflict in drivers/opp/core.c ]
[anantg: Use dev_pm_genpd_set_performance_state in drivers/opp/core.c ]
Signed-off-by: Nikunj Kela <quic_nkela@quicinc.com>
Signed-off-by: Anant Goel <quic_anantg@quicinc.com>
Let's extend the dev_pm_opp_data with a level variable, to allow users to
specify a corresponding level (performance state) for a dynamically added
OPP.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Bug: 323966425
Change-Id: I1cc007355fd53f26d29b88ad0d9fa2d56a1e4e9d
(cherry picked from commit 3166383da081461244918aeed7ad028ef11b17cc)
Signed-off-by: Nikunj Kela <quic_nkela@quicinc.com>
Signed-off-by: Anant Goel <quic_anantg@quicinc.com>
The dev_pm_opp_add() API is limited to add dynamic OPPs with a frequency
and a voltage level. To enable more flexibility, let's add a new API,
dev_pm_opp_add_dynamic() that's takes a struct dev_pm_opp_data* instead of
a list of in-parameters.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Bug: 323966425
Change-Id: I37e08f4b4c9f42e1f8c42a876d46d89447fda7a1
(cherry picked from commit 248a38d5cc3f3505e6cfbbc0514435c9f1ba00af)
[anantg: Kept dev_pm_opp_add() as an exported API to maintain the ABI]
Signed-off-by: Nikunj Kela <quic_nkela@quicinc.com>
Signed-off-by: Anant Goel <quic_anantg@quicinc.com>
Most scmi_perf_proto_ops are already using an "u32 domain" as an
in-parameter to indicate what performance domain we shall operate upon.
However, some of the ops are using a "struct device *dev", which means that
an additional OF parsing is needed each time the perf ops gets called, to
find the corresponding domain-id.
To avoid the above, but also to make the code more consistent, let's
replace the in-parameter "struct device *dev" with an "u32 domain". Note
that, this requires us to make some corresponding changes to the scmi
cpufreq driver, so let's do that too.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lore.kernel.org/r/20230825112633.236607-5-ulf.hansson@linaro.org
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Bug: 323966425
Change-Id: I524cfe6d5f28382d407b6323a1fce63620cc089b
(cherry picked from commit 39dfa5b9e1f0fa63b811a0a87f1c2fb9c76a0456)
Signed-off-by: Nikunj Kela <quic_nkela@quicinc.com>
Signed-off-by: Anant Goel <quic_anantg@quicinc.com>
The OF parsing of the clock domain specifier seems to better belong in the
scmi cpufreq driver, rather than being implemented behind the generic
->device_domain_id() perf protocol ops.
To prepare to remove the ->device_domain_id() ops, let's implement the OF
parsing in the scmi cpufreq driver instead.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lore.kernel.org/r/20230825112633.236607-4-ulf.hansson@linaro.org
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Bug: 323966425
Change-Id: I7b777bc375373f5a716afd38ca23ece5e2750606
(cherry picked from commit e336baa4193ecc788a06c0c4659e400bb53689b4)
Signed-off-by: Nikunj Kela <quic_nkela@quicinc.com>
Signed-off-by: Anant Goel <quic_anantg@quicinc.com>
Similar to other protocol ops, it's useful for an scmi module driver to get
some generic information of a performance domain. Therefore, let's add a
new callback to provide this information. The information is currently
limited to the name of the performance domain and whether the set-level
operation is supported, although this can easily be extended if we find the
need for it.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20230825112633.236607-3-ulf.hansson@linaro.org
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Bug: 323966425
Change-Id: Ica11495c1ba40c646cf835418788795e6671e7db
(cherry picked from commit 3d99ed60721bf2e108c8fc660775766057689a92)
[nikunj: Resolved minor conflict in drivers/firmware/arm_scmi/perf.c ]
[anantg: Resolved minor conflict in drivers/firmware/arm_scmi/perf.c ]
Signed-off-by: Nikunj Kela <quic_nkela@quicinc.com>
Signed-off-by: Anant Goel <quic_anantg@quicinc.com>
Similar to other protocol ops, it's useful for an scmi module driver to get
the number of supported performance domains, hence let's make this
available by adding a new perf protocol callback. Note that, a user is
being added from subsequent changes.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20230825112633.236607-2-ulf.hansson@linaro.org
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Bug: 323966425
Change-Id: Idc17b94ead5ec08c58df9dcf2194d730b0eef175
(cherry picked from commit e9090e70e618cd62ab7bf2914511e5eea31a2535)
[nikunj: Resolved minor conflict in drivers/firmware/arm_scmi/perf.c ]
Signed-off-by: Nikunj Kela <quic_nkela@quicinc.com>
Signed-off-by: Anant Goel <quic_anantg@quicinc.com>
Export shrink_slab to module for do shrink-memory action.
Bug: 363907036
Change-Id: Ia96e6595186927c1fe31a6504675b8762154c0a6
Signed-off-by: huangshaobo1 <huangshaobo1@xiaomi.corp-partner.google.com>
[ Upstream commit 4e7aaa6b82d63e8ddcbfb56b4fd3d014ca586f10 ]
Lion Ackermann reported that there is a race condition between namespace cleanup
in ipset and the garbage collection of the list:set type. The namespace
cleanup can destroy the list:set type of sets while the gc of the set type is
waiting to run in rcu cleanup. The latter uses data from the destroyed set which
thus leads use after free. The patch contains the following parts:
- When destroying all sets, first remove the garbage collectors, then wait
if needed and then destroy the sets.
- Fix the badly ordered "wait then remove gc" for the destroy a single set
case.
- Fix the missing rcu locking in the list:set type in the userspace test
case.
- Use proper RCU list handlings in the list:set type.
The patch depends on c1193d9bbbd3 (netfilter: ipset: Add list flush to cancel_gc).
Bug: 347636817
Fixes: 97f7cf1cd80e (netfilter: ipset: fix performance regression in swap operation)
Reported-by: Lion Ackermann <nnamrec@gmail.com>
Tested-by: Lion Ackermann <nnamrec@gmail.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 390b353d1a)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I92a8681d8a0f475aa20c34c17daa222d02cf7935
Add initial symbol list for sunxi.
Bug: 363322588
Signed-off-by: Aran Dalton <arda@allwinnertech.com>
Change-Id: I1f9ab11c8d583b6905bf55420c6bc7d22be67346
In commit cfcd544a99 ("usb: typec: tcpm: unregister existing source
caps before re-registration"), quilt, and git, applied the diff to the
incorrect function, which would cause bad problems if exercised in a
device with these capabilities.
Fix this all up to be in the correct function.
Fixes: cfcd544a99 ("usb: typec: tcpm: unregister existing source caps before re-registration")
Reported-by: Charles Yo <charlesyo@google.com>
Cc: Kyle Tso <kyletso@google.com>
Cc: Amit Sunil Dhamne <amitsd@google.com>
Cc: Ondrej Jirman <megi@xff.cz>
Cc: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/linux-usb/2024083008-granddad-unmoving-828c@gregkh/
Bug: 363121994
[ note, only 1/3 of the upstream commit is needed here due to half
already being present due to manual UPSTREAM changes made to the tree,
and a second follow-up fix not being merged from LTS here yet - gregkh]
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I33b07fcb8d1e64e7f0424b9a7b7d056aa9e44b2f
[ Upstream commit affc18fdc694190ca7575b9a86632a73b9fe043d ]
q->bands will be assigned to qopt->bands to execute subsequent code logic
after kmalloc. So the old q->bands should not be used in kmalloc.
Otherwise, an out-of-bounds write will occur.
Bug: 349777785
Fixes: c2999f7fb0 ("net: sched: multiq: don't call qdisc_put() while holding tree lock")
Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 0f208fad86)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: Iec8413c39878596795420ae58bbe6974890cf2de
When the target context passed to enter_vmid_context() matches the
current running context, the function returns early without manipulating
the registers of the stage-2 MMU. This can result in a stale VMID due to
the lack of an ISB instruction in exit_vmid_context() after writing the
VTTBR when ARM64_WORKAROUND_SPECULATIVE_AT is not enabled.
For example, with pKVM enabled:
// Initially running in host context
enter_vmid_context(guest);
-> __load_stage2(guest); isb // Writes VTCR & VTTBR
exit_vmid_context(guest);
-> __load_stage2(host); // Restores VTCR & VTTBR
enter_vmid_context(host);
-> Returns early as we're already in host context
tlbi vmalls12e1is // !!! Can use the stale VMID as we
// haven't performed context
// synchronisation since restoring
// VTTBR.VMID
Add an unconditional ISB instruction to exit_vmid_context() after
restoring the VTTBR. This already existed for the
ARM64_WORKAROUND_SPECULATIVE_AT path, so we can simply hoist that onto
the common path.
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Fuad Tabba <tabba@google.com>
Fixes: 58f3b0fc3b87 ("KVM: arm64: Support TLB invalidation in guest context")
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240814123429.20457-3-will@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit ed49fe5a6fb9c1a1bbbf4b5b648c7d34a756cb6d
kvmarm/next)
Bug: 311571169
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: I1612ebdc5625e44694897f2c5b26fe38cdaa3179
When initialising the nVHE hypervisor, we invalidate potentially stale
TLB entries for the EL1&0 regime using a 'vmalls12e1' invalidation.
However, this invalidation operation applies only to the active VMID
and therefore we could proceed with stale TLB entries for other VMIDs.
Replace the operation with an 'alle1' which applies to all entries for
the EL1&0 regime, regardless of the VMID.
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Fixes: 1025c8c0c6 ("KVM: arm64: Wrap the host with a stage 2")
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240814123429.20457-2-will@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit dc0dddb1d66de88c571cf1a5bc3b484521a578af
kvmarm/next)
Bug: 311571169
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: Ib116a4b3b08501e84340ce63ea6cded67824c7aa
commit 36e008323926036650299cfbb2dca704c7aba849 upstream.
The TLBI level hints are for leaf entries only, so take care not to pass
them incorrectly after clearing a table entry.
Cc: Gavin Shan <gshan@redhat.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Quentin Perret <qperret@google.com>
Fixes: 82bb02445d ("KVM: arm64: Implement kvm_pgtable_hyp_unmap() at EL2")
Fixes: 6d9d2115c4 ("KVM: arm64: Add support for stage-2 map()/unmap() in generic page-table")
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240327124853.11206-3-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Cc: <stable@vger.kernel.org> # 6.1.y only
[will@: Use '0' instead of TLBI_TTL_UNKNOWN_to indicate "no level". Force
level to 0 in stage2_put_pte() if we're clearing a table entry.]
Signed-off-by: Will Deacon <will@kernel.org>
(cherry picked from commit 298e875b36
stable/linux-6.1.y)
Bug: 311571169
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: Icefe7099ccf03c4ba96e395dfc225f0015e3fccc
Vendor may have specific actions after task renamed.
Bug: 357956265
Change-Id: I78263dc023af6fd1ee2db03eee4ccb3ca3ebb278
Signed-off-by: Rick Yiu <rickyiu@google.com>
It was reported that in moving to 6.1, a larger then 10%
regression was seen in the performance of
clock_gettime(CLOCK_THREAD_CPUTIME_ID,...).
Using a simple reproducer, I found:
5.10:
100000000 calls in 24345994193 ns => 243.460 ns per call
100000000 calls in 24288172050 ns => 242.882 ns per call
100000000 calls in 24289135225 ns => 242.891 ns per call
6.1:
100000000 calls in 28248646742 ns => 282.486 ns per call
100000000 calls in 28227055067 ns => 282.271 ns per call
100000000 calls in 28177471287 ns => 281.775 ns per call
The cause of this was finally narrowed down to the addition of
psi_account_irqtime() in update_rq_clock_task(), in commit
52b1364ba0 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ
pressure").
In my initial attempt to resolve this, I leaned towards moving
all accounting work out of the clock_gettime() call path, but it
wasn't very pretty, so it will have to wait for a later deeper
rework. Instead, Peter shared this approach:
Rework psi_account_irqtime() to use its own psi_irq_time base
for accounting, and move it out of the hotpath, calling it
instead from sched_tick() and __schedule().
In testing this, we found the importance of ensuring
psi_account_irqtime() is run under the rq_lock, which Johannes
Weiner helpfully explained, so also add some lockdep annotations
to make that requirement clear.
With this change the performance is back in-line with 5.10:
6.1+fix:
100000000 calls in 24297324597 ns => 242.973 ns per call
100000000 calls in 24318869234 ns => 243.189 ns per call
100000000 calls in 24291564588 ns => 242.916 ns per call
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: kernel-team@android.com
Reported-by: Jimmy Shiu <jimmyshiu@google.com>
Originally-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Link: https://lore.kernel.org/r/20240618215909.4099720-1-jstultz@google.com
Change-Id: I5c4f04d047ca0aa11fccaec9a034dfe60dbeb295
Bug: 343748421
(cherry picked from commit ddae0ca2a8fe12d0e24ab10ba759c3fbd755ada8)
[jstultz: Backported and reworked to use per-cpu values instead of
adding a field to the struct rq]
Signed-off-by: John Stultz <jstultz@google.com>
Binder objects are processed and copied individually into the target
buffer during transactions. Any raw data in-between these objects is
copied as well. However, this raw data copy lacks an out-of-bounds
check. If the raw data exceeds the data section size then the copy
overwrites the offsets section. This eventually triggers an error that
attempts to unwind the processed objects. However, at this point the
offsets used to index these objects are now corrupted.
Unwinding with corrupted offsets can result in decrements of arbitrary
nodes and lead to their premature release. Other users of such nodes are
left with a dangling pointer triggering a use-after-free. This issue is
made evident by the following KASAN report (trimmed):
==================================================================
BUG: KASAN: slab-use-after-free in _raw_spin_lock+0xe4/0x19c
Write of size 4 at addr ffff47fc91598f04 by task binder-util/743
CPU: 9 UID: 0 PID: 743 Comm: binder-util Not tainted 6.11.0-rc4 #1
Hardware name: linux,dummy-virt (DT)
Call trace:
_raw_spin_lock+0xe4/0x19c
binder_free_buf+0x128/0x434
binder_thread_write+0x8a4/0x3260
binder_ioctl+0x18f0/0x258c
[...]
Allocated by task 743:
__kmalloc_cache_noprof+0x110/0x270
binder_new_node+0x50/0x700
binder_transaction+0x413c/0x6da8
binder_thread_write+0x978/0x3260
binder_ioctl+0x18f0/0x258c
[...]
Freed by task 745:
kfree+0xbc/0x208
binder_thread_read+0x1c5c/0x37d4
binder_ioctl+0x16d8/0x258c
[...]
==================================================================
To avoid this issue, let's check that the raw data copy is within the
boundaries of the data section.
Fixes: 6d98eb95b4 ("binder: avoid potential data leakage when copying txn")
Cc: Todd Kjos <tkjos@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Bug: 352520660
Link: https://lore.kernel.org/all/20240822182353.2129600-1-cmllamas@google.com/
Change-Id: I1b2dd8403b63e5eeb58904558b7b542141c83fc2
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Add 'struct binder_proc_wrap' to support the addition of new members in
'struct binder_proc' without breaking the KMI. In this case, proc->dmap
was backported from upstream and needs to be migrated into this wrapper.
Avoids the following KMI issue:
function symbol 'int __traceiter_binder_transaction_received(void*, struct binder_transaction*)' changed
CRC changed from 0x74e9c98b to 0x7af6cf5a
type 'struct binder_proc' changed
byte size changed from 584 to 600
member 'struct dbitmap dmap' was added
16 members ('struct list_head todo' .. 'u64 android_oem_data1') changed
offset changed by 128
Bug: 298520209
Change-Id: Icbbee14a8f16d0881faf8d5673582e785f98e8cf
Signed-off-by: Carlos Llamas <cmllamas@google.com>
(cherry picked from commit af55892f201c8f709725d75d306b1cdd20984b97)
[cmllamas: merge binder_proc_wrap_entry() with new proc_wrapper()]
Signed-off-by: Carlos Llamas <cmllamas@google.com>
In commit 15d9da3f818c ("binder: use bitmap for faster descriptor
lookup"), it was incorrectly assumed that references to the context
manager node should always get descriptor zero assigned to them.
However, if the context manager dies and a new process takes its place,
then assigning descriptor zero to the new context manager might lead to
collisions, as there could still be references to the older node. This
issue was reported by syzbot with the following trace:
kernel BUG at drivers/android/binder.c:1173!
Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
Modules linked in:
CPU: 1 PID: 447 Comm: binder-util Not tainted 6.10.0-rc6-00348-g31643d84b8c3 #10
Hardware name: linux,dummy-virt (DT)
pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : binder_inc_ref_for_node+0x500/0x544
lr : binder_inc_ref_for_node+0x1e4/0x544
sp : ffff80008112b940
x29: ffff80008112b940 x28: ffff0e0e40310780 x27: 0000000000000000
x26: 0000000000000001 x25: ffff0e0e40310738 x24: ffff0e0e4089ba34
x23: ffff0e0e40310b00 x22: ffff80008112bb50 x21: ffffaf7b8f246970
x20: ffffaf7b8f773f08 x19: ffff0e0e4089b800 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: 000000002de4aa60
x14: 0000000000000000 x13: 2de4acf000000000 x12: 0000000000000020
x11: 0000000000000018 x10: 0000000000000020 x9 : ffffaf7b90601000
x8 : ffff0e0e48739140 x7 : 0000000000000000 x6 : 000000000000003f
x5 : ffff0e0e40310b28 x4 : 0000000000000000 x3 : ffff0e0e40310720
x2 : ffff0e0e40310728 x1 : 0000000000000000 x0 : ffff0e0e40310710
Call trace:
binder_inc_ref_for_node+0x500/0x544
binder_transaction+0xf68/0x2620
binder_thread_write+0x5bc/0x139c
binder_ioctl+0xef4/0x10c8
[...]
This patch adds back the previous behavior of assigning the next
non-zero descriptor if references to previous context managers still
exist. It amends both strategies, the newer dbitmap code and also the
legacy slow_desc_lookup_olocked(), by allowing them to start looking
for available descriptors at a given offset.
Fixes: 15d9da3f818c ("binder: use bitmap for faster descriptor lookup")
Cc: stable@vger.kernel.org
Reported-and-tested-by: syzbot+3dae065ca76952a67257@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000c1c0a0061d1e6979@google.com/
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Link: https://lore.kernel.org/r/20240722150512.4192473-1-cmllamas@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 298520209
Change-Id: I5b888c138163eff263239ebcc85c59cd7f26d64f
(cherry picked from commit 11512c197d387b59569d3a93af93de204d3bdaa6)
Signed-off-by: Carlos Llamas <cmllamas@google.com>
When creating new binder references, the driver assigns a descriptor id
that is shared with userspace. Regrettably, the driver needs to keep the
descriptors small enough to accommodate userspace potentially using them
as Vector indexes. Currently, the driver performs a linear search on the
rb-tree of references to find the smallest available descriptor id. This
approach, however, scales poorly as the number of references grows.
This patch introduces the usage of bitmaps to boost the performance of
descriptor assignments. This optimization results in notable performance
gains, particularly in processes with a large number of references. The
following benchmark with 100,000 references showcases the difference in
latency between the dbitmap implementation and the legacy approach:
[ 587.145098] get_ref_desc_olocked: 15us (dbitmap on)
[ 602.788623] get_ref_desc_olocked: 47343us (dbitmap off)
Note the bitmap size is dynamically adjusted in line with the number of
references, ensuring efficient memory usage. In cases where growing the
bitmap is not possible, the driver falls back to the slow legacy method.
A previous attempt to solve this issue was proposed in [1]. However,
such method involved adding new ioctls which isn't great, plus older
userspace code would not have benefited from the optimizations either.
Link: https://lore.kernel.org/all/20240417191418.1341988-1-cmllamas@google.com/ [1]
Cc: Tim Murray <timmurray@google.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Martijn Coenen <maco@android.com>
Cc: Todd Kjos <tkjos@android.com>
Cc: John Stultz <jstultz@google.com>
Cc: Steven Moreland <smoreland@google.com>
Suggested-by: Nick Chen <chenjia3@oppo.com>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Link: https://lore.kernel.org/r/20240612042535.1556708-1-cmllamas@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 298520209
Change-Id: Iaf32794ab7786c603706f6806cabec9d031559a2
(cherry picked from commit 15d9da3f818cae676f822a04407d3c17b53357d2)
[cmllamas: fixed trivial conflicts with KMI work-around]
Signed-off-by: Carlos Llamas <cmllamas@google.com>
When ufshcd_abort_one is racing with the completion ISR, the completed tag
of the request's mq_hctx pointer will be set to NULL by ISR. Return
success when request is completed by ISR because ufshcd_abort_one does not
need to do anything.
The racing flow is:
Thread A
ufshcd_err_handler step 1
...
ufshcd_abort_one
ufshcd_try_to_abort_task
ufshcd_cmd_inflight(true) step 3
ufshcd_mcq_req_to_hwq
blk_mq_unique_tag
rq->mq_hctx->queue_num step 5
Thread B
ufs_mtk_mcq_intr(cq complete ISR) step 2
scsi_done
...
__blk_mq_free_request
rq->mq_hctx = NULL; step 4
Below is KE back trace.
ufshcd_try_to_abort_task: cmd at tag 41 not pending in the device.
ufshcd_try_to_abort_task: cmd at tag=41 is cleared.
Aborting tag 41 / CDB 0x28 succeeded
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000194
pc : [0xffffffddd7a79bf8] blk_mq_unique_tag+0x8/0x14
lr : [0xffffffddd6155b84] ufshcd_mcq_req_to_hwq+0x1c/0x40 [ufs_mediatek_mod_ise]
do_mem_abort+0x58/0x118
el1_abort+0x3c/0x5c
el1h_64_sync_handler+0x54/0x90
el1h_64_sync+0x68/0x6c
blk_mq_unique_tag+0x8/0x14
ufshcd_err_handler+0xae4/0xfa8 [ufs_mediatek_mod_ise]
process_one_work+0x208/0x4fc
worker_thread+0x228/0x438
kthread+0x104/0x1d4
ret_from_fork+0x10/0x20
Bug: 361140026
Fixes: 93e6c0e19d5b ("scsi: ufs: core: Clear cmd if abort succeeds in MCQ mode")
Suggested-by: Bart Van Assche <bvanassche@acm.org>
Change-Id: I42f9b93dae33eac8cf41ac3085858b6adf0ee9ee
Signed-off-by: Peter Wang <peter.wang@mediatek.com>
Link: https://lore.kernel.org/r/20240628070030.30929-3-peter.wang@mediatek.com
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 74736103fb4123c71bf11fb7a6abe7c884c5269e)
[ Resolved minor conflict in drivers/ufs/core/ufshcd.c ]
When ufshcd_clear_cmd is racing with the completion ISR, the completed tag
of the request's mq_hctx pointer will be set to NULL by the ISR. And
ufshcd_clear_cmd's call to ufshcd_mcq_req_to_hwq will get NULL pointer KE.
Return success when the request is completed by ISR because sq does not
need cleanup.
The racing flow is:
Thread A
ufshcd_err_handler step 1
ufshcd_try_to_abort_task
ufshcd_cmd_inflight(true) step 3
ufshcd_clear_cmd
...
ufshcd_mcq_req_to_hwq
blk_mq_unique_tag
rq->mq_hctx->queue_num step 5
Thread B
ufs_mtk_mcq_intr(cq complete ISR) step 2
scsi_done
...
__blk_mq_free_request
rq->mq_hctx = NULL; step 4
Below is KE back trace:
ufshcd_try_to_abort_task: cmd pending in the device. tag = 6
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000194
pc : [0xffffffd589679bf8] blk_mq_unique_tag+0x8/0x14
lr : [0xffffffd5862f95b4] ufshcd_mcq_sq_cleanup+0x6c/0x1cc [ufs_mediatek_mod_ise]
Workqueue: ufs_eh_wq_0 ufshcd_err_handler [ufs_mediatek_mod_ise]
Call trace:
dump_backtrace+0xf8/0x148
show_stack+0x18/0x24
dump_stack_lvl+0x60/0x7c
dump_stack+0x18/0x3c
mrdump_common_die+0x24c/0x398 [mrdump]
ipanic_die+0x20/0x34 [mrdump]
notify_die+0x80/0xd8
die+0x94/0x2b8
__do_kernel_fault+0x264/0x298
do_page_fault+0xa4/0x4b8
do_translation_fault+0x38/0x54
do_mem_abort+0x58/0x118
el1_abort+0x3c/0x5c
el1h_64_sync_handler+0x54/0x90
el1h_64_sync+0x68/0x6c
blk_mq_unique_tag+0x8/0x14
ufshcd_clear_cmd+0x34/0x118 [ufs_mediatek_mod_ise]
ufshcd_try_to_abort_task+0x2c8/0x5b4 [ufs_mediatek_mod_ise]
ufshcd_err_handler+0xa7c/0xfa8 [ufs_mediatek_mod_ise]
process_one_work+0x208/0x4fc
worker_thread+0x228/0x438
kthread+0x104/0x1d4
ret_from_fork+0x10/0x20
Bug: 361140026
Fixes: 8d72903489 ("scsi: ufs: mcq: Add supporting functions for MCQ abort")
Suggested-by: Bart Van Assche <bvanassche@acm.org>
Change-Id: I59fc0a8246d2fc2421e38c42b619b2393d2b22e6
Signed-off-by: Peter Wang <peter.wang@mediatek.com>
Link: https://lore.kernel.org/r/20240628070030.30929-2-peter.wang@mediatek.com
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 9307a998cb9846a2557fdca286997430bee36a2a)
[ Resolved minor conflict in drivers/ufs/core/ufshcd.c ]
The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of
interrupt affinity reconfiguration via procfs. Instead, the change is
deferred until the next instance of the interrupt being triggered on the
original CPU.
When the interrupt next triggers on the original CPU, the new affinity is
enforced within __irq_move_irq(). A vector is allocated from the new CPU,
but the old vector on the original CPU remains and is not immediately
reclaimed. Instead, apicd->move_in_progress is flagged, and the reclaiming
process is delayed until the next trigger of the interrupt on the new CPU.
Upon the subsequent triggering of the interrupt on the new CPU,
irq_complete_move() adds a task to the old CPU's vector_cleanup list if it
remains online. Subsequently, the timer on the old CPU iterates over its
vector_cleanup list, reclaiming old vectors.
However, a rare scenario arises if the old CPU is outgoing before the
interrupt triggers again on the new CPU.
In that case irq_force_complete_move() is not invoked on the outgoing CPU
to reclaim the old apicd->prev_vector because the interrupt isn't currently
affine to the outgoing CPU, and irq_needs_fixup() returns false. Even
though __vector_schedule_cleanup() is later called on the new CPU, it
doesn't reclaim apicd->prev_vector; instead, it simply resets both
apicd->move_in_progress and apicd->prev_vector to 0.
As a result, the vector remains unreclaimed in vector_matrix, leading to a
CPU vector leak.
To address this issue, move the invocation of irq_force_complete_move()
before the irq_needs_fixup() call to reclaim apicd->prev_vector, if the
interrupt is currently or used to be affine to the outgoing CPU.
Additionally, reclaim the vector in __vector_schedule_cleanup() as well,
following a warning message, although theoretically it should never see
apicd->move_in_progress with apicd->prev_cpu pointing to an offline CPU.
Fixes: f0383c24b4 ("genirq/cpuhotplug: Add support for cleaning up move in progress")
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240522220218.162423-1-dongli.zhang@oracle.com
Bug: 359158960
Change-Id: Ib9a564add4cd1c90ec725ab7f0bdc0db505ae29f
(cherry picked from commit a6c11c0a5235fb144a65e0cb2ffd360ddc1f6c32)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
When a CPU goes offline, the interrupts affine to that CPU are
re-configured.
Managed interrupts undergo either migration to other CPUs or shutdown if
all CPUs listed in the affinity are offline. The migration of managed
interrupts is guaranteed on x86 because there are interrupt vectors
reserved.
Regular interrupts are migrated to a still online CPU in the affinity mask
or if there is no online CPU to any online CPU.
This works as long as the still online CPUs in the affinity mask have
interrupt vectors available, but in case that none of those CPUs has a
vector available the migration fails and the device interrupt becomes
stale.
This is not any different from the case where the affinity mask does not
contain any online CPU, but there is no fallback operation for this.
Instead of giving up, retry the migration attempt with the online CPU mask
if the interrupt is not managed, as managed interrupts cannot be affected
by this problem.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240423073413.79625-1-dongli.zhang@oracle.com
Bug: 359158960
Change-Id: I3e8acfe0598b0cb94389115d680d2216833d6a0c
(cherry picked from commit 88d724e2301a69c1ab805cd74fc27aa36ae529e0)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
irq_restore_affinity_of_irq() restarts managed interrupts unconditionally
when the first CPU in the affinity mask comes online. That's correct during
normal hotplug operations, but not when resuming from S3 because the
drivers are not resumed yet and interrupt delivery is not expected by them.
Skip the startup of suspended interrupts and let resume_device_irqs() deal
with restoring them. This ensures that irqs are not delivered to drivers
during the noirq phase of resuming from S3, after non-boot CPUs are brought
back online.
Signed-off-by: David Stevens <stevensd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240424090341.72236-1-stevensd@chromium.org
Bug: 359158960
Change-Id: Ie5610f7dffe05a141e6db1a8b6b067845c910b4a
(cherry picked from commit a60dd06af674d3bb76b40da5d722e4a0ecefe650)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Currently, the UTP_TASK_REQ_LIST_BASE_L/UTP_TASK_REQ_LIST_BASE_H regs are
written to and then completed with an mb().
mb() ensures that the write completes, but completion doesn't mean that it
isn't stored in a buffer somewhere. The recommendation for ensuring these
bits have taken effect on the device is to perform a read back to force it
to make it all the way to the device. This is documented in device-io.rst
and a talk by Will Deacon on this can be seen over here:
https://youtu.be/i6DayghhA8Q?si=MiyxB5cKJXSaoc01&t=1678
Let's do that to ensure the bits hit the device. Because the mb()'s purpose
wasn't to add extra ordering (on top of the ordering guaranteed by
writel()/readl()), it can safely be removed.
Bug: 254441685
Fixes: 88441a8d35 ("scsi: ufs: core: Add hibernation callbacks")
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Can Guo <quic_cang@quicinc.com>
Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Link: https://lore.kernel.org/r/20240329-ufs-reset-ensure-effect-before-delay-v5-7-181252004586@redhat.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 408e28086f1c7a6423efc79926a43d7001902fae)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: Iadf1f45aebea54393365a5387d35440aa5df3e00
Commit c33c794828 ("mm: ptep_get() conversion") converted all (non-arch)
call sites to use ptep_get() instead of doing a direct dereference of the
pte. Full rationale can be found in that commit's log.
Since then, UFFDIO_MOVE has been implemented which does 7 direct pte
dereferences. Let's fix those up to use ptep_get().
I've asserted in the past that there is no reliable automated mechanism to
catch these; I'm relying on a combination of Coccinelle (which throws up a
lot of false positives) and some compiler magic to force a compiler error
on dereference. But given the frequency with which new issues are coming
up, I'll add it to my todo list to try to find an automated solution.
Bug: 254441685
Link: https://lkml.kernel.org/r/20240123141755.3836179-1-ryan.roberts@arm.com
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 56ae10cf628b02279980d17439c6241a643959c2)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I0573208f1c7ad5085e0f7e82d484199074a5f6ef
BUG: kernel NULL pointer dereference, address: 0000000000000014
RIP: 0010:f2fs_submit_page_write+0x6cf/0x780 [f2fs]
Call Trace:
<TASK>
? show_regs+0x6e/0x80
? __die+0x29/0x70
? page_fault_oops+0x154/0x4a0
? prb_read_valid+0x20/0x30
? __irq_work_queue_local+0x39/0xd0
? irq_work_queue+0x36/0x70
? do_user_addr_fault+0x314/0x6c0
? exc_page_fault+0x7d/0x190
? asm_exc_page_fault+0x2b/0x30
? f2fs_submit_page_write+0x6cf/0x780 [f2fs]
? f2fs_submit_page_write+0x736/0x780 [f2fs]
do_write_page+0x50/0x170 [f2fs]
f2fs_outplace_write_data+0x61/0xb0 [f2fs]
f2fs_do_write_data_page+0x3f8/0x660 [f2fs]
f2fs_write_single_data_page+0x5bb/0x7a0 [f2fs]
f2fs_write_cache_pages+0x3da/0xbe0 [f2fs]
...
It is possible that other threads have added this fio to io->bio
and submitted the io->bio before entering f2fs_submit_page_write().
At this point io->bio = NULL.
If is_end_zone_blkaddr(sbi, fio->new_blkaddr) of this fio is true,
then an NULL pointer dereference error occurs at bio_get(io->bio).
The original code for determining zone end was after "out:",
which would have missed some fio who is zone end. I've moved
this code before "skip:" to make sure it's done for each fio.
Bug: 254441685
Fixes: e067dc3c6b ("f2fs: maintain six open zones for zoned devices")
Signed-off-by: Wenjie Qi <qwjhust@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
(cherry picked from commit c2034ef6192a65a986a45c2aa2ed05824fdc0e9f)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I98714afbda136a8932eaa417fff129bd4db61b26