Certain AMD processors are vulnerable to a cross-thread return address
predictions bug. When running in SMT mode and one of the sibling threads
transitions out of C0 state, the other sibling thread could use return
target predictions from the sibling thread that transitioned out of C0.
The Spectre v2 mitigations cover the Linux kernel, as it fills the RSB
when context switching to the idle thread. However, KVM allows a VMM to
prevent exiting guest mode when transitioning out of C0. A guest could
act maliciously in this situation, so create a new x86 BUG that can be
used to detect if the processor is vulnerable.
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <91cec885656ca1fcd4f0185ce403a53dd9edecb7.1675956146.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Amlogic fixes for v6.2-rc, take2:
- Change MMC controllers interrupts flag to level on all families, fixes irq loss & performance issues when cpu loaded
* tag 'amlogic-fixes-v6.2-rc-take2' of https://git.kernel.org/pub/scm/linux/kernel/git/amlogic/linux:
arm64: dts: meson-gx: Make mmc host controller interrupts level-sensitive
arm64: dts: meson-g12-common: Make mmc host controller interrupts level-sensitive
arm64: dts: meson-axg: Make mmc host controller interrupts level-sensitive
Link: https://lore.kernel.org/r/761c2ebc-7c93-8504-35ae-3e84ad216bcf@linaro.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
When a fbdev with deferred I/O is once opened and closed, the dirty
pages still remain queued in the pageref list, and eventually later
those may be processed in the delayed work. This may lead to a
corruption of pages, hitting an Oops.
This patch makes sure to cancel the delayed work and clean up the
pageref list at closing the device for addressing the bug. A part of
the cleanup code is factored out as a new helper function that is
called from the common fb_release().
Reviewed-by: Patrik Jakobsson <patrik.r.jakobsson@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Miko Larsson <mikoxyzzz@gmail.com>
Fixes: 56c134f7f1 ("fbdev: Track deferred-I/O pages in pageref struct")
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20230129082856.22113-1-tiwai@suse.de
skbuff_head_cache is misnamed (perhaps for historical reasons?)
because it does not hold heads. Head is the buffer which skb->data
points to, and also where shinfo lives. struct sk_buff is a metadata
structure, not the head.
Eric recently added skb_small_head_cache (which allocates actual
head buffers), let that serve as an excuse to finally clean this up :)
Leave the user-space visible name intact, it could possibly be uAPI.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder says:
====================
net: ipa: prepare for GSI register updtaes
An upcoming series (or two) will convert the definitions of GSI
registers used by IPA so they use the "IPA reg" mechanism to specify
register offsets and their fields. This will simplify implementing
the fairly large number of changes required in GSI registers to
support more than 32 GSI channels (introduced in IPA v5.0).
A few minor problems and inconsistencies were found, and they're
fixed here. The last three patches in this series change the
"ipa_reg" code to separate the IPA-specific part (the base virtual
address, basically) from the generic register part, and the now-
generic code is renamed to use just "reg_" or "REG_" as a prefix
rather than "ipa_reg" or "IPA_REG_".
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename functions related to register fields so they don't appear to
be IPA-specific, and move their definitions into "reg.h":
ipa_reg_fmask() -> reg_fmask()
ipa_reg_bit() -> reg_bit()
ipa_reg_field_max() -> reg_field_max()
ipa_reg_encode() -> reg_encode()
ipa_reg_decode() -> reg_decode()
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename ipa_reg_offset() to be reg_offset() and move its definition
to "reg.h". Rename ipa_reg_n_offset() to be reg_n_offset() also.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPA register definitions have evolved with each new version. The
changes required to support more than 32 endpoints in IPA v5.0 made
it best to define a unified mechanism for defining registers and
their fields.
GSI register definitions, meanwhile, have remained fairly stable.
And even as the total number of IPA endpoints goes beyond 32, the
number of GSI channels on a given EE that underly endpoints still
remains 32 or less.
Despite that, GSI v3.0 (which is used with IPA v5.0) extends the
number of channels (and events) it supports to be about 256, and as
a result, many GSI register definitions must change significantly.
To address this, we'll use the same "ipa_reg" mechanism to define
the GSI registers.
As a first step in generalizing the "ipa_reg" to also support GSI
registers, isolate the definitions of the "ipa_reg" and "ipa_regs"
structure types (and some supporting macros) into a new header file,
and remove the "ipa_" and "IPA_" from symbol names.
Separate the IPA register ID validity checking from the generic
check that a register ID is in range. Aside from that, this is
intended to have no functional effect on the code.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move some static inline function definitions out of "gsi_reg.h" and
into "gsi.c", which is the only place they're used. Rename them so
their names identify the register they're associated with.
Move the gsi_channel_type enumerated type definition below the
offset and field definitions for the CH_C_CNTXT_0 register where
it's used.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are seven GSI interrupt types that can be signaled by a single
GSI IRQ. These are represented in a bitmask, and the gsi_irq_type_id
enumerated type defines what each bit position represents.
Similarly, the global and general GSI interrupt types each has a set
of conditions it signals, and both types have an enumerated type
that defines which bit that represents each condition.
When used, these enumerated values are passed as an argument to BIT()
in *all* cases. So clean up the code a little bit by defining the
enumerated type values as one-bit masks rather than bit positions.
Rename gsi_general_id to be gsi_general_irq_id for consistency.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When checking the validity of an IPA register ID, compare it against
all possible ipa_reg_id values.
Rename the function ipa_reg_id_valid() to be specific about what's
being checked.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Soon IPA v5.0+ will be supported, and when that happens we will be
able to enable support for the SDX65 (IPA v5.0), SM8450 (IPA v5.1),
and SM8550 (IPA v5.5).
Fix the comment about the GSI version used for IPA v3.1.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The reg_addr field in the IPA structure is set but never used.
Get rid of it.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Starting at IPA v4.11, the GSI_GENERIC_COMMAND GSI register got a
new PARAMS field. The code that encodes a value into that field
sets it unconditionally, which is wrong.
We currently only provide 0 as the field's value, so this error has
no real effect. Still, it's a bug, so let's fix it.
Fix an (unrelated) incorrect comment as well. Fields in the
ERROR_LOG GSI register actually *are* defined for IPA versions
prior to v3.5.1.
Fixes: fe68c43ce3 ("net: ipa: support enhanced channel flow control")
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the following TC flower filter keys to lan966x for IS2:
- ipv4_addr (sip and dip)
- ipv6_addr (sip and dip)
- control (IPv4 fragments)
- portnum (tcp and udp port numbers)
- basic (L3 and L4 protocol)
- vlan (outer vlan tag info)
- tcp (tcp flags)
- ip (tos field)
As the parsing of these keys is similar between lan966x and sparx5, move
the code in a separate file to be shared by these 2 chips. And put the
specific parsing outside of the common functions.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells says:
====================
rxrpc development
Here are some miscellaneous changes for rxrpc:
(1) Use consume_skb() rather than kfree_skb_reason().
(2) Fix unnecessary waking when poking and already-poked call.
(3) Add ack.rwind to the rxrpc_tx_ack tracepoint as this indicates how
many incoming DATA packets we're telling the peer that we are
currently willing to accept on this call.
(4) Reduce duplicate ACK transmission. We send ACKs to let the peer know
that we're increasing the receive window (ack.rwind) as we consume
packets locally. Normal ACK transmission is triggered in three places
and that leads to duplicates being sent.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit b3973bb400 ("vmxnet3: set correct hash type based on
rss information") added hashType information into skb. However,
rssType field is populated for eop descriptor. This can lead
to incorrectly reporting of hashType for packets which use
multiple rx descriptors. Multiple rx descriptors are used
for Jumbo frame or LRO packets, which can hit this issue.
This patch moves the RSS codeblock under eop descritor.
Cc: stable@vger.kernel.org
Fixes: b3973bb400 ("vmxnet3: set correct hash type based on rss information")
Signed-off-by: Ronak Doshi <doshir@vmware.com>
Acked-by: Peng Li <lpeng@vmware.com>
Acked-by: Guolin Yang <gyang@vmware.com>
Link: https://lore.kernel.org/r/20230208223900.5794-1-doshir@vmware.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzbot was able to trigger a warning [1] from net_free()
calling ref_tracker_dir_exit(&net->notrefcnt_tracker)
while the corresponding ref_tracker_dir_init() has not been
done yet.
copy_net_ns() can indeed bypass the call to setup_net()
in some error conditions.
Note:
We might factorize/move more code in preinit_net() in the future.
[1]
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
CPU: 0 PID: 5817 Comm: syz-executor.3 Not tainted 6.2.0-rc7-next-20230208-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
assign_lock_key kernel/locking/lockdep.c:982 [inline]
register_lock_class+0xdb6/0x1120 kernel/locking/lockdep.c:1295
__lock_acquire+0x10a/0x5df0 kernel/locking/lockdep.c:4951
lock_acquire.part.0+0x11c/0x370 kernel/locking/lockdep.c:5691
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0x3d/0x60 kernel/locking/spinlock.c:162
ref_tracker_dir_exit+0x52/0x600 lib/ref_tracker.c:24
net_free net/core/net_namespace.c:442 [inline]
net_free+0x98/0xd0 net/core/net_namespace.c:436
copy_net_ns+0x4f3/0x6b0 net/core/net_namespace.c:493
create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:228
ksys_unshare+0x449/0x920 kernel/fork.c:3205
__do_sys_unshare kernel/fork.c:3276 [inline]
__se_sys_unshare kernel/fork.c:3274 [inline]
__x64_sys_unshare+0x31/0x40 kernel/fork.c:3274
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
Fixes: 0cafd77dcd ("net: add a refcount tracker for kernel sockets")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230208182123.3821604-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Guillaume Nault says:
====================
ipv6: Fix socket connection with DSCP fib-rules.
The "flowlabel" field of struct flowi6 is used to store both the actual
flow label and the DS Field (or Traffic Class). However the .connect
handlers of datagram and TCP sockets don't set the DS Field part when
doing their route lookup. This breaks fib-rules that match on DSCP.
====================
Link: https://lore.kernel.org/r/cover.1675875519.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add the fib_rule6_send and fib_rule4_send tests to verify that DSCP
values are properly taken into account when UDP or TCP sockets try to
connect().
Tests are done with nettest, which needs a new option to specify
the DS Field value of the socket being tested. This new option is
named '-Q', in reference to the similar option used by ping.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Take into account the IPV6_TCLASS socket option (DSCP) in
tcp_v6_connect(). Otherwise fib6_rule_match() can't properly
match the DSCP value, resulting in invalid route lookup.
For example:
ip route add unreachable table main 2001:db8::10/124
ip route add table 100 2001:db8::10/124 dev eth0
ip -6 rule add dsfield 0x04 table 100
echo test | socat - TCP6:[2001:db8::11]:54321,ipv6-tclass=0x04
Without this patch, socat fails at connect() time ("No route to host")
because the fib-rule doesn't jump to table 100 and the lookup ends up
being done in the main table.
Fixes: 2cc67cc731 ("[IPV6] ROUTE: Routing by Traffic Class.")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Take into account the IPV6_TCLASS socket option (DSCP) in
ip6_datagram_flow_key_init(). Otherwise fib6_rule_match() can't
properly match the DSCP value, resulting in invalid route lookup.
For example:
ip route add unreachable table main 2001:db8::10/124
ip route add table 100 2001:db8::10/124 dev eth0
ip -6 rule add dsfield 0x04 table 100
echo test | socat - UDP6:[2001:db8::11]:54321,ipv6-tclass=0x04
Without this patch, socat fails at connect() time ("No route to host")
because the fib-rule doesn't jump to table 100 and the lookup ends up
being done in the main table.
Fixes: 2cc67cc731 ("[IPV6] ROUTE: Routing by Traffic Class.")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Simon Horman says:
====================
nfp: fix schedule in atomic context when offloading sa
Yinjun Zhang says:
IPsec offloading callbacks may be called in atomic context, sleep is
not allowed in the implementation. Now use workqueue mechanism to
avoid this issue.
Extend existing workqueue mechanism for multicast configuration only
to universal use, so that all configuring through mailbox asynchoronously
can utilize it.
Also fix another two incorrect use of mailbox in IPsec:
1. Need lock for race condition when accessing mbox
2. Offset of mbox access should depends on tlv caps
====================
Link: https://lore.kernel.org/r/20230208102258.29639-1-simon.horman@corigine.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
IPsec offloading callbacks may be called in atomic context, sleep is
not allowed in the implementation. Now use workqueue mechanism to
avoid this issue.
Extend existing workqueue mechanism for multicast configuration only
to universal use, so that all configuring through mailbox asynchronously
can utilize it.
Fixes: 859a497fe8 ("nfp: implement xfrm callbacks and expose ipsec offload feature to upper layer")
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The mailbox configuration mechanism requires writing several registers,
which shouldn't be interrupted, so need lock to avoid race condition.
The base offset of mailbox configuration registers is not fixed, it
depends on TLV caps read from application firmware.
Fixes: 859a497fe8 ("nfp: implement xfrm callbacks and expose ipsec offload feature to upper layer")
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Code blocks handling BCMA_CHIP_ID_BCM5357 and BCMA_CHIP_ID_BCM53572 were
incorrectly unified. Chip package values are not unique and cannot be
checked independently. They are meaningful only in a context of a given
chip.
Packages BCM5358 and BCM47188 share the same value but then belong to
different chips. Code unification resulted in treating BCM5358 as
BCM47188 and broke its initialization.
Link: https://github.com/openwrt/openwrt/issues/8278
Fixes: cb1b0f90ac ("net: ethernet: bgmac: unify code of the same family")
Cc: Jon Mason <jdmason@kudzu.us>
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20230208091637.16291-1-zajec5@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add IPsec offloading support for NFP3800. Include data
plane and control plane.
Data plane: add IPsec packet process flow in NFP3800
datapath (NFDk).
Control plane: add an algorithm support distinction flow
in xfrm hook function xdo_dev_state_add(), as NFP3800 has
a different set of IPsec algorithm support.
This matches existing support for the NFP6000/NFP4000 and
their NFD3 datapath.
In addition, fixup the md_bytes calculation for NFD3 datapath
to make sure the two datapahts are keept in sync.
Signed-off-by: Huanhuan Wang <huanhuan.wang@corigine.com>
Reviewed-by: Niklas Söderlund <niklas.soderlund@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230208091000.4139974-1-simon.horman@corigine.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull drm fixes from Dave Airlie:
"Weekly fixes.
The amdgpu had a few small fixes to display flicker on certain
configurations, however it was found the the flicker was lessened but
there were other unintended consequences, so for now they've been
reverted and replaced with an option for users to test with so future
fixes can be developed.
Otherwise apart from the usual bunch of i915 and amdgpu, there's a
client, virtio-gpu and an nvidiafb fix that reorders its loading to
avoid failure.
client:
- refcount fix
amdgpu:
- a bunch of attempted flicker fixes that regressed turned into a
user workaround option for now
- Properly fix S/G display with AGP aperture enabled
- Fix cursor offset with 180 rotation
- SMU13 fixes
- Use TGID for GPUVM traces
- Fix oops on in fence error path
- Don't run IB tests on hw rings when sw rings are in use
- memory leak fix
i915:
- Display watermark fix
- fbdev fix for PSR, FBC, DRRS
- Move fd_install after last use of fence
- Initialize the obj flags for shmem objects
- Fix VBT DSI DVO port handling
virtio-gpu:
- fence fix
nvidiafb:
- regression fix for driver load when no hw supported"
* tag 'drm-fixes-2023-02-10' of git://anongit.freedesktop.org/drm/drm: (27 commits)
Revert "drm/amd/display: disable S/G display on DCN 3.1.5"
Revert "drm/amd/display: disable S/G display on DCN 2.1.0"
Revert "drm/amd/display: disable S/G display on DCN 3.1.2/3"
drm/amdgpu: add S/G display parameter
drm/amdgpu/smu: skip pptable init under sriov
amd/amdgpu: remove test ib on hw ring
drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
drm/amdgpu: Use the TGID for trace_amdgpu_vm_update_ptes
drm/amdgpu: Add unique_id support for GC 11.0.1/2
drm/amd/pm: bump SMU 13.0.7 driver_if header version
drm/amd/pm: bump SMU 13.0.0 driver_if header version
drm/amd/pm: add SMU 13.0.7 missing GetPptLimit message mapping
drm/amd/display: fix cursor offset on rotation 180
drm/amd/amdgpu: enable athub cg 11.0.3
Revert "drm/amd/display: disable S/G display on DCN 3.1.4"
drm/amd/display: properly handling AGP aperture in vm setup
drm/amd/display: disable S/G display on DCN 3.1.2/3
drm/amd/display: disable S/G display on DCN 2.1.0
drm/i915: Fix VBT DSI DVO port handling
drm/client: fix circular reference counting issue
...
Paolo Abeni says:
====================
net: introduce rps_default_mask
Real-time setups try hard to ensure proper isolation between time
critical applications and e.g. network processing performed by the
network stack in softirq and RPS is used to move the softirq
activity away from the isolated core.
If the network configuration is dynamic, with netns and devices
routinely created at run-time, enforcing the correct RPS setting
on each newly created device allowing to transient bad configuration
became complex.
Additionally, when multi-queue devices are involved, configuring rps
in user-space on each queue easily becomes very expensive, e.g.
some setups use veths with per cpu queues.
These series try to address the above, introducing a new
sysctl knob: rps_default_mask. The new sysctl entry allows
configuring a netns-wide RPS mask, to be enforced since receive
queue creation time without any fourther per device configuration
required.
Additionally, a simple self-test is introduced to check the
rps_default_mask behavior.
====================
Link: https://lore.kernel.org/r/cover.1675789134.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ensure that RPS default mask changes take place on
all newly created netns/devices and don't affect
existing ones.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If RPS is enabled, this allows configuring a default rps
mask, which is effective since receive queue creation time.
A default RPS mask allows the system admin to ensure proper
isolation, avoiding races at network namespace or device
creation time.
The default RPS mask is initially empty, and can be
modified via a newly added sysctl entry.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Will be used by the following patch to avoid code
duplication. No functional changes intended.
The only difference is that now flow_limit_cpu_sysctl() will
always compute the flow limit mask on each read operation,
even when read() will not return any byte to user-space.
Note that the new helper is placed under a new #ifdef at
the file start to better fit the usage in the later patch
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull rdma fixes from Jason Gunthorpe:
"The usual collection of small driver bug fixes:
- Fix error unwind bugs in hfi1, irdma rtrs
- Old bug with IPoIB children interfaces possibly using the wrong
number of queues
- Really old bug in usnic calling iommu_map in an atomic context
- Recent regression from the DMABUF locking rework
- Missing user data validation in MANA"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/rtrs: Don't call kobject_del for srv_path->kobj
RDMA/mana_ib: Prevent array underflow in mana_ib_create_qp_raw()
IB/hfi1: Assign npages earlier
RDMA/umem: Use dma-buf locked API to solve deadlock
RDMA/usnic: use iommu_map_atomic() under spin_lock()
RDMA/irdma: Fix potential NULL-ptr-dereference
IB/IPoIB: Fix legacy IPoIB due to wrong number of queues
IB/hfi1: Restore allocated resources on failed copyout
Patch series "Fix kmemleak crashes when scanning CMA regions", v2.
When trying to boot a device with an ARM64 kernel with the following
config options enabled:
CONFIG_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
CONFIG_DEBUG_KMEMLEAK=y
a crash is encountered when kmemleak starts to scan the list of gray
or allocated objects that it maintains. Upon closer inspection, it was
observed that these page-faults always occurred when kmemleak attempted
to scan a CMA region.
At the moment, kmemleak is made aware of CMA regions that are specified
through the devicetree to be dynamically allocated within a range of
addresses. However, kmemleak should not need to scan CMA regions or any
reserved memory region, as those regions can be used for DMA transfers
between drivers and peripherals, and thus wouldn't contain anything
useful for kmemleak.
Additionally, since CMA regions are unmapped from the kernel's address
space when they are freed to the buddy allocator at boot when
CONFIG_DEBUG_PAGEALLOC is enabled, kmemleak shouldn't attempt to access
those memory regions, as that will trigger a crash. Thus, kmemleak
should ignore all dynamically allocated reserved memory regions.
This patch (of 1):
Currently, kmemleak ignores dynamically allocated reserved memory regions
that don't have a kernel mapping. However, regions that do retain a
kernel mapping (e.g. CMA regions) do get scanned by kmemleak.
This is not ideal for two reasons:
1 kmemleak works by scanning memory regions for pointers to allocated
objects to determine if those objects have been leaked or not.
However, reserved memory regions can be used between drivers and
peripherals for DMA transfers, and thus, would not contain pointers to
allocated objects, making it unnecessary for kmemleak to scan these
reserved memory regions.
2 When CONFIG_DEBUG_PAGEALLOC is enabled, along with kmemleak, the
CMA reserved memory regions are unmapped from the kernel's address
space when they are freed to buddy at boot. These CMA reserved regions
are still tracked by kmemleak, however, and when kmemleak attempts to
scan them, a crash will happen, as accessing the CMA region will result
in a page-fault, since the regions are unmapped.
Thus, use kmemleak_ignore_phys() for all dynamically allocated reserved
memory regions, instead of those that do not have a kernel mapping
associated with them.
Link: https://lkml.kernel.org/r/20230208232001.2052777-1-isaacmanjarres@google.com
Link: https://lkml.kernel.org/r/20230208232001.2052777-2-isaacmanjarres@google.com
Fixes: a7259df767 ("memblock: make memblock_find_in_range method private")
Signed-off-by: Isaac J. Manjarres <isaacmanjarres@google.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Kirill A. Shutemov <kirill.shtuemov@linux.intel.com>
Cc: Nick Kossifidis <mick@ics.forth.gr>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Cc: Saravana Kannan <saravanak@google.com>
Cc: <stable@vger.kernel.org> [5.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The debugfs_remove_recursive() is invoked by unregister_shrinker(), which
is holding the write lock of shrinker_rwsem. It will waits for the
handler of debugfs file complete. The handler also needs to hold the read
lock of shrinker_rwsem to do something. So it may cause the following
deadlock:
CPU0 CPU1
debugfs_file_get()
shrinker_debugfs_count_show()/shrinker_debugfs_scan_write()
unregister_shrinker()
--> down_write(&shrinker_rwsem);
debugfs_remove_recursive()
// wait for (A)
--> wait_for_completion();
// wait for (B)
--> down_read_killable(&shrinker_rwsem)
debugfs_file_put() -- (A)
up_write() -- (B)
The down_read_killable() can be killed, so that the above deadlock can be
recovered. But it still requires an extra kill action, otherwise it will
block all subsequent shrinker-related operations, so it's better to fix
it.
[akpm@linux-foundation.org: fix CONFIG_SHRINKER_DEBUG=n stub]
Link: https://lkml.kernel.org/r/20230202105612.64641-1-zhengqi.arch@bytedance.com
Fixes: 5035ebc644 ("mm: shrinkers: introduce debugfs interface for memory shrinkers")
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>