commit ff083a2d97 upstream.
Protect perf_guest_cbs with RCU to fix multiple possible errors. Luckily,
all paths that read perf_guest_cbs already require RCU protection, e.g. to
protect the callback chains, so only the direct perf_guest_cbs touchpoints
need to be modified.
Bug #1 is a simple lack of WRITE_ONCE/READ_ONCE behavior to ensure
perf_guest_cbs isn't reloaded between a !NULL check and a dereference.
Fixed via the READ_ONCE() in rcu_dereference().
Bug #2 is that on weakly-ordered architectures, updates to the callbacks
themselves are not guaranteed to be visible before the pointer is made
visible to readers. Fixed by the smp_store_release() in
rcu_assign_pointer() when the new pointer is non-NULL.
Bug #3 is that, because the callbacks are global, it's possible for
readers to run in parallel with an unregisters, and thus a module
implementing the callbacks can be unloaded while readers are in flight,
resulting in a use-after-free. Fixed by a synchronize_rcu() call when
unregistering callbacks.
Bug #1 escaped notice because it's extremely unlikely a compiler will
reload perf_guest_cbs in this sequence. perf_guest_cbs does get reloaded
for future derefs, e.g. for ->is_user_mode(), but the ->is_in_guest()
guard all but guarantees the consumer will win the race, e.g. to nullify
perf_guest_cbs, KVM has to completely exit the guest and teardown down
all VMs before KVM start its module unload / unregister sequence. This
also makes it all but impossible to encounter bug #3.
Bug #2 has not been a problem because all architectures that register
callbacks are strongly ordered and/or have a static set of callbacks.
But with help, unloading kvm_intel can trigger bug #1 e.g. wrapping
perf_guest_cbs with READ_ONCE in perf_misc_flags() while spamming
kvm_intel module load/unload leads to:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:perf_misc_flags+0x1c/0x70
Call Trace:
perf_prepare_sample+0x53/0x6b0
perf_event_output_forward+0x67/0x160
__perf_event_overflow+0x52/0xf0
handle_pmi_common+0x207/0x300
intel_pmu_handle_irq+0xcf/0x410
perf_event_nmi_handler+0x28/0x50
nmi_handle+0xc7/0x260
default_do_nmi+0x6b/0x170
exc_nmi+0x103/0x130
asm_exc_nmi+0x76/0xbf
Fixes: 39447b386c ("perf: Enhance perf to allow for guest statistic collection from host")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211111020738.2512932-2-seanjc@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a6097180d8 upstream.
Prior to Linux v5.4 devtmpfs used mount_single() which treats the given
mount options as "remount" options, so it updates the configuration of
the single super_block on each mount.
Since that was changed, the mount options used for devtmpfs are ignored.
This is a regression which affect systemd - which mounts devtmpfs with
"-o mode=755,size=4m,nr_inodes=1m".
This patch restores the "remount" effect by calling reconfigure_single()
Fixes: d401727ea0 ("devtmpfs: don't mix {ramfs,shmem}_fill_super() with mount_single()")
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f9d31c4cf4 upstream.
The same fix in commit 5ec7d18d18 ("sctp: use call_rcu to free endpoint")
is also needed for dumping one asoc and sock after the lookup.
Fixes: 86fdb3448c ("sctp: ensure ep is not destroyed before doing the dump")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7175f02c4e upstream.
Replace sa_family_t with __kernel_sa_family_t to fix the following
linux/nfc.h userspace compilation errors:
/usr/include/linux/nfc.h:266:2: error: unknown type name 'sa_family_t'
sa_family_t sa_family;
/usr/include/linux/nfc.h:274:2: error: unknown type name 'sa_family_t'
sa_family_t sa_family;
Fixes: 23b7869c0f ("NFC: add the NFC socket raw protocol")
Fixes: d646960f79 ("NFC: Initial LLCP support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 79b69a8370 upstream.
Fix user-space builds if it includes /usr/include/linux/nfc.h before
some of other headers:
/usr/include/linux/nfc.h:281:9: error: unknown type name ‘size_t’
281 | size_t service_name_len;
| ^~~~~~
Fixes: d646960f79 ("NFC: Initial LLCP support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 5ec7d18d18 ]
This patch is to delay the endpoint free by calling call_rcu() to fix
another use-after-free issue in sctp_sock_dump():
BUG: KASAN: use-after-free in __lock_acquire+0x36d9/0x4c20
Call Trace:
__lock_acquire+0x36d9/0x4c20 kernel/locking/lockdep.c:3218
lock_acquire+0x1ed/0x520 kernel/locking/lockdep.c:3844
__raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
_raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168
spin_lock_bh include/linux/spinlock.h:334 [inline]
__lock_sock+0x203/0x350 net/core/sock.c:2253
lock_sock_nested+0xfe/0x120 net/core/sock.c:2774
lock_sock include/net/sock.h:1492 [inline]
sctp_sock_dump+0x122/0xb20 net/sctp/diag.c:324
sctp_for_each_transport+0x2b5/0x370 net/sctp/socket.c:5091
sctp_diag_dump+0x3ac/0x660 net/sctp/diag.c:527
__inet_diag_dump+0xa8/0x140 net/ipv4/inet_diag.c:1049
inet_diag_dump+0x9b/0x110 net/ipv4/inet_diag.c:1065
netlink_dump+0x606/0x1080 net/netlink/af_netlink.c:2244
__netlink_dump_start+0x59a/0x7c0 net/netlink/af_netlink.c:2352
netlink_dump_start include/linux/netlink.h:216 [inline]
inet_diag_handler_cmd+0x2ce/0x3f0 net/ipv4/inet_diag.c:1170
__sock_diag_cmd net/core/sock_diag.c:232 [inline]
sock_diag_rcv_msg+0x31d/0x410 net/core/sock_diag.c:263
netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2477
sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:274
This issue occurs when asoc is peeled off and the old sk is freed after
getting it by asoc->base.sk and before calling lock_sock(sk).
To prevent the sk free, as a holder of the sk, ep should be alive when
calling lock_sock(). This patch uses call_rcu() and moves sock_put and
ep free into sctp_endpoint_destroy_rcu(), so that it's safe to try to
hold the ep under rcu_read_lock in sctp_transport_traverse_process().
If sctp_endpoint_hold() returns true, it means this ep is still alive
and we have held it and can continue to dump it; If it returns false,
it means this ep is dead and can be freed after rcu_read_unlock, and
we should skip it.
In sctp_sock_dump(), after locking the sk, if this ep is different from
tsp->asoc->ep, it means during this dumping, this asoc was peeled off
before calling lock_sock(), and the sk should be skipped; If this ep is
the same with tsp->asoc->ep, it means no peeloff happens on this asoc,
and due to lock_sock, no peeloff will happen either until release_sock.
Note that delaying endpoint free won't delay the port release, as the
port release happens in sctp_endpoint_destroy() before calling call_rcu().
Also, freeing endpoint by call_rcu() makes it safe to access the sk by
asoc->base.sk in sctp_assocs_seq_show() and sctp_rcv().
Thanks Jones to bring this issue up.
v1->v2:
- improve the changelog.
- add kfree(ep) into sctp_endpoint_destroy_rcu(), as Jakub noticed.
Reported-by: syzbot+9276d76e83e3bcde6c99@syzkaller.appspotmail.com
Reported-by: Lee Jones <lee.jones@linaro.org>
Fixes: d25adbeb0c ("sctp: fix an use-after-free issue in sctp_sock_dump")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 4bc5e64e6c upstream.
Commit 8633ef82f1 ("drivers/firmware: consolidate EFI framebuffer setup
for all arches") made the Generic System Framebuffers (sysfb) driver able
to be built on non-x86 architectures.
But it left the efifb_setup_from_dmi() function prototype declaration in
the architecture specific headers. This could lead to the following
compiler warning as reported by the kernel test robot:
drivers/firmware/efi/sysfb_efi.c:70:6: warning: no previous prototype for function 'efifb_setup_from_dmi' [-Wmissing-prototypes]
void efifb_setup_from_dmi(struct screen_info *si, const char *opt)
^
drivers/firmware/efi/sysfb_efi.c:70:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
void efifb_setup_from_dmi(struct screen_info *si, const char *opt)
Fixes: 8633ef82f1 ("drivers/firmware: consolidate EFI framebuffer setup for all arches")
Reported-by: kernel test robot <lkp@intel.com>
Cc: <stable@vger.kernel.org> # 5.15.x
Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://lore.kernel.org/r/20211126001333.555514-1-javierm@redhat.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d7f55471db ]
Fix modpost Section mismatch error in memblock_phys_alloc()
[...]
WARNING: modpost: vmlinux.o(.text.unlikely+0x1dcc): Section mismatch in reference
from the function memblock_phys_alloc() to the function .init.text:memblock_phys_alloc_range()
The function memblock_phys_alloc() references
the function __init memblock_phys_alloc_range().
This is often because memblock_phys_alloc lacks a __init
annotation or the annotation of memblock_phys_alloc_range is wrong.
ERROR: modpost: Section mismatches detected.
Set CONFIG_SECTION_MISMATCH_WARN_ONLY=y to allow them.
[...]
memblock_phys_alloc() is a one-line wrapper, make it __always_inline to
avoid these section mismatches.
Reported-by: k2ci <kernel-bot@kylinos.cn>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
[rppt: slightly massaged changelog ]
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/r/20211217020754.2874872-1-liu.yun@linux.dev
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ec624fe740 ]
BPF layer extends the qdisc control block via struct bpf_skb_data_end
and because of that there is no more room to add variables to the
qdisc layer control block without going over the skb->cb size.
Extend the qdisc control block with a tc control block,
and move all tc related variables to there as a pre-step for
extending the tc control block with additional members.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit dfd0743f1d upstream.
Since the tee subsystem does not keep a strong reference to its idle
shared memory buffers, it races with other threads that try to destroy a
shared memory through a close of its dma-buf fd or by unmapping the
memory.
In tee_shm_get_from_id() when a lookup in teedev->idr has been
successful, it is possible that the tee_shm is in the dma-buf teardown
path, but that path is blocked by the teedev mutex. Since we don't have
an API to tell if the tee_shm is in the dma-buf teardown path or not we
must find another way of detecting this condition.
Fix this by doing the reference counting directly on the tee_shm using a
new refcount_t refcount field. dma-buf is replaced by using
anon_inode_getfd() instead, this separates the life-cycle of the
underlying file from the tee_shm. tee_shm_put() is updated to hold the
mutex when decreasing the refcount to 0 and then remove the tee_shm from
teedev->idr before releasing the mutex. This means that the tee_shm can
never be found unless it has a refcount larger than 0.
Fixes: 967c9cca2c ("tee: generic TEE subsystem")
Cc: stable@vger.kernel.org
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Lars Persson <larper@axis.com>
Reviewed-by: Sumit Garg <sumit.garg@linaro.org>
Reported-by: Patrik Lantz <patrik.lantz@axis.com>
Signed-off-by: Jens Wiklander <jens.wiklander@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit dcce50e6cc ]
When building with Clang and CONFIG_TRACE_BRANCH_PROFILING, there are a
lot of unreachable warnings, like:
arch/x86/kernel/traps.o: warning: objtool: handle_xfd_event()+0x134: unreachable instruction
Without an input to the inline asm, 'volatile' is ignored for some
reason and Clang feels free to move the reachable() annotation away from
its intended location.
Fix that by re-adding the counter value to the inputs.
Fixes: f1069a8756 ("compiler.h: Avoid using inline asm operand modifiers")
Fixes: c199f64ff9 ("instrumentation.h: Avoid using inline asm operand modifiers")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/0417e96909b97a406323409210de7bf13df0b170.1636410380.git.jpoimboe@redhat.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: x86@kernel.org
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit cb8747b7d2 ]
This macro is defined by glibc itself, which makes the issue go unnoticed on
those systems. On non-glibc systems it causes build failures on several
utilities and libraries, like bpftool and objtool.
Fixes: 1d509f2a6e ("x86/insn: Support big endian cross-compiles")
Fixes: 2d7ce0e8a7 ("tools/virtio: more stubs")
Fixes: 3fb321fde2 ("selftests/net: ipv6 flowlabel")
Fixes: 50b3ed57de ("selftests/bpf: test bpf flow dissection")
Fixes: 9cacf81f81 ("bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE")
Fixes: a4b2061242 ("tools include uapi: Grab a copy of linux/in.h")
Fixes: b12d6ec097 ("bpf: btf: add btf print functionality")
Fixes: c0dd967818 ("tools, include: Grab a copy of linux/erspan.h")
Fixes: c4b6014e8b ("tools: Add copy of perf_event.h to tools/include/linux/")
Signed-off-by: Ismael Luceno <ismael@iodev.co.uk>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20211115134647.1921-1-ismael@iodev.co.uk
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1ed1d59211 ]
virtio_net_hdr_set_proto infers skb->protocol from the virtio_net_hdr
gso_type, to avoid packets getting dropped for lack of a proto type.
Its protocol choice is a guess, especially in the case of UFO, where
the single VIRTIO_NET_HDR_GSO_UDP label covers both UFOv4 and UFOv6.
Skip this best effort if the field is already initialized. Whether
explicitly from userspace, or implicitly based on an earlier call to
dev_parse_header_protocol (which is more robust, but was introduced
after this patch).
Fixes: 9d2f67e43b ("net/packet: fix packet drop as of virtio gso")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20211220145027.2784293-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ef57c1610d ]
Increase cache locality by moving rx_dst_coookie next to sk->sk_rx_dst
This removes one or two cache line misses in IPv6 early demux (TCP/UDP)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0c0a5ef809 ]
Increase cache locality by moving rx_dst_ifindex next to sk->sk_rx_dst
This is part of an effort to reduce cache line misses in TCP fast path.
This removes one cache line miss in early demux.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit fe415186b4 upstream.
The Xen console driver is still vulnerable for an attack via excessive
number of events sent by the backend. Fix that by using a lateeoi event
channel.
For the normal domU initial console this requires the introduction of
bind_evtchn_to_irq_lateeoi() as there is no xenbus device available
at the time the event channel is bound to the irq.
As the decision whether an interrupt was spurious or not requires to
test for bytes having been read from the backend, move sending the
event into the if statement, as sending an event without having found
any bytes to be read is making no sense at all.
This is part of XSA-391
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 444dd878e8 upstream.
The kerneldoc comment of pm_runtime_active() does not reflect the
behavior of the function, so update it accordingly.
Fixes: 403d2d116e ("PM: runtime: Add kerneldoc comments to multiple helpers")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 50252e4b5e upstream.
signalfd_poll() and binder_poll() are special in that they use a
waitqueue whose lifetime is the current task, rather than the struct
file as is normally the case. This is okay for blocking polls, since a
blocking poll occurs within one task; however, non-blocking polls
require another solution. This solution is for the queue to be cleared
before it is freed, by sending a POLLFREE notification to all waiters.
Unfortunately, only eventpoll handles POLLFREE. A second type of
non-blocking poll, aio poll, was added in kernel v4.18, and it doesn't
handle POLLFREE. This allows a use-after-free to occur if a signalfd or
binder fd is polled with aio poll, and the waitqueue gets freed.
Fix this by making aio poll handle POLLFREE.
A patch by Ramji Jiyani <ramjiyani@google.com>
(https://lore.kernel.org/r/20211027011834.2497484-1-ramjiyani@google.com)
tried to do this by making aio_poll_wake() always complete the request
inline if POLLFREE is seen. However, that solution had two bugs.
First, it introduced a deadlock, as it unconditionally locked the aio
context while holding the waitqueue lock, which inverts the normal
locking order. Second, it didn't consider that POLLFREE notifications
are missed while the request has been temporarily de-queued.
The second problem was solved by my previous patch. This patch then
properly fixes the use-after-free by handling POLLFREE in a
deadlock-free way. It does this by taking advantage of the fact that
freeing of the waitqueue is RCU-delayed, similar to what eventpoll does.
Fixes: 2c14fa838c ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.18+
Link: https://lore.kernel.org/r/20211209010455.42744-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 42288cb44c upstream.
Several ->poll() implementations are special in that they use a
waitqueue whose lifetime is the current task, rather than the struct
file as is normally the case. This is okay for blocking polls, since a
blocking poll occurs within one task; however, non-blocking polls
require another solution. This solution is for the queue to be cleared
before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'.
However, that has a bug: wake_up_poll() calls __wake_up() with
nr_exclusive=1. Therefore, if there are multiple "exclusive" waiters,
and the wakeup function for the first one returns a positive value, only
that one will be called. That's *not* what's needed for POLLFREE;
POLLFREE is special in that it really needs to wake up everyone.
Considering the three non-blocking poll systems:
- io_uring poll doesn't handle POLLFREE at all, so it is broken anyway.
- aio poll is unaffected, since it doesn't support exclusive waits.
However, that's fragile, as someone could add this feature later.
- epoll doesn't appear to be broken by this, since its wakeup function
returns 0 when it sees POLLFREE. But this is fragile.
Although there is a workaround (see epoll), it's better to define a
function which always sends POLLFREE to all waiters. Add such a
function. Also make it verify that the queue really becomes empty after
all waiters have been woken up.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211209010455.42744-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 79364031c5 upstream.
The initial implementation of migrate_disable() for mainline was a
wrapper around preempt_disable(). RT kernels substituted this with a
real migrate disable implementation.
Later on mainline gained true migrate disable support, but neither
documentation nor affected code were updated.
Remove stale comments claiming that migrate_disable() is PREEMPT_RT only.
Don't use __this_cpu_inc() in the !PREEMPT_RT path because preemption is
not disabled and the RMW operation can be preempted.
Fixes: 74d862b682 ("sched: Make migrate_disable/enable() independent of RT")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211127163200.10466-3-bigeasy@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f45b2974cc upstream.
The arch_prepare_bpf_dispatcher function does not have a prototype, and
yields the following warning when W=1 is enabled for the kernel build.
>> arch/x86/net/bpf_jit_comp.c:2188:5: warning: no previous \
prototype for 'arch_prepare_bpf_dispatcher' [-Wmissing-prototypes]
2188 | int arch_prepare_bpf_dispatcher(void *image, s64 *funcs, \
int num_funcs)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
Remove the warning by adding a function declaration to include/linux/bpf.h.
Fixes: 75ccbef636 ("bpf: Introduce BPF dispatcher")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Björn Töpel <bjorn@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211117125708.769168-1-bjorn@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit eaee12f046 ]
This series introduces new packet merge type, therefore rename lro
functions to packet merge to support the new merge type:
- Generalize + rename mlx5e_build_tir_ctx_lro to
mlx5e_build_tir_ctx_packet_merge.
- Rename mlx5e_modify_tirs_lro to mlx5e_modify_tirs_packet_merge.
- Rename lro bit in mlx5_ifc_modify_tir_bitmask_bits to packet_merge.
- Rename lro_en in mlx5e_params to packet_merge_type type and combine
packet_merge params into one struct mlx5e_packet_merge_param.
Signed-off-by: Khalid Manaa <khalidm@nvidia.com>
Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 50f477fe99 ]
TIR stands for transport interface receive, the TIR object is
responsible for performing all transport related operations on
the receive side like packet processing, demultiplexing the packets
to different RQ's, etc.
lro_timeout is a field in the TIR that is used to set the timeout for lro
session, this series introduces new packet merge type, therefore rename
lro_timeout to packet_merge_timeout for all packet merge types.
Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 213f5f8f31 upstream.
Before commit faa041a40b ("ipv4: Create cleanup helper for fib_nh")
changes to net->ipv4.fib_num_tclassid_users were protected by RTNL.
After the change, this is no longer the case, as free_fib_info_rcu()
runs after rcu grace period, without rtnl being held.
Fixes: faa041a40b ("ipv4: Create cleanup helper for fib_nh")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f7e5b9bfa6 upstream.
On ARM v6 and later, we define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
because the ordinary load/store instructions (ldr, ldrh, ldrb) can
tolerate any misalignment of the memory address. However, load/store
double and load/store multiple instructions (ldrd, ldm) may still only
be used on memory addresses that are 32-bit aligned, and so we have to
use the CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS macro with care, or we
may end up with a severe performance hit due to alignment traps that
require fixups by the kernel. Testing shows that this currently happens
with clang-13 but not gcc-11. In theory, any compiler version can
produce this bug or other problems, as we are dealing with undefined
behavior in C99 even on architectures that support this in hardware,
see also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363.
Fortunately, the get_unaligned() accessors do the right thing: when
building for ARMv6 or later, the compiler will emit unaligned accesses
using the ordinary load/store instructions (but avoid the ones that
require 32-bit alignment). When building for older ARM, those accessors
will emit the appropriate sequence of ldrb/mov/orr instructions. And on
architectures that can truly tolerate any kind of misalignment, the
get_unaligned() accessors resolve to the leXX_to_cpup accessors that
operate on aligned addresses.
Since the compiler will in fact emit ldrd or ldm instructions when
building this code for ARM v6 or later, the solution is to use the
unaligned accessors unconditionally on architectures where this is
known to be fast. The _aligned version of the hash function is
however still needed to get the best performance on architectures
that cannot do any unaligned access in hardware.
This new version avoids the undefined behavior and should produce
the fastest hash on all architectures we support.
Link: https://lore.kernel.org/linux-arm-kernel/20181008211554.5355-4-ard.biesheuvel@linaro.org/
Link: https://lore.kernel.org/linux-crypto/CAK8P3a2KfmmGDbVHULWevB0hv71P2oi2ZCHEAqT=8dQfa0=cqQ@mail.gmail.com/
Reported-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Fixes: 2c956a6077 ("siphash: add cryptographically secure PRF")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dacb5d8875 upstream.
Steffen reported a TCP stream corruption for HTTP requests
served by the apache web-server using a cifs mount-point
and memory mapping the relevant file.
The root cause is quite similar to the one addressed by
commit 20eb4f29b6 ("net: fix sk_page_frag() recursion from
memory reclaim"). Here the nested access to the task page frag
is caused by a page fault on the (mmapped) user-space memory
buffer coming from the cifs file.
The page fault handler performs an smb transaction on a different
socket, inside the same process context. Since sk->sk_allaction
for such socket does not prevent the usage for the task_frag,
the nested allocation modify "under the hood" the page frag
in use by the outer sendmsg call, corrupting the stream.
The overall relevant stack trace looks like the following:
httpd 78268 [001] 3461630.850950: probe:tcp_sendmsg_locked:
ffffffff91461d91 tcp_sendmsg_locked+0x1
ffffffff91462b57 tcp_sendmsg+0x27
ffffffff9139814e sock_sendmsg+0x3e
ffffffffc06dfe1d smb_send_kvec+0x28
[...]
ffffffffc06cfaf8 cifs_readpages+0x213
ffffffff90e83c4b read_pages+0x6b
ffffffff90e83f31 __do_page_cache_readahead+0x1c1
ffffffff90e79e98 filemap_fault+0x788
ffffffff90eb0458 __do_fault+0x38
ffffffff90eb5280 do_fault+0x1a0
ffffffff90eb7c84 __handle_mm_fault+0x4d4
ffffffff90eb8093 handle_mm_fault+0xc3
ffffffff90c74f6d __do_page_fault+0x1ed
ffffffff90c75277 do_page_fault+0x37
ffffffff9160111e page_fault+0x1e
ffffffff9109e7b5 copyin+0x25
ffffffff9109eb40 _copy_from_iter_full+0xe0
ffffffff91462370 tcp_sendmsg_locked+0x5e0
ffffffff91462370 tcp_sendmsg_locked+0x5e0
ffffffff91462b57 tcp_sendmsg+0x27
ffffffff9139815c sock_sendmsg+0x4c
ffffffff913981f7 sock_write_iter+0x97
ffffffff90f2cc56 do_iter_readv_writev+0x156
ffffffff90f2dff0 do_iter_write+0x80
ffffffff90f2e1c3 vfs_writev+0xa3
ffffffff90f2e27c do_writev+0x5c
ffffffff90c042bb do_syscall_64+0x5b
ffffffff916000ad entry_SYSCALL_64_after_hwframe+0x65
The cifs filesystem rightfully sets sk_allocations to GFP_NOFS,
we can avoid the nesting using the sk page frag for allocation
lacking the __GFP_FS flag. Do not define an additional mm-helper
for that, as this is strictly tied to the sk page frag usage.
v1 -> v2:
- use a stricted sk_page_frag() check instead of reordering the
code (Eric)
Reported-by: Steffen Froemer <sfroemer@redhat.com>
Fixes: 5640f76858 ("net: use a per task frag allocator")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 20ae1d6aa1 upstream.
Each peer's endpoint contains a dst_cache entry that takes a reference
to another netdev. When the containing namespace exits, we take down the
socket and prevent future sockets from being created (by setting
creating_net to NULL), which removes that potential reference on the
netns. However, it doesn't release references to the netns that a netdev
cached in dst_cache might be taking, so the netns still might fail to
exit. Since the socket is gimped anyway, we can simply clear all the
dst_caches (by way of clearing the endpoint src), which will release all
references.
However, the current dst_cache_reset function only releases those
references lazily. But it turns out that all of our usages of
wg_socket_clear_peer_endpoint_src are called from contexts that are not
exactly high-speed or bottle-necked. For example, when there's
connection difficulty, or when userspace is reconfiguring the interface.
And in particular for this patch, when the netns is exiting. So for
those cases, it makes more sense to call dst_release immediately. For
that, we add a small helper function to dst_cache.
This patch also adds a test to netns.sh from Hangbin Liu to ensure this
doesn't regress.
Tested-by: Hangbin Liu <liuhangbin@gmail.com>
Reported-by: Xiumei Mu <xmu@redhat.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Fixes: 900575aa33 ("wireguard: device: avoid circular netns references")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cdef485217 upstream.
The kernel leaks memory when a `fib` rule is present in IPv6 nftables
firewall rules and a suppress_prefix rule is present in the IPv6 routing
rules (used by certain tools such as wg-quick). In such scenarios, every
incoming packet will leak an allocation in `ip6_dst_cache` slab cache.
After some hours of `bpftrace`-ing and source code reading, I tracked
down the issue to ca7a03c417 ("ipv6: do not free rt if
FIB_LOOKUP_NOREF is set on suppress rule").
The problem with that change is that the generic `args->flags` always have
`FIB_LOOKUP_NOREF` set[1][2] but the IPv6-specific flag
`RT6_LOOKUP_F_DST_NOREF` might not be, leading to `fib6_rule_suppress` not
decreasing the refcount when needed.
How to reproduce:
- Add the following nftables rule to a prerouting chain:
meta nfproto ipv6 fib saddr . mark . iif oif missing drop
This can be done with:
sudo nft create table inet test
sudo nft create chain inet test test_chain '{ type filter hook prerouting priority filter + 10; policy accept; }'
sudo nft add rule inet test test_chain meta nfproto ipv6 fib saddr . mark . iif oif missing drop
- Run:
sudo ip -6 rule add table main suppress_prefixlength 0
- Watch `sudo slabtop -o | grep ip6_dst_cache` to see memory usage increase
with every incoming ipv6 packet.
This patch exposes the protocol-specific flags to the protocol
specific `suppress` function, and check the protocol-specific `flags`
argument for RT6_LOOKUP_F_DST_NOREF instead of the generic
FIB_LOOKUP_NOREF when decreasing the refcount, like this.
[1]: ca7a03c417/net/ipv6/fib6_rules.c (L71)
[2]: ca7a03c417/net/ipv6/fib6_rules.c (L99)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215105
Fixes: ca7a03c417 ("ipv6: do not free rt if FIB_LOOKUP_NOREF is set on suppress rule")
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6bbfa44116 upstream.
The 'kprobe::data_size' is unsigned, thus it can not be negative. But if
user sets it enough big number (e.g. (size_t)-8), the result of 'data_size
+ sizeof(struct kretprobe_instance)' becomes smaller than sizeof(struct
kretprobe_instance) or zero. In result, the kretprobe_instance are
allocated without enough memory, and kretprobe accesses outside of
allocated memory.
To avoid this issue, introduce a max limitation of the
kretprobe::data_size. 4KB per instance should be OK.
Link: https://lkml.kernel.org/r/163836995040.432120.10322772773821182925.stgit@devnote2
Cc: stable@vger.kernel.org
Fixes: f47cd9b553 ("kprobes: kretprobe user entry-handler")
Reported-by: zhangyue <zhangyue1@kylinos.cn>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8837cbbf85 ]
We need a way to release a fib6_nh's per-cpu dsts when replacing
nexthops otherwise we can end up with stale per-cpu dsts which hold net
device references, so add a new IPv6 stub called fib6_nh_release_dsts.
It must be used after an RCU grace period, so no new dsts can be created
through a group's nexthop entry.
Similar to fib6_nh_release it shouldn't be used if fib6_nh_init has failed
so it doesn't need a dummy stub when IPv6 is not enabled.
Fixes: 7bf4796dd0 ("nexthops: add support for replace")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 451dc48c80 ]
This patch fixes an issue that an u32 netlink value is handled as a
signed enum value which doesn't fit into the range of u32 netlink type.
If it's handled as -1 value some BIT() evaluation ends in a
shift-out-of-bounds issue. To solve the issue we set the to u32 max which
is s32 "-1" value to keep backwards compatibility and let the followed enum
values start counting at 0. This brings the compiler to never handle the
enum as signed and a check if the value is above NL802154_IFTYPE_MAX should
filter -1 out.
Fixes: f3ea5e4423 ("ieee802154: add new interface command")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/r/20211112030916.685793-1-aahringo@redhat.com
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 868ddfcef3 upstream.
The code for hdac_ext_stream seems inherited from hdac_stream, and
similar locking issues are present: the use of the bus->reg_lock
spinlock is inconsistent, with only writes to specific fields being
protected.
Apply similar fix as in hdac_stream by protecting all accesses to
'link_locked' and 'decoupled' fields, with a new helper
snd_hdac_ext_stream_decouple_locked() added to simplify code
changes.
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Link: https://lore.kernel.org/r/20210924192417.169243-4-pierre-louis.bossart@linux.intel.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c4777efa75 upstream.
While commit 097b9146c0 ("net: fix up truesize of cloned
skb in skb_prepare_for_shift()") fixed immediate issues found
when KFENCE was enabled/tested, there are still similar issues,
when tcp_trim_head() hits KFENCE while the master skb
is cloned.
This happens under heavy networking TX workloads,
when the TX completion might be delayed after incoming ACK.
This patch fixes the WARNING in sk_stream_kill_queues
when sk->sk_mem_queued/sk->sk_forward_alloc are not zero.
Fixes: d3fb45f370 ("mm, kfence: insert KFENCE hooks for SLAB")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20211102004555.1359210-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e60feb445f upstream.
If you already have an inode and need to update the time on the inode
there is no way to do this properly. Export this helper to allow file
systems to update time on the inode so the appropriate handler is
called, either ->update_time or generic_update_time.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fcb116bc43 upstream.
Recently to prevent issues with SECCOMP_RET_KILL and similar signals
being changed before they are delivered SA_IMMUTABLE was added.
Unfortunately this broke debuggers[1][2] which reasonably expect
to be able to trap synchronous SIGTRAP and SIGSEGV even when
the target process is not configured to handle those signals.
Add force_exit_sig and use it instead of force_fatal_sig where
historically the code has directly called do_exit. This has the
implementation benefits of going through the signal exit path
(including generating core dumps) without the danger of allowing
userspace to ignore or change these signals.
This avoids userspace regressions as older kernels exited with do_exit
which debuggers also can not intercept.
In the future is should be possible to improve the quality of
implementation of the kernel by changing some of these force_exit_sig
calls to force_fatal_sig. That can be done where it matters on
a case-by-case basis with careful analysis.
Reported-by: Kyle Huey <me@kylehuey.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
[1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90jtpjw@mail.gmail.com
[2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020
Fixes: 00b06da29c ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
Fixes: a3616a3c02 ("signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die")
Fixes: 83a1f27ad7 ("signal/powerpc: On swapcontext failure force SIGSEGV")
Fixes: 9bc508cf07 ("signal/s390: Use force_sigsegv in default_trap_handler")
Fixes: 086ec444f8 ("signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig")
Fixes: c317d306d5 ("signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails")
Fixes: 695dd0d634 ("signal/x86: In emulate_vsyscall force a signal instead of calling do_exit")
Fixes: 1fbd60df8a ("signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.")
Fixes: 941edc5bf1 ("exit/syscall_user_dispatch: Send ordinary signals on failure")
Link: https://lkml.kernel.org/r/871r3dqfv8.fsf_-_@email.froward.int.ebiederm.org
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kees Cook <keescook@chromium.org>
Tested-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Thomas Backlund <tmb@iki.fi>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 26d5badbcc upstream.
Add a simple helper force_fatal_sig that causes a signal to be
delivered to a process as if the signal handler was set to SIG_DFL.
Reimplement force_sigsegv based upon this new helper. This fixes
force_sigsegv so that when it forces the default signal handler
to be used the code now forces the signal to be unblocked as well.
Reusing the tested logic in force_sig_info_to_task that was built for
force_sig_seccomp this makes the implementation trivial.
This is interesting both because it makes force_sigsegv simpler and
because there are a couple of buggy places in the kernel that call
do_exit(SIGILL) or do_exit(SIGSYS) because there is no straight
forward way today for those places to simply force the exit of a
process with the chosen signal. Creating force_fatal_sig allows
those places to be implemented with normal signal exits.
Link: https://lkml.kernel.org/r/20211020174406.17889-13-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Thomas Backlund <tmb@iki.fi>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>