mirror of
https://github.com/hardkernel/linux.git
synced 2026-03-26 12:30:23 +09:00
f10aa9bc94de0aa334380b3903dd2f5ec1f48231
76157 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
c74b84544d |
wifi: mac80211: Purge vif txq in ieee80211_do_stop()
[ Upstream commit 378677eb8f44621ecc9ce659f7af61e5baa94d81 ]
After ieee80211_do_stop() SKB from vif's txq could still be processed.
Indeed another concurrent vif schedule_and_wake_txq call could cause
those packets to be dequeued (see ieee80211_handle_wake_tx_queue())
without checking the sdata current state.
Because vif.drv_priv is now cleared in this function, this could lead to
driver crash.
For example in ath12k, ahvif is store in vif.drv_priv. Thus if
ath12k_mac_op_tx() is called after ieee80211_do_stop(), ahvif->ah can be
NULL, leading the ath12k_warn(ahvif->ah,...) call in this function to
trigger the NULL deref below.
Unable to handle kernel paging request at virtual address dfffffc000000001
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
batman_adv: bat0: Interface deactivated: brbh1337
Mem abort info:
ESR = 0x0000000096000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x04: level 0 translation fault
Data abort info:
ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
CM = 0, WnR = 0, TnD = 0, TagAccess = 0
GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[dfffffc000000001] address between user and kernel address ranges
Internal error: Oops: 0000000096000004 [#1] SMP
CPU: 1 UID: 0 PID: 978 Comm: lbd Not tainted 6.13.0-g633f875b8f1e #114
Hardware name: HW (DT)
pstate: 10000005 (nzcV daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : ath12k_mac_op_tx+0x6cc/0x29b8 [ath12k]
lr : ath12k_mac_op_tx+0x174/0x29b8 [ath12k]
sp : ffffffc086ace450
x29: ffffffc086ace450 x28: 0000000000000000 x27: 1ffffff810d59ca4
x26: ffffff801d05f7c0 x25: 0000000000000000 x24: 000000004000001e
x23: ffffff8009ce4926 x22: ffffff801f9c0800 x21: ffffff801d05f7f0
x20: ffffff8034a19f40 x19: 0000000000000000 x18: ffffff801f9c0958
x17: ffffff800bc0a504 x16: dfffffc000000000 x15: ffffffc086ace4f8
x14: ffffff801d05f83c x13: 0000000000000000 x12: ffffffb003a0bf03
x11: 0000000000000000 x10: ffffffb003a0bf02 x9 : ffffff8034a19f40
x8 : ffffff801d05f818 x7 : 1ffffff0069433dc x6 : ffffff8034a19ee0
x5 : ffffff801d05f7f0 x4 : 0000000000000000 x3 : 0000000000000001
x2 : 0000000000000000 x1 : dfffffc000000000 x0 : 0000000000000008
Call trace:
ath12k_mac_op_tx+0x6cc/0x29b8 [ath12k] (P)
ieee80211_handle_wake_tx_queue+0x16c/0x260
ieee80211_queue_skb+0xeec/0x1d20
ieee80211_tx+0x200/0x2c8
ieee80211_xmit+0x22c/0x338
__ieee80211_subif_start_xmit+0x7e8/0xc60
ieee80211_subif_start_xmit+0xc4/0xee0
__ieee80211_subif_start_xmit_8023.isra.0+0x854/0x17a0
ieee80211_subif_start_xmit_8023+0x124/0x488
dev_hard_start_xmit+0x160/0x5a8
__dev_queue_xmit+0x6f8/0x3120
br_dev_queue_push_xmit+0x120/0x4a8
__br_forward+0xe4/0x2b0
deliver_clone+0x5c/0xd0
br_flood+0x398/0x580
br_dev_xmit+0x454/0x9f8
dev_hard_start_xmit+0x160/0x5a8
__dev_queue_xmit+0x6f8/0x3120
ip6_finish_output2+0xc28/0x1b60
__ip6_finish_output+0x38c/0x638
ip6_output+0x1b4/0x338
ip6_local_out+0x7c/0xa8
ip6_send_skb+0x7c/0x1b0
ip6_push_pending_frames+0x94/0xd0
rawv6_sendmsg+0x1a98/0x2898
inet_sendmsg+0x94/0xe0
__sys_sendto+0x1e4/0x308
__arm64_sys_sendto+0xc4/0x140
do_el0_svc+0x110/0x280
el0_svc+0x20/0x60
el0t_64_sync_handler+0x104/0x138
el0t_64_sync+0x154/0x158
To avoid that, empty vif's txq at ieee80211_do_stop() so no packet could
be dequeued after ieee80211_do_stop() (new packets cannot be queued
because SDATA_STATE_RUNNING is cleared at this point).
Fixes:
|
||
|
|
7fa75affe2 |
wifi: mac80211: Update skb's control block key in ieee80211_tx_dequeue()
[ Upstream commit a104042e2bf6528199adb6ca901efe7b60c2c27f ]
The ieee80211 skb control block key (set when skb was queued) could have
been removed before ieee80211_tx_dequeue() call. ieee80211_tx_dequeue()
already called ieee80211_tx_h_select_key() to get the current key, but
the latter do not update the key in skb control block in case it is
NULL. Because some drivers actually use this key in their TX callbacks
(e.g. ath1{1,2}k_mac_op_tx()) this could lead to the use after free
below:
BUG: KASAN: slab-use-after-free in ath11k_mac_op_tx+0x590/0x61c
Read of size 4 at addr ffffff803083c248 by task kworker/u16:4/1440
CPU: 3 UID: 0 PID: 1440 Comm: kworker/u16:4 Not tainted 6.13.0-ge128f627f404 #2
Hardware name: HW (DT)
Workqueue: bat_events batadv_send_outstanding_bcast_packet
Call trace:
show_stack+0x14/0x1c (C)
dump_stack_lvl+0x58/0x74
print_report+0x164/0x4c0
kasan_report+0xac/0xe8
__asan_report_load4_noabort+0x1c/0x24
ath11k_mac_op_tx+0x590/0x61c
ieee80211_handle_wake_tx_queue+0x12c/0x1c8
ieee80211_queue_skb+0xdcc/0x1b4c
ieee80211_tx+0x1ec/0x2bc
ieee80211_xmit+0x224/0x324
__ieee80211_subif_start_xmit+0x85c/0xcf8
ieee80211_subif_start_xmit+0xc0/0xec4
dev_hard_start_xmit+0xf4/0x28c
__dev_queue_xmit+0x6ac/0x318c
batadv_send_skb_packet+0x38c/0x4b0
batadv_send_outstanding_bcast_packet+0x110/0x328
process_one_work+0x578/0xc10
worker_thread+0x4bc/0xc7c
kthread+0x2f8/0x380
ret_from_fork+0x10/0x20
Allocated by task 1906:
kasan_save_stack+0x28/0x4c
kasan_save_track+0x1c/0x40
kasan_save_alloc_info+0x3c/0x4c
__kasan_kmalloc+0xac/0xb0
__kmalloc_noprof+0x1b4/0x380
ieee80211_key_alloc+0x3c/0xb64
ieee80211_add_key+0x1b4/0x71c
nl80211_new_key+0x2b4/0x5d8
genl_family_rcv_msg_doit+0x198/0x240
<...>
Freed by task 1494:
kasan_save_stack+0x28/0x4c
kasan_save_track+0x1c/0x40
kasan_save_free_info+0x48/0x94
__kasan_slab_free+0x48/0x60
kfree+0xc8/0x31c
kfree_sensitive+0x70/0x80
ieee80211_key_free_common+0x10c/0x174
ieee80211_free_keys+0x188/0x46c
ieee80211_stop_mesh+0x70/0x2cc
ieee80211_leave_mesh+0x1c/0x60
cfg80211_leave_mesh+0xe0/0x280
cfg80211_leave+0x1e0/0x244
<...>
Reset SKB control block key before calling ieee80211_tx_h_select_key()
to avoid that.
Fixes:
|
||
|
|
c6fefcb71d |
sctp: detect and prevent references to a freed transport in sendmsg
commit f1a69a940de58b16e8249dff26f74c8cc59b32be upstream. sctp_sendmsg() re-uses associations and transports when possible by doing a lookup based on the socket endpoint and the message destination address, and then sctp_sendmsg_to_asoc() sets the selected transport in all the message chunks to be sent. There's a possible race condition if another thread triggers the removal of that selected transport, for instance, by explicitly unbinding an address with setsockopt(SCTP_SOCKOPT_BINDX_REM), after the chunks have been set up and before the message is sent. This can happen if the send buffer is full, during the period when the sender thread temporarily releases the socket lock in sctp_wait_for_sndbuf(). This causes the access to the transport data in sctp_outq_select_transport(), when the association outqueue is flushed, to result in a use-after-free read. This change avoids this scenario by having sctp_transport_free() signal the freeing of the transport, tagging it as "dead". In order to do this, the patch restores the "dead" bit in struct sctp_transport, which was removed in commit |
||
|
|
29b2114572 |
mptcp: only inc MPJoinAckHMacFailure for HMAC failures
commit 21c02e8272bc95ba0dd44943665c669029b42760 upstream.
Recently, during a debugging session using local MPTCP connections, I
noticed MPJoinAckHMacFailure was not zero on the server side. The
counter was in fact incremented when the PM rejected new subflows,
because the 'subflow' limit was reached.
The fix is easy, simply dissociating the two cases: only the HMAC
validation check should increase MPTCP_MIB_JOINACKMAC counter.
Fixes:
|
||
|
|
7f9ae060ed |
mptcp: fix NULL pointer in can_accept_new_subflow
commit 443041deb5ef6a1289a99ed95015ec7442f141dc upstream.
When testing valkey benchmark tool with MPTCP, the kernel panics in
'mptcp_can_accept_new_subflow' because subflow_req->msk is NULL.
Call trace:
mptcp_can_accept_new_subflow (./net/mptcp/subflow.c:63 (discriminator 4)) (P)
subflow_syn_recv_sock (./net/mptcp/subflow.c:854)
tcp_check_req (./net/ipv4/tcp_minisocks.c:863)
tcp_v4_rcv (./net/ipv4/tcp_ipv4.c:2268)
ip_protocol_deliver_rcu (./net/ipv4/ip_input.c:207)
ip_local_deliver_finish (./net/ipv4/ip_input.c:234)
ip_local_deliver (./net/ipv4/ip_input.c:254)
ip_rcv_finish (./net/ipv4/ip_input.c:449)
...
According to the debug log, the same req received two SYN-ACK in a very
short time, very likely because the client retransmits the syn ack due
to multiple reasons.
Even if the packets are transmitted with a relevant time interval, they
can be processed by the server on different CPUs concurrently). The
'subflow_req->msk' ownership is transferred to the subflow the first,
and there will be a risk of a null pointer dereference here.
This patch fixes this issue by moving the 'subflow_req->msk' under the
`own_req == true` conditional.
Note that the !msk check in subflow_hmac_valid() can be dropped, because
the same check already exists under the own_req mpj branch where the
code has been moved to.
Fixes:
|
||
|
|
21b0c54546 |
wifi: mac80211: fix integer overflow in hwmp_route_info_get()
commit d00c0c4105e5ab8a6a13ed23d701cceb285761fa upstream.
Since the new_metric and last_hop_metric variables can reach
the MAX_METRIC(0xffffffff) value, an integer overflow may occur
when multiplying them by 10/9. It can lead to incorrect behavior.
Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with SVACE.
Fixes:
|
||
|
|
92b6844279 |
mptcp: sockopt: fix getting IPV6_V6ONLY
commit 8c39633759885b6ff85f6d96cf445560e74df5e8 upstream.
When adding a socket option support in MPTCP, both the get and set parts
are supposed to be implemented.
IPV6_V6ONLY support for the setsockopt part has been added a while ago,
but it looks like the get part got forgotten. It should have been
present as a way to verify a setting has been set as expected, and not
to act differently from TCP or any other socket types.
Not supporting this getsockopt(IPV6_V6ONLY) blocks some apps which want
to check the default value, before doing extra actions. On Linux, the
default value is 0, but this can be changed with the net.ipv6.bindv6only
sysctl knob. On Windows, it is set to 1 by default. So supporting the
get part, like for all other socket options, is important.
Everything was in place to expose it, just the last step was missing.
Only new code is added to cover this specific getsockopt(), that seems
safe.
Fixes:
|
||
|
|
d5cba7730d |
bpf: support SKF_NET_OFF and SKF_LL_OFF on skb frags
[ Upstream commit d4bac0288a2b444e468e6df9cb4ed69479ddf14a ]
Classic BPF socket filters with SKB_NET_OFF and SKB_LL_OFF fail to
read when these offsets extend into frags.
This has been observed with iwlwifi and reproduced with tun with
IFF_NAPI_FRAGS. The below straightforward socket filter on UDP port,
applied to a RAW socket, will silently miss matching packets.
const int offset_proto = offsetof(struct ip6_hdr, ip6_nxt);
const int offset_dport = sizeof(struct ip6_hdr) + offsetof(struct udphdr, dest);
struct sock_filter filter_code[] = {
BPF_STMT(BPF_LD + BPF_B + BPF_ABS, SKF_AD_OFF + SKF_AD_PKTTYPE),
BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, PACKET_HOST, 0, 4),
BPF_STMT(BPF_LD + BPF_B + BPF_ABS, SKF_NET_OFF + offset_proto),
BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, IPPROTO_UDP, 0, 2),
BPF_STMT(BPF_LD + BPF_H + BPF_ABS, SKF_NET_OFF + offset_dport),
This is unexpected behavior. Socket filter programs should be
consistent regardless of environment. Silent misses are
particularly concerning as hard to detect.
Use skb_copy_bits for offsets outside linear, same as done for
non-SKF_(LL|NET) offsets.
Offset is always positive after subtracting the reference threshold
SKB_(LL|NET)_OFF, so is always >= skb_(mac|network)_offset. The sum of
the two is an offset against skb->data, and may be negative, but it
cannot point before skb->head, as skb_(mac|network)_offset would too.
This appears to go back to when frag support was introduced to
sk_run_filter in linux-2.4.4, before the introduction of git.
The amount of code change and 8/16/32 bit duplication are unfortunate.
But any attempt I made to be smarter saved very few LoC while
complicating the code.
Fixes:
|
||
|
|
d537859e56 |
net: vlan: don't propagate flags on open
[ Upstream commit 27b918007d96402aba10ed52a6af8015230f1793 ] With the device instance lock, there is now a possibility of a deadlock: [ 1.211455] ============================================ [ 1.211571] WARNING: possible recursive locking detected [ 1.211687] 6.14.0-rc5-01215-g032756b4ca7a-dirty #5 Not tainted [ 1.211823] -------------------------------------------- [ 1.211936] ip/184 is trying to acquire lock: [ 1.212032] ffff8881024a4c30 (&dev->lock){+.+.}-{4:4}, at: dev_set_allmulti+0x4e/0xb0 [ 1.212207] [ 1.212207] but task is already holding lock: [ 1.212332] ffff8881024a4c30 (&dev->lock){+.+.}-{4:4}, at: dev_open+0x50/0xb0 [ 1.212487] [ 1.212487] other info that might help us debug this: [ 1.212626] Possible unsafe locking scenario: [ 1.212626] [ 1.212751] CPU0 [ 1.212815] ---- [ 1.212871] lock(&dev->lock); [ 1.212944] lock(&dev->lock); [ 1.213016] [ 1.213016] *** DEADLOCK *** [ 1.213016] [ 1.213143] May be due to missing lock nesting notation [ 1.213143] [ 1.213294] 3 locks held by ip/184: [ 1.213371] #0: ffffffff838b53e0 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_nets_lock+0x1b/0xa0 [ 1.213543] #1: ffffffff84e5fc70 (&net->rtnl_mutex){+.+.}-{4:4}, at: rtnl_nets_lock+0x37/0xa0 [ 1.213727] #2: ffff8881024a4c30 (&dev->lock){+.+.}-{4:4}, at: dev_open+0x50/0xb0 [ 1.213895] [ 1.213895] stack backtrace: [ 1.213991] CPU: 0 UID: 0 PID: 184 Comm: ip Not tainted 6.14.0-rc5-01215-g032756b4ca7a-dirty #5 [ 1.213993] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 1.213994] Call Trace: [ 1.213995] <TASK> [ 1.213996] dump_stack_lvl+0x8e/0xd0 [ 1.214000] print_deadlock_bug+0x28b/0x2a0 [ 1.214020] lock_acquire+0xea/0x2a0 [ 1.214027] __mutex_lock+0xbf/0xd40 [ 1.214038] dev_set_allmulti+0x4e/0xb0 # real_dev->flags & IFF_ALLMULTI [ 1.214040] vlan_dev_open+0xa5/0x170 # ndo_open on vlandev [ 1.214042] __dev_open+0x145/0x270 [ 1.214046] __dev_change_flags+0xb0/0x1e0 [ 1.214051] netif_change_flags+0x22/0x60 # IFF_UP vlandev [ 1.214053] dev_change_flags+0x61/0xb0 # for each device in group from dev->vlan_info [ 1.214055] vlan_device_event+0x766/0x7c0 # on netdevsim0 [ 1.214058] notifier_call_chain+0x78/0x120 [ 1.214062] netif_open+0x6d/0x90 [ 1.214064] dev_open+0x5b/0xb0 # locks netdevsim0 [ 1.214066] bond_enslave+0x64c/0x1230 [ 1.214075] do_set_master+0x175/0x1e0 # on netdevsim0 [ 1.214077] do_setlink+0x516/0x13b0 [ 1.214094] rtnl_newlink+0xaba/0xb80 [ 1.214132] rtnetlink_rcv_msg+0x440/0x490 [ 1.214144] netlink_rcv_skb+0xeb/0x120 [ 1.214150] netlink_unicast+0x1f9/0x320 [ 1.214153] netlink_sendmsg+0x346/0x3f0 [ 1.214157] __sock_sendmsg+0x86/0xb0 [ 1.214160] ____sys_sendmsg+0x1c8/0x220 [ 1.214164] ___sys_sendmsg+0x28f/0x2d0 [ 1.214179] __x64_sys_sendmsg+0xef/0x140 [ 1.214184] do_syscall_64+0xec/0x1d0 [ 1.214190] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 1.214191] RIP: 0033:0x7f2d1b4a7e56 Device setup: netdevsim0 (down) ^ ^ bond netdevsim1.100@netdevsim1 allmulticast=on (down) When we enslave the lower device (netdevsim0) which has a vlan, we propagate vlan's allmuti/promisc flags during ndo_open. This causes (re)locking on of the real_dev. Propagate allmulti/promisc on flags change, not on the open. There is a slight semantics change that vlans that are down now propagate the flags, but this seems unlikely to result in the real issues. Reproducer: echo 0 1 > /sys/bus/netdevsim/new_device dev_path=$(ls -d /sys/bus/netdevsim/devices/netdevsim0/net/*) dev=$(echo $dev_path | rev | cut -d/ -f1 | rev) ip link set dev $dev name netdevsim0 ip link set dev netdevsim0 up ip link add link netdevsim0 name netdevsim0.100 type vlan id 100 ip link set dev netdevsim0.100 allmulticast on down ip link add name bond1 type bond mode 802.3ad ip link set dev netdevsim0 down ip link set dev netdevsim0 master bond1 ip link set dev bond1 up ip link show Reported-by: syzbot+b0c03d76056ef6cd12a6@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/Z9CfXjLMKn6VLG5d@mini-arch/T/#m15ba130f53227c883e79fb969687d69d670337a0 Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250313100657.2287455-1-sdf@fomichev.me Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
95f17738b8 |
page_pool: avoid infinite loop to schedule delayed worker
[ Upstream commit 43130d02baa137033c25297aaae95fd0edc41654 ] We noticed the kworker in page_pool_release_retry() was waken up repeatedly and infinitely in production because of the buggy driver causing the inflight less than 0 and warning us in page_pool_inflight()[1]. Since the inflight value goes negative, it means we should not expect the whole page_pool to get back to work normally. This patch mitigates the adverse effect by not rescheduling the kworker when detecting the inflight negative in page_pool_release_retry(). [1] [Mon Feb 10 20:36:11 2025] ------------[ cut here ]------------ [Mon Feb 10 20:36:11 2025] Negative(-51446) inflight packet-pages ... [Mon Feb 10 20:36:11 2025] Call Trace: [Mon Feb 10 20:36:11 2025] page_pool_release_retry+0x23/0x70 [Mon Feb 10 20:36:11 2025] process_one_work+0x1b1/0x370 [Mon Feb 10 20:36:11 2025] worker_thread+0x37/0x3a0 [Mon Feb 10 20:36:11 2025] kthread+0x11a/0x140 [Mon Feb 10 20:36:11 2025] ? process_one_work+0x370/0x370 [Mon Feb 10 20:36:11 2025] ? __kthread_cancel_work+0x40/0x40 [Mon Feb 10 20:36:11 2025] ret_from_fork+0x35/0x40 [Mon Feb 10 20:36:11 2025] ---[ end trace ebffe800f33e7e34 ]--- Note: before this patch, the above calltrace would flood the dmesg due to repeated reschedule of release_dw kworker. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250214064250.85987-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
90cdd7e5a4 |
nft_set_pipapo: fix incorrect avx2 match of 5th field octet
[ Upstream commit e042ed950d4e176379ba4c0722146cd96fb38aa2 ]
Given a set element like:
icmpv6 . dead:beef:00ff::1
The value of 'ff' is irrelevant, any address will be matched
as long as the other octets are the same.
This is because of too-early register clobbering:
ymm7 is reloaded with new packet data (pkt[9]) but it still holds data
of an earlier load that wasn't processed yet.
The existing tests in nft_concat_range.sh selftests do exercise this code
path, but do not trigger incorrect matching due to the network prefix
limitation.
Fixes:
|
||
|
|
6509e2e17d |
ipv6: Align behavior across nexthops during path selection
[ Upstream commit 6933cd4714861eea6848f18396a119d741f25fc3 ]
A nexthop is only chosen when the calculated multipath hash falls in the
nexthop's hash region (i.e., the hash is smaller than the nexthop's hash
threshold) and when the nexthop is assigned a non-negative score by
rt6_score_route().
Commit 4d0ab3a6885e ("ipv6: Start path selection from the first
nexthop") introduced an unintentional difference between the first
nexthop and the rest when the score is negative.
When the first nexthop matches, but has a negative score, the code will
currently evaluate subsequent nexthops until one is found with a
non-negative score. On the other hand, when a different nexthop matches,
but has a negative score, the code will fallback to the nexthop with
which the selection started ('match').
Align the behavior across all nexthops and fallback to 'match' when the
first nexthop matches, but has a negative score.
Fixes:
|
||
|
|
d2718324f9 |
net_sched: sch_sfq: move the limit validation
[ Upstream commit b3bf8f63e6179076b57c9de660c9f80b5abefe70 ] It is not sufficient to directly validate the limit on the data that the user passes as it can be updated based on how the other parameters are changed. Move the check at the end of the configuration update process to also catch scenarios where the limit is indirectly updated, for example with the following configurations: tc qdisc add dev dummy0 handle 1: root sfq limit 2 flows 1 depth 1 tc qdisc add dev dummy0 handle 1: root sfq limit 2 flows 1 divisor 1 This fixes the following syzkaller reported crash: ------------[ cut here ]------------ UBSAN: array-index-out-of-bounds in net/sched/sch_sfq.c:203:6 index 65535 is out of range for type 'struct sfq_head[128]' CPU: 1 UID: 0 PID: 3037 Comm: syz.2.16 Not tainted 6.14.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 12/27/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x201/0x300 lib/dump_stack.c:120 ubsan_epilogue lib/ubsan.c:231 [inline] __ubsan_handle_out_of_bounds+0xf5/0x120 lib/ubsan.c:429 sfq_link net/sched/sch_sfq.c:203 [inline] sfq_dec+0x53c/0x610 net/sched/sch_sfq.c:231 sfq_dequeue+0x34e/0x8c0 net/sched/sch_sfq.c:493 sfq_reset+0x17/0x60 net/sched/sch_sfq.c:518 qdisc_reset+0x12e/0x600 net/sched/sch_generic.c:1035 tbf_reset+0x41/0x110 net/sched/sch_tbf.c:339 qdisc_reset+0x12e/0x600 net/sched/sch_generic.c:1035 dev_reset_queue+0x100/0x1b0 net/sched/sch_generic.c:1311 netdev_for_each_tx_queue include/linux/netdevice.h:2590 [inline] dev_deactivate_many+0x7e5/0xe70 net/sched/sch_generic.c:1375 Reported-by: syzbot <syzkaller@googlegroups.com> Fixes: 10685681bafc ("net_sched: sch_sfq: don't allow 1 packet limit") Signed-off-by: Octavian Purdila <tavip@google.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
00d44fe29e |
net_sched: sch_sfq: use a temporary work area for validating configuration
[ Upstream commit 8c0cea59d40cf6dd13c2950437631dd614fbade6 ] Many configuration parameters have influence on others (e.g. divisor -> flows -> limit, depth -> limit) and so it is difficult to correctly do all of the validation before applying the configuration. And if a validation error is detected late it is difficult to roll back a partially applied configuration. To avoid these issues use a temporary work area to update and validate the configuration and only then apply the configuration to the internal state. Signed-off-by: Octavian Purdila <tavip@google.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: b3bf8f63e617 ("net_sched: sch_sfq: move the limit validation") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
c5ed0eaddc |
net: ethtool: Don't call .cleanup_data when prepare_data fails
[ Upstream commit 4f038a6a02d20859a3479293cbf172b0f14cbdd6 ]
There's a consistent pattern where the .cleanup_data() callback is
called when .prepare_data() fails, when it should really be called to
clean after a successful .prepare_data() as per the documentation.
Rewrite the error-handling paths to make sure we don't cleanup
un-prepared data.
Fixes:
|
||
|
|
8b158a0d19 |
tc: Ensure we have enough buffer space when sending filter netlink notifications
[ Upstream commit 369609fc6272c2f6ad666ba4fd913f3baf32908f ] The tfilter_notify() and tfilter_del_notify() functions assume that NLMSG_GOODSIZE is always enough to dump the filter chain. This is not always the case, which can lead to silent notify failures (because the return code of tfilter_notify() is not always checked). In particular, this can lead to NLM_F_ECHO not being honoured even though an action succeeds, which forces userspace to create workarounds[0]. Fix this by increasing the message size if dumping the filter chain into the allocated skb fails. Use the size of the incoming skb as a size hint if set, so we can start at a larger value when appropriate. To trigger this, run the following commands: # ip link add type veth # tc qdisc replace dev veth0 root handle 1: fq_codel # tc -echo filter add dev veth0 parent 1: u32 match u32 0 0 $(for i in $(seq 32); do echo action pedit munge ip dport set 22; done) Before this fix, tc just returns: Not a filter(cmd 2) After the fix, we get the correct echo: added filter dev veth0 parent 1: protocol all pref 49152 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 terminal flowid not_in_hw match 00000000/00000000 at 0 action order 1: pedit action pass keys 1 index 1 ref 1 bind 1 key #0 at 20: val 00000016 mask ffff0000 [repeated 32 times] [0] |
||
|
|
28f2cd143b |
net/sched: cls_api: conditional notification of events
[ Upstream commit 93775590b1ee98bf2976b1f4a1ed24e9ff76170f ] As of today tc-filter/chain events are unconditionally built and sent to RTNLGRP_TC. As with the introduction of rtnl_notify_needed we can check before-hand if they are really needed. This will help to alleviate system pressure when filters are concurrently added without the rtnl lock as in tc-flower. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Link: https://lore.kernel.org/r/20231208192847.714940-8-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 369609fc6272 ("tc: Ensure we have enough buffer space when sending filter netlink notifications") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
2bcad8fefc |
net: tls: explicitly disallow disconnect
[ Upstream commit 5071a1e606b30c0c11278d3c6620cd6a24724cf6 ]
syzbot discovered that it can disconnect a TLS socket and then
run into all sort of unexpected corner cases. I have a vague
recollection of Eric pointing this out to us a long time ago.
Supporting disconnect is really hard, for one thing if offload
is enabled we'd need to wait for all packets to be _acked_.
Disconnect is not commonly used, disallow it.
The immediate problem syzbot run into is the warning in the strp,
but that's just the easiest bug to trigger:
WARNING: CPU: 0 PID: 5834 at net/tls/tls_strp.c:486 tls_strp_msg_load+0x72e/0xa80 net/tls/tls_strp.c:486
RIP: 0010:tls_strp_msg_load+0x72e/0xa80 net/tls/tls_strp.c:486
Call Trace:
<TASK>
tls_rx_rec_wait+0x280/0xa60 net/tls/tls_sw.c:1363
tls_sw_recvmsg+0x85c/0x1c30 net/tls/tls_sw.c:2043
inet6_recvmsg+0x2c9/0x730 net/ipv6/af_inet6.c:678
sock_recvmsg_nosec net/socket.c:1023 [inline]
sock_recvmsg+0x109/0x280 net/socket.c:1045
__sys_recvfrom+0x202/0x380 net/socket.c:2237
Fixes:
|
||
|
|
2f9761a94b |
codel: remove sch->q.qlen check before qdisc_tree_reduce_backlog()
[ Upstream commit 342debc12183b51773b3345ba267e9263bdfaaef ] After making all ->qlen_notify() callbacks idempotent, now it is safe to remove the check of qlen!=0 from both fq_codel_dequeue() and codel_qdisc_dequeue(). Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg> Fixes: |
||
|
|
09c2dcda2c |
tipc: fix memory leak in tipc_link_xmit
[ Upstream commit 69ae94725f4fc9e75219d2d69022029c5b24bc9a ]
In case the backlog transmit queue for system-importance messages is overloaded,
tipc_link_xmit() returns -ENOBUFS but the skb list is not purged. This leads to
memory leak and failure when a skb is allocated.
This commit fixes this issue by purging the skb list before tipc_link_xmit()
returns.
Fixes:
|
||
|
|
fa2f9fc35f |
ipv6: Do not consider link down nexthops in path selection
[ Upstream commit 8b8e0dd357165e0258d9f9cdab5366720ed2f619 ]
Nexthops whose link is down are not supposed to be considered during
path selection when the "ignore_routes_with_linkdown" sysctl is set.
This is done by assigning them a negative region boundary.
However, when comparing the computed hash (unsigned) with the region
boundary (signed), the negative region boundary is treated as unsigned,
resulting in incorrect nexthop selection.
Fix by treating the computed hash as signed. Note that the computed hash
is always in range of [0, 2^31 - 1].
Fixes:
|
||
|
|
21f678f672 |
ipv6: Start path selection from the first nexthop
[ Upstream commit 4d0ab3a6885e3e9040310a8d8f54503366083626 ]
Cited commit transitioned IPv6 path selection to use hash-threshold
instead of modulo-N. With hash-threshold, each nexthop is assigned a
region boundary in the multipath hash function's output space and a
nexthop is chosen if the calculated hash is smaller than the nexthop's
region boundary.
Hash-threshold does not work correctly if path selection does not start
with the first nexthop. For example, if fib6_select_path() is always
passed the last nexthop in the group, then it will always be chosen
because its region boundary covers the entire hash function's output
space.
Fix this by starting the selection process from the first nexthop and do
not consider nexthops for which rt6_score_route() provided a negative
score.
Fixes:
|
||
|
|
5a2976cc4d |
net: fix geneve_opt length integer overflow
[ Upstream commit b27055a08ad4b415dcf15b63034f9cb236f7fb40 ] struct geneve_opt uses 5 bit length for each single option, which means every vary size option should be smaller than 128 bytes. However, all current related Netlink policies cannot promise this length condition and the attacker can exploit a exact 128-byte size option to *fake* a zero length option and confuse the parsing logic, further achieve heap out-of-bounds read. One example crash log is like below: [ 3.905425] ================================================================== [ 3.905925] BUG: KASAN: slab-out-of-bounds in nla_put+0xa9/0xe0 [ 3.906255] Read of size 124 at addr ffff888005f291cc by task poc/177 [ 3.906646] [ 3.906775] CPU: 0 PID: 177 Comm: poc-oob-read Not tainted 6.1.132 #1 [ 3.907131] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 [ 3.907784] Call Trace: [ 3.907925] <TASK> [ 3.908048] dump_stack_lvl+0x44/0x5c [ 3.908258] print_report+0x184/0x4be [ 3.909151] kasan_report+0xc5/0x100 [ 3.909539] kasan_check_range+0xf3/0x1a0 [ 3.909794] memcpy+0x1f/0x60 [ 3.909968] nla_put+0xa9/0xe0 [ 3.910147] tunnel_key_dump+0x945/0xba0 [ 3.911536] tcf_action_dump_1+0x1c1/0x340 [ 3.912436] tcf_action_dump+0x101/0x180 [ 3.912689] tcf_exts_dump+0x164/0x1e0 [ 3.912905] fw_dump+0x18b/0x2d0 [ 3.913483] tcf_fill_node+0x2ee/0x460 [ 3.914778] tfilter_notify+0xf4/0x180 [ 3.915208] tc_new_tfilter+0xd51/0x10d0 [ 3.918615] rtnetlink_rcv_msg+0x4a2/0x560 [ 3.919118] netlink_rcv_skb+0xcd/0x200 [ 3.919787] netlink_unicast+0x395/0x530 [ 3.921032] netlink_sendmsg+0x3d0/0x6d0 [ 3.921987] __sock_sendmsg+0x99/0xa0 [ 3.922220] __sys_sendto+0x1b7/0x240 [ 3.922682] __x64_sys_sendto+0x72/0x90 [ 3.922906] do_syscall_64+0x5e/0x90 [ 3.923814] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 3.924122] RIP: 0033:0x7e83eab84407 [ 3.924331] Code: 48 89 fa 4c 89 df e8 38 aa 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 faf [ 3.925330] RSP: 002b:00007ffff505e370 EFLAGS: 00000202 ORIG_RAX: 000000000000002c [ 3.925752] RAX: ffffffffffffffda RBX: 00007e83eaafa740 RCX: 00007e83eab84407 [ 3.926173] RDX: 00000000000001a8 RSI: 00007ffff505e3c0 RDI: 0000000000000003 [ 3.926587] RBP: 00007ffff505f460 R08: 00007e83eace1000 R09: 000000000000000c [ 3.926977] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffff505f3c0 [ 3.927367] R13: 00007ffff505f5c8 R14: 00007e83ead1b000 R15: 00005d4fbbe6dcb8 Fix these issues by enforing correct length condition in related policies. Fixes: |
||
|
|
fbab7bbf72 |
ipv6: fix omitted netlink attributes when using RTEXT_FILTER_SKIP_STATS
[ Upstream commit 7ac6ea4a3e0898db76aecccd68fb2c403eb7d24e ]
Using RTEXT_FILTER_SKIP_STATS is incorrectly skipping non-stats IPv6
netlink attributes on link dump. This causes issues on userspace tools,
e.g iproute2 is not rendering address generation mode as it should due
to missing netlink attribute.
Move the filling of IFLA_INET6_STATS and IFLA_INET6_ICMP6STATS to a
helper function guarded by a flag check to avoid hitting the same
situation in the future.
Fixes:
|
||
|
|
28d88ee1e1 |
netfilter: nft_tunnel: fix geneve_opt type confusion addition
[ Upstream commit 1b755d8eb1ace3870789d48fbd94f386ad6e30be ]
When handling multiple NFTA_TUNNEL_KEY_OPTS_GENEVE attributes, the
parsing logic should place every geneve_opt structure one by one
compactly. Hence, when deciding the next geneve_opt position, the
pointer addition should be in units of char *.
However, the current implementation erroneously does type conversion
before the addition, which will lead to heap out-of-bounds write.
[ 6.989857] ==================================================================
[ 6.990293] BUG: KASAN: slab-out-of-bounds in nft_tunnel_obj_init+0x977/0xa70
[ 6.990725] Write of size 124 at addr ffff888005f18974 by task poc/178
[ 6.991162]
[ 6.991259] CPU: 0 PID: 178 Comm: poc-oob-write Not tainted 6.1.132 #1
[ 6.991655] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 6.992281] Call Trace:
[ 6.992423] <TASK>
[ 6.992586] dump_stack_lvl+0x44/0x5c
[ 6.992801] print_report+0x184/0x4be
[ 6.993790] kasan_report+0xc5/0x100
[ 6.994252] kasan_check_range+0xf3/0x1a0
[ 6.994486] memcpy+0x38/0x60
[ 6.994692] nft_tunnel_obj_init+0x977/0xa70
[ 6.995677] nft_obj_init+0x10c/0x1b0
[ 6.995891] nf_tables_newobj+0x585/0x950
[ 6.996922] nfnetlink_rcv_batch+0xdf9/0x1020
[ 6.998997] nfnetlink_rcv+0x1df/0x220
[ 6.999537] netlink_unicast+0x395/0x530
[ 7.000771] netlink_sendmsg+0x3d0/0x6d0
[ 7.001462] __sock_sendmsg+0x99/0xa0
[ 7.001707] ____sys_sendmsg+0x409/0x450
[ 7.002391] ___sys_sendmsg+0xfd/0x170
[ 7.003145] __sys_sendmsg+0xea/0x170
[ 7.004359] do_syscall_64+0x5e/0x90
[ 7.005817] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 7.006127] RIP: 0033:0x7ec756d4e407
[ 7.006339] Code: 48 89 fa 4c 89 df e8 38 aa 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 faf
[ 7.007364] RSP: 002b:00007ffed5d46760 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[ 7.007827] RAX: ffffffffffffffda RBX: 00007ec756cc4740 RCX: 00007ec756d4e407
[ 7.008223] RDX: 0000000000000000 RSI: 00007ffed5d467f0 RDI: 0000000000000003
[ 7.008620] RBP: 00007ffed5d468a0 R08: 0000000000000000 R09: 0000000000000000
[ 7.009039] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
[ 7.009429] R13: 00007ffed5d478b0 R14: 00007ec756ee5000 R15: 00005cbd4e655cb8
Fix this bug with correct pointer addition and conversion in parse
and dump code.
Fixes:
|
||
|
|
ccc331fd5b |
net: decrease cached dst counters in dst_release
[ Upstream commit 3a0a3ff6593d670af2451ec363ccb7b18aec0c0a ]
Upstream fix ac888d58869b ("net: do not delay dst_entries_add() in
dst_release()") moved decrementing the dst count from dst_destroy to
dst_release to avoid accessing already freed data in case of netns
dismantle. However in case CONFIG_DST_CACHE is enabled and OvS+tunnels
are used, this fix is incomplete as the same issue will be seen for
cached dsts:
Unable to handle kernel paging request at virtual address ffff5aabf6b5c000
Call trace:
percpu_counter_add_batch+0x3c/0x160 (P)
dst_release+0xec/0x108
dst_cache_destroy+0x68/0xd8
dst_destroy+0x13c/0x168
dst_destroy_rcu+0x1c/0xb0
rcu_do_batch+0x18c/0x7d0
rcu_core+0x174/0x378
rcu_core_si+0x18/0x30
Fix this by invalidating the cache, and thus decrementing cached dst
counters, in dst_release too.
Fixes:
|
||
|
|
8586953246 |
tunnels: Accept PACKET_HOST in skb_tunnel_check_pmtu().
[ Upstream commit 8930424777e43257f5bf6f0f0f53defd0d30415c ]
Because skb_tunnel_check_pmtu() doesn't handle PACKET_HOST packets,
commit 30a92c9e3d6b ("openvswitch: Set the skbuff pkt_type for proper
pmtud support.") forced skb->pkt_type to PACKET_OUTGOING for
openvswitch packets that are sent using the OVS_ACTION_ATTR_OUTPUT
action. This allowed such packets to invoke the
iptunnel_pmtud_check_icmp() or iptunnel_pmtud_check_icmpv6() helpers
and thus trigger PMTU update on the input device.
However, this also broke other parts of PMTU discovery. Since these
packets don't have the PACKET_HOST type anymore, they won't trigger the
sending of ICMP Fragmentation Needed or Packet Too Big messages to
remote hosts when oversized (see the skb_in->pkt_type condition in
__icmp_send() for example).
These two skb->pkt_type checks are therefore incompatible as one
requires skb->pkt_type to be PACKET_HOST, while the other requires it
to be anything but PACKET_HOST.
It makes sense to not trigger ICMP messages for non-PACKET_HOST packets
as these messages should be generated only for incoming l2-unicast
packets. However there doesn't seem to be any reason for
skb_tunnel_check_pmtu() to ignore PACKET_HOST packets.
Allow both cases to work by allowing skb_tunnel_check_pmtu() to work on
PACKET_HOST packets and not overriding skb->pkt_type in openvswitch
anymore.
Fixes: 30a92c9e3d6b ("openvswitch: Set the skbuff pkt_type for proper pmtud support.")
Fixes:
|
||
|
|
b0a1055e0a |
vsock: avoid timeout during connect() if the socket is closing
[ Upstream commit fccd2b711d9628c7ce0111d5e4938652101ee30a ]
When a peer attempts to establish a connection, vsock_connect() contains
a loop that waits for the state to be TCP_ESTABLISHED. However, the
other peer can be fast enough to accept the connection and close it
immediately, thus moving the state to TCP_CLOSING.
When this happens, the peer in the vsock_connect() is properly woken up,
but since the state is not TCP_ESTABLISHED, it goes back to sleep
until the timeout expires, returning -ETIMEDOUT.
If the socket state is TCP_CLOSING, waiting for the timeout is pointless.
vsock_connect() can return immediately without errors or delay since the
connection actually happened. The socket will be in a closing state,
but this is not an issue, and subsequent calls will fail as expected.
We discovered this issue while developing a test that accepts and
immediately closes connections to stress the transport switch between
two connect() calls, where the first one was interrupted by a signal
(see Closes link).
Reported-by: Luigi Leonardi <leonardi@redhat.com>
Closes: https://lore.kernel.org/virtualization/bq6hxrolno2vmtqwcvb5bljfpb7mvwb3kohrvaed6auz5vxrfv@ijmd2f3grobn/
Fixes:
|
||
|
|
aeef645669 |
udp: Fix memory accounting leak.
[ Upstream commit df207de9d9e7a4d92f8567e2c539d9c8c12fd99d ]
Matt Dowling reported a weird UDP memory usage issue.
Under normal operation, the UDP memory usage reported in /proc/net/sockstat
remains close to zero. However, it occasionally spiked to 524,288 pages
and never dropped. Moreover, the value doubled when the application was
terminated. Finally, it caused intermittent packet drops.
We can reproduce the issue with the script below [0]:
1. /proc/net/sockstat reports 0 pages
# cat /proc/net/sockstat | grep UDP:
UDP: inuse 1 mem 0
2. Run the script till the report reaches 524,288
# python3 test.py & sleep 5
# cat /proc/net/sockstat | grep UDP:
UDP: inuse 3 mem 524288 <-- (INT_MAX + 1) >> PAGE_SHIFT
3. Kill the socket and confirm the number never drops
# pkill python3 && sleep 5
# cat /proc/net/sockstat | grep UDP:
UDP: inuse 1 mem 524288
4. (necessary since v6.0) Trigger proto_memory_pcpu_drain()
# python3 test.py & sleep 1 && pkill python3
5. The number doubles
# cat /proc/net/sockstat | grep UDP:
UDP: inuse 1 mem 1048577
The application set INT_MAX to SO_RCVBUF, which triggered an integer
overflow in udp_rmem_release().
When a socket is close()d, udp_destruct_common() purges its receive
queue and sums up skb->truesize in the queue. This total is calculated
and stored in a local unsigned integer variable.
The total size is then passed to udp_rmem_release() to adjust memory
accounting. However, because the function takes a signed integer
argument, the total size can wrap around, causing an overflow.
Then, the released amount is calculated as follows:
1) Add size to sk->sk_forward_alloc.
2) Round down sk->sk_forward_alloc to the nearest lower multiple of
PAGE_SIZE and assign it to amount.
3) Subtract amount from sk->sk_forward_alloc.
4) Pass amount >> PAGE_SHIFT to __sk_mem_reduce_allocated().
When the issue occurred, the total in udp_destruct_common() was 2147484480
(INT_MAX + 833), which was cast to -2147482816 in udp_rmem_release().
At 1) sk->sk_forward_alloc is changed from 3264 to -2147479552, and
2) sets -2147479552 to amount. 3) reverts the wraparound, so we don't
see a warning in inet_sock_destruct(). However, udp_memory_allocated
ends up doubling at 4).
Since commit
|
||
|
|
864ca690ff |
net_sched: skbprio: Remove overly strict queue assertions
[ Upstream commit ce8fe975fd99b49c29c42e50f2441ba53112b2e8 ]
In the current implementation, skbprio enqueue/dequeue contains an assertion
that fails under certain conditions when SKBPRIO is used as a child qdisc under
TBF with specific parameters. The failure occurs because TBF sometimes peeks at
packets in the child qdisc without actually dequeuing them when tokens are
unavailable.
This peek operation creates a discrepancy between the parent and child qdisc
queue length counters. When TBF later receives a high-priority packet,
SKBPRIO's queue length may show a different value than what's reflected in its
internal priority queue tracking, triggering the assertion.
The fix removes this overly strict assertions in SKBPRIO, they are not
necessary at all.
Reported-by: syzbot+a3422a19b05ea96bee18@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=a3422a19b05ea96bee18
Fixes:
|
||
|
|
1927d0bcd5 |
netlabel: Fix NULL pointer exception caused by CALIPSO on IPv4 sockets
[ Upstream commit 078aabd567de3d63d37d7673f714e309d369e6e2 ]
When calling netlbl_conn_setattr(), addr->sa_family is used
to determine the function behavior. If sk is an IPv4 socket,
but the connect function is called with an IPv6 address,
the function calipso_sock_setattr() is triggered.
Inside this function, the following code is executed:
sk_fullsock(__sk) ? inet_sk(__sk)->pinet6 : NULL;
Since sk is an IPv4 socket, pinet6 is NULL, leading to a
null pointer dereference.
This patch fixes the issue by checking if inet6_sk(sk)
returns a NULL pointer before accessing pinet6.
Signed-off-by: Debin Zhu <mowenroot@163.com>
Signed-off-by: Bitao Ouyang <1985755126@qq.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Fixes:
|
||
|
|
6134d1ea1e |
netfilter: nf_tables: don't unregister hook when table is dormant
[ Upstream commit 688c15017d5cd5aac882400782e7213d40dc3556 ]
When nf_tables_updchain encounters an error, hook registration needs to
be rolled back.
This should only be done if the hook has been registered, which won't
happen when the table is flagged as dormant (inactive).
Just move the assignment into the registration block.
Reported-by: syzbot+53ed3a6440173ddbf499@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=53ed3a6440173ddbf499
Fixes:
|
||
|
|
073b04796c |
netfilter: nft_set_hash: GC reaps elements with conncount for dynamic sets only
[ Upstream commit 9d74da1177c800eb3d51c13f9821b7b0683845a5 ]
conncount has its own GC handler which determines when to reap stale
elements, this is convenient for dynamic sets. However, this also reaps
non-dynamic sets with static configurations coming from control plane.
Always run connlimit gc handler but honor feedback to reap element if
this set is dynamic.
Fixes:
|
||
|
|
68adc6f17a |
can: statistics: use atomic access in hot path
[ Upstream commit 80b5f90158d1364cbd80ad82852a757fc0692bf2 ] In can_send() and can_receive() CAN messages and CAN filter matches are counted to be visible in the CAN procfs files. KCSAN detected a data race within can_send() when two CAN frames have been generated by a timer event writing to the same CAN netdevice at the same time. Use atomic operations to access the statistics in the hot path to fix the KCSAN complaint. Reported-by: syzbot+78ce4489b812515d5e4d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67cd717d.050a0220.e1a89.0006.GAE@google.com Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Link: https://patch.msgid.link/20250310143353.3242-1-socketcan@hartkopp.net Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
e87b8f209c |
wifi: mac80211: flush the station before moving it to UN-AUTHORIZED state
[ Upstream commit 43e04077170799d0e6289f3e928f727e401b3d79 ] We first want to flush the station to make sure we no longer have any frames being Tx by the station before the station is moved to un-authorized state. Failing to do that will lead to races: a frame may be sent after the station's state has been changed. Since the API clearly states that the driver can't fail the sta_state() transition down the list of state, we can easily flush the station first, and only then call the driver's sta_state(). Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250306123626.450bc40e8b04.I636ba96843c77f13309c15c9fd6eb0c5a52a7976@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
15f150771e |
rtnetlink: Allocate vfinfo size for VF GUIDs when supported
[ Upstream commit 23f00807619d15063d676218f36c5dfeda1eb420 ] Commit |
||
|
|
5251041573 |
netfilter: socket: Lookup orig tuple for IPv6 SNAT
commit 932b32ffd7604fb00b5c57e239a3cc4d901ccf6e upstream.
nf_sk_lookup_slow_v4 does the conntrack lookup for IPv4 packets to
restore the original 5-tuple in case of SNAT, to be able to find the
right socket (if any). Then socket_match() can correctly check whether
the socket was transparent.
However, the IPv6 counterpart (nf_sk_lookup_slow_v6) lacks this
conntrack lookup, making xt_socket fail to match on the socket when the
packet was SNATed. Add the same logic to nf_sk_lookup_slow_v6.
IPv6 SNAT is used in Kubernetes clusters for pod-to-world packets, as
pods' addresses are in the fd00::/8 ULA subnet and need to be replaced
with the node's external address. Cilium leverages Envoy to enforce L7
policies, and Envoy uses transparent sockets. Cilium inserts an iptables
prerouting rule that matches on `-m socket --transparent` and redirects
the packets to localhost, but it fails to match SNATed IPv6 packets due
to that missing conntrack lookup.
Closes: https://github.com/cilium/cilium/issues/37932
Fixes:
|
||
|
|
9da6b6340d |
atm: Fix NULL pointer dereference
commit bf2986fcf82a449441f9ee4335df19be19e83970 upstream.
When MPOA_cache_impos_rcvd() receives the msg, it can trigger
Null Pointer Dereference Vulnerability if both entry and
holding_time are NULL. Because there is only for the situation
where entry is NULL and holding_time exists, it can be passed
when both entry and holding_time are NULL. If these are NULL,
the entry will be passd to eg_cache_put() as parameter and
it is referenced by entry->use code in it.
kasan log:
[ 3.316691] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000006:I
[ 3.317568] KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037]
[ 3.318188] CPU: 3 UID: 0 PID: 79 Comm: ex Not tainted 6.14.0-rc2 #102
[ 3.318601] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[ 3.319298] RIP: 0010:eg_cache_remove_entry+0xa5/0x470
[ 3.319677] Code: c1 f7 6e fd 48 c7 c7 00 7e 38 b2 e8 95 64 54 fd 48 c7 c7 40 7e 38 b2 48 89 ee e80
[ 3.321220] RSP: 0018:ffff88800583f8a8 EFLAGS: 00010006
[ 3.321596] RAX: 0000000000000006 RBX: ffff888005989000 RCX: ffffffffaecc2d8e
[ 3.322112] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000030
[ 3.322643] RBP: 0000000000000000 R08: 0000000000000000 R09: fffffbfff6558b88
[ 3.323181] R10: 0000000000000003 R11: 203a207972746e65 R12: 1ffff11000b07f15
[ 3.323707] R13: dffffc0000000000 R14: ffff888005989000 R15: ffff888005989068
[ 3.324185] FS: 000000001b6313c0(0000) GS:ffff88806d380000(0000) knlGS:0000000000000000
[ 3.325042] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3.325545] CR2: 00000000004b4b40 CR3: 000000000248e000 CR4: 00000000000006f0
[ 3.326430] Call Trace:
[ 3.326725] <TASK>
[ 3.326927] ? die_addr+0x3c/0xa0
[ 3.327330] ? exc_general_protection+0x161/0x2a0
[ 3.327662] ? asm_exc_general_protection+0x26/0x30
[ 3.328214] ? vprintk_emit+0x15e/0x420
[ 3.328543] ? eg_cache_remove_entry+0xa5/0x470
[ 3.328910] ? eg_cache_remove_entry+0x9a/0x470
[ 3.329294] ? __pfx_eg_cache_remove_entry+0x10/0x10
[ 3.329664] ? console_unlock+0x107/0x1d0
[ 3.329946] ? __pfx_console_unlock+0x10/0x10
[ 3.330283] ? do_syscall_64+0xa6/0x1a0
[ 3.330584] ? entry_SYSCALL_64_after_hwframe+0x47/0x7f
[ 3.331090] ? __pfx_prb_read_valid+0x10/0x10
[ 3.331395] ? down_trylock+0x52/0x80
[ 3.331703] ? vprintk_emit+0x15e/0x420
[ 3.331986] ? __pfx_vprintk_emit+0x10/0x10
[ 3.332279] ? down_trylock+0x52/0x80
[ 3.332527] ? _printk+0xbf/0x100
[ 3.332762] ? __pfx__printk+0x10/0x10
[ 3.333007] ? _raw_write_lock_irq+0x81/0xe0
[ 3.333284] ? __pfx__raw_write_lock_irq+0x10/0x10
[ 3.333614] msg_from_mpoad+0x1185/0x2750
[ 3.333893] ? __build_skb_around+0x27b/0x3a0
[ 3.334183] ? __pfx_msg_from_mpoad+0x10/0x10
[ 3.334501] ? __alloc_skb+0x1c0/0x310
[ 3.334809] ? __pfx___alloc_skb+0x10/0x10
[ 3.335283] ? _raw_spin_lock+0xe0/0xe0
[ 3.335632] ? finish_wait+0x8d/0x1e0
[ 3.335975] vcc_sendmsg+0x684/0xba0
[ 3.336250] ? __pfx_vcc_sendmsg+0x10/0x10
[ 3.336587] ? __pfx_autoremove_wake_function+0x10/0x10
[ 3.337056] ? fdget+0x176/0x3e0
[ 3.337348] __sys_sendto+0x4a2/0x510
[ 3.337663] ? __pfx___sys_sendto+0x10/0x10
[ 3.337969] ? ioctl_has_perm.constprop.0.isra.0+0x284/0x400
[ 3.338364] ? sock_ioctl+0x1bb/0x5a0
[ 3.338653] ? __rseq_handle_notify_resume+0x825/0xd20
[ 3.339017] ? __pfx_sock_ioctl+0x10/0x10
[ 3.339316] ? __pfx___rseq_handle_notify_resume+0x10/0x10
[ 3.339727] ? selinux_file_ioctl+0xa4/0x260
[ 3.340166] __x64_sys_sendto+0xe0/0x1c0
[ 3.340526] ? syscall_exit_to_user_mode+0x123/0x140
[ 3.340898] do_syscall_64+0xa6/0x1a0
[ 3.341170] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 3.341533] RIP: 0033:0x44a380
[ 3.341757] Code: 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c00
[ 3.343078] RSP: 002b:00007ffc1d404098 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 3.343631] RAX: ffffffffffffffda RBX: 00007ffc1d404458 RCX: 000000000044a380
[ 3.344306] RDX: 000000000000019c RSI: 00007ffc1d4040b0 RDI: 0000000000000003
[ 3.344833] RBP: 00007ffc1d404260 R08: 0000000000000000 R09: 0000000000000000
[ 3.345381] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
[ 3.346015] R13: 00007ffc1d404448 R14: 00000000004c17d0 R15: 0000000000000001
[ 3.346503] </TASK>
[ 3.346679] Modules linked in:
[ 3.346956] ---[ end trace 0000000000000000 ]---
[ 3.347315] RIP: 0010:eg_cache_remove_entry+0xa5/0x470
[ 3.347737] Code: c1 f7 6e fd 48 c7 c7 00 7e 38 b2 e8 95 64 54 fd 48 c7 c7 40 7e 38 b2 48 89 ee e80
[ 3.349157] RSP: 0018:ffff88800583f8a8 EFLAGS: 00010006
[ 3.349517] RAX: 0000000000000006 RBX: ffff888005989000 RCX: ffffffffaecc2d8e
[ 3.350103] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000030
[ 3.350610] RBP: 0000000000000000 R08: 0000000000000000 R09: fffffbfff6558b88
[ 3.351246] R10: 0000000000000003 R11: 203a207972746e65 R12: 1ffff11000b07f15
[ 3.351785] R13: dffffc0000000000 R14: ffff888005989000 R15: ffff888005989068
[ 3.352404] FS: 000000001b6313c0(0000) GS:ffff88806d380000(0000) knlGS:0000000000000000
[ 3.353099] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3.353544] CR2: 00000000004b4b40 CR3: 000000000248e000 CR4: 00000000000006f0
[ 3.354072] note: ex[79] exited with irqs disabled
[ 3.354458] note: ex[79] exited with preempt_count 1
Signed-off-by: Minjoong Kim <pwn9uin@gmail.com>
Fixes:
|
||
|
|
fa81cb19f5 |
netfilter: nft_counter: Use u64_stats_t for statistic.
commit 4a1d3acd6ea86075e77fcc1188c3fc372833ba73 upstream. The nft_counter uses two s64 counters for statistics. Those two are protected by a seqcount to ensure that the 64bit variable is always properly seen during updates even on 32bit architectures where the store is performed by two writes. A side effect is that the two counter (bytes and packet) are written and read together in the same window. This can be replaced with u64_stats_t. write_seqcount_begin()/ end() is replaced with u64_stats_update_begin()/ end() and behaves the same way as with seqcount_t on 32bit architectures. Additionally there is a preempt_disable on PREEMPT_RT to ensure that a reader does not preempt a writer. On 64bit architectures the macros are removed and the reads happen without any retries. This also means that the reader can observe one counter (bytes) from before the update and the other counter (packets) but that is okay since there is no requirement to have both counter from the same update window. Convert the statistic to u64_stats_t. There is one optimisation: nft_counter_do_init() and nft_counter_clone() allocate a new per-CPU counter and assign a value to it. During this assignment preemption is disabled which is not needed because the counter is not yet exposed to the system so there can not be another writer or reader. Therefore disabling preemption is omitted and raw_cpu_ptr() is used to obtain a pointer to a counter for the assignment. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
b44a378248 |
mptcp: Fix data stream corruption in the address announcement
commit 2c1f97a52cb827a5f2768e67a9dddffae1ed47ab upstream.
Because of the size restriction in the TCP options space, the MPTCP
ADD_ADDR option is exclusive and cannot be sent with other MPTCP ones.
For this reason, in the linked mptcp_out_options structure, group of
fields linked to different options are part of the same union.
There is a case where the mptcp_pm_add_addr_signal() function can modify
opts->addr, but not ended up sending an ADD_ADDR. Later on, back in
mptcp_established_options, other options will be sent, but with
unexpected data written in other fields due to the union, e.g. in
opts->ext_copy. This could lead to a data stream corruption in the next
packet.
Using an intermediate variable, prevents from corrupting previously
established DSS option. The assignment of the ADD_ADDR option
parameters is now done once we are sure this ADD_ADDR option can be set
in the packet, e.g. after having dropped other suboptions.
Fixes:
|
||
|
|
6e38b4a4b3 |
batman-adv: Ignore own maximum aggregation size during RX
commit 548b0c5de7619ef53bbde5590700693f2f6d2a56 upstream. An OGMv1 and OGMv2 packet receive processing were not only limited by the number of bytes in the received packet but also by the nodes maximum aggregation packet size limit. But this limit is relevant for TX and not for RX. It must not be enforced by batadv_(i)v_ogm_aggr_packet to avoid loss of information in case of a different limit for sender and receiver. This has a minor side effect for B.A.T.M.A.N. IV because the batadv_iv_ogm_aggr_packet is also used for the preprocessing for the TX. But since the aggregation code itself will not allow more than BATADV_MAX_AGGREGATION_BYTES bytes, this check was never triggering (in this context) prior of removing it. Cc: stable@vger.kernel.org Fixes: |
||
|
|
b7b4be1fa4 |
xsk: fix an integer overflow in xp_create_and_assign_umem()
commit 559847f56769037e5b2e0474d3dbff985b98083d upstream.
Since the i and pool->chunk_size variables are of type 'u32',
their product can wrap around and then be cast to 'u64'.
This can lead to two different XDP buffers pointing to the same
memory area.
Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with SVACE.
Fixes:
|
||
|
|
f0372a4cf2 |
Revert "gre: Fix IPv6 link-local address generation."
[ Upstream commit fc486c2d060f67d672ddad81724f7c8a4d329570 ] This reverts commit 183185a18ff96751db52a46ccf93fff3a1f42815. This patch broke net/forwarding/ip6gre_custom_multipath_hash.sh in some circumstances (https://lore.kernel.org/netdev/Z9RIyKZDNoka53EO@mini-arch/). Let's revert it while the problem is being investigated. Fixes: 183185a18ff9 ("gre: Fix IPv6 link-local address generation.") Signed-off-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/8b1ce738eb15dd841aab9ef888640cab4f6ccfea.1742418408.git.gnault@redhat.com Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
ae2ec5a51f |
net/neighbor: add missing policy for NDTPA_QUEUE_LENBYTES
[ Upstream commit 90a7138619a0c55e2aefaad27b12ffc2ddbeed78 ] Previous commit |
||
|
|
e4f6de68de |
net: lwtunnel: fix recursion loops
[ Upstream commit 986ffb3a57c5650fb8bf6d59a8f0f07046abfeb6 ] This patch acts as a parachute, catch all solution, by detecting recursion loops in lwtunnel users and taking care of them (e.g., a loop between routes, a loop within the same route, etc). In general, such loops are the consequence of pathological configurations. Each lwtunnel user is still free to catch such loops early and do whatever they want with them. It will be the case in a separate patch for, e.g., seg6 and seg6_local, in order to provide drop reasons and update statistics. Another example of a lwtunnel user taking care of loops is ioam6, which has valid use cases that include loops (e.g., inline mode), and which is addressed by the next patch in this series. Overall, this patch acts as a last resort to catch loops and drop packets, since we don't want to leak something unintentionally because of a pathological configuration in lwtunnels. The solution in this patch reuses dev_xmit_recursion(), dev_xmit_recursion_inc(), and dev_xmit_recursion_dec(), which seems fine considering the context. Closes: https://lore.kernel.org/netdev/2bc9e2079e864a9290561894d2a602d6@akamai.com/ Closes: https://lore.kernel.org/netdev/Z7NKYMY7fJT5cYWu@shredder/ Fixes: |
||
|
|
9566f6ee13 |
net: atm: fix use after free in lec_send()
[ Upstream commit f3009d0d6ab78053117f8857b921a8237f4d17b3 ]
The ->send() operation frees skb so save the length before calling
->send() to avoid a use after free.
Fixes:
|
||
|
|
a235ec29c9 |
ipv6: Set errno after ip_fib_metrics_init() in ip6_route_info_create().
[ Upstream commit 9a81fc3480bf5dbe2bf80e278c440770f6ba2692 ]
While creating a new IPv6, we could get a weird -ENOMEM when
RTA_NH_ID is set and either of the conditions below is true:
1) CONFIG_IPV6_SUBTREES is enabled and rtm_src_len is specified
2) nexthop_get() fails
e.g.)
# strace ip -6 route add fe80::dead:beef:dead:beef nhid 1 from ::
recvmsg(3, {msg_iov=[{iov_base=[...[
{error=-ENOMEM, msg=[... [...]]},
[{nla_len=49, nla_type=NLMSGERR_ATTR_MSG}, "Nexthops can not be used with so"...]
]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 148
Let's set err explicitly after ip_fib_metrics_init() in
ip6_route_info_create().
Fixes:
|
||
|
|
119dcafe36 |
ipv6: Fix memleak of nhc_pcpu_rth_output in fib_check_nh_v6_gw().
[ Upstream commit 9740890ee20e01f99ff1dde84c63dcf089fabb98 ] fib_check_nh_v6_gw() expects that fib6_nh_init() cleans up everything when it fails. Commit |
||
|
|
ecd06ad082 |
Bluetooth: Fix error code in chan_alloc_skb_cb()
[ Upstream commit 72d061ee630d0dbb45c2920d8d19b3861c413e54 ]
The chan_alloc_skb_cb() function is supposed to return error pointers on
error. Returning NULL will lead to a NULL dereference.
Fixes:
|
||
|
|
8e1704e5b2 |
xfrm_output: Force software GSO only in tunnel mode
[ Upstream commit 0aae2867aa6067f73d066bc98385e23c8454a1d7 ]
The cited commit fixed a software GSO bug with VXLAN + IPSec in tunnel
mode. Unfortunately, it is slightly broader than necessary, as it also
severely affects performance for Geneve + IPSec transport mode over a
device capable of both HW GSO and IPSec crypto offload. In this case,
xfrm_output unnecessarily triggers software GSO instead of letting the
HW do it. In simple iperf3 tests over Geneve + IPSec transport mode over
a back-2-back pair of NICs with MTU 1500, the performance was observed
to be up to 6x worse when doing software GSO compared to leaving it to
the hardware.
This commit makes xfrm_output only trigger software GSO in crypto
offload cases for already encapsulated packets in tunnel mode, as not
doing so would then cause the inner tunnel skb->inner_networking_header
to be overwritten and break software GSO for that packet later if the
device turns out to not be capable of HW GSO.
Taking a closer look at the conditions for the original bug, to better
understand the reasons for this change:
- vxlan_build_skb -> iptunnel_handle_offloads sets inner_protocol and
inner network header.
- then, udp_tunnel_xmit_skb -> ip_tunnel_xmit adds outer transport and
network headers.
- later in the xmit path, xfrm_output -> xfrm_outer_mode_output ->
xfrm4_prepare_output -> xfrm4_tunnel_encap_add overwrites the inner
network header with the one set in ip_tunnel_xmit before adding the
second outer header.
- __dev_queue_xmit -> validate_xmit_skb checks whether GSO segmentation
needs to happen based on dev features. In the original bug, the hw
couldn't segment the packets, so skb_gso_segment was invoked.
- deep in the .gso_segment callback machinery, __skb_udp_tunnel_segment
tries to use the wrong inner network header, expecting the one set in
iptunnel_handle_offloads but getting the one set by xfrm instead.
- a bit later, ipv6_gso_segment accesses the wrong memory based on that
wrong inner network header.
With the new change, the original bug (or similar ones) cannot happen
again, as xfrm will now trigger software GSO before applying a tunnel.
This concern doesn't exist in packet offload mode, when the HW adds
encapsulation headers. For the non-offloaded packets (crypto in SW),
software GSO is still done unconditionally in the else branch.
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Yael Chemla <ychemla@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Fixes:
|