[ Upstream commit ccdb0900a0 ]
Add a debug log to this quirk as well so we can see what quirks have
been applied when debugging.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0ac448e0d2 ]
Prevent KVM hang when a Solidgm P44 Pro NVMe is passed through to a guest
via IOMMU and the guest is subsequently rebooted.
A similar issue was identified and patched by 51ba09452d ("PCI: Delay
after FLR of Intel DC P3700 NVMe") and the same fix can be applied for this
case. (Intel spun off their NAND and SSD business as Solidigm and sold it
to SK Hynix in late 2021.)
Link: https://lore.kernel.org/r/20230507073519.9737-1-mike@oobak.org
Signed-off-by: Mike Pastore <mike@oobak.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d089d69cc1 ]
This patch fixes a FLR bug on the SNET DPU rev 1 by setting the
PCI_DEV_FLAGS_NO_FLR_RESET flag.
As there is a quirk to avoid FLR (quirk_no_flr), I added a new quirk
to check the rev ID before calling to quirk_no_flr.
Without this patch, a SNET DPU rev 1 may hang when FLR is applied.
Signed-off-by: Alvaro Karsz <alvaro.karsz@solid-run.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Message-Id: <20230110165638.123745-3-alvaro.karsz@solid-run.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit cf92bb778eda7830e79452c6917efa8474a30c1e ]
When Output Resource (dcb->or) value is assigned in
fabricate_dcb_output(), there may be out of bounds access to
dac_users array in case dcb->or is zero because ffs(dcb->or) is
used as index there.
The 'or' argument of fabricate_dcb_output() must be interpreted as a
number of bit to set, not value.
Utilize macros from 'enum nouveau_or' in calls instead of hardcoding.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 2e5702aff3 ("drm/nouveau: fabricate DCB encoder table for iMac G4")
Fixes: 670820c0e6 ("drm/nouveau: Workaround incorrect DCB entry on a GeForce3 Ti 200.")
Signed-off-by: Mikhail Kobuk <m.kobuk@ispras.ru>
Signed-off-by: Danilo Krummrich <dakr@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240411110854.16701-1-m.kobuk@ispras.ru
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2d8527f2f911fab84aec04df4788c0c23af3df48 ]
A race condition exists in ccw_device_set_online() that can cause the
online process to fail, leaving the affected device in an inconsistent
state. As a result, subsequent attempts to set that device online fail
with return code ENODEV.
The problem occurs when a path verification request arrives after
a wait for final device state completed, but before the result state
is evaluated.
Fix this by ensuring that the CCW-device lock is held between
determining final state and checking result state.
Note that since:
commit 2297791c92 ("s390/cio: dont unregister subchannel from child-drivers")
path verification requests are much more likely to occur during boot,
resulting in an increased chance of this race condition occurring.
Fixes: 2297791c92 ("s390/cio: dont unregister subchannel from child-drivers")
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Vineeth Vijayan <vneethv@linux.ibm.com>
Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 607638faf2ff1cede37458111496e7cc6c977f6f ]
A deferred condition code 1 response indicates that I/O was not started
and should be retried. The current QDIO implementation handles a cc1
response as I/O error, resulting in a failed QDIO setup. This can happen
for example when a path verification request arrives at the same time
as QDIO setup I/O is started.
Fix this by retrying the QDIO setup I/O when a cc1 response is received.
Note that since
commit 2297791c92 ("s390/cio: dont unregister subchannel from child-drivers")
commit 5ef1dc40ffa6 ("s390/cio: fix invalid -EBUSY on ccw_device_start")
deferred cc1 responses are much more likely to occur. See the commit
message of the latter for more background information.
Fixes: 2297791c92 ("s390/cio: dont unregister subchannel from child-drivers")
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c24cd679b075b0e953ea167b0aa2b2d59e4eba7f ]
The TX and RX DMA Channels used by the driver to exchange data with CPSW
are not guaranteed to be in a clean state during driver initialization.
The Bootloader could have used the same DMA Channels without cleaning them
up in the event of failure. Thus, reset and disable the DMA Channels to
ensure that they are in a clean state before using them.
Fixes: 93a7653031 ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver")
Reported-by: Schuyler Patton <spatton@ti.com>
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Reviewed-by: Roger Quadros <rogerq@kernel.org>
Link: https://lore.kernel.org/r/20240417095425.2253876-1-s-vadapalli@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d59cf049c8378677053703e724808836f180888e ]
This switch intellectual property provides a bit on the ARL global control
register which controls allowing mirroring frames which are received on the
local port (monitor port). This bit is unset after reset.
This ability must be enabled to fully support the port mirroring feature on
this switch intellectual property.
Therefore, this patch fixes the traffic not being reflected on a port,
which would be configured like below:
tc qdisc add dev swp0 clsact
tc filter add dev swp0 ingress matchall skip_sw \
action mirred egress mirror dev swp0
As a side note, this configuration provides the hairpinning feature for a
single port.
Fixes: 37feab6076 ("net: dsa: mt7530: add support for port mirroring")
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f8bbc07ac535593139c875ffa19af924b1084540 ]
vhost_worker will call tun call backs to receive packets. If too many
illegal packets arrives, tun_do_read will keep dumping packet contents.
When console is enabled, it will costs much more cpu time to dump
packet and soft lockup will be detected.
net_ratelimit mechanism can be used to limit the dumping rate.
PID: 33036 TASK: ffff949da6f20000 CPU: 23 COMMAND: "vhost-32980"
#0 [fffffe00003fce50] crash_nmi_callback at ffffffff89249253
#1 [fffffe00003fce58] nmi_handle at ffffffff89225fa3
#2 [fffffe00003fceb0] default_do_nmi at ffffffff8922642e
#3 [fffffe00003fced0] do_nmi at ffffffff8922660d
#4 [fffffe00003fcef0] end_repeat_nmi at ffffffff89c01663
[exception RIP: io_serial_in+20]
RIP: ffffffff89792594 RSP: ffffa655314979e8 RFLAGS: 00000002
RAX: ffffffff89792500 RBX: ffffffff8af428a0 RCX: 0000000000000000
RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff8af428a0
RBP: 0000000000002710 R8: 0000000000000004 R9: 000000000000000f
R10: 0000000000000000 R11: ffffffff8acbf64f R12: 0000000000000020
R13: ffffffff8acbf698 R14: 0000000000000058 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#5 [ffffa655314979e8] io_serial_in at ffffffff89792594
#6 [ffffa655314979e8] wait_for_xmitr at ffffffff89793470
#7 [ffffa65531497a08] serial8250_console_putchar at ffffffff897934f6
#8 [ffffa65531497a20] uart_console_write at ffffffff8978b605
#9 [ffffa65531497a48] serial8250_console_write at ffffffff89796558
#10 [ffffa65531497ac8] console_unlock at ffffffff89316124
#11 [ffffa65531497b10] vprintk_emit at ffffffff89317c07
#12 [ffffa65531497b68] printk at ffffffff89318306
#13 [ffffa65531497bc8] print_hex_dump at ffffffff89650765
#14 [ffffa65531497ca8] tun_do_read at ffffffffc0b06c27 [tun]
#15 [ffffa65531497d38] tun_recvmsg at ffffffffc0b06e34 [tun]
#16 [ffffa65531497d68] handle_rx at ffffffffc0c5d682 [vhost_net]
#17 [ffffa65531497ed0] vhost_worker at ffffffffc0c644dc [vhost]
#18 [ffffa65531497f10] kthread at ffffffff892d2e72
#19 [ffffa65531497f50] ret_from_fork at ffffffff89c0022f
Fixes: ef3db4a595 ("tun: avoid BUG, dump packet on GSO errors")
Signed-off-by: Lei Chen <lei.chen@smartx.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20240415020247.2207781-1-lei.chen@smartx.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 73278715725a8347032acf233082ca4eb31e6a56 ]
The check for flags is done to not pass empty lookups to adding switch
rule functions. Since metadata is always added to lookups there is no
need to check against the flag.
It is also fixing the problem with such rule:
$ tc filter add dev gtp_dev ingress protocol ip prio 0 flower \
enc_dst_port 2123 action drop
Switch block in case of GTP can't parse the destination port, because it
should always be set to GTP specific value. The same with ethertype. The
result is that there is no other matching criteria than GTP tunnel. In
this case flags is 0, rule can't be added only because of defensive
check against flags.
Fixes: 9a225f81f5 ("ice: Support GTP-U and GTP-C offload in switchdev")
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit fef965764cf562f28afb997b626fc7c3cec99693 ]
When disabling aRFS under the `priv->state_lock`, any scheduled
aRFS works are canceled using the `cancel_work_sync` function,
which waits for the work to end if it has already started.
However, while waiting for the work handler, the handler will
try to acquire the `state_lock` which is already acquired.
The worker acquires the lock to delete the rules if the state
is down, which is not the worker's responsibility since
disabling aRFS deletes the rules.
Add an aRFS state variable, which indicates whether the aRFS is
enabled and prevent adding rules when the aRFS is disabled.
Kernel log:
======================================================
WARNING: possible circular locking dependency detected
6.7.0-rc4_net_next_mlx5_5483eb2 #1 Tainted: G I
------------------------------------------------------
ethtool/386089 is trying to acquire lock:
ffff88810f21ce68 ((work_completion)(&rule->arfs_work)){+.+.}-{0:0}, at: __flush_work+0x74/0x4e0
but task is already holding lock:
ffff8884a1808cc0 (&priv->state_lock){+.+.}-{3:3}, at: mlx5e_ethtool_set_channels+0x53/0x200 [mlx5_core]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&priv->state_lock){+.+.}-{3:3}:
__mutex_lock+0x80/0xc90
arfs_handle_work+0x4b/0x3b0 [mlx5_core]
process_one_work+0x1dc/0x4a0
worker_thread+0x1bf/0x3c0
kthread+0xd7/0x100
ret_from_fork+0x2d/0x50
ret_from_fork_asm+0x11/0x20
-> #0 ((work_completion)(&rule->arfs_work)){+.+.}-{0:0}:
__lock_acquire+0x17b4/0x2c80
lock_acquire+0xd0/0x2b0
__flush_work+0x7a/0x4e0
__cancel_work_timer+0x131/0x1c0
arfs_del_rules+0x143/0x1e0 [mlx5_core]
mlx5e_arfs_disable+0x1b/0x30 [mlx5_core]
mlx5e_ethtool_set_channels+0xcb/0x200 [mlx5_core]
ethnl_set_channels+0x28f/0x3b0
ethnl_default_set_doit+0xec/0x240
genl_family_rcv_msg_doit+0xd0/0x120
genl_rcv_msg+0x188/0x2c0
netlink_rcv_skb+0x54/0x100
genl_rcv+0x24/0x40
netlink_unicast+0x1a1/0x270
netlink_sendmsg+0x214/0x460
__sock_sendmsg+0x38/0x60
__sys_sendto+0x113/0x170
__x64_sys_sendto+0x20/0x30
do_syscall_64+0x40/0xe0
entry_SYSCALL_64_after_hwframe+0x46/0x4e
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&priv->state_lock);
lock((work_completion)(&rule->arfs_work));
lock(&priv->state_lock);
lock((work_completion)(&rule->arfs_work));
*** DEADLOCK ***
3 locks held by ethtool/386089:
#0: ffffffff82ea7210 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40
#1: ffffffff82e94c88 (rtnl_mutex){+.+.}-{3:3}, at: ethnl_default_set_doit+0xd3/0x240
#2: ffff8884a1808cc0 (&priv->state_lock){+.+.}-{3:3}, at: mlx5e_ethtool_set_channels+0x53/0x200 [mlx5_core]
stack backtrace:
CPU: 15 PID: 386089 Comm: ethtool Tainted: G I 6.7.0-rc4_net_next_mlx5_5483eb2 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x60/0xa0
check_noncircular+0x144/0x160
__lock_acquire+0x17b4/0x2c80
lock_acquire+0xd0/0x2b0
? __flush_work+0x74/0x4e0
? save_trace+0x3e/0x360
? __flush_work+0x74/0x4e0
__flush_work+0x7a/0x4e0
? __flush_work+0x74/0x4e0
? __lock_acquire+0xa78/0x2c80
? lock_acquire+0xd0/0x2b0
? mark_held_locks+0x49/0x70
__cancel_work_timer+0x131/0x1c0
? mark_held_locks+0x49/0x70
arfs_del_rules+0x143/0x1e0 [mlx5_core]
mlx5e_arfs_disable+0x1b/0x30 [mlx5_core]
mlx5e_ethtool_set_channels+0xcb/0x200 [mlx5_core]
ethnl_set_channels+0x28f/0x3b0
ethnl_default_set_doit+0xec/0x240
genl_family_rcv_msg_doit+0xd0/0x120
genl_rcv_msg+0x188/0x2c0
? ethnl_ops_begin+0xb0/0xb0
? genl_family_rcv_msg_dumpit+0xf0/0xf0
netlink_rcv_skb+0x54/0x100
genl_rcv+0x24/0x40
netlink_unicast+0x1a1/0x270
netlink_sendmsg+0x214/0x460
__sock_sendmsg+0x38/0x60
__sys_sendto+0x113/0x170
? do_user_addr_fault+0x53f/0x8f0
__x64_sys_sendto+0x20/0x30
do_syscall_64+0x40/0xe0
entry_SYSCALL_64_after_hwframe+0x46/0x4e
</TASK>
Fixes: 45bf454ae8 ("net/mlx5e: Enabling aRFS mechanism")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240411115444.374475-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 37cc10da3a50e6d0cb9808a90b7da9b4868794dd ]
The cited patch introduces the concept of buckets in LAG in hash mode.
However, the patch doesn't clear the number of buckets in the LAG
deactivation. This results in using the wrong number of buckets in
case user create a hash mode LAG and afterwards create a non-hash
mode LAG.
Hence, restore buckets number to default after hash mode LAG
deactivation.
Fixes: 352899f384 ("net/mlx5: Lag, use buckets in hash mode")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240411115444.374475-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 22dd70eb2c3d754862964377a75abafd3167346b ]
Currently, we can read OOB data without MSG_OOB by using MSG_PEEK
when OOB data is sitting on the front row, which is apparently
wrong.
>>> from socket import *
>>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
>>> c1.send(b'a', MSG_OOB)
1
>>> c2.recv(1, MSG_PEEK | MSG_DONTWAIT)
b'a'
If manage_oob() is called when no data has been copied, we only
check if the socket enables SO_OOBINLINE or MSG_PEEK is not used.
Otherwise, the skb is returned as is.
However, here we should return NULL if MSG_PEEK is set and no data
has been copied.
Also, in such a case, we should not jump to the redo label because
we will be caught in the loop and hog the CPU until normal data
comes in.
Then, we need to handle skb == NULL case with the if-clause below
the manage_oob() block.
With this patch:
>>> from socket import *
>>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
>>> c1.send(b'a', MSG_OOB)
1
>>> c2.recv(1, MSG_PEEK | MSG_DONTWAIT)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
BlockingIOError: [Errno 11] Resource temporarily unavailable
Fixes: 314001f0bf ("af_unix: Add OOB support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240410171016.7621-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 283454c8a123072e5c386a5a2b5fc576aa455b6f ]
When we call recv() for AF_UNIX socket, we first peek one skb and
calls manage_oob() to check if the skb is sent with MSG_OOB.
However, when we fetch the next (and the following) skb, manage_oob()
is not called now, leading a wrong behaviour.
Let's say a socket send()s "hello" with MSG_OOB and the peer tries
to recv() 5 bytes with MSG_PEEK. Here, we should get only "hell"
without 'o', but actually not:
>>> from socket import *
>>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
>>> c1.send(b'hello', MSG_OOB)
5
>>> c2.recv(5, MSG_PEEK)
b'hello'
The first skb fills 4 bytes, and the next skb is peeked but not
properly checked by manage_oob().
Let's move up the again label to call manage_oob() for evry skb.
With this patch:
>>> from socket import *
>>> c1, c2 = socketpair(AF_UNIX, SOCK_STREAM)
>>> c1.send(b'hello', MSG_OOB)
5
>>> c2.recv(5, MSG_PEEK)
b'hell'
Fixes: 314001f0bf ("af_unix: Add OOB support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240410171016.7621-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6db5dc7b351b9569940cd1cf445e237c42cd6d27 ]
pppoe traffic reaching ingress path does not match the flowtable entry
because the pppoe header is expected to be at the network header offset.
This bug causes a mismatch in the flow table lookup, so pppoe packets
enter the classical forwarding path.
Fixes: 72efd585f7 ("netfilter: flowtable: add pppoe support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 87b3593bed1868b2d9fe096c01bcdf0ea86cbebf ]
Ensure there is sufficient room to access the protocol field of the
PPPoe header. Validate it once before the flowtable lookup, then use a
helper function to access protocol field.
Reported-by: syzbot+b6f07e1c07ef40199081@syzkaller.appspotmail.com
Fixes: 72efd585f7 ("netfilter: flowtable: add pppoe support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3cfc9ec039af60dbd8965ae085b2c2ccdcfbe1cc ]
Pablo reports a crash with large batches of elements with a
back-to-back add/remove pattern. Quoting Pablo:
add_elem("00000000") timeout 100 ms
...
add_elem("0000000X") timeout 100 ms
del_elem("0000000X") <---------------- delete one that was just added
...
add_elem("00005000") timeout 100 ms
1) nft_pipapo_remove() removes element 0000000X
Then, KASAN shows a splat.
Looking at the remove function there is a chance that we will drop a
rule that maps to a non-deactivated element.
Removal happens in two steps, first we do a lookup for key k and return the
to-be-removed element and mark it as inactive in the next generation.
Then, in a second step, the element gets removed from the set/map.
The _remove function does not work correctly if we have more than one
element that share the same key.
This can happen if we insert an element into a set when the set already
holds an element with same key, but the element mapping to the existing
key has timed out or is not active in the next generation.
In such case its possible that removal will unmap the wrong element.
If this happens, we will leak the non-deactivated element, it becomes
unreachable.
The element that got deactivated (and will be freed later) will
remain reachable in the set data structure, this can result in
a crash when such an element is retrieved during lookup (stale
pointer).
Add a check that the fully matching key does in fact map to the element
that we have marked as inactive in the deactivation step.
If not, we need to continue searching.
Add a bug/warn trap at the end of the function as well, the remove
function must not ever be called with an invisible/unreachable/non-existent
element.
v2: avoid uneeded temporary variable (Stefano)
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d78d867dcea69c328db30df665be5be7d0148484 ]
nft_unregister_obj() can concurrent with __nft_obj_type_get(),
and there is not any protection when iterate over nf_tables_objects
list in __nft_obj_type_get(). Therefore, there is potential data-race
of nf_tables_objects list entry.
Use list_for_each_entry_rcu() to iterate over nf_tables_objects
list in __nft_obj_type_get(), and use rcu_read_lock() in the caller
nft_obj_type_get() to protect the entire type query process.
Fixes: e50092404c ("netfilter: nf_tables: add stateful objects")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f969eb84ce482331a991079ab7a5c4dc3b7f89bf ]
nft_unregister_expr() can concurrent with __nft_expr_type_get(),
and there is not any protection when iterate over nf_tables_expressions
list in __nft_expr_type_get(). Therefore, there is potential data-race
of nf_tables_expressions list entry.
Use list_for_each_entry_rcu() to iterate over nf_tables_expressions
list in __nft_expr_type_get(), and use rcu_read_lock() in the caller
nft_expr_type_get() to protect the entire type query process.
Fixes: ef1f7df917 ("netfilter: nf_tables: expression ops overloading")
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Commit 9c55461040a9264b7e44444c53d26480b438eda6 upstream ]
Currently, the EFI stub invokes the EFI memory attributes protocol to
strip any NX restrictions from the entire loaded kernel, resulting in
all code and data being mapped read-write-execute.
The point of the EFI memory attributes protocol is to remove the need
for all memory allocations to be mapped with both write and execute
permissions by default, and make it the OS loader's responsibility to
transition data mappings to code mappings where appropriate.
Even though the UEFI specification does not appear to leave room for
denying memory attribute changes based on security policy, let's be
cautious and avoid relying on the ability to create read-write-execute
mappings. This is trivially achievable, given that the amount of kernel
code executing via the firmware's 1:1 mapping is rather small and
limited to the .head.text region. So let's drop the NX restrictions only
on that subregion, but not before remapping it as read-only first.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Commit 48204aba801f1b512b3abed10b8e1a63e03f3dd1 upstream ]
The .head.text section is the initial primary entrypoint of the core
kernel, and is entered with the CPU executing from a 1:1 mapping of
memory. Such code must never access global variables using absolute
references, as these are based on the kernel virtual mapping which is
not active yet at this point.
Given that the SME startup code is also called from this early execution
context, move it into .head.text as well. This will allow more thorough
build time checks in the future to ensure that early startup code only
uses RIP-relative references to global variables.
Also replace some occurrences of __pa_symbol() [which relies on the
compiler generating an absolute reference, which is not guaranteed] and
an open coded RIP-relative access with RIP_REL_REF().
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240227151907.387873-18-ardb+git@google.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Commit 1ad55cecf22f05f1c884adf63cc09d3c3e609ebf upstream ]
The .compat section is a dummy PE section that contains the address of
the 32-bit entrypoint of the 64-bit kernel image if it is bootable from
32-bit firmware (i.e., CONFIG_EFI_MIXED=y)
This section is only 8 bytes in size and is only referenced from the
loader, and so it is placed at the end of the memory view of the image,
to avoid the need for padding it to 4k, which is required for sections
appearing in the middle of the image.
Unfortunately, this violates the PE/COFF spec, and even if most EFI
loaders will work correctly (including the Tianocore reference
implementation), PE loaders do exist that reject such images, on the
basis that both the file and memory views of the file contents should be
described by the section headers in a monotonically increasing manner
without leaving any gaps.
So reorganize the sections to avoid this issue. This results in a slight
padding overhead (< 4k) which can be avoided if desired by disabling
CONFIG_EFI_MIXED (which is only needed in rare cases these days)
Fixes: 3e3eabe26dc8 ("x86/boot: Increase section and file alignment to 4k/512")
Reported-by: Mike Beaton <mjsbeaton@gmail.com>
Link: https://lkml.kernel.org/r/CAHzAAWQ6srV6LVNdmfbJhOwhBw5ZzxxZZ07aHt9oKkfYAdvuQQ%40mail.gmail.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Commit 3e3eabe26dc88692d34cf76ca0e0dd331481cc15 upstream ]
Align x86 with other EFI architectures, and increase the section
alignment to the EFI page size (4k), so that firmware is able to honour
the section permission attributes and map code read-only and data
non-executable.
There are a number of requirements that have to be taken into account:
- the sign tools get cranky when there are gaps between sections in the
file view of the image
- the virtual offset of each section must be aligned to the image's
section alignment
- the file offset *and size* of each section must be aligned to the
image's file alignment
- the image size must be aligned to the section alignment
- each section's virtual offset must be greater than or equal to the
size of the headers.
In order to meet all these requirements, while avoiding the need for
lots of padding to accommodate the .compat section, the latter is placed
at an arbitrary offset towards the end of the image, but aligned to the
minimum file alignment (512 bytes). The space before the .text section
is therefore distributed between the PE header, the .setup section and
the .compat section, leaving no gaps in the file coverage, making the
signing tools happy.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230915171623.655440-18-ardb@google.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Commit 34951f3c28bdf6481d949a20413b2ce7693687b2 upstream ]
Describe the code and data of the decompressor binary using separate
.text and .data PE/COFF sections, so that we will be able to map them
using restricted permissions once we increase the section and file
alignment sufficiently. This avoids the need for memory mappings that
are writable and executable at the same time, which is something that
is best avoided for security reasons.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230915171623.655440-17-ardb@google.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Commit efa089e63b56bdc5eca754b995cb039dd7a5457e upstream ]
Now that the size of the setup block is visible to the assembler, it is
possible to populate the PE/COFF header fields from the asm code
directly, instead of poking the values into the binary using the build
tool. This will make it easier to reorganize the section layout without
having to tweak the build tool in lockstep.
This change has no impact on the resulting bzImage binary.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230915171623.655440-15-ardb@google.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Commit aeb92067f6ae994b541d7f9752fe54ed3d108bcc upstream ]
Tweak the linker script so that the value of _edata represents the
decompressor binary's file size rounded up to the appropriate alignment.
This removes the need to calculate it in the build tool, and will make
it easier to refer to the file size from the header directly in
subsequent changes to the PE header layout.
While adding _edata to the sed regex that parses the compressed
vmlinux's symbol list, tweak the regex a bit for conciseness.
This change has no impact on the resulting bzImage binary when
configured with CONFIG_EFI_STUB=y.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230915171623.655440-14-ardb@google.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Commit 093ab258e3fb1d1d3afdfd4a69403d44ce90e360 upstream ]
The setup block contains the real mode startup code that is used when
booting from a legacy BIOS, along with the boot_params/setup_data that
is used by legacy x86 bootloaders to pass the command line and initial
ramdisk parameters, among other things.
The setup block also contains the PE/COFF header of the entire combined
image, which includes the compressed kernel image, the decompressor and
the EFI stub.
This PE header describes the layout of the executable image in memory,
and currently, the fact that the setup block precedes it makes it rather
fiddly to get the right values into the right place in the final image.
Let's make things a bit easier by defining the setup_size in the linker
script so it can be referenced from the asm code directly, rather than
having to rely on the build tool to calculate it. For the time being,
add 64 bytes of fixed padding for the .reloc and .compat sections - this
will be removed in a subsequent patch after the PE/COFF header has been
reorganized.
This change has no impact on the resulting bzImage binary when
configured with CONFIG_EFI_MIXED=y.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230915171623.655440-13-ardb@google.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Commit eac956345f99dda3d68f4ae6cf7b494105e54780 upstream ]
The offsets of the EFI handover entrypoints are available to the
assembler when constructing the header, so there is no need to set them
from the build tool afterwards.
This change has no impact on the resulting bzImage binary.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230915171623.655440-12-ardb@google.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Commit 2e765c02dcbfc2a8a4527c621a84b9502f6b9bd2 upstream ]
Instead of parsing zoffset.h and poking the kernel_info offset value
into the header from the build tool, just grab the value directly in the
asm file that describes this header.
This change has no impact on the resulting bzImage binary.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230915171623.655440-11-ardb@google.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Commit b618d31f112bea3d2daea19190d63e567f32a4db upstream ]
The x86 boot image generation tool assign a default value to startup_64
and subsequently parses the actual value from zoffset.h but it never
actually uses the value anywhere. So remove this code.
This change has no impact on the resulting bzImage binary.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230912090051.4014114-25-ardb@google.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Commit 7448e8e5d15a3c4df649bf6d6d460f78396f7e1e upstream ]
The root device defaults to 0,0 and is no longer configurable at build
time [0], so there is no need for the build tool to ever write to this
field.
[0] 079f85e624 ("x86, build: Do not set the root_dev field in bzImage")
This change has no impact on the resulting bzImage binary.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230912090051.4014114-23-ardb@google.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Commit 8eace5b3555606e684739bef5bcdfcfe68235257 upstream ]
Now that the EFI stub decompresses the kernel and hands over to the
decompressed image directly, there is no longer a need to provide a
decompression buffer as part of the .BSS allocation of the PE/COFF
image. It also means the PE/COFF image can be loaded anywhere in memory,
and setting the preferred image base is unnecessary. So drop the
handling of this from the header and from the build tool.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230912090051.4014114-22-ardb@google.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Commit 768171d7ebbce005210e1cf8456f043304805c15 upstream ]
Ancient (pre-2003) x86 kernels could boot from a floppy disk straight from
the BIOS, using a small real mode boot stub at the start of the image
where the BIOS would expect the boot record (or boot block) to appear.
Due to its limitations (kernel size < 1 MiB, no support for IDE, USB or
El Torito floppy emulation), this support was dropped, and a Linux aware
bootloader is now always required to boot the kernel from a legacy BIOS.
To smoothen this transition, the boot stub was not removed entirely, but
replaced with one that just prints an error message telling the user to
install a bootloader.
As it is unlikely that anyone doing direct floppy boot with such an
ancient kernel is going to upgrade to v6.5+ and expect that this boot
method still works, printing this message is kind of pointless, and so
it should be possible to remove the logic that emits it.
Let's free up this space so it can be used to expand the PE header in a
subsequent patch.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Link: https://lore.kernel.org/r/20230912090051.4014114-21-ardb@google.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Commit decd347c2a75d32984beb8807d470b763a53b542 upstream ]
Commit
8117961d98 ("x86/efi: Disregard setup header of loaded image")
dropped the memcopy of the image's setup header into the boot_params
struct provided to the core kernel, on the basis that EFI boot does not
need it and should rely only on a single protocol to interface with the
boot chain. It is also a prerequisite for being able to increase the
section alignment to 4k, which is needed to enable memory protections
when running in the boot services.
So only the setup_header fields that matter to the core kernel are
populated explicitly, and everything else is ignored. One thing was
overlooked, though: the initrd_addr_max field in the setup_header is not
used by the core kernel, but it is used by the EFI stub itself when it
loads the initrd, where its default value of INT_MAX is used as the soft
limit for memory allocation.
This means that, in the old situation, the initrd was virtually always
loaded in the lower 2G of memory, but now, due to initrd_addr_max being
0x0, the initrd may end up anywhere in memory. This should not be an
issue principle, as most systems can deal with this fine. However, it
does appear to tickle some problems in older UEFI implementations, where
the memory ends up being corrupted, resulting in errors when unpacking
the initramfs.
So set the initrd_addr_max field to INT_MAX like it was before.
Fixes: 8117961d98 ("x86/efi: Disregard setup header of loaded image")
Reported-by: Radek Podgorny <radek@podgorny.cz>
Closes: https://lore.kernel.org/all/a99a831a-8ad5-4cb0-bff9-be637311f771@podgorny.cz
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>