[ Upstream commit 88c7a9fd9b ]
When sending a packet, we will prepend it with an LAPB header.
This modifies the shared parts of a cloned skb, so we should copy the
skb rather than just clone it, before we prepend the header.
In "Documentation/networking/driver.rst" (the 2nd point), it states
that drivers shouldn't modify the shared parts of a cloned skb when
transmitting.
The "dev_queue_xmit_nit" function in "net/core/dev.c", which is called
when an skb is being sent, clones the skb and sents the clone to
AF_PACKET sockets. Because the LAPB drivers first remove a 1-byte
pseudo-header before handing over the skb to us, if we don't copy the
skb before prepending the LAPB header, the first byte of the packets
received on AF_PACKET sockets can be corrupted.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Acked-by: Martin Schiller <ms@dev.tdt.de>
Link: https://lore.kernel.org/r/20210201055706.415842-1-xie.he.0141@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit aa880c6f3e ]
Dcfg was overlapping with clockgen address space which resulted
in failure in memory allocation for dcfg. According regs description
dcfg size should not be bigger than 4KB.
Signed-off-by: Zyta Szpak <zr@semihalf.com>
Fixes: 8126d88162 ("arm64: dts: add QorIQ LS1046A SoC support")
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5399d52233 ]
AF_RXRPC sockets use UDP ports in encap mode. This causes socket and dst
from an incoming packet to get stolen and attached to the UDP socket from
whence it is leaked when that socket is closed.
When a network namespace is removed, the wait for dst records to be cleaned
up happens before the cleanup of the rxrpc and UDP socket, meaning that the
wait never finishes.
Fix this by moving the rxrpc (and, by dependence, the afs) private
per-network namespace registrations to the device group rather than subsys
group. This allows cached rxrpc local endpoints to be cleared and their
UDP sockets closed before we try waiting for the dst records.
The symptom is that lines looking like the following:
unregister_netdevice: waiting for lo to become free
get emitted at regular intervals after running something like the
referenced syzbot test.
Thanks to Vadim for tracking this down and work out the fix.
Reported-by: syzbot+df400f2f24a1677cd7e0@syzkaller.appspotmail.com
Reported-by: Vadim Fedorenko <vfedorenko@novek.ru>
Fixes: 5271953cad ("rxrpc: Use the UDP encap_rcv hook")
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Vadim Fedorenko <vfedorenko@novek.ru>
Link: https://lore.kernel.org/r/161196443016.3868642.5577440140646403533.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 640f17c824 ]
create_worker() will already set the right affinity using
kthread_bind_mask(), this means only the rescuer will need to change
it's affinity.
Howveer, while in cpu-hot-unplug a regular task is not allowed to run
on online&&!active as it would be pushed away quite agressively. We
need KTHREAD_IS_PER_CPU to survive in that environment.
Therefore set the affinity after getting that magic flag.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.826629830@infradead.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ac687e6e8c ]
There is a need to distinguish geniune per-cpu kthreads from kthreads
that happen to have a single CPU affinity.
Geniune per-cpu kthreads are kthreads that are CPU affine for
correctness, these will obviously have PF_KTHREAD set, but must also
have PF_NO_SETAFFINITY set, lest userspace modify their affinity and
ruins things.
However, these two things are not sufficient, PF_NO_SETAFFINITY is
also set on other tasks that have their affinities controlled through
other means, like for instance workqueues.
Therefore another bit is needed; it turns out kthread_create_per_cpu()
already has such a bit: KTHREAD_IS_PER_CPU, which is used to make
kthread_park()/kthread_unpark() work correctly.
Expose this flag and remove the implicit setting of it from
kthread_create_on_cpu(); the io_uring usage of it seems dubious at
best.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.557620262@infradead.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1d489151e9 ]
Thanks to a recent binutils change which doesn't generate unused
symbols, it's now possible for thunk_64.o be completely empty without
CONFIG_PREEMPTION: no text, no data, no symbols.
We could edit the Makefile to only build that file when
CONFIG_PREEMPTION is enabled, but that will likely create confusion
if/when the thunks end up getting used by some other code again.
Just ignore it and move on.
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Link: https://github.com/ClangBuiltLinux/linux/issues/1254
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit dd3a44c06f ]
Newer binutils (>= 2.36) refuse to assemble lmw/stmw when building in
little endian mode. That breaks compilation of our alignment handler
test:
/tmp/cco4l14N.s: Assembler messages:
/tmp/cco4l14N.s:1440: Error: `lmw' invalid when little-endian
/tmp/cco4l14N.s:1814: Error: `stmw' invalid when little-endian
make[2]: *** [../../lib.mk:139: /output/kselftest/powerpc/alignment/alignment_handler] Error 1
These tests do pass on little endian machines, as the kernel will
still emulate those instructions even when running little
endian (which is arguably a kernel bug).
But we don't really need to test that case, so ifdef those
instructions out to get the alignment test building again.
Reported-by: Libor Pechacek <lpechacek@suse.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Libor Pechacek <lpechacek@suse.com>
Link: https://lore.kernel.org/r/20210119041800.3093047-1-mpe@ellerman.id.au
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 764907293e ]
While testing live partition mobility, we have observed occasional crashes
of the Linux partition. What we've seen is that during the live migration,
for specific configurations with large amounts of memory, slow network
links, and workloads that are changing memory a lot, the partition can end
up being suspended for 30 seconds or longer. This resulted in the following
scenario:
CPU 0 CPU 1
------------------------------- ----------------------------------
scsi_queue_rq migration_store
-> blk_mq_start_request -> rtas_ibm_suspend_me
-> blk_add_timer -> on_each_cpu(rtas_percpu_suspend_me
_______________________________________V
|
V
-> IPI from CPU 1
-> rtas_percpu_suspend_me
-> __rtas_suspend_last_cpu
-- Linux partition suspended for > 30 seconds --
-> for_each_online_cpu(cpu)
plpar_hcall_norets(H_PROD
-> scsi_dispatch_cmd
-> scsi_times_out
-> scsi_abort_command
-> queue_delayed_work
-> ibmvfc_queuecommand_lck
-> ibmvfc_send_event
-> ibmvfc_send_crq
- returns H_CLOSED
<- returns SCSI_MLQUEUE_HOST_BUSY
-> __blk_mq_requeue_request
-> scmd_eh_abort_handler
-> scsi_try_to_abort_cmd
- returns SUCCESS
-> scsi_queue_insert
Normally, the SCMD_STATE_COMPLETE bit would protect against the command
completion and the timeout, but that doesn't work here, since we don't
check that at all in the SCSI_MLQUEUE_HOST_BUSY path.
In this case we end up calling scsi_queue_insert on a request that has
already been queued, or possibly even freed, and we crash.
The patch below simply increases the default I/O timeout to avoid this race
condition. This is also the timeout value that nearly all IBM SAN storage
recommends setting as the default value.
Link: https://lore.kernel.org/r/1610463998-19791-1-git-send-email-brking@linux.vnet.ibm.com
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b2b0f16fa6 ]
A race condition exists between the response handler getting called because
of exchange_mgr_reset() (which clears out all the active XIDs) and the
response we get via an interrupt.
Sequence of events:
rport ba0200: Port timeout, state PLOGI
rport ba0200: Port entered PLOGI state from PLOGI state
xid 1052: Exchange timer armed : 20000 msecs xid timer armed here
rport ba0200: Received LOGO request while in state PLOGI
rport ba0200: Delete port
rport ba0200: work event 3
rport ba0200: lld callback ev 3
bnx2fc: rport_event_hdlr: event = 3, port_id = 0xba0200
bnx2fc: ba0200 - rport not created Yet!!
/* Here we reset any outstanding exchanges before
freeing rport using the exch_mgr_reset() */
xid 1052: Exchange timer canceled
/* Here we got two responses for one xid */
xid 1052: invoking resp(), esb 20000000 state 3
xid 1052: invoking resp(), esb 20000000 state 3
xid 1052: fc_rport_plogi_resp() : ep->resp_active 2
xid 1052: fc_rport_plogi_resp() : ep->resp_active 2
Skip the response if the exchange is already completed.
Link: https://lore.kernel.org/r/20201215194731.2326-1-jhasan@marvell.com
Signed-off-by: Javed Hasan <jhasan@marvell.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 46c54cf270 ]
The Estar Beauty HD (MID 7316R) tablet uses a Goodix touchscreen,
with the X and Y coordinates swapped compared to the LCD panel.
Add a touchscreen_dmi entry for this adding a "touchscreen-swapped-x-y"
device-property to the i2c-client instantiated for this device before
the driver binds.
This is the first entry of a Goodix touchscreen to touchscreen_dmi.c,
so far DMI quirks for Goodix touchscreen's have been added directly
to drivers/input/touchscreen/goodix.c. Currently there are 3
DMI tables in goodix.c:
1. rotated_screen[] for devices where the touchscreen is rotated
180 degrees vs the LCD panel
2. inverted_x_screen[] for devices where the X axis is inverted
3. nine_bytes_report[] for devices which use a non standard touch
report size
Arguably only 3. really needs to be inside the driver and the other
2 cases are better handled through the generic touchscreen DMI quirk
mechanism from touchscreen_dmi.c, which allows adding device-props to
any i2c-client. Esp. now that goodix.c is using the generic
touchscreen_properties code.
Alternative to the approach from this patch we could add a 4th
dmi_system_id table for devices with swapped-x-y axis to goodix.c,
but that seems undesirable.
Cc: Bastien Nocera <hadess@hadess.net>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20201224135158.10976-1-hdegoede@redhat.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 764257d906 ]
On deferred probe, we will get the following splat:
cpcap-usb-phy cpcap-usb-phy.0: could not initialize VBUS or ID IIO: -517
WARNING: CPU: 0 PID: 21 at drivers/regulator/core.c:2123 regulator_put+0x68/0x78
...
(regulator_put) from [<c068ebf0>] (release_nodes+0x1b4/0x1fc)
(release_nodes) from [<c068a9a4>] (really_probe+0x104/0x4a0)
(really_probe) from [<c068b034>] (driver_probe_device+0x58/0xb4)
Signed-off-by: Tony Lindgren <tony@atomide.com>
Link: https://lore.kernel.org/r/20201230102105.11826-1-tony@atomide.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 7f2923c4f7 upstream.
proc_get_long() is a funny function. It uses simple_strtoul() and for a
good reason. proc_get_long() wants to always succeed the parse and
return the maybe incorrect value and the trailing characters to check
against a pre-defined list of acceptable trailing values. However,
simple_strtoul() explicitly ignores overflows which can cause funny
things like the following to happen:
echo 18446744073709551616 > /proc/sys/fs/file-max
cat /proc/sys/fs/file-max
0
(Which will cause your system to silently die behind your back.)
On the other hand kstrtoul() does do overflow detection but does not
return the trailing characters, and also fails the parse when anything
other than '\n' is a trailing character whereas proc_get_long() wants to
be more lenient.
Now, before adding another kstrtoul() function let's simply add a static
parse strtoul_lenient() which:
- fails on overflow with -ERANGE
- returns the trailing characters to the caller
The reason why we should fail on ERANGE is that we already do a partial
fail on overflow right now. Namely, when the TMPBUFLEN is exceeded. So
we already reject values such as 184467440737095516160 (21 chars) but
accept values such as 18446744073709551616 (20 chars) but both are
overflows. So we should just always reject 64bit overflows and not
special-case this based on the number of chars.
Link: http://lkml.kernel.org/r/20190107222700.15954-2-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Joerg Vehlow <lkml@jv-coder.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 81b704d3e4 upstream.
Calling acpi_thermal_check() from acpi_thermal_notify() directly
is problematic if _TMP triggers Notify () on the thermal zone for
which it has been evaluated (which happens on some systems), because
it causes a new acpi_thermal_notify() invocation to be queued up
every time and if that takes place too often, an indefinite number of
pending work items may accumulate in kacpi_notify_wq over time.
Besides, it is not really useful to queue up a new invocation of
acpi_thermal_check() if one of them is pending already.
For these reasons, rework acpi_thermal_notify() to queue up a thermal
check instead of calling acpi_thermal_check() directly and only allow
one thermal check to be pending at a time. Moreover, only allow one
acpi_thermal_check_fn() instance at a time to run
thermal_zone_device_update() for one thermal zone and make it return
early if it sees other instances running for the same thermal zone.
While at it, fold acpi_thermal_check() into acpi_thermal_check_fn(),
as it is only called from there after the other changes made here.
[This issue appears to have been exposed by commit 6d25be5782
("sched/core, workqueues: Distangle worker accounting from rq
lock"), but it is unclear why it was not visible earlier.]
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=208877
Reported-by: Stephen Berman <stephen.berman@gmx.net>
Diagnosed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Stephen Berman <stephen.berman@gmx.net>
Cc: All applicable <stable@vger.kernel.org>
[bigeasy: Backported to v5.4.y]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 62d9f1a694 upstream.
Upon receiving a cumulative ACK that changes the congestion state from
Disorder to Open, the TLP timer is not set. If the sender is app-limited,
it can only wait for the RTO timer to expire and retransmit.
The reason for this is that the TLP timer is set before the congestion
state changes in tcp_ack(), so we delay the time point of calling
tcp_set_xmit_timer() until after tcp_fastretrans_alert() returns and
remove the FLAG_SET_XMIT_TIMER from ack_flag when the RACK reorder timer
is set.
This commit has two additional benefits:
1) Make sure to reset RTO according to RFC6298 when receiving ACK, to
avoid spurious RTO caused by RTO timer early expires.
2) Reduce the xmit timer reschedule once per ACK when the RACK reorder
timer is set.
Fixes: df92c8394e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
Link: https://lore.kernel.org/netdev/1611311242-6675-1-git-send-email-yangpc@wangsu.com
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/1611464834-23030-1-git-send-email-yangpc@wangsu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f0947d0d21 upstream.
Function __team_compute_features() is protected by team->lock
mutex when it is called from team_compute_features() used when
features of an underlying device is changed. This causes
a deadlock when NETDEV_FEAT_CHANGE notifier for underlying device
is fired due to change propagated from team driver (e.g. MTU
change). It's because callbacks like team_change_mtu() or
team_vlan_rx_{add,del}_vid() protect their port list traversal
by team->lock mutex.
Example (r8169 case where this driver disables TSO for certain MTU
values):
...
[ 6391.348202] __mutex_lock.isra.6+0x2d0/0x4a0
[ 6391.358602] team_device_event+0x9d/0x160 [team]
[ 6391.363756] notifier_call_chain+0x47/0x70
[ 6391.368329] netdev_update_features+0x56/0x60
[ 6391.373207] rtl8169_change_mtu+0x14/0x50 [r8169]
[ 6391.378457] dev_set_mtu_ext+0xe1/0x1d0
[ 6391.387022] dev_set_mtu+0x52/0x90
[ 6391.390820] team_change_mtu+0x64/0xf0 [team]
[ 6391.395683] dev_set_mtu_ext+0xe1/0x1d0
[ 6391.399963] do_setlink+0x231/0xf50
...
In fact team_compute_features() called from team_device_event()
does not need to be protected by team->lock mutex and rcu_read_lock()
is sufficient there for port list traversal.
Fixes: 3d249d4ca7 ("net: introduce ethernet teaming device")
Cc: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20210125074416.4056484-1-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9def3b1a07 upstream.
Since commit c40aaaac10 ("iommu/vt-d: Gracefully handle DMAR units
with no supported address widths") dmar.c needs struct iommu_device to
be selected. We can drop this dependency by not dereferencing struct
iommu_device if IOMMU_API is not selected and by reusing the information
stored in iommu->drhd->ignored instead.
This fixes the following build error when IOMMU_API is not selected:
drivers/iommu/dmar.c: In function ‘free_iommu’:
drivers/iommu/dmar.c:1139:41: error: ‘struct iommu_device’ has no member named ‘ops’
1139 | if (intel_iommu_enabled && iommu->iommu.ops) {
^
Fixes: c40aaaac10 ("iommu/vt-d: Gracefully handle DMAR units with no supported address widths")
Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Acked-by: Lu Baolu <baolu.lu@linux.intel.com>
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20201013073055.11262-1-brgl@bgdev.pl
Signed-off-by: Joerg Roedel <jroedel@suse.de>
[ - context change due to moving drivers/iommu/dmar.c to
drivers/iommu/intel/dmar.c
- set the drhr in the iommu like in upstream commit b1012ca8dc
("iommu/vt-d: Skip TE disabling on quirky gfx dedicated iommu") ]
Signed-off-by: Filippo Sironi <sironi@amazon.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 487c6ef81e ]
When we create the ft object we also init rhltable in ft->fgs_hash.
So in error flow before kfree of ft we need to destroy that rhltable.
Fixes: 693c6883bb ("net/mlx5: Add hash table for flow groups in flow table")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3d372c4edf ]
If we spin for a long time in memory reads that (for some reason in
hardware) take a long time, then we'll eventually get messages such
as
watchdog: BUG: soft lockup - CPU#2 stuck for 24s! [kworker/2:2:272]
This is because the reading really does take a very long time, and
we don't schedule, so we're hogging the CPU with this task, at least
if CONFIG_PREEMPT is not set, e.g. with CONFIG_PREEMPT_VOLUNTARY=y.
Previously I misinterpreted the situation and thought that this was
only going to happen if we had interrupts disabled, and then fixed
this (which is good anyway, however), but that didn't always help;
looking at it again now I realized that the spin unlock will only
reschedule if CONFIG_PREEMPT is used.
In order to avoid this issue, change the code to cond_resched() if
we've been spinning for too long here.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Fixes: 04516706bb ("iwlwifi: pcie: limit memory read spin time")
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/iwlwifi.20210115130253.217a9d6a6a12.If964cb582ab0aaa94e81c4ff3b279eaafda0fd3f@changeid
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 814b849713 ]
If the server returns a new stateid that does not match the one in our
cache, then pnfs_layout_process() will leak the layout segments returned
by pnfs_mark_layout_stateid_invalid().
Fixes: 9888d837f3 ("pNFS: Force a retry of LAYOUTGET if the stateid doesn't match our cache")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9f8550e4bd ]
The disable_xfrm flag signals that xfrm should not be performed during
routing towards a device before reaching device xmit.
For xfrm interfaces this is usually desired as they perform the outbound
policy lookup as part of their xmit using their if_id.
Before this change enabling this flag on xfrm interfaces prevented them
from xmitting as xfrm_lookup_with_ifid() would not perform a policy lookup
in case the original dst had the DST_NOXFRM flag.
This optimization is incorrect when the lookup is done by the xfrm
interface xmit logic.
Fix by performing policy lookup when invoked by xfrmi as if_id != 0.
Similarly it's unlikely for the 'no policy exists on net' check to yield
any performance benefits when invoked from xfrmi.
Fixes: f203b76d78 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 56ce7c25ae ]
When setting xfrm replay_window to values higher than 32, a rare
page-fault occurs in xfrm_replay_advance_bmp:
BUG: unable to handle page fault for address: ffff8af350ad7920
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD ad001067 P4D ad001067 PUD 0
Oops: 0002 [#1] SMP PTI
CPU: 3 PID: 30 Comm: ksoftirqd/3 Kdump: loaded Not tainted 5.4.52-050452-generic #202007160732
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
RIP: 0010:xfrm_replay_advance_bmp+0xbb/0x130
RSP: 0018:ffffa1304013ba40 EFLAGS: 00010206
RAX: 000000000000010d RBX: 0000000000000002 RCX: 00000000ffffff4b
RDX: 0000000000000018 RSI: 00000000004c234c RDI: 00000000ffb3dbff
RBP: ffffa1304013ba50 R08: ffff8af330ad7920 R09: 0000000007fffffa
R10: 0000000000000800 R11: 0000000000000010 R12: ffff8af29d6258c0
R13: ffff8af28b95c700 R14: 0000000000000000 R15: ffff8af29d6258fc
FS: 0000000000000000(0000) GS:ffff8af339ac0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8af350ad7920 CR3: 0000000015ee4000 CR4: 00000000001406e0
Call Trace:
xfrm_input+0x4e5/0xa10
xfrm4_rcv_encap+0xb5/0xe0
xfrm4_udp_encap_rcv+0x140/0x1c0
Analysis revealed offending code is when accessing:
replay_esn->bmp[nr] |= (1U << bitnr);
with 'nr' being 0x07fffffa.
This happened in an SMP system when reordering of packets was present;
A packet arrived with a "too old" sequence number (outside the window,
i.e 'diff > replay_window'), and therefore the following calculation:
bitnr = replay_esn->replay_window - (diff - pos);
yields a negative result, but since bitnr is u32 we get a large unsigned
quantity (in crash dump above: 0xffffff4b seen in ecx).
This was supposed to be protected by xfrm_input()'s former call to:
if (x->repl->check(x, skb, seq)) {
However, the state's spinlock x->lock is *released* after '->check()'
is performed, and gets re-acquired before '->advance()' - which gives a
chance for a different core to update the xfrm state, e.g. by advancing
'replay_esn->seq' when it encounters more packets - leading to a
'diff > replay_window' situation when original core continues to
xfrm_replay_advance_bmp().
An attempt to fix this issue was suggested in commit bcf66bf54a
("xfrm: Perform a replay check after return from async codepaths"),
by calling 'x->repl->recheck()' after lock is re-acquired, but fix
applied only to asyncronous crypto algorithms.
Augment the fix, by *always* calling 'recheck()' - irrespective if we're
using async crypto.
Fixes: 0ebea8ef35 ("[IPSEC]: Move state lock into x->type->input")
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 0c5b7a501e upstream.
Otherwise, the newly create element shows no timeout when listing the
ruleset. If the set definition does not specify a default timeout, then
the set element only shows the expiration time, but not the timeout.
This is a problem when restoring a stateful ruleset listing since it
skips the timeout policy entirely.
Fixes: 22fe54d5fe ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a88afa46b8 upstream.
When the kernel is configured to use the Thumb-2 instruction set
"suspend-to-memory" fails to resume. Observed on a Colibri iMX6ULL
(i.MX 6ULL) and Apalis iMX6 (i.MX 6Q).
It looks like the CPU resumes unconditionally in ARM instruction mode
and then chokes on the presented Thumb-2 code it should execute.
Fix this by using the arm instruction set for all code in
suspend-imx6.S.
Signed-off-by: Max Krummenacher <max.krummenacher@toradex.com>
Fixes: df595746fa ("ARM: imx: add suspend in ocram support for i.mx6q")
Acked-by: Oleksandr Suvorov <oleksandr.suvorov@toradex.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0549cd67b0 upstream.
This is inline with the specification described in blkif.h:
* discard-granularity: should be set to the physical block size if
node is not present.
* discard-alignment, discard-secure: should be set to 0 if node not
present.
This was detected as QEMU would only create the discard-granularity
node but not discard-alignment, and thus the setup done in
blkfront_setup_discard would fail.
Fix blkfront_setup_discard to not fail on missing nodes, and also fix
blkif_set_queue_limits to set the discard granularity to the physical
block size if none is specified in xenbus.
Fixes: ed30bf317c ('xen-blkfront: Handle discard requests.')
Reported-by: Arthur Borsboom <arthurborsboom@gmail.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Tested-By: Arthur Borsboom <arthurborsboom@gmail.com>
Link: https://lore.kernel.org/r/20210119105727.95173-1-roger.pau@citrix.com
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 27af8e2c90 upstream.
We have the following potential deadlock condition:
========================================================
WARNING: possible irq lock inversion dependency detected
5.10.0-rc2+ #25 Not tainted
--------------------------------------------------------
swapper/3/0 just changed the state of lock:
ffff8880063bd618 (&host->lock){-...}-{2:2}, at: ata_bmdma_interrupt+0x27/0x200
but this lock took another, HARDIRQ-READ-unsafe lock in the past:
(&trig->leddev_list_lock){.+.?}-{2:2}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&trig->leddev_list_lock);
local_irq_disable();
lock(&host->lock);
lock(&trig->leddev_list_lock);
<Interrupt>
lock(&host->lock);
*** DEADLOCK ***
no locks held by swapper/3/0.
the shortest dependencies between 2nd lock and 1st lock:
-> (&trig->leddev_list_lock){.+.?}-{2:2} ops: 46 {
HARDIRQ-ON-R at:
lock_acquire+0x15f/0x420
_raw_read_lock+0x42/0x90
led_trigger_event+0x2b/0x70
rfkill_global_led_trigger_worker+0x94/0xb0
process_one_work+0x240/0x560
worker_thread+0x58/0x3d0
kthread+0x151/0x170
ret_from_fork+0x1f/0x30
IN-SOFTIRQ-R at:
lock_acquire+0x15f/0x420
_raw_read_lock+0x42/0x90
led_trigger_event+0x2b/0x70
kbd_bh+0x9e/0xc0
tasklet_action_common.constprop.0+0xe9/0x100
tasklet_action+0x22/0x30
__do_softirq+0xcc/0x46d
run_ksoftirqd+0x3f/0x70
smpboot_thread_fn+0x116/0x1f0
kthread+0x151/0x170
ret_from_fork+0x1f/0x30
SOFTIRQ-ON-R at:
lock_acquire+0x15f/0x420
_raw_read_lock+0x42/0x90
led_trigger_event+0x2b/0x70
rfkill_global_led_trigger_worker+0x94/0xb0
process_one_work+0x240/0x560
worker_thread+0x58/0x3d0
kthread+0x151/0x170
ret_from_fork+0x1f/0x30
INITIAL READ USE at:
lock_acquire+0x15f/0x420
_raw_read_lock+0x42/0x90
led_trigger_event+0x2b/0x70
rfkill_global_led_trigger_worker+0x94/0xb0
process_one_work+0x240/0x560
worker_thread+0x58/0x3d0
kthread+0x151/0x170
ret_from_fork+0x1f/0x30
}
... key at: [<ffffffff83da4c00>] __key.0+0x0/0x10
... acquired at:
_raw_read_lock+0x42/0x90
led_trigger_blink_oneshot+0x3b/0x90
ledtrig_disk_activity+0x3c/0xa0
ata_qc_complete+0x26/0x450
ata_do_link_abort+0xa3/0xe0
ata_port_freeze+0x2e/0x40
ata_hsm_qc_complete+0x94/0xa0
ata_sff_hsm_move+0x177/0x7a0
ata_sff_pio_task+0xc7/0x1b0
process_one_work+0x240/0x560
worker_thread+0x58/0x3d0
kthread+0x151/0x170
ret_from_fork+0x1f/0x30
-> (&host->lock){-...}-{2:2} ops: 69 {
IN-HARDIRQ-W at:
lock_acquire+0x15f/0x420
_raw_spin_lock_irqsave+0x52/0xa0
ata_bmdma_interrupt+0x27/0x200
__handle_irq_event_percpu+0xd5/0x2b0
handle_irq_event+0x57/0xb0
handle_edge_irq+0x8c/0x230
asm_call_irq_on_stack+0xf/0x20
common_interrupt+0x100/0x1c0
asm_common_interrupt+0x1e/0x40
native_safe_halt+0xe/0x10
arch_cpu_idle+0x15/0x20
default_idle_call+0x59/0x1c0
do_idle+0x22c/0x2c0
cpu_startup_entry+0x20/0x30
start_secondary+0x11d/0x150
secondary_startup_64_no_verify+0xa6/0xab
INITIAL USE at:
lock_acquire+0x15f/0x420
_raw_spin_lock_irqsave+0x52/0xa0
ata_dev_init+0x54/0xe0
ata_link_init+0x8b/0xd0
ata_port_alloc+0x1f1/0x210
ata_host_alloc+0xf1/0x130
ata_host_alloc_pinfo+0x14/0xb0
ata_pci_sff_prepare_host+0x41/0xa0
ata_pci_bmdma_prepare_host+0x14/0x30
piix_init_one+0x21f/0x600
local_pci_probe+0x48/0x80
pci_device_probe+0x105/0x1c0
really_probe+0x221/0x490
driver_probe_device+0xe9/0x160
device_driver_attach+0xb2/0xc0
__driver_attach+0x91/0x150
bus_for_each_dev+0x81/0xc0
driver_attach+0x1e/0x20
bus_add_driver+0x138/0x1f0
driver_register+0x91/0xf0
__pci_register_driver+0x73/0x80
piix_init+0x1e/0x2e
do_one_initcall+0x5f/0x2d0
kernel_init_freeable+0x26f/0x2cf
kernel_init+0xe/0x113
ret_from_fork+0x1f/0x30
}
... key at: [<ffffffff83d9fdc0>] __key.6+0x0/0x10
... acquired at:
__lock_acquire+0x9da/0x2370
lock_acquire+0x15f/0x420
_raw_spin_lock_irqsave+0x52/0xa0
ata_bmdma_interrupt+0x27/0x200
__handle_irq_event_percpu+0xd5/0x2b0
handle_irq_event+0x57/0xb0
handle_edge_irq+0x8c/0x230
asm_call_irq_on_stack+0xf/0x20
common_interrupt+0x100/0x1c0
asm_common_interrupt+0x1e/0x40
native_safe_halt+0xe/0x10
arch_cpu_idle+0x15/0x20
default_idle_call+0x59/0x1c0
do_idle+0x22c/0x2c0
cpu_startup_entry+0x20/0x30
start_secondary+0x11d/0x150
secondary_startup_64_no_verify+0xa6/0xab
This lockdep splat is reported after:
commit e918188611 ("locking: More accurate annotations for read_lock()")
To clarify:
- read-locks are recursive only in interrupt context (when
in_interrupt() returns true)
- after acquiring host->lock in CPU1, another cpu (i.e. CPU2) may call
write_lock(&trig->leddev_list_lock) that would be blocked by CPU0
that holds trig->leddev_list_lock in read-mode
- when CPU1 (ata_ac_complete()) tries to read-lock
trig->leddev_list_lock, it would be blocked by the write-lock waiter
on CPU2 (because we are not in interrupt context, so the read-lock is
not recursive)
- at this point if an interrupt happens on CPU0 and
ata_bmdma_interrupt() is executed it will try to acquire host->lock,
that is held by CPU1, that is currently blocked by CPU2, so:
* CPU0 blocked by CPU1
* CPU1 blocked by CPU2
* CPU2 blocked by CPU0
*** DEADLOCK ***
The deadlock scenario is better represented by the following schema
(thanks to Boqun Feng <boqun.feng@gmail.com> for the schema and the
detailed explanation of the deadlock condition):
CPU 0: CPU 1: CPU 2:
----- ----- -----
led_trigger_event():
read_lock(&trig->leddev_list_lock);
<workqueue>
ata_hsm_qc_complete():
spin_lock_irqsave(&host->lock);
write_lock(&trig->leddev_list_lock);
ata_port_freeze():
ata_do_link_abort():
ata_qc_complete():
ledtrig_disk_activity():
led_trigger_blink_oneshot():
read_lock(&trig->leddev_list_lock);
// ^ not in in_interrupt() context, so could get blocked by CPU 2
<interrupt>
ata_bmdma_interrupt():
spin_lock_irqsave(&host->lock);
Fix by using read_lock_irqsave/irqrestore() in led_trigger_event(), so
that no interrupt can happen in between, preventing the deadlock
condition.
Apply the same change to led_trigger_blink_setup() as well, since the
same deadlock scenario can also happen in power_supply_update_bat_leds()
-> led_trigger_blink() -> led_trigger_blink_setup() (workqueue context),
and potentially prevent other similar usages.
Link: https://lore.kernel.org/lkml/20201101092614.GB3989@xps-13-7390/
Fixes: eb25cb9956 ("leds: convert IDE trigger to common disk trigger")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>