Pull RCU updates from Paul McKenney:
"Documentation updates
Miscellaneous fixes, perhaps most notably:
- Remove RCU_NONIDLE(). The new visibility of most of the idle loop
to RCU has obsoleted this API.
- Make the RCU_SOFTIRQ callback-invocation time limit also apply to
the rcuc kthreads that invoke callbacks for CONFIG_PREEMPT_RT.
- Add a jiffies-based callback-invocation time limit to handle
long-running callbacks. (The local_clock() function is only invoked
once per 32 callbacks due to its high overhead.)
- Stop rcu_tasks_invoke_cbs() from using never-onlined CPUs, which
fixes a bug that can occur on systems with non-contiguous CPU
numbering.
kvfree_rcu updates:
- Eliminate the single-argument variant of k[v]free_rcu() now that
all uses have been converted to k[v]free_rcu_mightsleep().
- Add WARN_ON_ONCE() checks for k[v]free_rcu*() freeing callbacks too
soon. Yes, this is closing the barn door after the horse has
escaped, but Murphy says that there will be more horses.
Callback-offloading updates:
- Fix a number of bugs involving the shrinker and lazy callbacks.
Tasks RCU updates
Torture-test updates"
* tag 'rcu.2023.06.22a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (32 commits)
torture: Remove duplicated argument -enable-kvm for ppc64
doc/rcutorture: Add description of rcutorture.stall_cpu_block
rcu/rcuscale: Stop kfree_scale_thread thread(s) after unloading rcuscale
rcu/rcuscale: Move rcu_scale_*() after kfree_scale_cleanup()
rcutorture: Correct name of use_softirq module parameter
locktorture: Add long_hold to adjust lock-hold delays
rcu/nocb: Make shrinker iterate only over NOCB CPUs
rcu-tasks: Stop rcu_tasks_invoke_cbs() from using never-onlined CPUs
rcu: Make rcu_cpu_starting() rely on interrupts being disabled
rcu: Mark rcu_cpu_kthread() accesses to ->rcu_cpu_has_work
rcu: Mark additional concurrent load from ->cpu_no_qs.b.exp
rcu: Employ jiffies-based backstop to callback time limit
rcu: Check callback-invocation time limit for rcuc kthreads
rcu: Remove RCU_NONIDLE()
rcu: Add more RCU files to kernel-api.rst
rcu-tasks: Clarify the cblist_init_generic() function's pr_info() output
rcu-tasks: Avoid pr_info() with spin lock in cblist_init_generic()
rcu/nocb: Recheck lazy callbacks under the ->nocb_lock from shrinker
rcu/nocb: Fix shrinker race against callback enqueuer
rcu/nocb: Protect lazy shrinker against concurrent (de-)offloading
...
It feels very unlikely that anybody would want to do a GUP in an
unmapped area under the stack pointer, but real users sometimes do some
really strange things. So add a (temporary) warning for the case where
a GUP fails and expanding the stack might have made it work.
It's trivial to do the expansion in the caller as part of getting the mm
lock in the first place - see __access_remote_vm() for ptrace, for
example - it's just that it's unnecessarily painful to do it deep in the
guts of the GUP lookup when we might have to drop and re-take the lock.
I doubt anybody actually does anything quite this strange, but let's be
proactive: adding these warnings is simple, and will make debugging it
much easier if they trigger.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This finishes the job of always holding the mmap write lock when
extending the user stack vma, and removes the 'write_locked' argument
from the vm helper functions again.
For some cases, we just avoid expanding the stack at all: drivers and
page pinning really shouldn't be extending any stacks. Let's see if any
strange users really wanted that.
It's worth noting that architectures that weren't converted to the new
lock_mm_and_find_vma() helper function are left using the legacy
"expand_stack()" function, but it has been changed to drop the mmap_lock
and take it for writing while expanding the vma. This makes it fairly
straightforward to convert the remaining architectures.
As a result of dropping and re-taking the lock, the calling conventions
for this function have also changed, since the old vma may no longer be
valid. So it will now return the new vma if successful, and NULL - and
the lock dropped - if the area could not be extended.
Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # ia64
Tested-by: Frank Scheiner <frank.scheiner@web.de> # ia64
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When using the felix driver (the only one which supports UC filtering
and MC filtering) as a DSA master for a random other DSA switch, one can
see the following stack trace when the downstream switch ports join a
VLAN-aware bridge:
=============================
WARNING: suspicious RCU usage
-----------------------------
net/8021q/vlan_core.c:238 suspicious rcu_dereference_protected() usage!
stack backtrace:
Workqueue: dsa_ordered dsa_slave_switchdev_event_work
Call trace:
lockdep_rcu_suspicious+0x170/0x210
vlan_for_each+0x8c/0x188
dsa_slave_sync_uc+0x128/0x178
__hw_addr_sync_dev+0x138/0x158
dsa_slave_set_rx_mode+0x58/0x70
__dev_set_rx_mode+0x88/0xa8
dev_uc_add+0x74/0xa0
dsa_port_bridge_host_fdb_add+0xec/0x180
dsa_slave_switchdev_event_work+0x7c/0x1c8
process_one_work+0x290/0x568
What it's saying is that vlan_for_each() expects rtnl_lock() context and
it's not getting it, when it's called from the DSA master's ndo_set_rx_mode().
The caller of that - dsa_slave_set_rx_mode() - is the slave DSA
interface's dsa_port_bridge_host_fdb_add() which comes from the deferred
dsa_slave_switchdev_event_work().
We went to great lengths to avoid the rtnl_lock() context in that call
path in commit 0faf890fc5 ("net: dsa: drop rtnl_lock from
dsa_slave_switchdev_event_work"), and calling rtnl_lock() is simply not
an option due to the possibility of deadlocking when calling
dsa_flush_workqueue() from the call paths that do hold rtnl_lock() -
basically all of them.
So, when the DSA master calls vlan_for_each() from its ndo_set_rx_mode(),
the state of the 8021q driver on this device is really not protected
from concurrent access by anything.
Looking at net/8021q/, I don't think that vlan_info->vid_list was
particularly designed with RCU traversal in mind, so introducing an RCU
read-side form of vlan_for_each() - vlan_for_each_rcu() - won't be so
easy, and it also wouldn't be exactly what we need anyway.
In general I believe that the solution isn't in net/8021q/ anyway;
vlan_for_each() is not cut out for this task. DSA doesn't need rtnl_lock()
to be held per se - since it's not a netdev state change that we're
blocking, but rather, just concurrent additions/removals to a VLAN list.
We don't even need sleepable context - the callback of vlan_for_each()
just schedules deferred work.
The proposed escape is to remove the dependency on vlan_for_each() and
to open-code a non-sleepable, rtnl-free alternative to that, based on
copies of the VLAN list modified from .ndo_vlan_rx_add_vid() and
.ndo_vlan_rx_kill_vid().
Fixes: 64fdc5f341 ("net: dsa: sync unicast and multicast addresses for VLAN filters too")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20230626154402.3154454-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With commit 9da0763bdd ("kbuild: Use relative path when building in
a subdir of the source tree"), compiler messages in out-of-tree builds
include relative paths, which are relative to the build directory, not
the directory where make was started.
To help IDEs/editors find the source files, Kbuild lets GNU Make print
"Entering directory ..." when it changes the working directory. It has
been working fine for a long time, but David reported it is broken with
the latest GNU Make.
The behavior was changed by GNU Make commit 8f9e7722ff0f ("[SV 63537]
Fix setting -w in makefiles"). Previously, setting --no-print-directory
to MAKEFLAGS only affected child makes, but it is now interpreted in
the current make as soon as it is set.
[test code]
$ cat /tmp/Makefile
ifneq ($(SUBMAKE),1)
MAKEFLAGS += --no-print-directory
all: ; $(MAKE) SUBMAKE=1
else
all: ; :
endif
[before 8f9e7722ff0f]
$ make -C /tmp
make: Entering directory '/tmp'
make SUBMAKE=1
:
make: Leaving directory '/tmp'
[after 8f9e7722ff0f]
$ make -C /tmp
make SUBMAKE=1
:
Previously, the effect of --no-print-directory was delayed until Kbuild
started the directory descending, but it is no longer true with GNU Make
4.4.1.
This commit adds one more recursion to cater to GNU Make >= 4.4.1.
When Kbuild needs to change the working directory, __submake will be
executed twice.
__submake without --no-print-directory --> show "Entering directory ..."
__submake with --no-print-directory --> parse the rest of Makefile
We end up with one more recursion than needed for GNU Make < 4.4.1, but
I do not want to complicate the version check.
Reported-by: David Howells <dhowells@redhat.com>
Closes: https://lore.kernel.org/all/2427604.1686237298@warthog.procyon.org.uk/
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Tested-by: Nicolas Schier <n.schier@avm.de>
When you run 'make rpm-pkg', the rpmbuild tool builds the kernel in
rpmbuild/BUILD, but $(abs_srctree) and $(abs_objtree) point to the
directory path where make was started, not the kernel is actually
being built. The same applies to 'make snap-pkg'. Fix it.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Stephen reports warnings when rendering phylink kdocs as HTML:
include/linux/phylink.h:110: ERROR: Unexpected indentation.
include/linux/phylink.h:111: WARNING: Block quote ends without a blank line; unexpected unindent.
include/linux/phylink.h:614: WARNING: Inline literal start-string without end-string.
include/linux/phylink.h:644: WARNING: Inline literal start-string without end-string.
Make phylink_pcs_neg_mode() use a proper list format to fix the first
two warnings.
The last two warnings, AFAICT, come from the use of shorthand like
phylink_mode_*(). Perhaps those should be special-cased at the Sphinx
level.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: https://lore.kernel.org/all/20230626162908.2f149f98@canb.auug.org.au/
Link: https://lore.kernel.org/r/20230626214640.3142252-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Two deadly typos break RX and TX traffic on the VSC8502 PHY using RGMII
if phy-mode = "rgmii-id" or "rgmii-txid", and no "tx-internal-delay-ps"
override exists. The negative error code from phy_get_internal_delay()
does not get overridden with the delay deduced from the phy-mode, and
later gets committed to hardware. Also, the rx_delay gets overridden by
what should have been the tx_delay.
Fixes: dbb050d2bf ("phy: mscc: Add support for RGMII delay configuration")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Harini Katakam <harini.katakam@amd.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230627134235.3453358-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Julia Lawall says:
====================
use vmalloc_array and vcalloc
The functions vmalloc_array and vcalloc were introduced in
commit a8749a35c3 ("mm: vmalloc: introduce array allocation functions")
but are not used much yet. This series introduces uses of
these functions, to protect against multiplication overflows.
The changes were done using the following Coccinelle semantic
patch.
@initialize:ocaml@
@@
let rename alloc =
match alloc with
"vmalloc" -> "vmalloc_array"
| "vzalloc" -> "vcalloc"
| _ -> failwith "unknown"
@@
size_t e1,e2;
constant C1, C2;
expression E1, E2, COUNT, x1, x2, x3;
typedef u8;
typedef __u8;
type t = {u8,__u8,char,unsigned char};
identifier alloc = {vmalloc,vzalloc};
fresh identifier realloc = script:ocaml(alloc) { rename alloc };
@@
(
alloc(x1*x2*x3)
|
alloc(C1 * C2)
|
alloc((sizeof(t)) * (COUNT), ...)
|
- alloc((e1) * (e2))
+ realloc(e1, e2)
|
- alloc((e1) * (COUNT))
+ realloc(COUNT, e1)
|
- alloc((E1) * (E2))
+ realloc(E1, E2)
)
v2: This series uses vmalloc_array and vcalloc instead of
array_size. It also leaves a multiplication of a constant by a
sizeof as is. Two patches are thus dropped from the series.
====================
Link: https://lore.kernel.org/r/20230627144339.144478-1-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In nfsd4_encode_fattr(), TIME_CREATE was being written out after all
other times. However, they should be written out in an order that
matches the bit flags in bmval1, which in this case are
#define FATTR4_WORD1_TIME_ACCESS (1UL << 15)
#define FATTR4_WORD1_TIME_CREATE (1UL << 18)
#define FATTR4_WORD1_TIME_DELTA (1UL << 19)
#define FATTR4_WORD1_TIME_METADATA (1UL << 20)
#define FATTR4_WORD1_TIME_MODIFY (1UL << 21)
so TIME_CREATE should come second.
I noticed this on a FreeBSD NFSv4.2 client, which supports creation
times. On this client, file times were weirdly permuted. With this
patch applied on the server, times looked normal on the client.
Fixes: e377a3e698 ("nfsd: Add support for the birth time attribute")
Link: https://unix.stackexchange.com/q/749605/56202
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
It is possible that the next available path we failover to, happens to
be frozen (for example if it is during connection establishment). If
the original I/O was set with NOWAIT, this cause the I/O to unnecessarily
fail because the request queue cannot be entered, hence the I/O fails with
EAGAIN.
The NOWAIT restriction that was originally set for the I/O is no longer
relevant or needed because this is the nvme requeue context. Hence we
clear the REQ_NOWAIT flag when failing over I/O.
This fix a simple test case of nvme controller reset during I/O when the
multipath device that has only a single path and I/O fails with "Resource
temporarily unavailable" errno. Note that this reproduces with io_uring
which by default sets IOCB_NOWAIT by default.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Correctly spell "Zeroes" in nvme_cmd_write_zeroes command name.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Commit 896e97bf99 ("ACPI: EC: Clear GPE on interrupt handling only")
broke suspend-to-idle at least on Dell XPS13 9360 and 9380.
The problem is that acpi_ec_dispatch_gpe() must clear the EC GPE,
because the EC GPE handler never runs when the system is in the
suspend-to-idle state and if the EC GPE is not cleared by the suspend-
to-idle loop, it is never cleared at all which leads to a GPE storm.
This causes suspend-to-idle to burn energy instead of saving it which
is potentially dangerous (the affected machines heat up rather badly
when that happens).
Addess this by making acpi_ec_dispatch_gpe() clear the EC GPE as it did
before.
Fixes: 896e97bf99 ("ACPI: EC: Clear GPE on interrupt handling only")
Tested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add support for generation of interrupts from the device directly to the
VM to the VCPU thus avoiding the overhead on the host CPU.
When supported, the driver will attempt to allocate vectors for each
data virtqueue. If a vector for a virtqueue cannot be provided it will
use the QP mode where notifications go through the driver.
In addition, we add a shutdown callback to make sure allocated
interrupts are released in case of shutdown to allow clean shutdown.
Signed-off-by: Eli Cohen <elic@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Message-Id: <20230607190007.290505-1-dtatulea@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Add missing documentation for the vqs_list_lock field of struct virtio_device,
and the validate field of struct virtio_driver.
./scripts/kernel-doc says:
.../virtio.h:131: warning: Function parameter or member 'vqs_list_lock' not described in 'virtio_device'
.../virtio.h:192: warning: Function parameter or member 'validate' not described in 'virtio_driver'
2 warnings as Errors
No functional changes intended.
Signed-off-by: Simon Horman <horms@kernel.org>
Message-Id: <20230510-virtio-kdoc-v3-1-e2681ed7a425@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
These are the adminq commands that will be needed for
setting up and using the vDPA device. There are a number
of commands defined in the FW's API, but by making use of
the FW's virtio BAR we only need a few of these commands
for vDPA support.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230519215632.12343-9-shannon.nelson@amd.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The pds_core_logical_qtype enum and IFNAMSIZ are not needed
in the common PDS header, only needed when working with the
adminq, so move them to the adminq header.
Note: This patch might conflict with pds_vfio patches that are
in review, depending on which patchset gets pulled first.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230519215632.12343-5-shannon.nelson@amd.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This is the initial auxiliary driver framework for a new vDPA
device driver, an auxiliary_bus client of the pds_core driver.
The pds_core driver supplies the PCI services for the VF device
and for accessing the adminq in the PF device.
This patch adds the very basics of registering for the auxiliary
device and setting up debugfs entries.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230519215632.12343-4-shannon.nelson@amd.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
To add a bit of vendor flexibility with various virtio based devices,
allow the caller to specify a different DMA mask. This adds a dma_mask
field to struct virtio_pci_modern_device. If defined by the driver,
this mask will be used in a call to dma_set_mask_and_coherent() instead
of the traditional DMA_BIT_MASK(64). This allows limiting the DMA space
on vendor devices with address limitations.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230519215632.12343-3-shannon.nelson@amd.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
To add a bit of vendor flexibility with various virtio based devices,
allow the caller to check for a different device id. This adds a function
pointer field to struct virtio_pci_modern_device to specify an override
device id check. If defined by the driver, this function will be called
to check that the PCI device is the vendor's expected device, and will
return the found device id to be stored in mdev->id.device. This allows
vendors with alternative vendor device ids to use this library on their
own device BAR.
Note: A lot of the diff in this is simply indenting the existing code
into an else block.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230519215632.12343-2-shannon.nelson@amd.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Improve the size of the virtio_pci_device structure, which is commonly
used to represent a virtio PCI device. A given virtio PCI device can
either of legacy type or modern type, with the
struct virtio_pci_legacy_device occupying 32 bytes and the
struct virtio_pci_modern_device occupying 88 bytes. Make them a union,
thereby save 32 bytes of memory as shown by the pahole tool. This
improvement is particularly beneficial when dealing with numerous
devices, as it helps conserve memory resources.
Before the modification, pahole tool reported the following:
struct virtio_pci_device {
[...]
struct virtio_pci_legacy_device ldev; /* 824 32 */
/* --- cacheline 13 boundary (832 bytes) was 24 bytes ago --- */
struct virtio_pci_modern_device mdev; /* 856 88 */
/* XXX last struct has 4 bytes of padding */
[...]
/* size: 1056, cachelines: 17, members: 19 */
[...]
};
After the modification, pahole tool reported the following:
struct virtio_pci_device {
[...]
union {
struct virtio_pci_legacy_device ldev; /* 824 32 */
struct virtio_pci_modern_device mdev; /* 824 88 */
}; /* 824 88 */
[...]
/* size: 1024, cachelines: 16, members: 18 */
[...]
};
Signed-off-by: Feng Liu <feliu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Message-Id: <20230516135446.16266-1-feliu@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
"-mfunction-return=thunk -mindirect-branch-register" are only valid
for x86. So introduce compiler operation check to avoid such issues
Fixes: 0d0ed40061 ("tools/virtio: enable to build with retpoline")
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Message-Id: <20230323040024.3809108-1-peng.fan@oss.nxp.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Rather than former lazy-initialization mechanism,
now the virtqueue operations and driver_features related
ops access the virtio registers directly to take
immediate actions. So ifcvf_start_datapath() should
retire.
ifcvf_add_status() is retierd because we should not change
device status by a vendor driver's decision, this driver should
only set device status which is from virito drivers
upon vdpa_ops.set_status()
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230526145254.39537-4-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This commit implements a new function ifcvf_get_driver_feature()
which read driver_features from virtio registers.
To be less ambiguous, ifcvf_set_features() is renamed to
ifcvf_set_driver_features(), and ifcvf_get_features()
is renamed to ifcvf_get_dev_features() which returns
the provisioned vDPA device features.
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20230526145254.39537-3-lingshan.zhu@intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>