Commit Graph

1140015 Commits

Author SHA1 Message Date
Marc Kleine-Budde
e85e9e0d8c spi: spi-imx: spi_imx_transfer_one(): check for DMA transfer first
The SPI framework checks for each transfer (with the struct
spi_controller::can_dma callback) whether the driver wants to use DMA
for the transfer. If the driver returns true, the SPI framework will
map the transfer's data to the device, start the actual transfer and
map the data back.

In commit 07e7593877 ("spi: spi-imx: add PIO polling support") the
spi-imx driver's spi_imx_transfer_one() function was extended. If the
estimated duration of a transfer does not exceed a configurable
duration, a polling transfer function is used. This check happens
before checking if the driver decided earlier for a DMA transfer.

If spi_imx_can_dma() decided to use a DMA transfer, and the user
configured a big maximum polling duration, a polling transfer will be
used. The DMA unmap after the transfer destroys the transferred data.

To fix this problem check in spi_imx_transfer_one() if the driver
decided for DMA transfer first, then check the limits for a polling
transfer.

Fixes: 07e7593877 ("spi: spi-imx: add PIO polling support")
Link: https://lore.kernel.org/all/20221111003032.82371-1-festevam@gmail.com
Reported-by: Frieder Schrempf <frieder.schrempf@kontron.de>
Reported-by: Fabio Estevam <festevam@gmail.com>
Tested-by: Fabio Estevam <festevam@gmail.com>
Cc: David Jander <david@protonic.nl>
Cc: stable@vger.kernel.org
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Tested-by: Frieder Schrempf <frieder.schrempf@kontron.de>
Reviewed-by: Frieder Schrempf <frieder.schrempf@kontron.de>
Link: https://lore.kernel.org/r/20221116164930.855362-1-mkl@pengutronix.de
Signed-off-by: Mark Brown <broonie@kernel.org>
2022-11-18 11:39:50 +00:00
Pierre-Louis Bossart
f5f8ad3fcd ASoC: SOF: dai: move AMD_HS to end of list to restore backwards-compatibility
The addition of AMD_HS breaks Mediatek platforms by using an index
previously allocated to Mediatek. This is a backwards-compatibility
issue and needs to be fixed. All firmware released by AMD needs to be
re-generated and re-distributed.

Fixes: ed2562c64b ("ASoC: SOF: Adding amd HS functionality to the sof core")
Link: https://github.com/thesofproject/sof/issues/6615
Link: https://lore.kernel.org/alsa-devel/36a45c7a-820a-7675-d740-c0e83ae2c417@collabora.com/
Reported-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Daniel Baluta <daniel.baluta@nxp.com>
Reviewed-by: Basavaraj Hiregoudar <basavaraj.hiregoudar@amd.com>
Reviewed-by: V sujith kumar Reddy <Vsujithkumar.Reddy@amd.com>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Link: https://lore.kernel.org/r/20221117232120.112639-1-pierre-louis.bossart@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
2022-11-18 11:39:46 +00:00
Jiawen Wu
0b6ffefbb0 net: libwx: Fix dead code for duplicate check
Fix duplicate check on polling timeout.

Fixes: 1efa9bfe58 ("net: libwx: Implement interaction with firmware")
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-18 10:53:13 +00:00
Mahesh Bandewar
40b9d1ab63 ipvlan: hold lower dev to avoid possible use-after-free
Recently syzkaller discovered the issue of disappearing lower
device (NETDEV_UNREGISTER) while the virtual device (like
macvlan) is still having it as a lower device. So it's just
a matter of time similar discovery will be made for IPvlan
device setup. So fixing it preemptively. Also while at it,
add a refcount tracker.

Fixes: 2ad7bf3638 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-18 10:39:22 +00:00
Thomas Zeitlhofer
8207f253a0 net: neigh: decrement the family specific qlen
Commit 0ff4eb3d5e ("neighbour: make proxy_queue.qlen limit
per-device") introduced the length counter qlen in struct neigh_parms.
There are separate neigh_parms instances for IPv4/ARP and IPv6/ND, and
while the family specific qlen is incremented in pneigh_enqueue(), the
mentioned commit decrements always the IPv4/ARP specific qlen,
regardless of the currently processed family, in pneigh_queue_purge()
and neigh_proxy_process().

As a result, with IPv6/ND, the family specific qlen is only incremented
(and never decremented) until it exceeds PROXY_QLEN, and then, according
to the check in pneigh_enqueue(), neighbor solicitations are not
answered anymore. As an example, this is noted when using the
subnet-router anycast address to access a Linux router. After a certain
amount of time (in the observed case, qlen exceeded PROXY_QLEN after two
days), the Linux router stops answering neighbor solicitations for its
subnet-router anycast address and effectively becomes unreachable.

Another result with IPv6/ND is that the IPv4/ARP specific qlen is
decremented more often than incremented. This leads to negative qlen
values, as a signed integer has been used for the length counter qlen,
and potentially to an integer overflow.

Fix this by introducing the helper function neigh_parms_qlen_dec(),
which decrements the family specific qlen. Thereby, make use of the
existing helper function neigh_get_dev_parms_rcu(), whose definition
therefore needs to be placed earlier in neighbour.c. Take the family
member from struct neigh_table to determine the currently processed
family and appropriately call neigh_parms_qlen_dec() from
pneigh_queue_purge() and neigh_proxy_process().

Additionally, use an unsigned integer for the length counter qlen.

Fixes: 0ff4eb3d5e ("neighbour: make proxy_queue.qlen limit per-device")
Signed-off-by: Thomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-18 10:29:50 +00:00
Leon Romanovsky
733d4bbf95 net: liquidio: simplify if expression
Fix the warning reported by kbuild:

cocci warnings: (new ones prefixed by >>)
>> drivers/net/ethernet/cavium/liquidio/lio_main.c:1797:54-56: WARNING !A || A && B is equivalent to !A || B
   drivers/net/ethernet/cavium/liquidio/lio_main.c:1827:54-56: WARNING !A || A && B is equivalent to !A || B

Fixes: 8979f428a4 ("net: liquidio: release resources when liquidio driver open failed")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-18 10:27:20 +00:00
Antoine Tenart
0dc33c6583 net: phy: mscc: macsec: do not copy encryption keys
Following 1b16b3fdf6 ("net: phy: mscc: macsec: clear encryption keys when freeing a flow"),
go one step further and instead of calling memzero_explicit on the key
when freeing a flow, simply not copy the key in the first place as it's
only used when a new flow is set up.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-18 09:17:42 +00:00
Robin Murphy
c2418f911a gpu: host1x: Avoid trying to use GART on Tegra20
Since commit c7e3ca515e ("iommu/tegra: gart: Do not register with
bus") quite some time ago, the GART driver has effectively disabled
itself to avoid issues with the GPU driver expecting it to work in ways
that it doesn't. As of commit 57365a04c9 ("iommu: Move bus setup to
IOMMU device registration") that bodge no longer works, but really the
GPU driver should be responsible for its own behaviour anyway. Make the
workaround explicit.

Reported-by: Jon Hunter <jonathanh@nvidia.com>
Suggested-by: Dmitry Osipenko <digetx@gmail.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
2022-11-18 09:33:20 +01:00
Jakub Kicinski
11b64a466c Merge branch 'mptcp-selftests-fix-timeouts-and-test-isolation'
Mat Martineau says:

====================
mptcp: selftests: Fix timeouts and test isolation

Patches 1 and 3 adjust test timeouts to reduce false negatives on slow
machines.

Patch 2 improves test isolation by running the mptcp_sockopt test in its
own net namespace.
====================

Link: https://lore.kernel.org/r/20221115221046.20370-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:47:27 -08:00
Jakub Kicinski
a452d30f3f Merge branch 'net-ipa-change-gsi-firmware-load-specification'
Alex Elder says:

====================
net: ipa: change GSI firmware load specification

Currently, GSI firmware must be loaded for IPA before it can be
used--either by the modem, or by the AP.  New hardware supports a
third option, with the bootloader taking responsibility for loading
GSI firmware.  In that case, neither the AP nor the modem needs to
do that.

The first patch in this series deprecates the "modem-init" Device
Tree property in the IPA binding, using a new "qcom,gsi-loader"
property instead.  The second and third implement logic in the code
to support either the "old" or the "new" way of specifying how GSI
firmware is loaded.

The last two patches implement a new value for the "qcom,gsi-loader"
property.  If the value is "skip", neither the AP nor modem needs to
load the GSI firmware.  The first of these patches implements the
change in the IPA binding; the second implements it in the code.
====================

Link: https://lore.kernel.org/r/20221116073257.34010-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:46:59 -08:00
Alex Elder
7569805ec2 net: ipa: permit GSI firmware loading to be skipped
Define a new value "skip" for the "qcom,gsi-loader" Device Tree
property.  If used, it indicates that neither the AP nor the modem
need to load GSI firmware (because it has already been loaded--for
example by the boot loader).

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:46:55 -08:00
Alex Elder
a49c3ab7d7 dt-bindings: net: qcom,ipa: support skipping GSI firmware load
Add a new enumerated value to those defined for the qcom,gsi-loader
property.  If the qcom,gsi-loader is "skip", the GSI firmware will
already be loaded, so neither the AP nor modem is required to load
GSI firmware.

Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:46:55 -08:00
Alex Elder
07f2f8e1b7 net: ipa: introduce "qcom,gsi-loader" property
Introduce a new way of specifying how the GSI firmware gets loaded
for IPA.  Currently, this is indicated by the presence or absence of
the Boolean "modem-init" Device Tree property.  The new property
must have a value--either "self" or "modem"--which indicates whether
the AP or modem is the GSI firmware loader, respectively.

For legacy systems, the new property will not exist, and the
"modem-init" property will be used.  For newer systems, the
"qcom,gsi-loader" property *must* exist, and must have one of the
two prescribed values.  It is an error to have both properties
defined, and it is an error for the new property to have an
unrecognized value.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:46:54 -08:00
Alex Elder
50f803d4aa net: ipa: encapsulate decision about firmware load
The GSI layer used for IPA requires firmware to be loaded.

Currently either the AP or the modem loads the firmware,
distinguished by whether the "modem-init" Device Tree
property is defined.

Some newer systems implement a third option.  In preparation for
that, encapsulate the code that determines how the GSI firmware
gets loaded in a new function, ipa_firmware_loader().

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:46:54 -08:00
Alex Elder
4ca0c6474f dt-bindings: net: qcom,ipa: deprecate modem-init
GSI firmware for IPA must be loaded during initialization, either by
the AP or by the modem.  The loader is currently specified based on
whether the Boolean modem-init property is present.

Instead, use a new property with an enumerated value to indicate
explicitly how GSI firmware gets loaded.  With this in place, a
third approach can be added in an upcoming patch.

The new qcom,gsi-loader property has two defined values:
  - self:   The AP loads GSI firmware
  - modem:  The modem loads GSI firmware
The modem-init property must still be supported, but is now marked
deprecated.

Update the example so it represents the SC7180 SoC, and provide
examples for the qcom,gsi-loader, memory-region, and firmware-name
properties.

Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:46:53 -08:00
Matthieu Baerts
3de88b95c4 selftests: mptcp: fix mibit vs mbit mix up
The estimated time was supposing the rate was expressed in mibit
(bit * 1024^2) but it is in mbit (bit * 1000^2).

This makes the threshold higher but in a more realistic way to avoid
false positives reported by CI instances.

Before this patch, the thresholds were at 7561/4005ms and now they are
at 7906/4178ms.

While at it, also fix a typo in the linked comment, spotted by Mat.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/310
Fixes: 1a418cb8e8 ("mptcp: simult flow self-tests")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:43:58 -08:00
Matthieu Baerts
7e68d31020 selftests: mptcp: run mptcp_sockopt from a new netns
Not running it from a new netns causes issues if some MPTCP settings are
modified, e.g. if MPTCP is disabled from the sysctl knob, if multiple
addresses are available and added to the MPTCP path-manager, etc.

In these cases, the created connection will not behave as expected, e.g.
unable to create an MPTCP socket, more than one subflow is seen, etc.

A new "sandbox" net namespace is now created and used to run
mptcp_sockopt from this controlled environment.

Fixes: ce9979129a ("selftests: mptcp: add mptcp getsockopt test cases")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:43:57 -08:00
Paolo Abeni
22b29557ae selftests: mptcp: gives slow test-case more time
On slow or busy VM, some test-cases still fail because the
data transfer completes before the endpoint manipulation
actually took effect.

Address the issue by artificially increasing the runtime for
the relevant test-cases.

Fixes: ef360019db ("selftests: mptcp: signal addresses testcases")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/309
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:43:57 -08:00
Xin Long
647541ea06 sctp: move SCTP_PAD4 and SCTP_TRUNC4 to linux/sctp.h
Move these two macros from net/sctp/sctp.h to linux/sctp.h, so that
it will be enough to include only linux/sctp.h in nft_exthdr.c and
xt_sctp.c. It should not include "net/sctp/sctp.h" if a module does
not have a dependence on SCTP module.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Saeed Mahameed <saeed@kernel.org>
Link: https://lore.kernel.org/r/ef6468a687f36da06f575c2131cd4612f6b7be88.1668526821.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:43:34 -08:00
Xin Long
b78c416282 sctp: change to include linux/sctp.h in net/sctp/checksum.h
Currently "net/sctp/checksum.h" including "net/sctp/sctp.h" is
included in quite some places in netfilter and openswitch and
net/sched. It's not necessary to include "net/sctp/sctp.h" if
a module does not have dependence on SCTP, "linux/sctp.h" is
the right one to include.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Saeed Mahameed <saeed@kernel.org>
Link: https://lore.kernel.org/r/ca7ea96d62a26732f0491153c3979dc1c0d8d34a.1668526793.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:43:34 -08:00
Jakub Kicinski
24f627a3a8 Merge branch 'implement-devlink-rate-api-and-extend-it'
Michal Wilczynski says:

====================
Implement devlink-rate API and extend it

This patch series implements devlink-rate for ice driver. Unfortunately
current API isn't flexible enough for our use case, so there is a need to
extend it. Some functions have been introduced to enable the driver to
export current Tx scheduling configuration.

Pasting justification for this series from commit implementing devlink-rate
in ice driver(that is a part of this series):

There is a need to support modification of Tx scheduler tree, in the
ice driver. This will allow user to control Tx settings of each node in
the internal hierarchy of nodes. As a result user will be able to use
Hierarchy QoS implemented entirely in the hardware.

This patch implemenents devlink-rate API. It also exports initial
default hierarchy. It's mostly dictated by the fact that the tree
can't be removed entirely, all we can do is enable the user to modify
it. For example root node shouldn't ever be removed, also nodes that
have children are off-limits.

Example initial tree with 2 VF's:

[root@fedora ~]# devlink port function rate show
pci/0000:4b:00.0/node_27: type node parent node_26
pci/0000:4b:00.0/node_26: type node parent node_0
pci/0000:4b:00.0/node_34: type node parent node_33
pci/0000:4b:00.0/node_33: type node parent node_32
pci/0000:4b:00.0/node_32: type node parent node_16
pci/0000:4b:00.0/node_19: type node parent node_18
pci/0000:4b:00.0/node_18: type node parent node_17
pci/0000:4b:00.0/node_17: type node parent node_16
pci/0000:4b:00.0/node_21: type node parent node_20
pci/0000:4b:00.0/node_20: type node parent node_3
pci/0000:4b:00.0/node_14: type node parent node_5
pci/0000:4b:00.0/node_5: type node parent node_3
pci/0000:4b:00.0/node_13: type node parent node_4
pci/0000:4b:00.0/node_12: type node parent node_4
pci/0000:4b:00.0/node_11: type node parent node_4
pci/0000:4b:00.0/node_10: type node parent node_4
pci/0000:4b:00.0/node_9: type node parent node_4
pci/0000:4b:00.0/node_8: type node parent node_4
pci/0000:4b:00.0/node_7: type node parent node_4
pci/0000:4b:00.0/node_6: type node parent node_4
pci/0000:4b:00.0/node_4: type node parent node_3
pci/0000:4b:00.0/node_3: type node parent node_16
pci/0000:4b:00.0/node_16: type node parent node_15
pci/0000:4b:00.0/node_15: type node parent node_0
pci/0000:4b:00.0/node_2: type node parent node_1
pci/0000:4b:00.0/node_1: type node parent node_0
pci/0000:4b:00.0/node_0: type node
pci/0000:4b:00.0/1: type leaf parent node_27
pci/0000:4b:00.0/2: type leaf parent node_27

Let me visualize part of the tree:

                        +---------+
                        |  node_0 |
                        +---------+
                             |
                        +----v----+
                        | node_26 |
                        +----+----+
                             |
                        +----v----+
                        | node_27 |
                        +----+----+
                             |
                    |-----------------|
               +----v----+       +----v----+
               |   VF 1  |       |   VF 2  |
               +----+----+       +----+----+

So at this point there is a couple things that can be done.
For example we could only assign parameters to VF's.

[root@fedora ~]# devlink port function rate set pci/0000:4b:00.0/1 \
                 tx_max 5Gbps

This would cap the VF 1 BW to 5Gbps.

But let's say you would like to create a completely new branch.
This can be done like this:

[root@fedora ~]# devlink port function rate add \
                 pci/0000:4b:00.0/node_custom parent node_0
[root@fedora ~]# devlink port function rate add \
                 pci/0000:4b:00.0/node_custom_1 parent node_custom
[root@fedora ~]# devlink port function rate set \
                 pci/0000:4b:00.0/1 parent node_custom_1

This creates a completely new branch and reassigns VF 1 to it.

A number of parameters is supported per each node: tx_max, tx_share,
tx_priority and tx_weight.
====================

Link: https://lore.kernel.org/r/20221115104825.172668-1-michal.wilczynski@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:43:34 -08:00
Michal Wilczynski
242dd64375 Documentation: Add documentation for new devlink-rate attributes
Provide documentation for newly introduced netlink attributes for
devlink-rate: tx_priority and tx_weight.

Mention the possibility to export tree from the driver.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:43:33 -08:00
Michal Wilczynski
16eb4afc5d ice: Add documentation for devlink-rate implementation
Add documentation to a newly added devlink-rate feature. Provide some
examples on how to use the commands, which netlink attributes are
supported and descriptions of the attributes.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:43:27 -08:00
Michal Wilczynski
80fe30a8c1 ice: Prevent ADQ, DCB coexistence with Custom Tx scheduler
ADQ, DCB might interfere with Custom Tx Scheduler changes that user
might introduce using devlink-rate API.

Check if ADQ, DCB is active, when user tries to change any setting
in exported Tx scheduler tree. If any of those are active block the user
from doing so, and log an appropriate message.

Remove the exported hierarchy if user enable ADQ or DCB.
Prevent ADQ or DCB from getting configured if user already made some
changes using devlink-rate API.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:41:28 -08:00
Michal Wilczynski
42c2eb6b1f ice: Implement devlink-rate API
There is a need to support modification of Tx scheduler tree, in the
ice driver. This will allow user to control Tx settings of each node in
the internal hierarchy of nodes. As a result user will be able to use
Hierarchy QoS implemented entirely in the hardware.

This patch implemenents devlink-rate API. It also exports initial
default hierarchy. It's mostly dictated by the fact that the tree
can't be removed entirely, all we can do is enable the user to modify
it. For example root node shouldn't ever be removed, also nodes that
have children are off-limits.

Example initial tree with 2 VF's:

[root@fedora ~]# devlink port function rate show

pci/0000:4b:00.0/node_27: type node parent node_26
pci/0000:4b:00.0/node_26: type node parent node_0
pci/0000:4b:00.0/node_34: type node parent node_33
pci/0000:4b:00.0/node_33: type node parent node_32
pci/0000:4b:00.0/node_32: type node parent node_16
pci/0000:4b:00.0/node_19: type node parent node_18
pci/0000:4b:00.0/node_18: type node parent node_17
pci/0000:4b:00.0/node_17: type node parent node_16
pci/0000:4b:00.0/node_21: type node parent node_20
pci/0000:4b:00.0/node_20: type node parent node_3
pci/0000:4b:00.0/node_14: type node parent node_5
pci/0000:4b:00.0/node_5: type node parent node_3
pci/0000:4b:00.0/node_13: type node parent node_4
pci/0000:4b:00.0/node_12: type node parent node_4
pci/0000:4b:00.0/node_11: type node parent node_4
pci/0000:4b:00.0/node_10: type node parent node_4
pci/0000:4b:00.0/node_9: type node parent node_4
pci/0000:4b:00.0/node_8: type node parent node_4
pci/0000:4b:00.0/node_7: type node parent node_4
pci/0000:4b:00.0/node_6: type node parent node_4
pci/0000:4b:00.0/node_4: type node parent node_3
pci/0000:4b:00.0/node_3: type node parent node_16
pci/0000:4b:00.0/node_16: type node parent node_15
pci/0000:4b:00.0/node_15: type node parent node_0
pci/0000:4b:00.0/node_2: type node parent node_1
pci/0000:4b:00.0/node_1: type node parent node_0
pci/0000:4b:00.0/node_0: type node
pci/0000:4b:00.0/1: type leaf parent node_27
pci/0000:4b:00.0/2: type leaf parent node_27

Let me visualize part of the tree:

                    +---------+
                    |  node_0 |
                    +---------+
                         |
                    +----v----+
                    | node_26 |
                    +----+----+
                         |
                    +----v----+
                    | node_27 |
                    +----+----+
                         |
                |-----------------|
           +----v----+       +----v----+
           |   VF 1  |       |   VF 2  |
           +----+----+       +----+----+

So at this point there is a couple things that can be done.
For example we could only assign parameters to VF's.

[root@fedora ~]# devlink port function rate set pci/0000:4b:00.0/1 \
                 tx_max 5Gbps

This would cap the VF 1 BW to 5Gbps.

But let's say you would like to create a completely new branch.
This can be done like this:

[root@fedora ~]# devlink port function rate add \
                 pci/0000:4b:00.0/node_custom parent node_0
[root@fedora ~]# devlink port function rate add \
                 pci/0000:4b:00.0/node_custom_1 parent node_custom
[root@fedora ~]# devlink port function rate set \
                 pci/0000:4b:00.0/1 parent node_custom_1

This creates a completely new branch and reassigns VF 1 to it.

A number of parameters is supported per each node: tx_max, tx_share,
tx_priority and tx_weight.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:41:28 -08:00
Michal Wilczynski
bdf96d965a ice: Add an option to pre-allocate memory for ice_sched_node
devlink-rate API requires a priv object to be allocated when node still
doesn't have a parent. This is problematic, because ice_sched_node can't
be currently created without a parent.

Add an option to pre-allocate memory for ice_sched_node struct. Add
new arguments to ice_sched_add() and ice_sched_add_elems() that allow
for pre-allocation of memory for ice_sched_node struct.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:41:28 -08:00
Michal Wilczynski
16dfa49406 ice: Introduce new parameters in ice_sched_node
To support new devlink-rate API ice_sched_node struct needs to store
a number of additional parameters. This includes tx_max, tx_share,
tx_weight, and tx_priority.

Add new fields to ice_sched_node struct. Add new functions to configure
the hardware with new parameters. Introduce new xarray to identify
nodes uniquely.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:41:27 -08:00
Michal Wilczynski
f2fc15e271 devlink: Allow to set up parent in devl_rate_leaf_create()
Currently the driver is able to create leaf nodes for the devlink-rate,
but is unable to set parent for them. This wasn't as issue before the
possibility to export hierarchy from the driver. After adding the export
feature, in order for the driver to supply correct hierarchy, it's
necessary for it to be able to supply a parent name to
devl_rate_leaf_create().

Introduce a new parameter 'parent_name' in devl_rate_leaf_create().

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:41:27 -08:00
Michal Wilczynski
04d674f04e devlink: Allow for devlink-rate nodes parent reassignment
Currently it's not possible to reassign the parent of the node using one
command. As the previous commit introduced a way to export entire
hierarchy from the driver, being able to modify and reassign parents
become important. This way user might easily change QoS settings without
interrupting traffic.

Example command:
devlink port function rate set pci/0000:4b:00.0/1 parent node_custom_1

This reassigns leaf node parent to node_custom_1.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:41:26 -08:00
Michal Wilczynski
caba177d7f devlink: Enable creation of the devlink-rate nodes from the driver
Intel 100G card internal firmware hierarchy for Hierarchicial QoS is very
rigid and can't be easily removed. This requires an ability to export
default hierarchy to allow user to modify it. Currently the driver is
only able to create the 'leaf' nodes, which usually represent the vport.
This is not enough for HQoS implemented in Intel hardware.

Introduce new function devl_rate_node_create() that allows for creation
of the devlink-rate nodes from the driver.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:41:26 -08:00
Michal Wilczynski
6e2d7e84fc devlink: Introduce new attribute 'tx_weight' to devlink-rate
To fully utilize offload capabilities of Intel 100G card QoS capabilities
new attribute 'tx_weight' needs to be introduced. This attribute allows
for usage of Weighted Fair Queuing arbitration scheme among siblings.
This arbitration scheme can be used simultaneously with the strict
priority.

Introduce new attribute in devlink-rate that will allow for configuration
of Weighted Fair Queueing. New attribute is optional.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:41:26 -08:00
Michal Wilczynski
cd50223683 devlink: Introduce new attribute 'tx_priority' to devlink-rate
To fully utilize offload capabilities of Intel 100G card QoS capabilities
new attribute 'tx_priority' needs to be introduced. This attribute allows
for usage of strict priority arbiter among siblings. This arbitration
scheme attempts to schedule nodes based on their priority as long as the
nodes remain within their bandwidth limit.

Introduce new attribute in devlink-rate that will allow for configuration
of strict priority. New attribute is optional.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:41:25 -08:00
Jakub Kicinski
4ab45e973c Merge branch 'autoload-dsa-tagging-driver-when-dynamically-changing-protocol'
Vladimir Oltean says:

====================
Autoload DSA tagging driver when dynamically changing protocol

This patch set solves the issue reported by Michael and Heiko here:
https://lore.kernel.org/lkml/20221027113248.420216-1-michael@walle.cc/
making full use of Michael's suggestion of having two modaliases: one
gets used for loading the tagging protocol when it's the default one
reported by the switch driver, the other gets loaded at user's request,
by name.

  # modinfo tag_ocelot
  filename:       /lib/modules/6.1.0-rc4+/kernel/net/dsa/tag_ocelot.ko
  license:        GPL v2
  alias:          dsa_tag:seville
  alias:          dsa_tag:id-21
  alias:          dsa_tag:ocelot
  alias:          dsa_tag:id-15
  depends:        dsa_core
  intree:         Y
  name:           tag_ocelot
  vermagic:       6.1.0-rc4+ SMP preempt mod_unload modversions aarch64

Tested on NXP LS1028A-RDB with the following device tree addition:

&mscc_felix_port4 {
	dsa-tag-protocol = "ocelot-8021q";
};

&mscc_felix_port5 {
	dsa-tag-protocol = "ocelot-8021q";
};

CONFIG_NET_DSA and everything that depends on it is built as module.
Everything auto-loads, and "cat /sys/class/net/eno2/dsa/tagging" shows
"ocelot-8021q". Traffic works as well. Furthermore, "echo ocelot-8021q"
into the aforementioned sysfs file now auto-loads the driver for it.
====================

Link: https://lore.kernel.org/r/20221115011847.2843127-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:16:47 -08:00
Vladimir Oltean
0184c07a11 net: dsa: autoload tag driver module on tagging protocol change
Issue a request_module() call when an attempt to change the tagging
protocol is made, either by sysfs or by device tree. In the case of
ocelot (the only driver for which the default and the alternative
tagging protocol are compiled as different modules), the user is now no
longer required to insert tag_ocelot_8021q.ko manually.

In the particular case of ocelot, this solves a problem where
tag_ocelot_8021q.ko is built as module, and this is present in the
device tree:

&mscc_felix_port4 {
	dsa-tag-protocol = "ocelot-8021q";
};

&mscc_felix_port5 {
	dsa-tag-protocol = "ocelot-8021q";
};

Because no one attempts to load the module into the kernel at boot time,
the switch driver will fail to probe (actually forever defer) until
someone manually inserts tag_ocelot_8021q.ko. This is now no longer
necessary and happens automatically.

Rename dsa_find_tagger_by_name() to denote the change in functionality:
there is now feature parity with dsa_tag_driver_get_by_id(), i.o.w. we
also load the module if it's missing.

Link: https://lore.kernel.org/lkml/20221027113248.420216-1-michael@walle.cc/
Suggested-by: Michael Walle <michael@walle.cc>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Michael Walle <michael@walle.cc> # on kontron-sl28 w/ ocelot_8021q
Tested-by: Michael Walle <michael@walle.cc>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:16:42 -08:00
Vladimir Oltean
54c087e839 net: dsa: rename dsa_tag_driver_get() to dsa_tag_driver_get_by_id()
A future patch will introduce one more way of getting a reference on a
tagging protocl driver (by name). Rename the current method to "by_id".

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Michael Walle <michael@walle.cc>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:16:42 -08:00
Vladimir Oltean
e8666130b9 net: dsa: strip sysfs "tagging" string of trailing newline
Currently, dsa_find_tagger_by_name() uses sysfs_streq() which works both
with strings that contain \n at the end (echo ocelot > .../dsa/tagging)
and with strings that don't (printf ocelot > .../dsa/tagging).

There will be a problem once we'll want to construct the modalias string
based on which we auto-load the protocol kernel module. If the sysfs
buffer ends in a newline, we need to strip it first. This is a
preparatory patch specifically for that.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Michael Walle <michael@walle.cc>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:16:42 -08:00
Vladimir Oltean
94793a56b3 net: dsa: provide a second modalias to tag proto drivers based on their name
Currently, tagging protocol drivers have a modalias of
"dsa_tag:id-<number>", where the number is one of DSA_TAG_PROTO_*_VALUE.

This modalias makes it possible for the request_module() call in
dsa_tag_driver_get() to work, given the input it has - an integer
returned by ds->ops->get_tag_protocol().

It is also possible to change tagging protocols at (pseudo-)runtime, via
sysfs or via device tree, and this works via the name string of the
tagging protocol rather than via its id (DSA_TAG_PROTO_*_VALUE).

In the latter case, there is no request_module() call, because there is
no association that the DSA core has between the string name and the ID,
to construct the modalias. The module is simply assumed to have been
inserted. This is actually slightly problematic when the tagging
protocol change should take place at probe time, since it's expected
that the dependency module should get autoloaded.

For this purpose, let's introduce a second modalias, so that the DSA
core can call request_module() by name. There is no reason to make the
modalias by name optional, so just modify the MODULE_ALIAS_DSA_TAG_DRIVER()
macro to take both the ID and the name as arguments, and generate two
modaliases behind the scenes.

Suggested-by: Michael Walle <michael@walle.cc>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Michael Walle <michael@walle.cc> # on kontron-sl28 w/ ocelot_8021q
Tested-by: Michael Walle <michael@walle.cc>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:16:41 -08:00
Vladimir Oltean
2610937d7e net: dsa: rename tagging protocol driver modalias
It's autumn cleanup time, and today's target are modaliases.

Michael says that for users of modinfo, "dsa_tag-20" is not the most
suggestive name, and recommends a change to "dsa_tag-id-20".

Andrew points out that other modaliases have a prefix delimited by
colons, so he recommends "dsa_tag:20" instead of "dsa_tag-20".

To satisfy both proposals, Florian recommends "dsa_tag:id-20".

The modaliases are not stable ABI, and the essential information
(protocol ID) is still conveyed in the new string, which
request_module() must be adapted to form.

Link: 20221027210830.3577793-1-vladimir.oltean@nxp.com
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Suggested-by: Michael Walle <michael@walle.cc>
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Michael Walle <michael@walle.cc>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:16:41 -08:00
Vladimir Oltean
9999f85ba3 net: dsa: stop exposing tag proto module helpers to the world
The DSA tagging protocol driver macros are in the public include/net/dsa.h
probably because that's also where the DSA_TAG_PROTO_*_VALUE macros are
(MODULE_ALIAS_DSA_TAG_DRIVER hinges on those macro definitions).

But there is no reason to expose these helpers to <net/dsa.h>. That
header is shared between switch drivers (drivers/net/dsa/), tagging
protocol drivers (net/dsa/tag_*.c), the DSA core (net/dsa/ sans tag_*.c),
and the rest of the world (DSA master drivers, network stack, etc).
Too much exposure.

On the other hand, net/dsa/dsa_priv.h is included only by the DSA core
and by DSA tagging protocol drivers (or IOW, "friend" modules). Also a
bit too much exposure - I've contemplated creating a new header which is
only included by tagging protocol drivers, but completely separating a
new dsa_tag_proto.h from dsa_priv.h is not immediately trivial - for
example dsa_slave_to_port() is used both from the fast path and from the
control path.

So for now, move these definitions to dsa_priv.h which at least hides
them from the world.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Michael Walle <michael@walle.cc>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:16:40 -08:00
Robert Marko
4a8c14384f dt-bindings: net: ipq4019-mdio: document required clock-names
IPQ5018, IPQ6018 and IPQ8074 require clock-names to be set as driver is
requesting the clock based on it and not index, so document that and make
it required for the listed SoC-s.

Signed-off-by: Robert Marko <robimarko@gmail.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20221114194734.3287854-4-robimarko@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:14:38 -08:00
Robert Marko
e50c50367d dt-bindings: net: ipq4019-mdio: require and validate clocks
Now that we can match the platforms requiring clocks by compatible start
using those to allow clocks per compatible and make them required.

Signed-off-by: Robert Marko <robimarko@gmail.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20221114194734.3287854-3-robimarko@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:14:37 -08:00
Robert Marko
05c1cbb96f dt-bindings: net: ipq4019-mdio: add IPQ8074 compatible
Allow using IPQ8074 specific compatible along with the fallback IPQ4019
one in order to be able to specify which compatibles require clocks to
be able to validate them via schema.

Signed-off-by: Robert Marko <robimarko@gmail.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20221114194734.3287854-2-robimarko@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:14:37 -08:00
Robert Marko
cbe5f7c0fb dt-bindings: net: ipq4019-mdio: document IPQ6018 compatible
Document IPQ6018 compatible that is already being used in the DTS along
with the fallback IPQ4019 compatible as driver itself only gets probed
on IPQ4019 and IPQ5018 compatibles.

This is also required in order to specify which platform require clock to
be defined and validate it in schema.

Signed-off-by: Robert Marko <robimarko@gmail.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20221114194734.3287854-1-robimarko@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 21:14:36 -08:00
Jakub Kicinski
2ffff1449d Merge branch 'net-dsa-use-more-appropriate-net_name_-constants-for-user-ports'
Rasmus Villemoes says:

====================
net: dsa: use more appropriate NET_NAME_* constants for user ports

The intention of commit 685343fc3b ("net: add name_assign_type
netdev attribute") was clearly that drivers be switched over one by
one to select appropriate NET_NAME_* constants instead of
NET_NAME_UNKNOWN. This small series attempts to do that for DSA user
ports.

This is obviously and intentionally user-visible changes, so there's a
small chance that it could lead to a regression. To make it easy to
revert either of the "label in DT" and "fallback to eth%d" changes,
this is done as a refactoring which shouldn't introduce any functional
change (but by itself adds code which looks a little odd, with the two
identical assignments in the two branches), followed by changing the
constant used in each case in two different patches.
====================

Link: https://lore.kernel.org/r/20221116105205.1127843-1-linux@rasmusvillemoes.dk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 19:40:15 -08:00
Rasmus Villemoes
b8790661d9 net: dsa: set name_assign_type to NET_NAME_ENUM for enumerated user ports
When a user port does not have a label in device tree, and we thus
fall back to the eth%d scheme, the proper constant to use is
NET_NAME_ENUM. See also commit e9f656b7a2 ("net: ethernet: set
default assignment identifier to NET_NAME_ENUM"), which in turn quoted
commit 685343fc3b ("net: add name_assign_type netdev attribute"):

    ... when the kernel has given the interface a name using global
    device enumeration based on order of discovery (ethX, wlanY, etc)
    ... are labelled NET_NAME_ENUM.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.faineli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 19:40:08 -08:00
Rasmus Villemoes
6fdb038420 net: dsa: use NET_NAME_PREDICTABLE for user ports with name given in DT
When a user port has a label in device tree, the corresponding
netdevice is, to quote include/uapi/linux/netdevice.h, "predictably
named by the kernel". This is also explicitly one of the intended use
cases for NET_NAME_PREDICTABLE, quoting 685343fc3b ("net: add
name_assign_type netdev attribute"):

  NET_NAME_PREDICTABLE:
    The ifname has been assigned by the kernel in a predictable way
    [...] Examples include [...] and names deduced from hardware
    properties (including being given explicitly by the firmware).

Expose that information properly for the benefit of userspace tools
that make decisions based on the name_assign_type attribute,
e.g. a systemd-udev rule with "kernel" in NamePolicy.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.faineli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 19:40:08 -08:00
Rasmus Villemoes
0171a1d22b net: dsa: refactor name assignment for user ports
The following two patches each have a (small) chance of causing
regressions for userspace and will in that case of course need to be
reverted.

In order to prepare for that and make those two patches independent
and individually revertable, refactor the code which sets the names
for user ports by moving the "fall back to eth%d if no label is given
in device tree" to dsa_slave_create().

No functional change (at least none intended).

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.faineli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 19:40:07 -08:00
Vincent Mailhol
f20a0a0519 ethtool: doc: clarify what drivers can implement in their get_drvinfo()
Many of the drivers which implement ethtool_ops::get_drvinfo() will
prints the .driver, .version or .bus_info of struct ethtool_drvinfo.
To have a glance of current state, do:

  $ git grep -W "get_drvinfo(struct"

Printing in those three fields is useless because:

  - since [1], the driver version should be the kernel version (at
    least for upstream drivers). Arguably, out of tree drivers might
    still want to set a custom version, but out of tree is not our
    focus.

  - since [2], the core is able to provide default values for .driver
    and .bus_info.

In summary, drivers may provide .fw_version and .erom_version, the
rest is expected to be done by the core.

In struct ethtool_ops doc from linux/ethtool: rephrase field
get_drvinfo() doc to discourage developers from implementing this
callback.

In struct ethtool_drvinfo doc from uapi/linux/ethtool.h: remove the
paragraph mentioning what drivers should do. Rationale: no need to
repeat what is already written in struct ethtool_ops doc. But add a
note that .fw_version and .erom_version are driver defined.

Also update the dummy driver and simply remove the callback in order
not to confuse the newcomers: most of the drivers will not need this
callback function any more.

[1] commit 6a7e25c7fb ("net/core: Replace driver version to be
    kernel version")
Link: https://git.kernel.org/torvalds/linux/c/6a7e25c7fb48

[2] commit edaf5df22c ("ethtool: ethtool_get_drvinfo: populate
    drvinfo fields even if callback exits")
Link: https://git.kernel.org/netdev/net-next/c/edaf5df22cb8

Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Link: https://lore.kernel.org/r/20221116171828.4093-1-mailhol.vincent@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 19:26:02 -08:00
Alexei Starovoitov
db6bf99954 Merge branch 'Allocated objects, BPF linked lists'
Kumar Kartikeya Dwivedi says:

====================

This series introduces user defined BPF objects of a type in program
BTF. This allows BPF programs to allocate their own objects, build their
own object hierarchies, and use the basic building blocks provided by
BPF runtime to build their own data structures flexibly.

Then, we introduce the support for single ownership BPF linked lists,
which can be put inside BPF maps, or allocated objects, and hold such
allocated objects as elements. It works as an instrusive collection,
which is done to allow making allocated objects part of multiple data
structures at the same time in the future.

The eventual goal of this and future patches is to allow one to do some
limited form of kernel style programming in BPF C, and allow programmers
to build their own complex data structures flexibly out of basic
building blocks.

The key difference will be that such programs are verified to be safe,
preserve runtime integrity of the system, and are proven to be bug free
as far as the invariants of BPF specific APIs are concerned.

One immediate use case that will be using the entire infrastructure this
series is introducing will be managing percpu NMI safe linked lists
inside BPF programs.

The other use case this will serve in the near future will be linking
kernel structures like XDP frame and sk_buff directly into user data
structures (rbtree, pifomap, etc.) for packet queueing. This will follow
single ownership concept included in this series.

The user has complete control of the internal locking, and hence also
the batching of operations for each critical section.

The features are:
- Allocated objects.
- bpf_obj_new, bpf_obj_drop to allocate and free them.
- Single ownership BPF linked lists.
  - Support for them in BPF maps.
  - Support for them in allocated objects.
- Global spin locks.
- Spin locks inside allocated objects.

Some other notable things:
- Completely static verification of locking.
- Kfunc argument handling has been completely reworked.
- Argument rewriting support for kfuncs.
- A new bpf_experimental.h header as a dumping ground for these APIs.

Any functionality exposed in this series is NOT part of UAPI. It is only
available through use of kfuncs, and structs that can be added to map
value may also change their size or name in the future. Hence, every
feature in this series must be considered experimental.

Follow-ups:
-----------
 * Support for kptrs (local and kernel) in local storage and percpu maps + kptr tests
 * Fixes for helper access checks rebasing on top of this series

Next steps:
-----------
 * NMI safe percpu single ownership linked lists (using local_t protection).
 * Lockless linked lists.
 * Allow RCU protected BPF allocated objects. This then allows RCU
   protected list lookups, since spinlock protection for readers does
   not scale.
 * Introduce bpf_refcount for local kptrs, shared ownership.
 * Introduce shared ownership linked lists.
 * Documentation.

Changelog:
----------
 v9 -> v10
 v9: https://lore.kernel.org/bpf/20221117225510.1676785-1-memxor@gmail.com

  * Deduplicate code to find btf_record of reg (Alexei)
  * Add linked_list test to DENYLIST.aarch64 (Alexei)
  * Disable some linked list tests for now so that they compile with
    clang nightly (Alexei)

 v8 -> v9
 v8: https://lore.kernel.org/bpf/20221117162430.1213770-1-memxor@gmail.com

  * Fix up commit log of patch 2, Simplify patch 3
  * Explain the implicit requirement of bpf_list_head requiring map BTF
    to match in btf_record_equal in a separate patch.

 v7 -> v8
 v7: https://lore.kernel.org/bpf/20221114191547.1694267-1-memxor@gmail.com

  * Fix early return in map_check_btf (Dan Carpenter)
  * Fix two memory leak bugs in local storage maps, outer maps
  * Address comments from Alexei and Dave
   * More local kptr -> allocated object renaming
   * Use krealloc with NULL instead kmalloc + krealloc
   * Drop WARN_ON_ONCE for field_offs parsing
   * Combine kfunc add + remove patches into one
   * Drop STRONG suffix from KF_ARG_PTR_TO_KPTR
   * Rename is_kfunc_arg_ret_buf_size to is_kfunc_arg_scalar_with_name
   * Remove redundant check for reg->type and arg type in it
   * Drop void * ret type check
   * Remove code duplication in checks for NULL pointer with offset != 0
   * Fix two bpf_list_node typos
   * Improve log message for bpf_list_head operations
   * Improve comments for active_lock struct
   * Improve comments for Implementation details of process_spin_lock
  * Add Dave's acks

 v6 -> v7
 v6: https://lore.kernel.org/bpf/20221111193224.876706-1-memxor@gmail.com

  * Fix uninitialized variable warning (Dan Carpenter, Kernel Test Robot)
  * One more local_kptr renaming

 v5 -> v6
 v5: https://lore.kernel.org/bpf/20221107230950.7117-1-memxor@gmail.com

  * Replace (i && !off) check with next_off, include test (Andrii)
  * Drop local kptrs naming (Andrii, Alexei)
  * Drop reg->precise == EXACT patch (Andrii)
  * Add comment about ptr member of struct active_lock (Andrii)
  * Use btf__new_empty + btf__add_xxx APIs (Andrii)
  * Address other misc nits from Andrii

 v4 -> v5
 v4: https://lore.kernel.org/bpf/20221103191013.1236066-1-memxor@gmail.com

  * Add a lot more selftests (failure, success, runtime, BTF)
  * Make sure series is bisect friendly
  * Move list draining out of spin lock
    * This exposed an issue where bpf_mem_free can now be called in
      map_free path without migrate_disable, also fixed that.
  * Rename MEM_ALLOC -> MEM_RINGBUF, MEM_TYPE_LOCAL -> MEM_ALLOC (Alexei)
  * Group lock identity into a struct active_lock { ptr, id } (Dave)
  * Split set_release_on_unlock logic into separate patch (Alexei)

 v3 -> v4
 v3: https://lore.kernel.org/bpf/20221102202658.963008-1-memxor@gmail.com

  * Fix compiler error for !CONFIG_BPF_SYSCALL (Kernel Test Robot)
  * Fix error due to BUILD_BUG_ON on 32-bit platforms (Kernel Test Robot)

 v2 -> v3
 v2: https://lore.kernel.org/bpf/20221013062303.896469-1-memxor@gmail.com

  * Add ack from Dave for patch 5
  * Rename btf_type_fields -> btf_record, btf_type_fields_off ->
    btf_field_offs, rename functions similarly (Alexei)
  * Remove 'kind' component from contains declaration tag (Alexei)
  * Move bpf_list_head, bpf_list_node definitions to UAPI bpf.h (Alexei)
  * Add note in commit log about modifying btf_struct_access API (Dave)
  * Downgrade WARN_ON_ONCE to verbose(env, "...") and return -EFAULT (Dave)
  * Add type_is_local_kptr wrapper to avoid noisy checks (Dave)
  * Remove unused flags parameter from bpf_kptr_new (Alexei)
  * Rename bpf_kptr_new -> bpf_obj_new, bpf_kptr_drop -> bpf_obj_drop (Alexei)
  * Reword comment in ref_obj_id_set_release_on_unlock (Dave)
  * Fix return type of ref_obj_id_set_release_on_unlock (Dave)
  * Introduce is_bpf_list_api_kfunc to dedup checks (Dave)
  * Disallow BPF_WRITE to untrusted local kptrs
  * Add details about soundness of check_reg_allocation_locked logic
  * List untrusted local kptrs for PROBE_MEM handling

 v1 -> v2
 v1: https://lore.kernel.org/bpf/20221011012240.3149-1-memxor@gmail.com

  * Rebase on bpf-next to resolve merge conflict in DENYLIST.s390x
  * Fix a couple of mental lapses in bpf_list_head_free

 RFC v1 -> v1
 RFC v1: https://lore.kernel.org/bpf/20220904204145.3089-1-memxor@gmail.com

  * Mostly a complete rewrite of BTF parsing, refactor existing code (Kartikeya)
  * Rebase kfunc rewrite for bpf-next, add support for more changes
  * Cache type metadata in BTF to avoid recomputation inside verifier (Kartikeya)
  * Remove __kernel tag, make things similar to map values, reserve bpf_ prefix
  * bpf_kptr_new, bpf_kptr_drop
  * Rename precision state enum values (Alexei)
  * Drop explicit constructor/destructor support (Alexei)
  * Rewrite code for constructing/destructing objects and offload to runtime
  * Minimize duplication in bpf_map_value_off_desc handling (Alexei)
  * Expose global memory allocator (Alexei)
  * Address other nits from Alexei
  * Split out local kptrs in maps, more kptrs in maps support into a follow up

Links:
------
 * Dave's BPF RB-Tree RFC series
   v1 (Discussion thread)
     https://lore.kernel.org/bpf/20220722183438.3319790-1-davemarchevsky@fb.com
   v2 (With support for static locks)
     https://lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com
 * BPF Linked Lists Discussion
   https://lore.kernel.org/bpf/CAP01T74U30+yeBHEgmgzTJ-XYxZ0zj71kqCDJtTH9YQNfTK+Xw@mail.gmail.com
 * BPF Memory Allocator from Alexei
   https://lore.kernel.org/bpf/20220902211058.60789-1-alexei.starovoitov@gmail.com
 * BPF Memory Allocator UAPI Discussion
   https://lore.kernel.org/bpf/d3f76b27f4e55ec9e400ae8dcaecbb702a4932e8.camel@fb.com
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:22:15 -08:00
Kumar Kartikeya Dwivedi
0a2f85a1be selftests/bpf: Temporarily disable linked list tests
The latest clang nightly as of writing crashes with the given test case
for BPF linked lists wherever global glock, ghead, glock2 are used,
hence comment out the parts that cause the crash, and prepare this commit
so that it can be reverted when the fix has been made. More context in [0].

 [0]: https://lore.kernel.org/bpf/d56223f9-483e-fbc1-4564-44c0858a1e3e@meta.com

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-25-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:22:15 -08:00