The phylink conflict was between a bug fix by Russell King
to make sure we have a consistent PHY interface mode, and
a change in net-next to pull some code in phylink_resolve()
into the helper functions phylink_mac_link_{up,down}()
On the dp83867 side it's mostly overlapping changes, with
the 'net' side removing a condition that was supposed to
trigger for RGMII but because of how it was coded never
actually could trigger.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a few problems with CONFIG_IPV6=y and
CONFIG_NF_CONNTRACK_BRIDGE=m:
In file included from net/netfilter/utils.c:5:
include/linux/netfilter_ipv6.h: In function 'nf_ipv6_br_defrag':
include/linux/netfilter_ipv6.h:110:9: error: implicit declaration of function 'nf_ct_frag6_gather'; did you mean 'nf_ct_attach'? [-Werror=implicit-function-declaration]
And these too:
net/ipv6/netfilter.c:242:2: error: unknown field 'br_defrag' specified in initializer
net/ipv6/netfilter.c:243:2: error: unknown field 'br_fragment' specified in initializer
This patch includes an original chunk from wenxu.
Fixes: 764dd163ac ("netfilter: nf_conntrack_bridge: add support for IPv6")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: Yuehaibing <yuehaibing@huawei.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix OOPS during nf_tables rule dump, from Florian Westphal.
2) Use after free in ip_vs_in, from Yue Haibing.
3) Fix various kTLS bugs (NULL deref during device removal resync,
netdev notification ignoring, etc.) From Jakub Kicinski.
4) Fix ipv6 redirects with VRF, from David Ahern.
5) Memory leak fix in igmpv3_del_delrec(), from Eric Dumazet.
6) Missing memory allocation failure check in ip6_ra_control(), from
Gen Zhang. And likewise fix ip_ra_control().
7) TX clean budget logic error in aquantia, from Igor Russkikh.
8) SKB leak in llc_build_and_send_ui_pkt(), from Eric Dumazet.
9) Double frees in mlx5, from Parav Pandit.
10) Fix lost MAC address in r8169 during PCI D3, from Heiner Kallweit.
11) Fix botched register access in mvpp2, from Antoine Tenart.
12) Use after free in napi_gro_frags(), from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (89 commits)
net: correct zerocopy refcnt with udp MSG_MORE
ethtool: Check for vlan etype or vlan tci when parsing flow_rule
net: don't clear sock->sk early to avoid trouble in strparser
net-gro: fix use-after-free read in napi_gro_frags()
net: dsa: tag_8021q: Create a stable binary format
net: dsa: tag_8021q: Change order of rx_vid setup
net: mvpp2: fix bad MVPP2_TXQ_SCHED_TOKEN_CNTR_REG queue value
ipv4: tcp_input: fix stack out of bounds when parsing TCP options.
mlxsw: spectrum: Prevent force of 56G
mlxsw: spectrum_acl: Avoid warning after identical rules insertion
net: dsa: mv88e6xxx: fix handling of upper half of STATS_TYPE_PORT
r8169: fix MAC address being lost in PCI D3
net: core: support XDP generic on stacked devices.
netvsc: unshare skb in VF rx handler
udp: Avoid post-GRO UDP checksum recalculation
net: phy: dp83867: Set up RGMII TX delay
net: phy: dp83867: do not call config_init twice
net: phy: dp83867: increase SGMII autoneg timer duration
net: phy: dp83867: fix speed 10 in sgmii mode
net: phy: marvell10g: report if the PHY fails to boot firmware
...
We face the issue that link change interrupt and link status may be
reported by different PHY layers. As a result the link change
interrupt may occur before the link status changes.
Export phy_queue_state_machine to allow PHY drivers to specify a
delay between link status change interrupt and link status check.
v2:
- change jiffies parameter type to unsigned long
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Suggested-by: Russell King <rmk+kernel@armlinux.org.uk>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
The phylib interrupt handler handles link change events only currently.
However PHY drivers may want to use other interrupt sources too,
e.g. to report temperature monitoring events. Therefore add a callback
to struct phy_driver allowing PHY drivers to implement a custom
interrupt handler.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Suggested-by: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is a step towards allowing PHY drivers to handle more
interrupt sources than just link change. E.g. several PHY's have
built-in temperature monitoring and can raise an interrupt if a
temperature threshold is exceeded. We may be interested in such
interrupts also if the phylib state machine isn't started.
Therefore move enabling interrupts to phy_request_interrupt().
v2:
- patch added to series
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The same skb_checksum_ops struct is defined twice in two different places,
leading to code duplication. Declare it as a global variable into a common
header instead of allocating it on the stack on each function call.
bloat-o-meter reports a slight code shrink.
add/remove: 1/1 grow/shrink: 0/10 up/down: 128/-1282 (-1154)
Function old new delta
sctp_csum_ops - 128 +128
crc32c_csum_ops 16 - -16
sctp_rcv 6616 6583 -33
sctp_packet_pack 4542 4504 -38
nf_conntrack_sctp_packet 4980 4926 -54
execute_masked_set_action 6453 6389 -64
tcf_csum_sctp 575 428 -147
sctp_gso_segment 1292 1126 -166
sctp_csum_check 579 412 -167
sctp_snat_handler 957 772 -185
sctp_dnat_handler 1321 1132 -189
l4proto_manip_pkt 2536 2313 -223
Total: Before=359297613, After=359296459, chg -0.00%
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_defrag() and br_fragment() indirections are added in case that IPv6
support comes as a module, to avoid pulling innecessary dependencies in.
The new fraglist iterator and fragment transformer APIs are used to
implement the refragmentation code.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds basic connection tracking support for the bridge,
including initial IPv4 support.
This patch register two hooks to deal with the bridge forwarding path,
one from the bridge prerouting hook to call nf_conntrack_in(); and
another from the bridge postrouting hook to confirm the entry.
The conntrack bridge prerouting hook defragments packets before passing
them to nf_conntrack_in() to look up for an existing entry, otherwise a
new entry is allocated and it is attached to the skbuff. The conntrack
bridge postrouting hook confirms new conntrack entries, ie. if this is
the first packet seen, then it adds the entry to the hashtable and (if
needed) it refragments the skbuff into the original fragments, leaving
the geometry as is if possible. Exceptions are linearized skbuffs, eg.
skbuffs that are passed up to nfqueue and conntrack helpers, as well as
cloned skbuff for the local delivery (eg. tcpdump), also in case of
bridge port flooding (cloned skbuff too).
The packet defragmentation is done through the ip_defrag() call. This
forces us to save the bridge control buffer, reset the IP control buffer
area and then restore it after call. This function also bumps the IP
fragmentation statistics, it would be probably desiderable to have
independent statistics for the bridge defragmentation/refragmentation.
The maximum fragment length is stored in the control buffer and it is
used to refragment the skbuff from the postrouting path.
The new fraglist splitter and fragment transformer APIs are used to
implement the bridge refragmentation code. The br_ip_fragment() function
drops the packet in case the maximum fragment size seen is larger than
the output port MTU.
This patchset follows the principle that conntrack should not drop
packets, so users can do it through policy via invalid state matching.
Like br_netfilter, there is no refragmentation for packets that are
passed up for local delivery, ie. prerouting -> input path. There are
calls to nf_reset() already in several spots in the stack since time ago
already, eg. af_packet, that show that skbuff fraglist handling from the
netif_rx path is supported already.
The helpers are called from the postrouting hook, before confirmation,
from there we may see packet floods to bridge ports. Then, although
unlikely, this may result in exercising the helpers many times for each
clone. It would be good to explore how to pass all the packets in a list
to the conntrack hook to do this handle only once for this case.
Thanks to Florian Westphal for handing me over an initial patchset
version to add support for conntrack bridge.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds infrastructure to register and to unregister bridge
support for the conntrack module via nf_ct_bridge_register() and
nf_ct_bridge_unregister().
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch exposes a new API to refragment a skbuff. This allows you to
split either a linear skbuff or to force the refragmentation of an
existing fraglist using a different mtu. The API consists of:
* ip6_frag_init(), that initializes the internal state of the transformer.
* ip6_frag_next(), that allows you to fetch the next fragment. This function
internally allocates the skbuff that represents the fragment, it pushes
the IPv6 header, and it also copies the payload for each fragment.
The ip6_frag_state object stores the internal state of the splitter.
This code has been extracted from ip6_fragment(). Symbols are also
exported to allow to reuse this iterator from the bridge codepath to
build its own refragmentation routine by reusing the existing codebase.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch exposes a new API to refragment a skbuff. This allows you to
split either a linear skbuff or to force the refragmentation of an
existing fraglist using a different mtu. The API consists of:
* ip_frag_init(), that initializes the internal state of the transformer.
* ip_frag_next(), that allows you to fetch the next fragment. This function
internally allocates the skbuff that represents the fragment, it pushes
the IPv4 header, and it also copies the payload for each fragment.
The ip_frag_state object stores the internal state of the splitter.
This code has been extracted from ip_do_fragment(). Symbols are also
exported to allow to reuse this iterator from the bridge codepath to
build its own refragmentation routine by reusing the existing codebase.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the skbuff fraglist split iterator. This API provides an
iterator to transform the fraglist into single skbuff objects, it
consists of:
* ip6_fraglist_init(), that initializes the internal state of the
fraglist iterator.
* ip6_fraglist_prepare(), that restores the IPv6 header on the fragment.
* ip6_fraglist_next(), that retrieves the fragment from the fraglist and
updates the internal state of the iterator to point to the next
fragment in the fraglist.
The ip6_fraglist_iter object stores the internal state of the iterator.
This code has been extracted from ip6_fragment(). Symbols are also
exported to allow to reuse this iterator from the bridge codepath to
build its own refragmentation routine by reusing the existing codebase.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the skbuff fraglist splitter. This API provides an
iterator to transform the fraglist into single skbuff objects, it
consists of:
* ip_fraglist_init(), that initializes the internal state of the
fraglist splitter.
* ip_fraglist_prepare(), that restores the IPv4 header on the
fragments.
* ip_fraglist_next(), that retrieves the fragment from the fraglist and
it updates the internal state of the splitter to point to the next
fragment skbuff in the fraglist.
The ip_fraglist_iter object stores the internal state of the iterator.
This code has been extracted from ip_do_fragment(). Symbols are also
exported to allow to reuse this iterator from the bridge codepath to
build its own refragmentation routine by reusing the existing codebase.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We would like to be able to rotate TFO keys while minimizing the number of
client cookies that are rejected. Currently, we have only one key which can
be used to generate and validate cookies, thus if we simply replace this
key clients can easily have cookies rejected upon rotation.
We propose having the ability to have both a primary key and a backup key.
The primary key is used to generate as well as to validate cookies.
The backup is only used to validate cookies. Thus, keys can be rotated as:
1) generate new key
2) add new key as the backup key
3) swap the primary and backup key, thus setting the new key as the primary
We don't simply set the new key as the primary key and move the old key to
the backup slot because the ip may be behind a load balancer and we further
allow for the fact that all machines behind the load balancer will not be
updated simultaneously.
We make use of this infrastructure in subsequent patches.
Suggested-by: Igor Lubashev <ilubashe@akamai.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when resegmenting an unexpected UDP GRO packet, the full UDP
checksum will be calculated for every new SKB created by skb_segment()
because the netdev features passed in by udp_rcv_segment() lack any
information about checksum offload capabilities.
Usually, we have no need to perform this calculation again, as
1) The GRO implementation guarantees that any packets making it to the
udp_rcv_segment() function had correct checksums, and, more
importantly,
2) Upon the successful return of udp_rcv_segment(), we immediately pull
the UDP header off and either queue the segment to the socket or
hand it off to a new protocol handler.
Unless userspace has set the IP_CHECKSUM sockopt to indicate that they
want the final checksum values, we can pass the needed netdev feature
flags to __skb_gso_segment() to avoid checksumming each segment in
skb_segment().
Fixes: cf329aa42b ("udp: cope with UDP GRO packet misdirection")
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the PHYLINK_DEV operation type, the PHYLINK infrastructure can work
without an attached net_device. For printing usecases, instead, a struct
device * should be passed to PHYLINK using the phylink_config structure.
Also, netif_carrier_* calls ar guarded by the presence of a valid
net_device. When using the PHYLINK_DEV operation type, we cannot check
link status using the netif_carrier_ok() API so instead, keep an
internal state of the MAC and call mac_link_{down,up} only when the link
changed.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The phylink_config structure will encapsulate a pointer to a struct
device and the operation type requested for this instance of PHYLINK.
This patch does not make any functional changes, it just transitions the
PHYLINK internals and all its users to the new API.
A pointer to a phylink_config structure will be passed to
phylink_create() instead of the net_device directly. Also, the same
phylink_config pointer will be passed back to all phylink_mac_ops
callbacks instead of the net_device. Using this mechanism, a PHYLINK
user can get the original net_device using a structure such as
'to_net_dev(config->dev)' or directly the structure containing the
phylink_config using a container_of call.
At the moment, only the PHYLINK_NETDEV is defined as a valid operation
type for PHYLINK. In this mode, a valid reference to a struct device
linked to the original net_device should be passed to PHYLINK through
the phylink_config structure.
This API changes is mainly driven by the necessity of adding a new
operation type in PHYLINK that disconnects the phy_device from the
net_device and also works when the net_device is lacking.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ctinfo is a new tc filter action module. It is designed to restore
information contained in firewall conntrack marks to other packet fields
and is typically used on packet ingress paths. At present it has two
independent sub-functions or operating modes, DSCP restoration mode &
skb mark restoration mode.
The DSCP restore mode:
This mode copies DSCP values that have been placed in the firewall
conntrack mark back into the IPv4/v6 diffserv fields of relevant
packets.
The DSCP restoration is intended for use and has been found useful for
restoring ingress classifications based on egress classifications across
links that bleach or otherwise change DSCP, typically home ISP Internet
links. Restoring DSCP on ingress on the WAN link allows qdiscs such as
but by no means limited to CAKE to shape inbound packets according to
policies that are easier to set & mark on egress.
Ingress classification is traditionally a challenging task since
iptables rules haven't yet run and tc filter/eBPF programs are pre-NAT
lookups, hence are unable to see internal IPv4 addresses as used on the
typical home masquerading gateway. Thus marking the connection in some
manner on egress for later restoration of classification on ingress is
easier to implement.
Parameters related to DSCP restore mode:
dscpmask - a 32 bit mask of 6 contiguous bits and indicate bits of the
conntrack mark field contain the DSCP value to be restored.
statemask - a 32 bit mask of (usually) 1 bit length, outside the area
specified by dscpmask. This represents a conditional operation flag
whereby the DSCP is only restored if the flag is set. This is useful to
implement a 'one shot' iptables based classification where the
'complicated' iptables rules are only run once to classify the
connection on initial (egress) packet and subsequent packets are all
marked/restored with the same DSCP. A mask of zero disables the
conditional behaviour ie. the conntrack mark DSCP bits are always
restored to the ip diffserv field (assuming the conntrack entry is found
& the skb is an ipv4/ipv6 type)
e.g. dscpmask 0xfc000000 statemask 0x01000000
|----0xFC----conntrack mark----000000---|
| Bits 31-26 | bit 25 | bit24 |~~~ Bit 0|
| DSCP | unused | flag |unused |
|-----------------------0x01---000000---|
| |
| |
---| Conditional flag
v only restore if set
|-ip diffserv-|
| 6 bits |
|-------------|
The skb mark restore mode (cpmark):
This mode copies the firewall conntrack mark to the skb's mark field.
It is completely the functional equivalent of the existing act_connmark
action with the additional feature of being able to apply a mask to the
restored value.
Parameters related to skb mark restore mode:
mask - a 32 bit mask applied to the firewall conntrack mark to mask out
bits unwanted for restoration. This can be useful where the conntrack
mark is being used for different purposes by different applications. If
not specified and by default the whole mark field is copied (i.e.
default mask of 0xffffffff)
e.g. mask 0x00ffffff to mask out the top 8 bits being used by the
aforementioned DSCP restore mode.
|----0x00----conntrack mark----ffffff---|
| Bits 31-24 | |
| DSCP & flag| some value here |
|---------------------------------------|
|
|
v
|------------skb mark-------------------|
| | |
| zeroed | |
|---------------------------------------|
Overall parameters:
zone - conntrack zone
control - action related control (reclassify | pipe | drop | continue |
ok | goto chain <CHAIN_INDEX>)
Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves common code between rht_ptr and rht_ptr_exclusive
into __rht_ptr. It also adds a new helper rht_ptr_rcu exclusively
for the RCU case. This way rht_ptr becomes a lock-only construct
so we can use the lighter rcu_dereference_protected primitive.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow the creation of nexthop groups which reference other nexthop
objects to create multipath routes:
+--------------+
+------------+ +--------------+ |
| nh nh_grp --->| nh_grp_entry |-+
+------------+ +---------|----+
^ | | +------------+
+----------------+ +--->| nh, weight |
nh_parent +------------+
A group entry points to a nexthop with a weight for that hop within the
group. The nexthop has a list_head, grp_list, for tracking which groups
it is a member of and the group entry has a reference back to the parent.
The grp_list is used when a nexthop is deleted - to efficiently remove
it from groups using it.
If a nexthop group spec is given, no other attributes can be set. Each
nexthop id in a group spec must already exist.
Similar to single nexthops, the specification of a nexthop group can be
updated so that data is managed with rcu locking.
Add path selection function to account for multiple paths and add
ipv{4,6}_good_nh helpers to know that if a neighbor entry exists it is
in a good state.
Update NETDEV event handling to rebalance multipath nexthop groups if
a nexthop is deleted due to a link event (down or unregister).
When a nexthop is removed any groups using it are updated. Groups using a
nexthop a tracked via a grp_list.
Nexthop dumps can be limited to groups only by adding NHA_GROUPS to the
request.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for NHA_ENCAP and NHA_ENCAP_TYPE. Leverages the existing code
for lwtunnel within fib_nh_common, so the only change needed is handling
the attributes in the nexthop code.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Handle IPv6 gateway in a nexthop spec. If nh_family is set to AF_INET6,
NHA_GATEWAY is expected to be an IPv6 address. Add ipv6 option to gw in
nh_config to hold the address, add fib6_nh to nh_info to leverage the
ipv6 initialization and cleanup code. Update nh_fill_node to dump the v6
address.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for IPv4 nexthops. If nh_family is set to AF_INET, then
NHA_GATEWAY is expected to be an IPv4 address.
Register for netdev events to be notified of admin up/down changes as
well as deletes. A hash table is used to track nexthop per devices to
quickly convert device events to the affected nexthops.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Barebones start point for nexthops. Implementation for RTM commands,
notifications, management of rbtree for holding nexthops by id, and
kernel side data structures for nexthops and nexthop config.
Nexthops are maintained in an rbtree sorted by id. Similar to routes,
nexthops are configured per namespace using netns_nexthop struct added
to struct net.
Nexthop notifications are sent when a nexthop is added or deleted,
but NOT if the delete is due to a device event or network namespace
teardown (which also involves device events). Applications are
expected to use the device down event to flush nexthops and any
routes used by the nexthops.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
New UAPI for nexthops as standalone objects:
- defines netlink ancillary header, struct nhmsg
- RTM commands for nexthop objects, RTM_*NEXTHOP,
- RTNLGRP for nexthop notifications, RTNLGRP_NEXTHOP,
- Attributes for creating nexthops, NHA_*
- Attribute for route specs to specify a nexthop by id, RTA_NH_ID.
The nexthop attributes and semantics follow the route and RTA ones for
device, gateway and lwt encap. Unique to nexthop objects are a blackhole
and a group which contains references to other nexthop objects. With the
exception of blackhole and group, nexthop objects MUST contain a device.
Gateway and encap are optional. Nexthop groups can only reference other
pre-existing nexthops by id. If the NHA_ID attribute is present that id
is used for the nexthop. If not specified, one is auto assigned.
Dump requests can include attributes:
- NHA_GROUPS to return only nexthop groups,
- NHA_MASTER to limit dumps to nexthops with devices enslaved to the
given master (e.g., VRF)
- NHA_OIF to limit dumps to nexthops using given device
nlmsg_route_perms in selinux code is updated for the new RTM comands.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fqdir_init() is not fast path and is getting bigger.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syszbot found an interesting use-after-free [1] happening
while IPv4 fragment rhashtable was destroyed at netns dismantle.
While no insertions can possibly happen at the time a dismantling
netns is destroying this rhashtable, timers can still fire and
attempt to remove elements from this rhashtable.
This is forbidden, since rhashtable_free_and_destroy() has
no synchronization against concurrent inserts and deletes.
Add a new fqdir->dead flag so that timers do not attempt
a rhashtable_remove_fast() operation.
We also have to respect an RCU grace period before starting
the rhashtable_free_and_destroy() from process context,
thus we use rcu_work infrastructure.
This is a refinement of a prior rough attempt to fix this bug :
https://marc.info/?l=linux-netdev&m=153845936820900&w=2
Since the rhashtable cleanup is now deferred to a work queue,
netns dismantles should be slightly faster.
[1]
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:194 [inline]
BUG: KASAN: use-after-free in rhashtable_last_table+0x162/0x180 lib/rhashtable.c:212
Read of size 8 at addr ffff8880a6497b70 by task kworker/0:0/5
CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.2.0-rc1+ #2
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events rht_deferred_worker
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
__kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
kasan_report+0x12/0x20 mm/kasan/common.c:614
__asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
__read_once_size include/linux/compiler.h:194 [inline]
rhashtable_last_table+0x162/0x180 lib/rhashtable.c:212
rht_deferred_worker+0x111/0x2030 lib/rhashtable.c:411
process_one_work+0x989/0x1790 kernel/workqueue.c:2269
worker_thread+0x98/0xe40 kernel/workqueue.c:2415
kthread+0x354/0x420 kernel/kthread.c:255
ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Allocated by task 32687:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_kmalloc mm/kasan/common.c:489 [inline]
__kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503
__do_kmalloc_node mm/slab.c:3620 [inline]
__kmalloc_node+0x4e/0x70 mm/slab.c:3627
kmalloc_node include/linux/slab.h:590 [inline]
kvmalloc_node+0x68/0x100 mm/util.c:431
kvmalloc include/linux/mm.h:637 [inline]
kvzalloc include/linux/mm.h:645 [inline]
bucket_table_alloc+0x90/0x480 lib/rhashtable.c:178
rhashtable_init+0x3f4/0x7b0 lib/rhashtable.c:1057
inet_frags_init_net include/net/inet_frag.h:109 [inline]
ipv4_frags_init_net+0x182/0x410 net/ipv4/ip_fragment.c:683
ops_init+0xb3/0x410 net/core/net_namespace.c:130
setup_net+0x2d3/0x740 net/core/net_namespace.c:316
copy_net_ns+0x1df/0x340 net/core/net_namespace.c:439
create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
ksys_unshare+0x440/0x980 kernel/fork.c:2692
__do_sys_unshare kernel/fork.c:2760 [inline]
__se_sys_unshare kernel/fork.c:2758 [inline]
__x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Freed by task 7:
save_stack+0x23/0x90 mm/kasan/common.c:71
set_track mm/kasan/common.c:79 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
__cache_free mm/slab.c:3432 [inline]
kfree+0xcf/0x220 mm/slab.c:3755
kvfree+0x61/0x70 mm/util.c:460
bucket_table_free+0x69/0x150 lib/rhashtable.c:108
rhashtable_free_and_destroy+0x165/0x8b0 lib/rhashtable.c:1155
inet_frags_exit_net+0x3d/0x50 net/ipv4/inet_fragment.c:152
ipv4_frags_exit_net+0x73/0x90 net/ipv4/ip_fragment.c:695
ops_exit_list.isra.0+0xaa/0x150 net/core/net_namespace.c:154
cleanup_net+0x3fb/0x960 net/core/net_namespace.c:553
process_one_work+0x989/0x1790 kernel/workqueue.c:2269
worker_thread+0x98/0xe40 kernel/workqueue.c:2415
kthread+0x354/0x420 kernel/kthread.c:255
ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
The buggy address belongs to the object at ffff8880a6497b40
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 48 bytes inside of
1024-byte region [ffff8880a6497b40, ffff8880a6497f40)
The buggy address belongs to the page:
page:ffffea0002992580 refcount:1 mapcount:0 mapping:ffff8880aa400ac0 index:0xffff8880a64964c0 compound_mapcount: 0
flags: 0x1fffc0000010200(slab|head)
raw: 01fffc0000010200 ffffea0002916e88 ffffea000218fe08 ffff8880aa400ac0
raw: ffff8880a64964c0 ffff8880a6496040 0000000100000005 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8880a6497a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880a6497a80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff8880a6497b00: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
^
ffff8880a6497b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880a6497c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 648700f76b ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch will add rcu grace period before fqdir
rhashtable destruction, so we need to dynamically allocate
fqdir structures to not force expensive synchronize_rcu() calls
in netns dismantle path.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fqdir will soon be dynamically allocated.
We need to reach the struct net pointer from fqdir,
so add it, and replace the various container_of() constructs
by direct access to the new field.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
And pass an extra parameter, since we will soon
dynamically allocate fqdir structures.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename the @frags fields from structs netns_ipv4, netns_ipv6,
netns_nf_frag and netns_ieee802154_lowpan to @fqdir
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1) struct netns_frags is renamed to struct fqdir
This structure is really holding many frag queues in a hash table.
2) (struct inet_frag_queue)->net field is renamed to fqdir
since net is generally associated to a 'struct net' pointer
in networking stack.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add iWARP engine affinity setting for supporting iWARP over 100g.
iWARP cannot be distinguished by the LLH from L2, hence the
engine division will affect L2 as well. For this reason we add
a parameter to devlink to determine the engine division.
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ethtool ops get_rxfh_context and set_rxfh_context are used to create,
remove and access parameters associated to RSS contexts, in a similar
fashion to get_rxfh and set_rxfh.
Add a small descritopn of these callbacks in the struct ethtool_ops doc.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull libnvdimm fixes from Dan Williams:
- Fix a regression that disabled device-mapper dax support
- Remove unnecessary hardened-user-copy overhead (>30%) for dax
read(2)/write(2).
- Fix some compilation warnings.
* tag 'libnvdimm-fixes-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
dax: Arrange for dax_supported check to span multiple devices
libnvdimm: Fix compilation warnings with W=1
Pull block fixes from Jens Axboe:
- NVMe pull request from Keith, with fixes from a few folks.
- bio and sbitmap before atomic barrier fixes (Andrea)
- Hang fix for blk-mq freeze and unfreeze (Bob)
- Single segment count regression fix (Christoph)
- AoE now has a new maintainer
- tools/io_uring/ Makefile fix, and sync with liburing (me)
* tag 'for-linus-20190524' of git://git.kernel.dk/linux-block: (23 commits)
tools/io_uring: sync with liburing
tools/io_uring: fix Makefile for pthread library link
blk-mq: fix hang caused by freeze/unfreeze sequence
block: remove the bi_seg_{front,back}_size fields in struct bio
block: remove the segment size check in bio_will_gap
block: force an unlimited segment size on queues with a virt boundary
block: don't decrement nr_phys_segments for physically contigous segments
sbitmap: fix improper use of smp_mb__before_atomic()
bio: fix improper use of smp_mb__before_atomic()
aoe: list new maintainer for aoe driver
nvme-pci: use blk-mq mapping for unmanaged irqs
nvme: update MAINTAINERS
nvme: copy MTFA field from identify controller
nvme: fix memory leak for power latency tolerance
nvme: release namespace SRCU protection before performing controller ioctls
nvme: merge nvme_ns_ioctl into nvme_ioctl
nvme: remove the ifdef around nvme_nvm_ioctl
nvme: fix srcu locking on error return in nvme_get_ns_from_disk
nvme: Fix known effects
nvme-pci: Sync queues on reset
...
Pule more SPDX updates from Greg KH:
"Here is another set of reviewed patches that adds SPDX tags to
different kernel files, based on a set of rules that are being used to
parse the comments to try to determine that the license of the file is
"GPL-2.0-or-later".
Only the "obvious" versions of these matches are included here, a
number of "non-obvious" variants of text have been found but those
have been postponed for later review and analysis.
These patches have been out for review on the linux-spdx@vger mailing
list, and while they were created by automatic tools, they were
hand-verified by a bunch of different people, all whom names are on
the patches are reviewers"
* tag 'spdx-5.2-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (85 commits)
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 125
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 123
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 122
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 121
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 120
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 119
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 118
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 116
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 114
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 113
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 112
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 111
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 110
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 106
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 105
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 104
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 103
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 102
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 101
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 98
...
Move fib6_nh to the end of fib6_info and make it an array of
size 0. Pass a flag to fib6_info_alloc indicating if the
allocation needs to add space for a fib6_nh.
The current code path always has a fib6_nh allocated with a
fib6_info; with nexthop objects they will be separate.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to the pcpu routes exceptions are really per nexthop, so move
rt6i_exception_bucket from fib6_info to fib6_nh.
To avoid additional increases to the size of fib6_nh for a 1-bit flag,
use the lowest bit in the allocated memory pointer for the flushed flag.
Add helpers for retrieving the bucket pointer to mask off the flag.
The cleanup of the exception bucket is moved to fib6_nh_release.
fib6_nh_flush_exceptions can now be called from 2 contexts:
1. deleting a fib entry
2. deleting a fib6_nh
For 1., fib6_nh_flush_exceptions is called for a specific fib6_info that
is getting deleted. All exceptions in the cache using the entry are
deleted. For 2, the fib6_nh itself is getting destroyed so
fib6_nh_flush_exceptions is called for a NULL fib6_info which means
flush all entries.
The pmtu.sh selftest exercises the affected code paths - from creating
exceptions to cleaning them up on device delete. All tests pass without
any rcu locking or memleak warnings.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rt6_info are specific instances of a fib entry and are tied to a
device and gateway - ie., a nexthop. Before nexthop objects, IPv6 fib
entries have separate fib6_info for each nexthop in a multipath route,
so the location of the pcpu cache in the fib6_info struct worked.
However, with nexthop objects a fib6_info can point to a set of nexthops
(yet another alignment of ipv6 with ipv4). Accordingly, the pcpu
cache needs to be moved to the fib6_nh struct so the cached entries
are local to the nexthop specification used to create the rt6_info.
Initialization and free of the pcpu entries moved to fib6_nh_init and
fib6_nh_release.
Change in location only, from fib6_info down to fib6_nh; no other
functional change intended.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not see the file copying or write to the free
software foundation inc
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 12 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190523091651.231300438@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 or at your option any
later version this program is distributed in the hope that it will
be useful but without any warranty without even the implied warranty
of merchantability or fitness for a particular purpose see the gnu
general public license for more details
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 44 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190523091651.032047323@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this sctp implementation is free software you can redistribute it
and or modify it under the terms of the gnu general public license
as published by the free software foundation either version 2 or at
your option any later version this sctp implementation is
distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details you should have received a copy of the gnu general
public license along with gnu cc see the file copying if not see
http www gnu org licenses
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 42 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190523091649.683323110@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not write to the free software foundation inc
51 franklin st fifth floor boston ma 02110 1301 usa
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 50 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190523091649.499889647@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>