linux

mirror of https://github.com/hardkernel/linux.git synced 2026-04-14 01:20:41 +09:00

Author	SHA1	Message	Date
Amol Grover	51cae673d0	sunrpc: Pass lockdep expression to RCU lists detail->hash_table[] is traversed using hlist_for_each_entry_rcu outside an RCU read-side critical section but under the protection of detail->hash_lock. Hence, add corresponding lockdep expression to silence false-positive warnings, and harden RCU lists. Signed-off-by: Amol Grover <frextrite@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2020-03-16 12:04:31 -04:00
Chuck Lever	d162372af3	SUNRPC: Trim stack utilization in the wrap and unwrap paths By preventing compiler inlining of the integrity and privacy helpers, stack utilization for the common case (authentication only) goes way down. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-16 10:18:45 -04:00
Chuck Lever	8d6bda7f23	SUNRPC: Remove xdr_buf_read_mic() Clean up: this function is no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-16 10:18:45 -04:00
Chuck Lever	4047aa909c	sunrpc: Fix gss_unwrap_resp_integ() again xdr_buf_read_mic() tries to find unused contiguous space in a received xdr_buf in order to linearize the checksum for the call to gss_verify_mic. However, the corner cases in this code are numerous and we seem to keep missing them. I've just hit yet another buffer overrun related to it. This overrun is at the end of xdr_buf_read_mic(): 1284 if (buf->tail[0].iov_len != 0) 1285 mic->data = buf->tail[0].iov_base + buf->tail[0].iov_len; 1286 else 1287 mic->data = buf->head[0].iov_base + buf->head[0].iov_len; 1288 __read_bytes_from_xdr_buf(&subbuf, mic->data, mic->len); 1289 return 0; This logic assumes the transport has set the length of the tail based on the size of the received message. base + len is then supposed to be off the end of the message but still within the actual buffer. In fact, the length of the tail is set by the upper layer when the Call is encoded so that the end of the tail is actually the end of the allocated buffer itself. This causes the logic above to set mic->data to point past the end of the receive buffer. The "mic->data = head" arm of this if statement is no less fragile. As near as I can tell, this has been a problem forever. I'm not sure that minimizing au_rslack recently changed this pathology much. So instead, let's use a more straightforward approach: kmalloc a separate buffer to linearize the checksum. This is similar to how gss_validate() currently works. Coming back to this code, I had some trouble understanding what was going on. So I've cleaned up the variable naming and added a few comments that point back to the XDR definition in RFC 2203 to help guide future spelunkers, including myself. As an added clean up, the functionality that was in xdr_buf_read_mic() is folded directly into gss_unwrap_resp_integ(), as that is its only caller. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-16 10:18:44 -04:00
Colin Ian King	68e9a2463d	SUNRPC: remove redundant assignments to variable status The variable status is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-16 10:10:36 -04:00
Trond Myklebust	263fb9c21e	SUNRPC: Don't take a reference to the cred on synchronous tasks If the RPC call is synchronous, assume the cred is already pinned by the caller. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-16 08:34:29 -04:00
Trond Myklebust	7eac52648a	SUNRPC: Add a flag to avoid reference counts on credentials Add a flag to signal to the RPC layer that the credential is already pinned for the duration of the RPC call. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-03-16 08:34:28 -04:00
Torsten Hilbrich	2a9de3af21	vti6: Fix memory leak of skb if input policy check fails The vti6_rcv function performs some tests on the retrieved tunnel including checking the IP protocol, the XFRM input policy, the source and destination address. In all but one places the skb is released in the error case. When the input policy check fails the network packet is leaked. Using the same goto-label discard in this case to fix this problem. Fixes: `ed1efb2aef` ("ipv6: Add support for IPsec virtual tunnel interfaces") Signed-off-by: Torsten Hilbrich <torsten.hilbrich@secunet.com> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>	2020-03-16 11:13:48 +01:00
Jiri Pirko	74522e7baa	net: sched: set the hw_stats_type in pedit loop For a single pedit action, multiple offload entries may be used. Set the hw_stats_type to all of them. Fixes: `44f8658017` ("sched: act: allow user to specify type of HW stats for a filter") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-16 02:13:43 -07:00
Michal Kubecek	2363d73a2f	ethtool: reject unrecognized request flags As pointed out by Jakub Kicinski, we ethtool netlink code should respond with an error if request head has flags set which are not recognized by kernel, either as a mistake or because it expects functionality introduced in later kernel versions. To avoid unnecessary roundtrips, use extack cookie to provide the information about supported request flags. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-16 02:04:24 -07:00
Michal Kubecek	fe2a31d790	netlink: allow extack cookie also for error messages Commit `ba0dc5f6e0` ("netlink: allow sending extended ACK with cookie on success") introduced a cookie which can be sent to userspace as part of extended ack message in the form of NLMSGERR_ATTR_COOKIE attribute. Currently the cookie is ignored if error code is non-zero but there is no technical reason for such limitation and it can be useful to provide machine parseable information as part of an error message. Include NLMSGERR_ATTR_COOKIE whenever the cookie has been set, regardless of error code. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-16 02:04:24 -07:00
Cong Wang	ef299cc3fa	net_sched: cls_route: remove the right filter from hashtable route4_change() allocates a new filter and copies values from the old one. After the new filter is inserted into the hash table, the old filter should be removed and freed, as the final step of the update. However, the current code mistakenly removes the new one. This looks apparently wrong to me, and it causes double "free" and use-after-free too, as reported by syzbot. Reported-and-tested-by: syzbot+f9b32aaacd60305d9687@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+2f8c233f131943d6056d@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+9c2df9fd5e9445b74e01@syzkaller.appspotmail.com Fixes: `1109c00547` ("net: sched: RCU cls_route") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-16 01:59:32 -07:00
Taehee Yoo	09e91dbea0	hsr: set .netnsok flag The hsr module has been supporting the list and status command. (HSR_C_GET_NODE_LIST and HSR_C_GET_NODE_STATUS) These commands send node information to the user-space via generic netlink. But, in the non-init_net namespace, these commands are not allowed because .netnsok flag is false. So, there is no way to get node information in the non-init_net namespace. Fixes: `f421436a59` ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-16 01:46:09 -07:00
Taehee Yoo	ca19c70f52	hsr: add restart routine into hsr_get_node_list() The hsr_get_node_list() is to send node addresses to the userspace. If there are so many nodes, it could fail because of buffer size. In order to avoid this failure, the restart routine is added. Fixes: `f421436a59` ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-16 01:46:09 -07:00
Taehee Yoo	173756b868	hsr: use rcu_read_lock() in hsr_get_node_{list/status}() hsr_get_node_{list/status}() are not under rtnl_lock() because they are callback functions of generic netlink. But they use __dev_get_by_index() without rtnl_lock(). So, it would use unsafe data. In order to fix it, rcu_read_lock() and dev_get_by_index_rcu() are used instead of __dev_get_by_index(). Fixes: `f421436a59` ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-16 01:46:09 -07:00
Russell King	87615c96e7	net: dsa: warn if phylink_mac_link_state returns error Issue a warning to the kernel log if phylink_mac_link_state() returns an error. This should not occur, but let's make it visible. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-15 17:11:12 -07:00
Florian Westphal	d0febd81ae	netfilter: conntrack: re-visit sysctls in unprivileged namespaces since commit `b884fa4617` ("netfilter: conntrack: unify sysctl handling") conntrack no longer exposes most of its sysctls (e.g. tcp timeouts settings) to network namespaces that are not owned by the initial user namespace. This patch exposes all sysctls even if the namespace is unpriviliged. compared to a 4.19 kernel, the newly visible and writeable sysctls are: net.netfilter.nf_conntrack_acct net.netfilter.nf_conntrack_timestamp .. to allow to enable accouting and timestamp extensions. net.netfilter.nf_conntrack_events .. to turn off conntrack event notifications. net.netfilter.nf_conntrack_checksum .. to disable checksum validation. net.netfilter.nf_conntrack_log_invalid .. to enable logging of packets deemed invalid by conntrack. newly visible sysctls that are only exported as read-only: net.netfilter.nf_conntrack_count .. current number of conntrack entries living in this netns. net.netfilter.nf_conntrack_max .. global upperlimit (maximum size of the table). net.netfilter.nf_conntrack_buckets .. size of the conntrack table (hash buckets). net.netfilter.nf_conntrack_expect_max .. maximum number of permitted expectations in this netns. net.netfilter.nf_conntrack_helper .. conntrack helper auto assignment. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:27:51 +01:00
Pablo Neira Ayuso	339706bc21	netfilter: nft_lookup: update element stateful expression If the set element comes with an stateful expression, update it. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:27:50 +01:00
Pablo Neira Ayuso	76adfafeca	netfilter: nf_tables: add nft_set_elem_update_expr() helper function This helper function runs the eval path of the stateful expression of an existing set element. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:27:49 +01:00
Pablo Neira Ayuso	4094445229	netfilter: nf_tables: add elements with stateful expressions Update nft_add_set_elem() to handle the NFTA_SET_ELEM_EXPR netlink attribute. This patch allows users to to add elements with stateful expressions. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:27:49 +01:00
Pablo Neira Ayuso	795a6d6b42	netfilter: nf_tables: statify nft_expr_init() Not exposed anymore to modules, statify this function. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:27:48 +01:00
Pablo Neira Ayuso	a7fc936804	netfilter: nf_tables: add nft_set_elem_expr_alloc() Add helper function to create stateful expression. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:27:47 +01:00
Stefano Brivio	eb16933aa5	nft_set_pipapo: Prepare for single ranged field usage A few adjustments in nft_pipapo_init() are needed to allow usage of this set back-end for a single, ranged field. Provide a convenient NFT_PIPAPO_MIN_FIELDS definition that currently makes sure that the rbtree back-end is selected instead, for sets with a single field. This finally allows a fair comparison with rbtree sets, by defining NFT_PIPAPO_MIN_FIELDS as 0 and skipping rbtree back-end initialisation: ---------------.--------------------------.-------------------------. AMD Epyc 7402 \| baselines, Mpps \| Mpps, % over rbtree \| 1 thread \|__________________________\|_________________________\| 3.35GHz \| \| \| \| \| \| 768KiB L1D$ \| netdev \| hash \| rbtree \| \| pipapo \| ---------------\| hook \| no \| single \| pipapo \|single field\| type entries \| drop \| ranges \| field \|single field\| AVX2 \| ---------------\|--------\|--------\|--------\|------------\|------------\| net,port \| \| \| \| \| \| 1000 \| 19.0 \| 10.4 \| 3.8 \| 6.0 +58% \| 9.6 +153% \| ---------------\|--------\|--------\|--------\|------------\|------------\| port,net \| \| \| \| \| \| 100 \| 18.8 \| 10.3 \| 5.8 \| 9.1 +57% \|11.6 +100% \| ---------------\|--------\|--------\|--------\|------------\|------------\| net6,port \| \| \| \| \| \| 1000 \| 16.4 \| 7.6 \| 1.8 \| 2.8 +55% \| 6.5 +261% \| ---------------\|--------\|--------\|--------\|------------\|------------\| port,proto \| \| \| \| [1] \| [1] \| 30000 \| 19.6 \| 11.6 \| 3.9 \| 0.9 -77% \| 2.7 -31% \| ---------------\|--------\|--------\|--------\|------------\|------------\| port,proto \| \| \| \| \| \| 10000 \| 19.6 \| 11.6 \| 4.4 \| 2.1 -52% \| 5.6 +27% \| ---------------\|--------\|--------\|--------\|------------\|------------\| port,proto \| \| \| \| \| \| 4 threads 10000\| 77.9 \| 45.1 \| 17.4 \| 8.3 -52% \|22.4 +29% \| ---------------\|--------\|--------\|--------\|------------\|------------\| net6,port,mac \| \| \| \| \| \| 10 \| 16.5 \| 5.4 \| 4.3 \| 4.5 +5% \| 8.2 +91% \| ---------------\|--------\|--------\|--------\|------------\|------------\| net6,port,mac, \| \| \| \| \| \| proto 1000 \| 16.5 \| 5.7 \| 1.9 \| 2.8 +47% \| 6.6 +247% \| ---------------\|--------\|--------\|--------\|------------\|------------\| net,mac \| \| \| \| \| \| 1000 \| 19.0 \| 8.4 \| 3.9 \| 6.0 +54% \| 9.9 +154% \| ---------------'--------'--------'--------'------------'------------' [1] Causes switch of lookup table buckets for 'port' to 4-bit groups Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:27:46 +01:00
Stefano Brivio	7400b06396	nft_set_pipapo: Introduce AVX2-based lookup implementation If the AVX2 set is available, we can exploit the repetitive characteristic of this algorithm to provide a fast, vectorised version by using 256-bit wide AVX2 operations for bucket loads and bitwise intersections. In most cases, this implementation consistently outperforms rbtree set instances despite the fact they are configured to use a given, single, ranged data type out of the ones used for performance measurements by the nft_concat_range.sh kselftest. That script, injecting packets directly on the ingoing device path with pktgen, reports, averaged over five runs on a single AMD Epyc 7402 thread (3.35GHz, 768 KiB L1D$, 12 MiB L2$), the figures below. CONFIG_RETPOLINE was not set here. Note that this is not a fair comparison over hash and rbtree set types: non-ranged entries (used to have a reference for hash types) would be matched faster than this, and matching on a single field only (which is the case for rbtree) is also significantly faster. However, it's not possible at the moment to choose this set type for non-ranged entries, and the current implementation also needs a few minor adjustments in order to match on less than two fields. ---------------.-----------------------------------.------------. AMD Epyc 7402 \| baselines, Mpps \| this patch \| 1 thread \|___________________________________\|____________\| 3.35GHz \| \| \| \| \| \| 768KiB L1D$ \| netdev \| hash \| rbtree \| \| \| ---------------\| hook \| no \| single \| \| pipapo \| type entries \| drop \| ranges \| field \| pipapo \| AVX2 \| ---------------\|--------\|--------\|--------\|--------\|------------\| net,port \| \| \| \| \| \| 1000 \| 19.0 \| 10.4 \| 3.8 \| 4.0 \| 7.5 +87% \| ---------------\|--------\|--------\|--------\|--------\|------------\| port,net \| \| \| \| \| \| 100 \| 18.8 \| 10.3 \| 5.8 \| 6.3 \| 8.1 +29% \| ---------------\|--------\|--------\|--------\|--------\|------------\| net6,port \| \| \| \| \| \| 1000 \| 16.4 \| 7.6 \| 1.8 \| 2.1 \| 4.8 +128% \| ---------------\|--------\|--------\|--------\|--------\|------------\| port,proto \| \| \| \| \| \| 30000 \| 19.6 \| 11.6 \| 3.9 \| 0.5 \| 2.6 +420% \| ---------------\|--------\|--------\|--------\|--------\|------------\| net6,port,mac \| \| \| \| \| \| 10 \| 16.5 \| 5.4 \| 4.3 \| 3.4 \| 4.7 +38% \| ---------------\|--------\|--------\|--------\|--------\|------------\| net6,port,mac, \| \| \| \| \| \| proto 1000 \| 16.5 \| 5.7 \| 1.9 \| 1.4 \| 3.6 +26% \| ---------------\|--------\|--------\|--------\|--------\|------------\| net,mac \| \| \| \| \| \| 1000 \| 19.0 \| 8.4 \| 3.9 \| 2.5 \| 6.4 +156% \| ---------------'--------'--------'--------'--------'------------' A similar strategy could be easily reused to implement specialised versions for other SIMD sets, and I plan to post at least a NEON version at a later time. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:27:45 +01:00
Stefano Brivio	8683f4b995	nft_set_pipapo: Prepare for vectorised implementation: helpers Move most macros and helpers to a header file, so that they can be conveniently used by related implementations. No functional changes are intended here. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:27:44 +01:00
Stefano Brivio	bf3e583923	nft_set_pipapo: Prepare for vectorised implementation: alignment SIMD vector extension sets require stricter alignment than native instruction sets to operate efficiently (AVX, NEON) or for some instructions to work at all (AltiVec). Provide facilities to define arbitrary alignment for lookup tables and scratch maps. By defining byte alignment with NFT_PIPAPO_ALIGN, lt_aligned and scratch_aligned pointers become available. Additional headroom is allocated, and pointers to the possibly unaligned, originally allocated areas are kept so that they can be freed. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:27:43 +01:00
Stefano Brivio	4051f43116	nft_set_pipapo: Add support for 8-bit lookup groups and dynamic switch While grouping matching bits in groups of four saves memory compared to the more natural choice of 8-bit words (lookup table size is one eighth), it comes at a performance cost, as the number of lookup comparisons is doubled, and those also needs bitshifts and masking. Introduce support for 8-bit lookup groups, together with a mapping mechanism to dynamically switch, based on defined per-table size thresholds and hysteresis, between 8-bit and 4-bit groups, as tables grow and shrink. Empty sets start with 8-bit groups, and per-field tables are converted to 4-bit groups if they get too big. An alternative approach would have been to swap per-set lookup operation functions as needed, but this doesn't allow for different group sizes in the same set, which looks desirable if some fields need significantly more matching data compared to others due to heavier impact of ranges (e.g. a big number of subnets with relatively simple port specifications). Allowing different group sizes for the same lookup functions implies the need for further conditional clauses, whose cost, however, appears to be negligible in tests. The matching rate figures below were obtained for x86_64 running the nft_concat_range.sh "performance" cases, averaged over five runs, on a single thread of an AMD Epyc 7402 CPU, and for aarch64 on a single thread of a BCM2711 (Raspberry Pi 4 Model B 4GB), clocked at a stable 2147MHz frequency: ---------------.-----------------------------------.------------. AMD Epyc 7402 \| baselines, Mpps \| this patch \| 1 thread \|___________________________________\|____________\| 3.35GHz \| \| \| \| \| \| 768KiB L1D$ \| netdev \| hash \| rbtree \| \| \| ---------------\| hook \| no \| single \| pipapo \| pipapo \| type entries \| drop \| ranges \| field \| 4 bits \| bit switch \| ---------------\|--------\|--------\|--------\|--------\|------------\| net,port \| \| \| \| \| \| 1000 \| 19.0 \| 10.4 \| 3.8 \| 2.8 \| 4.0 +43% \| ---------------\|--------\|--------\|--------\|--------\|------------\| port,net \| \| \| \| \| \| 100 \| 18.8 \| 10.3 \| 5.8 \| 5.5 \| 6.3 +14% \| ---------------\|--------\|--------\|--------\|--------\|------------\| net6,port \| \| \| \| \| \| 1000 \| 16.4 \| 7.6 \| 1.8 \| 1.3 \| 2.1 +61% \| ---------------\|--------\|--------\|--------\|--------\|------------\| port,proto \| \| \| \| \| [1] \| 30000 \| 19.6 \| 11.6 \| 3.9 \| 0.3 \| 0.5 +66% \| ---------------\|--------\|--------\|--------\|--------\|------------\| net6,port,mac \| \| \| \| \| \| 10 \| 16.5 \| 5.4 \| 4.3 \| 2.6 \| 3.4 +31% \| ---------------\|--------\|--------\|--------\|--------\|------------\| net6,port,mac, \| \| \| \| \| \| proto 1000 \| 16.5 \| 5.7 \| 1.9 \| 1.0 \| 1.4 +40% \| ---------------\|--------\|--------\|--------\|--------\|------------\| net,mac \| \| \| \| \| \| 1000 \| 19.0 \| 8.4 \| 3.9 \| 1.7 \| 2.5 +47% \| ---------------'--------'--------'--------'--------'------------' [1] Causes switch of lookup table buckets for 'port', not 'proto', to 4-bit groups ---------------.-----------------------------------.------------. BCM2711 \| baselines, Mpps \| this patch \| 1 thread \|___________________________________\|____________\| 2147MHz \| \| \| \| \| \| 32KiB L1D$ \| netdev \| hash \| rbtree \| \| \| ---------------\| hook \| no \| single \| pipapo \| pipapo \| type entries \| drop \| ranges \| field \| 4 bits \| bit switch \| ---------------\|--------\|--------\|--------\|--------\|------------\| net,port \| \| \| \| \| \| 1000 \| 1.63 \| 1.37 \| 0.87 \| 0.61 \| 0.70 +17% \| ---------------\|--------\|--------\|--------\|--------\|------------\| port,net \| \| \| \| \| \| 100 \| 1.64 \| 1.36 \| 1.02 \| 0.78 \| 0.81 +4% \| ---------------\|--------\|--------\|--------\|--------\|------------\| net6,port \| \| \| \| \| \| 1000 \| 1.56 \| 1.27 \| 0.65 \| 0.34 \| 0.50 +47% \| ---------------\|--------\|--------\|--------\|--------\|------------\| port,proto [2] \| \| \| \| \| \| 10000 \| 1.68 \| 1.43 \| 0.84 \| 0.30 \| 0.40 +13% \| ---------------\|--------\|--------\|--------\|--------\|------------\| net6,port,mac \| \| \| \| \| \| 10 \| 1.56 \| 1.14 \| 1.02 \| 0.62 \| 0.66 +6% \| ---------------\|--------\|--------\|--------\|--------\|------------\| net6,port,mac, \| \| \| \| \| \| proto 1000 \| 1.56 \| 1.12 \| 0.64 \| 0.27 \| 0.40 +48% \| ---------------\|--------\|--------\|--------\|--------\|------------\| net,mac \| \| \| \| \| \| 1000 \| 1.63 \| 1.26 \| 0.87 \| 0.41 \| 0.53 +29% \| ---------------'--------'--------'--------'--------'------------' [2] Using 10000 entries instead of 30000 as it would take way too long for the test script to generate all of them Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:27:43 +01:00
Stefano Brivio	e807b13cb3	nft_set_pipapo: Generalise group size for buckets Get rid of all hardcoded assumptions that buckets in lookup tables correspond to four-bit groups, and replace them with appropriate calculations based on a variable group size, now stored in struct field. The group size could now be in principle any divisor of eight. Note, though, that lookup and get functions need an implementation intimately depending on the group size, and the only supported size there, currently, is four bits, which is also the initial and only used size at the moment. While at it, drop 'groups' from struct nft_pipapo: it was never used. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:27:42 +01:00
wenxu	88bf6e4114	netfilter: flowtable: add tunnel encap/decap action offload support This patch add tunnel encap decap action offload in the flowtable offload. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:27:23 +01:00
wenxu	cfab6dbd0e	netfilter: flowtable: add tunnel match offload support This patch support both ipv4 and ipv6 tunnel_id, tunnel_src and tunnel_dst match for flowtable offload Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:26:17 +01:00
wenxu	b5140a36da	netfilter: flowtable: add indr block setup support Add etfilter flowtable support indr-block setup. It makes flowtable offload vlan and tunnel device. Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:22:50 +01:00
wenxu	4679877921	netfilter: flowtable: add nf_flow_table_block_offload_init() Add nf_flow_table_block_offload_init prepare for the indr block offload patch Signed-off-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:22:32 +01:00
Dan Carpenter	f628c27d85	netfilter: xt_IDLETIMER: clean up some indenting These lines were indented wrong so Smatch complained. net/netfilter/xt_IDLETIMER.c:81 idletimer_tg_show() warn: inconsistent indenting Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:20:17 +01:00
Jeremy Sowden	049dee95f8	netfilter: bitwise: use more descriptive variable-names. Name the mask and xor data variables, "mask" and "xor," instead of "d1" and "d2." Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:20:16 +01:00
Gustavo A. R. Silva	6daf141401	netfilter: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] Lastly, fix checkpatch.pl warning WARNING: __aligned(size) is preferred over __attribute__((aligned(size))) in net/bridge/netfilter/ebtables.c This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:20:16 +01:00
Chen Wandun	eb9d7af3b7	netfilter: nft_set_pipapo: make the symbol 'nft_pipapo_get' static Fix the following sparse warning: net/netfilter/nft_set_pipapo.c:739:6: warning: symbol 'nft_pipapo_get' was not declared. Should it be static? Fixes: `3c4287f620` ("nf_tables: Add set type for arbitrary concatenation of ranges") Signed-off-by: Chen Wandun <chenwandun@huawei.com> Acked-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:20:16 +01:00
Li RongQing	9325f070f7	netfilter: cleanup unused macro TEMPLATE_NULLS_VAL is not used after commit `0838aa7fcf` ("netfilter: fix netns dependencies with conntrack templates") PFX is not used after commit `8bee4bad03` ("netfilter: xt extensions: use pr_<level>") Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:20:16 +01:00
Florian Westphal	24d19826fc	netfilter: nf_tables: make all set structs const They do not need to be writeable anymore. v2: remove left-over __read_mostly annotation in set_pipapo.c (Stefano) Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:20:16 +01:00
Florian Westphal	e32a4dc651	netfilter: nf_tables: make sets built-in Placing nftables set support in an extra module is pointless: 1. nf_tables needs dynamic registeration interface for sake of one module 2. nft heavily relies on sets, e.g. even simple rule like "nft ... tcp dport { 80, 443 }" will not work with _SETS=n. IOW, either nftables isn't used or both nf_tables and nf_tables_set modules are needed anyway. With extra module: 307K net/netfilter/nf_tables.ko 79K net/netfilter/nf_tables_set.ko text data bss dec filename 146416 3072 545 150033 nf_tables.ko 35496 1817 0 37313 nf_tables_set.ko This patch: 373K net/netfilter/nf_tables.ko 178563 4049 545 183157 nf_tables.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:20:16 +01:00
Xin Long	925d844696	netfilter: nft_tunnel: add support for geneve opts Like vxlan and erspan opts, geneve opts should also be supported in nft_tunnel. The difference is geneve RFC (draft-ietf-nvo3-geneve-14) allows a geneve packet to carry multiple geneve opts. So with this patch, nftables/libnftnl would do: # nft add table ip filter # nft add chain ip filter input { type filter hook input priority 0 \; } # nft add tunnel filter geneve_02 { type geneve\; id 2\; \ ip saddr 192.168.1.1\; ip daddr 192.168.1.2\; \ sport 9000\; dport 9001\; dscp 1234\; ttl 64\; flags 1\; \ opts \"1:1:34567890,2:2:12121212,3:3:1212121234567890\"\; } # nft list tunnels table filter table ip filter { tunnel geneve_02 { id 2 ip saddr 192.168.1.1 ip daddr 192.168.1.2 sport 9000 dport 9001 tos 18 ttl 64 flags 1 geneve opts 1:1:34567890,2:2:12121212,3:3:1212121234567890 } } v1->v2: - no changes, just post it separately. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:20:16 +01:00
Manoj Basapathi	68983a354a	netfilter: xtables: Add snapshot of hardidletimer target This is a snapshot of hardidletimer netfilter target. This patch implements a hardidletimer Xtables target that can be used to identify when interfaces have been idle for a certain period of time. Timers are identified by labels and are created when a rule is set with a new label. The rules also take a timeout value (in seconds) as an option. If more than one rule uses the same timer label, the timer will be restarted whenever any of the rules get a hit. One entry for each timer is created in sysfs. This attribute contains the timer remaining for the timer to expire. The attributes are located under the xt_idletimer class: /sys/class/xt_idletimer/timers/<label> When the timer expires, the target module sends a sysfs notification to the userspace, which can then decide what to do (eg. disconnect to save power) Compared to IDLETIMER, HARDIDLETIMER can send notifications when CPU is in suspend too, to notify the timer expiry. v1->v2: Moved all functionality into IDLETIMER module to avoid code duplication per comment from Florian. Signed-off-by: Manoj Basapathi <manojbm@codeaurora.org> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:20:16 +01:00
Paul Blakey	c3c831b0a2	netfilter: flowtable: Use nf_flow_offload_tuple for stats as well This patch doesn't change any functionality. Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2020-03-15 15:20:15 +01:00
Willem de Bruijn	61fad6816f	net/packet: tpacket_rcv: avoid a producer race condition PACKET_RX_RING can cause multiple writers to access the same slot if a fast writer wraps the ring while a slow writer is still copying. This is particularly likely with few, large, slots (e.g., GSO packets). Synchronize kernel thread ownership of rx ring slots with a bitmap. Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL while holding the sk receive queue lock. They release this lock before copying and set tp_status to TP_STATUS_USER to release to userspace when done. During copying, another writer may take the lock, also see TP_STATUS_KERNEL, and start writing to the same slot. Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a slot, test and set with the lock held. To release race-free, update tp_status and owner bit as a transaction, so take the lock again. This is the one of a variety of discussed options (see Link below): * instead of a shadow ring, embed the data in the slot itself, such as in tp_padding. But any test for this field may match a value left by userspace, causing deadlock. * avoid the lock on release. This leaves a small race if releasing the shadow slot before setting TP_STATUS_USER. The below reproducer showed that this race is not academic. If releasing the slot after tp_status, the race is more subtle. See the first link for details. * add a new tp_status TP_KERNEL_OWNED to avoid the transactional store of two fields. But, legacy applications may interpret all non-zero tp_status as owned by the user. As libpcap does. So this is possible only opt-in by newer processes. It can be added as an optional mode. * embed the struct at the tail of pg_vec to avoid extra allocation. The implementation proved no less complex than a separate field. The additional locking cost on release adds contention, no different than scaling on multicore or multiqueue h/w. In practice, below reproducer nor small packet tcpdump showed a noticeable change in perf report in cycles spent in spinlock. Where contention is problematic, packet sockets support mitigation through PACKET_FANOUT. And we can consider adding opt-in state TP_KERNEL_OWNED. Easy to reproduce by running multiple netperf or similar TCP_STREAM flows concurrently with `tcpdump -B 129 -n greater 60000`. Based on an earlier patchset by Jon Rosen. See links below. I believe this issue goes back to the introduction of tpacket_rcv, which predates git history. Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html Suggested-by: Jon Rosen <jrosen@cisco.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jon Rosen <jrosen@cisco.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-15 00:25:25 -07:00
Paolo Abeni	dc093db5cc	mptcp: drop unneeded checks After the previous patch subflow->conn is always != NULL and is never changed. We can drop a bunch of now unneeded checks. v1 -> v2: - rebased on top of commit `2398e3991b` ("mptcp: always include dack if possible.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-15 00:19:03 -07:00
Paolo Abeni	58b0991962	mptcp: create msk early This change moves the mptcp socket allocation from mptcp_accept() to subflow_syn_recv_sock(), so that subflow->conn is now always set for the non fallback scenario. It allows cleaning up a bit mptcp_accept() reducing the additional locking and will allow fourther cleanup in the next patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-15 00:19:03 -07:00
Petr Machata	e1f8f78ffe	net: ip_gre: Separate ERSPAN newlink / changelink callbacks ERSPAN shares most of the code path with GRE and gretap code. While that helps keep the code compact, it is also error prone. Currently a broken userspace can turn a gretap tunnel into a de facto ERSPAN one by passing IFLA_GRE_ERSPAN_VER. There has been a similar issue in ip6gretap in the past. To prevent these problems in future, split the newlink and changelink code paths. Split the ERSPAN code out of ipgre_netlink_parms() into a new function erspan_netlink_parms(). Extract a piece of common logic from ipgre_newlink() and ipgre_changelink() into ipgre_newlink_encap_setup(). Add erspan_newlink() and erspan_changelink(). Fixes: `84e54fe0a5` ("gre: introduce native tunnel support for ERSPAN") Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-15 00:14:08 -07:00
Hoang Le	746a1eda68	tipc: add NULL pointer check to prevent kernel oops Calling: tipc_node_link_down()-> - tipc_node_write_unlock()->tipc_mon_peer_down() - tipc_mon_peer_down() just after disabling bearer could be caused kernel oops. Fix this by adding a sanity check to make sure valid memory access. Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-15 00:07:00 -07:00
Hoang Le	e228c5c088	tipc: simplify trivial boolean return Checking and returning 'true' boolean is useless as it will be returning at end of function Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-15 00:07:00 -07:00
Petr Machata	0a7fad2376	net: sched: RED: Introduce an ECN nodrop mode When the RED Qdisc is currently configured to enable ECN, the RED algorithm is used to decide whether a certain SKB should be marked. If that SKB is not ECN-capable, it is early-dropped. It is also possible to keep all traffic in the queue, and just mark the ECN-capable subset of it, as appropriate under the RED algorithm. Some switches support this mode, and some installations make use of it. To that end, add a new RED flag, TC_RED_NODROP. When the Qdisc is configured with this flag, non-ECT traffic is enqueued instead of being early-dropped. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-14 21:03:46 -07:00
Petr Machata	14bc175d9c	net: sched: Allow extending set of supported RED flags The qdiscs RED, GRED, SFQ and CHOKE use different subsets of the same pool of global RED flags. These are passed in tc_red_qopt.flags. However none of these qdiscs validate the flag field, and just copy it over wholesale to internal structures, and later dump it back. (An exception is GRED, which does validate for VQs -- however not for the main setup.) A broken userspace can therefore configure a qdisc with arbitrary unsupported flags, and later expect to see the flags on qdisc dump. The current ABI therefore allows storage of several bits of custom data to qdisc instances of the types mentioned above. How many bits, depends on which flags are meaningful for the qdisc in question. E.g. SFQ recognizes flags ECN and HARDDROP, and the rest is not interpreted. If SFQ ever needs to support ADAPTATIVE, it needs another way of doing it, and at the same time it needs to retain the possibility to store 6 bits of uninterpreted data. Likewise RED, which adds a new flag later in this patchset. To that end, this patch adds a new function, red_get_flags(), to split the passed flags of RED-like qdiscs to flags and user bits, and red_validate_flags() to validate the resulting configuration. It further adds a new attribute, TCA_RED_FLAGS, to pass arbitrary flags. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-03-14 21:03:46 -07:00

... 24 25 26 27 28 ...

60851 Commits