linux

mirror of https://github.com/hardkernel/linux.git synced 2026-06-04 10:12:16 +09:00

Author	SHA1	Message	Date
Alexei Starovoitov	1fc5bdb2b8	Merge branch 'Split bpf_sock dst_port field' Jakub Sitnicki says: ==================== This is a follow-up to discussion around the idea of making dst_port in struct bpf_sock a 16-bit field that happened in [1]. v2: - use an anonymous field for zero padding (Alexei) v1: - keep dst_field offset unchanged to prevent existing BPF program breakage (Martin) - allow 8-bit loads from dst_port[0] and [1] - add test coverage for the verifier and the context access converter [1] https://lore.kernel.org/bpf/87sftbobys.fsf@cloudflare.com/ ==================== Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-31 12:39:19 -08:00
Jakub Sitnicki	8f50f16ff3	selftests/bpf: Extend verifier and bpf_sock tests for dst_port loads Add coverage to the verifier tests and tests for reading bpf_sock fields to ensure that 32-bit, 16-bit, and 8-bit loads from dst_port field are allowed only at intended offsets and produce expected values. While 16-bit and 8-bit access to dst_port field is straight-forward, 32-bit wide loads need be allowed and produce a zero-padded 16-bit value for backward compatibility. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/r/20220130115518.213259-3-jakub@cloudflare.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-31 12:39:12 -08:00
Jakub Sitnicki	4421a58271	bpf: Make dst_port field in struct bpf_sock 16-bit wide Menglong Dong reports that the documentation for the dst_port field in struct bpf_sock is inaccurate and confusing. From the BPF program PoV, the field is a zero-padded 16-bit integer in network byte order. The value appears to the BPF user as if laid out in memory as so: offsetof(struct bpf_sock, dst_port) + 0 <port MSB> + 8 <port LSB> +16 0x00 +24 0x00 32-, 16-, and 8-bit wide loads from the field are all allowed, but only if the offset into the field is 0. 32-bit wide loads from dst_port are especially confusing. The loaded value, after converting to host byte order with bpf_ntohl(dst_port), contains the port number in the upper 16-bits. Remove the confusion by splitting the field into two 16-bit fields. For backward compatibility, allow 32-bit wide loads from offsetof(struct bpf_sock, dst_port). While at it, allow loads 8-bit loads at offset [0] and [1] from dst_port. Reported-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/r/20220130115518.213259-2-jakub@cloudflare.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-01-31 12:39:12 -08:00
Alexander Lobakin	f322a620be	ixgbe: respect metadata on XSK Rx to skb For now, if the XDP prog returns XDP_PASS on XSK, the metadata will be lost as it doesn't get copied to the skb. Copy it along with the frame headers. Account its size on skb allocation, and when copying just treat it as a part of the frame and do a pull after to "move" it to the "reserved" zone. net_prefetch() xdp->data_meta and align the copy size to speed-up memcpy() a little and better match ixgbe_construct_skb(). Fixes: `d0bcacd0a1` ("ixgbe: add AF_XDP zero-copy Rx support") Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com> Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2022-01-31 09:50:20 -08:00
Alexander Lobakin	8f405221a7	ixgbe: don't reserve excessive XDP_PACKET_HEADROOM on XSK Rx to skb {__,}napi_alloc_skb() allocates and reserves additional NET_SKB_PAD + NET_IP_ALIGN for any skb. OTOH, ixgbe_construct_skb_zc() currently allocates and reserves additional `xdp->data - xdp->data_hard_start`, which is XDP_PACKET_HEADROOM for XSK frames. There's no need for that at all as the frame is post-XDP and will go only to the networking stack core. Pass the size of the actual data only to __napi_alloc_skb() and don't reserve anything. This will give enough headroom for stack processing. Fixes: `d0bcacd0a1` ("ixgbe: add AF_XDP zero-copy Rx support") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2022-01-31 09:47:21 -08:00
Alexander Lobakin	1fbdaa1338	ixgbe: pass bi->xdp to ixgbe_construct_skb_zc() directly To not dereference bi->xdp each time in ixgbe_construct_skb_zc(), pass bi->xdp as an argument instead of bi. We can also call xsk_buff_free() outside of the function as well as assign bi->xdp to NULL, which seems to make it closer to its name. Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2022-01-31 09:47:21 -08:00
Alexander Lobakin	f9e61d365b	igc: don't reserve excessive XDP_PACKET_HEADROOM on XSK Rx to skb {__,}napi_alloc_skb() allocates and reserves additional NET_SKB_PAD + NET_IP_ALIGN for any skb. OTOH, igc_construct_skb_zc() currently allocates and reserves additional `xdp->data_meta - xdp->data_hard_start`, which is about XDP_PACKET_HEADROOM for XSK frames. There's no need for that at all as the frame is post-XDP and will go only to the networking stack core. Pass the size of the actual data only (+ meta) to __napi_alloc_skb() and don't reserve anything. This will give enough headroom for stack processing. Also, net_prefetch() xdp->data_meta and align the copy size to speed-up memcpy() a little and better match igc_construct_skb(). Fixes: `fc9df2a0b5` ("igc: Enable RX via AF_XDP zero-copy") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Nechama Kraus <nechamax.kraus@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2022-01-31 09:47:13 -08:00
Alexander Lobakin	45a34ca680	ice: respect metadata on XSK Rx to skb For now, if the XDP prog returns XDP_PASS on XSK, the metadata will be lost as it doesn't get copied to the skb. Copy it along with the frame headers. Account its size on skb allocation, and when copying just treat it as a part of the frame and do a pull after to "move" it to the "reserved" zone. net_prefetch() xdp->data_meta and align the copy size to speed-up memcpy() a little and better match ice_construct_skb(). Fixes: `2d4238f556` ("ice: Add support for AF_XDP") Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com> Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2022-01-31 09:46:05 -08:00
Alexander Lobakin	dc44572d19	ice: don't reserve excessive XDP_PACKET_HEADROOM on XSK Rx to skb {__,}napi_alloc_skb() allocates and reserves additional NET_SKB_PAD + NET_IP_ALIGN for any skb. OTOH, ice_construct_skb_zc() currently allocates and reserves additional `xdp->data - xdp->data_hard_start`, which is XDP_PACKET_HEADROOM for XSK frames. There's no need for that at all as the frame is post-XDP and will go only to the networking stack core. Pass the size of the actual data only to __napi_alloc_skb() and don't reserve anything. This will give enough headroom for stack processing. Fixes: `2d4238f556` ("ice: Add support for AF_XDP") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2022-01-31 09:43:35 -08:00
Alexander Lobakin	ee803dca96	ice: respect metadata in legacy-rx/ice_construct_skb() In "legacy-rx" mode represented by ice_construct_skb(), we can still use XDP (and XDP metadata), but after XDP_PASS the metadata will be lost as it doesn't get copied to the skb. Copy it along with the frame headers. Account its size on skb allocation, and when copying just treat it as a part of the frame and do a pull after to "move" it to the "reserved" zone. Point net_prefetch() to xdp->data_meta instead of data. This won't change anything when the meta is not here, but will save some cache misses otherwise. Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com> Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2022-01-31 09:43:35 -08:00
Alexander Lobakin	6dba29537c	i40e: respect metadata on XSK Rx to skb For now, if the XDP prog returns XDP_PASS on XSK, the metadata will be lost as it doesn't get copied to the skb. Copy it along with the frame headers. Account its size on skb allocation, and when copying just treat it as a part of the frame and do a pull after to "move" it to the "reserved" zone. net_prefetch() xdp->data_meta and align the copy size to speed-up memcpy() a little and better match i40e_construct_skb(). Fixes: `0a714186d3` ("i40e: add AF_XDP zero-copy Rx support") Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com> Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2022-01-31 09:42:58 -08:00
Alexander Lobakin	bc97f9c6f9	i40e: don't reserve excessive XDP_PACKET_HEADROOM on XSK Rx to skb {__,}napi_alloc_skb() allocates and reserves additional NET_SKB_PAD + NET_IP_ALIGN for any skb. OTOH, i40e_construct_skb_zc() currently allocates and reserves additional `xdp->data - xdp->data_hard_start`, which is XDP_PACKET_HEADROOM for XSK frames. There's no need for that at all as the frame is post-XDP and will go only to the networking stack core. Pass the size of the actual data only to __napi_alloc_skb() and don't reserve anything. This will give enough headroom for stack processing. Fixes: `0a714186d3` ("i40e: add AF_XDP zero-copy Rx support") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2022-01-31 09:37:47 -08:00
Vincenzo Frascino	ec049891b2	kselftest: Fix vdso_test_abi return status vdso_test_abi contains a batch of tests that verify the validity of the vDSO ABI. When a vDSO symbol is not found the relevant test is skipped reporting KSFT_SKIP. All the tests return values are then added in a single variable which is checked to verify failures. This approach can have side effects which result in reporting the wrong kselftest exit status. Fix vdso_test_abi verifying the return code of each test separately. Cc: Shuah Khan <shuah@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-by: Cristian Marussi <cristian.marussi@arm.com> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>	2022-01-31 10:35:14 -07:00
David S. Miller	b43471cc10	Merge branch 'mana-XDP-counters' Haiyang Zhang says: ==================== net: mana: Add XDP counters, reuse dropped pages Add drop, tx counters for XDP. Reuse dropped pages ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 15:39:59 +00:00
Haiyang Zhang	a6bf5703f1	net: mana: Reuse XDP dropped page Reuse the dropped page in RX path to save page allocation overhead. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 15:39:58 +00:00
Haiyang Zhang	d356abb95b	net: mana: Add counter for XDP_TX This counter will show up in ethtool stat. It is the number of packets received and forwarded by XDP program. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 15:39:58 +00:00
Haiyang Zhang	f90f84201e	net: mana: Add counter for packet dropped by XDP This counter will show up in ethtool stat data. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 15:39:58 +00:00
Benjamin Gaignard	f83a96e5f0	spi: mediatek: Avoid NULL pointer crash in interrupt In some case, like after a transfer timeout, master->cur_msg pointer is NULL which led to a kernel crash when trying to use master->cur_msg->spi. mtk_spi_can_dma(), pointed by master->can_dma, doesn't use this parameter avoid the problem by setting NULL as second parameter. Fixes: `a568231f46` ("spi: mediatek: Add spi bus for Mediatek MT8173") Signed-off-by: Benjamin Gaignard <benjamin.gaignard@collabora.com> Link: https://lore.kernel.org/r/20220131141708.888710-1-benjamin.gaignard@collabora.com Signed-off-by: Mark Brown <broonie@kernel.org>	2022-01-31 15:24:05 +00:00
David S. Miller	780bf05f44	Merge branch 'smc-improvements' Tony Lu says: ==================== net/smc: Improvements for TCP_CORK and sendfile() Currently, SMC use default implement for syscall sendfile() [1], which is wildly used in nginx and big data sences. Usually, applications use sendfile() with TCP_CORK: fstat(20, {st_mode=S_IFREG\|0644, st_size=4096, ...}) = 0 setsockopt(19, SOL_TCP, TCP_CORK, [1], 4) = 0 writev(19, [{iov_base="HTTP/1.1 200 OK\r\nServer: nginx/1"..., iov_len=240}], 1) = 240 sendfile(19, 20, [0] => [4096], 4096) = 4096 close(20) = 0 setsockopt(19, SOL_TCP, TCP_CORK, [0], 4) = 0 The above is an example of Nginx, when sendfile() on, Nginx first enables TCP_CORK, write headers, the data will not be sent. Then call sendfile(), it reads file and write to sndbuf. When TCP_CORK is cleared, all pending data is sent out. The performance of the default implement of sendfile is lower than when it is off. After investigation, it shows two parts to improve: - unnecessary lock contention of delayed work - less data per send than when sendfile off Patch #1 tries to reduce lock_sock() contention in smc_tx_work(). Patch #2 removes timed work for corking, and let applications control it. See TCP_CORK [2] MSG_MORE [3]. Patch #3 adds MSG_SENDPAGE_NOTLAST for corking more data when sendfile(). Test environments: - CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4 - socket sndbuf / rcvbuf: 16384 / 131072 bytes - server: smc_run nginx - client: smc_run ./wrk -c 100 -t 2 -d 30 http://192.168.100.1:8080/4k.html - payload: 4KB local disk file Items QPS sendfile off 272477.10 sendfile on (orig) 223622.79 sendfile on (this) 395847.21 This benchmark shows +45.28% improvement compared with sendfile off, and +77.02% compared with original sendfile implement. [1] https://man7.org/linux/man-pages/man2/sendfile.2.html [2] https://linux.die.net/man/7/tcp [3] https://man7.org/linux/man-pages/man2/send.2.html ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 15:08:20 +00:00
Tony Lu	be9a16ccca	net/smc: Cork when sendpage with MSG_SENDPAGE_NOTLAST flag This introduces a new corked flag, MSG_SENDPAGE_NOTLAST, which is involved in syscall sendfile() [1], it indicates this is not the last page. So we can cork the data until the page is not specify this flag. It has the same effect as MSG_MORE, but existed in sendfile() only. This patch adds a option MSG_SENDPAGE_NOTLAST for corking data, try to cork more data before sending when using sendfile(), which acts like TCP's behaviour. Also, this reimplements the default sendpage to inform that it is supported to some extent. [1] https://man7.org/linux/man-pages/man2/sendfile.2.html Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 15:08:20 +00:00
Tony Lu	139653bc66	net/smc: Remove corked dealyed work Based on the manual of TCP_CORK [1] and MSG_MORE [2], these two options have the same effect. Applications can set these options and informs the kernel to pend the data, and send them out only when the socket or syscall does not specify this flag. In other words, there's no need to send data out by a delayed work, which will queue a lot of work. This removes corked delayed work with SMC_TX_CORK_DELAY (250ms), and the applications control how/when to send them out. It improves the performance for sendfile and throughput, and remove unnecessary race of lock_sock(). This also unlocks the limitation of sndbuf, and try to fill it up before sending. [1] https://linux.die.net/man/7/tcp [2] https://man7.org/linux/man-pages/man2/send.2.html Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 15:08:20 +00:00
Tony Lu	ea785a1a57	net/smc: Send directly when TCP_CORK is cleared According to the man page of TCP_CORK [1], if set, don't send out partial frames. All queued partial frames are sent when option is cleared again. When applications call setsockopt to disable TCP_CORK, this call is protected by lock_sock(), and tries to mod_delayed_work() to 0, in order to send pending data right now. However, the delayed work smc_tx_work is also protected by lock_sock(). There introduces lock contention for sending data. To fix it, send pending data directly which acts like TCP, without lock_sock() protected in the context of setsockopt (already lock_sock()ed), and cancel unnecessary dealyed work, which is protected by lock. [1] https://linux.die.net/man/7/tcp Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 15:08:20 +00:00
David S. Miller	01b2a99515	Merge branch 'hash-rethink' Akhmat Karakotov says: ==================== Make hash rethink configurable As it was shown in the report by Alexander Azimov, hash rethink at the client-side may lead to connection timeout toward stateful anycast services. Tom Herbert created a patchset to address this issue by applying hash rethink only after a negative routing event (3RTOs) [1]. This change also affects server-side behavior, which we found undesirable. This patchset changes defaults in a way to make them safe: hash rethink at the client-side is disabled and enabled at the server-side upon each RTO event or in case of duplicate acknowledgments. This patchset provides two options to change default behaviour. The hash rethink may be disabled at the server-side by the new sysctl option. Changes in the sysctl option don't affect default behavior at the client-side. Hash rethink can also be enabled/disabled with socket option or bpf syscalls which ovewrite both default and sysctl settings. This socket option is available on both client and server-side. This should provide mechanics to enable hash rethink inside administrative domain, such as DC, where hash rethink at the client-side can be desirable. [1] https://lore.kernel.org/netdev/20210809185314.38187-1-tom@herbertland.com/ v2: - Changed sysctl default to ENABLED in all patches. Reduced sysctl and socket option size to u8. Fixed netns bug reported by kernel test robot. v3: - Fixed bug with bad u8 comparison. Moved sk_txrehash to use less bytes in struct. Added WRITE_ONCE() in setsockopt in and READ_ONCE() in tcp_rtx_synack. v4: - Rebase and add documentation for sysctl option. v5: - Move sk_txrehash out of busy poll ifdef. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 15:05:25 +00:00
Akhmat Karakotov	cb6cd2cec7	tcp: Change SYN ACK retransmit behaviour to account for rehash Disabling rehash behavior did not affect SYN ACK retransmits because hash was forcefully changed bypassing the sk_rethink_hash function. This patch adds a condition which checks for rehash mode before resetting hash. Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 15:05:25 +00:00
Akhmat Karakotov	e7b9bfd184	bpf: Add SO_TXREHASH setsockopt Add bpf socket option to override rehash behaviour from userspace or from bpf. Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 15:05:25 +00:00
Akhmat Karakotov	2127324a7d	txhash: Add txrehash sysctl description Update Documentation/admin-guide/sysctl/net.rst with txrehash usage description. Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 15:05:25 +00:00
Akhmat Karakotov	26859240e4	txhash: Add socket option to control TX hash rethink behavior Add the SO_TXREHASH socket option to control hash rethink behavior per socket. When default mode is set, sockets disable rehash at initialization and use sysctl option when entering listen state. setsockopt() overrides default behavior. Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 15:05:25 +00:00
Akhmat Karakotov	e187013abe	txhash: Make rethinking txhash behavior configurable via sysctl Add a per ns sysctl that controls the txhash rethink behavior: net.core.txrehash. When enabled, the same behavior is retained, when disabled, rethink is not performed. Sysctl is enabled by default. Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 15:05:24 +00:00
Gerhard Engleder	678dfd5280	selftests/net: timestamping: Fix bind_phc check timestamping checks socket options during initialisation. For the field bind_phc of the socket option SO_TIMESTAMPING it expects the value -1 if PHC is not bound. Actually the value of bind_phc is 0 if PHC is not bound. This results in the following output: SIOCSHWTSTAMP: tx_type 0 requested, got 0; rx_filter 0 requested, got 0 SO_TIMESTAMP 0 SO_TIMESTAMPNS 0 SO_TIMESTAMPING flags 0, bind phc 0 not expected, flags 0, bind phc -1 This is fixed by setting default value and expected value of bind_phc to 0. Fixes: `2214d70324` ("selftests/net: timestamping: support binding PHC") Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 11:44:04 +00:00
David S. Miller	116ea68dc7	Merge branch 'renesas-dead-code' Sergey Shtylyov says: ==================== Remove some dead code in the Renesas Ethernet drivers Here are 2 patches against DaveM's 'net-next.git' repo. The Renesas drivers call their ndo_stop() methods directly and they always return 0, making the result checks pointless, hence remove them... ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 11:42:13 +00:00
Sergey Shtylyov	e7d966f9ea	sh_eth: sh_eth_close() always returns 0 sh_eth_close() always returns 0, hence the check in sh_eth_wol_restore() is pointless (however we cannot change the prototype of sh_eth_close() as it implements the driver's ndo_stop() method). Found by Linux Verification Center (linuxtesting.org) with the SVACE static analysis tool. Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 11:42:13 +00:00
Sergey Shtylyov	be94a51f3e	ravb: ravb_close() always returns 0 ravb_close() always returns 0, hence the check in ravb_wol_restore() is pointless (however, we cannot change the prototype of ravb_close() as it implements the driver's ndo_stop() method). Found by Linux Verification Center (linuxtesting.org) with the SVACE static analysis tool. Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 11:42:13 +00:00
Wei Yongjun	cc4598cf17	net/fsl: xgmac_mdio: fix return value check in xgmac_mdio_probe() In case of error, the function devm_ioremap() returns NULL pointer not ERR_PTR(). The IS_ERR() test in the return value check should be replaced with NULL test. Fixes: `1d14eb15dc` ("net/fsl: xgmac_mdio: Use managed device resources") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 11:36:19 +00:00
David Ahern	47ed9442b2	ipv4: Make ip_idents_reserve static ip_idents_reserve is only used in net/ipv4/route.c. Make it static and remove the export. Signed-off-by: David Ahern <dsahern@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 11:33:10 +00:00
Heiner Kallweit	d192181c2c	r8169: add rtl_disable_exit_l1() Add rtl_disable_exit_l1() for ensuring that the chip doesn't inadvertently exit ASPM L1 when being in a low-power mode. The new function is called from rtl_prepare_power_down() which has to be moved in the code to avoid a forward declaration. According to Realtek OCP register 0xc0ac shadows ERI register 0xd4 on RTL8168 versions from RTL8168g. This allows to simplify the code a little. v2: - call rtl_disable_exit_l1() also if DASH or WoL are enabled Suggested-by: Chun-Hao Lin <hau@realtek.com> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 11:32:20 +00:00
Sergey Shtylyov	73c105ad2a	phy: make phy_set_max_speed() void After following the call tree of phy_set_max_speed(), it became clear that this function never returns anything but 0, so we can change its result type to void and drop the result checks from the three drivers that actually bothered to do it... Found by Linux Verification Center (linuxtesting.org) with the SVACE static analysis tool. Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 11:30:56 +00:00
David S. Miller	fe8930278c	Merge branch 'dsa-mv88e6xxx-Improve-indirect-addressing-performance' Tobias Waldekranz says: ==================== net: dsa: mv88e6xxx: Improve indirect addressing performance The individual patches have all the details. This work was triggered by recent work on a platform that took 16s (sic) to load the mv88e6xxx module. The first patch gets rid of most of that time by replacing a very long delay with a tighter poll loop to wait for the busy bit to clear. The second patch shaves off some more time by avoiding redundant busy-bit-checks, saving 1 out of 4 MDIO operations for every register read/write in the optimal case. v1 -> v2: - Make sure that we always poll the busy bit at least twice, in the unlikely event that the first one is quick to query the hardware, but is then scheduled out for a long time before the timeout is checked. v2 -> v3: - Fallback to the longer sleeps after the initial two poll attempts. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 11:29:13 +00:00
Tobias Waldekranz	7bca16b22e	net: dsa: mv88e6xxx: Improve indirect addressing performance Before this change, both the read and write callback would start out by asserting that the chip's busy flag was cleared. However, both callbacks also made sure to wait for the clearing of the busy bit before returning - making the initial check superfluous. The only time that would ever have an effect was if the busy bit was initially set for some reason. With that in mind, make sure to perform an initial check of the busy bit, after which both read and write can rely the previous operation to have waited for the bit to clear. This cuts the number of operations on the underlying MDIO bus by 25% Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 11:29:12 +00:00
Tobias Waldekranz	35da1dfd94	net: dsa: mv88e6xxx: Improve performance of busy bit polling Avoid a long delay when a busy bit is still set and has to be polled again. Measurements on a system with 2 Opals (6097F) and one Agate (6352) show that even with this much tighter loop, we have about a 50% chance of the bit being cleared on the first poll, all other accesses see the bit being cleared on the second poll. On a standard MDIO bus running MDC at 2.5MHz, a single access with 32 bits of preamble plus 32 bits of data takes 64(1/2.5MHz) = 25.6us. This means that mv88e6xxx_smi_direct_wait took 26us + CPU overhead in the fast scenario, but 26us + 1500us + 26us + CPU overhead in the slow case - bringing the average close to 1ms. With this change in place, the slow case is closer to 226us + CPU overhead, with the average well below 100us - a 10x improvement. This translates to real-world winnings. On a 3-chip 20-port system, the modprobe time drops by 88%: Before: root@coronet:~# time modprobe mv88e6xxx real 0m 15.99s user 0m 0.00s sys 0m 1.52s After: root@coronet:~# time modprobe mv88e6xxx real 0m 2.21s user 0m 0.00s sys 0m 1.54s Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 11:29:12 +00:00
Sun Shouxin	0da8aa00bf	net: bonding: Add support for IPV6 ns/na to balance-alb/balance-tlb mode Since ipv6 neighbor solicitation and advertisement messages isn't handled gracefully in bond6 driver, we can see packet drop due to inconsistency between mac address in the option message and source MAC . Another examples is ipv6 neighbor solicitation and advertisement messages from VM via tap attached to host bridge, the src mac might be changed through balance-alb mode, but it is not synced with Link-layer address in the option message. The patch implements bond6's tx handle for ipv6 neighbor solicitation and advertisement messages. Suggested-by: Hu Yadi <huyd12@chinatelecom.cn> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Sun Shouxin <sunshouxin@chinatelecom.cn> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 11:24:12 +00:00
Horatiu Vultur	baf927a833	pinctrl: microchip-sgpio: Fix support for regmap Initially the driver accessed the registers using u32 __iomem but then in the blamed commit it changed it to use regmap. The problem is that now the offset of the registers is not calculated anymore at word offset but at byte offset. Therefore make sure to multiply the offset with word size. Acked-by: Steen Hegelund <Steen.Hegelund@microchip.com> Reviewed-by: Colin Foster <colin.foster@in-advantage.com> Fixes: `2afbbab45c` ("pinctrl: microchip-sgpio: update to support regmap") Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Link: https://lore.kernel.org/r/20220131085201.307031-1-horatiu.vultur@microchip.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org>	2022-01-31 12:07:31 +01:00
Wen Gu	341adeec9a	net/smc: Forward wakeup to smc socket waitqueue after fallback When we replace TCP with SMC and a fallback occurs, there may be some socket waitqueue entries remaining in smc socket->wq, such as eppoll_entries inserted by userspace applications. After the fallback, data flows over TCP/IP and only clcsocket->wq will be woken up. Applications can't be notified by the entries which were inserted in smc socket->wq before fallback. So we need a mechanism to wake up smc socket->wq at the same time if some entries remaining in it. The current workaround is to transfer the entries from smc socket->wq to clcsock->wq during the fallback. But this may cause a crash like this: general protection fault, probably for non-canonical address 0xdead000000000100: 0000 [#1] PREEMPT SMP PTI CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Tainted: G E 5.16.0+ #107 RIP: 0010:__wake_up_common+0x65/0x170 Call Trace: <IRQ> __wake_up_common_lock+0x7a/0xc0 sock_def_readable+0x3c/0x70 tcp_data_queue+0x4a7/0xc40 tcp_rcv_established+0x32f/0x660 ? sk_filter_trim_cap+0xcb/0x2e0 tcp_v4_do_rcv+0x10b/0x260 tcp_v4_rcv+0xd2a/0xde0 ip_protocol_deliver_rcu+0x3b/0x1d0 ip_local_deliver_finish+0x54/0x60 ip_local_deliver+0x6a/0x110 ? tcp_v4_early_demux+0xa2/0x140 ? tcp_v4_early_demux+0x10d/0x140 ip_sublist_rcv_finish+0x49/0x60 ip_sublist_rcv+0x19d/0x230 ip_list_rcv+0x13e/0x170 __netif_receive_skb_list_core+0x1c2/0x240 netif_receive_skb_list_internal+0x1e6/0x320 napi_complete_done+0x11d/0x190 mlx5e_napi_poll+0x163/0x6b0 [mlx5_core] __napi_poll+0x3c/0x1b0 net_rx_action+0x27c/0x300 __do_softirq+0x114/0x2d2 irq_exit_rcu+0xb4/0xe0 common_interrupt+0xba/0xe0 </IRQ> <TASK> The crash is caused by privately transferring waitqueue entries from smc socket->wq to clcsock->wq. The owners of these entries, such as epoll, have no idea that the entries have been transferred to a different socket wait queue and still use original waitqueue spinlock (smc socket->wq.wait.lock) to make the entries operation exclusive, but it doesn't work. The operations to the entries, such as removing from the waitqueue (now is clcsock->wq after fallback), may cause a crash when clcsock waitqueue is being iterated over at the moment. This patch tries to fix this by no longer transferring wait queue entries privately, but introducing own implementations of clcsock's callback functions in fallback situation. The callback functions will forward the wakeup to smc socket->wq if clcsock->wq is actually woken up and smc socket->wq has remaining entries. Fixes: `2153bd1e3d` ("net/smc: Transfer remaining wait queue entries during fallback") Suggested-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-01-31 11:07:13 +00:00
Linus Torvalds	26291c54e1	Linux 5.17-rc2	2022-01-30 15:37:07 +02:00
Linus Torvalds	c5fe9de790	Merge tag 'irq_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Borislav Petkov: - Drop an unused private data field in the AIC driver - Various fixes to the realtek-rtl driver - Make the GICv3 ITS driver compile again in !SMP configurations - Force reset of the GICv3 ITSs at probe time to avoid issues during kexec - Yet another kfree/bitmap_free conversion - Various DT updates (Renesas, SiFive) * tag 'irq_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: dt-bindings: interrupt-controller: sifive,plic: Group interrupt tuples dt-bindings: interrupt-controller: sifive,plic: Fix number of interrupts dt-bindings: irqchip: renesas-irqc: Add R-Car V3U support irqchip/gic-v3-its: Reset each ITS's BASERn register before probe irqchip/gic-v3-its: Fix build for !SMP irqchip/loongson-pch-ms: Use bitmap_free() to free bitmap irqchip/realtek-rtl: Service all pending interrupts irqchip/realtek-rtl: Fix off-by-one in routing irqchip/realtek-rtl: Map control data to virq irqchip/apple-aic: Drop unused ipi_hwirq field	2022-01-30 15:12:02 +02:00
Linus Torvalds	27a96c4feb	Merge tag 'perf_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Prevent accesses to the per-CPU cgroup context list from another CPU except the one it belongs to, to avoid list corruption - Make sure parent events are always woken up to avoid indefinite hangs in the traced workload * tag 'perf_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix cgroup event list management perf: Always wake the parent event	2022-01-30 15:02:32 +02:00
Linus Torvalds	24f4db1f3a	Merge tag 'sched_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: "Make sure the membarrier-rseq fence commands are part of the reported set when querying membarrier(2) commands through MEMBARRIER_CMD_QUERY" * tag 'sched_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/membarrier: Fix membarrier-rseq fence command missing from query bitmask	2022-01-30 13:09:00 +02:00
Linus Torvalds	a96d3a5b15	Merge tag 'x86_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Add another Intel CPU model to the list of CPUs supporting the processor inventory unique number - Allow writing to MCE thresholding sysfs files again - a previous change had accidentally disabled it and no one noticed. Goes to show how much is this stuff used * tag 'x86_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Add Xeon Icelake-D to list of CPUs that support PPIN x86/MCE/AMD: Allow thresholding interface updates after init	2022-01-30 12:55:06 +02:00
Linus Torvalds	8dd71685dc	Merge branch 'akpm' (patches from Andrew) Merge misc fixes from Andrew Morton: "12 patches. Subsystems affected by this patch series: sysctl, binfmt, ia64, mm (memory-failure, folios, kasan, and psi), selftests, and ocfs2" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: ocfs2: fix a deadlock when commit trans jbd2: export jbd2_journal_[grab\|put]_journal_head psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n psi: fix "no previous prototype" warnings when CONFIG_CGROUPS=n mm, kasan: use compare-exchange operation to set KASAN page tag kasan: test: fix compatibility with FORTIFY_SOURCE tools/testing/scatterlist: add missing defines mm: page->mapping folio->mapping should have the same offset memory-failure: fetch compound_head after pgmap_pfn_valid() ia64: make IA64_MCA_RECOVERY bool instead of tristate binfmt_misc: fix crash when load/unload module include/linux/sysctl.h: fix register_sysctl_mount_point() return type	2022-01-30 11:21:50 +02:00
Joseph Qi	ddf4b773aa	ocfs2: fix a deadlock when commit trans commit `6f1b228529` introduces a regression which can deadlock as follows: Task1: Task2: jbd2_journal_commit_transaction ocfs2_test_bg_bit_allocatable spin_lock(&jh->b_state_lock) jbd_lock_bh_journal_head __jbd2_journal_remove_checkpoint spin_lock(&jh->b_state_lock) jbd2_journal_put_journal_head jbd_lock_bh_journal_head Task1 and Task2 lock bh->b_state and jh->b_state_lock in different order, which finally result in a deadlock. So use jbd2_journal_[grab\|put]_journal_head instead in ocfs2_test_bg_bit_allocatable() to fix it. Link: https://lkml.kernel.org/r/20220121071205.100648-3-joseph.qi@linux.alibaba.com Fixes: `6f1b228529` ("ocfs2: fix race between searching chunks and release journal_head from buffer_head") Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reported-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com> Tested-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com> Reported-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-01-30 09:56:58 +02:00
Joseph Qi	4cd1103d8c	jbd2: export jbd2_journal_[grab\|put]_journal_head Patch series "ocfs2: fix a deadlock case". This fixes a deadlock case in ocfs2. We firstly export jbd2 symbols jbd2_journal_[grab\|put]_journal_head as preparation and later use them in ocfs2 insread of jbd_[lock\|unlock]_bh_journal_head to fix the deadlock. This patch (of 2): This exports symbols jbd2_journal_[grab\|put]_journal_head, which will be used outside modules, e.g. ocfs2. Link: https://lkml.kernel.org/r/20220121071205.100648-2-joseph.qi@linux.alibaba.com Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com> Cc: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-01-30 09:56:58 +02:00

... 8 9 10 11 12 ...

1073509 Commits