When resuming, set up registers that have been lost in the sleep state.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Pull "Third Round of Renesas ARM Based SoC Fixes for v4.13" from Simon Horman:
Fix deadlock in regulator quirk for R-Car Gen 2 SoCs
The da9063/da9210 regulator quirk for R-Car Gen2 boards uses a bus
notifier, and unregisters the notifier when it is no longer needed.
However, a notifier must not be unregistered from within the call chain.
This bug went unnoticed, as blocking_notifier_chain_unregister() didn't
take the semaphore during early boot. This is no longer the case as of
upstream commit 1c3c5eab17 ("sched/core: Enable might_sleep() and
smp_processor_id() checks early") and a deadlock occurs.
* tag 'renesas-fixes3-for-v4.13' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas:
ARM: shmobile: rcar-gen2: Fix deadlock in regulator quirk
Pull "mvebu fixes for 4.13 (part 2)" from Gregory CLEMENT:
All the fixes are for ARM64 mvebu:
- Fix the RTC interrupt on A7K/A8K which was missed when switching
from GIC to ICU
- Mark the A7K/A8K crypto engine as dma coherent
- Fix the number of GPIO on south bridge on Armada 3700
* tag 'mvebu-fixes-4.13-2' of git://git.infradead.org/linux-mvebu:
ARM64: dts: marvell: armada-37xx: Fix the number of GPIO on south bridge
arm64: dts: marvell: mark the cp110 crypto engine as dma coherent
arm64: dts: marvell: use ICU for the CP110 slave RTC
Pull "Amlogic fixes for v4.13-rc" from Kevin Hilman:
- 2 minor DT fixes
* tag 'amlogic-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-amlogic:
ARM64: dts: meson-gxl-s905x-libretech-cc: fixup board definition
ARM64: dts: meson-gx: use specific compatible for the AO pwms
Pull "Rockchip dts32 fixes for 4.13" from Heiko Stübner:
Fix for the recently added mali dt support. The example
showed a wrong value, so fix it before it gets copy-pasted
to much.
* tag 'v4.13-rockchip-dts32fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip:
ARM: dts: rockchip: fix mali gpu node on rk3288
dt-bindings: gpu: drop wrong compatible from midgard binding example
MMIO block with tracked mmio, is introduced for the sake of performance
of searching tracked mmio. All the tracked mmio needs to get the initial
value from the HW state during vGPU being created. This patch is to
initialize the tracked registers in MMIO block with the HW state.
v2: Add "Fixes:" line for this patch (Zhenyu)
Fixes: 65f9f6febf ("drm/i915/gvt: Optimize MMIO register handling for some large MMIO blocks")
Signed-off-by: Tina Zhang <tina.zhang@intel.com>
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
PAE40 confiuration in hardware extends some of the address registers
for TLB/cache ops to 2 words.
So far kernel was NOT setting the higher word if feature was not enabled
in software which is wrong. Those need to be set to 0 in such case.
Normally this would be done in the cache flush / tlb ops, however since
these registers only exist conditionally, this would have to be
conditional to a flag being set on boot which is expensive/ugly -
specially for the more common case of PAE exists but not in use.
Optimize that by zero'ing them once at boot - nobody will write to
them afterwards
Cc: stable@vger.kernel.org #4.4+
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
It is necessary to explicitly set both SLC_AUX_RGN_START1 and SLC_AUX_RGN_END1
which hold MSB bits of the physical address correspondingly of region start
and end otherwise SLC region operation is executed in unpredictable manner
Without this patch, SLC flushes on HSDK (IOC disabled) were taking
seconds.
Cc: stable@vger.kernel.org #4.4+
Reported-by: Vladimir Kondratiev <vladimir.kondratiev@intel.com>
Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
[vgupta: PAR40 regs only written if PAE40 exist]
c70c473396 "ARCv2: SLC: Make sure busy bit is set properly on SLC flushing"
fixes problem for entire SLC operation where the problem was initially
caught. But given a nature of the issue it is perfectly possible for
busy bit to be read incorrectly even when region operation was started.
So extending initial fix for regional operation as well.
Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com>
Cc: stable@vger.kernel.org #4.10
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Essentially remove CONFIG_ARC_PLAT_SIM
There is no need for any platform specific code, just the board DTS
match strings which we can include unconditionally
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Iommu would get page fault with following path:
vop_disable:
1, disable all windows and set vop config done
2, vop enter to standy, all windows not works, but their registers
are not clean, when you read window's enable bit, may found the
window is enable.
vop_enable:
1, memcpy(vop->regsbak, vop->regs, len)
save current vop registers to vop->regsbak, then you can found
window is enable on regsbak.
2, VOP_WIN_SET(vop, win, gate, 1);
force enable window gate, but gate and enable are on same
hardware register, then window enable bit rewrite to vop hardware.
3, vop power on, and vop might try to scan destroyed buffer,
then iommu get page fault.
Move windows disable after vop regsbak restore, then vop regsbak mechanism
would keep tracing the modify, everything would be safe.
Signed-off-by: Mark Yao <mark.yao@rock-chips.com>
Reviewed-by: Sandy huang <sandy.huang@rock-chips.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1501494582-6934-1-git-send-email-mark.yao@rock-chips.com
Willem de Bruijn says:
====================
socket sendmsg MSG_ZEROCOPY
Introduce zerocopy socket send flag MSG_ZEROCOPY. This extends the
shared page support (SKBTX_SHARED_FRAG) from sendpage to sendmsg.
Implement the feature for TCP initially, as large writes benefit
most.
On a send call with MSG_ZEROCOPY, the kernel pins user pages and
links these directly into the skbuff frags[] array.
Each send call with MSG_ZEROCOPY that transmits data will eventually
queue a completion notification on the error queue: a per-socket u32
incremented on each such call. A request may have to revert to copy
to succeed, for instance when a device cannot support scatter-gather
IO. In that case a flag is passed along to notify that the operation
succeeded without zerocopy optimization.
The implementation extends the existing zerocopy infra for tuntap,
vhost and xen with features needed for TCP, notably reference
counting to handle cloning on retransmit and GSO.
For more details, see also the netdev 2.1 paper and presentation at
https://netdevconf.org/2.1/session.html?debruijn
Changelog:
v3 -> v4:
- dropped UDP, RAW and PF_PACKET for now
Without loopback support, datagrams are usually smaller than
the ~8KB size threshold needed to benefit from zerocopy.
- style: a few reverse chrismas tree
- minor: SO_ZEROCOPY returns ENOTSUPP on unsupported protocols
- minor: squashed SO_EE_CODE_ZEROCOPY_COPIED patch
- minor: rebased on top of net-next with kmap_atomic fix
v2 -> v3:
- fix rebase conflict: SO_ZEROCOPY 59 -> 60
v1 -> v2:
- fix (kbuild-bot): do not remove uarg until patch 5
- fix (kbuild-bot): move zerocopy_sg_from_iter doc with function
- fix: remove unused extern in header file
RFCv2 -> v1:
- patch 2
- review comment: in skb_copy_ubufs, always allocate order-0
page, also when replacing compound source pages.
- patch 3
- fix: always queue completion notification on MSG_ZEROCOPY,
also if revert to copy.
- fix: on syscall abort, correctly revert notification state
- minor: skip queue notification on SOCK_DEAD
- minor: replace BUG_ON with WARN_ON in recoverable error
- patch 4
- new: add socket option SOCK_ZEROCOPY.
only honor MSG_ZEROCOPY if set, ignore for legacy apps.
- patch 5
- fix: clear zerocopy state on skb_linearize
- patch 6
- fix: only coalesce if prev errqueue elem is zerocopy
- minor: try coalescing with list tail instead of head
- minor: merge bytelen limit patch
- patch 7
- new: signal when data had to be copied
- patch 8 (tcp)
- optimize: avoid setting PSH bit when exceeding max frags.
that limits GRO on the client. do not goto new_segment.
- fix: fail on MSG_ZEROCOPY | MSG_FASTOPEN
- minor: do not wait for memory: does not work for optmem
- minor: simplify alloc
- patch 9 (udp)
- new: add PF_INET6
- fix: attach zerocopy notification even if revert to copy
- minor: simplify alloc size arithmetic
- patch 10 (raw hdrinc)
- new: add PF_INET6
- patch 11 (pf_packet)
- minor: simplify slightly
- patch 12
- new msg_zerocopy regression test: use veth pair to test
all protocols: ipv4/ipv6/packet, tcp/udp/raw, cork
all relevant ethtool settings: rx off, sg off
all relevant packet lengths: 0, <MAX_HEADER, max size
RFC -> RFCv2:
- review comment: do not loop skb with zerocopy frags onto rx:
add skb_orphan_frags_rx to orphan even refcounted frags
call this in __netif_receive_skb_core, deliver_skb and tun:
same as commit 1080e512d4 ("net: orphan frags on receive")
- fix: hold an explicit sk reference on each notification skb.
previously relied on the reference (or wmem) held by the
data skb that would trigger notification, but this breaks
on skb_orphan.
- fix: when aborting a send, do not inc the zerocopy counter
this caused gaps in the notification chain
- fix: in packet with SOCK_DGRAM, pull ll headers before calling
zerocopy_sg_from_iter
- fix: if sock_zerocopy_realloc does not allow coalescing,
do not fail, just allocate a new ubuf
- fix: in tcp, check return value of second allocation attempt
- chg: allocate notification skbs from optmem
to avoid affecting tcp write queue accounting (TSQ)
- chg: limit #locked pages (ulimit) per user instead of per process
- chg: grow notification ids from 16 to 32 bit
- pass range [lo, hi] through 32 bit fields ee_info and ee_data
- chg: rebased to davem-net-next on top of v4.10-rc7
- add: limit notification coalescing
sharing ubufs limits overhead, but delays notification until
the last packet is released, possibly unbounded. Add a cap.
- tests: add snd_zerocopy_lo pf_packet test
- tests: two bugfixes (add do_flush_tcp, ++sent not only in debug)
Limitations / Known Issues:
- TCP may build slightly smaller than max TSO packets due to
exceeding MAX_SKB_FRAGS frags when zerocopy pages are unaligned.
- All SKBTX_SHARED_FRAG may require additional __skb_linearize or
skb_copy_ubufs calls in u32, skb_find_text, similar to
skb_checksum_help.
Notification skbuffs are allocated from optmem. For sockets that
cannot effectively coalesce notifications, the optmem max may need
to be increased to avoid hitting -ENOBUFS:
sysctl -w net.core.optmem_max=1048576
In application load, copy avoidance shows a roughly 5% systemwide
reduction in cycles when streaming large flows and a 4-8% reduction in
wall clock time on early tensorflow test workloads.
For the single-machine veth tests to succeed, loopback support has to
be temporarily enabled by making skb_orphan_frags_rx map to
skb_orphan_frags.
* Performance
The below table shows cycles reported by perf for a netperf process
sending a single 10 Gbps TCP_STREAM. The first three columns show
Mcycles spent in the netperf process context. The second three columns
show time spent systemwide (-a -C A,B) on the two cpus that run the
process and interrupt handler. Reported is the median of at least 3
runs. std is a standard netperf, zc uses zerocopy and % is the ratio.
Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs
are disabled and the kernel is booted with idle=halt.
NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size
perf stat -e cycles $NETPERF
perf stat -C 2,3 -a -e cycles $NETPERF
--process cycles-- ----cpu cycles----
std zc % std zc %
4K 27,609 11,217 41 49,217 39,175 79
16K 21,370 3,823 18 43,540 29,213 67
64K 20,557 2,312 11 42,189 26,910 64
256K 21,110 2,134 10 43,006 27,104 63
1M 20,987 1,610 8 42,759 25,931 61
Perf record indicates the main source of these differences. Process
cycles only at 1M writes (perf record; perf report -n):
std:
Samples: 42K of event 'cycles', Event count (approx.): 21258597313
79.41% 33884 netperf [kernel.kallsyms] [k] copy_user_generic_string
3.27% 1396 netperf [kernel.kallsyms] [k] tcp_sendmsg
1.66% 694 netperf [kernel.kallsyms] [k] get_page_from_freelist
0.79% 325 netperf [kernel.kallsyms] [k] tcp_ack
0.43% 188 netperf [kernel.kallsyms] [k] __alloc_skb
zc:
Samples: 1K of event 'cycles', Event count (approx.): 1439509124
30.36% 584 netperf.zerocop [kernel.kallsyms] [k] gup_pte_range
14.63% 284 netperf.zerocop [kernel.kallsyms] [k] __zerocopy_sg_from_iter
8.03% 159 netperf.zerocop [kernel.kallsyms] [k] skb_zerocopy_add_frags_iter
4.84% 96 netperf.zerocop [kernel.kallsyms] [k] __alloc_skb
3.10% 60 netperf.zerocop [kernel.kallsyms] [k] kmem_cache_alloc_node
* Safety
The number of pages that can be pinned on behalf of a user with
MSG_ZEROCOPY is bound by the locked memory ulimit.
While the kernel holds process memory pinned, a process cannot safely
reuse those pages for other purposes. Packets looped onto the receive
stack and queued to a socket can be held indefinitely. Avoid unbounded
notification latency by restricting user pages to egress paths only.
skb_orphan_frags_rx() will create a private copy of pages even for
refcounted packets when these are looped, as did skb_orphan_frags for
the original tun zerocopy implementation.
Pages are not remapped read-only. Processes can modify packet contents
while packets are in flight in the kernel path. Bytes on which kernel
control flow depends (headers) are copied to avoid TOCTTOU attacks.
Datapath integrity does not otherwise depend on payload, with three
exceptions: checksums, optional sk_filter/tc u32/.. and device +
driver logic. The effect of wrong checksums is limited to the
misbehaving process. TC filters that access contents may have to be
excluded by adding an skb_orphan_frags_rx.
Processes can also safely avoid OOM conditions by bounding the number
of bytes passed with MSG_ZEROCOPY and by removing shared pages after
transmission from their own memory map.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce regression test for msg_zerocopy feature. Send traffic from
one process to another with and without zerocopy.
Evaluate tcp, udp, raw and packet sockets, including variants
- udp: corking and corking with mixed copy/zerocopy calls
- raw: with and without hdrincl
- packet: at both raw and dgram level
Test on both ipv4 and ipv6, optionally with ethtool changes to
disable scatter-gather, tx checksum or tso offload. All of these
can affect zerocopy behavior.
The regression test can be run on a single machine if over a veth
pair. Then skb_orphan_frags_rx must be modified to be identical to
skb_orphan_frags to allow forwarding zerocopy locally.
The msg_zerocopy.sh script will setup the veth pair in network
namespaces and run all tests.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable support for MSG_ZEROCOPY to the TCP stack. TSO and GSO are
both supported. Only data sent to remote destinations is sent without
copying. Packets looped onto a local destination have their payload
copied to avoid unbounded latency.
Tested:
A 10x TCP_STREAM between two hosts showed a reduction in netserver
process cycles by up to 70%, depending on packet size. Systemwide,
savings are of course much less pronounced, at up to 20% best case.
msg_zerocopy.sh 4 tcp:
without zerocopy
tx=121792 (7600 MB) txc=0 zc=n
rx=60458 (7600 MB)
with zerocopy
tx=286257 (17863 MB) txc=286257 zc=y
rx=140022 (17863 MB)
This test opens a pair of sockets over veth, one one calls send with
64KB and optionally MSG_ZEROCOPY and on the other reads the initial
bytes. The receiver truncates, so this is strictly an upper bound on
what is achievable. It is more representative of sending data out of
a physical NIC (when payload is not touched, either).
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bound the number of pages that a user may pin.
Follow the lead of perf tools to maintain a per-user bound on memory
locked pages commit 789f90fcf6 ("perf_counter: per user mlock gift")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the simple case, each sendmsg() call generates data and eventually
a zerocopy ready notification N, where N indicates the Nth successful
invocation of sendmsg() with the MSG_ZEROCOPY flag on this socket.
TCP and corked sockets can cause send() calls to append new data to an
existing sk_buff and, thus, ubuf_info. In that case the notification
must hold a range. odify ubuf_info to store a inclusive range [N..N+m]
and add skb_zerocopy_realloc() to optionally extend an existing range.
Also coalesce notifications in this common case: if a notification
[1, 1] is about to be queued while [0, 0] is the queue tail, just modify
the head of the queue to read [0, 1].
Coalescing is limited to a few TSO frames worth of data to bound
notification latency.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
skb_zerocopy_clone() wherever needed due to skb split, merge, resize
or clone.
Split skb_orphan_frags into two variants. The split, merge, .. paths
support reference counted zerocopy buffers, so do not do a deep copy.
Add skb_orphan_frags_rx for paths that may loop packets to receive
sockets. That is not allowed, as it may cause unbounded latency.
Deep copy all zerocopy copy buffers, ref-counted or not, in this path.
The exact locations to modify were chosen by exhaustively searching
through all code that might modify skb_frag references and/or the
the SKBTX_DEV_ZEROCOPY tx_flags bit.
The changes err on the safe side, in two ways.
(1) legacy ubuf_info paths virtio and tap are not modified. They keep
a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
still call skb_copy_ubufs and thus copy frags in this case.
(2) not all copies deep in the stack are addressed yet. skb_shift,
skb_split and skb_try_coalesce can be refined to avoid copying.
These are not in the hot path and this patch is hairy enough as
is, so that is left for future refinement.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The send call ignores unknown flags. Legacy applications may already
unwittingly pass MSG_ZEROCOPY. Continue to ignore this flag unless a
socket opts in to zerocopy.
Introduce socket option SO_ZEROCOPY to enable MSG_ZEROCOPY processing.
Processes can also query this socket option to detect kernel support
for the feature. Older kernels will return ENOPROTOOPT.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The kernel supports zerocopy sendmsg in virtio and tap. Expand the
infrastructure to support other socket types. Introduce a completion
notification channel over the socket error queue. Notifications are
returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid
blocking the send/recv path on receiving notifications.
Add reference counting, to support the skb split, merge, resize and
clone operations possible with SOCK_STREAM and other socket types.
The patch does not yet modify any datapaths.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refine skb_copy_ubufs to support compound pages. With upcoming TCP
zerocopy sendmsg, such fragments may appear.
The existing code replaces each page one for one. Splitting each
compound page into an independent number of regular pages can result
in exceeding limit MAX_SKB_FRAGS if data is not exactly page aligned.
Instead, fill all destination pages but the last to PAGE_SIZE.
Split the existing alloc + copy loop into separate stages:
1. compute bytelength and minimum number of pages to store this.
2. allocate
3. copy, filling each page except the last to PAGE_SIZE bytes
4. update skb frag array
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add sock_omalloc and sock_ofree to be able to allocate control skbs,
for instance for looping errors onto sk_error_queue.
The transmit budget (sk_wmem_alloc) is involved in transmit skb
shaping, most notably in TCP Small Queues. Using this budget for
control packets would impact transmission.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the decrementer wraps again and de-asserts the decrementer
exception while hard-disabled, __check_irq_replay() has a test to
notice the wrap when interrupts are re-enabled.
The decrementer check must be done when clearing the PACA_IRQ_HARD_DIS
flag, not when the PACA_IRQ_DEC flag is tested. Previously this worked
because the decrementer interrupt was always the first one checked
after clearing the hard disable flag, but HMI check was moved ahead of
that, which introduced this bug.
This can cause a missed decrementer interrupt if we soft-disable
interrupts then take an HMI which is recorded in irq_happened, then
hard-disable interrupts for > 4s to wrap the decrementer.
Fixes: e0e0d6b739 ("powerpc/64: Replay hypervisor maintenance interrupt first")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
POWER9 DD2 PMU can stop after a state-loss idle in some conditions.
A solution is to set then clear MMCRA[60] after wake from state-loss
idle. MMCRA[60] is a non-architected bit, see the user manual for
details.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Acked-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Just a few small fixes for 4.13.
* 'drm-fixes-4.13' of git://people.freedesktop.org/~agd5f/linux:
drm/amdgpu: Use list_del_init in amdgpu_mn_unregister
drm/amdgpu: Fix undue fallthroughs in golden registers initialization
drm/amdgpu: fix header on gfx9 clear state
Neal Cardwell says:
====================
tcp: fix xmit timer rearming to avoid stalls
This patch series is a bug fix for a TCP loss recovery performance bug
reported independently in recent netdev threads:
(i) July 26, 2017: netdev thread "TCP fast retransmit issues"
(ii) July 26, 2017: netdev thread:
"[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
outstanding TLP retransmission"
Many thanks to Klavs Klavsen and Mao Wenan for the detailed reports,
traces, and packetdrill test cases, which enabled us to root-cause
this issue and verify the fix.
- v1 -> v2:
- In patch 2/3, changed an unclear comment in the pre-existing code
in tcp_schedule_loss_probe() to be more clear (thanks to Eric Dumazet
for suggesting we improve this).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a TCP loss recovery performance bug raised recently on the netdev
list, in two threads:
(i) July 26, 2017: netdev thread "TCP fast retransmit issues"
(ii) July 26, 2017: netdev thread:
"[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
outstanding TLP retransmission"
The basic problem is that incoming TCP packets that did not indicate
forward progress could cause the xmit timer (TLP or RTO) to be rearmed
and pushed back in time. In certain corner cases this could result in
the following problems noted in these threads:
- Repeated ACKs coming in with bogus SACKs corrupted by middleboxes
could cause TCP to repeatedly schedule TLPs forever. We kept
sending TLPs after every ~200ms, which elicited bogus SACKs, which
caused more TLPs, ad infinitum; we never fired an RTO to fill in
the holes.
- Incoming data segments could, in some cases, cause us to reschedule
our RTO or TLP timer further out in time, for no good reason. This
could cause repeated inbound data to result in stalls in outbound
data, in the presence of packet loss.
This commit fixes these bugs by changing the TLP and RTO ACK
processing to:
(a) Only reschedule the xmit timer once per ACK.
(b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the
ACK indicates sufficient forward progress (a packet was
cumulatively ACKed, or we got a SACK for a packet that was sent
before the most recent retransmit of the write queue head).
This brings us back into closer compliance with the RFCs, since, as
the comment for tcp_rearm_rto() notes, we should only restart the RTO
timer after forward progress on the connection. Previously we were
restarting the xmit timer even in these cases where there was no
forward progress.
As a side benefit, this commit simplifies and speeds up the TCP timer
arming logic. We had been calling inet_csk_reset_xmit_timer() three
times on normal ACKs that cumulatively acknowledged some data:
1) Once near the top of tcp_ack() to switch from TLP timer to RTO:
if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
tcp_rearm_rto(sk);
2) Once in tcp_clean_rtx_queue(), to update the RTO:
if (flag & FLAG_ACKED) {
tcp_rearm_rto(sk);
3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO
to TLP:
if (icsk->icsk_pending == ICSK_TIME_RETRANS)
tcp_schedule_loss_probe(sk);
This commit, by only rescheduling the xmit timer once per ACK,
simplifies the code and reduces CPU overhead.
This commit was tested in an A/B test with Google web server
traffic. SNMP stats and request latency metrics were within noise
levels, substantiating that for normal web traffic patterns this is a
rare issue. This commit was also tested with packetdrill tests to
verify that it fixes the timer behavior in the corner cases discussed
in the netdev threads mentioned above.
This patch is a bug fix patch intended to be queued for -stable
relases.
Fixes: 6ba8a3b19e ("tcp: Tail loss probe (TLP)")
Reported-by: Klavs Klavsen <kl@vsen.dk>
Reported-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Have tcp_schedule_loss_probe() base the TLP scheduling decision based
on when the RTO *should* fire. This is to enable the upcoming xmit
timer fix in this series, where tcp_schedule_loss_probe() cannot
assume that the last timer installed was an RTO timer (because we are
no longer doing the "rearm RTO, rearm RTO, rearm TLP" dance on every
ACK). So tcp_schedule_loss_probe() must independently figure out when
an RTO would want to fire.
In the new TLP implementation following in this series, we cannot
assume that icsk_timeout was set based on an RTO; after processing a
cumulative ACK the icsk_timeout we see can be from a previous TLP or
RTO. So we need to independently recalculate the RTO time (instead of
reading it out of icsk_timeout). Removing this dependency on the
nature of icsk_timeout makes things a little easier to reason about
anyway.
Note that the old and new code should be equivalent, since they are
both saying: "if the RTO is in the future, but at an earlier time than
the normal TLP time, then set the TLP timer to fire when the RTO would
have fired".
Fixes: 6ba8a3b19e ("tcp: Tail loss probe (TLP)")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pure refactor. This helper will be required in the xmit timer fix
later in the patch series. (Because the TLP logic will want to make
this calculation.)
Fixes: 6ba8a3b19e ("tcp: Tail loss probe (TLP)")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko says:
====================
mlxsw: Support for IPv6 UC router
Ido says:
This set adds support for IPv6 unicast routes offload. The first four
patches make the FIB notification chain generic so that it could be used
by address families other than IPv4. This is done by having each address
family register its callbacks with the common code, so that its FIB tables
and rules could be dumped upon registration to the chain, while ensuring
the integrity of the dump. The exact mechanics are explained in detail in
the first patch.
The next six patches build upon this work and add the necessary callbacks
in IPv6 code. This allows listeners of the chain to receive notifications
about IPv6 routes addition, deletion and replacement as well as FIB rules
notifications.
Unlike user space notifications for IPv6 multipath routes, the FIB
notification chain notifies these on a per-nexthop basis. This allows
us to keep the common code lean and is also unnecessary, as notifications
are serialized by each table's lock whereas applications maintaining
netlink caches may suffer from concurrent dumps and deletions / additions
of routes.
The next five patches audit the different code paths reading the route's
reference count (rt6i_ref) and remove assumptions regarding its meaning.
This is needed since non-FIB users need to be able to hold a reference on
the route and a non-zero reference count no longer means the route is in
the FIB.
The last six patches enable the mlxsw driver to offload IPv6 unicast
routes to the Spectrum ASIC. Without resorting to ACLs, lookup is done
solely based on the destination IP, so the abort mechanism is invoked
upon the addition of source-specific routes.
Follow-up patch sets will increase the scale of gatewayed routes by
consolidating identical nexthop groups to one adjacency entry in the
device's adjacency table (as in IPv4), as well as add support for
NH_{ADD,DEL} events which enable support for the
'ignore_routes_with_linkdown' sysctl.
Changes in v2:
* Provide offload indication for individual nexthops (David Ahern).
* Use existing route reference count instead of adding another one.
This resulted in several new patches to remove assumptions regarding
current semantics of the existing reference count (David Ahern).
* Add helpers to allow non-FIB users to take a reference on route.
* Remove use of tb6_lock in mlxsw (David Ahern).
* Add IPv6 dependency to mlxsw.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We now have all the necessary IPv6 infrastructure in place, so stop
ignoring these notifications.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without resorting to ACLs, the device performs route lookup solely based
on the destination IP address.
In case source-specific routing is needed, an error is returned and the
abort mechanism is activated, thus allowing the kernel to take over
forwarding decisions.
Instead of aborting, we can trap specific destination prefixes where
source-specific routes are present, but this will result in a lot more
code that is unlikely to ever be used.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case we got a replace event, then the replaced route must exist. If
the route isn't capable of multipath, then replace first matching
non-multipath capable route.
If the route is capable of multipath and matching multipath capable
route is found, then replace it. Otherwise, replace first matching
non-multipath capable route.
The new route is inserted before the replaced one. In case the replaced
route is currently offloaded, then it's overwritten in the device's table
by the new route and later deleted, thus not impacting routed traffic.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow directly connected and remote unicast IPv6 routes to be programmed
to the device's tables.
As with IPv4, identical routes - sharing the same destination prefix -
are ordered in a FIB node according to their table ID and then the
metric. While the kernel doesn't share the same trie for the local and
main table, this does happen in the device, so ordering according to
table ID is needed.
Since individual nexthops can be added and deleted in IPv6, each FIB
entry stores a linked list of the rt6_info structs it represents. Upon
the addition or deletion of a nexthop, a new nexthop group is allocated
according to the new configuration and the old one is destroyed.
Identical groups aren't currently consolidated, but will be in a
follow-up patchset.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We only allow FIB offload in the presence of default rules or an l3mdev
rule. In a similar fashion to IPv4 FIB rules, sanitize IPv6 rules.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The FIB notification block currently only handles IPv4 events, but we
want to start handling IPv6 events soon, so lay the groundwork now.
Do that by preparing the work item and process it according to the
notified address family.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to commit 1c677b3d28 ("ipv4: fib: Add fib_info_hold() helper")
and commit b423cb1080 ("ipv4: fib: Export free_fib_info()") add an
helper to hold a reference on rt6_info and export rt6_release() to drop
it and potentially release the route.
This is needed so that drivers capable of FIB offload could hold a
reference on the route before queueing it for offload and drop it after
the route has been programmed to the device's tables.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an interface is brought back up, the kernel tries to restore the
host routes tied to its permanent addresses.
However, if the host route was removed from the FIB, then we need to
reinsert it. This is done by releasing the current dst and allocating a
new, so as to not reuse a dst with obsolete values.
Since this function is called under RTNL and using the same explanation
from the previous patch, we can test if the route is in the FIB by
checking its node pointer instead of its reference count.
Tested using the following script and Andrey's reproducer mentioned
in commit 8048ced9be ("net: ipv6: regenerate host route if moved to gc
list") and linked below:
$ ip link set dev lo up
$ ip link add dummy1 type dummy
$ ip -6 address add cafe::1/64 dev dummy1
$ ip link set dev lo down # cafe::1/128 is removed
$ ip link set dev dummy1 up
$ ip link set dev lo up
The host route is correctly regenerated.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Link: http://lkml.kernel.org/r/CAAeHK+zSe82vc5gCRgr_EoUwiALPnWVdWJBPwJZBpbxYz=kGJw@mail.gmail.com
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the loopback device is brought back up we need to check if the host
route attached to the address is still in the FIB and regenerate one in
case it's not.
Host routes using the loopback device are always inserted into and
removed from the FIB under RTNL (under which this function is called),
so we can test their node pointer instead of the reference count in
order to check if the route is in the FIB or not.
Tested using the following script from Nicolas mentioned in
commit a220445f9f ("ipv6: correctly add local routes when lo goes up"):
$ ip link add dummy1 type dummy
$ ip link set dummy1 up
$ ip link set lo down ; ip link set lo up
The host route is correctly regenerated.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a route is deleted its node pointer is set to NULL to indicate it's
no longer linked to its node. Do the same for routes that are replaced.
This will later allow us to test if a route is still in the FIB by
checking its node pointer instead of its reference count.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code currently assumes that only FIB nodes can hold a reference on
routes. Therefore, after fib6_purge_rt() has run and the route is no
longer present in any intermediate nodes, it's assumed that its
reference count would be 1 - taken by the node where it's currently
stored.
However, we're going to allow users other than the FIB to take a
reference on a route, so this assumption is no longer valid and the
BUG_ON() needs to be removed.
Note that purging only takes place if the initial reference count is
different than 1. I've left that check intact, as in the majority of
systems (where routes are only referenced by the FIB), it does actually
mean the route is present in intermediate nodes.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow user space applications to see which routes are offloaded and
which aren't by setting the RTNH_F_OFFLOAD flag when dumping them.
To be consistent with IPv4, offload indication is provided on a
per-nexthop basis.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dump all the FIB tables in each net namespace upon registration to the
FIB notification chain so that the callee will have a complete view of
the tables.
The integrity of the dump is ensured by a per-table sequence counter
that is incremented (under write lock) whenever a route is added or
deleted from the table.
All the sequence counters are read (under each table's read lock) and
summed, prior and after the dump. In case the counters differ, then the
dump is either restarted or the registration fails.
While it's possible for a table to be modified after its counter has
been read, this isn't really a problem. In case it happened before it
was read the second time, then the comparison at the end will fail. If
it happened afterwards, then we're guaranteed to be notified about the
change, as the notification block is registered prior to the second
read.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow users of the FIB notification chain to receive a complete view of
the IPv6 FIB rules upon registration to the chain.
The integrity of the dump is ensured by a per-family sequence counter
that is incremented (under RTNL) whenever a rule is added or deleted.
All the sequence counters are read (under RTNL) and summed, prior and
after the dump. In case the counters differ, then the dump is either
restarted or the registration fails.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As with IPv4, allow listeners of the FIB notification chain to receive
notifications whenever a route is added, replaced or deleted. This is
done by placing calls to the FIB notification chain in the two lowest
level functions that end up performing these operations - namely,
fib6_add_rt2node() and fib6_del_route().
Unlike IPv4, APPEND notifications aren't sent as the kernel doesn't
distinguish between "append" (NLM_F_CREATE|NLM_F_APPEND) and "prepend"
(NLM_F_CREATE). If NLM_F_EXCL isn't set, duplicate routes are always
added after the existing duplicate routes.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>