commit 36de10c478 upstream.
Virtual and translated addresses retrieved by the xtensa TLB sanity
checker must be consistent, i.e. correspond to the same state of the
checked TLB entry. KASAN shadow memory is mapped dynamically using
auto-refill TLB entries and thus may change TLB state between the
virtual and translated address retrieval, resulting in false TLB
insanity report.
Move read_xtlb_translation close to read_xtlb_virtual to make sure that
read values are consistent.
Cc: stable@vger.kernel.org
Fixes: a99e07ee5e ("xtensa: check TLB sanity on return to userspace")
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fe5e7ba11f upstream.
Commit 9287c6452d fixed a situation in which gfs2 could use a glock
after it had been freed. To do that, it temporarily added a new glock
reference by calling gfs2_glock_hold in function gfs2_add_revoke.
However, if the bd element was removed by gfs2_trans_remove_revoke, it
failed to drop the additional reference.
This patch adds logic to gfs2_trans_remove_revoke to properly drop the
additional glock reference.
Fixes: 9287c6452d ("gfs2: Fix occasional glock use-after-free")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f53056c430 upstream.
In gfs2_page_mkwrite's gfs2_allocate_page_backing helper, try to
allocate as many blocks at once as we need. Pass in the size of the
requested allocation.
Fixes: 35af80aef9 ("gfs2: don't use buffer_heads in gfs2_allocate_page_backing")
Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e64681b487 upstream.
KASAN shadow map doesn't need to be accessible through the linear kernel
mapping, allocate its pages with MEMBLOCK_ALLOC_ANYWHERE so that high
memory can be used. This frees up to ~100MB of low memory on xtensa
configurations with KASAN and high memory.
Cc: stable@vger.kernel.org # v5.1+
Fixes: f240ec09bb ("memblock: replace memblock_alloc_base(ANYWHERE) with memblock_phys_alloc")
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cc90bc6842 upstream.
This partially reverts commit e3a5d8e386.
Commit e3a5d8e386 ("check bi_size overflow before merge") adds a bio_full
check to __bio_try_merge_page. This will cause __bio_try_merge_page to fail
when the last bi_io_vec has been reached. Instead, what we want here is only
the bi_size overflow check.
Fixes: e3a5d8e386 ("block: check bi_size overflow before merge")
Cc: stable@vger.kernel.org # v5.4+
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c6a3aea935 upstream.
QOS requests for DEFAULT_VALUE are supposed to be ignored but this is
not the case for FREQ_QOS_MAX. Adding one request for MAX_DEFAULT_VALUE
and one for a real value will cause freq_qos_read_value to unexpectedly
return MAX_DEFAULT_VALUE (-1).
This happens because freq_qos max value is aggregated with PM_QOS_MIN
but FREQ_QOS_MAX_DEFAULT_VALUE is (-1) so it's smaller than other
values.
Fix this by redefining FREQ_QOS_MAX_DEFAULT_VALUE to S32_MAX.
Looking at current users for freq_qos it seems that none of them create
requests for FREQ_QOS_MAX_DEFAULT_VALUE.
Fixes: 77751a466e ("PM: QoS: Introduce frequency QoS")
Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
Reported-by: Matthias Kaehlcke <mka@chromium.org>
Reviewed-by: Matthias Kaehlcke <mka@chromium.org>
Cc: 5.4+ <stable@vger.kernel.org> # 5.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7c7e53e1c9 upstream.
The R-Car Gen2/3 manual - available at:
https://www.renesas.com/eu/en/products/microcontrollers-microprocessors/rz/rzg/rzg1m.html#documents
"RZ/G Series User's Manual: Hardware" section
strictly enforces the MACCTLR inizialization value - 39.3.1 - "Initial
Setting of PCI Express":
"Be sure to write the initial value (= H'80FF 0000) to MACCTLR before
enabling PCIETCTLR.CFINIT".
To avoid unexpected behavior and to match the SW initialization sequence
guidelines, this patch programs the MACCTLR with the correct value.
Note that the MACCTLR.SPCHG bit in the MACCTLR register description
reports that "Only writing 1 is valid and writing 0 is invalid" but this
"invalid" has to be interpreted as a write-ignore aka "ignored", not
"prohibited".
Reported-by: Eugeniu Rosca <erosca@de.adit-jv.com>
Fixes: c25da47788 ("PCI: rcar: Add Renesas R-Car PCIe driver")
Fixes: be20bbcb0a ("PCI: rcar: Add the initialization of PCIe link in resume_noirq()")
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: <stable@vger.kernel.org> # v5.2+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 73884a7082 upstream.
As per PCIe r5.0, sec 7.8.5.2, fixed bus numbers of a bridge must be zero
when no function that uses EA is located behind it. Hence, if EA supplies
bus numbers of zero, assign bus numbers normally. A secondary bus can
never have a bus number of zero, so setting a bridge's Secondary Bus Number
to zero makes downstream devices unreachable.
[bhelgaas: retain bool return value so "zero is invalid" logic is local]
Fixes: 2dbce59011 ("PCI: Assign bus numbers present in EA capability for bridges")
Link: https://lore.kernel.org/r/1572850664-9861-1-git-send-email-sundeep.lkml@gmail.com
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e045fa29e8 upstream.
When a driver enables MSI-X, msix_program_entries() reads the MSI-X Vector
Control register for each vector and saves it in desc->masked. Each
register is 32 bits and bit 0 is the actual Mask bit.
When we restored these registers during resume, we previously set the Mask
bit if *any* bit in desc->masked was set instead of when the Mask bit
itself was set:
pci_restore_state
pci_restore_msi_state
__pci_restore_msix_state
for_each_pci_msi_entry
msix_mask_irq(entry, entry->masked) <-- entire u32 word
__pci_msix_desc_mask_irq(desc, flag)
mask_bits = desc->masked & ~PCI_MSIX_ENTRY_CTRL_MASKBIT
if (flag) <-- testing entire u32, not just bit 0
mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT
writel(mask_bits, desc_addr + PCI_MSIX_ENTRY_VECTOR_CTRL)
This means that after resume, MSI-X vectors were masked when they shouldn't
be, which leads to timeouts like this:
nvme nvme0: I/O 978 QID 3 timeout, completion polled
On resume, set the Mask bit only when the saved Mask bit from suspend was
set.
This should remove the need for 19ea025e1d ("nvme: Add quirk for Kingston
NVME SSD running FW E8FK11.T").
[bhelgaas: commit log, move fix to __pci_msix_desc_mask_irq()]
Link: https://bugzilla.kernel.org/show_bug.cgi?id=204887
Link: https://lore.kernel.org/r/20191008034238.2503-1-jian-hong@endlessm.com
Fixes: f2440d9acb ("PCI MSI: Refactor interrupt masking code")
Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f2c33ccacb upstream.
pci_pm_thaw_noirq() is supposed to return the device to D0 and restore its
configuration registers, but previously it only did that for devices whose
drivers implemented the new power management ops.
Hibernation, e.g., via "echo disk > /sys/power/state", involves freezing
devices, creating a hibernation image, thawing devices, writing the image,
and powering off. The fact that thawing did not return devices with legacy
power management to D0 caused errors, e.g., in this path:
pci_pm_thaw_noirq
if (pci_has_legacy_pm_support(pci_dev)) # true for Mellanox VF driver
return pci_legacy_resume_early(dev) # ... legacy PM skips the rest
pci_set_power_state(pci_dev, PCI_D0)
pci_restore_state(pci_dev)
pci_pm_thaw
if (pci_has_legacy_pm_support(pci_dev))
pci_legacy_resume
drv->resume
mlx4_resume
...
pci_enable_msix_range
...
if (dev->current_state != PCI_D0) # <---
return -EINVAL;
which caused these warnings:
mlx4_core a6d1:00:02.0: INTx is not supported in multi-function mode, aborting
PM: dpm_run_callback(): pci_pm_thaw+0x0/0xd7 returns -95
PM: Device a6d1:00:02.0 failed to thaw: error -95
Return devices to D0 and restore config registers for all devices, not just
those whose drivers support new power management.
[bhelgaas: also call pci_restore_state() before pci_legacy_resume_early(),
update comment, add stable tag, commit log]
Link: https://lore.kernel.org/r/KU1P153MB016637CAEAD346F0AA8E3801BFAD0@KU1P153MB0166.APCP153.PROD.OUTLOOK.COM
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: stable@vger.kernel.org # v4.13+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2ac55d5e5e upstream.
It have turned out that it's not a good idea to unconditionally do a power
cycle and then to re-initialize the SDIO card, as currently done through
mmc_hw_reset() -> mmc_sdio_hw_reset(). This because there may be multiple
SDIO func drivers probed, who also shares the same SDIO card.
To address these scenarios, one may be tempted to use a notification
mechanism, as to allow the core to inform each of the probed func drivers,
about an ongoing HW reset. However, supporting such an operation from the
func driver point of view, may not be entirely trivial.
Therefore, let's use a more simplistic approach to solve the problem, by
instead forcing the card to be removed and re-detected, via scheduling a
rescan-work. In this way, we can rely on existing infrastructure, as the
func driver's ->remove() and ->probe() callbacks, becomes invoked to deal
with the cleanup and the re-initialization.
This solution may be considered as rather heavy, especially if a func
driver doesn't share its card with other func drivers. To address this,
let's keep the current immediate HW reset option as well, but run it only
when there is one func driver probed for the card.
Finally, to allow the caller of mmc_hw_reset(), to understand if the reset
is being asynchronously managed from a scheduled work, it returns 1
(propagated from mmc_sdio_hw_reset()). If the HW reset is executed
successfully and synchronously it returns 0, which maintains the existing
behaviour.
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 99b4ddd8b7 upstream.
Upfront in mmc_rescan() we use the host->rescan_entered flag, to allow
scanning only once for non-removable cards. Therefore, it's also not
possible that we can have a corresponding card bus attached (host->bus_ops
is NULL), when we are scanning non-removable cards.
For this reason, let' drop the check for mmc_card_is_removable() as it's
redundant.
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a0d4c7eb71 upstream.
MMC IOCTLS with R1B responses may cause the card to enter the busy state,
which means it's not ready to receive a new request. To prevent new
requests from being sent to the card, use a CMD13 polling loop to verify
that the card returns to the transfer state, before completing the request.
Signed-off-by: Chaotian Jing <chaotian.jing@mediatek.com>
Reviewed-by: Avri Altman <avri.altman@wdc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f8c63edfd7 upstream.
Fix commit 7b81cb6bdd ("usb: add a HCD_DMA flag instead of
guestimating DMA capabilities") where local memory USB drivers
erroneously allocate DMA memory instead of pool memory, causing
OHCI Unrecoverable Error, disabled
HC died; cleaning up
The order between hcd_uses_dma() and hcd->localmem_pool is now
arranged as in hcd_buffer_alloc() and hcd_buffer_free(), with the
test for hcd->localmem_pool placed first.
As an alternative, one might consider adjusting hcd_uses_dma() with
static inline bool hcd_uses_dma(struct usb_hcd *hcd)
{
- return IS_ENABLED(CONFIG_HAS_DMA) && (hcd->driver->flags & HCD_DMA);
+ return IS_ENABLED(CONFIG_HAS_DMA) &&
+ (hcd->driver->flags & HCD_DMA) &&
+ (hcd->localmem_pool == NULL);
}
One can also consider unsetting HCD_DMA for local memory pool drivers.
Fixes: 7b81cb6bdd ("usb: add a HCD_DMA flag instead of guestimating DMA capabilities")
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Fredrik Noring <noring@nocrew.org>
Link: https://lore.kernel.org/r/20191210172905.GA52526@sx9
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ffac2027e1 ]
If the user has specified their own RSS hash key, don't
lose it across queue resets such as DOWN/UP, MTU change,
and number of channels change. This is fixed by moving
the key initialization to a little earlier in the lif
creation.
Also, let's clean up the RSS config a little better on
the way down by setting it all to 0.
Fixes: aa3198819b ("ionic: Add RSS support")
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 86c76c0989 ]
A lockdep splat was observed when trying to remove an xdp memory
model from the table since the mutex was obtained when trying to
remove the entry, but not before the table walk started:
Fix the splat by obtaining the lock before starting the table walk.
Fixes: c3f812cea0 ("page_pool: do not release pool until inflight == 0.")
Reported-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Tested-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c3f812cea0 ]
The page pool keeps track of the number of pages in flight, and
it isn't safe to remove the pool until all pages are returned.
Disallow removing the pool until all pages are back, so the pool
is always available for page producers.
Make the page pool responsible for its own delayed destruction
instead of relying on XDP, so the page pool can be used without
the xdp memory model.
When all pages are returned, free the pool and notify xdp if the
pool is registered with the xdp memory system. Have the callback
perform a table walk since some drivers (cpsw) may share the pool
among multiple xdp_rxq_info.
Note that the increment of pages_state_release_cnt may result in
inflight == 0, resulting in the pool being released.
Fixes: d956a048cd ("xdp: force mem allocator removal and periodic warning")
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 3d7cadae51 ]
When setting speed to 100G via ethtool (AN is set to off), only 25G*4 is
configured while the user, who has an advanced HW which supports
extended PTYS, expects also 50G*2 to be configured.
With this patch, when extended PTYS mode is available, configure
PTYS via extended fields.
Fixes: 4b95840a6c ("net/mlx5e: Fix matching of speed to PRM link modes")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6d485e5e55 ]
Add a missing value in translation of PTYS ext_eth_proto_oper to its
corresponding speed. When ext_eth_proto_oper bit 10 is set, ethtool
shows unknown speed. With this fix, ethtool shows speed is 100G as
expected.
Fixes: a08b4ed137 ("net/mlx5: Add support to ext_* fields introduced in Port Type and Speed register")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 95219afbb9 ]
The act_ct TC module shares a common conntrack and NAT infrastructure
exposed via netfilter. It's possible that a packet needs both SNAT and
DNAT manipulation, due to e.g. tuple collision. Netfilter can support
this because it runs through the NAT table twice - once on ingress and
again after egress. The act_ct action doesn't have such capability.
Like netfilter hook infrastructure, we should run through NAT twice to
keep the symmetry.
Fixes: b57dc7c13e ("net/sched: Introduce action ct")
Signed-off-by: Aaron Conole <aconole@redhat.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c55d8b108c ]
Cited patch changed (channel index, tc) => (TXQ index) mapping to be a
static one, in order to keep indices consistent when changing number of
channels or TCs.
For 32 channels (OOB) and 8 TCs, real num of TXQs is 256.
When reducing the amount of channels to 8, the real num of TXQs will be
changed to 64.
This indices method is buggy:
- Channel #0, TC 3, the TXQ index is 96.
- Index 8 is not valid, as there is no such TXQ from driver perspective
(As it represents channel #8, TC 0, which is not valid with the above
configuration).
As part of driver's select queue, it calls netdev_pick_tx which returns an
index in the range of real number of TXQs. Depends on the return value,
with the examples above, driver could have returned index larger than the
real number of tx queues, or crash the kernel as it tries to read invalid
address of SQ which was not allocated.
Fix that by allocating sequential TXQ indices, and hold a new mapping
between (channel index, tc) => (real TXQ index). This mapping will be
updated as part of priv channels activation, and is used in
mlx5e_select_queue to find the selected queue index.
The existing indices mapping (channel_tc2txq) is no longer needed, as it
is used only for statistics structures and can be calculated on run time.
Delete its definintion and updates.
Fixes: 8bfaf07f78 ("net/mlx5e: Present SW stats when state is not opened")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d04ac224b1 ]
The skb_mpls_push was not updating ethertype of an ethernet packet if
the packet was originally received from a non ARPHRD_ETHER device.
In the below OVS data path flow, since the device corresponding to
port 7 is an l3 device (ARPHRD_NONE) the skb_mpls_push function does
not update the ethertype of the packet even though the previous
push_eth action had added an ethernet header to the packet.
recirc_id(0),in_port(7),eth_type(0x0800),ipv4(tos=0/0xfc,ttl=64,frag=no),
actions:push_eth(src=00:00:00:00:00:00,dst=00:00:00:00:00:00),
push_mpls(label=13,tc=0,ttl=64,bos=1,eth_type=0x8847),4
Fixes: 8822e270d6 ("net: core: move push MPLS functionality from OvS to core helper")
Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 040b5cfbce ]
The skb_mpls_pop was not updating ethertype of an ethernet packet if the
packet was originally received from a non ARPHRD_ETHER device.
In the below OVS data path flow, since the device corresponding to port 7
is an l3 device (ARPHRD_NONE) the skb_mpls_pop function does not update
the ethertype of the packet even though the previous push_eth action had
added an ethernet header to the packet.
recirc_id(0),in_port(7),eth_type(0x8847),
mpls(label=12/0xfffff,tc=0/0,ttl=0/0x0,bos=1/1),
actions:push_eth(src=00:00:00:00:00:00,dst=00:00:00:00:00:00),
pop_mpls(eth_type=0x800),4
Fixes: ed246cee09 ("net: core: move pop MPLS functionality from OvS to core helper")
Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 0e4940928c ]
After pskb_may_pull() we should always refetch the header
pointers from the skb->data in case it got reallocated.
In gre_parse_header(), the erspan header is still fetched
from the 'options' pointer which is fetched before
pskb_may_pull().
Found this during code review of a KMSAN bug report.
Fixes: cb73ee40b1 ("net: ip_gre: use erspan key field for tunnel lookup")
Cc: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Acked-by: William Tu <u9012063@gmail.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8ffb055bea ]
The recent commit 5c72299fba ("net: sched: cls_flower: Classify
packets using port ranges") had added filtering based on port ranges
to tc flower. However the commit missed necessary changes in hw-offload
code, so the feature gave rise to generating incorrect offloaded flow
keys in NIC.
One more detailed example is below:
$ tc qdisc add dev eth0 ingress
$ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \
dst_port 100-200 action drop
With the setup above, an exact match filter with dst_port == 0 will be
installed in NIC by hw-offload. IOW, the NIC will have a rule which is
equivalent to the following one.
$ tc qdisc add dev eth0 ingress
$ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \
dst_port 0 action drop
The behavior was caused by the flow dissector which extracts packet
data into the flow key in the tc flower. More specifically, regardless
of exact match or specified port ranges, fl_init_dissector() set the
FLOW_DISSECTOR_KEY_PORTS flag in struct flow_dissector to extract port
numbers from skb in skb_flow_dissect() called by fl_classify(). Note
that device drivers received the same struct flow_dissector object as
used in skb_flow_dissect(). Thus, offloaded drivers could not identify
which of these is used because the FLOW_DISSECTOR_KEY_PORTS flag was
set to struct flow_dissector in either case.
This patch adds the new FLOW_DISSECTOR_KEY_PORTS_RANGE flag and the new
tp_range field in struct fl_flow_key to recognize which filters are applied
to offloaded drivers. At this point, when filters based on port ranges
passed to drivers, drivers return the EOPNOTSUPP error because they do
not support the feature (the newly created FLOW_DISSECTOR_KEY_PORTS_RANGE
flag).
Fixes: 5c72299fba ("net: sched: cls_flower: Classify packets using port ranges")
Signed-off-by: Yoshiki Komachi <komachi.yoshiki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 25a443f74b ]
When a device is bound to a clsact qdisc, bind events are triggered to
registered drivers for both ingress and egress. However, if a driver
registers to such a device using the indirect block routines then it is
assumed that it is only interested in ingress offload and so only replays
ingress bind/unbind messages.
The NFP driver supports the offload of some egress filters when
registering to a block with qdisc of type clsact. However, on unregister,
if the block is still active, it will not receive an unbind egress
notification which can prevent proper cleanup of other registered
callbacks.
Modify the indirect block callback command in TC to send messages of
ingress and/or egress bind depending on the qdisc in use. NFP currently
supports egress offload for TC flower offload so the changes are only
added to TC.
Fixes: 4d12ba4278 ("nfp: flower: allow offloading of matches on 'internal' ports")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit dbad340889 ]
With indirect blocks, a driver can register for callbacks from a device
that is does not 'own', for example, a tunnel device. When registering to
or unregistering from a new device, a callback is triggered to generate
a bind/unbind event. This, in turn, allows the driver to receive any
existing rules or to properly clean up installed rules.
When first added, it was assumed that all indirect block registrations
would be for ingress offloads. However, the NFP driver can, in some
instances, support clsact qdisc binds for egress offload.
Change the name of the indirect block callback command in flow_offload to
remove the 'ingress' identifier from it. While this does not change
functionality, a follow up patch will implement a more more generic
callback than just those currently just supporting ingress offload.
Fixes: 4d12ba4278 ("nfp: flower: allow offloading of matches on 'internal' ports")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 721c8dafad ]
Syncookies borrow the ->rx_opt.ts_recent_stamp field to store the
timestamp of the last synflood. Protect them with READ_ONCE() and
WRITE_ONCE() since reads and writes aren't serialised.
Use of .rx_opt.ts_recent_stamp for storing the synflood timestamp was
introduced by a0f82f64e2 ("syncookies: remove last_synq_overflow from
struct tcp_sock"). But unprotected accesses were already there when
timestamp was stored in .last_synq_overflow.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit cb44a08f86 ]
When no synflood occurs, the synflood timestamp isn't updated.
Therefore it can be so old that time_after32() can consider it to be
in the future.
That's a problem for tcp_synq_no_recent_overflow() as it may report
that a recent overflow occurred while, in fact, it's just that jiffies
has grown past 'last_overflow' + TCP_SYNCOOKIE_VALID + 2^31.
Spurious detection of recent overflows lead to extra syncookie
verification in cookie_v[46]_check(). At that point, the verification
should fail and the packet dropped. But we should have dropped the
packet earlier as we didn't even send a syncookie.
Let's refine tcp_synq_no_recent_overflow() to report a recent overflow
only if jiffies is within the
[last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval. This
way, no spurious recent overflow is reported when jiffies wraps and
'last_overflow' becomes in the future from the point of view of
time_after32().
However, if jiffies wraps and enters the
[last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval (with
'last_overflow' being a stale synflood timestamp), then
tcp_synq_no_recent_overflow() still erroneously reports an
overflow. In such cases, we have to rely on syncookie verification
to drop the packet. We unfortunately have no way to differentiate
between a fresh and a stale syncookie timestamp.
In practice, using last_overflow as lower bound is problematic.
If the synflood timestamp is concurrently updated between the time
we read jiffies and the moment we store the timestamp in
'last_overflow', then 'now' becomes smaller than 'last_overflow' and
tcp_synq_no_recent_overflow() returns true, potentially dropping a
valid syncookie.
Reading jiffies after loading the timestamp could fix the problem,
but that'd require a memory barrier. Let's just accommodate for
potential timestamp growth instead and extend the interval using
'last_overflow - HZ' as lower bound.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 04d26e7b15 ]
If no synflood happens for a long enough period of time, then the
synflood timestamp isn't refreshed and jiffies can advance so much
that time_after32() can't accurately compare them any more.
Therefore, we can end up in a situation where time_after32(now,
last_overflow + HZ) returns false, just because these two values are
too far apart. In that case, the synflood timestamp isn't updated as
it should be, which can trick tcp_synq_no_recent_overflow() into
rejecting valid syncookies.
For example, let's consider the following scenario on a system
with HZ=1000:
* The synflood timestamp is 0, either because that's the timestamp
of the last synflood or, more commonly, because we're working with
a freshly created socket.
* We receive a new SYN, which triggers synflood protection. Let's say
that this happens when jiffies == 2147484649 (that is,
'synflood timestamp' + HZ + 2^31 + 1).
* Then tcp_synq_overflow() doesn't update the synflood timestamp,
because time_after32(2147484649, 1000) returns false.
With:
- 2147484649: the value of jiffies, aka. 'now'.
- 1000: the value of 'last_overflow' + HZ.
* A bit later, we receive the ACK completing the 3WHS. But
cookie_v[46]_check() rejects it because tcp_synq_no_recent_overflow()
says that we're not under synflood. That's because
time_after32(2147484649, 120000) returns false.
With:
- 2147484649: the value of jiffies, aka. 'now'.
- 120000: the value of 'last_overflow' + TCP_SYNCOOKIE_VALID.
Of course, in reality jiffies would have increased a bit, but this
condition will last for the next 119 seconds, which is far enough
to accommodate for jiffie's growth.
Fix this by updating the overflow timestamp whenever jiffies isn't
within the [last_overflow, last_overflow + HZ] range. That shouldn't
have any performance impact since the update still happens at most once
per second.
Now we're guaranteed to have fresh timestamps while under synflood, so
tcp_synq_no_recent_overflow() can safely use it with time_after32() in
such situations.
Stale timestamps can still make tcp_synq_no_recent_overflow() return
the wrong verdict when not under synflood. This will be handled in the
next patch.
For 64 bits architectures, the problem was introduced with the
conversion of ->tw_ts_recent_stamp to 32 bits integer by commit
cca9bab1b7 ("tcp: use monotonic timestamps for PAWS").
The problem has always been there on 32 bits architectures.
Fixes: cca9bab1b7 ("tcp: use monotonic timestamps for PAWS")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6c8991f415 ]
ipv6_stub uses the ip6_dst_lookup function to allow other modules to
perform IPv6 lookups. However, this function skips the XFRM layer
entirely.
All users of ipv6_stub->ip6_dst_lookup use ip_route_output_flow (via the
ip_route_output_key and ip_route_output helpers) for their IPv4 lookups,
which calls xfrm_lookup_route(). This patch fixes this inconsistent
behavior by switching the stub to ip6_dst_lookup_flow, which also calls
xfrm_lookup_route().
This requires some changes in all the callers, as these two functions
take different arguments and have different return types.
Fixes: 5f81bd2e5d ("ipv6: export a stub for IPv6 symbols used by vxlan")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c4e85f73af ]
This will be used in the conversion of ipv6_stub to ip6_dst_lookup_flow,
as some modules currently pass a net argument without a socket to
ip6_dst_lookup. This is equivalent to commit 343d60aada ("ipv6: change
ipv6_stub_impl.ipv6_dst_lookup to take net argument").
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 73e6551699 ]
When the user changes prio2buffer mapping while global pause is
enabled, mlx5 driver incorrectly sets all active buffers
(buffer that has at least one priority mapped) to lossy.
Solution:
If global pause is enabled, set all the active buffers to lossless
in prio2buffer command.
Also, add error message when buffer size is not enough to meet
xoff threshold.
Fixes: 0696d60853 ("net/mlx5e: Receive buffer configuration")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>