Commit Graph

121010 Commits

Author SHA1 Message Date
Ingo Molnar
b01b214199 mm/swap: Use local_lock for protection
The various struct pagevec per CPU variables are protected by disabling
either preemption or interrupts across the critical sections. Inside
these sections spinlocks have to be acquired.

These spinlocks are regular spinlock_t types which are converted to
"sleeping" spinlocks on PREEMPT_RT enabled kernels. Obviously sleeping
locks cannot be acquired in preemption or interrupt disabled sections.

local locks provide a trivial way to substitute preempt and interrupt
disable instances. On a non PREEMPT_RT enabled kernel local_lock() maps
to preempt_disable() and local_lock_irq() to local_irq_disable().

Create lru_rotate_pvecs containing the pagevec and the locallock.
Create lru_pvecs containing the remaining pagevecs and the locallock.
Add lru_add_drain_cpu_zone() which is used from compact_zone() to avoid
exporting the pvec structure.

Change the relevant call sites to acquire these locks instead of using
preempt_disable() / get_cpu() / get_cpu_var() and local_irq_disable() /
local_irq_save().

There is neither a functional change nor a change in the generated
binary code for non PREEMPT_RT enabled non-debug kernels.

When lockdep is enabled local locks have lockdep maps embedded. These
allow lockdep to validate the protections, i.e. inappropriate usage of a
preemption only protected sections would result in a lockdep warning
while the same problem would not be noticed with a plain
preempt_disable() based protection.

local locks also improve readability as they provide a named scope for
the protections while preempt/interrupt disable are opaque scopeless.

Finally local locks allow PREEMPT_RT to substitute them with real
locking primitives to ensure the correctness of operation in a fully
preemptible kernel.

[ bigeasy: Adopted to use local_lock ]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200527201119.1692513-4-bigeasy@linutronix.de
2020-05-28 10:31:10 +02:00
Sebastian Andrzej Siewior
cfa6705d89 radix-tree: Use local_lock for protection
The radix-tree and idr preload mechanisms use preempt_disable() to protect
the complete operation between xxx_preload() and xxx_preload_end().

As the code inside the preempt disabled section acquires regular spinlocks,
which are converted to 'sleeping' spinlocks on a PREEMPT_RT kernel and
eventually calls into a memory allocator, this conflicts with the RT
semantics.

Convert it to a local_lock which allows RT kernels to substitute them with
a real per CPU lock. On non RT kernels this maps to preempt_disable() as
before, but provides also lockdep coverage of the critical region.
No functional change.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200527201119.1692513-3-bigeasy@linutronix.de
2020-05-28 10:31:09 +02:00
Thomas Gleixner
91710728d1 locking: Introduce local_lock()
preempt_disable() and local_irq_disable/save() are in principle per CPU big
kernel locks. This has several downsides:

  - The protection scope is unknown

  - Violation of protection rules is hard to detect by instrumentation

  - For PREEMPT_RT such sections, unless in low level critical code, can
    violate the preemptability constraints.

To address this PREEMPT_RT introduced the concept of local_locks which are
strictly per CPU.

The lock operations map to preempt_disable(), local_irq_disable/save() and
the enabling counterparts on non RT enabled kernels.

If lockdep is enabled local locks gain a lock map which tracks the usage
context. This will catch cases where an area is protected by
preempt_disable() but the access also happens from interrupt context. local
locks have identified quite a few such issues over the years, the most
recent example is:

  b7d5dc2107 ("random: add a spinlock_t to struct batched_entropy")

Aside of the lockdep coverage this also improves code readability as it
precisely annotates the protection scope.

PREEMPT_RT substitutes these local locks with 'sleeping' spinlocks to
protect such sections while maintaining preemtability and CPU locality.

local locks can replace:

  - preempt_enable()/disable() pairs
  - local_irq_disable/enable() pairs
  - local_irq_save/restore() pairs

They are also used to replace code which implicitly disables preemption
like:

  - get_cpu()/put_cpu()
  - get_cpu_var()/put_cpu_var()

with PREEMPT_RT friendly constructs.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200527201119.1692513-2-bigeasy@linutronix.de
2020-05-28 10:31:09 +02:00
Takashi Iwai
6d28484026 Merge branch 'for-linus' into for-next
Back-merge 5.7-devel branch for further development.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
2020-05-28 10:08:50 +02:00
Tudor Ambarus
1ac71ec013 mtd: spi-nor: Fix SPI NOR acronym
The correct terminology is serial NOR flash or SPI NOR.
s/SPI-NOR/SPI NOR and s/spi-nor/SPI NOR across the subsystem.

Signed-off-by: Tudor Ambarus <tudor.ambarus@microchip.com>
Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
2020-05-28 10:46:23 +03:00
Sebastian Reichel
6c2fe5cae5 Merge tag 'tags/ib-mfd-iio-power-v5.8' into psy-next
This merges the MP2629 battery charge management immutable branch
between MFD, IIO and power-supply due for the v5.8 merge window
into power-supply for-next branch.

Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com>
2020-05-28 09:04:26 +02:00
Ingo Molnar
0bffedbce9 Merge tag 'v5.7-rc7' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-05-28 07:58:12 +02:00
Lubomir Rintel
e787c5b725 dt-bindings: clock: Add Marvell MMP Audio Clock Controller binding
This describes the bindings for a controller that generates master and bit
clocks for the I2S interface.

Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Link: https://lkml.kernel.org/r/20200519224151.2074597-13-lkundrak@v3.sk
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2020-05-27 17:55:12 -07:00
Lubomir Rintel
17d43046fd dt-bindings: marvell,mmp2: Add ids for the power domains
On MMP2 the audio and GPU blocks are on separate power islands. On MMP3
the camera block's power is also controlled separately.

Add the numbers that we could use to refer to the power domains for
respective power islands from the device tree.

Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Acked-by: Rob Herring <robh@kernel.org>
Link: https://lkml.kernel.org/r/20200519224151.2074597-11-lkundrak@v3.sk
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2020-05-27 17:55:12 -07:00
Lubomir Rintel
c227df7a09 dt-bindings: marvell,mmp2: Add clock id for the Audio clock
This clocks the Audio block.

Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Acked-by: Rob Herring <robh@kernel.org>
Link: https://lkml.kernel.org/r/20200519224151.2074597-5-lkundrak@v3.sk
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2020-05-27 17:55:11 -07:00
Lubomir Rintel
edcec4a869 dt-bindings: marvell,mmp2: Add clock id for the I2S clocks
There are two of these on a MMP2.

Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Acked-by: Rob Herring <robh@kernel.org>
Link: https://lkml.kernel.org/r/20200519224151.2074597-4-lkundrak@v3.sk
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2020-05-27 17:55:11 -07:00
Christoph Hellwig
7a15b2e013 net: remove kernel_getsockopt
No users left.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:11:33 -07:00
Jonas Falkevik
50ce4c099b sctp: fix typo sctp_ulpevent_nofity_peer_addr_change
change typo in function name "nofity" to "notify"
sctp_ulpevent_nofity_peer_addr_change ->
sctp_ulpevent_notify_peer_addr_change

Signed-off-by: Jonas Falkevik <jonas.falkevik@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:08:02 -07:00
Tariq Toukan
b3ae2459f8 net/tls: Add force_resync for driver resync
This patch adds a field to the tls rx offload context which enables
drivers to force a send_resync call.

This field can be used by drivers to request a resync at the next
possible tls record. It is beneficial for hardware that provides the
resync sequence number asynchronously. In such cases, the packet that
triggered the resync does not contain the information required for a
resync. Instead, the driver requests resync for all the following
TLS record until the asynchronous notification with the resync request
TCP sequence arrives.

A following series for mlx5e ConnectX-6DX TLS RX offload support will
use this mechanism.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:07:13 -07:00
Cong Wang
f5a7833e83 net_sched: add a tracepoint for qdisc creation
With this tracepoint, we could know when qdisc's are created,
especially those default qdisc's.

Sample output:

  tc-736   [001] ...1    56.230107: qdisc_create: dev=ens3 kind=pfifo parent=1:0
  tc-736   [001] ...1    56.230113: qdisc_create: dev=ens3 kind=hfsc parent=ffff:ffff
  tc-738   [001] ...1    56.256816: qdisc_create: dev=ens3 kind=pfifo parent=1:100
  tc-739   [001] ...1    56.267584: qdisc_create: dev=ens3 kind=pfifo parent=1:200
  tc-740   [001] ...1    56.279649: qdisc_create: dev=ens3 kind=fq_codel parent=1:100
  tc-741   [001] ...1    56.289996: qdisc_create: dev=ens3 kind=pfifo_fast parent=1:200
  tc-745   [000] .N.1   111.687483: qdisc_create: dev=ens3 kind=ingress parent=ffff:fff1

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:05:49 -07:00
Cong Wang
a34dac0b90 net_sched: add tracepoints for qdisc_reset() and qdisc_destroy()
Add two tracepoints for qdisc_reset() and qdisc_destroy() to track
qdisc resetting and destroying.

Sample output:

  tc-756   [000] ...3   138.355662: qdisc_reset: dev=ens3 kind=pfifo_fast parent=ffff:ffff handle=0:0
  tc-756   [000] ...1   138.355720: qdisc_reset: dev=ens3 kind=pfifo_fast parent=ffff:ffff handle=0:0
  tc-756   [000] ...1   138.355867: qdisc_reset: dev=ens3 kind=pfifo_fast parent=ffff:ffff handle=0:0
  tc-756   [000] ...1   138.355930: qdisc_destroy: dev=ens3 kind=pfifo_fast parent=ffff:ffff handle=0:0
  tc-757   [000] ...2   143.073780: qdisc_reset: dev=ens3 kind=fq_codel parent=ffff:ffff handle=8001:0
  tc-757   [000] ...1   143.073878: qdisc_reset: dev=ens3 kind=fq_codel parent=ffff:ffff handle=8001:0
  tc-757   [000] ...1   143.074114: qdisc_reset: dev=ens3 kind=fq_codel parent=ffff:ffff handle=8001:0
  tc-757   [000] ...1   143.074228: qdisc_destroy: dev=ens3 kind=fq_codel parent=ffff:ffff handle=8001:0

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:05:49 -07:00
Krzysztof Wilczyński
3910ebaca8 PCI: Rename _DSM constants to align with spec
Rename PCI-related _DSM constants to align them with the PCI Firmware Spec,
r3.2, sec 4.6.  No functional change intended.

Link: https://lore.kernel.org/r/20200526213905.2479381-1-kw@linux.com
Signed-off-by: Krzysztof Wilczyński <kw@linux.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2020-05-27 16:48:21 -05:00
Pablo Neira Ayuso
abadb2f865 netfilter: nf_tables: delete devices from flowtable
This patch allows users to delete devices from existing flowtables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Pablo Neira Ayuso
78d9f48f7f netfilter: nf_tables: add devices to existing flowtable
This patch allows users to add devices to an existing flowtable.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Romain Bellan
cb8aa9a3af netfilter: ctnetlink: add kernel side filtering for dump
Conntrack dump does not support kernel side filtering (only get exists,
but it returns only one entry. And user has to give a full valid tuple)

It means that userspace has to implement filtering after receiving many
irrelevant entries, consuming resources (conntrack table is sometimes
very huge, much more than a routing table for example).

This patch adds filtering in kernel side. To achieve this goal, we:

 * Add a new CTA_FILTER netlink attributes, actually a flag list to
   parametize filtering
 * Convert some *nlattr_to_tuple() functions, to allow a partial parsing
   of CTA_TUPLE_ORIG and CTA_TUPLE_REPLY (so nf_conntrack_tuple it not
   fully set)

Filtering is now possible on:
 * IP SRC/DST values
 * Ports for TCP and UDP flows
 * IMCP(v6) codes types and IDs

Filtering is done as an "AND" operator. For example, when flags
PROTO_SRC_PORT, PROTO_NUM and IP_SRC are sets, only entries matching all
values are dumped.

Changes since v1:
  Set NLM_F_DUMP_FILTERED in nlm flags if entries are filtered

Changes since v2:
  Move several constants to nf_internals.h
  Move a fix on netlink values check in a separate patch
  Add a check on not-supported flags
  Return EOPNOTSUPP if CDA_FILTER is set in ctnetlink_flush_conntrack
  (not yet implemented)
  Code style issues

Changes since v3:
  Fix compilation warning reported by kbuild test robot

Changes since v4:
  Fix a regression introduced in v3 (returned EINVAL for valid netlink
  messages without CTA_MARK)

Changes since v5:
  Change definition of CTA_FILTER_F_ALL
  Fix a regression when CTA_TUPLE_ZONE is not set

Signed-off-by: Romain Bellan <romain.bellan@wifirst.fr>
Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Leon Romanovsky
50aec2c313 RDMA/mlx5: Return ECE data after modify QP
After users sets the ECE option, FW will return the agreed/supported bits
through an output structures of modify QP stages for regular QPs or
through create QP for the DCT.

Link: https://lore.kernel.org/r/20200526115440.205922-9-leon@kernel.org
Reviewed-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:07:49 -03:00
Leon Romanovsky
5f62a521ff RDMA/mlx5: Set ECE options during modify QP
The most common way to set ECE option will be during modify QP command in
INIT2RTR, RTR2RTS and RTS2RTS stages, so update mlx5 to support it.

The new bit in the comp_mask is needed to mark that kernel supports ECE
and can receive data instead of "reserved" field in the struct
mlx5_ib_modify_qp.

Link: https://lore.kernel.org/r/20200526115440.205922-8-leon@kernel.org
Reviewed-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:07:49 -03:00
Leon Romanovsky
f18e26af6a RDMA/mlx5: Convert modify QP to use MLX5_SET macros
Instead of hand crafted mlx5_qp_context and mlx5_qp_path use common
MLX5_SET() macros.

Link: https://lore.kernel.org/r/20200526115440.205922-7-leon@kernel.org
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:07:49 -03:00
Leon Romanovsky
e383085c24 RDMA/mlx5: Set ECE options during QP create
Allow users to ask creation of QPs with specific ECE options.  Such early
set even before RDMA-CM connection is established is useful if user knows
exactly which option he needs.

Link: https://lore.kernel.org/r/20200526115440.205922-4-leon@kernel.org
Reviewed-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:07:49 -03:00
Leon Romanovsky
3e09a427ae RDMA/mlx5: Get ECE options from FW during create QP
Supported ECE options are returned from FW in the create_qp phase and zero
means that field is not valid. Such default value allows us to reuse
reserved field without worries about comp_mask.

Update create QP API to return ECE options.

Link: https://lore.kernel.org/r/20200526115440.205922-3-leon@kernel.org
Reviewed-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:07:49 -03:00
Leon Romanovsky
8094ba0ace RDMA/cma: Provide ECE reject reason
IBTA declares "vendor option not supported" reject reason in REJ messages
if passive side doesn't want to accept proposed ECE options.

Due to the fact that ECE is managed by userspace, there is a need to let
users to provide such rejected reason.

Link: https://lore.kernel.org/r/20200526103304.196371-7-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:05:05 -03:00
Leon Romanovsky
0cb15372a6 RDMA/cma: Connect ECE to rdma_accept
The rdma_accept() is called by both passive and active sides of CMID
connection to mark readiness to start data transfer. For passive side,
this is called explicitly, for active side, it is called implicitly while
receiving REP message.

Provide ECE data to rdma_accept function needed for passive side to send
that REP message.

Link: https://lore.kernel.org/r/20200526103304.196371-6-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:05:05 -03:00
Leon Romanovsky
a20652e175 RDMA/cm: Send and receive ECE parameter over the wire
ECE parameters are exchanged through REQ->REP/SIDR_REP messages, this
patch adds the data to provide to other side of CMID communication
channel.

Link: https://lore.kernel.org/r/20200526103304.196371-5-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:05:05 -03:00
Leon Romanovsky
93531ee7b9 RDMA/ucma: Deliver ECE parameters through UCMA events
Passive side of CMID connection receives ECE request through REQ message
and needs to respond with relevant REP message which will be forwarded to
active side.

The UCMA events interface is responsible for such communication with the
user space (librdmacm). Extend it to provide ECE wire data.

Link: https://lore.kernel.org/r/20200526103304.196371-4-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:05:05 -03:00
Leon Romanovsky
34e2ab57a9 RDMA/ucma: Extend ucma_connect to receive ECE parameters
Active side of CMID initiates connection through librdmacm's
rdma_connect() and kernel's ucma_connect(). Extend UCMA interface to
handle those new parameters.

Link: https://lore.kernel.org/r/20200526103304.196371-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:05:05 -03:00
Leon Romanovsky
fef17f91da RDMA/cm: Add Enhanced Connection Establishment (ECE) bits
Extend REQ (request for communications), REP (reply to request for
communication), rejected reason and SIDR_REP (service ID resolution
response) structures with hardware vendor ID bits according to IBTA v1.4.

Link: https://lore.kernel.org/r/20200526103304.196371-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:05:05 -03:00
Jason Gunthorpe
e4fdf7625b Merge branch 'mellanox/mlx5-next' into rdma.git for/next
From the mlx5-next branch at
  git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Required for dependencies in following patches

* branch 'mellanox/mlx5-next':
  net/mlx5: Add ability to read and write ECE options
  net/mlx5: Add support for RDMA TX FT headers modifying
  net/mlx5: Move iseg access helper routines close to mlx5_core driver
  net/mlx5: Cleanup mlx5_ifc_fte_match_set_misc2_bits

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:01:17 -03:00
Horatiu Vultur
20f6a05ef6 bridge: mrp: Rework the MRP netlink interface
This patch reworks the MRP netlink interface. Before, each attribute
represented a binary structure which made it hard to be extended.
Therefore update the MRP netlink interface such that each existing
attribute to be a nested attribute which contains the fields of the
binary structures.
In this way the MRP netlink interface can be extended without breaking
the backwards compatibility. It is also using strict checking for
attributes under the MRP top attribute.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 11:30:43 -07:00
Linus Torvalds
b0c3ba31be Merge tag 'fsnotify_for_v5.7-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fanotify FAN_DIR_MODIFY disabling from Jan Kara:
 "A single patch that disables FAN_DIR_MODIFY support that was merged in
  this merge window.

  When discussing further functionality we realized it may be more
  logical to guard it with a feature flag or to call things slightly
  differently (or maybe not) so let's not set the API in stone for now."

* tag 'fsnotify_for_v5.7-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fanotify: turn off support for FAN_DIR_MODIFY
2020-05-27 11:03:24 -07:00
Linus Torvalds
3301f6ae2d Merge branch 'for-5.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:

 - Reverted stricter synchronization for cgroup recursive stats which
   was prepping it for event counter usage which never got merged. The
   change was causing performation regressions in some cases.

 - Restore bpf-based device-cgroup operation even when cgroup1 device
   cgroup is disabled.

 - An out-param init fix.

* 'for-5.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  device_cgroup: Cleanup cgroup eBPF device filter code
  xattr: fix uninitialized out-param
  Revert "cgroup: Add memory barriers to plug cgroup_rstat_updated() race window"
2020-05-27 10:58:19 -07:00
Jason Gunthorpe
c85f4abe66 RDMA/core: Fix double destruction of uobject
Fix use after free when user user space request uobject concurrently for
the same object, within the RCU grace period.

In that case, remove_handle_idr_uobject() is called twice and we will have
an extra put on the uobject which cause use after free.  Fix it by leaving
the uobject write locked after it was removed from the idr.

Call to rdma_lookup_put_uobject with UVERBS_LOOKUP_DESTROY instead of
UVERBS_LOOKUP_WRITE will do the work.

  refcount_t: underflow; use-after-free.
  WARNING: CPU: 0 PID: 1381 at lib/refcount.c:28 refcount_warn_saturate+0xfe/0x1a0
  Kernel panic - not syncing: panic_on_warn set ...
  CPU: 0 PID: 1381 Comm: syz-executor.0 Not tainted 5.5.0-rc3 #8
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x94/0xce
   panic+0x234/0x56f
   __warn+0x1cc/0x1e1
   report_bug+0x200/0x310
   fixup_bug.part.11+0x32/0x80
   do_error_trap+0xd3/0x100
   do_invalid_op+0x31/0x40
   invalid_op+0x1e/0x30
  RIP: 0010:refcount_warn_saturate+0xfe/0x1a0
  Code: 0f 0b eb 9b e8 23 f6 6d ff 80 3d 6c d4 19 03 00 75 8d e8 15 f6 6d ff 48 c7 c7 c0 02 55 bd c6 05 57 d4 19 03 01 e8 a2 58 49 ff <0f> 0b e9 6e ff ff ff e8 f6 f5 6d ff 80 3d 42 d4 19 03 00 0f 85 5c
  RSP: 0018:ffffc90002df7b98 EFLAGS: 00010282
  RAX: 0000000000000000 RBX: ffff88810f6a193c RCX: ffffffffba649009
  RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff88811b0283cc
  RBP: 0000000000000003 R08: ffffed10236060e3 R09: ffffed10236060e3
  R10: 0000000000000001 R11: ffffed10236060e2 R12: ffff88810f6a193c
  R13: ffffc90002df7d60 R14: 0000000000000000 R15: ffff888116ae6a08
   uverbs_uobject_put+0xfd/0x140
   __uobj_perform_destroy+0x3d/0x60
   ib_uverbs_close_xrcd+0x148/0x170
   ib_uverbs_write+0xaa5/0xdf0
   __vfs_write+0x7c/0x100
   vfs_write+0x168/0x4a0
   ksys_write+0xc8/0x200
   do_syscall_64+0x9c/0x390
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x465b49
  Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
  RSP: 002b:00007f759d122c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
  RAX: ffffffffffffffda RBX: 000000000073bfa8 RCX: 0000000000465b49
  RDX: 000000000000000c RSI: 0000000020000080 RDI: 0000000000000003
  RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000246 R12: 00007f759d1236bc
  R13: 00000000004ca27c R14: 000000000070de40 R15: 00000000ffffffff
  Dumping ftrace buffer:
     (ftrace buffer empty)
  Kernel Offset: 0x39400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Fixes: 7452a3c745 ("IB/uverbs: Allow RDMA_REMOVE_DESTROY to work concurrently with disassociate")
Link: https://lore.kernel.org/r/20200527135534.482279-1-leon@kernel.org
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 14:22:57 -03:00
Amir Goldstein
f17936993a fanotify: turn off support for FAN_DIR_MODIFY
FAN_DIR_MODIFY has been enabled by commit 44d705b037 ("fanotify:
report name info for FAN_DIR_MODIFY event") in 5.7-rc1. Now we are
planning further extensions to the fanotify API and during that we
realized that FAN_DIR_MODIFY may behave slightly differently to be more
consistent with extensions we plan. So until we finalize these
extensions, let's not bind our hands with exposing FAN_DIR_MODIFY to
userland.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-05-27 18:55:54 +02:00
Gustavo A. R. Silva
bebcfe85f4 RDMA/core: Use sizeof_field() helper
Make use of the sizeof_field() helper instead of an open-coded version.

Link: https://lore.kernel.org/r/20200527144152.GA22605@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 13:46:05 -03:00
Domenico Andreoli
ad1e4f74c0 PM: hibernate: Restrict writes to the resume device
Hibernation via snapshot device requires write permission to the swap
block device, the one that more often (but not necessarily) is used to
store the hibernation image.

With this patch, such permissions are granted iff:

 1) snapshot device config option is enabled
 2) swap partition is used as resume device

In other circumstances the swap device is not writable from userspace.

In order to achieve this, every write attempt to a swap device is
checked against the device configured as part of the uswsusp API [0]
using a pointer to the inode struct in memory. If the swap device being
written was not configured for resuming, the write request is denied.

NOTE: this implementation works only for swap block devices, where the
inode configured by swapon (which sets S_SWAPFILE) is the same used
by SNAPSHOT_SET_SWAP_AREA.

In case of swap file, SNAPSHOT_SET_SWAP_AREA indeed receives the inode
of the block device containing the filesystem where the swap file is
located (+ offset in it) which is never passed to swapon and then has
not set S_SWAPFILE.

As result, the swap file itself (as a file) has never an option to be
written from userspace. Instead it remains writable if accessed directly
from the containing block device, which is always writeable from root.

[0] Documentation/power/userland-swsusp.rst

v2:
 - rename is_hibernate_snapshot_dev() to is_hibernate_resume_dev()
 - fix description so to correctly refer to the resume device

Signed-off-by: Domenico Andreoli <domenico.andreoli@linux.com>
Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-05-27 17:55:59 +02:00
Linus Walleij
ce1d966a30 Merge tag 'v5.7-rc7' into devel
Linux 5.7-rc7
2020-05-27 16:15:52 +02:00
Gustavo A. R. Silva
1ad8dd939a NFS: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-05-27 10:10:12 -04:00
Zheng Bin
d04659db7b nfs4: Remove unneeded semicolon
Fixes coccicheck warning:

include/linux/nfs4.h:298:2-3: Unneeded semicolon

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-05-27 10:08:26 -04:00
Jean-Philippe Brucker
521376741b PCI/ATS: Only enable ATS for trusted devices
Add pci_ats_supported(), which checks whether a device has an ATS
capability, and whether it is trusted.  A device is untrusted if it is
plugged into an external-facing port such as Thunderbolt and could be
spoofing an existing device to exploit weaknesses in the IOMMU
configuration.  PCIe ATS is one such weaknesses since it allows
endpoints to cache IOMMU translations and emit transactions with
'Translated' Address Type (10b) that partially bypass the IOMMU
translation.

The SMMUv3 and VT-d IOMMU drivers already disallow ATS and transactions
with 'Translated' Address Type for untrusted devices.  Add the check to
pci_enable_ats() to let other drivers (AMD IOMMU for now) benefit from
it.

By checking ats_cap, the pci_ats_supported() helper also returns whether
ATS was globally disabled with pci=noats, and could later include more
things, for example whether the whole PCIe hierarchy down to the
endpoint supports ATS.

Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/r/20200520152201.3309416-2-jean-philippe@linaro.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2020-05-27 14:35:41 +02:00
Charles Keepax
f4aa5e214e ASoC: dapm: Move dai_link widgets to runtime to fix use after free
The newly added CODEC to CODEC DAI link widget pointers in
snd_soc_dai_link are better placed in snd_soc_pcm_runtime.
snd_soc_dai_link is really intended for static configuration of
the DAI, and the runtime for dynamic data.  The snd_soc_dai_link
structures are not destroyed if the card is unbound. The widgets
are cleared up on unbind, however if the card is rebound as the
snd_soc_dai_link structures are reused these pointers will be left at
their old values, causing access to freed memory.

Fixes: 595571cca4 ("ASoC: dapm: Fix regression introducing multiple copies of DAI widgets")
Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com>
Link: https://lore.kernel.org/r/20200526161930.30759-1-ckeepax@opensource.cirrus.com
Signed-off-by: Mark Brown <broonie@kernel.org>
2020-05-27 13:24:13 +01:00
Pablo Neira Ayuso
4946ea5c12 netfilter: nf_conntrack_pptp: fix compilation warning with W=1 build
>> include/linux/netfilter/nf_conntrack_pptp.h:13:20: warning: 'const' type qualifier on return type has no effect [-Wignored-qualifiers]
extern const char *const pptp_msg_name(u_int16_t msg);
^~~~~~

Reported-by: kbuild test robot <lkp@intel.com>
Fixes: 4c559f15ef ("netfilter: nf_conntrack_pptp: prevent buffer overflows in debug code")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 13:39:08 +02:00
Konstantin Khlebnikov
b2d76adbc0 block: use __this_cpu_add() instead of access by smp_processor_id()
Most architectures have fast path to access percpu for current cpu.
The required preempt_disable() is provided by part_stat_lock().

[hch: rebased]

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:21:23 -06:00
Konstantin Khlebnikov
8ab1d40a64 block: remove rcu_read_lock() from part_stat_lock()
The RCU lock is required only in disk_map_sector_rcu() to lookup the
partition.  After that request holds reference to related hd_struct.

Replace get_cpu() with preempt_disable() - returned cpu index is unused.

[hch: rebased]

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:21:23 -06:00
Christoph Hellwig
58d4f14fc3 block: always use a percpu variable for disk stats
percpu variables have a perfectly fine working stub implementation
for UP kernels, so use that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:21:23 -06:00
Christoph Hellwig
e722fff238 block: remove generic_{start,end}_io_acct
Remove these now unused functions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:21:23 -06:00
Christoph Hellwig
956d510ee7 block: add disk/bio-based accounting helpers
Add two new helpers to simplify I/O accounting for bio based drivers.
Currently these drivers use the generic_start_io_acct and
generic_end_io_acct helpers which have very cumbersome calling
conventions, don't actually return the time they started accounting,
and try to deal with accounting for partitions, which can't happen
for bio based drivers.  The new helpers will be used to subsequently
replace uses of the old helpers.

The main API is the bio based wrappes in blkdev.h, but for zram
which wants to account rw_page based I/O lower level routines are
provided as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-27 05:21:23 -06:00