Commit Graph

122951 Commits

Author SHA1 Message Date
Michael S. Tsirkin
cacaf775c6 virtio_config: cread/write cleanup
Use vars of the correct type instead of casting.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-08-05 11:08:40 -04:00
Michael S. Tsirkin
452639a64a vdpa: make sure set_features is invoked for legacy
Some legacy guests just assume features are 0 after reset.
We detect that config space is accessed before features are
set and set features to 0 automatically.
Note: some legacy guests might not even access config space, if this is
reported in the field we might need to catch a kick to handle these.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-08-05 11:08:40 -04:00
Michael S. Tsirkin
4a04cfb0eb virtio_config: disallow native type fields
Transitional devices should all use __virtioXX types (and __leXX for
fields not present in legacy devices).
Modern ones should use __leXX.
_uXX type would be a bug.
Let's prevent that.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-08-05 11:08:40 -04:00
Michael S. Tsirkin
965b535051 virtio_scsi: correct tags for config space fields
Tag config space fields as having virtio endian-ness.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
2020-08-05 11:08:40 -04:00
Michael S. Tsirkin
a28feb855c virtio_pmem: correct tags for config space fields
Since this is a modern-only device,
tag config space fields as having little endian-ness.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
2020-08-05 11:08:40 -04:00
Michael S. Tsirkin
577e677a78 virtio_net: correct tags for config space fields
Tag config space fields as having virtio endian-ness.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-08-05 11:08:40 -04:00
Michael S. Tsirkin
7926895442 virtio_mem: correct tags for config space fields
Since this is a modern-only device,
tag config space fields as having little endian-ness.

TODO: check other uses of __virtioXX types in this header,
should probably be __leXX.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
2020-08-05 11:08:40 -04:00
Michael S. Tsirkin
0ebcffcc27 virtio_iommu: correct tags for config space fields
Since this is a modern-only device,
tag config space fields as having little endian-ness.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
2020-08-05 11:08:40 -04:00
Michael S. Tsirkin
924b59a6df virtio_input: correct tags for config space fields
Since this is a modern-only device,
tag config space fields as having little endian-ness.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Gerd Hoffmann <kraxel@redhat.com>
Reviewed-by: Gerd Hoffmann <kraxel@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
2020-08-05 11:08:40 -04:00
Michael S. Tsirkin
f378444b7c virtio_gpu: correct tags for config space fields
Since gpu is a modern-only device,
tag config space fields as having little endian-ness.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
2020-08-05 11:08:40 -04:00
Michael S. Tsirkin
fc4a1accbb virtio_fs: correct tags for config space fields
Since fs is a modern-only device,
tag config space fields as having little endian-ness.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
2020-08-05 11:08:39 -04:00
Michael S. Tsirkin
24bcf35b69 virtio_crypto: correct tags for config space fields
Since crypto is a modern-only device,
tag config space fields as having little endian-ness.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
2020-08-05 11:08:39 -04:00
Michael S. Tsirkin
dbe2dc8c58 virtio_console: correct tags for config space fields
Tag config space fields as having virtio endian-ness.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
2020-08-05 11:08:39 -04:00
Michael S. Tsirkin
40e04c488b virtio_blk: correct tags for config space fields
Tag config space fields as having virtio endian-ness.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
2020-08-05 11:08:39 -04:00
Vinod Koul
0b5ad7b952 Merge branch 'for-linus' into fixes
Signed-off-by: Vinod Koul <vkoul@kernel.org>

 Conflicts:
	drivers/dma/idxd/sysfs.c
2020-08-05 19:02:07 +05:30
Michael S. Tsirkin
c73cb10cc4 virtio_balloon: correct tags for config space fields
Tag config space fields as having little endian-ness.
Note that balloon is special: LE even when using
the legacy interface.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
2020-08-05 09:30:20 -04:00
Michael S. Tsirkin
cae19a6386 virtio_9p: correct tags for config space fields
Tag config space fields as having virtio endian-ness.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
2020-08-05 09:30:19 -04:00
Michael S. Tsirkin
a4235ec06a virtio: allow __virtioXX, __leXX in config space
Currently all config space fields are of the type __uXX.
This confuses people and some drivers (notably vdpa)
access them using CPU endian-ness - which only
works well for legacy or LE platforms.

Update virtio_cread/virtio_cwrite macros to allow __virtioXX
and __leXX field types. Follow-up patches will convert
config space to use these types.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
2020-08-05 09:30:19 -04:00
Michael S. Tsirkin
5487196878 virtio_ring: sparse warning fixup
virtio_store_mb was built with split ring in mind so it accepts
__virtio16 arguments. Packed ring uses __le16 values, so sparse
complains.  It's just a store with some barriers so let's convert it to
a macro, we don't loose too much type safety by doing that.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
2020-08-05 09:30:19 -04:00
Mohan Kumar
4106820b90 ALSA: hda: Add dma stop delay variable
A variable dma_stop_delay is added as a new member in hdac_bus
structure to avoid memory decode error incase DMA RUN bit is not
disabled in the given timeout from snd_hdac_stream_sync function and
followed by stream reset which results in memory decode error between
reset set and clear operation.

Signed-off-by: Mohan Kumar <mkumard@nvidia.com>
Link: https://lore.kernel.org/r/20200805095221.5476-3-mkumard@nvidia.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2020-08-05 12:27:47 +02:00
Codrin Ciubotariu
75820314de i2c: core: add generic I2C GPIO recovery
Multiple I2C bus drivers use similar bindings to obtain information needed
for I2C recovery. For example, for platforms using device-tree, the
properties look something like this:

&i2c {
	...
	pinctrl-names = "default", "gpio";
	pinctrl-0 = <&pinctrl_i2c_default>;
	pinctrl-1 = <&pinctrl_i2c_gpio>;
	sda-gpios = <&pio 0 GPIO_ACTIVE_HIGH>;
	scl-gpios = <&pio 1 (GPIO_ACTIVE_HIGH | GPIO_OPEN_DRAIN)>;
	...
}

For this reason, we can add this common initialization in the core. This
way, other I2C bus drivers will be able to support GPIO recovery just by
providing a pointer to platform's pinctrl and calling i2c_recover_bus()
when SDA is stuck low.

Signed-off-by: Codrin Ciubotariu <codrin.ciubotariu@microchip.com>
[wsa: inverted one logic for better readability, minor update to kdoc]
Signed-off-by: Wolfram Sang <wsa@kernel.org>
2020-08-05 11:52:27 +02:00
Christoph Hellwig
262e6ae708 modules: inherit TAINT_PROPRIETARY_MODULE
If a TAINT_PROPRIETARY_MODULE exports symbol, inherit the taint flag
for all modules importing these symbols, and don't allow loading
symbols from TAINT_PROPRIETARY_MODULE modules if the module previously
imported gplonly symbols.  Add a anti-circumvention devices so people
don't accidentally get themselves into trouble this way.

Comment from Greg:
  "Ah, the proven-to-be-illegal "GPL Condom" defense :)"

[jeyu: pr_info -> pr_err and pr_warn as per discussion]
Link: http://lore.kernel.org/r/20200730162957.GA22469@lst.de
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2020-08-05 10:31:28 +02:00
Linus Torvalds
2324d50d05 Merge tag 'docs-5.9' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
 "It's been a busy cycle for documentation - hopefully the busiest for a
  while to come. Changes include:

   - Some new Chinese translations

   - Progress on the battle against double words words and non-HTTPS
     URLs

   - Some block-mq documentation

   - More RST conversions from Mauro. At this point, that task is
     essentially complete, so we shouldn't see this kind of churn again
     for a while. Unless we decide to switch to asciidoc or
     something...:)

   - Lots of typo fixes, warning fixes, and more"

* tag 'docs-5.9' of git://git.lwn.net/linux: (195 commits)
  scripts/kernel-doc: optionally treat warnings as errors
  docs: ia64: correct typo
  mailmap: add entry for <alobakin@marvell.com>
  doc/zh_CN: add cpu-load Chinese version
  Documentation/admin-guide: tainted-kernels: fix spelling mistake
  MAINTAINERS: adjust kprobes.rst entry to new location
  devices.txt: document rfkill allocation
  PCI: correct flag name
  docs: filesystems: vfs: correct flag name
  docs: filesystems: vfs: correct sync_mode flag names
  docs: path-lookup: markup fixes for emphasis
  docs: path-lookup: more markup fixes
  docs: path-lookup: fix HTML entity mojibake
  CREDITS: Replace HTTP links with HTTPS ones
  docs: process: Add an example for creating a fixes tag
  doc/zh_CN: add Chinese translation prefer section
  doc/zh_CN: add clearing-warn-once Chinese version
  doc/zh_CN: add admin-guide index
  doc:it_IT: process: coding-style.rst: Correct __maybe_unused compiler label
  futex: MAINTAINERS: Re-add selftests directory
  ...
2020-08-04 22:47:54 -07:00
Linus Torvalds
a754292348 Merge tag 'printk-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:

 - Herbert Xu made printk header file self-contained.

 - Andy Shevchenko and Sergey Senozhatsky cleaned up console->setup()
   error handling.

 - Andy Shevchenko did some cleanups (e.g. sparse warning) in vsprintf
   code.

 - Minor documentation updates.

* tag 'printk-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  lib/vsprintf: Force type of flags value for gfp_t
  lib/vsprintf: Replace custom spec to print decimals with generic one
  lib/vsprintf: Replace hidden BUILD_BUG_ON() with static_assert()
  printk: Make linux/printk.h self-contained
  doc:kmsg: explicitly state the return value in case of SEEK_CUR
  Replace HTTP links with HTTPS ones: vsprintf
  hvc: unify console setup naming
  console: Fix trivia typo 'change' -> 'chance'
  console: Propagate error code from console ->setup()
  tty: hvc: Return proper error code from console ->setup() hook
  serial: sunzilog: Return proper error code from console ->setup() hook
  serial: sunsab: Return proper error code from console ->setup() hook
  mips: Return proper error code from console ->setup() hook
2020-08-04 22:22:25 -07:00
Linus Torvalds
3f0d6ecdf1 Merge tag 'core-entry-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull generic kernel entry/exit code from Thomas Gleixner:
 "Generic implementation of common syscall, interrupt and exception
  entry/exit functionality based on the recent X86 effort to ensure
  correctness of entry/exit vs RCU and instrumentation.

  As this functionality and the required entry/exit sequences are not
  architecture specific, sharing them allows other architectures to
  benefit instead of copying the same code over and over again.

  This branch was kept standalone to allow others to work on it. The
  conversion of x86 comes in a seperate pull request which obviously is
  based on this branch"

* tag 'core-entry-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry: Correct __secure_computing() stub
  entry: Correct 'noinstr' attributes
  entry: Provide infrastructure for work before transitioning to guest mode
  entry: Provide generic interrupt entry/exit code
  entry: Provide generic syscall exit function
  entry: Provide generic syscall entry functionality
  seccomp: Provide stub for __secure_computing()
2020-08-04 21:00:11 -07:00
Olga Kornievskaia
7de62bc09f SUNRPC dont update timeout value on connection reset
Current behaviour: every time a v3 operation is re-sent to the server
we update (double) the timeout. There is no distinction between whether
or not the previous timer had expired before the re-sent happened.

Here's the scenario:
1. Client sends a v3 operation
2. Server RST-s the connection (prior to the timeout) (eg., connection
is immediately reset)
3. Client re-sends a v3 operation but the timeout is now 120sec.

As a result, an application sees 2mins pause before a retry in case
server again does not reply.

Instead, this patch proposes to keep track off when the minor timeout
should happen and if it didn't, then don't update the new timeout.
Value is updated based on the previous value to make timeouts
predictable.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-08-04 23:17:11 -04:00
Siddharth Gupta
4476770881 remoteproc: Add remoteproc character device interface
Add the character device interface into remoteproc framework.
This interface can be used in order to boot/shutdown remote
subsystems and provides a basic ioctl based interface to implement
supplementary functionality. An ioctl call is implemented to enable
the shutdown on release feature which will allow remote processors to
be shutdown when the controlling userspace application crashes or hangs.

Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Rishabh Bhatnagar <rishabhb@codeaurora.org>
Signed-off-by: Siddharth Gupta <sidgup@codeaurora.org>
Link: https://lore.kernel.org/r/1596044401-22083-2-git-send-email-sidgup@codeaurora.org
[bjorn: s/int32_t/s32/ per checkpatch]
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
2020-08-04 20:16:37 -07:00
Linus Torvalds
442489c219 Merge tag 'timers-core-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "Time, timers and related driver updates:

   - Prevent unnecessary timer softirq invocations by extending the
     tracking of the next expiring timer in the timer wheel beyond the
     existing NOHZ functionality.

     The tracking overhead at enqueue time is within the noise, but on
     sensitive workloads the avoidance of the soft interrupt invocation
     is a measurable improvement.

   - The obligatory new clocksource driver for Ingenic X100 OST

   - The usual fixes, improvements, cleanups and extensions for newer
     chip variants all over the driver space"

* tag 'timers-core-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
  timers: Recalculate next timer interrupt only when necessary
  clocksource/drivers/ingenic: Add support for the Ingenic X1000 OST.
  dt-bindings: timer: Add Ingenic X1000 OST bindings.
  clocksource/drivers: Replace HTTP links with HTTPS ones
  clocksource/drivers/nomadik-mtu: Handle 32kHz clock
  clocksource/drivers/sh_cmt: Use "kHz" for kilohertz
  clocksource/drivers/imx: Add support for i.MX TPM driver with ARM64
  clocksource/drivers/ingenic: Add high resolution timer support for SMP/SMT.
  timers: Lower base clock forwarding threshold
  timers: Remove must_forward_clk
  timers: Spare timer softirq until next expiry
  timers: Expand clk forward logic beyond nohz
  timers: Reuse next expiry cache after nohz exit
  timers: Always keep track of next expiry
  timers: Optimize _next_timer_interrupt() level iteration
  timers: Add comments about calc_index() ceiling work
  timers: Move trigger_dyntick_cpu() to enqueue_timer()
  timers: Use only bucket expiry for base->next_expiry value
  timers: Preserve higher bits of expiration on index calculation
  clocksource/drivers/timer-atmel-tcb: Add sama5d2 support
  ...
2020-08-04 18:17:37 -07:00
Linus Torvalds
f8b036a7fc Merge tag 'irq-core-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "The usual boring updates from the interrupt subsystem:

   - Infrastructure to allow building irqchip drivers as modules

   - Consolidation of irqchip ACPI probing

   - Removal of the EOI-preflow interrupt handler which was required for
     SPARC support and became obsolete after SPARC was converted to use
     sparse interrupts.

   - Cleanups, fixes and improvements all over the place"

* tag 'irq-core-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  irqchip/loongson-pch-pic: Fix the misused irq flow handler
  irqchip/loongson-htvec: Support 8 groups of HT vectors
  irqchip/loongson-liointc: Fix misuse of gc->mask_cache
  dt-bindings: interrupt-controller: Update Loongson HTVEC description
  irqchip/imx-intmux: Fix irqdata regs save in imx_intmux_runtime_suspend()
  irqchip/imx-intmux: Implement intmux runtime power management
  irqchip/gic-v4.1: Use GFP_ATOMIC flag in allocate_vpe_l1_table()
  irqchip: Fix IRQCHIP_PLATFORM_DRIVER_* compilation by including module.h
  irqchip/stm32-exti: Map direct event to irq parent
  irqchip/mtk-cirq: Convert to a platform driver
  irqchip/mtk-sysirq: Convert to a platform driver
  irqchip/qcom-pdc: Switch to using IRQCHIP_PLATFORM_DRIVER helper macros
  irqchip: Add IRQCHIP_PLATFORM_DRIVER_BEGIN/END and IRQCHIP_MATCH helper macros
  irqchip: irq-bcm2836.h: drop a duplicated word
  irqchip/gic-v4.1: Ensure accessing the correct RD when writing INVALLR
  irqchip/irq-bcm7038-l1: Guard uses of cpu_logical_map
  irqchip/gic-v3: Remove unused register definition
  irqchip/qcom-pdc: Allow QCOM_PDC to be loadable as a permanent module
  genirq: Export irq_chip_retrigger_hierarchy and irq_chip_set_vcpu_affinity_parent
  irqdomain: Export irq_domain_update_bus_token
  ...
2020-08-04 18:11:58 -07:00
Christoph Hellwig
f073531070 init: add an init_dup helper
Add a simple helper to grab a reference to a file and install it at
the next available fd, and switch the early init code over to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-04 21:02:38 -04:00
Max Gurtovoy
a8ac78357d scsi: target: Make iscsit_register_transport() return void
This function always returns 0. We can make it return void to simplify the
code. Also, no caller ever checks the return value of this function.

Link: https://lore.kernel.org/r/20200803150008.83920-1-maxg@mellanox.com
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2020-08-04 20:56:56 -04:00
Linus Torvalds
2ed90dbbf7 Merge tag 'dma-mapping-5.9' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:

 - make support for dma_ops optional

 - move more code out of line

 - add generic support for a dma_ops bypass mode

 - misc cleanups

* tag 'dma-mapping-5.9' of git://git.infradead.org/users/hch/dma-mapping:
  dma-contiguous: cleanup dma_alloc_contiguous
  dma-debug: use named initializers for dir2name
  powerpc: use the generic dma_ops_bypass mode
  dma-mapping: add a dma_ops_bypass flag to struct device
  dma-mapping: make support for dma ops optional
  dma-mapping: inline the fast path dma-direct calls
  dma-mapping: move the remaining DMA API calls out of line
2020-08-04 17:29:57 -07:00
Linus Torvalds
9fa867d2ac Merge tag 'uuid-for-5.9' of git://git.infradead.org/users/hch/uuid
Pull uuid update from Christoph Hellwig:
 "Remove a now unused helper (Andy Shevchenko)"

* tag 'uuid-for-5.9' of git://git.infradead.org/users/hch/uuid:
  uuid: remove unused uuid_le_to_bin() definition
2020-08-04 17:10:11 -07:00
Linus Torvalds
4f30a60aa7 Merge tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull close_range() implementation from Christian Brauner:
 "This adds the close_range() syscall. It allows to efficiently close a
  range of file descriptors up to all file descriptors of a calling
  task.

  This is coordinated with the FreeBSD folks which have copied our
  version of this syscall and in the meantime have already merged it in
  April 2019:

    https://reviews.freebsd.org/D21627
    https://svnweb.freebsd.org/base?view=revision&revision=359836

  The syscall originally came up in a discussion around the new mount
  API and making new file descriptor types cloexec by default. During
  this discussion, Al suggested the close_range() syscall.

  First, it helps to close all file descriptors of an exec()ing task.
  This can be done safely via (quoting Al's example from [1] verbatim):

        /* that exec is sensitive */
        unshare(CLONE_FILES);
        /* we don't want anything past stderr here */
        close_range(3, ~0U);
        execve(....);

  The code snippet above is one way of working around the problem that
  file descriptors are not cloexec by default. This is aggravated by the
  fact that we can't just switch them over without massively regressing
  userspace. For a whole class of programs having an in-kernel method of
  closing all file descriptors is very helpful (e.g. demons, service
  managers, programming language standard libraries, container managers
  etc.).

  Second, it allows userspace to avoid implementing closing all file
  descriptors by parsing through /proc/<pid>/fd/* and calling close() on
  each file descriptor and other hacks. From looking at various
  large(ish) userspace code bases this or similar patterns are very
  common in service managers, container runtimes, and programming
  language runtimes/standard libraries such as Python or Rust.

  In addition, the syscall will also work for tasks that do not have
  procfs mounted and on kernels that do not have procfs support compiled
  in. In such situations the only way to make sure that all file
  descriptors are closed is to call close() on each file descriptor up
  to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery.

  Based on Linus' suggestion close_range() also comes with a new flag
  CLOSE_RANGE_UNSHARE to more elegantly handle file descriptor dropping
  right before exec. This would usually be expressed in the sequence:

        unshare(CLONE_FILES);
        close_range(3, ~0U);

  as pointed out by Linus it might be desirable to have this be a part
  of close_range() itself under a new flag CLOSE_RANGE_UNSHARE which
  gets especially handy when we're closing all file descriptors above a
  certain threshold.

  Test-suite as always included"

* tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  tests: add CLOSE_RANGE_UNSHARE tests
  close_range: add CLOSE_RANGE_UNSHARE
  tests: add close_range() tests
  arch: wire-up close_range()
  open: add close_range()
2020-08-04 15:12:02 -07:00
Linus Torvalds
74858abbb1 Merge tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull checkpoint-restore updates from Christian Brauner:
 "This enables unprivileged checkpoint/restore of processes.

  Given that this work has been going on for quite some time the first
  sentence in this summary is hopefully more exciting than the actual
  final code changes required. Unprivileged checkpoint/restore has seen
  a frequent increase in interest over the last two years and has thus
  been one of the main topics for the combined containers &
  checkpoint/restore microconference since at least 2018 (cf. [1]).

  Here are just the three most frequent use-cases that were brought forward:

   - The JVM developers are integrating checkpoint/restore into a Java
     VM to significantly decrease the startup time.

   - In high-performance computing environment a resource manager will
     typically be distributing jobs where users are always running as
     non-root. Long-running and "large" processes with significant
     startup times are supposed to be checkpointed and restored with
     CRIU.

   - Container migration as a non-root user.

  In all of these scenarios it is either desirable or required to run
  without CAP_SYS_ADMIN. The userspace implementation of
  checkpoint/restore CRIU already has the pull request for supporting
  unprivileged checkpoint/restore up (cf. [2]).

  To enable unprivileged checkpoint/restore a new dedicated capability
  CAP_CHECKPOINT_RESTORE is introduced. This solution has last been
  discussed in 2019 in a talk by Google at Linux Plumbers (cf. [1]
  "Update on Task Migration at Google Using CRIU") with Adrian and
  Nicolas providing the implementation now over the last months. In
  essence, this allows the CRIU binary to be installed with the
  CAP_CHECKPOINT_RESTORE vfs capability set thereby enabling
  unprivileged users to restore processes.

  To make this possible the following permissions are altered:

   - Selecting a specific PID via clone3() set_tid relaxed from userns
     CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.

   - Selecting a specific PID via /proc/sys/kernel/ns_last_pid relaxed
     from userns CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.

   - Accessing /proc/pid/map_files relaxed from init userns
     CAP_SYS_ADMIN to init userns CAP_CHECKPOINT_RESTORE.

   - Changing /proc/self/exe from userns CAP_SYS_ADMIN to userns
     CAP_CHECKPOINT_RESTORE.

  Of these four changes the /proc/self/exe change deserves a few words
  because the reasoning behind even restricting /proc/self/exe changes
  in the first place is just full of historical quirks and tracking this
  down was a questionable version of fun that I'd like to spare others.

  In short, it is trivial to change /proc/self/exe as an unprivileged
  user, i.e. without userns CAP_SYS_ADMIN right now. Either via ptrace()
  or by simply intercepting the elf loader in userspace during exec.
  Nicolas was nice enough to even provide a POC for the latter (cf. [3])
  to illustrate this fact.

  The original patchset which introduced PR_SET_MM_MAP had no
  permissions around changing the exe link. They too argued that it is
  trivial to spoof the exe link already which is true. The argument
  brought up against this was that the Tomoyo LSM uses the exe link in
  tomoyo_manager() to detect whether the calling process is a policy
  manager. This caused changing the exe links to be guarded by userns
  CAP_SYS_ADMIN.

  All in all this rather seems like a "better guard it with something
  rather than nothing" argument which imho doesn't qualify as a great
  security policy. Again, because spoofing the exe link is possible for
  the calling process so even if this were security relevant it was
  broken back then and would be broken today. So technically, dropping
  all permissions around changing the exe link would probably be
  possible and would send a clearer message to any userspace that relies
  on /proc/self/exe for security reasons that they should stop doing
  this but for now we're only relaxing the exe link permissions from
  userns CAP_SYS_ADMIN to userns CAP_CHECKPOINT_RESTORE.

  There's a final uapi change in here. Changing the exe link used to
  accidently return EINVAL when the caller lacked the necessary
  permissions instead of the more correct EPERM. This pr contains a
  commit fixing this. I assume that userspace won't notice or care and
  if they do I will revert this commit. But since we are changing the
  permissions anyway it seems like a good opportunity to try this fix.

  With these changes merged unprivileged checkpoint/restore will be
  possible and has already been tested by various users"

[1] LPC 2018
     1. "Task Migration at Google Using CRIU"
        https://www.youtube.com/watch?v=yI_1cuhoDgA&t=12095
     2. "Securely Migrating Untrusted Workloads with CRIU"
        https://www.youtube.com/watch?v=yI_1cuhoDgA&t=14400
     LPC 2019
     1. "CRIU and the PID dance"
         https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=2m48s
     2. "Update on Task Migration at Google Using CRIU"
        https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=1h2m8s

[2] https://github.com/checkpoint-restore/criu/pull/1155

[3] https://github.com/nviennot/run_as_exe

* tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  selftests: add clone3() CAP_CHECKPOINT_RESTORE test
  prctl: exe link permission error changed from -EINVAL to -EPERM
  prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
  proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
  pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
  pid: use checkpoint_restore_ns_capable() for set_tid
  capabilities: Introduce CAP_CHECKPOINT_RESTORE
2020-08-04 15:02:07 -07:00
Linus Torvalds
9ba27414f2 Merge tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull fork cleanups from Christian Brauner:
 "This is cleanup series from when we reworked a chunk of the process
  creation paths in the kernel and switched to struct
  {kernel_}clone_args.

  High-level this does two main things:

   - Remove the double export of both do_fork() and _do_fork() where
     do_fork() used the incosistent legacy clone calling convention.

     Now we only export _do_fork() which is based on struct
     kernel_clone_args.

   - Remove the copy_thread_tls()/copy_thread() split making the
     architecture specific HAVE_COYP_THREAD_TLS config option obsolete.

  This switches all remaining architectures to select
  HAVE_COPY_THREAD_TLS and thus to the copy_thread_tls() calling
  convention. The current split makes the process creation codepaths
  more convoluted than they need to be. Each architecture has their own
  copy_thread() function unless it selects HAVE_COPY_THREAD_TLS then it
  has a copy_thread_tls() function.

  The split is not needed anymore nowadays, all architectures support
  CLONE_SETTLS but quite a few of them never bothered to select
  HAVE_COPY_THREAD_TLS and instead simply continued to use copy_thread()
  and use the old calling convention. Removing this split cleans up the
  process creation codepaths and paves the way for implementing clone3()
  on such architectures since it requires the copy_thread_tls() calling
  convention.

  After having made each architectures support copy_thread_tls() this
  series simply renames that function back to copy_thread(). It also
  switches all architectures that call do_fork() directly over to
  _do_fork() and the struct kernel_clone_args calling convention. This
  is a corollary of switching the architectures that did not yet support
  it over to copy_thread_tls() since do_fork() is conditional on not
  supporting copy_thread_tls() (Mostly because it lacks a separate
  argument for tls which is trivial to fix but there's no need for this
  function to exist.).

  The do_fork() removal is in itself already useful as it allows to to
  remove the export of both do_fork() and _do_fork() we currently have
  in favor of only _do_fork(). This has already been discussed back when
  we added clone3(). The legacy clone() calling convention is - as is
  probably well-known - somewhat odd:

    #
    # ABI hall of shame
    #
    config CLONE_BACKWARDS
    config CLONE_BACKWARDS2
    config CLONE_BACKWARDS3

  that is aggravated by the fact that some architectures such as sparc
  follow the CLONE_BACKWARDSx calling convention but don't really select
  the corresponding config option since they call do_fork() directly.

  So do_fork() enforces a somewhat arbitrary calling convention in the
  first place that doesn't really help the individual architectures that
  deviate from it. They can thus simply be switched to _do_fork()
  enforcing a single calling convention. (I really hope that any new
  architectures will __not__ try to implement their own calling
  conventions...)

  Most architectures already have made a similar switch (m68k comes to
  mind).

  Overall this removes more code than it adds even with a good portion
  of added comments. It simplifies a chunk of arch specific assembly
  either by moving the code into C or by simply rewriting the assembly.

  Architectures that have been touched in non-trivial ways have all been
  actually boot and stress tested: sparc and ia64 have been tested with
  Debian 9 images. They are the two architectures which have been
  touched the most. All non-trivial changes to architectures have seen
  acks from the relevant maintainers. nios2 with a custom built
  buildroot image. h8300 I couldn't get something bootable to test on
  but the changes have been fairly automatic and I'm sure we'll hear
  people yell if I broke something there.

  All other architectures that have been touched in trivial ways have
  been compile tested for each single patch of the series via git rebase
  -x "make ..." v5.8-rc2. arm{64} and x86{_64} have been boot tested
  even though they have just been trivially touched (removal of the
  HAVE_COPY_THREAD_TLS macro from their Kconfig) because well they are
  basically "core architectures" and since it is trivial to get your
  hands on a useable image"

* tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  arch: rename copy_thread_tls() back to copy_thread()
  arch: remove HAVE_COPY_THREAD_TLS
  unicore: switch to copy_thread_tls()
  sh: switch to copy_thread_tls()
  nds32: switch to copy_thread_tls()
  microblaze: switch to copy_thread_tls()
  hexagon: switch to copy_thread_tls()
  c6x: switch to copy_thread_tls()
  alpha: switch to copy_thread_tls()
  fork: remove do_fork()
  h8300: select HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
  nios2: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
  ia64: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
  sparc: unconditionally enable HAVE_COPY_THREAD_TLS
  sparc: share process creation helpers between sparc and sparc64
  sparc64: enable HAVE_COPY_THREAD_TLS
  fork: fold legacy_clone_args_valid() into _do_fork()
2020-08-04 14:47:45 -07:00
Linus Torvalds
0a72761b27 Merge tag 'threads-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull thread updates from Christian Brauner:
 "This contains the changes to add the missing support for attaching to
  time namespaces via pidfds.

  Last cycle setns() was changed to support attaching to multiple
  namespaces atomically. This requires all namespaces to have a point of
  no return where they can't fail anymore.

  Specifically, <namespace-type>_install() is allowed to perform
  permission checks and install the namespace into the new struct nsset
  that it has been given but it is not allowed to make visible changes
  to the affected task. Once <namespace-type>_install() returns,
  anything that the given namespace type additionally requires to be
  setup needs to ideally be done in a function that can't fail or if it
  fails the failure must be non-fatal.

  For time namespaces the relevant functions that fell into this
  category were timens_set_vvar_page() and vdso_join_timens(). The
  latter could still fail although it didn't need to. This function is
  only implemented for vdso_join_timens() in current mainline. As
  discussed on-list (cf. [1]), in order to make setns() support time
  namespaces when attaching to multiple namespaces at once properly we
  changed vdso_join_timens() to always succeed. So vdso_join_timens()
  replaces the mmap_write_lock_killable() with mmap_read_lock().

  Please note that arm is about to grow vdso support for time namespaces
  (possibly this merge window). We've synced on this change and arm64
  also uses mmap_read_lock(), i.e. makes vdso_join_timens() a function
  that can't fail. Once the changes here and the arm64 changes have
  landed, vdso_join_timens() should be turned into a void function so
  it's obvious to callers and implementers on other architectures that
  the expectation is that it can't fail.

  We didn't do this right away because it would've introduced
  unnecessary merge conflicts between the two trees for no major gain.

  As always, tests included"

[1]: https://lore.kernel.org/lkml/20200611110221.pgd3r5qkjrjmfqa2@wittgenstein

* tag 'threads-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  tests: add CLONE_NEWTIME setns tests
  nsproxy: support CLONE_NEWTIME with setns()
  timens: add timens_commit() helper
  timens: make vdso_join_timens() always succeed
2020-08-04 14:40:07 -07:00
Linus Torvalds
3950e97543 Merge branch 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve updates from Eric Biederman:
 "During the development of v5.7 I ran into bugs and quality of
  implementation issues related to exec that could not be easily fixed
  because of the way exec is implemented. So I have been diggin into
  exec and cleaning up what I can.

  This cycle I have been looking at different ideas and different
  implementations to see what is possible to improve exec, and cleaning
  the way exec interfaces with in kernel users. Only cleaning up the
  interfaces of exec with rest of the kernel has managed to stabalize
  and make it through review in time for v5.9-rc1 resulting in 2 sets of
  changes this cycle.

   - Implement kernel_execve

   - Make the user mode driver code a better citizen

  With kernel_execve the code size got a little larger as the copying of
  parameters from userspace and copying of parameters from userspace is
  now separate. The good news is kernel threads no longer need to play
  games with set_fs to use exec. Which when combined with the rest of
  Christophs set_fs changes should security bugs with set_fs much more
  difficult"

* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
  exec: Implement kernel_execve
  exec: Factor bprm_stack_limits out of prepare_arg_pages
  exec: Factor bprm_execve out of do_execve_common
  exec: Move bprm_mm_init into alloc_bprm
  exec: Move initialization of bprm->filename into alloc_bprm
  exec: Factor out alloc_bprm
  exec: Remove unnecessary spaces from binfmts.h
  umd: Stop using split_argv
  umd: Remove exit_umh
  bpfilter: Take advantage of the facilities of struct pid
  exit: Factor thread_group_exited out of pidfd_poll
  umd: Track user space drivers with struct pid
  bpfilter: Move bpfilter_umh back into init data
  exec: Remove do_execve_file
  umh: Stop calling do_execve_file
  umd: Transform fork_usermode_blob into fork_usermode_driver
  umd: Rename umd_info.cmdline umd_info.driver_name
  umd: For clarity rename umh_info umd_info
  umh: Separate the user mode driver and the user mode helper support
  umh: Remove call_usermodehelper_setup_file.
  ...
2020-08-04 14:27:25 -07:00
Linus Torvalds
fd76a74d94 Merge tag 'audit-pr-20200803' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
 "Aside from some smaller bug fixes, here are the highlights:

   - add a new backlog wait metric to the audit status message, this is
     intended to help admins determine how long processes have been
     waiting for the audit backlog queue to clear

   - generate audit records for nftables configuration changes

   - generate CWD audit records for for the relevant LSM audit records"

* tag 'audit-pr-20200803' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: report audit wait metric in audit status reply
  audit: purge audit_log_string from the intra-kernel audit API
  audit: issue CWD record to accompany LSM_AUDIT_DATA_* records
  audit: use the proper gfp flags in the audit_log_nfcfg() calls
  audit: remove unused !CONFIG_AUDITSYSCALL __audit_inode* stubs
  audit: add gfp parameter to audit_log_nfcfg
  audit: log nftables configuration change events
  audit: Use struct_size() helper in alloc_chunk
2020-08-04 14:20:26 -07:00
Linus Torvalds
9ecc6ea491 Merge tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp updates from Kees Cook:
 "There are a bunch of clean ups and selftest improvements along with
  two major updates to the SECCOMP_RET_USER_NOTIF filter return:
  EPOLLHUP support to more easily detect the death of a monitored
  process, and being able to inject fds when intercepting syscalls that
  expect an fd-opening side-effect (needed by both container folks and
  Chrome). The latter continued the refactoring of __scm_install_fd()
  started by Christoph, and in the process found and fixed a handful of
  bugs in various callers.

   - Improved selftest coverage, timeouts, and reporting

   - Add EPOLLHUP support for SECCOMP_RET_USER_NOTIF (Christian Brauner)

   - Refactor __scm_install_fd() into __receive_fd() and fix buggy
     callers

   - Introduce 'addfd' command for SECCOMP_RET_USER_NOTIF (Sargun
     Dhillon)"

* tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (30 commits)
  selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD
  seccomp: Introduce addfd ioctl to seccomp user notifier
  fs: Expand __receive_fd() to accept existing fd
  pidfd: Replace open-coded receive_fd()
  fs: Add receive_fd() wrapper for __receive_fd()
  fs: Move __scm_install_fd() to __receive_fd()
  net/scm: Regularize compat handling of scm_detach_fds()
  pidfd: Add missing sock updates for pidfd_getfd()
  net/compat: Add missing sock updates for SCM_RIGHTS
  selftests/seccomp: Check ENOSYS under tracing
  selftests/seccomp: Refactor to use fixture variants
  selftests/harness: Clean up kern-doc for fixtures
  seccomp: Use -1 marker for end of mode 1 syscall list
  seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID
  selftests/seccomp: Rename user_trap_syscall() to user_notif_syscall()
  selftests/seccomp: Make kcmp() less required
  seccomp: Use pr_fmt
  selftests/seccomp: Improve calibration loop
  selftests/seccomp: use 90s as timeout
  selftests/seccomp: Expand benchmark to per-filter measurements
  ...
2020-08-04 14:11:08 -07:00
Linus Torvalds
99ea1521a0 Merge tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull uninitialized_var() macro removal from Kees Cook:
 "This is long overdue, and has hidden too many bugs over the years. The
  series has several "by hand" fixes, and then a trivial treewide
  replacement.

   - Clean up non-trivial uses of uninitialized_var()

   - Update documentation and checkpatch for uninitialized_var() removal

   - Treewide removal of uninitialized_var()"

* tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  compiler: Remove uninitialized_var() macro
  treewide: Remove uninitialized_var() usage
  checkpatch: Remove awareness of uninitialized_var() macro
  mm/debug_vm_pgtable: Remove uninitialized_var() usage
  f2fs: Eliminate usage of uninitialized_var() macro
  media: sur40: Remove uninitialized_var() usage
  KVM: PPC: Book3S PR: Remove uninitialized_var() usage
  clk: spear: Remove uninitialized_var() usage
  clk: st: Remove uninitialized_var() usage
  spi: davinci: Remove uninitialized_var() usage
  ide: Remove uninitialized_var() usage
  rtlwifi: rtl8192cu: Remove uninitialized_var() usage
  b43: Remove uninitialized_var() usage
  drbd: Remove uninitialized_var() usage
  x86/mm/numa: Remove uninitialized_var() usage
  docs: deprecated.rst: Add uninitialized_var()
2020-08-04 13:49:43 -07:00
Linus Torvalds
427714f258 Merge tag 'tasklets-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull tasklets API update from Kees Cook:
 "These are the infrastructure updates needed to support converting the
  tasklet API to something more modern (and hopefully for removal
  further down the road).

  There is a 300-patch series waiting in the wings to get set out to
  subsystem maintainers, but these changes need to be present in the
  kernel first. Since this has some treewide changes, I carried this
  series for -next instead of paining Thomas with it in -tip, but it's
  got his Ack.

  This is similar to the timer_struct modernization from a while back,
  but not nearly as messy (I hope). :)

   - Prepare for tasklet API modernization (Romain Perier, Allen Pais,
     Kees Cook)"

* tag 'tasklets-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  tasklet: Introduce new initialization API
  treewide: Replace DECLARE_TASKLET() with DECLARE_TASKLET_OLD()
  usb: gadget: udc: Avoid tasklet passing a global
2020-08-04 13:40:35 -07:00
David S. Miller
ee895a30ef Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Flush the cleanup xtables worker to make sure destructors
   have completed, from Florian Westphal.

2) iifgroup is matching erroneously, also from Florian.

3) Add selftest for meta interface matching, from Florian Westphal.

4) Move nf_ct_offload_timeout() to header, from Roi Dayan.

5) Call nf_ct_offload_timeout() from flow_offload_add() to
   make sure garbage collection does not evict offloaded flow,
   from Roi Dayan.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-04 13:32:39 -07:00
Linus Torvalds
3e4a12a1ba Merge tag 'gcc-plugins-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull gcc plugin updates from Kees Cook:
 "Primarily improvements to STACKLEAK from Alexander Popov, along with
  some additional cleanups.

    - Update URLs for HTTPS scheme where available (Alexander A. Klimov)

   - Improve STACKLEAK code generation on x86 (Alexander Popov)"

* tag 'gcc-plugins-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  gcc-plugins: Replace HTTP links with HTTPS ones
  gcc-plugins/stackleak: Add 'verbose' plugin parameter
  gcc-plugins/stackleak: Use asm instrumentation to avoid useless register saving
  ARM: vdso: Don't use gcc plugins for building vgettimeofday.c
  gcc-plugins/stackleak: Don't instrument itself
2020-08-04 13:26:06 -07:00
Stefano Brivio
4cb47a8644 tunnels: PMTU discovery support for directly bridged IP packets
It's currently possible to bridge Ethernet tunnels carrying IP
packets directly to external interfaces without assigning them
addresses and routes on the bridged network itself: this is the case
for UDP tunnels bridged with a standard bridge or by Open vSwitch.

PMTU discovery is currently broken with those configurations, because
the encapsulation effectively decreases the MTU of the link, and
while we are able to account for this using PMTU discovery on the
lower layer, we don't have a way to relay ICMP or ICMPv6 messages
needed by the sender, because we don't have valid routes to it.

On the other hand, as a tunnel endpoint, we can't fragment packets
as a general approach: this is for instance clearly forbidden for
VXLAN by RFC 7348, section 4.3:

   VTEPs MUST NOT fragment VXLAN packets.  Intermediate routers may
   fragment encapsulated VXLAN packets due to the larger frame size.
   The destination VTEP MAY silently discard such VXLAN fragments.

The same paragraph recommends that the MTU over the physical network
accomodates for encapsulations, but this isn't a practical option for
complex topologies, especially for typical Open vSwitch use cases.

Further, it states that:

   Other techniques like Path MTU discovery (see [RFC1191] and
   [RFC1981]) MAY be used to address this requirement as well.

Now, PMTU discovery already works for routed interfaces, we get
route exceptions created by the encapsulation device as they receive
ICMP Fragmentation Needed and ICMPv6 Packet Too Big messages, and
we already rebuild those messages with the appropriate MTU and route
them back to the sender.

Add the missing bits for bridged cases:

- checks in skb_tunnel_check_pmtu() to understand if it's appropriate
  to trigger a reply according to RFC 1122 section 3.2.2 for ICMP and
  RFC 4443 section 2.4 for ICMPv6. This function is already called by
  UDP tunnels

- a new function generating those ICMP or ICMPv6 replies. We can't
  reuse icmp_send() and icmp6_send() as we don't see the sender as a
  valid destination. This doesn't need to be generic, as we don't
  cover any other type of ICMP errors given that we only provide an
  encapsulation function to the sender

While at it, make the MTU check in skb_tunnel_check_pmtu() accurate:
we might receive GSO buffers here, and the passed headroom already
includes the inner MAC length, so we don't have to account for it
a second time (that would imply three MAC headers on the wire, but
there are just two).

This issue became visible while bridging IPv6 packets with 4500 bytes
of payload over GENEVE using IPv4 with a PMTU of 4000. Given the 50
bytes of encapsulation headroom, we would advertise MTU as 3950, and
we would reject fragmented IPv6 datagrams of 3958 bytes size on the
wire. We're exclusively dealing with network MTU here, though, so we
could get Ethernet frames up to 3964 octets in that case.

v2:
- moved skb_tunnel_check_pmtu() to ip_tunnel_core.c (David Ahern)
- split IPv4/IPv6 functions (David Ahern)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-04 13:01:45 -07:00
Stefano Brivio
df23bb18b4 ipv4: route: Ignore output interface in FIB lookup for PMTU route
Currently, processes sending traffic to a local bridge with an
encapsulation device as a port don't get ICMP errors if they exceed
the PMTU of the encapsulated link.

David Ahern suggested this as a hack, but it actually looks like
the correct solution: when we update the PMTU for a given destination
by means of updating or creating a route exception, the encapsulation
might trigger this because of PMTU discovery happening either on the
encapsulation device itself, or its lower layer. This happens on
bridged encapsulations only.

The output interface shouldn't matter, because we already have a
valid destination. Drop the output interface restriction from the
associated route lookup.

For UDP tunnels, we will now have a route exception created for the
encapsulation itself, with a MTU value reflecting its headroom, which
allows a bridge forwarding IP packets originated locally to deliver
errors back to the sending socket.

The behaviour is now consistent with IPv6 and verified with selftests
pmtu_ipv{4,6}_br_{geneve,vxlan}{4,6}_exception introduced later in
this series.

v2:
- reset output interface only for bridge ports (David Ahern)
- add and use netif_is_any_bridge_port() helper (David Ahern)

Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-04 13:01:45 -07:00
David S. Miller
cabf06e5a2 Merge tag 'wireless-drivers-next-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
Kalle Valo says:

====================
wireless-drivers-next patches for v5.9

Second set of patches for v5.9. mt76 has most of patches this time.
Otherwise it's just smaller fixes and cleanups to other drivers.

There was a major conflict in mt76 driver between wireless-drivers and
wireless-drivers-next. I solved that by merging the former to the
latter.

Major changes:

rtw88

* add support for ieee80211_ops::change_interface

* add support for enabling and disabling beacon

* add debugfs file for testing h2c

mt76

* ARP filter offload for 7663

* runtime power management for 7663

* testmode support for mfg calibration

* support for more channels
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-04 12:57:02 -07:00
Jeff Layton
a0102bda5b ceph: move sb->wb_pagevec_pool to be a global mempool
When doing some testing recently, I hit some page allocation failures
on mount, when creating the wb_pagevec_pool for the mount. That
requires 128k (32 contiguous pages), and after thrashing the memory
during an xfstests run, sometimes that would fail.

128k for each mount seems like a lot to hold in reserve for a rainy
day, so let's change this to a global mempool that gets allocated
when the module is plugged in.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-08-04 19:41:12 +02:00
Rob Herring
669cbc7081 PCI: Move DT resource setup into devm_pci_alloc_host_bridge()
Now that pci_parse_request_of_pci_ranges() callers just setup
pci_host_bridge.windows and dma_ranges directly and don't need the bus
range returned, we can just initialize them when allocating the
pci_host_bridge struct.

With this, pci_parse_request_of_pci_ranges() becomes a static function.

Link: https://lore.kernel.org/r/20200722022514.1283916-19-robh@kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
2020-08-04 16:36:30 +01:00
Rafael J. Wysocki
9ac1fb156a Merge branch 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm
Pull ARM cpufreq driver changes for v5.9-rc1 from Viresh Kumar:

"Here are the details:

- Adaptive voltage scaling (AVS) support and minor cleanups for
  brcmstb driver (Florian Fainelli and Markus Mayer).

- A new tegra driver and cleanup for the existing one (Sumit Gupta and
  Jon Hunter).

- Bandwidth level support for Qcom driver along with OPP changes (Sibi
  Sankar).

- Cleanups to sti, cpufreq-dt, ap806, CPPC drivers (Viresh Kumar, Lee
  Jones, Ivan Kokshaysky, Sven Auhagen, and Xin Hao).

- Make schedutil default governor for ARM (Valentin Schneider).

- Fix dependency issues for imx (Walter Lozano).

- Cleanup around cached_resolved_idx in cpufreq core (Viresh Kumar)."

* 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
  cpufreq: make schedutil the default for arm and arm64
  cpufreq: cached_resolved_idx can not be negative
  cpufreq: Add Tegra194 cpufreq driver
  dt-bindings: arm: Add NVIDIA Tegra194 CPU Complex binding
  cpufreq: imx: Select NVMEM_IMX_OCOTP
  cpufreq: sti-cpufreq: Fix some formatting and misspelling issues
  cpufreq: tegra186: Simplify probe return path
  cpufreq: CPPC: Reuse caps variable in few routines
  cpufreq: ap806: fix cpufreq driver needs ap cpu clk
  cpufreq: cppc: Reorder code and remove apply_hisi_workaround variable
  cpufreq: dt: fix oops on armada37xx
  cpufreq: brcmstb-avs-cpufreq: send S2_ENTER / S2_EXIT commands to AVS
  cpufreq: brcmstb-avs-cpufreq: Support polling AVS firmware
  cpufreq: brcmstb-avs-cpufreq: more flexible interface for __issue_avs_command()
  cpufreq: qcom: Disable fast switch when scaling DDR/L3
  cpufreq: qcom: Update the bandwidth levels on frequency change
  OPP: Add and export helper to set bandwidth
  cpufreq: blacklist SC7180 in cpufreq-dt-platdev
  cpufreq: blacklist SDM845 in cpufreq-dt-platdev
2020-08-04 12:44:53 +02:00