Commit Graph

72799 Commits

Author SHA1 Message Date
Jens Axboe
58c6a581de Merge branch 'io_uring-5.8' into for-5.9/io_uring
Pull in task_work changes from the 5.8 series, as we'll need to apply
the same kind of changes to other parts in the 5.9 branch.

* io_uring-5.8:
  io_uring: fix regression with always ignoring signals in io_cqring_wait()
  io_uring: use signal based task_work running
  task_work: teach task_work_add() to do signal_wake_up()
2020-07-05 15:04:17 -06:00
Oleg Nesterov
e91b481623 task_work: teach task_work_add() to do signal_wake_up()
So that the target task will exit the wait_event_interruptible-like
loop and call task_work_run() asap.

The patch turns "bool notify" into 0,TWA_RESUME,TWA_SIGNAL enum, the
new TWA_SIGNAL flag implies signal_wake_up().  However, it needs to
avoid the race with recalc_sigpending(), so the patch also adds the
new JOBCTL_TASK_WORK bit included in JOBCTL_PENDING_MASK.

TODO: once this patch is merged we need to change all current users
of task_work_add(notify = true) to use TWA_RESUME.

Cc: stable@vger.kernel.org # v5.7
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 12:18:08 -06:00
Jens Axboe
d1932dc3dc mm: add kiocb_wait_page_queue_init() helper
Checks if the file supports it, and initializes the values that we need.
Caller passes in 'data' pointer, if any, and the callback function to
be used.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:26 -06:00
Jens Axboe
c2a25ec0f1 fs: add FMODE_BUF_RASYNC
If set, this indicates that the file system supports IOCB_WAITQ for
buffered reads.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:25 -06:00
Jens Axboe
dd3e6d5039 mm: add support for async page locking
Normally waiting for a page to become unlocked, or locking the page,
requires waiting for IO to complete. Add support for lock_page_async()
and wait_on_page_locked_async(), which are callback based instead. This
allows a caller to get notified when a page becomes unlocked, rather
than wait for it.

We add a new iocb field, ki_waitq, to pass in the necessary data for this
to happen. We can unionize this with ki_cookie, since that is only used
for polled IO. Polled IO can never co-exist with async callbacks, as it is
(by definition) polled completions. struct wait_page_key is made public,
and we define struct wait_page_async as the interface between the caller
and the core.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:25 -06:00
Jens Axboe
c7510ab2cf mm: abstract out wake_page_match() from wake_page_function()
No functional changes in this patch, just in preparation for allowing
more callers.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:25 -06:00
Jens Axboe
5a473e8311 block: provide plug based way of signaling forced no-wait semantics
Provide a way for the caller to specify that IO should be marked
with REQ_NOWAIT to avoid blocking on allocation.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-21 20:44:25 -06:00
Linus Torvalds
7561393908 Merge tag 'powerpc-5.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:

 - One fix for the interrupt rework we did last release which broke
   KVM-PR

 - Three commits fixing some fallout from the READ_ONCE() changes
   interacting badly with our 8xx 16K pages support, which uses a pte_t
   that is a structure of 4 actual PTEs

 - A cleanup of the 8xx pte_update() to use the newly added pmd_off()

 - A fix for a crash when handling an oops if CONFIG_DEBUG_VIRTUAL is
   enabled

 - A minor fix for the SPU syscall generation

Thanks to Aneesh Kumar K.V, Christian Zigotzky, Christophe Leroy, Mike
Rapoport, Nicholas Piggin.

* tag 'powerpc-5.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/8xx: Provide ptep_get() with 16k pages
  mm: Allow arches to provide ptep_get()
  mm/gup: Use huge_ptep_get() in gup_hugepte()
  powerpc/syscalls: Use the number when building SPU syscall table
  powerpc/8xx: use pmd_off() to access a PMD entry in pte_update()
  powerpc/64s: Fix KVM interrupt using wrong save area
  powerpc: Fix kernel crash in show_instructions() w/DEBUG_VIRTUAL
2020-06-21 10:02:53 -07:00
Linus Torvalds
93bbca271a Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:

 - NULL dereference in octeontx

 - PM reference imbalance in ks-sa

 - deadlock in crypto manager

 - memory leak in drbg

 - missing socket limit check on receive SG list size in algif_skcipher

 - typos in caam

 - warnings in ccp and hisilicon

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: drbg - always try to free Jitter RNG instance
  crypto: marvell/octeontx - Fix a potential NULL dereference
  crypto: algboss - don't wait during notifier callback
  crypto: caam - fix typos
  crypto: ccp - Fix sparse warnings in sev-dev
  crypto: hisilicon - Cap block size at 2^31
  crypto: algif_skcipher - Cap recv SG list at ctx->used
  hwrng: ks-sa - Fix runtime PM imbalance on error
2020-06-21 10:01:03 -07:00
Linus Torvalds
64677779e8 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
 "One minor fix and two patches reworking the ata dma drain for the
  !CONFIG_LIBATA case. The latter is a 5.7 regression fix"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: Wire up ata_scsi_dma_need_drain for SAS HBA drivers
  scsi: libata: Provide an ata_scsi_dma_need_drain stub for !CONFIG_ATA
  scsi: ufs-bsg: Fix runtime PM imbalance on error
2020-06-20 19:23:13 -07:00
Linus Torvalds
a5c6a1f0fe Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:

 - a small collection of remaining API conversion patches (all acked)
   which allow to finally remove the deprecated API

 - some documentation fixes and a MAINTAINERS addition

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  MAINTAINERS: Add robert and myself as qcom i2c cci maintainers
  i2c: smbus: Fix spelling mistake in the comments
  Documentation/i2c: SMBus start signal is S not A
  i2c: remove deprecated i2c_new_device API
  Documentation: media: convert to use i2c_new_client_device()
  video: backlight: tosa_lcd: convert to use i2c_new_client_device()
  x86/platform/intel-mid: convert to use i2c_new_client_device()
  drm: encoder_slave: use new I2C API
  drm: encoder_slave: fix refcouting error for modules
2020-06-20 19:18:27 -07:00
Linus Torvalds
8b6ddd10d6 Merge tag 'trace-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:

 - Have recordmcount work with > 64K sections (to support LTO)

 - kprobe RCU fixes

 - Correct a kprobe critical section with missing mutex

 - Remove redundant arch_disarm_kprobe() call

 - Fix lockup when kretprobe triggers within kprobe_flush_task()

 - Fix memory leak in fetch_op_data operations

 - Fix sleep in atomic in ftrace trace array sample code

 - Free up memory on failure in sample trace array code

 - Fix incorrect reporting of function_graph fields in format file

 - Fix quote within quote parsing in bootconfig

 - Fix return value of bootconfig tool

 - Add testcases for bootconfig tool

 - Fix maybe uninitialized warning in ftrace pid file code

 - Remove unused variable in tracing_iter_reset()

 - Fix some typos

* tag 'trace-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix maybe-uninitialized compiler warning
  tools/bootconfig: Add testcase for show-command and quotes test
  tools/bootconfig: Fix to return 0 if succeeded to show the bootconfig
  tools/bootconfig: Fix to use correct quotes for value
  proc/bootconfig: Fix to use correct quotes for value
  tracing: Remove unused event variable in tracing_iter_reset
  tracing/probe: Fix memleak in fetch_op_data operations
  trace: Fix typo in allocate_ftrace_ops()'s comment
  tracing: Make ftrace packed events have align of 1
  sample-trace-array: Remove trace_array 'sample-instance'
  sample-trace-array: Fix sleeping function called from invalid context
  kretprobe: Prevent triggering kretprobe from within kprobe_flush_task
  kprobes: Remove redundant arch_disarm_kprobe() call
  kprobes: Fix to protect kick_kprobe_optimizer() by kprobe_mutex
  kprobes: Use non RCU traversal APIs on kprobe_tables if possible
  kprobes: Suppress the suspicious RCU warning on kprobes
  recordmcount: support >64k sections
2020-06-20 13:17:47 -07:00
Christophe Leroy
481e980a7c mm: Allow arches to provide ptep_get()
Since commit 9e343b467c ("READ_ONCE: Enforce atomicity for
{READ,WRITE}_ONCE() memory accesses") it is not possible anymore to
use READ_ONCE() to access complex page table entries like the one
defined for powerpc 8xx with 16k size pages.

Define a ptep_get() helper that architectures can override instead
of performing a READ_ONCE() on the page table entry pointer.

Fixes: 9e343b467c ("READ_ONCE: Enforce atomicity for {READ,WRITE}_ONCE() memory accesses")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/087fa12b6e920e32315136b998aa834f99242695.1592225558.git.christophe.leroy@csgroup.eu
2020-06-20 22:14:53 +10:00
Linus Torvalds
d2b1c81f5f Merge tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:

 - Use import_uuid() where appropriate (Andy)

 - bcache fixes (Coly, Mauricio, Zhiqiang)

 - blktrace sparse warnings fix (Jan)

 - blktrace concurrent setup fix (Luis)

 - blkdev_get use-after-free fix (Jason)

 - Ensure all blk-mq maps are updated (Weiping)

 - Loop invalidate bdev fix (Zheng)

* tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-block:
  block: make function 'kill_bdev' static
  loop: replace kill_bdev with invalidate_bdev
  partitions/ldm: Replace uuid_copy() with import_uuid() where it makes sense
  block: update hctx map when use multiple maps
  blktrace: Avoid sparse warnings when assigning q->blk_trace
  blktrace: break out of blktrace setup on concurrent calls
  block: Fix use-after-free in blkdev_get()
  trace/events/block.h: drop kernel-doc for dropped function parameter
  blk-mq: Remove redundant 'return' statement
  bcache: pr_info() format clean up in bcache_device_init()
  bcache: use delayed kworker fo asynchronous devices registration
  bcache: check and adjust logical block size for backing devices
  bcache: fix potential deadlock problem in btree_gc_coalesce
2020-06-19 13:11:26 -07:00
Linus Torvalds
592be758f1 Merge tag 'libata-5.8-2020-06-19' of git://git.kernel.dk/linux-block
Pull libata fixes from Jens Axboe:
 "A few minor changes that should go into this release"

* tag 'libata-5.8-2020-06-19' of git://git.kernel.dk/linux-block:
  libata: Use per port sync for detach
  ata/libata: Fix usage of page address by page_address in ata_scsi_mode_select_xlat function
  sata_rcar: handle pm_runtime_get_sync failure cases
2020-06-19 13:09:40 -07:00
Linus Torvalds
672f9255a7 Merge tag 'ceph-for-5.8-rc2' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
 "An important follow-up for replica reads support that went into -rc1
  and two target_copy() fixups"

* tag 'ceph-for-5.8-rc2' of git://github.com/ceph/ceph-client:
  libceph: don't omit used_replica in target_copy()
  libceph: don't omit recovery_deletes in target_copy()
  libceph: move away from global osd_req_flags
2020-06-19 12:25:04 -07:00
Linus Torvalds
98b769942c Merge tag 'overflow-v5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull flex-array size helper from Kees Cook:
 "During the treewide clean-ups of zero-length "flexible arrays", the
  struct_size() helper was heavily used, but it was noticed that many
  times it would have been nice to have an additional helper to get the
  size of just the flexible array itself.

  This need appears to be even more common when cleaning up the 1-byte
  array "flexible arrays", so Gustavo implemented it.

  I'd love to get this landed early so it can be used during the v5.9
  dev cycle to ease the 1-byte array cleanups."

* tag 'overflow-v5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  overflow.h: Add flex_array_size() helper
2020-06-19 11:45:03 -07:00
Wolfram Sang
390fd0475a i2c: remove deprecated i2c_new_device API
All in-tree users have been converted to the new i2c_new_client_device
function, so remove this deprecated one.

Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Wolfram Sang <wsa@kernel.org>
2020-06-19 09:20:28 +02:00
Linus Torvalds
5e857ce6ea Merge branch 'hch' (maccess patches from Christoph Hellwig)
Merge non-faulting memory access cleanups from Christoph Hellwig:
 "Andrew and I decided to drop the patches implementing your suggested
  rename of the probe_kernel_* and probe_user_* helpers from -mm as
  there were way to many conflicts.

  After -rc1 might be a good time for this as all the conflicts are
  resolved now"

This also adds a type safety checking patch on top of the renaming
series to make the subtle behavioral difference between 'get_user()' and
'get_kernel_nofault()' less potentially dangerous and surprising.

* emailed patches from Christoph Hellwig <hch@lst.de>:
  maccess: make get_kernel_nofault() check for minimal type compatibility
  maccess: rename probe_kernel_address to get_kernel_nofault
  maccess: rename probe_user_{read,write} to copy_{from,to}_user_nofault
  maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault
2020-06-18 12:35:51 -07:00
Linus Torvalds
0c389d89ab maccess: make get_kernel_nofault() check for minimal type compatibility
Now that we've renamed probe_kernel_address() to get_kernel_nofault()
and made it look and behave more in line with get_user(), some of the
subtle type behavior differences end up being more obvious and possibly
dangerous.

When you do

        get_user(val, user_ptr);

the type of the access comes from the "user_ptr" part, and the above
basically acts as

        val = *user_ptr;

by design (except, of course, for the fact that the actual dereference
is done with a user access).

Note how in the above case, the type of the end result comes from the
pointer argument, and then the value is cast to the type of 'val' as
part of the assignment.

So the type of the pointer is ultimately the more important type both
for the access itself.

But 'get_kernel_nofault()' may now _look_ similar, but it behaves very
differently.  When you do

        get_kernel_nofault(val, kernel_ptr);

it behaves like

        val = *(typeof(val) *)kernel_ptr;

except, of course, for the fact that the actual dereference is done with
exception handling so that a faulting access is suppressed and returned
as the error code.

But note how different the casting behavior of the two superficially
similar accesses are: one does the actual access in the size of the type
the pointer points to, while the other does the access in the size of
the target, and ignores the pointer type entirely.

Actually changing get_kernel_nofault() to act like get_user() is almost
certainly the right thing to do eventually, but in the meantime this
patch adds logit to at least verify that the pointer type is compatible
with the type of the result.

In many cases, this involves just casting the pointer to 'void *' to
make it obvious that the type of the pointer is not the important part.
It's not how 'get_user()' acts, but at least the behavioral difference
is now obvious and explicit.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-18 12:10:37 -07:00
Christoph Hellwig
25f12ae45f maccess: rename probe_kernel_address to get_kernel_nofault
Better describe what this helper does, and match the naming of
copy_from_kernel_nofault.

Also switch the argument order around, so that it acts and looks
like get_user().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-18 11:14:40 -07:00
Luc Van Oostenryck
670d0a4b10 sparse: use identifiers to define address spaces
Currently, address spaces in warnings are displayed as '<asn:X>' with
'X' being the address space's arbitrary number.

But since sparse v0.6.0-rc1 (late December 2018), sparse allows you to
define the address spaces using an identifier instead of a number.  This
identifier is then directly used in the warnings.

So, use the identifiers '__user', '__iomem', '__percpu' & '__rcu' for
the corresponding address spaces.  The default address space, __kernel,
being not displayed in warnings, stays defined as '0'.

With this change, warnings that used to be displayed as:

	cast removes address space '<asn:1>' of expression
	... void [noderef] <asn:2> *

will now be displayed as:

	cast removes address space '__user' of expression
	... void [noderef] __iomem *

This also moves the __kernel annotation to be the first one, since it is
quite different from the others because it's the default one, and so:

 - it's never displayed

 - it's normally not needed, nor in type annotations, nor in cast
   between address spaces. The only time it's needed is when it's
   combined with a typeof to express "the same type as this one but
   without the address space"

 - it can't be defined with a name, '0' must be used.

So, it seemed strange to me to have it in the middle of the other
ones.

Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-18 10:05:23 -07:00
Zheng Bin
3373a3461a block: make function 'kill_bdev' static
kill_bdev does not have any external user, so make it static.

Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-18 09:24:35 -06:00
Kai-Heng Feng
b5292111de libata: Use per port sync for detach
Commit 130f4caf14 ("libata: Ensure ata_port probe has completed before
detach") may cause system freeze during suspend.

Using async_synchronize_full() in PM callbacks is wrong, since async
callbacks that are already scheduled may wait for not-yet-scheduled
callbacks, causes a circular dependency.

Instead of using big hammer like async_synchronize_full(), use async
cookie to make sure port probe are synced, without affecting other
scheduled PM callbacks.

Fixes: 130f4caf14 ("libata: Ensure ata_port probe has completed before detach")
Suggested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Tested-by: John Garry <john.garry@huawei.com>
BugLink: https://bugs.launchpad.net/bugs/1867983
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-18 09:21:40 -06:00
Christoph Hellwig
c0ee37e85e maccess: rename probe_user_{read,write} to copy_{from,to}_user_nofault
Better describe what these functions do.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-17 10:57:41 -07:00
Christoph Hellwig
fe557319aa maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault
Better describe what these functions do.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-17 10:57:41 -07:00
Gustavo A. R. Silva
b19d57d0f3 overflow.h: Add flex_array_size() helper
Add flex_array_size() helper for the calculation of the size, in bytes,
of a flexible array member contained within an enclosing structure.

Example of usage:

struct something {
	size_t count;
	struct foo items[];
};

struct something *instance;

instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL);
instance->count = count;
memcpy(instance->items, src, flex_array_size(instance, items, instance->count));

The helper returns SIZE_MAX on overflow instead of wrapping around.

Additionally replaces parameter "n" with "count" in struct_size() helper
for greater clarity and unification.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20200609012233.GA3371@embeddedor
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-06-16 20:45:08 -07:00
Jiri Olsa
9b38cc704e kretprobe: Prevent triggering kretprobe from within kprobe_flush_task
Ziqian reported lockup when adding retprobe on _raw_spin_lock_irqsave.
My test was also able to trigger lockdep output:

 ============================================
 WARNING: possible recursive locking detected
 5.6.0-rc6+ #6 Not tainted
 --------------------------------------------
 sched-messaging/2767 is trying to acquire lock:
 ffffffff9a492798 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_hash_lock+0x52/0xa0

 but task is already holding lock:
 ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&(kretprobe_table_locks[i].lock));
   lock(&(kretprobe_table_locks[i].lock));

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 1 lock held by sched-messaging/2767:
  #0: ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50

 stack backtrace:
 CPU: 3 PID: 2767 Comm: sched-messaging Not tainted 5.6.0-rc6+ #6
 Call Trace:
  dump_stack+0x96/0xe0
  __lock_acquire.cold.57+0x173/0x2b7
  ? native_queued_spin_lock_slowpath+0x42b/0x9e0
  ? lockdep_hardirqs_on+0x590/0x590
  ? __lock_acquire+0xf63/0x4030
  lock_acquire+0x15a/0x3d0
  ? kretprobe_hash_lock+0x52/0xa0
  _raw_spin_lock_irqsave+0x36/0x70
  ? kretprobe_hash_lock+0x52/0xa0
  kretprobe_hash_lock+0x52/0xa0
  trampoline_handler+0xf8/0x940
  ? kprobe_fault_handler+0x380/0x380
  ? find_held_lock+0x3a/0x1c0
  kretprobe_trampoline+0x25/0x50
  ? lock_acquired+0x392/0xbc0
  ? _raw_spin_lock_irqsave+0x50/0x70
  ? __get_valid_kprobe+0x1f0/0x1f0
  ? _raw_spin_unlock_irqrestore+0x3b/0x40
  ? finish_task_switch+0x4b9/0x6d0
  ? __switch_to_asm+0x34/0x70
  ? __switch_to_asm+0x40/0x70

The code within the kretprobe handler checks for probe reentrancy,
so we won't trigger any _raw_spin_lock_irqsave probe in there.

The problem is in outside kprobe_flush_task, where we call:

  kprobe_flush_task
    kretprobe_table_lock
      raw_spin_lock_irqsave
        _raw_spin_lock_irqsave

where _raw_spin_lock_irqsave triggers the kretprobe and installs
kretprobe_trampoline handler on _raw_spin_lock_irqsave return.

The kretprobe_trampoline handler is then executed with already
locked kretprobe_table_locks, and first thing it does is to
lock kretprobe_table_locks ;-) the whole lockup path like:

  kprobe_flush_task
    kretprobe_table_lock
      raw_spin_lock_irqsave
        _raw_spin_lock_irqsave ---> probe triggered, kretprobe_trampoline installed

        ---> kretprobe_table_locks locked

        kretprobe_trampoline
          trampoline_handler
            kretprobe_hash_lock(current, &head, &flags);  <--- deadlock

Adding kprobe_busy_begin/end helpers that mark code with fake
probe installed to prevent triggering of another kprobe within
this code.

Using these helpers in kprobe_flush_task, so the probe recursion
protection check is hit and the probe is never set to prevent
above lockup.

Link: http://lkml.kernel.org/r/158927059835.27680.7011202830041561604.stgit@devnote2

Fixes: ef53d9c5e4 ("kprobes: improve kretprobe scalability with hashed locking")
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "Gustavo A . R . Silva" <gustavoars@kernel.org>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Reported-by: "Ziqian SUN (Zamir)" <zsun@redhat.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-16 21:21:01 -04:00
Linus Torvalds
ffbc93768e Merge tag 'flex-array-conversions-5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull flexible-array member conversions from Gustavo A. R. Silva:
 "Replace zero-length arrays with flexible-array members.

  Notice that all of these patches have been baking in linux-next for
  two development cycles now.

  There is a regular need in the kernel to provide a way to declare
  having a dynamically sized set of trailing elements in a structure.
  Kernel code should always use “flexible array members”[1] for these
  cases. The older style of one-element or zero-length arrays should no
  longer be used[2].

  C99 introduced “flexible array members”, which lacks a numeric size
  for the array declaration entirely:

        struct something {
                size_t count;
                struct foo items[];
        };

  This is the way the kernel expects dynamically sized trailing elements
  to be declared. It allows the compiler to generate errors when the
  flexible array does not occur last in the structure, which helps to
  prevent some kind of undefined behavior[3] bugs from being
  inadvertently introduced to the codebase.

  It also allows the compiler to correctly analyze array sizes (via
  sizeof(), CONFIG_FORTIFY_SOURCE, and CONFIG_UBSAN_BOUNDS). For
  instance, there is no mechanism that warns us that the following
  application of the sizeof() operator to a zero-length array always
  results in zero:

        struct something {
                size_t count;
                struct foo items[0];
        };

        struct something *instance;

        instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL);
        instance->count = count;

        size = sizeof(instance->items) * instance->count;
        memcpy(instance->items, source, size);

  At the last line of code above, size turns out to be zero, when one
  might have thought it represents the total size in bytes of the
  dynamic memory recently allocated for the trailing array items. Here
  are a couple examples of this issue[4][5].

  Instead, flexible array members have incomplete type, and so the
  sizeof() operator may not be applied[6], so any misuse of such
  operators will be immediately noticed at build time.

  The cleanest and least error-prone way to implement this is through
  the use of a flexible array member:

        struct something {
                size_t count;
                struct foo items[];
        };

        struct something *instance;

        instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL);
        instance->count = count;

        size = sizeof(instance->items[0]) * instance->count;
        memcpy(instance->items, source, size);

  instead"

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
[4] commit f2cd32a443 ("rndis_wlan: Remove logically dead code")
[5] commit ab91c2a89f ("tpm: eventlog: Replace zero-length array with flexible-array member")
[6] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html

* tag 'flex-array-conversions-5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (41 commits)
  w1: Replace zero-length array with flexible-array
  tracing/probe: Replace zero-length array with flexible-array
  soc: ti: Replace zero-length array with flexible-array
  tifm: Replace zero-length array with flexible-array
  dmaengine: tegra-apb: Replace zero-length array with flexible-array
  stm class: Replace zero-length array with flexible-array
  Squashfs: Replace zero-length array with flexible-array
  ASoC: SOF: Replace zero-length array with flexible-array
  ima: Replace zero-length array with flexible-array
  sctp: Replace zero-length array with flexible-array
  phy: samsung: Replace zero-length array with flexible-array
  RxRPC: Replace zero-length array with flexible-array
  rapidio: Replace zero-length array with flexible-array
  media: pwc: Replace zero-length array with flexible-array
  firmware: pcdp: Replace zero-length array with flexible-array
  oprofile: Replace zero-length array with flexible-array
  block: Replace zero-length array with flexible-array
  tools/testing/nvdimm: Replace zero-length array with flexible-array
  libata: Replace zero-length array with flexible-array
  kprobes: Replace zero-length array with flexible-array
  ...
2020-06-16 17:23:57 -07:00
Ilya Dryomov
22d2cfdffa libceph: move away from global osd_req_flags
osd_req_flags is overly general and doesn't suit its only user
(read_from_replica option) well:

- applying osd_req_flags in account_request() affects all OSD
  requests, including linger (i.e. watch and notify).  However,
  linger requests should always go to the primary even though
  some of them are reads (e.g. notify has side effects but it
  is a read because it doesn't result in mutation on the OSDs).

- calls to class methods that are reads are allowed to go to
  the replica, but most such calls issued for "rbd map" and/or
  exclusive lock transitions are requested to be resent to the
  primary via EAGAIN, doubling the latency.

Get rid of global osd_req_flags and set read_from_replica flag
only on specific OSD requests instead.

Fixes: 8ad44d5e0d ("libceph: read_from_replica option")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2020-06-16 16:01:53 +02:00
Gustavo A. R. Silva
5cab1634e4 tifm: Replace zero-length array with flexible-array
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15 23:08:32 -05:00
Gustavo A. R. Silva
af6bb61cc0 sctp: Replace zero-length array with flexible-array
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15 23:08:32 -05:00
Gustavo A. R. Silva
9c5fbf05cb libata: Replace zero-length array with flexible-array
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15 23:08:31 -05:00
Gustavo A. R. Silva
67a862a94d kprobes: Replace zero-length array with flexible-array
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15 23:08:31 -05:00
Gustavo A. R. Silva
50b6951feb kexec: Replace zero-length array with flexible-array
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15 23:08:31 -05:00
Gustavo A. R. Silva
764e515f41 KVM: Replace zero-length array with flexible-array
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15 23:08:31 -05:00
Gustavo A. R. Silva
67cd462446 FS-Cache: Replace zero-length array with flexible-array
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15 23:08:31 -05:00
Gustavo A. R. Silva
6b5679d237 cb710: Replace zero-length array with flexible-array
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15 23:08:31 -05:00
Gustavo A. R. Silva
d6562f1ca8 can: Replace zero-length array with flexible-array
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15 23:08:31 -05:00
Gustavo A. R. Silva
466f966b1f dmaengine: Replace zero-length array with flexible-array
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15 23:08:30 -05:00
Christoph Hellwig
7bb7ee8704 scsi: libata: Provide an ata_scsi_dma_need_drain stub for !CONFIG_ATA
SAS drivers can be compiled with ata support disabled.  Provide a stub so
that the drivers don't have to ifdef around wiring up
ata_scsi_dma_need_drain.

Link: https://lore.kernel.org/r/20200615064624.37317-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2020-06-15 23:40:41 -04:00
Linus Torvalds
3be20b6fc1 Merge tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull more ext4 updates from Ted Ts'o:
 "This is the second round of ext4 commits for 5.8 merge window [1].

  It includes the per-inode DAX support, which was dependant on the DAX
  infrastructure which came in via the XFS tree, and a number of
  regression and bug fixes; most notably the "BUG: using
  smp_processor_id() in preemptible code in ext4_mb_new_blocks" reported
  by syzkaller"

[1] The pull request actually came in 15 minutes after I had tagged the
    rc1 release. Tssk, tssk, late..   - Linus

* tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4, jbd2: ensure panic by fix a race between jbd2 abort and ext4 error handlers
  ext4: support xattr gnu.* namespace for the Hurd
  ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr
  ext4: avoid utf8_strncasecmp() with unstable name
  ext4: stop overwrite the errcode in ext4_setup_super
  ext4: fix partial cluster initialization when splitting extent
  ext4: avoid race conditions when remounting with options that change dax
  Documentation/dax: Update DAX enablement for ext4
  fs/ext4: Introduce DAX inode flag
  fs/ext4: Remove jflag variable
  fs/ext4: Make DAX mount option a tri-state
  fs/ext4: Only change S_DAX on inode load
  fs/ext4: Update ext4_should_use_dax()
  fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS
  fs/ext4: Disallow verity if inode is DAX
  fs/ext4: Narrow scope of DAX check in setflags
2020-06-15 09:32:10 -07:00
Herbert Xu
376bd28d03 crypto: ccp - Fix sparse warnings in sev-dev
This patch fixes a bunch of sparse warnings in sev-dev where the
__user marking is incorrectly handled.

Reported-by: kbuild test robot <lkp@intel.com>
Fixes: 7360e4b143 ("crypto: ccp: Implement SEV_PEK_CERT_IMPORT...")
Fixes: e799035609 ("crypto: ccp: Implement SEV_PEK_CSR ioctl...")
Fixes: 76a2b524a4 ("crypto: ccp: Implement SEV_PDH_CERT_EXPORT...")
Fixes: d6112ea0cb ("crypto: ccp - introduce SEV_GET_ID2 command")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-06-15 17:38:53 +10:00
Linus Torvalds
4a87b197c1 Merge tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux
Pull SafeSetID update from Micah Morton:
 "Add additional LSM hooks for SafeSetID

  SafeSetID is capable of making allow/deny decisions for set*uid calls
  on a system, and we want to add similar functionality for set*gid
  calls.

  The work to do that is not yet complete, so probably won't make it in
  for v5.8, but we are looking to get this simple patch in for v5.8
  since we have it ready.

  We are planning on the rest of the work for extending the SafeSetID
  LSM being merged during the v5.9 merge window"

* tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux:
  security: Add LSM hooks to set*gid syscalls
2020-06-14 11:39:31 -07:00
Thomas Cedeno
39030e1351 security: Add LSM hooks to set*gid syscalls
The SafeSetID LSM uses the security_task_fix_setuid hook to filter
set*uid() syscalls according to its configured security policy. In
preparation for adding analagous support in the LSM for set*gid()
syscalls, we add the requisite hook here. Tested by putting print
statements in the security_task_fix_setgid hook and seeing them get hit
during kernel boot.

Signed-off-by: Thomas Cedeno <thomascedeno@google.com>
Signed-off-by: Micah Morton <mortonm@chromium.org>
2020-06-14 10:52:02 -07:00
Linus Torvalds
9d645db853 Merge tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "This reverts the direct io port to iomap infrastructure of btrfs
  merged in the first pull request. We found problems in invalidate page
  that don't seem to be fixable as regressions or without changing iomap
  code that would not affect other filesystems.

  There are four reverts in total, but three of them are followup
  cleanups needed to revert a43a67a2d7 cleanly. The result is the
  buffer head based implementation of direct io.

  Reverts are not great, but under current circumstances I don't see
  better options"

* tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Revert "btrfs: switch to iomap_dio_rw() for dio"
  Revert "fs: remove dio_end_io()"
  Revert "btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK"
  Revert "btrfs: split btrfs_direct_IO to read and write part"
2020-06-14 09:47:25 -07:00
Linus Torvalds
96144c58ab Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix cfg80211 deadlock, from Johannes Berg.

 2) RXRPC fails to send norigications, from David Howells.

 3) MPTCP RM_ADDR parsing has an off by one pointer error, fix from
    Geliang Tang.

 4) Fix crash when using MSG_PEEK with sockmap, from Anny Hu.

 5) The ucc_geth driver needs __netdev_watchdog_up exported, from
    Valentin Longchamp.

 6) Fix hashtable memory leak in dccp, from Wang Hai.

 7) Fix how nexthops are marked as FDB nexthops, from David Ahern.

 8) Fix mptcp races between shutdown and recvmsg, from Paolo Abeni.

 9) Fix crashes in tipc_disc_rcv(), from Tuong Lien.

10) Fix link speed reporting in iavf driver, from Brett Creeley.

11) When a channel is used for XSK and then reused again later for XSK,
    we forget to clear out the relevant data structures in mlx5 which
    causes all kinds of problems. Fix from Maxim Mikityanskiy.

12) Fix memory leak in genetlink, from Cong Wang.

13) Disallow sockmap attachments to UDP sockets, it simply won't work.
    From Lorenz Bauer.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
  net: ethernet: ti: ale: fix allmulti for nu type ale
  net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init
  net: atm: Remove the error message according to the atomic context
  bpf: Undo internal BPF_PROBE_MEM in BPF insns dump
  libbpf: Support pre-initializing .bss global variables
  tools/bpftool: Fix skeleton codegen
  bpf: Fix memlock accounting for sock_hash
  bpf: sockmap: Don't attach programs to UDP sockets
  bpf: tcp: Recv() should return 0 when the peer socket is closed
  ibmvnic: Flush existing work items before device removal
  genetlink: clean up family attributes allocations
  net: ipa: header pad field only valid for AP->modem endpoint
  net: ipa: program upper nibbles of sequencer type
  net: ipa: fix modem LAN RX endpoint id
  net: ipa: program metadata mask differently
  ionic: add pcie_print_link_status
  rxrpc: Fix race between incoming ACK parser and retransmitter
  net/mlx5: E-Switch, Fix some error pointer dereferences
  net/mlx5: Don't fail driver on failure to create debugfs
  net/mlx5e: CT: Fix ipv6 nat header rewrite actions
  ...
2020-06-13 16:27:13 -07:00
Linus Torvalds
91fa58840a Merge branch 'i2c/for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c updates from Wolfram Sang:
 "I2C has quite some patches for you this time. I hope it is the move to
  per-driver-maintainers which is now showing results. We will see.

  The big news is two new drivers (Nuvoton NPCM and Qualcomm CCI),
  larger refactoring of the Designware, Tegra, and PXA drivers, the
  Cadence driver supports being a slave now, and there is support to
  instanciate SPD eeproms for well-known cases (which will be
  user-visible because the i801 driver supports it), and some
  devm_platform_ioremap_resource() conversions which blow up the
  diffstat.

  Note that I applied the Nuvoton driver quite late, so some minor fixup
  patches arrived during the merge window. I chose to apply them right
  away because they were trivial"

* 'i2c/for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (109 commits)
  i2c: Drop stray comma in MODULE_AUTHOR statements
  i2c: npcm7xx: npcm_i2caddr[] can be static
  MAINTAINERS: npcm7xx: Add maintainer for Nuvoton NPCM BMC
  i2c: npcm7xx: Fix a couple of error codes in probe
  i2c: icy: Fix build with CONFIG_AMIGA_PCMCIA=n
  i2c: npcm7xx: Remove unnecessary parentheses
  i2c: npcm7xx: Add support for slave mode for Nuvoton
  i2c: npcm7xx: Add Nuvoton NPCM I2C controller driver
  dt-bindings: i2c: npcm7xx: add NPCM I2C controller
  i2c: pxa: don't error out if there's no pinctrl
  i2c: add 'single-master' property to generic bindings
  i2c: designware: Add Baikal-T1 System I2C support
  i2c: designware: Move reg-space remapping into a dedicated function
  i2c: designware: Retrieve quirk flags as early as possible
  i2c: designware: Convert driver to using regmap API
  i2c: designware: Discard Cherry Trail model flag
  i2c: designware: Add Baytrail sem config DW I2C platform dependency
  i2c: designware: slave: Set DW I2C core module dependency
  i2c: designware: Use `-y` to build multi-object modules
  dt-bindings: i2c: dw: Add Baikal-T1 SoC I2C controller
  ...
2020-06-13 13:12:38 -07:00
Linus Torvalds
a9429089d3 Merge tag 'ras-core-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Thomas Gleixner:
 "RAS updates from Borislav Petkov:

   - Unmap a whole guest page if an MCE is encountered in it to avoid
     follow-on MCEs leading to the guest crashing, by Tony Luck.

     This change collided with the entry changes and the merge
     resolution would have been rather unpleasant. To avoid that the
     entry branch was merged in before applying this. The resulting code
     did not change over the rebase.

   - AMD MCE error thresholding machinery cleanup and hotplug
     sanitization, by Thomas Gleixner.

   - Change the MCE notifiers to denote whether they have handled the
     error and not break the chain early by returning NOTIFY_STOP, thus
     giving the opportunity for the later handlers in the chain to see
     it. By Tony Luck.

   - Add AMD family 0x17, models 0x60-6f support, by Alexander Monakov.

   - Last but not least, the usual round of fixes and improvements"

* tag 'ras-core-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  x86/mce/dev-mcelog: Fix -Wstringop-truncation warning about strncpy()
  x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned
  EDAC/amd64: Add AMD family 17h model 60h PCI IDs
  hwmon: (k10temp) Add AMD family 17h model 60h PCI match
  x86/amd_nb: Add AMD family 17h model 60h PCI IDs
  x86/mcelog: Add compat_ioctl for 32-bit mcelog support
  x86/mce: Drop bogus comment about mce.kflags
  x86/mce: Fixup exception only for the correct MCEs
  EDAC: Drop the EDAC report status checks
  x86/mce: Add mce=print_all option
  x86/mce: Change default MCE logger to check mce->kflags
  x86/mce: Fix all mce notifiers to update the mce->kflags bitmask
  x86/mce: Add a struct mce.kflags field
  x86/mce: Convert the CEC to use the MCE notifier
  x86/mce: Rename "first" function as "early"
  x86/mce/amd, edac: Remove report_gart_errors
  x86/mce/amd: Make threshold bank setting hotplug robust
  x86/mce/amd: Cleanup threshold device remove path
  x86/mce/amd: Straighten CPU hotplug path
  x86/mce/amd: Sanitize thresholding device creation hotplug path
  ...
2020-06-13 10:21:00 -07:00
Linus Torvalds
076f14be7f Merge tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 entry updates from Thomas Gleixner:
 "The x86 entry, exception and interrupt code rework

  This all started about 6 month ago with the attempt to move the Posix
  CPU timer heavy lifting out of the timer interrupt code and just have
  lockless quick checks in that code path. Trivial 5 patches.

  This unearthed an inconsistency in the KVM handling of task work and
  the review requested to move all of this into generic code so other
  architectures can share.

  Valid request and solved with another 25 patches but those unearthed
  inconsistencies vs. RCU and instrumentation.

  Digging into this made it obvious that there are quite some
  inconsistencies vs. instrumentation in general. The int3 text poke
  handling in particular was completely unprotected and with the batched
  update of trace events even more likely to expose to endless int3
  recursion.

  In parallel the RCU implications of instrumenting fragile entry code
  came up in several discussions.

  The conclusion of the x86 maintainer team was to go all the way and
  make the protection against any form of instrumentation of fragile and
  dangerous code pathes enforcable and verifiable by tooling.

  A first batch of preparatory work hit mainline with commit
  d5f744f9a2 ("Pull x86 entry code updates from Thomas Gleixner")

  That (almost) full solution introduced a new code section
  '.noinstr.text' into which all code which needs to be protected from
  instrumentation of all sorts goes into. Any call into instrumentable
  code out of this section has to be annotated. objtool has support to
  validate this.

  Kprobes now excludes this section fully which also prevents BPF from
  fiddling with it and all 'noinstr' annotated functions also keep
  ftrace off. The section, kprobes and objtool changes are already
  merged.

  The major changes coming with this are:

    - Preparatory cleanups

    - Annotating of relevant functions to move them into the
      noinstr.text section or enforcing inlining by marking them
      __always_inline so the compiler cannot misplace or instrument
      them.

    - Splitting and simplifying the idtentry macro maze so that it is
      now clearly separated into simple exception entries and the more
      interesting ones which use interrupt stacks and have the paranoid
      handling vs. CR3 and GS.

    - Move quite some of the low level ASM functionality into C code:

       - enter_from and exit to user space handling. The ASM code now
         calls into C after doing the really necessary ASM handling and
         the return path goes back out without bells and whistels in
         ASM.

       - exception entry/exit got the equivivalent treatment

       - move all IRQ tracepoints from ASM to C so they can be placed as
         appropriate which is especially important for the int3
         recursion issue.

    - Consolidate the declaration and definition of entry points between
      32 and 64 bit. They share a common header and macros now.

    - Remove the extra device interrupt entry maze and just use the
      regular exception entry code.

    - All ASM entry points except NMI are now generated from the shared
      header file and the corresponding macros in the 32 and 64 bit
      entry ASM.

    - The C code entry points are consolidated as well with the help of
      DEFINE_IDTENTRY*() macros. This allows to ensure at one central
      point that all corresponding entry points share the same
      semantics. The actual function body for most entry points is in an
      instrumentable and sane state.

      There are special macros for the more sensitive entry points, e.g.
      INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF.
      They allow to put the whole entry instrumentation and RCU handling
      into safe places instead of the previous pray that it is correct
      approach.

    - The INT3 text poke handling is now completely isolated and the
      recursion issue banned. Aside of the entry rework this required
      other isolation work, e.g. the ability to force inline bsearch.

    - Prevent #DB on fragile entry code, entry relevant memory and
      disable it on NMI, #MC entry, which allowed to get rid of the
      nested #DB IST stack shifting hackery.

    - A few other cleanups and enhancements which have been made
      possible through this and already merged changes, e.g.
      consolidating and further restricting the IDT code so the IDT
      table becomes RO after init which removes yet another popular
      attack vector

    - About 680 lines of ASM maze are gone.

  There are a few open issues:

   - An escape out of the noinstr section in the MCE handler which needs
     some more thought but under the aspect that MCE is a complete
     trainwreck by design and the propability to survive it is low, this
     was not high on the priority list.

   - Paravirtualization

     When PV is enabled then objtool complains about a bunch of indirect
     calls out of the noinstr section. There are a few straight forward
     ways to fix this, but the other issues vs. general correctness were
     more pressing than parawitz.

   - KVM

     KVM is inconsistent as well. Patches have been posted, but they
     have not yet been commented on or picked up by the KVM folks.

   - IDLE

     Pretty much the same problems can be found in the low level idle
     code especially the parts where RCU stopped watching. This was
     beyond the scope of the more obvious and exposable problems and is
     on the todo list.

  The lesson learned from this brain melting exercise to morph the
  evolved code base into something which can be validated and understood
  is that once again the violation of the most important engineering
  principle "correctness first" has caused quite a few people to spend
  valuable time on problems which could have been avoided in the first
  place. The "features first" tinkering mindset really has to stop.

  With that I want to say thanks to everyone involved in contributing to
  this effort. Special thanks go to the following people (alphabetical
  order): Alexandre Chartre, Andy Lutomirski, Borislav Petkov, Brian
  Gerst, Frederic Weisbecker, Josh Poimboeuf, Juergen Gross, Lai
  Jiangshan, Macro Elver, Paolo Bonzin,i Paul McKenney, Peter Zijlstra,
  Vitaly Kuznetsov, and Will Deacon"

* tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (142 commits)
  x86/entry: Force rcu_irq_enter() when in idle task
  x86/entry: Make NMI use IDTENTRY_RAW
  x86/entry: Treat BUG/WARN as NMI-like entries
  x86/entry: Unbreak __irqentry_text_start/end magic
  x86/entry: __always_inline CR2 for noinstr
  lockdep: __always_inline more for noinstr
  x86/entry: Re-order #DB handler to avoid *SAN instrumentation
  x86/entry: __always_inline arch_atomic_* for noinstr
  x86/entry: __always_inline irqflags for noinstr
  x86/entry: __always_inline debugreg for noinstr
  x86/idt: Consolidate idt functionality
  x86/idt: Cleanup trap_init()
  x86/idt: Use proper constants for table size
  x86/idt: Add comments about early #PF handling
  x86/idt: Mark init only functions __init
  x86/entry: Rename trace_hardirqs_off_prepare()
  x86/entry: Clarify irq_{enter,exit}_rcu()
  x86/entry: Remove DBn stacks
  x86/entry: Remove debug IDT frobbing
  x86/entry: Optimize local_db_save() for virt
  ...
2020-06-13 10:05:47 -07:00