Initializing struct pages is a long task and keeping interrupts disabled
for the duration of this operation introduces a number of problems.
1. jiffies are not updated for long period of time, and thus incorrect time
is reported. See proposed solution and discussion here:
lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
2. It prevents farther improving deferred page initialization by allowing
intra-node multi-threading.
We are keeping interrupts disabled to solve a rather theoretical problem
that was never observed in real world (See 3a2d7fa8a3).
Let's keep interrupts enabled. In case we ever encounter a scenario where
an interrupt thread wants to allocate large amount of memory this early in
boot we can deal with that by growing zone (see deferred_grow_zone()) by
the needed amount before starting deferred_init_memmap() threads.
Before:
[ 1.232459] node 0 initialised, 12058412 pages in 1ms
After:
[ 1.632580] node 0 initialised, 12051227 pages in 436ms
Fixes: 3a2d7fa8a3 ("mm: disable interrupts while initializing deferred pages")
Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Yiqian Wei <yiwei@redhat.com>
Cc: <stable@vger.kernel.org> [4.17+]
Link: http://lkml.kernel.org/r/20200403140952.17177-3-pasha.tatashin@soleen.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some architectures (e.g. ARC) have the ZONE_HIGHMEM zone below the
ZONE_NORMAL. Allowing free_area_init() parse max_zone_pfn array even it
is sorted in descending order allows using free_area_init() on such
architectures.
Add top -> down traversal of max_zone_pfn array in free_area_init() and
use the latter in ARC node/zone initialization.
[rppt@kernel.org: ARC fix]
Link: http://lkml.kernel.org/r/20200504153901.GM14260@kernel.org
[rppt@linux.ibm.com: arc: free_area_init(): take into account PAE40 mode]
Link: http://lkml.kernel.org/r/20200507205900.GH683243@linux.ibm.com
[akpm@linux-foundation.org: declare arch_has_descending_max_zone_pfns()]
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Hoan Tran <hoan@os.amperecomputing.com> [arm64]
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Guenter Roeck <linux@roeck-us.net>
Link: http://lkml.kernel.org/r/20200412194859.12663-18-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memcmp KASAN self-test fails on a kernel with both KASAN and
FORTIFY_SOURCE.
When FORTIFY_SOURCE is on, a number of functions are replaced with
fortified versions, which attempt to check the sizes of the operands.
However, these functions often directly invoke __builtin_foo() once they
have performed the fortify check. Using __builtins may bypass KASAN
checks if the compiler decides to inline it's own implementation as
sequence of instructions, rather than emit a function call that goes out
to a KASAN-instrumented implementation.
Why is only memcmp affected?
============================
Of the string and string-like functions that kasan_test tests, only memcmp
is replaced by an inline sequence of instructions in my testing on x86
with gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2).
I believe this is due to compiler heuristics. For example, if I annotate
kmalloc calls with the alloc_size annotation (and disable some fortify
compile-time checking!), the compiler will replace every memset except the
one in kmalloc_uaf_memset with inline instructions. (I have some WIP
patches to add this annotation.)
Does this affect other functions in string.h?
=============================================
Yes. Anything that uses __builtin_* rather than __real_* could be
affected. This looks like:
- strncpy
- strcat
- strlen
- strlcpy maybe, under some circumstances?
- strncat under some circumstances
- memset
- memcpy
- memmove
- memcmp (as noted)
- memchr
- strcpy
Whether a function call is emitted always depends on the compiler. Most
bugs should get caught by FORTIFY_SOURCE, but the missed memcmp test shows
that this is not always the case.
Isn't FORTIFY_SOURCE disabled with KASAN?
========================================-
The string headers on all arches supporting KASAN disable fortify with
kasan, but only when address sanitisation is _also_ disabled. For example
from x86:
#if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__)
/*
* For files that are not instrumented (e.g. mm/slub.c) we
* should use not instrumented version of mem* functions.
*/
#define memcpy(dst, src, len) __memcpy(dst, src, len)
#define memmove(dst, src, len) __memmove(dst, src, len)
#define memset(s, c, n) __memset(s, c, n)
#ifndef __NO_FORTIFY
#define __NO_FORTIFY /* FORTIFY_SOURCE uses __builtin_memcpy, etc. */
#endif
#endif
This comes from commit 6974f0c455 ("include/linux/string.h: add the
option of fortified string.h functions"), and doesn't work when KASAN is
enabled and the file is supposed to be sanitised - as with test_kasan.c
I'm pretty sure this is not wrong, but not as expansive it should be:
* we shouldn't use __builtin_memcpy etc in files where we don't have
instrumentation - it could devolve into a function call to memcpy,
which will be instrumented. Rather, we should use __memcpy which
by convention is not instrumented.
* we also shouldn't be using __builtin_memcpy when we have a KASAN
instrumented file, because it could be replaced with inline asm
that will not be instrumented.
What is correct behaviour?
==========================
Firstly, there is some overlap between fortification and KASAN: both
provide some level of _runtime_ checking. Only fortify provides
compile-time checking.
KASAN and fortify can pick up different things at runtime:
- Some fortify functions, notably the string functions, could easily be
modified to consider sub-object sizes (e.g. members within a struct),
and I have some WIP patches to do this. KASAN cannot detect these
because it cannot insert poision between members of a struct.
- KASAN can detect many over-reads/over-writes when the sizes of both
operands are unknown, which fortify cannot.
So there are a couple of options:
1) Flip the test: disable fortify in santised files and enable it in
unsanitised files. This at least stops us missing KASAN checking, but
we lose the fortify checking.
2) Make the fortify code always call out to real versions. Do this only
for KASAN, for fear of losing the inlining opportunities we get from
__builtin_*.
(We can't use kasan_check_{read,write}: because the fortify functions are
_extern inline_, you can't include _static_ inline functions without a
compiler warning. kasan_check_{read,write} are static inline so we can't
use them even when they would otherwise be suitable.)
Take approach 2 and call out to real versions when KASAN is enabled.
Use __underlying_foo to distinguish from __real_foo: __real_foo always
refers to the kernel's implementation of foo, __underlying_foo could be
either the kernel implementation or the __builtin_foo implementation.
This is sometimes enough to make the memcmp test succeed with
FORTIFY_SOURCE enabled. It is at least enough to get the function call
into the module. One more fix is needed to make it reliable: see the next
patch.
Fixes: 6974f0c455 ("include/linux/string.h: add the option of fortified string.h functions")
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: David Gow <davidgow@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Link: http://lkml.kernel.org/r/20200423154503.5103-3-dja@axtens.net
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking updates from David Miller:
1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
Augusto von Dentz.
2) Add GSO partial support to igc, from Sasha Neftin.
3) Several cleanups and improvements to r8169 from Heiner Kallweit.
4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
device self-test. From Andrew Lunn.
5) Start moving away from custom driver versions, use the globally
defined kernel version instead, from Leon Romanovsky.
6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.
7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.
8) Add sriov and vf support to hinic, from Luo bin.
9) Support Media Redundancy Protocol (MRP) in the bridging code, from
Horatiu Vultur.
10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.
11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
Dubroca. Also add ipv6 support for espintcp.
12) Lots of ReST conversions of the networking documentation, from Mauro
Carvalho Chehab.
13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
from Doug Berger.
14) Allow to dump cgroup id and filter by it in inet_diag code, from
Dmitry Yakunin.
15) Add infrastructure to export netlink attribute policies to
userspace, from Johannes Berg.
16) Several optimizations to sch_fq scheduler, from Eric Dumazet.
17) Fallback to the default qdisc if qdisc init fails because otherwise
a packet scheduler init failure will make a device inoperative. From
Jesper Dangaard Brouer.
18) Several RISCV bpf jit optimizations, from Luke Nelson.
19) Correct the return type of the ->ndo_start_xmit() method in several
drivers, it's netdev_tx_t but many drivers were using
'int'. From Yunjian Wang.
20) Add an ethtool interface for PHY master/slave config, from Oleksij
Rempel.
21) Add BPF iterators, from Yonghang Song.
22) Add cable test infrastructure, including ethool interfaces, from
Andrew Lunn. Marvell PHY driver is the first to support this
facility.
23) Remove zero-length arrays all over, from Gustavo A. R. Silva.
24) Calculate and maintain an explicit frame size in XDP, from Jesper
Dangaard Brouer.
25) Add CAP_BPF, from Alexei Starovoitov.
26) Support terse dumps in the packet scheduler, from Vlad Buslov.
27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.
28) Add devm_register_netdev(), from Bartosz Golaszewski.
29) Minimize qdisc resets, from Cong Wang.
30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
eliminate set_fs/get_fs calls. From Christoph Hellwig.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
selftests: net: ip_defrag: ignore EPERM
net_failover: fixed rollback in net_failover_open()
Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
vmxnet3: allow rx flow hash ops only when rss is enabled
hinic: add set_channels ethtool_ops support
selftests/bpf: Add a default $(CXX) value
tools/bpf: Don't use $(COMPILE.c)
bpf, selftests: Use bpf_probe_read_kernel
s390/bpf: Use bcr 0,%0 as tail call nop filler
s390/bpf: Maintain 8-byte stack alignment
selftests/bpf: Fix verifier test
selftests/bpf: Fix sample_cnt shared between two threads
bpf, selftests: Adapt cls_redirect to call csum_level helper
bpf: Add csum_level helper for fixing up csum levels
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
crypto/chtls: IPv6 support for inline TLS
Crypto/chcr: Fixes a coccinile check error
Crypto/chcr: Fixes compilations warnings
...
Pull kvm updates from Paolo Bonzini:
"ARM:
- Move the arch-specific code into arch/arm64/kvm
- Start the post-32bit cleanup
- Cherry-pick a few non-invasive pre-NV patches
x86:
- Rework of TLB flushing
- Rework of event injection, especially with respect to nested
virtualization
- Nested AMD event injection facelift, building on the rework of
generic code and fixing a lot of corner cases
- Nested AMD live migration support
- Optimization for TSC deadline MSR writes and IPIs
- Various cleanups
- Asynchronous page fault cleanups (from tglx, common topic branch
with tip tree)
- Interrupt-based delivery of asynchronous "page ready" events (host
side)
- Hyper-V MSRs and hypercalls for guest debugging
- VMX preemption timer fixes
s390:
- Cleanups
Generic:
- switch vCPU thread wakeup from swait to rcuwait
The other architectures, and the guest side of the asynchronous page
fault work, will come next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (256 commits)
KVM: selftests: fix rdtsc() for vmx_tsc_adjust_test
KVM: check userspace_addr for all memslots
KVM: selftests: update hyperv_cpuid with SynDBG tests
x86/kvm/hyper-v: Add support for synthetic debugger via hypercalls
x86/kvm/hyper-v: enable hypercalls regardless of hypercall page
x86/kvm/hyper-v: Add support for synthetic debugger interface
x86/hyper-v: Add synthetic debugger definitions
KVM: selftests: VMX preemption timer migration test
KVM: nVMX: Fix VMX preemption timer migration
x86/kvm/hyper-v: Explicitly align hcall param for kvm_hyperv_exit
KVM: x86/pmu: Support full width counting
KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in
KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT
KVM: x86: acknowledgment mechanism for async pf page ready notifications
KVM: x86: interrupt based APF 'page ready' event delivery
KVM: introduce kvm_read_guest_offset_cached()
KVM: rename kvm_arch_can_inject_async_page_present() to kvm_arch_can_dequeue_async_page_present()
KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously"
KVM: VMX: Replace zero-length array with flexible-array
...
Pull hyper-v updates from Wei Liu:
- a series from Andrea to support channel reassignment
- a series from Vitaly to clean up Vmbus message handling
- a series from Michael to clean up and augment hyperv-tlfs.h
- patches from Andy to clean up GUID usage in Hyper-V code
- a few other misc patches
* tag 'hyperv-next-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (29 commits)
Drivers: hv: vmbus: Resolve more races involving init_vp_index()
Drivers: hv: vmbus: Resolve race between init_vp_index() and CPU hotplug
vmbus: Replace zero-length array with flexible-array
Driver: hv: vmbus: drop a no long applicable comment
hyper-v: Switch to use UUID types directly
hyper-v: Replace open-coded variant of %*phN specifier
hyper-v: Supply GUID pointer to printf() like functions
hyper-v: Use UUID API for exporting the GUID (part 2)
asm-generic/hyperv: Add definitions for Get/SetVpRegister hypercalls
x86/hyperv: Split hyperv-tlfs.h into arch dependent and independent files
x86/hyperv: Remove HV_PROCESSOR_POWER_STATE #defines
KVM: x86: hyperv: Remove duplicate definitions of Reference TSC Page
drivers: hv: remove redundant assignment to pointer primary_channel
scsi: storvsc: Re-init stor_chns when a channel interrupt is re-assigned
Drivers: hv: vmbus: Introduce the CHANNELMSG_MODIFYCHANNEL message type
Drivers: hv: vmbus: Synchronize init_vp_index() vs. CPU hotplug
Drivers: hv: vmbus: Remove the unused HV_LOCALIZED channel affinity logic
PCI: hv: Prepare hv_compose_msi_msg() for the VMBus-channel-interrupt-to-vCPU reassignment functionality
Drivers: hv: vmbus: Use a spin lock for synchronizing channel scheduling vs. channel removal
hv_utils: Always execute the fcopy and vss callbacks in a tasklet
...
Pull kgdb updates from Daniel Thompson:
"By far the biggest change in this cycle are the changes that allow
much earlier debug of systems that are hooked up via UART by taking
advantage of the earlycon framework to implement the kgdb I/O hooks
before handing over to the regular polling I/O drivers once they are
available. When discussing Doug's work we also found and fixed an
broken raw_smp_processor_id() sequence in in_dbg_master().
Also included are a collection of much smaller fixes and tweaks: a
couple of tweaks to ged rid of doc gen or coccicheck warnings, future
proof some internal calculations that made implicit power-of-2
assumptions and eliminate some rather weird handling of magic
environment variables in kdb"
* tag 'kgdb-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kdb: Remove the misfeature 'KDBFLAGS'
kdb: Cleanup math with KDB_CMD_HISTORY_COUNT
serial: amba-pl011: Support kgdboc_earlycon
serial: 8250_early: Support kgdboc_earlycon
serial: qcom_geni_serial: Support kgdboc_earlycon
serial: kgdboc: Allow earlycon initialization to be deferred
Documentation: kgdboc: Document new kgdboc_earlycon parameter
kgdb: Don't call the deinit under spinlock
kgdboc: Disable all the early code when kgdboc is a module
kgdboc: Add kgdboc_earlycon to support early kgdb using boot consoles
kgdboc: Remove useless #ifdef CONFIG_KGDB_SERIAL_CONSOLE in kgdboc
kgdb: Prevent infinite recursive entries to the debugger
kgdb: Delay "kgdbwait" to dbg_late_init() by default
kgdboc: Use a platform device to handle tty drivers showing up late
Revert "kgdboc: disable the console lock when in kgdb"
kgdb: Disable WARN_CONSOLE_UNLOCKED for all kgdb
kgdb: Return true in kgdb_nmi_poll_knock()
kgdb: Drop malformed kernel doc comment
kgdb: Fix spurious true from in_dbg_master()
that's the only caller of __clear_user() in generic code, and it's
not hot enough to bother with skipping access_ok().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull MIPS updates from Thomas Bogendoerfer:
- added support for MIPSr5 and P5600 cores
- converted Loongson PCI driver into a PCI host driver using the
generic PCI framework
- added emulation of CPUCFG command for Loogonson64 cpus
- removed of LASAT, PMC MSP71xx and NEC MARKEINS/EMMA
- ioremap cleanup
- fix for a race between two threads faulting the same page
- various cleanups and fixes
* tag 'mips_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (143 commits)
MIPS: ralink: drop ralink_clk_init for mt7621
MIPS: ralink: bootrom: mark a function as __init to save some memory
MIPS: Loongson64: Reorder CPUCFG model match arms
MIPS: Expose Loongson CPUCFG availability via HWCAP
MIPS: Loongson64: Guard against future cores without CPUCFG
MIPS: Fix build warning about "PTR_STR" redefinition
MIPS: Loongson64: Remove not used pci.c
MIPS: Loongson64: Define PCI_IOBASE
MIPS: CPU_LOONGSON2EF need software to maintain cache consistency
MIPS: DTS: Fix build errors used with various configs
MIPS: Loongson64: select NO_EXCEPT_FILL
MIPS: Fix IRQ tracing when call handle_fpe() and handle_msa_fpe()
MIPS: mm: add page valid judgement in function pte_modify
mm/memory.c: Add memory read privilege on page fault handling
mm/memory.c: Update local TLB if PTE entry exists
MIPS: Do not flush tlb page when updating PTE entry
MIPS: ingenic: Default to a generic board
MIPS: ingenic: Add support for GCW Zero prototype
MIPS: ingenic: DTS: Add memory info of GCW Zero
MIPS: Loongson64: Switch to generic PCI driver
...
Pull thread updates from Christian Brauner:
"We have been discussing using pidfds to attach to namespaces for quite
a while and the patches have in one form or another already existed
for about a year. But I wanted to wait to see how the general api
would be received and adopted.
This contains the changes to make it possible to use pidfds to attach
to the namespaces of a process, i.e. they can be passed as the first
argument to the setns() syscall.
When only a single namespace type is specified the semantics are
equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET)
equals setns(pidfd, CLONE_NEWNET).
However, when a pidfd is passed, multiple namespace flags can be
specified in the second setns() argument and setns() will attach the
caller to all the specified namespaces all at once or to none of them.
Specifying 0 is not valid together with a pidfd. Here are just two
obvious examples:
setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
setns(pidfd, CLONE_NEWUSER);
Allowing to also attach subsets of namespaces supports various
use-cases where callers setns to a subset of namespaces to retain
privilege, perform an action and then re-attach another subset of
namespaces.
Apart from significantly reducing the number of syscalls needed to
attach to all currently supported namespaces (eight "open+setns"
sequences vs just a single "setns()"), this also allows atomic setns
to a set of namespaces, i.e. either attaching to all namespaces
succeeds or we fail without having changed anything.
This is centered around a new internal struct nsset which holds all
information necessary for a task to switch to a new set of namespaces
atomically. Fwiw, with this change a pidfd becomes the only token
needed to interact with a container. I'm expecting this to be
picked-up by util-linux for nsenter rather soon.
Associated with this change is a shiny new test-suite dedicated to
setns() (for pidfds and nsfds alike)"
* tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
selftests/pidfd: add pidfd setns tests
nsproxy: attach to namespaces via pidfds
nsproxy: add struct nsset
Pull scheduler updates from Ingo Molnar:
"The changes in this cycle are:
- Optimize the task wakeup CPU selection logic, to improve
scalability and reduce wakeup latency spikes
- PELT enhancements
- CFS bandwidth handling fixes
- Optimize the wakeup path by remove rq->wake_list and replacing it
with ->ttwu_pending
- Optimize IPI cross-calls by making flush_smp_call_function_queue()
process sync callbacks first.
- Misc fixes and enhancements"
* tag 'sched-core-2020-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
irq_work: Define irq_work_single() on !CONFIG_IRQ_WORK too
sched/headers: Split out open-coded prototypes into kernel/sched/smp.h
sched: Replace rq::wake_list
sched: Add rq::ttwu_pending
irq_work, smp: Allow irq_work on call_single_queue
smp: Optimize send_call_function_single_ipi()
smp: Move irq_work_run() out of flush_smp_call_function_queue()
smp: Optimize flush_smp_call_function_queue()
sched: Fix smp_call_function_single_async() usage for ILB
sched/core: Offload wakee task activation if it the wakee is descheduling
sched/core: Optimize ttwu() spinning on p->on_cpu
sched: Defend cfs and rt bandwidth quota against overflow
sched/cpuacct: Fix charge cpuacct.usage_sys
sched/fair: Replace zero-length array with flexible-array
sched/pelt: Sync util/runnable_sum with PELT window when propagating
sched/cpuacct: Use __this_cpu_add() instead of this_cpu_ptr()
sched/fair: Optimize enqueue_task_fair()
sched: Make scheduler_ipi inline
sched: Clean up scheduler_ipi()
sched/core: Simplify sched_init()
...
Pull irq updates from Thomas Gleixner:
"The generic interrupt departement provides:
- Cleanup of the irq_domain API
- Overhaul of the interrupt chip simulator
- The usual pile of new interrupt chip drivers
- Cleanups, improvements and fixes all over the place"
* tag 'irq-core-2020-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
irqchip: Fix "Loongson HyperTransport Vector support" driver build on all non-MIPS platforms
dt-bindings: interrupt-controller: Add Loongson PCH MSI
irqchip: Add Loongson PCH MSI controller
dt-bindings: interrupt-controller: Add Loongson PCH PIC
irqchip: Add Loongson PCH PIC controller
dt-bindings: interrupt-controller: Add Loongson HTVEC
irqchip: Add Loongson HyperTransport Vector support
genirq: Check irq_data_get_irq_chip() return value before use
irqchip/sifive-plic: Improve boot prints for multiple PLIC instances
irqchip/sifive-plic: Setup cpuhp once after boot CPU handler is present
irqchip/sifive-plic: Set default irq affinity in plic_irqdomain_map()
irqchip/gic-v2, v3: Drop extra IRQ_NOAUTOEN setting for (E)PPIs
irqdomain: Allow software nodes for IRQ domain creation
irqdomain: Get rid of special treatment for ACPI in __irq_domain_add()
irqdomain: Make __irq_domain_add() less OF-dependent
iio: dummy_evgen: Fix use after free on error in iio_dummy_evgen_create()
irqchip/gic-v3-its: Balance initial LPI affinity across CPUs
irqchip/gic-v3-its: Track LPI distribution on a per CPU basis
genirq/irq_sim: Simplify the API
irqdomain: Make irq_domain_reset_irq_data() available to non-hierarchical users
...
This region provides a mechanism to pass a Channel Report Word
that affect vfio-ccw devices, and needs to be passed to the guest
for its awareness and/or processing.
The base driver (see crw_collect_info()) provides space for two
CRWs, as a subchannel event may have two CRWs chained together
(one for the ssid, one for the subchannel). As vfio-ccw will
deal with everything at the subchannel level, provide space
for a single CRW to be transferred in one shot.
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Message-Id: <20200505122745.53208-7-farman@linux.ibm.com>
[CH: added padding to ccw_crw_region]
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
There are quite a lot simple GPIO controller which are using regmap to
access the hardware. This driver tries to be a base to unify existing
code into one place. This won't cover everything but it should be a good
starting point.
It does not implement its own irq_chip because there is already a
generic one for regmap based devices. Instead, the irq_chip will be
instantiated in the parent driver and its irq domain will be associate
to this driver.
For now it consists of the usual registers, like set (and an optional
clear) data register, an input register and direction registers.
Out-of-the-box, it supports consecutive register mappings and mappings
where the registers have gaps between them with a linear mapping between
GPIO offset and bit position. For weirder mappings the user can register
its own .xlate().
Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Link: https://lore.kernel.org/r/20200528145845.31436-3-michael@walle.cc
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
The function connects an IRQ domain to a gpiochip and reuses
gpiochip_to_irq() which is provided by gpiolib.
gpiochip_irqchip_* and regmap_irq partially provide the same
functionality. This function will help to connect just the
minimal functionality of the gpiochip_irqchip which is needed to
work together with regmap-irq.
Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Link: https://lore.kernel.org/r/20200528145845.31436-2-michael@walle.cc
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Pull btrfs updates from David Sterba:
"Highlights:
- speedup dead root detection during orphan cleanup, eg. when there
are many deleted subvolumes waiting to be cleaned, the trees are
now looked up in radix tree instead of a O(N^2) search
- snapshot creation with inherited qgroup will mark the qgroup
inconsistent, requires a rescan
- send will emit file capabilities after chown, this produces a
stream that does not need postprocessing to set the capabilities
again
- direct io ported to iomap infrastructure, cleaned up and simplified
code, notably removing last use of struct buffer_head in btrfs code
Core changes:
- factor out backreference iteration, to be used by ordinary
backreferences and relocation code
- improved global block reserve utilization
* better logic to serialize requests
* increased maximum available for unlink
* improved handling on large pages (64K)
- direct io cleanups and fixes
* simplify layering, where cloned bios were unnecessarily created
for some cases
* error handling fixes (submit, endio)
* remove repair worker thread, used to avoid deadlocks during
repair
- refactored block group reading code, preparatory work for new type
of block group storage that should improve mount time on large
filesystems
Cleanups:
- cleaned up (and slightly sped up) set/get helpers for metadata data
structure members
- root bit REF_COWS got renamed to SHAREABLE to reflect the that the
blocks of the tree get shared either among subvolumes or with the
relocation trees
Fixes:
- when subvolume deletion fails due to ENOSPC, the filesystem is not
turned read-only
- device scan deals with devices from other filesystems that changed
ownership due to overwrite (mkfs)
- fix a race between scrub and block group removal/allocation
- fix long standing bug of a runaway balance operation, printing the
same line to the syslog, caused by a stale status bit on a reloc
tree that prevented progress
- fix corrupt log due to concurrent fsync of inodes with shared
extents
- fix space underflow for NODATACOW and buffered writes when it for
some reason needs to fallback to COW mode"
* tag 'for-5.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (133 commits)
btrfs: fix space_info bytes_may_use underflow during space cache writeout
btrfs: fix space_info bytes_may_use underflow after nocow buffered write
btrfs: fix wrong file range cleanup after an error filling dealloc range
btrfs: remove redundant local variable in read_block_for_search
btrfs: open code key_search
btrfs: split btrfs_direct_IO to read and write part
btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
fs: remove dio_end_io()
btrfs: switch to iomap_dio_rw() for dio
iomap: remove lockdep_assert_held()
iomap: add a filesystem hook for direct I/O bio submission
fs: export generic_file_buffered_read()
btrfs: turn space cache writeout failure messages into debug messages
btrfs: include error on messages about failure to write space/inode caches
btrfs: remove useless 'fail_unlock' label from btrfs_csum_file_blocks()
btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums
btrfs: make checksum item extension more efficient
btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents
btrfs: unexport btrfs_compress_set_level()
btrfs: simplify iget helpers
...
Pull DAX updates part two from Darrick Wong:
"This time around, we're hoisting the DONTCACHE flag from XFS into the
VFS so that we can make the incore DAX mode changes become effective
sooner.
We can't change the file data access mode on a live inode because we
don't have a safe way to change the file ops pointers. The incore
state change becomes effective at inode loading time, which can happen
if the inode is evicted. Therefore, we're making it so that
filesystems can ask the VFS to evict the inode as soon as the last
holder drops.
The per-fs changes to make this call this will be in subsequent pull
requests from Ted and myself.
Summary:
- Introduce DONTCACHE flags for dentries and inodes. This hint will
cause the VFS to drop the associated objects immediately after the
last put, so that we can change the file access mode (DAX or page
cache) on the fly"
* tag 'vfs-5.8-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fs: Introduce DCACHE_DONTCACHE
fs: Lift XFS_IDONTCACHE to the VFS layer
Pull DAX updates part one from Darrick Wong:
"After many years of LKML-wrangling about how to enable programs to
query and influence the file data access mode (DAX) when a filesystem
resides on storage devices such as persistent memory, Ira Weiny has
emerged with a proposed set of standard behaviors that has not been
shot down by anyone! We're more or less standardizing on the current
XFS behavior and adapting ext4 to do the same.
This is the first of a handful pull requests that will make ext4 and
XFS present a consistent interface for user programs that care about
DAX. We add a statx attribute that programs can check to see if DAX is
enabled on a particular file. Then, we update the DAX documentation to
spell out the user-visible behaviors that filesystems will guarantee
(until the next storage industry shakeup). The on-disk inode flag has
been in XFS for a few years now.
Summary:
- Clean up io_is_direct.
- Add a new statx flag to indicate when file data access is being
done via DAX (as opposed to the page cache).
- Update the documentation for how system administrators and
application programmers can take advantage of the (still
experimental DAX) feature"
Link: https://lore.kernel.org/lkml/20200505002016.1085071-1-ira.weiny@intel.com/
* tag 'vfs-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
Documentation/dax: Update Usage section
fs/stat: Define DAX statx attribute
fs: Remove unneeded IS_DAX() check in io_is_direct()
Pull lockdown update from James Morris:
"An update for the security subsystem to allow unprivileged users
to see the status of the lockdown feature. From Jeremy Cline"
Also an added comment to describe CAP_SETFCAP.
* 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
capabilities: add description for CAP_SETFCAP
lockdown: Allow unprivileged users to see lockdown status
Pull audit updates from Paul Moore:
"Summary of the significant patches:
- Record information about binds/unbinds to the audit multicast
socket. This helps identify which processes have/had access to the
information in the audit stream.
- Cleanup and add some additional information to the netfilter
configuration events collected by audit.
- Fix some of the audit error handling code so we don't leak network
namespace references"
* tag 'audit-pr-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: add subj creds to NETFILTER_CFG record to
audit: Replace zero-length array with flexible-array
audit: make symbol 'audit_nfcfgs' static
netfilter: add audit table unregister actions
audit: tidy and extend netfilter_cfg x_tables
audit: log audit netlink multicast bind and unbind
audit: fix a net reference leak in audit_list_rules_send()
audit: fix a net reference leak in audit_send_reply()