If LLD was built with -DLLD_VENDOR="xyz", ld.lld --version output
will prefix LLD_VENDOR. Since LLD_VENDOR can contain spaces, the
LLD identifier isn't guaranteed to be $2 either.
Adjust the version checker to handle such versions of lld.
Link: https://lore.kernel.org/lkml/20210302221211.1620858-1-bero@lindev.ch/
Signed-off-by: Bernhard Rosenkränzer <bero@lindev.ch>
[masahiro yamada: refactor the code]
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
(cherry picked from commit 1f09af0625)
Bug: 210043760
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Change-Id: Ic9a5b75cf0e7ffebcc77e81b4402c579dd13f3e2
Unify the two scripts/ld-version.sh and scripts/lld-version.sh, and
check the minimum linker version like scripts/cc-version.sh did.
I tested this script for some corner cases reported in the past:
- GNU ld version 2.25-15.fc23
as reported by commit 8083013fc3 ("ld-version: Fix it on Fedora")
- GNU ld (GNU Binutils) 2.20.1.20100303
as reported by commit 0d61ed17dd ("ld-version: Drop the 4th and
5th version components")
This script show an error message if the linker is too old:
$ make LD=ld.lld-9
SYNC include/config/auto.conf
***
*** Linker is too old.
*** Your LLD version: 9.0.1
*** Minimum LLD version: 10.0.1
***
scripts/Kconfig.include:50: Sorry, this linker is not supported.
make[2]: *** [scripts/kconfig/Makefile:71: syncconfig] Error 1
make[1]: *** [Makefile:600: syncconfig] Error 2
make: *** [Makefile:708: include/config/auto.conf] Error 2
I also moved the check for gold to this script, so gold is still rejected:
$ make LD=gold
SYNC include/config/auto.conf
gold linker is not supported as it is not capable of linking the kernel proper.
scripts/Kconfig.include:50: Sorry, this linker is not supported.
make[2]: *** [scripts/kconfig/Makefile:71: syncconfig] Error 1
make[1]: *** [Makefile:600: syncconfig] Error 2
make: *** [Makefile:708: include/config/auto.conf] Error 2
Thanks to David Laight for suggesting shell script improvements.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
[nd: conflict in scripts/lld-version.sh due to df58fb431a
("scripts/lld-version.sh: Rewrite based on upstream ld-version.sh")
being 5.10 only]
Bug: 210043760
(cherry picked from commit 02aff85922)
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Change-Id: I1618f31566b84dd601ca2c9665f6d341df56537d
Paul Gortmaker reported a regression in the GCC version check. [1]
If you use GCC 4.8, the build breaks before showing the error message
"error Sorry, your version of GCC is too old - please use 4.9 or newer."
I do not want to apply his fix-up since it implies we would not be able
to remove any cc-option test. Anyway, I admit checking the GCC version
in <linux/compiler-gcc.h> is too late.
Almost at the same time, Linus also suggested to move the compiler
version error to Kconfig time. [2]
I unified the two similar scripts, gcc-version.sh and clang-version.sh
into cc-version.sh. The old scripts invoked the compiler multiple times
(3 times for gcc-version.sh, 4 times for clang-version.sh). I refactored
the code so the new one invokes the compiler just once, and also tried
my best to use shell-builtin commands where possible.
The new script runs faster.
$ time ./scripts/clang-version.sh clang
120000
real 0m0.029s
user 0m0.012s
sys 0m0.021s
$ time ./scripts/cc-version.sh clang
Clang 120000
real 0m0.009s
user 0m0.006s
sys 0m0.004s
cc-version.sh also shows an error message if the compiler is too old:
$ make defconfig CC=clang-9
*** Default configuration is based on 'x86_64_defconfig'
***
*** Compiler is too old.
*** Your Clang version: 9.0.1
*** Minimum Clang version: 10.0.1
***
scripts/Kconfig.include:46: Sorry, this compiler is not supported.
make[1]: *** [scripts/kconfig/Makefile:81: defconfig] Error 1
make: *** [Makefile:602: defconfig] Error 2
The new script takes care of ICC because we have <linux/compiler-intel.h>
although I am not sure if building the kernel with ICC is well-supported.
[1]: https://lore.kernel.org/r/20210110190807.134996-1-paul.gortmaker@windriver.com
[2]: https://lore.kernel.org/r/CAHk-=wh-+TMHPTFo1qs-MYyK7tZh-OQovA=pP3=e06aCVp6_kA@mail.gmail.com
Fixes: 87de84c914 ("kbuild: remove cc-option test of -Werror=date-time")
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Miguel Ojeda <ojeda@kernel.org>
Tested-by: Miguel Ojeda <ojeda@kernel.org>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Bug: 210043760
(cherry picked from commit aec6c60a01)
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Change-Id: I61e4eb8b4ef5466d0ee8c5892973734c98529b17
Commit ccbef1674a ("Kbuild, lto: add ld-version and ld-ifversion
macros") introduced scripts/ld-version.sh for GCC LTO.
At that time, this script handled 5 version fields because GCC LTO
needed the downstream binutils. (https://lkml.org/lkml/2014/4/8/272)
The code snippet from the submitted patch was as follows:
# We need HJ Lu's Linux binutils because mainline binutils does not
# support mixing assembler and LTO code in the same ld -r object.
# XXX check if the gcc plugin ld is the expected one too
# XXX some Fedora binutils should also support it. How to check for that?
ifeq ($(call ld-ifversion,-ge,22710001,y),y)
...
However, GCC LTO was not merged into the mainline after all.
(https://lkml.org/lkml/2014/4/8/272)
So, the 4th and 5th fields were never used, and finally removed by
commit 0d61ed17dd ("ld-version: Drop the 4th and 5th version
components").
Since then, the last 4-digits returned by this script is always zeros.
Remove the meaningless last 4-digits. This makes the version format
consistent with GCC_VERSION, CLANG_VERSION, LLD_VERSION.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Bug: 210043760
(cherry picked from commit 052c805a18)
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Change-Id: I544ebf950d7cd2b58e5f3c0f89b38869ace80146
kvm_arm_init_arch_resources() was renamed to kvm_arm_init_sve() in
commit a3be836df7 ("KVM: arm/arm64: Demote
kvm_arm_init_arch_resources() to just set up SVE"). Fix the function
name in comment of kvm_vcpu_finalize_sve().
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211230141535.1389-1-yuzenghui@huawei.com
(cherry picked from commit e938eddbeb
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209777660
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: Ia4dd418253ee40891559a61430583c3c3188f47a
Ganapatrao reported that the kvm_pgtable->mmu pointer is more or
less hardcoded to the main S2 mmu structure, while the nested
code needs it to point to other instances (as we have one instance
per nested context).
Rework the initialisation of the kvm_pgtable structure so that
this assumtion doesn't hold true anymore. This requires some
minor changes to the order in which things are initialised
(the mmu->arch pointer being the critical one).
Reported-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Reviewed-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211129200150.351436-5-maz@kernel.org
(cherry picked from commit 9d8604b285
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209777660
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: If942bf4a01a975a60c37ad8483eafc52185234d5
Running the KVM selftests results in these messages being dumped
in the kernel console:
[ 188.051073] kvm [469]: VGIC redist and dist frames overlap
[ 188.056820] kvm [469]: VGIC redist and dist frames overlap
[ 188.076199] kvm [469]: VGIC redist and dist frames overlap
Being amle to trigger this from userspace is definitely not on,
so demote these warnings to kvm_debug().
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211216104507.1482017-1-maz@kernel.org
(cherry picked from commit 440523b92b
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209777660
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: I0c6c94795a86705925dc1c9519bb0dec831b641c
When handling an error at the point where we try and register
all the redistributors, we unregister all the previously
registered frames by counting down from the failing index.
However, the way the code is written relies on that index
being a signed value. Which won't be true once we switch to
an xarray-based vcpu set.
Since this code is pretty awkward the first place, and that the
failure mode is hard to spot, rewrite this loop to iterate
over the vcpus upwards rather than downwards.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211216104526.1482124-1-maz@kernel.org
(cherry picked from commit c95b1d7ca7
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209777660
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: I9df3874b4a6542e5d20125d436108d8c1235b8e5
workaround_flags is a leftover from our earlier Spectre-v4 workaround
implementation, and now serves no purpose.
Get rid of the field and the corresponding asm-offset definition.
Fixes: 29e8910a56 ("KVM: arm64: Simplify handling of ARCH_WORKAROUND_2")
Signed-off-by: Marc Zyngier <maz@kernel.org>
(cherry picked from commit 142ff9bddb
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209777660
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: Ie059d34e32f35718abc13b321fbf9f762a5585c1
Fix the following warning issued when CONFIG_PERF_EVENTS is not
defined:
kernel/sysctl.c:124:12: error: ‘one_thousand’ defined but not used [-Werror=unused-variable]
These definitions in upstream has been changed [1] and therefore
the issue does not exist there.
[1] https://lore.kernel.org/all/20211124220801.ip01WsWPQ%25akpm@linux-foundation.org/
Fixes: 8fded571e7 ("FROMGIT: mm/pagealloc: sysctl: change watermark_scale_factor max limit to 30%")
Bug: 194652782
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I5539a2d0d27a126f7405455a8cf08c23b80d2e0b
Make use of the newly introduced unshare hypercall during guest teardown
to unmap guest-related data structures from the hyp stage-1.
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-15-qperret@google.com
(cherry picked from commit 52b28657eb
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: Ic8ec2c74d3e803e4d5f0df37d144d2c2eb0e9ad3
Introduce an unshare hypercall which can be used to unmap memory from
the hypervisor stage-1 in nVHE protected mode. This will be useful to
update the EL2 ownership state of pages during guest teardown, and
avoids keeping dangling mappings to unreferenced portions of memory.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-14-qperret@google.com
(cherry picked from commit b8cc6eb5bd
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I0b2839046914a0fd9a6d0077383d53dc84a8ac3d
Tearing down a previously shared memory region results in the borrower
losing access to the underlying pages and returning them to the "owned"
state in the owner.
Implement a do_unshare() helper, along the same lines as do_share(), to
provide this functionality for the host-to-hyp case.
Reviewed-by: Andrew Walbran <qwandor@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-13-qperret@google.com
(cherry picked from commit 376a240f03
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I07aa6194099325c955f48ab608c4b601664419e5
__pkvm_host_share_hyp() shares memory between the host and the
hypervisor so implement it as an invocation of the new do_share()
mechanism.
Note that double-sharing is no longer permitted (as this allows us to
reduce the number of page-table walks significantly), but is thankfully
no longer relied upon by the host.
[ qperret: BACKPORT due to conflict caused by the MMIO handler
introduced with the S2MPU support ]
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-12-qperret@google.com
(cherry picked from commit 1ee32109fd
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I9f5877b6ea7b9bf735a90492fbca16f485e73110
By default, protected KVM isolates memory pages so that they are
accessible only to their owner: be it the host kernel, the hypervisor
at EL2 or (in future) the guest. Establishing shared-memory regions
between these components therefore involves a transition for each page
so that the owner can share memory with a borrower under a certain set
of permissions.
Introduce a do_share() helper for safely sharing a memory region between
two components. Currently, only host-to-hyp sharing is implemented, but
the code is easily extended to handle other combinations and the
permission checks for each component are reusable.
Reviewed-by: Andrew Walbran <qwandor@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-11-qperret@google.com
(cherry picked from commit e82edcc75c
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I3ee4486ab08faa24015ac608979e3596b27806a2
In preparation for adding additional locked sections for manipulating
page-tables at EL2, introduce some simple wrappers around the host and
hypervisor locks so that it's a bit easier to read and bit more difficult
to take the wrong lock (or even take them in the wrong order).
[ qperret: BACKPORT due to conflict with S2MPU code ]
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-10-qperret@google.com
(cherry picked from commit 61d99e33e7
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I413e2d7f514db507a1aca9a21f04115e97f72c00
Explicitly name the combination of SW0 | SW1 as reserved in the pte and
introduce a new PKVM_NOPAGE meta-state which, although not directly
stored in the software bits of the pte, can be used to represent an
entry for which there is no underlying page. This is distinct from an
invalid pte, as stage-2 identity mappings for the host are created
lazily and so an invalid pte there is the same as a valid mapping for
the purposes of ownership information.
This state will be used for permission checking during page transitions
in later patches.
Reviewed-by: Andrew Walbran <qwandor@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-9-qperret@google.com
(cherry picked from commit 3d467f7b8c
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I2e803a6ae83694be178a3d3eaf75fe918c881659
In order to simplify the page tracking infrastructure at EL2 in nVHE
protected mode, move the responsibility of refcounting pages that are
shared multiple times on the host. In order to do so, let's create a
red-black tree tracking all the PFNs that have been shared, along with
a refcount.
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-8-qperret@google.com
(cherry picked from commit a83e2191b7
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I58f39d139c84acb545676c99144547cd080cdda3
The create_hyp_mappings() function can currently be called at any point
in time. However, its behaviour in protected mode changes widely
depending on when it is being called. Prior to KVM init, it is used to
create the temporary page-table used to bring-up the hypervisor, and
later on it is transparently turned into a 'share' hypercall when the
kernel has lost control over the hypervisor stage-1. In order to prepare
the ground for also unsharing pages with the hypervisor during guest
teardown, introduce a kvm_share_hyp() function to make it clear in which
places a share hypercall should be expected, as we will soon need a
matching unshare hypercall in all those places.
[ qperret: BACKPORT due to resolution in kvm_arch_vcpu_run_map_fp() in
merge commit 2d761dbf7f ("Merge branch kvm-arm64/fpsimd-tracking
into kvmarm-master/next") ]
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-7-qperret@google.com
(cherry picked from commit 3f868e142c
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: Id3ba033a51550e5aada862d6af06773d0be833fb
kvm_pgtable_hyp_unmap() relies on the ->page_count() function callback
being provided by the memory-management operations for the page-table.
Wire up this callback for the hypervisor stage-1 page-table.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-5-qperret@google.com
(cherry picked from commit 34ec7cbf1e
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I481b1f722f4dddda72e35481aca725a937a52090
In nVHE-protected mode, the hyp stage-1 page-table refcount is broken
due to the lack of refcount support in the early allocator. Fix-up the
refcount in the finalize walker, once the 'hyp_vmemmap' is up and running.
[ qperret: BACKPORT due to conflict with S2MPU code ]
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-4-qperret@google.com
(cherry picked from commit d6b4bd3f48
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I6c48e4e5a0ae969e1d1ac6be0ce601faf138cfd3
To prepare the ground for allowing hyp stage-1 mappings to be removed at
run-time, update the KVM page-table code to maintain a correct refcount
using the ->{get,put}_page() function callbacks.
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-3-qperret@google.com
(cherry picked from commit 2ea2ff91e8
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: Ic4ce9177a4381a01e1f2c3b00e9f376e14015718
In nVHE protected mode, the EL2 code uses a temporary allocator during
boot while re-creating its stage-1 page-table. Unfortunately, the
hyp_vmmemap is not ready to use at this stage, so refcounting pages
is not possible. That is not currently a problem because hyp stage-1
mappings are never removed, which implies refcounting of page-table
pages is unnecessary.
In preparation for allowing hypervisor stage-1 mappings to be removed,
provide stub implementations for {get,put}_page() in the early allocator.
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-2-qperret@google.com
(cherry picked from commit 1fac3cfb9c
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git next)
Bug: 209599700
Change-Id: I9602cb4db9cef6f52e8beefa52e53cf71f2e0cd1
This reverts commit b66c10e133.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I6e353b0243a86e5590ac0533e29f90dde9a418c4
This reverts commit e96c599591.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I152dd0611638f7ceab361b586f3fa93172f17c9e
This reverts commit edffd3888c.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I86105b5a7baeb1223501457666f0628fa25b55b8
This reverts commit 3239508319.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I1cb6f413cf09a3b56a6e4fd2b83d28bfdc59984f
This reverts commit 446ab9f9b4.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Icae6c1df119bb9ba757c8c53e07729e7e7f4f16f
This reverts commit 21fb63c709.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I042fbb20865571916ba90684ffbbefb980bfa0ef
This reverts commit 33fa24cc3b.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I66488e19919bc3714c54874ac1918e8a1fd119b4
This reverts commit f80bdbb276.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I81ee72a29d8884474b17e944183e9d3372dfdab8
This reverts commit fb29cc8de3.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Ia67b33b36bcaf49a77b654ff62fbd604eaa959fc
This reverts commit 455e17002b.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Ibbcf494525b1ee746e5e2019172db66edc7276e9
This reverts commit 50e7557b36.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Iad125052e89012b0ad374a2874a83ef608e86141
This reverts commit 0a4821ecc2.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I199989c20d63e61f167fa55f270147e10cc64016
This reverts commit fee11d0f41.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I4b61e149f82874557904f6e6b09421f294f996fe
This reverts commit bcf3fd91be.
This will be replaced by a FROMGIT patch shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I51bd7eff34a9345275ca0957082456b5ec5c9d3f
To allow process_mrelease to reap targeted mm in parallel with exit_mmap
mark the victim with MMF_OOM_VICTIM flag.
Bug: 189803002
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I89cf5f8fbeeb18b93a340b9ebe7f200837ebe846
With exit_mmap holding mmap_write_lock during free_pgtables call,
process_mrelease does not need to elevate mm->mm_users in order to
prevent exit_mmap from destrying pagetables while __oom_reap_task_mm
is walking the VMA tree. The change prevents process_mrelease from
calling the last mmput, which can lead to waiting for IO completion
in exit_aio.
Fixes: 337546e83f ("mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/all/20211124235906.14437-2-surenb@google.com/
Bug: 130172058
Bug: 189803002
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I1e2728e0c477af9cc20e9e0b715ee67dee760618
oom-reaper and process_mrelease system call should protect against
races with exit_mmap which can destroy page tables while they
walk the VMA tree. oom-reaper protects from that race by setting
MMF_OOM_VICTIM and by relying on exit_mmap to set MMF_OOM_SKIP
before taking and releasing mmap_write_lock. process_mrelease has
to elevate mm->mm_users to prevent such race. Both oom-reaper and
process_mrelease hold mmap_read_lock when walking the VMA tree.
The locking rules and mechanisms could be simpler if exit_mmap takes
mmap_write_lock while executing destructive operations such as
free_pgtables.
Change exit_mmap to hold the mmap_write_lock when calling
free_pgtables. Operations like unmap_vmas() and unlock_range() are not
destructive and could run under mmap_read_lock but for simplicity we
take one mmap_write_lock during almost the entire operation. Note
also that because oom-reaper checks VM_LOCKED flag, unlock_range()
should not be allowed to race with it.
In most cases this lock should be uncontended. Previously, Kirill
reported ~4% regression caused by a similar change [1]. We reran the
same test and although the individual results are quite noisy, the
percentiles show lower regression with 1.6% being the worst case [2].
The change allows oom-reaper and process_mrelease to execute safely
under mmap_read_lock without worries that exit_mmap might destroy page
tables from under them.
[1] https://lore.kernel.org/all/20170725141723.ivukwhddk2voyhuc@node.shutemov.name/
[2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/all/20211124235906.14437-1-surenb@google.com/
Bug: 130172058
Bug: 189803002
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: Ic87272d09a0b68a1b0e968e8f1a1510fd6fc776a
Race between process_mrelease and exit_mmap, where free_pgtables is
called while __oom_reap_task_mm is in progress, leads to kernel crash
during pte_offset_map_lock call. oom-reaper avoids this race by setting
MMF_OOM_VICTIM flag and causing exit_mmap to take and release
mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock.
Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to
fix this race, however that would be considered a hack. Fix this race
by elevating mm->mm_users and preventing exit_mmap from executing until
process_mrelease is finished. Patch slightly refactors the code to
adapt for a possible mmget_not_zero failure.
This fix has considerable negative impact on process_mrelease
performance and will likely need later optimization.
Link: https://lkml.kernel.org/r/20211022014658.263508-1-surenb@google.com
Fixes: 884a7e5964 ("mm: introduce process_mrelease system call")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Christian Brauner <christian@brauner.io>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 337546e83f)
Bug: 189803002
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I7cf9c869faa7b746995a94ea93f6a617104385aa
In modern systems it's not unusual to have a system component monitoring
memory conditions of the system and tasked with keeping system memory
pressure under control. One way to accomplish that is to kill
non-essential processes to free up memory for more important ones.
Examples of this are Facebook's OOM killer daemon called oomd and
Android's low memory killer daemon called lmkd.
For such system component it's important to be able to free memory quickly
and efficiently. Unfortunately the time process takes to free up its
memory after receiving a SIGKILL might vary based on the state of the
process (uninterruptible sleep), size and OPP level of the core the
process is running. A mechanism to free resources of the target process
in a more predictable way would improve system's ability to control its
memory pressure.
Introduce process_mrelease system call that releases memory of a dying
process from the context of the caller. This way the memory is freed in a
more controllable way with CPU affinity and priority of the caller. The
workload of freeing the memory will also be charged to the caller. The
operation is allowed only on a dying process.
After previous discussions [1, 2, 3] the decision was made [4] to
introduce a dedicated system call to cover this use case.
The API is as follows,
int process_mrelease(int pidfd, unsigned int flags);
DESCRIPTION
The process_mrelease() system call is used to free the memory of
an exiting process.
The pidfd selects the process referred to by the PID file
descriptor.
(See pidfd_open(2) for further information)
The flags argument is reserved for future use; currently, this
argument must be specified as 0.
RETURN VALUE
On success, process_mrelease() returns 0. On error, -1 is
returned and errno is set to indicate the error.
ERRORS
EBADF pidfd is not a valid PID file descriptor.
EAGAIN Failed to release part of the address space.
EINTR The call was interrupted by a signal; see signal(7).
EINVAL flags is not 0.
EINVAL The memory of the task cannot be released because the
process is not exiting, the address space is shared
with another live process or there is a core dump in
progress.
ENOSYS This system call is not supported, for example, without
MMU support built into Linux.
ESRCH The target process does not exist (i.e., it has terminated
and been waited on).
[1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/
[2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/
[3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/
[4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/
Link: https://lkml.kernel.org/r/20210809185259.405936-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jan Engelhardt <jengelh@inai.de>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 884a7e5964)
Bug: 189803002
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I60d37051acaeff1b7eb7d10aeca23dfa1f2469a3
Changes in 5.10.90
Input: i8042 - add deferred probe support
Input: i8042 - enable deferred probe quirk for ASUS UM325UA
tomoyo: Check exceeded quota early in tomoyo_domain_quota_is_ok().
tomoyo: use hwight16() in tomoyo_domain_quota_is_ok()
parisc: Clear stale IIR value on instruction access rights trap
platform/x86: apple-gmux: use resource_size() with res
memblock: fix memblock_phys_alloc() section mismatch error
recordmcount.pl: fix typo in s390 mcount regex
selinux: initialize proto variable in selinux_ip_postroute_compat()
scsi: lpfc: Terminate string in lpfc_debugfs_nvmeio_trc_write()
net/mlx5: DR, Fix NULL vs IS_ERR checking in dr_domain_init_resources
net/mlx5e: Wrap the tx reporter dump callback to extract the sq
net/mlx5e: Fix ICOSQ recovery flow for XSK
udp: using datalen to cap ipv6 udp max gso segments
selftests: Calculate udpgso segment count without header adjustment
sctp: use call_rcu to free endpoint
net/smc: fix using of uninitialized completions
net: usb: pegasus: Do not drop long Ethernet frames
net: ag71xx: Fix a potential double free in error handling paths
net: lantiq_xrx200: fix statistics of received bytes
NFC: st21nfca: Fix memory leak in device probe and remove
net/smc: improved fix wait on already cleared link
net/smc: don't send CDC/LLC message if link not ready
net/smc: fix kernel panic caused by race of smc_sock
igc: Fix TX timestamp support for non-MSI-X platforms
ionic: Initialize the 'lif->dbid_inuse' bitmap
net/mlx5e: Fix wrong features assignment in case of error
selftests/net: udpgso_bench_tx: fix dst ip argument
net/ncsi: check for error return from call to nla_put_u32
fsl/fman: Fix missing put_device() call in fman_port_probe
i2c: validate user data in compat ioctl
nfc: uapi: use kernel size_t to fix user-space builds
uapi: fix linux/nfc.h userspace compilation errors
drm/amdgpu: When the VCN(1.0) block is suspended, powergating is explicitly enabled
drm/amdgpu: add support for IP discovery gc_info table v2
xhci: Fresco FL1100 controller should not have BROKEN_MSI quirk set.
usb: gadget: f_fs: Clear ffs_eventfd in ffs_data_clear.
usb: mtu3: add memory barrier before set GPD's HWO
usb: mtu3: fix list_head check warning
usb: mtu3: set interval of FS intr and isoc endpoint
binder: fix async_free_space accounting for empty parcels
scsi: vmw_pvscsi: Set residual data length conditionally
Input: appletouch - initialize work before device registration
Input: spaceball - fix parsing of movement data packets
net: fix use-after-free in tw_timer_handler
perf script: Fix CPU filtering of a script's switch events
bpf: Add kconfig knob for disabling unpriv bpf by default
Linux 5.10.90
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I277de1626e4275a3c3b1294a2a106e473a62de6c