Previous commits have introduced infrastructure to enable the EL2 code
to manage its own stage 1 mappings. However, this was preliminary work,
and none of it is currently in use.
Put all of this together by elevating the mapping creation at EL2 when
memory protection is enabled. In this case, the host kernel running
at EL1 still creates _temporary_ EL2 mappings, only used while
initializing the hypervisor, but frees them right after.
As such, all calls to create_hyp_mappings() after kvm init has finished
turn into hypercalls, as the host now has no 'legal' way to modify the
hypevisor page tables directly.
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210315143536.214621-19-qperret@google.com
Bug: 178098380
Change-Id: I0c93c298bc16fc913c6e0faf51c395aa0215c444
When memory protection is enabled, the EL2 code needs the ability to
create and manage its own page-table. To do so, introduce a new set of
hypercalls to bootstrap a memory management system at EL2.
This leads to the following boot flow in nVHE Protected mode:
1. the host allocates memory for the hypervisor very early on, using
the memblock API;
2. the host creates a set of stage 1 page-table for EL2, installs the
EL2 vectors, and issues the __pkvm_init hypercall;
3. during __pkvm_init, the hypervisor re-creates its stage 1 page-table
and stores it in the memory pool provided by the host;
4. the hypervisor then extends its stage 1 mappings to include a
vmemmap in the EL2 VA space, hence allowing to use the buddy
allocator introduced in a previous patch;
5. the hypervisor jumps back in the idmap page, switches from the
host-provided page-table to the new one, and wraps up its
initialization by enabling the new allocator, before returning to
the host.
6. the host can free the now unused page-table created for EL2, and
will now need to issue hypercalls to make changes to the EL2 stage 1
mappings instead of modifying them directly.
Note that for the sake of simplifying the review, this patch focuses on
the hypervisor side of things. In other words, this only implements the
new hypercalls, but does not make use of them from the host yet. The
host-side changes will follow in a subsequent patch.
Credits to Will for __pkvm_init_switch_pgd.
Acked-by: Will Deacon <will@kernel.org>
Co-authored-by: Will Deacon <will@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210315143536.214621-18-qperret@google.com
Bug: 178098380
Change-Id: I039096f049ad3fa083f56e19fb66ea09645d749a
In order to re-map the guest vectors at EL2 when pKVM is enabled,
refactor __kvm_vector_slot2idx() and kvm_init_vector_slot() to move all
the address calculation logic in a static inline function.
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210315143536.214621-16-qperret@google.com
Bug: 178098380
Change-Id: I52442e4b873cb881b2306dc2aacd3060812f5520
We will need to do cache maintenance at EL2 soon, so compile a copy of
__flush_dcache_area at EL2, and provide a copy of arm64_ftr_reg_ctrel0
as it is needed by the read_ctr macro.
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210315143536.214621-15-qperret@google.com
Bug: 178098380
Change-Id: Icdf042dc83a4486b615514a9bd6e27af0859ce75
Introduce the infrastructure in KVM enabling to copy CPU feature
registers into EL2-owned data-structures, to allow reading sanitised
values directly at EL2 in nVHE.
Given that only a subset of these features are being read by the
hypervisor, the ones that need to be copied are to be listed under
<asm/kvm_cpufeature.h> together with the name of the nVHE variable that
will hold the copy. This introduces only the infrastructure enabling
this copy. The first users will follow shortly.
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210315143536.214621-14-qperret@google.com
Bug: 178098380
Change-Id: I20b08439434b1f6f1c48230b60ce113063c0757f
When memory protection is enabled, the hyp code will require a basic
form of memory management in order to allocate and free memory pages at
EL2. This is needed for various use-cases, including the creation of hyp
mappings or the allocation of stage 2 page tables.
To address these use-case, introduce a simple memory allocator in the
hyp code. The allocator is designed as a conventional 'buddy allocator',
working with a page granularity. It allows to allocate and free
physically contiguous pages from memory 'pools', with a guaranteed order
alignment in the PA space. Each page in a memory pool is associated
with a struct hyp_page which holds the page's metadata, including its
refcount, as well as its current order, hence mimicking the kernel's
buddy system in the GFP infrastructure. The hyp_page metadata are made
accessible through a hyp_vmemmap, following the concept of
SPARSE_VMEMMAP in the kernel.
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210315143536.214621-13-qperret@google.com
Bug: 178098380
Change-Id: Id06e90de959a4e23df6d62f9072eab954f5b783f
With nVHE, the host currently creates all stage 1 hypervisor mappings at
EL1 during boot, installs them at EL2, and extends them as required
(e.g. when creating a new VM). But in a world where the host is no
longer trusted, it cannot have full control over the code mapped in the
hypervisor.
In preparation for enabling the hypervisor to create its own stage 1
mappings during boot, introduce an early page allocator, with minimal
functionality. This allocator is designed to be used only during early
bootstrap of the hyp code when memory protection is enabled, which will
then switch to using a full-fledged page allocator after init.
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210315143536.214621-11-qperret@google.com
Bug: 178098380
Change-Id: Ibe4c77634a293565171dc22b196065699a2a1f06
In order to allow the usage of code shared by the host and the hyp in
static inline library functions, allow the usage of kvm_nvhe_sym() at
EL2 by defaulting to the raw symbol name.
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210315143536.214621-10-qperret@google.com
Bug: 178098380
Change-Id: If82eb4fadb95772646d2215028dbb70a05a6a671
kvm_call_hyp() has some logic to issue a function call or a hypercall
depending on the EL at which the kernel is running. However, all the
code compiled under __KVM_NVHE_HYPERVISOR__ is guaranteed to only run
at EL2 which allows us to simplify.
Add ifdefery to kvm_host.h to simplify kvm_call_hyp() in .hyp.text.
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210315143536.214621-9-qperret@google.com
Bug: 178098380
Change-Id: I8bfe2e5f8febeb7f9e75cfe60e18f24af7b9797c
Currently, the hyp code cannot make full use of a bss, as the kernel
section is mapped read-only.
While this mapping could simply be changed to read-write, it would
intermingle even more the hyp and kernel state than they currently are.
Instead, introduce a __hyp_bss section, that uses reserved pages, and
create the appropriate RW hyp mappings during KVM init.
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210315143536.214621-8-qperret@google.com
Bug: 178098380
Change-Id: Iec51e75deebf8db9feb7358a7c6e4892c2624c5d
In preparation for enabling the creation of page-tables at EL2, factor
all memory allocation out of the page-table code, hence making it
re-usable with any compatible memory allocator.
No functional changes intended.
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210315143536.214621-7-qperret@google.com
Bug: 178098380
Change-Id: Ic72f50d58498435d3490931a3e1d695cd19d3b9d
Currently, the KVM page-table allocator uses a mix of put_page() and
free_page() calls depending on the context even though page-allocation
is always achieved using variants of __get_free_page().
Make the code consistent by using put_page() throughout, and reduce the
memory management API surface used by the page-table code. This will
ease factoring out page-allocation from pgtable.c, which is a
pre-requisite to creating page-tables at EL2.
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210315143536.214621-6-qperret@google.com
Bug: 178098380
Change-Id: I998e88dd31decc7dd1141af9dbda5e60c8370baa
Move the initialization of kvm_nvhe_init_params in a dedicated function
that is run early, and only once during KVM init, rather than every time
the KVM vectors are set and reset.
This also opens the opportunity for the hypervisor to change the init
structs during boot, hence simplifying the replacement of host-provided
page-table by the one the hypervisor will create for itself.
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210315143536.214621-5-qperret@google.com
Bug: 178098380
Change-Id: I83ee2ac4889ce39f8d02da1d9682f813713a1540
We will soon need to synchronise multiple CPUs in the hyp text at EL2.
The qspinlock-based locking used by the host is overkill for this purpose
and relies on the kernel's "percpu" implementation for the MCS nodes.
Implement a simple ticket locking scheme based heavily on the code removed
by commit c11090474d ("arm64: locking: Replace ticket lock implementation
with qspinlock").
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210315143536.214621-4-qperret@google.com
Bug: 178098380
Change-Id: Iecc108a89c71689ba281bf3e15dd0b1015595f1a
Pull clear_page(), copy_page(), memcpy() and memset() into the nVHE hyp
code and ensure that we always execute the '__pi_' entry point on the
offchance that it changes in future.
[ qperret: Commit title nits and added linker script alias ]
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20210315143536.214621-3-qperret@google.com
Bug: 178098380
Change-Id: Ie4a82f71d87f3151f8d5e1054b0fd1f8ed44d583
Remove the set_sugov_sched_attr hook which is no longer needed with a
modular governor. The IOWait hook must stay, however.
Bug: 171598214
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Ie68df673bc78ca76c90ba1e6c32ecaa4bba10c89
Add logic in enable_thermal_genl_check to avoid
thermal-hal being woken up all the time by on-die sensor.
When the temperature of the on-die sensor thermal_zone updates,
the thermal_zone_device_update() will be called to handle thermal
and then the netlink events will be sent
through thermal_notify_tz_trip_().
But there is currently no way to filter thermal_zone
in the thermal_netlink, so many netlink events will be sent
in a short amount of time. Because the on-die sensors
are very sensitive. Their temperature may jump up and down
quickly in a few ms.
Bug: 170682696
Test: boot and thermal-hal can receive thermal genl events from kernel
Change-Id: Idb3f4b07a2a2740c01d8785910878bfe6edc832d
Signed-off-by: davidchao <davidchao@google.com>
Currently when an MMC host driver declares crypto support, the MMC core
passes a keyslot and 32-bit DUN down to the driver in each request.
This is enough for the "standard" eMMC v5.2 crypto (cqhci-crypto).
However some SoCs don't follow this standard. They don't have keyslots
but rather take keys directly, and they support longer DUNs.
To allow such hardware to be supported, modify the MMC core to pass a
pointer to the bio_crypt_ctx along with each request, replacing the
crypto_enabled and data_unit_num fields. This way is more flexible.
Also update cqhci-crypto accordingly to keep it working.
(Not being sent upstream yet because this isn't really useful by itself;
it would need support for hardware that needs it to be upstreamed too.)
Bug: 180886435
Bug: 182283899
Change-Id: I8faf3135f570ca3f2798cf812b4245a5df5515ba
Signed-off-by: Eric Biggers <ebiggers@google.com>
commit 0d8359620d ("zram: support page writeback") introduced two
problems. It overwrites writeback_store's return value as kstrtol's
return value, which makes return value zero so user could see zero as
return value of write syscall even though it wrote data successfully.
It also breaks index value in the loop in that it doesn't increase the
index any longer. It means it can write only first starting block index
so user couldn't write all idle pages in the zram so lose memory saving
chance.
This patch fixes those issues.
Link: https://lkml.kernel.org/r/20210312173949.2197662-2-minchan@kernel.org
Fixes: 0d8359620d9b("zram: support page writeback")
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Amos Bianchi <amosbianchi@google.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: John Dias <joaodias@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 2766f18216)
Bug: 182850185
Signed-off-by: Amos Bianchi <amosbianchi@google.com>
Change-Id: I9836b6350727dc4fffc1cc939647235781f0a4ee
Vendor may have its own estimated utilization.
Bug: 170508405
Signed-off-by: Rick Yiu <rickyiu@google.com>
Change-Id: I6055907de75ace4586c3ad854d40f42e3bf40147
Enable DTPM framework, so that userspace daemons can control the max
power cap for a device.
Bug: 182396925
Change-Id: I7a1bdb658289843af8f4bb6824b424eb76b32143
Signed-off-by: Ram Chandrasekar <quic_rkumbako@quicinc.com>
This change adds the symbols that allow subsystems to collect minidumps.
Bug: 180426943
Change-Id: Idfda557600606aeca0c912ca0b3cedab8ff7c23e
Signed-off-by: Siddharth Gupta <quic_sidgup@quicinc.com>
Commit b0841eefd9 ("configfs: provide exclusion between IO and removals")
uses ->frag_dead to mark the fragment state, thus no bothering with extra
refcount on config_item when opening a file. The configfs_get_config_item
was removed in __configfs_open_file, but not with config_item_put. So the
refcount on config_item will lost its balance, causing use-after-free
issues in some occasions like this:
Test:
1. Mount configfs on /config with read-only items:
drwxrwx--- 289 root root 0 2021-04-01 11:55 /config
drwxr-xr-x 2 root root 0 2021-04-01 11:54 /config/a
--w--w--w- 1 root root 4096 2021-04-01 11:53 /config/a/1.txt
......
2. Then run:
for file in /config
do
echo $file
grep -R 'key' $file
done
3. __configfs_open_file will be called in parallel, the first one
got called will do:
if (file->f_mode & FMODE_READ) {
if (!(inode->i_mode & S_IRUGO))
goto out_put_module;
config_item_put(buffer->item);
kref_put()
package_details_release()
kfree()
the other one will run into use-after-free issues like this:
BUG: KASAN: use-after-free in __configfs_open_file+0x1bc/0x3b0
Read of size 8 at addr fffffff155f02480 by task grep/13096
CPU: 0 PID: 13096 Comm: grep VIP: 00 Tainted: G W 4.14.116-kasan #1
TGID: 13096 Comm: grep
Call trace:
dump_stack+0x118/0x160
kasan_report+0x22c/0x294
__asan_load8+0x80/0x88
__configfs_open_file+0x1bc/0x3b0
configfs_open_file+0x28/0x34
do_dentry_open+0x2cc/0x5c0
vfs_open+0x80/0xe0
path_openat+0xd8c/0x2988
do_filp_open+0x1c4/0x2fc
do_sys_open+0x23c/0x404
SyS_openat+0x38/0x48
Allocated by task 2138:
kasan_kmalloc+0xe0/0x1ac
kmem_cache_alloc_trace+0x334/0x394
packages_make_item+0x4c/0x180
configfs_mkdir+0x358/0x740
vfs_mkdir2+0x1bc/0x2e8
SyS_mkdirat+0x154/0x23c
el0_svc_naked+0x34/0x38
Freed by task 13096:
kasan_slab_free+0xb8/0x194
kfree+0x13c/0x910
package_details_release+0x524/0x56c
kref_put+0xc4/0x104
config_item_put+0x24/0x34
__configfs_open_file+0x35c/0x3b0
configfs_open_file+0x28/0x34
do_dentry_open+0x2cc/0x5c0
vfs_open+0x80/0xe0
path_openat+0xd8c/0x2988
do_filp_open+0x1c4/0x2fc
do_sys_open+0x23c/0x404
SyS_openat+0x38/0x48
el0_svc_naked+0x34/0x38
To fix this issue, remove the config_item_put in
__configfs_open_file to balance the refcount of config_item.
Fixes: b0841eefd9 ("configfs: provide exclusion between IO and removals")
Signed-off-by: Daiyue Zhang <zhangdaiyue1@huawei.com>
Signed-off-by: Yi Chen <chenyi77@huawei.com>
Signed-off-by: Ge Qiu <qiuge@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Bug: 174049066
(cherry picked from commit 14fbbc8297
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master)
Cc: Daniel Rosenberg <drosen@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I745711ed546c1be97360670b77aedfb929bffcec
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Add vendor hook for module init, so we can get memory type and
use it to do memory type check for architecture
dependent page table setting.
Bug: 181639260
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Change-Id: I95b70d7a57994f2548fddfb2290d4c9136f58785
Add vendor hook for bpf, so we can get memory type and
use it to do memory type check for architecture
dependent page table setting.
Bug: 181639260
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Change-Id: Icac325a040fb88c7f6b04b2409029b623bd8515f
Add vendor hook for creds, so we get the cred information
to monitor cred lifetime.
Bug: 181639260
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Change-Id: I8f254464e07f9c88336995152479ce91deb13c75
Add vendor hook for avc, so we can get avc_node information
to monitor avc lifetime.
Bug: 181639260
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Change-Id: Idbebeca926c2cb407264f2872b032e1f18462697
When registering a memslot, we check the size and location of that
memslot against the IPA size to ensure that we can provide guest
access to the whole of the memory.
Unfortunately, this check rejects memslot that end-up at the exact
limit of the addressing capability for a given IPA size. For example,
it refuses the creation of a 2GB memslot at 0x8000000 with a 32bit
IPA space.
Fix it by relaxing the check to accept a memslot reaching the
limit of the IPA space.
Fixes: c3058d5da2 ("arm/arm64: KVM: Ensure memslots are within KVM_PHYS_SIZE")
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Reviewed-by: Andrew Jones <drjones@redhat.com>
Link: https://lore.kernel.org/r/20210311100016.3830038-3-maz@kernel.org
(cherry picked from commit 262b003d05)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 178098380
Test: atest VirtualizationHostTestCases on an EL2-enabled device
Change-Id: I82fcfb372b375c2e35997fdc4b33dbd33f056887
KVM/arm64 has forever used a 40bit default IPA space, partially
due to its 32bit heritage (where the only choice is 40bit).
However, there are implementations in the wild that have a *cough*
much smaller *cough* IPA space, which leads to a misprogramming of
VTCR_EL2, and a guest that is stuck on its first memory access
if userspace dares to ask for the default IPA setting (which most
VMMs do).
Instead, blundly reject the creation of such VM, as we can't
satisfy the requirements from userspace (with a one-off warning).
Also clarify the boot warning, and document that the VM creation
will fail when an unsupported IPA size is provided.
Although this is an ABI change, it doesn't really change much
for userspace:
- the guest couldn't run before this change, but no error was
returned. At least userspace knows what is happening.
- a memory slot that was accepted because it did fit the default
IPA space now doesn't even get a chance to be registered.
The other thing that is left doing is to convince userspace to
actually use the IPA space setting instead of relying on the
antiquated default.
Fixes: 233a7cb235 ("kvm: arm64: Allow tuning the physical address size for VM")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Reviewed-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20210311100016.3830038-2-maz@kernel.org
(cherry picked from commit 7d717558dd)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 178098380
Test: atest VirtualizationHostTestCases on an EL2-enabled device
Change-Id: I36332fbe5606affd151976a6354f3e9c45fc214a
This header file is to be used for various macros to help make keeping
the kernel ABI "stable" during an "ABI Freeze" period.
They are to be used both before the freeze (to anticipate places where
there will be changes), and after the freeze (to keep the abi stable for
structures where there were changes due to LTS or other changes to the
kernel tree.)
Strongly based on rh_kabi.h from Red Hat's RHEL kernel tree.
This adds support for "real" padding and the ability to replace fields
with other fields.
But, note that ABI changes will still be caught by libabigail at this
point in time, work on that is still ongoing. When that is completed,
all that will be needed is to modify the _ANDROID_KABI_RESERVE() macro
in this file. No other file changes should be needed.
Bug: 151154716
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I77038cc251c819c3ed22a9cb8843b185416b6727
This commit leaves a TODO in the code so we can more easily keep track
of the workaround.
Bug: 182572011
Fixes: be409db652 ("ANDROID: Clang LTO: Only set -fvisibility=hidden for x86")
Signed-off-by: Giuliano Procida <gprocida@google.com>
Change-Id: I0277a8c0704e5caf8885884ec4e3642ebe03667f
Pages containing buffer_heads that are in one of the per-CPU
buffer_head LRU caches will be pinned and thus cannot be migrated.
This can prevent CMA allocations from succeeding, which are often used
on platforms with co-processors (such as a DSP) that can only use
physically contiguous memory. It can also prevent memory
hot-unplugging from succeeding, which involves migrating at least
MIN_MEMORY_BLOCK_SIZE bytes of memory, which ranges from 8 MiB to 1
GiB based on the architecture in use.
Correspondingly, invalidate the BH LRU caches before a migration
starts and stop any buffer_head from being cached in the LRU caches,
until migration has finished.
Bug: 180018981
Link: https://lore.kernel.org/linux-mm/20210310161429.399432-3-minchan@kernel.org/
Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I7ac085c2ec14a81c3c4d7b65a7eeedb0cfba4ea6
LRU pagevec holds refcount of pages until the pagevec are drained.
It could prevent migration since the refcount of the page is greater
than the expection in migration logic. To mitigate the issue,
callers of migrate_pages drains LRU pagevec via migrate_prep or
lru_add_drain_all before migrate_pages call.
However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.
To close the race, this patch disables lru caches(i.e, pagevec)
during ongoing migration until migrate is done.
Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a
sync migration) with below debug code.
int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..
if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
dump_page(page, "fail to migrate");
}
The test was repeating android apps launching with cma allocation
in background every five seconds. Total cma allocation count was
about 500 during the testing. With this patch, the dump_page count
was reduced from 400 to 30.
The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.
Bug: 180018981
Link: https://lore.kernel.org/linux-mm/20210310161429.399432-2-minchan@kernel.org/
[minchan: Resolved conflict in mm/memory_hotplug.c, mm/swap.c]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: Ie1e09cf26e3105b674a9aed4ac65070efee608af
Currently, migrate_prep is merely a wrapper of lru_cache_add_all.
There is not much to gain from having additional abstraction.
Use lru_add_drain_all instead of migrate_prep, which would be more
descriptive.
note: migrate_prep_local in compaction.c changed into lru_add_drain
to avoid CPU schedule cost with involving many other CPUs to keep
keep old behavior.
Bug: 180018981
Link: https://lore.kernel.org/linux-mm/20210310161429.399432-1-minchan@kernel.org/
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I1bd3fcb13993e8a7a7961ceec817ac17304364cb
GKI has CONFIG_DYNAMIC_DEBUG_CORE. Thus, to enable only the
specific alloc_contig_dump_pages without needing all pr_debug
in every source files is using -DCONFIG_DYNAMIC_MODULE
when the page_alloc.o compiled.
Bug: 182195592
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I93266eb4161b3653389c737d4c767fd5d1083339
Currently, debugging CMA allocation failures is quite limited.
The most common source of these failures seems to be page
migration which doesn't provide any useful information on the
reason of the failure by itself. alloc_contig_range can report
those failures as it holds a list of migrate-failed pages.
The information logged by dump_page() has already proven helpful
for debugging allocation issues, like identifying long-term
pinnings on ZONE_MOVABLE or MIGRATE_CMA.
Let's use the dynamic debugging infrastructure, such that we
avoid flooding the logs and creating a lot of noise on frequent
alloc_contig_range() calls. This information is helpful for
debugging only.
There are two ifdefery conditions to support common dyndbg options:
- CONFIG_DYNAMIC_DEBUG_CORE && DYNAMIC_DEBUG_MODULE
It aims for supporting the feature with only specific file
with adding ccflags.
- CONFIG_DYNAMIC_DEBUG
It aims for supporting the feature with system wide globally.
A simple example to enable the feature:
Admin could enable the dump like this(by default, disabled)
echo "func alloc_contig_dump_pages +p" > control
Admin could disable it.
echo "func alloc_contig_dump_pages =_" > control
Detail goes Documentation/admin-guide/dynamic-debug-howto.rst
A concern is utility functions in dump_page use inconsistent
loglevels. In the future, we might want to make the loglevels
used inside dump_page() consistent and eventually rework the way
we log the information here. See [1].
[1] https://lore.kernel.org/linux-mm/YEh4doXvyuRl5BDB@google.com/
Bug: 182195592
Link: https://lore.kernel.org/linux-mm/20210311194042.825152-1-minchan@kernel.org/
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Change-Id: I4db99a76a543d559a05515d6800c66ab48196978