To avoid confusion, the terms "promotion" and "demotion" will be
applied to the multi-gen LRU, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
promotes hot pages to the youngest generation when it finds them
accessed through page tables; the demotion of cold pages happens
consequently when it increments max_seq. The aging has the complexity
O(nr_hot_pages), since it is only interested in hot pages. Promotion
in the aging path does not require any LRU list operations, only the
updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
the result of the increment of max_seq, requires LRU list operations,
e.g., lru_deactivate_fn().
The eviction consumes old generations. Given an lruvec, it increments
min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types
are available from the same generation.
Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses through file descriptors. A
page accessed N times through file descriptors is in tier
order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in page->flags. In contrast to moving across generations, which
requires the LRU lock, moving across tiers only involves operations on
page->flags. The feedback loop also monitors refaults over all tiers
and decides when to protect pages in which tiers (N>1), using the
first tier (N=0,1) as a baseline. The first tier contains single-use
unmapped clean pages, which are most likely the best choices. The
eviction moves a page to the next generation, i.e., min_seq+1, if the
feedback loop decides so. This approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[38, 40]%
IOPS BW
5.18-ed4643521e6a: 2547k 9989MiB/s
patch1-6: 3540k 13.5GiB/s
Single workload:
memcached (anon): +[103, 107]%
Ops/sec KB/sec
5.18-ed4643521e6a: 469048.66 18243.91
patch1-6: 964656.80 37520.88
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.18-ed4643521e6a
39.56% page_vma_mapped_walk
19.32% lzo1x_1_do_compress (real work)
7.18% do_raw_spin_lock
4.23% _raw_spin_unlock_irq
2.26% vma_interval_tree_subtree_search
2.12% vma_interval_tree_iter_next
2.11% folio_referenced_one
1.90% anon_vma_interval_tree_iter_first
1.47% ptep_clear_flush
0.97% __anon_vma_interval_tree_subtree_search
patch1-6
36.13% lzo1x_1_do_compress (real work)
19.16% page_vma_mapped_walk
6.55% _raw_spin_unlock_irq
4.02% do_raw_spin_lock
2.32% anon_vma_interval_tree_iter_first
2.11% ptep_clear_flush
1.76% __zram_bvec_write
1.64% folio_referenced_one
1.40% memmove
1.35% obj_malloc
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
Chrome OS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lore.kernel.org/r/20220309021230.721028-7-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I3fe4850006d7984cd9f4fd46134b826609dc2f86
Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.
Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in page->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.
There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations. They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of
working set estimation and proactive reclaim. These features are
required to optimize job scheduling (bin packing) in data centers. The
variable size of the sliding window is designed for such use cases
[1][2].
To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive"
will be applied to the active/inactive LRU, as usual.
The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
applications usually do not prepare themselves for major page
faults like they do for blocked I/O. E.g., GUI applications
commonly use dedicated I/O threads to avoid blocking the rendering
threads.
There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or
VM_RAND_READ is present; the latter channel is assumed to follow the
latter pattern unless outlying refaults have been observed [3][4].
The next patch will address the "outlying refaults". Three macros,
i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are
added in this patch to make the entire patchset less diffy.
A page is added to the youngest generation on faulting. The aging
needs to check the accessed bit at least twice before handing this
page over to the eviction. The first check takes care of the accessed
bit set on the initial fault; the second check makes sure this page
has not been used since then. This protocol, AKA second chance,
requires a minimum of two generations, hence MIN_NR_GENS.
[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/
Link: https://lore.kernel.org/r/20220309021230.721028-6-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I333ec6a1d2abfa60d93d6adc190ed3eefe441512
Some architectures support the accessed bit in non-leaf PMD entries,
e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it
as part of linear address translation [1]. Page table walkers that
clear the accessed bit may use this capability to reduce their search
space.
Note that:
1. Although an inline function is preferable, this capability is added
as a configuration option for consistency with the existing macros.
2. Due to the little interest in other varieties, this capability was
only tested on Intel and AMD CPUs.
[1]: Intel 64 and IA-32 Architectures Software Developer's Manual
Volume 3 (June 2021), section 4.8
Link: https://lore.kernel.org/r/20220309021230.721028-3-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I73f84a21fd315192eaa3e6443334ed1bccb4e99e
Some architectures automatically set the accessed bit in PTEs, e.g.,
x86 and arm64 v8.2. On architectures that do not have this capability,
clearing the accessed bit in a PTE usually triggers a page fault
following the TLB miss of this PTE (to emulate the accessed bit).
Being aware of this capability can help make better decisions, e.g.,
whether to spread the work out over a period of time to reduce bursty
page faults when trying to clear the accessed bit in many PTEs.
Note that theoretically this capability can be unreliable, e.g.,
hotplugged CPUs might be different from builtin ones. Therefore it
should not be used in architecture-independent code that involves
correctness, e.g., to determine whether TLB flushes are required (in
combination with the accessed bit).
Link: https://lore.kernel.org/r/20220309021230.721028-2-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: Ie81175d7e0d239f688d31487b298cf9b4fb66707
The naming convention used in include/linux/page-flags-layout.h:
*_SHIFT: the number of bits trying to allocate
*_WIDTH: the number of bits successfully allocated
So when it comes to LAST_CPUPID_WIDTH, we need to check whether all
previous *_WIDTH and LAST_CPUPID_SHIFT can fit into page flags. This
means we need to use NODES_WIDTH, not NODES_SHIFT.
Link: https://lkml.kernel.org/r/20210303071609.797782-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f73c6c8805)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I6d7c58cf5d10e302adc818ac7e1fd727208d23c8
We are capable of SetPageWorkingset based on refault distances after
commit aae466b005 ("mm/swap: implement workingset detection for
anonymous LRU"). This is done by workingset_refault(), which is right
above the unconditional SetPageWorkingset deleted by this patch.
The unconditional SetPageWorkingset miscategorizes pages that are read
ahead or never belonged to the working set (e.g., tmpfs pages accessed
only once by fd). When those pages are swapped in (after they were
swapped out) for the first time, they skew PSI (when using async swap).
When this happens again, depending on their refault distances, they might
skew workingset_restore_anon counter in addition to PSI because their
shadows indicate they were part of the working set.
Historically, SetPageWorkingset was added as part of the PSI series, and
Johannes said:
"It was meant to mark incoming pages under IO with SetPageWorkingset
when waiting for them constituted a memory stall.
On the page cache side, because we HAVE workingset detection, this was
specific to recently evicted pages that had been active in their
previous life. On the anon side, the aging algorithm had no
distinction between workingset and sporadically used pages. Given the
choice between a) no swapin stalls are pressure and b) all swapin
stalls are pressure, I went with the latter in order to detect swap
storms. The false positive case - high rate of swapin without severe
memory pressure - was relatively unlikely, because we tried to avoid
swapping until everything was completely on fire in the first place."
Link: https://lkml.kernel.org/r/20201209012400.1771150-1-yuzhao@google.com
Link: https://lkml.kernel.org/r/20201214231253.62313-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit cad8320b4b)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: Ifa9c5fa2e875e6ccee6c3f7e2d2983278d54c220
Patch series "mm: lru related cleanups", v2.
The cleanups are intended to reduce the verbosity in lru list operations
and make them less error-prone. A typical example would be how the
patches change __activate_page():
static void __activate_page(struct page *page, struct lruvec *lruvec)
{
if (!PageActive(page) && !PageUnevictable(page)) {
- int lru = page_lru_base_type(page);
int nr_pages = thp_nr_pages(page);
- del_page_from_lru_list(page, lruvec, lru);
+ del_page_from_lru_list(page, lruvec);
SetPageActive(page);
- lru += LRU_ACTIVE;
- add_page_to_lru_list(page, lruvec, lru);
+ add_page_to_lru_list(page, lruvec);
trace_mm_lru_activate(page);
There are a few more places like __activate_page() and they are
unnecessarily repetitive in terms of figuring out which list a page should
be added onto or deleted from. And with the duplicated code removed, they
are easier to read, IMO.
Patch 1 to 5 basically cover the above. Patch 6 and 7 make code more
robust by improving bug reporting. Patch 8, 9 and 10 take care of some
dangling helpers left in header files.
This patch (of 10):
There is add_page_to_lru_list(), and move_pages_to_lru() should reuse it,
not duplicate it.
Link: https://lkml.kernel.org/r/20210122220600.906146-1-yuzhao@google.com
Link: https://lore.kernel.org/linux-mm/20201207220949.830352-2-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 42895ea73b)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I7e09be6bedcd451c4e8c790c969306b6ca3adebd
This allows Bazel to load the value of $BRANCH in order
to determine the value of --dist_dir of copy_to_dist_dir
statically.
Test: TH
Bug: 229268271
Change-Id: Iff759b8188360ea1b2bc204d29750eece9095582
Signed-off-by: Yifan Hong <elsk@google.com>
Currently trying to move or delete a memslot results in a warning
and a failure. Userspace shouldn't be able to trigger kernel
warnings.
The cause is that in protected mode, stage-2 is managed by hyp.
Modifying a memslot flushes the shadow memslot, which tries to
unmap any stage-2 mapped pages.
Bug: 226890762
Signed-off-by: Fuad Tabba <tabba@google.com>
Change-Id: Icc6a0aada76e8492285cd5509bad1ee57700af7c
We had a size mismatch for the return value, leading to EIOCBQUEUED
getting interpreted as a return size instead of an error code.
Test: generic/467, generic/013, and fuse_test
Bug: 217570523
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: I64f9d5263f8b37d3c0e286467f9351997b294cc2
Allocates the iocb we create for asynchronous IO from a cache instead of
a regular kzalloc
Test: generic/467 and fuse_test
Bug: 217570523
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: I27dcec89cd585835f6a8e80e1ae30c503f4038c8
The current name is a bit confusing. iocb_fuse could refer to the iocb
passed to fuse or created by fuse. The new name unambiguously refers to
the one passed in to fuse.
Test: compiles, behavior unchanged
Bug: 217570523
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: I955500eb8a3186252427fd06ca6e99b4fec469b6
Existing fixattr was adjusting the same node twice.
Bug: 226655982
Test: generic/241 generic/269
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: I4b1cb6d626ee6bd9010012ac126b78f14d6157d0
Fuse uses generic_file_llseek, so we must account for that in readdir to
ensure we read from the correct offset in the lower filesystem.
Bug: 226655281
Test: generic/257, fuse_test
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: Ie752c1c645e95b7c03ef9497562758a5c42b514a
MIGRATE_CMA is defined only when CONFIG_CMA. Thus, we
couldn't use MIGRATE_CMA directly to build for both
!CONFIG_CMA and CONFIG_CMA.
Let's use MIGRATE_RECLAIMABLE in the case.
Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: Idb4fc6f4ea02ab074f270ce62001182c8fff3b37
Since Android has pcp list for MIGRATE_CMA[1], it could cause
CMA allocation latency due to not freeing the MIGRATE_ISOLATE
page immediately.
Originally, MIGRATE_ISOLATED page is supposed to go buddy list
with skipping pcp list. Otherwise, the page could be reallocated
from pcp list or staying on the pcp list until the pcp is drained
so that CMA keeps retrying since it couldn't find the freed page
from buddy list. That worked before since the CMA pfnblocks changed
only from MIGRATE_CMA to MIGRATE_ISOLATE and free function logic
in page allocator has checked MIGRATE_ISOLATEness on every CMA
pages using below.
free_unref_page_commit
if (migratetype >= MIGRATE_PCPTYPES)
if(is_migrate_isolate(migratetype))
free_one_page(page);
It worked since enum MIGRATE_CMA was bigger than enum
MIGRATE_PCPTYPES but since [1], the enum MIGRATE_CMA is less than
MIGRATE_PCPTYPES so the logic above doesn't work any more.
It could cause following race
CPU 0 CPU 1
free_unref_page
migratetype = get_pfnblock_migratetype()
set_pcppage_migratetype(MIGRATE_CMA)
cma_alloc
alloc_contig_range
set_migrate_isolate(MIGRATE_ISOLATE)
add the page into pcp list
the page could be reallocated
This patch couldn't fix the race completely due to missing zone->lock
in order-0 page free(for performance reason). However, it's not a new
problem so we need to deal with the issue separately.
[1] ANDROID: mm: add cma pcp list
Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: Ibea20085ce5bfb4b74b83b041f9bda9a380120f9
Ensure that the FFA memory range to be checked and annotated in the host
stage-2 page-table is page-aligned and that its size is calculated using
64-bit arithmetic to avoid the host triggering overflow and subsequent
truncation.
Bug: 228889679
Reported-by: Gulshan Singh <gsgx@google.com>
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: Ifc51ee9598905cf2926d19c53159804f89d74040
Gulshan reports that the hypervisor is not pinning the host FFA mailbox
pages, therefore allowing the host to unshare them after registration
and to later donate them for things like page-table pages.
Pin the host FFA mailboxes to prevent the host from unsharing them while
they are in use.
Bug: 228931886
Reported-by: Gulshan Singh <gsgx@google.com>
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: I18ecad6ccaa3ef89015a71d97890fad55f0568f2
There are more vfs-only symbols that OEMs want to use, so place them in
the proper vfs-only namespace.
Bug: 157965270
Bug: 210074446
Bug: 227656251
Cc: Matthias Maennich <maennich@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I99b9facc8da45fb329f6627d204180d1f89bcf97
Xiling reports that the hypervisor dereferences the host memcache struct
twice when refilling its own memcache. This allows the host to change its
memcache head after it has been admitted and before it is consumed,
leading to an arbitrary write in hypervisor memory.
Fix this by copying the host memcache on the stack before starting to
refill hence guaranteeing its stability.
Bug: 228435321
Reported-by: Xiling Gong <xiling@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Ib7c5db203e4a4a7f27eb9f0c0083f4b5c726b4d9
This patch removes dump_page_pinner since it was not useful(IOW,
the page_pinner buffer to keep the history is enough).
This patch also changes mismatched printf format specifier.
Bug: 218731671
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I80c6f5ad656b3b0d27a50eabff4d1382559aa105
[Backport: resolve conflicts caused by CONFIG_CMA.]
KASAN changes that added new GFP flags mistakenly updated
__GFP_BITS_SHIFT as the total number of GFP bits instead of as a shift
used to define __GFP_BITS_MASK.
This broke LOCKDEP, as __GFP_BITS_MASK now gets the 25th bit enabled
instead of the 28th for __GFP_NOLOCKDEP.
Update __GFP_BITS_SHIFT to always count KASAN GFP bits.
In the future, we could handle all combinations of KASAN and LOCKDEP to
occupy as few bits as possible. For now, we have enough GFP bits to be
inefficient in this quick fix.
Link: https://lkml.kernel.org/r/462ff52742a1fcc95a69778685737f723ee4dfb3.1648400273.git.andreyknvl@google.com
Fixes: 9353ffa6e9 ("kasan, page_alloc: allow skipping memory init for HW_TAGS")
Fixes: 53ae233c30 ("kasan, page_alloc: allow skipping unpoisoning for HW_TAGS")
Fixes: f49d9c5bb1 ("kasan, mm: only define ___GFP_SKIP_KASAN_POISON with HW_TAGS")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 78d104f8b401c81d140adad91e027d7d83b3315c)
Bug: 217222520
Change-Id: I82484635012c5773c6ef9164a9368d9e61157f87
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Currently the generic IOMMU code lets the driver initialize its PT and
then invokes callbacks to set the permissions across the entire PA
range. Optimize this by making it a requirement on the driver to
initialize its PTs to all memory owned by the host. snapshot_host_stage2
then only calls the driver's callback for memory regions not owned by
the host.
Bug: 190463801
Bug: 218012133
Signed-off-by: David Brazdil <dbrazdil@google.com>
Change-Id: I51ff38cb4f4e28e19903af942776b401504c363e
Change the permissions that MPTs are initialized with from PROT_NONE to
PROT_RW. No functional change intended as the generic IOMMU code
sets permissions for the entire address space later. This will allow to
optimize boot time by only unmapping pages not available to host.
Bug: 190463801
Bug: 218012133
Signed-off-by: David Brazdil <dbrazdil@google.com>
Change-Id: Ic29ec690a84cde22a2ce8fe33e7127711c6f0f3e
The second argument of the kvm_pgtable_walker callback was
misinterpreted as the end of the current entry, where in fact it is
the end of the walked memory region. Fix this by computing the end of
the current entry from the start and the level.
This did not affect correctness, as the code iterates linarly over
the entire address space, but it did affect boot time.
Bug: 190463801
Bug: 218012133
Signed-off-by: David Brazdil <dbrazdil@google.com>
Change-Id: I6d189b87645f47cd215a783c1bc9e1f032ff8c62
When vendor hooks are added to a file that previously didn't have any
vendor hooks, we end up indirectly including linux/tracepoint.h. This
causes some data types that used to be opaque (forward declared) to the
code to become visible to the code.
Modversions correctly catches this change in visibility, but we don't
really care about the data types made visible when linux/tracepoint.h is
included. So, hide this from modversions in the central vendor_hooks.h file
instead of having to fix this on a case by case basis.
This change itself will cause a one time CRC breakage/churn because it's
fixing the existing vendor hook headers, but should reduce unnecessary CRC
churns in the future.
To avoid future pointless CRC churn, vendor hook header files that include
vendor_hooks.h should not include linux/tracepoint.h directly.
Bug: 227513263
Bug: 226140073
Signed-off-by: Saravana Kannan <saravanak@google.com>
Change-Id: Ia88e6af11dd94fe475c464eb30a6e5e1e24c938b
It needs addtional struct page **pages params to judge whether
it's possible to migrate pages out of CMA.
Bug: 227475444
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I9a8aa57ff91228baf0fc970b8499464c07872c09
Introduce $debugfs/page_pinner/buffer_size to change
buffer_size on demand. The change of buffer_size will
reset the buffer.
Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I505cdc2ee29aa0c6ed4e2dc2c0b6fcff77c388e4
We shouldn't waste memory for vendors who don't use
page_pinner so remove the page_pinner static buffer.
Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I46ae2fb5000c4eb59253159032182ca106b39eb9
From the experience, longterm_pinner is not worth maintaining
considering how much it churns MM. Just drop the feature and
we are good with alloc_contig_failed.
The visible effect from this patch is
1. drop $debugfs/page_pinner/longterm_pinner
2. drop put_user_page expoerted API
3. rename alloc_contig_failed to buffer
Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I68cc11db448260987a9e26b99647ecb55f571616
Currently, output format is a little hard to parse how long the
page has been pinned since user need to figure out the timeline
from migration failure detection to put event. Sometimes, the log
buffer would be overflowed so we lost the migration failure event
timeline, even. This patch stores the page pinning time in kernel
side and keep the information whenever page was released. Thus,
user could understand the output easier and never lose the information.
Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I396f0c12438e0ff8a3497253b750a7e5bb342f57
Currently, page_pinner code are too ugly since we missed refactoring
last time due to GKI deadline. Let's make it better this time before
GKI is freezing.
What this patch is cleaning are __reset_page_pinner which is used
for freeing page as well as putting pages depending on free parameter.
It makes code too ugly for readability PoV and hard to make further
changes so split it with each put and free functions.
Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I610ffc629eea5e996b7d55340b5589a3f49574d7
For various reasons based on the allocator behaviour and typical
use-cases at the time, when the max32_alloc_size optimisation was
introduced it seemed reasonable to couple the reset of the tracked
size to the update of cached32_node upon freeing a relevant IOVA.
However, since subsequent optimisations focused on helping genuine
32-bit devices make best use of even more limited address spaces, it
is now a lot more likely for cached32_node to be anywhere in a "full"
32-bit address space, and as such more likely for space to become
available from IOVAs below that node being freed.
At this point, the short-cut in __cached_rbnode_delete_update() really
doesn't hold up any more, and we need to fix the logic to reliably
provide the expected behaviour. We still want cached32_node to only move
upwards, but we should reset the allocation size if *any* 32-bit space
has become available.
Reported-by: Yunfei Wang <yf.wang@mediatek.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Miles Chen <miles.chen@mediatek.com>
Link: https://lore.kernel.org/r/033815732d83ca73b13c11485ac39336f15c3b40.1646318408.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Bug: 223712131
(cherry picked from commit 5b61343b50https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git core)
Signed-off-by: Yunfei Wang <yf.wang@mediatek.com>
Change-Id: I5026411dd022c6ddea5c0e4da6e69c7b14162c3f
(cherry picked from commit ec48b1892e)
When dealing with a guest with SVE enabled, we don't populate
the shadow SVE state, nor pin the SVE state at S1 EL2.
Fix both issues in one go.
Bug: 227292021
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Marc Zyngier <mzyngier@google.com>
Change-Id: I88dc7e9c84e5970ec2466a0aa98ad4e3c94711a0