linux

mirror of https://github.com/hardkernel/linux.git synced 2026-06-06 19:08:57 +09:00

Author	SHA1	Message	Date
Yu Zhao	082bc8296a	FROMLIST: mm: multi-gen LRU: optimize multiple memcgs When multiple memcgs are available, it is possible to make better choices based on generations and tiers and therefore improve the overall performance under global memory pressure. This patch adds a rudimentary optimization to select memcgs that can drop single-use unmapped clean pages first. Doing so reduces the chance of going into the aging path or swapping. These two operations can be costly. A typical example that benefits from this optimization is a server running mixed types of workloads, e.g., heavy anon workload in one memcg and heavy buffered I/O workload in the other. Though this optimization can be applied to both kswapd and direct reclaim, it is only added to kswapd to keep the patchset manageable. Later improvements will cover the direct reclaim path. Server benchmark results: Mixed workloads: fio (buffered I/O): -[23, 25]% IOPS BW patch1-8: 2960k 11.3GiB/s patch1-9: 2248k 8783MiB/s memcached (anon): +[210, 214]% Ops/sec KB/sec patch1-8: 606940.09 23576.89 patch1-9: 1895197.49 73619.93 Mixed workloads: fio (buffered I/O): -[4, 6]% IOPS BW 5.18-ed4643521e6a: 2369k 9255MiB/s patch1-9: 2248k 8783MiB/s memcached (anon): +[510, 516]% Ops/sec KB/sec 5.18-ed4643521e6a: 309189.58 12010.61 patch1-9: 1895197.49 73619.93 Configurations: (changes since patch 6) cat mixed.sh modprobe brd rd_nr=2 rd_size=56623104 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 mkfs.ext4 /dev/ram1 mount -t ext4 /dev/ram1 /mnt memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=90m --group_reporting & pid=$! sleep 200 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed kill -INT $pid wait Client benchmark results: no change (CONFIG_MEMCG=n) Link: https://lore.kernel.org/r/20220309021230.721028-10-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I0641467dbd7c5ba0645602cec7fe8d6fdb750edb	2022-04-18 10:11:55 -07:00
Yu Zhao	93c4f86793	FROMLIST: mm: multi-gen LRU: support page table walks To further exploit spatial locality, the aging prefers to walk page tables to search for young PTEs and promote hot pages. A kill switch will be added in the next patch to disable this behavior. When disabled, the aging relies on the rmap only. NB: this behavior has nothing similar with the page table scanning in the 2.4 kernel [1], which searches page tables for old PTEs, adds cold pages to swapcache and unmaps them. To avoid confusion, the term "iteration" specifically means the traversal of an entire mm_struct list; the term "walk" will be applied to page tables and the rmap, as usual. An mm_struct list is maintained for each memcg, and an mm_struct follows its owner task to the new memcg when this task is migrated. Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls walk_page_range() with each mm_struct on this list to promote hot pages before it increments max_seq. When multiple page table walkers iterate the same list, each of them gets a unique mm_struct; therefore they can run concurrently. Page table walkers ignore any misplaced pages, e.g., if an mm_struct was migrated, pages it left in the previous memcg will not be promoted when its current memcg is under reclaim. Similarly, page table walkers will not promote pages from nodes other than the one under reclaim. This patch uses the following optimizations when walking page tables: 1. It tracks the usage of mm_struct's between context switches so that page table walkers can skip processes that have been sleeping since the last iteration. 2. It uses generational Bloom filters to record populated branches so that page table walkers can reduce their search space based on the query results, e.g., to skip page tables containing mostly holes or misplaced pages. 3. It takes advantage of the accessed bit in non-leaf PMD entries when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y. 4. It does not zigzag between a PGD table and the same PMD table spanning multiple VMAs. IOW, it finishes all the VMAs within the range of the same PMD table before it returns to a PGD table. This improves the cache performance for workloads that have large numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[5.5, 7.5]% Ops/sec KB/sec patch1-7: 1014393.57 39455.42 patch1-8: 1078507.59 41949.15 Configurations: no change Client benchmark results: kswapd profiles: patch1-7 45.54% lzo1x_1_do_compress (real work) 9.56% page_vma_mapped_walk 6.70% _raw_spin_unlock_irq 2.78% ptep_clear_flush 2.47% do_raw_spin_lock 2.22% __zram_bvec_write 1.87% lru_gen_look_around 1.78% memmove 1.77% obj_malloc 1.44% free_unref_page_list patch1-8 47.02% lzo1x_1_do_compress (real work) 6.73% page_vma_mapped_walk 6.14% _raw_spin_unlock_irq 3.39% walk_pte_range 2.63% ptep_clear_flush 2.29% __zram_bvec_write 2.10% do_raw_spin_lock 1.81% memmove 1.73% obj_malloc 1.53% free_unref_page_list Configurations: no change [1] https://lwn.net/Articles/23732/ [2] https://source.android.com/devices/tech/debug/scudo Link: https://lore.kernel.org/r/20220309021230.721028-9-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I5a3c97cf8ebf8d65d5f9528cd979a637c190053e	2022-04-18 10:11:55 -07:00
Yu Zhao	c8356f7573	FROMLIST: mm: multi-gen LRU: exploit locality in rmap Searching the rmap for PTEs mapping each page on an LRU list (to test and clear the accessed bit) can be expensive because pages from different VMAs (PA space) are not cache friendly to the rmap (VA space). For workloads mostly using mapped pages, the rmap has a high CPU cost in the reclaim path. This patch exploits spatial locality to reduce the trips into the rmap. When shrink_page_list() walks the rmap and finds a young PTE, a new function lru_gen_look_around() scans at most BITS_PER_LONG-1 adjacent PTEs. On finding another young PTE, it clears the accessed bit and updates the gen counter of the page mapped by this PTE to (max_seq%MAX_NR_GENS)+1. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[4, 6]% Ops/sec KB/sec patch1-6: 964656.80 37520.88 patch1-7: 1014393.57 39455.42 Configurations: no change Client benchmark results: kswapd profiles: patch1-6 36.13% lzo1x_1_do_compress (real work) 19.16% page_vma_mapped_walk 6.55% _raw_spin_unlock_irq 4.02% do_raw_spin_lock 2.32% anon_vma_interval_tree_iter_first 2.11% ptep_clear_flush 1.76% __zram_bvec_write 1.64% folio_referenced_one 1.40% memmove 1.35% obj_malloc patch1-7 45.54% lzo1x_1_do_compress (real work) 9.56% page_vma_mapped_walk 6.70% _raw_spin_unlock_irq 2.78% ptep_clear_flush 2.47% do_raw_spin_lock 2.22% __zram_bvec_write 1.87% lru_gen_look_around 1.78% memmove 1.77% obj_malloc 1.44% free_unref_page_list Configurations: no change Link: https://lore.kernel.org/r/20220309021230.721028-8-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I9a290343840f3cf925c891c8e360c7cdc24ffb9c	2022-04-18 10:11:55 -07:00
Yu Zhao	436dff20eb	FROMLIST: mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. Promotion in the aging path does not require any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The eviction consumes old generations. Given an lruvec, it increments min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. Each generation is divided into multiple tiers. Tiers represent different ranges of numbers of accesses through file descriptors. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in page->flags. In contrast to moving across generations, which requires the LRU lock, moving across tiers only involves operations on page->flags. The feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. The eviction moves a page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[38, 40]% IOPS BW 5.18-ed4643521e6a: 2547k 9989MiB/s patch1-6: 3540k 13.5GiB/s Single workload: memcached (anon): +[103, 107]% Ops/sec KB/sec 5.18-ed4643521e6a: 469048.66 18243.91 patch1-6: 964656.80 37520.88 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO \| __GFP_ZERO \| __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO \| __GFP_ZERO \| __GFP_HIGHMEM \| __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.18-ed4643521e6a 39.56% page_vma_mapped_walk 19.32% lzo1x_1_do_compress (real work) 7.18% do_raw_spin_lock 4.23% _raw_spin_unlock_irq 2.26% vma_interval_tree_subtree_search 2.12% vma_interval_tree_iter_next 2.11% folio_referenced_one 1.90% anon_vma_interval_tree_iter_first 1.47% ptep_clear_flush 0.97% __anon_vma_interval_tree_subtree_search patch1-6 36.13% lzo1x_1_do_compress (real work) 19.16% page_vma_mapped_walk 6.55% _raw_spin_unlock_irq 4.02% do_raw_spin_lock 2.32% anon_vma_interval_tree_iter_first 2.11% ptep_clear_flush 1.76% __zram_bvec_write 1.64% folio_referenced_one 1.40% memmove 1.35% obj_malloc Configurations: CPU: single Snapdragon 7c Mem: total 4G Chrome OS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lore.kernel.org/r/20220309021230.721028-7-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I3fe4850006d7984cd9f4fd46134b826609dc2f86	2022-04-18 10:11:54 -07:00
Yu Zhao	fe302bd1f9	FROMLIST: mm: multi-gen LRU: groundwork Evictable pages are divided into multiple generations for each lruvec. The youngest generation number is stored in lrugen->max_seq for both anon and file types as they are aged on an equal footing. The oldest generation numbers are stored in lrugen->min_seq[] separately for anon and file types as clean file pages can be evicted regardless of swap constraints. These three variables are monotonically increasing. Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into the gen counter in page->flags. Each truncated generation number is an index to lrugen->lists[]. The sliding window technique is used to track at least MIN_NR_GENS and at most MAX_NR_GENS generations. The gen counter stores a value within [1, MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it stores 0. There are two conceptually independent procedures: "the aging", which produces young generations, and "the eviction", which consumes old generations. They form a closed-loop system, i.e., "the page reclaim". Both procedures can be invoked from userspace for the purposes of working set estimation and proactive reclaim. These features are required to optimize job scheduling (bin packing) in data centers. The variable size of the sliding window is designed for such use cases [1][2]. To avoid confusion, the terms "hot" and "cold" will be applied to the multi-gen LRU, as a new convention; the terms "active" and "inactive" will be applied to the active/inactive LRU, as usual. The protection of hot pages and the selection of cold pages are based on page access channels and patterns. There are two access channels: one through page tables and the other through file descriptors. The protection of the former channel is by design stronger because: 1. The uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit. 2. The cost of evicting the former channel is higher due to the TLB flushes required and the likelihood of encountering the dirty bit. 3. The penalty of underprotecting the former channel is higher because applications usually do not prepare themselves for major page faults like they do for blocked I/O. E.g., GUI applications commonly use dedicated I/O threads to avoid blocking the rendering threads. There are also two access patterns: one with temporal locality and the other without. For the reasons listed above, the former channel is assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is present; the latter channel is assumed to follow the latter pattern unless outlying refaults have been observed [3][4]. The next patch will address the "outlying refaults". Three macros, i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in this patch to make the entire patchset less diffy. A page is added to the youngest generation on faulting. The aging needs to check the accessed bit at least twice before handing this page over to the eviction. The first check takes care of the accessed bit set on the initial fault; the second check makes sure this page has not been used since then. This protocol, AKA second chance, requires a minimum of two generations, hence MIN_NR_GENS. [1] https://dl.acm.org/doi/10.1145/3297858.3304053 [2] https://dl.acm.org/doi/10.1145/3503222.3507731 [3] https://lwn.net/Articles/495543/ [4] https://lwn.net/Articles/815342/ Link: https://lore.kernel.org/r/20220309021230.721028-6-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I333ec6a1d2abfa60d93d6adc190ed3eefe441512	2022-04-18 10:11:54 -07:00
Yu Zhao	4c6c817249	FROMLIST: mm/vmscan.c: refactor shrink_node() This patch refactors shrink_node() to improve readability for the upcoming changes to mm/vmscan.c. Link: https://lore.kernel.org/r/20220309021230.721028-4-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I186f43f946de0d40d54883fb31114840fc749a57	2022-04-18 10:11:54 -07:00
Yu Zhao	95acc9c28b	FROMLIST: mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG Some architectures support the accessed bit in non-leaf PMD entries, e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it as part of linear address translation [1]. Page table walkers that clear the accessed bit may use this capability to reduce their search space. Note that: 1. Although an inline function is preferable, this capability is added as a configuration option for consistency with the existing macros. 2. Due to the little interest in other varieties, this capability was only tested on Intel and AMD CPUs. [1]: Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3 (June 2021), section 4.8 Link: https://lore.kernel.org/r/20220309021230.721028-3-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I73f84a21fd315192eaa3e6443334ed1bccb4e99e	2022-04-18 10:11:54 -07:00
Yu Zhao	1ed19b562b	FROMLIST: mm: x86, arm64: add arch_has_hw_pte_young() Some architectures automatically set the accessed bit in PTEs, e.g., x86 and arm64 v8.2. On architectures that do not have this capability, clearing the accessed bit in a PTE usually triggers a page fault following the TLB miss of this PTE (to emulate the accessed bit). Being aware of this capability can help make better decisions, e.g., whether to spread the work out over a period of time to reduce bursty page faults when trying to clear the accessed bit in many PTEs. Note that theoretically this capability can be unreliable, e.g., hotplugged CPUs might be different from builtin ones. Therefore it should not be used in architecture-independent code that involves correctness, e.g., to determine whether TLB flushes are required (in combination with the accessed bit). Link: https://lore.kernel.org/r/20220309021230.721028-2-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Acked-by: Will Deacon <will@kernel.org> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: Ie81175d7e0d239f688d31487b298cf9b4fb66707	2022-04-18 10:11:54 -07:00
Yu Zhao	b4f3b6ac71	UPSTREAM: include/linux/page-flags-layout.h: cleanups Tidy things up and delete comments stating the obvious with typos or making no sense. Link: https://lkml.kernel.org/r/20210303071609.797782-2-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `1587db62d8`) Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I1d57992dd4c68d89c1b9180f280e09d5d08482b6	2022-04-18 10:11:53 -07:00
Yu Zhao	2b286703d9	UPSTREAM: include/linux/page-flags-layout.h: correctly determine LAST_CPUPID_WIDTH The naming convention used in include/linux/page-flags-layout.h: _SHIFT: the number of bits trying to allocate _WIDTH: the number of bits successfully allocated So when it comes to LAST_CPUPID_WIDTH, we need to check whether all previous *_WIDTH and LAST_CPUPID_SHIFT can fit into page flags. This means we need to use NODES_WIDTH, not NODES_SHIFT. Link: https://lkml.kernel.org/r/20210303071609.797782-1-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `f73c6c8805`) Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I6d7c58cf5d10e302adc818ac7e1fd727208d23c8	2022-04-18 10:11:53 -07:00
Yu Zhao	80343eeaf3	UPSTREAM: mm/swap: don't SetPageWorkingset unconditionally during swapin We are capable of SetPageWorkingset based on refault distances after commit `aae466b005` ("mm/swap: implement workingset detection for anonymous LRU"). This is done by workingset_refault(), which is right above the unconditional SetPageWorkingset deleted by this patch. The unconditional SetPageWorkingset miscategorizes pages that are read ahead or never belonged to the working set (e.g., tmpfs pages accessed only once by fd). When those pages are swapped in (after they were swapped out) for the first time, they skew PSI (when using async swap). When this happens again, depending on their refault distances, they might skew workingset_restore_anon counter in addition to PSI because their shadows indicate they were part of the working set. Historically, SetPageWorkingset was added as part of the PSI series, and Johannes said: "It was meant to mark incoming pages under IO with SetPageWorkingset when waiting for them constituted a memory stall. On the page cache side, because we HAVE workingset detection, this was specific to recently evicted pages that had been active in their previous life. On the anon side, the aging algorithm had no distinction between workingset and sporadically used pages. Given the choice between a) no swapin stalls are pressure and b) all swapin stalls are pressure, I went with the latter in order to detect swap storms. The false positive case - high rate of swapin without severe memory pressure - was relatively unlikely, because we tried to avoid swapping until everything was completely on fire in the first place." Link: https://lkml.kernel.org/r/20201209012400.1771150-1-yuzhao@google.com Link: https://lkml.kernel.org/r/20201214231253.62313-1-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `cad8320b4b`) Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: Ifa9c5fa2e875e6ccee6c3f7e2d2983278d54c220	2022-04-18 10:11:53 -07:00
Yu Zhao	0c20cff831	UPSTREAM: include/linux/mm_inline.h: fold page_lru_base_type() into its sole caller We've removed all other references to this function. Link: https://lore.kernel.org/linux-mm/20201207220949.830352-9-yuzhao@google.com/ Link: https://lkml.kernel.org/r/20210122220600.906146-9-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `c1770e34f3`) Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: If229fa7a09e5be79cc28dc5a780b900e69f4ce64	2022-04-18 10:11:53 -07:00
Yu Zhao	aadc45fae6	BACKPORT: mm: VM_BUG_ON lru page flags Move scattered VM_BUG_ONs to two essential places that cover all lru list additions and deletions. Link: https://lore.kernel.org/linux-mm/20201207220949.830352-8-yuzhao@google.com/ Link: https://lkml.kernel.org/r/20210122220600.906146-8-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `bc7112719e`) Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I950ad4171f973c740d9fc3778d44efc020d0e12c	2022-04-18 10:11:53 -07:00
Yu Zhao	bcc2f50f7b	BACKPORT: mm: add __clear_page_lru_flags() to replace page_off_lru() Similar to page_off_lru(), the new function does non-atomic clearing of PageLRU() in addition to PageActive() and PageUnevictable(), on a page that has no references left. If PageActive() and PageUnevictable() are both set, refuse to clear either and leave them to bad_page(). This is a behavior change that is meant to help debug. Link: https://lore.kernel.org/linux-mm/20201207220949.830352-7-yuzhao@google.com/ Link: https://lkml.kernel.org/r/20210122220600.906146-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `8756017962`) Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I0290916fa08277c50e228a8d3f39af67d62ff9d0	2022-04-18 10:11:53 -07:00
Yu Zhao	552f416558	BACKPORT: mm/swap.c: don't pass "enum lru_list" to del_page_from_lru_list() The parameter is redundant in the sense that it can be potentially extracted from the "struct page" parameter by page_lru(). We need to make sure that existing PageActive() or PageUnevictable() remains until the function returns. A few places don't conform, and simple reordering fixes them. This patch may have left page_off_lru() seemingly odd, and we'll take care of it in the next patch. Link: https://lore.kernel.org/linux-mm/20201207220949.830352-6-yuzhao@google.com/ Link: https://lkml.kernel.org/r/20210122220600.906146-6-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `46ae6b2cc2`) Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I1e14dcbf4111b39cf155ed3512423448865eb324	2022-04-18 10:11:53 -07:00
Yu Zhao	10899adee3	UPSTREAM: mm/swap.c: don't pass "enum lru_list" to trace_mm_lru_insertion() The parameter is redundant in the sense that it can be extracted from the "struct page" parameter by page_lru() correctly. Link: https://lore.kernel.org/linux-mm/20201207220949.830352-5-yuzhao@google.com/ Link: https://lkml.kernel.org/r/20210122220600.906146-5-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `861404536a`) Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: Ia02c0c65dd427a98ffa39e9dc3e2ae701e85fad8	2022-04-18 10:11:52 -07:00
Yu Zhao	c18b4f50ce	BACKPORT: mm: don't pass "enum lru_list" to lru list addition functions The "enum lru_list" parameter to add_page_to_lru_list() and add_page_to_lru_list_tail() is redundant in the sense that it can be extracted from the "struct page" parameter by page_lru(). A caveat is that we need to make sure PageActive() or PageUnevictable() is correctly set or cleared before calling these two functions. And they are indeed. Link: https://lore.kernel.org/linux-mm/20201207220949.830352-4-yuzhao@google.com/ Link: https://lkml.kernel.org/r/20210122220600.906146-4-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `3a9c9788a3`) Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I0d92b845d18e6ab3bcb5645f22e3cedb04257d98	2022-04-18 10:11:52 -07:00
Yu Zhao	32ebee4382	BACKPORT: include/linux/mm_inline.h: shuffle lru list addition and deletion functions These functions will call page_lru() in the following patches. Move them below page_lru() to avoid the forward declaration. Link: https://lore.kernel.org/linux-mm/20201207220949.830352-3-yuzhao@google.com/ Link: https://lkml.kernel.org/r/20210122220600.906146-3-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `f90d8191ac`) Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I32b8565107c9134e656b43886c00105eb07b34dd	2022-04-18 10:11:52 -07:00
Yu Zhao	885e11e970	BACKPORT: mm/vmscan.c: use add_page_to_lru_list() Patch series "mm: lru related cleanups", v2. The cleanups are intended to reduce the verbosity in lru list operations and make them less error-prone. A typical example would be how the patches change __activate_page(): static void __activate_page(struct page page, struct lruvec lruvec) { if (!PageActive(page) && !PageUnevictable(page)) { - int lru = page_lru_base_type(page); int nr_pages = thp_nr_pages(page); - del_page_from_lru_list(page, lruvec, lru); + del_page_from_lru_list(page, lruvec); SetPageActive(page); - lru += LRU_ACTIVE; - add_page_to_lru_list(page, lruvec, lru); + add_page_to_lru_list(page, lruvec); trace_mm_lru_activate(page); There are a few more places like __activate_page() and they are unnecessarily repetitive in terms of figuring out which list a page should be added onto or deleted from. And with the duplicated code removed, they are easier to read, IMO. Patch 1 to 5 basically cover the above. Patch 6 and 7 make code more robust by improving bug reporting. Patch 8, 9 and 10 take care of some dangling helpers left in header files. This patch (of 10): There is add_page_to_lru_list(), and move_pages_to_lru() should reuse it, not duplicate it. Link: https://lkml.kernel.org/r/20210122220600.906146-1-yuzhao@google.com Link: https://lore.kernel.org/linux-mm/20201207220949.830352-2-yuzhao@google.com/ Link: https://lkml.kernel.org/r/20210122220600.906146-2-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `42895ea73b`) Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I7e09be6bedcd451c4e8c790c969306b6ca3adebd	2022-04-18 10:11:52 -07:00
Yifan Hong	75020bfbe2	ANDROID: Move BRANCH from build.config.common to .constants. This allows Bazel to load the value of $BRANCH in order to determine the value of --dist_dir of copy_to_dist_dir statically. Test: TH Bug: 229268271 Change-Id: Iff759b8188360ea1b2bc204d29750eece9095582 Signed-off-by: Yifan Hong <elsk@google.com>	2022-04-14 14:20:35 -07:00
Woody Lin	5ef1198a15	ANDROID: Update the ABI symbol list Leaf changes summary: 5 artifacts changed Changed leaf types summary: 0 leaf type changed Removed/Changed/Added functions summary: 0 Removed, 0 Changed, 5 Added functions Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 0 Added variable 5 Added functions: [A] 'function void interval_tree_insert(interval_tree_node, rb_root_cached)' [A] 'function interval_tree_node* interval_tree_iter_first(rb_root_cached, unsigned long int, unsigned long int)' [A] 'function interval_tree_node interval_tree_iter_next(interval_tree_node, unsigned long int, unsigned long int)' [A] 'function void interval_tree_remove(interval_tree_node, rb_root_cached)' [A] 'function void suspend_set_ops(const platform_suspend_ops)' Bug: 226105845 Bug: 226167799 Signed-off-by: Woody Lin <woodylin@google.com> Change-Id: I5da0ec8c678e36a46418c0f440fad87de1ac7a52	2022-04-14 16:15:37 +00:00
Fuad Tabba	0a227f89cf	ANDROID: KVM: arm64: Do not allow memslot modifications once a PVM has run Currently trying to move or delete a memslot results in a warning and a failure. Userspace shouldn't be able to trigger kernel warnings. The cause is that in protected mode, stage-2 is managed by hyp. Modifying a memslot flushes the shadow memslot, which tries to unmap any stage-2 mapped pages. Bug: 226890762 Signed-off-by: Fuad Tabba <tabba@google.com> Change-Id: Icc6a0aada76e8492285cd5509bad1ee57700af7c	2022-04-14 11:59:20 +01:00
Daniel Rosenberg	8be6e93244	ANDROID: fuse-bpf: Fix read_iter We had a size mismatch for the return value, leading to EIOCBQUEUED getting interpreted as a return size instead of an error code. Test: generic/467, generic/013, and fuse_test Bug: 217570523 Signed-off-by: Daniel Rosenberg <drosen@google.com> Change-Id: I64f9d5263f8b37d3c0e286467f9351997b294cc2	2022-04-13 21:21:57 +00:00
Daniel Rosenberg	128ed57bca	ANDROID: fuse-bpf: Use cache and refcount Allocates the iocb we create for asynchronous IO from a cache instead of a regular kzalloc Test: generic/467 and fuse_test Bug: 217570523 Signed-off-by: Daniel Rosenberg <drosen@google.com> Change-Id: I27dcec89cd585835f6a8e80e1ae30c503f4038c8	2022-04-13 21:21:49 +00:00
Daniel Rosenberg	8e24eb9a2d	ANDROID: fuse-bpf: Rename iocb_fuse to iocb_orig The current name is a bit confusing. iocb_fuse could refer to the iocb passed to fuse or created by fuse. The new name unambiguously refers to the one passed in to fuse. Test: compiles, behavior unchanged Bug: 217570523 Signed-off-by: Daniel Rosenberg <drosen@google.com> Change-Id: I955500eb8a3186252427fd06ca6e99b4fec469b6	2022-04-13 21:21:39 +00:00
Daniel Rosenberg	0f51319527	ANDROID: fuse-bpf: Fix fixattr in rename Existing fixattr was adjusting the same node twice. Bug: 226655982 Test: generic/241 generic/269 Signed-off-by: Daniel Rosenberg <drosen@google.com> Change-Id: I4b1cb6d626ee6bd9010012ac126b78f14d6157d0	2022-04-13 21:21:33 +00:00
Daniel Rosenberg	0c37c1459a	ANDROID: fuse-bpf: Fix readdir Fuse uses generic_file_llseek, so we must account for that in readdir to ensure we read from the correct offset in the lower filesystem. Bug: 226655281 Test: generic/257, fuse_test Signed-off-by: Daniel Rosenberg <drosen@google.com> Change-Id: Ie752c1c645e95b7c03ef9497562758a5c42b514a	2022-04-13 21:21:18 +00:00
Yi Kong	68c9936883	ANDROID: clang: update to 14.0.4 Bug: 225394140 Signed-off-by: Yi Kong <yikong@google.com> Change-Id: I9561e11768217b1ea9ab7c90d87445843784f8e9	2022-04-13 04:35:38 +00:00
Minchan Kim	7a197aa504	ANDROID: mm: fix build break MIGRATE_CMA is defined only when CONFIG_CMA. Thus, we couldn't use MIGRATE_CMA directly to build for both !CONFIG_CMA and CONFIG_CMA. Let's use MIGRATE_RECLAIMABLE in the case. Bug: 218731671 Signed-off-by: Minchan Kim <minchan@google.com> Change-Id: Idb4fc6f4ea02ab074f270ce62001182c8fff3b37	2022-04-12 15:00:39 -07:00
Minchan Kim	d9e4b67784	ANDROID: mm: freeing MIGRATE_ISOLATE page instantly Since Android has pcp list for MIGRATE_CMA[1], it could cause CMA allocation latency due to not freeing the MIGRATE_ISOLATE page immediately. Originally, MIGRATE_ISOLATED page is supposed to go buddy list with skipping pcp list. Otherwise, the page could be reallocated from pcp list or staying on the pcp list until the pcp is drained so that CMA keeps retrying since it couldn't find the freed page from buddy list. That worked before since the CMA pfnblocks changed only from MIGRATE_CMA to MIGRATE_ISOLATE and free function logic in page allocator has checked MIGRATE_ISOLATEness on every CMA pages using below. free_unref_page_commit if (migratetype >= MIGRATE_PCPTYPES) if(is_migrate_isolate(migratetype)) free_one_page(page); It worked since enum MIGRATE_CMA was bigger than enum MIGRATE_PCPTYPES but since [1], the enum MIGRATE_CMA is less than MIGRATE_PCPTYPES so the logic above doesn't work any more. It could cause following race CPU 0 CPU 1 free_unref_page migratetype = get_pfnblock_migratetype() set_pcppage_migratetype(MIGRATE_CMA) cma_alloc alloc_contig_range set_migrate_isolate(MIGRATE_ISOLATE) add the page into pcp list the page could be reallocated This patch couldn't fix the race completely due to missing zone->lock in order-0 page free(for performance reason). However, it's not a new problem so we need to deal with the issue separately. [1] ANDROID: mm: add cma pcp list Bug: 218731671 Signed-off-by: Minchan Kim <minchan@google.com> Change-Id: Ibea20085ce5bfb4b74b83b041f9bda9a380120f9	2022-04-12 15:50:50 +00:00
Will Deacon	83aa7ef838	ANDROID: KVM: arm64: Fix size calculation of FFA memory range Ensure that the FFA memory range to be checked and annotated in the host stage-2 page-table is page-aligned and that its size is calculated using 64-bit arithmetic to avoid the host triggering overflow and subsequent truncation. Bug: 228889679 Reported-by: Gulshan Singh <gsgx@google.com> Signed-off-by: Will Deacon <willdeacon@google.com> Change-Id: Ifc51ee9598905cf2926d19c53159804f89d74040	2022-04-12 11:28:50 +01:00
Will Deacon	2d2e0ad1d1	ANDROID: KVM: arm64: Pin FFA mailboxes shared by the host Gulshan reports that the hypervisor is not pinning the host FFA mailbox pages, therefore allowing the host to unshare them after registration and to later donate them for things like page-table pages. Pin the host FFA mailboxes to prevent the host from unsharing them while they are in use. Bug: 228931886 Reported-by: Gulshan Singh <gsgx@google.com> Signed-off-by: Will Deacon <willdeacon@google.com> Change-Id: I18ecad6ccaa3ef89015a71d97890fad55f0568f2	2022-04-12 11:21:52 +01:00
Paul Lawrence	b196350f2a	ANDROID: fuse-bpf: Fix lseek return value for offset 0 Bug: 227160050 Test: audible app now works Signed-off-by: Paul Lawrence <paullawrence@google.com> Change-Id: Ib14765285190b5838f28c25a69c91935d02c34f4	2022-04-11 15:11:24 -07:00
Will McVicker	bba21782c8	ANDROID: Update the ABI symbol list and xml Leaf changes summary: 1 artifact changed Changed leaf types summary: 0 leaf type changed Removed/Changed/Added functions summary: 0 Removed, 0 Changed, 1 Added function Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 0 Added variable 1 Added function: [A] 'function void __drm_printfn_debug(drm_printer, va_format)' Bug: 202781851 Change-Id: I8c0270ac538462cc64246195e20f5c653f5894cc Signed-off-by: Midas Chien <midaschieh@google.com> Signed-off-by: Will McVicker <willmcvicker@google.com>	2022-04-11 11:20:46 -07:00
Greg Kroah-Hartman	e5765b86ce	ANDROID: GKI: set more vfs-only exports into their own namespace There are more vfs-only symbols that OEMs want to use, so place them in the proper vfs-only namespace. Bug: 157965270 Bug: 210074446 Bug: 227656251 Cc: Matthias Maennich <maennich@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I99b9facc8da45fb329f6627d204180d1f89bcf97	2022-04-08 15:46:37 +02:00
Quentin Perret	74ff6e66d2	ANDROID: KVM: arm64: Fix ToCToU issue when refilling the hyp memcache Xiling reports that the hypervisor dereferences the host memcache struct twice when refilling its own memcache. This allows the host to change its memcache head after it has been admitted and before it is consumed, leading to an arbitrary write in hypervisor memory. Fix this by copying the host memcache on the stack before starting to refill hence guaranteeing its stability. Bug: 228435321 Reported-by: Xiling Gong <xiling@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Change-Id: Ib7c5db203e4a4a7f27eb9f0c0083f4b5c726b4d9	2022-04-08 12:34:52 +00:00
Minchan Kim	8fe46774c6	ANDROID: mm: page_pinner: remove dump_page_pinner This patch removes dump_page_pinner since it was not useful(IOW, the page_pinner buffer to keep the history is enough). This patch also changes mismatched printf format specifier. Bug: 218731671 Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Minchan Kim <minchan@google.com> Change-Id: I80c6f5ad656b3b0d27a50eabff4d1382559aa105	2022-04-07 23:15:24 +00:00
Andrey Konovalov	94c6c10c39	BACKPORT: mm, kasan: fix __GFP_BITS_SHIFT definition breaking LOCKDEP [Backport: resolve conflicts caused by CONFIG_CMA.] KASAN changes that added new GFP flags mistakenly updated __GFP_BITS_SHIFT as the total number of GFP bits instead of as a shift used to define __GFP_BITS_MASK. This broke LOCKDEP, as __GFP_BITS_MASK now gets the 25th bit enabled instead of the 28th for __GFP_NOLOCKDEP. Update __GFP_BITS_SHIFT to always count KASAN GFP bits. In the future, we could handle all combinations of KASAN and LOCKDEP to occupy as few bits as possible. For now, we have enough GFP bits to be inefficient in this quick fix. Link: https://lkml.kernel.org/r/462ff52742a1fcc95a69778685737f723ee4dfb3.1648400273.git.andreyknvl@google.com Fixes: `9353ffa6e9` ("kasan, page_alloc: allow skipping memory init for HW_TAGS") Fixes: `53ae233c30` ("kasan, page_alloc: allow skipping unpoisoning for HW_TAGS") Fixes: `f49d9c5bb1` ("kasan, mm: only define ___GFP_SKIP_KASAN_POISON with HW_TAGS") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 78d104f8b401c81d140adad91e027d7d83b3315c) Bug: 217222520 Change-Id: I82484635012c5773c6ef9164a9368d9e61157f87 Signed-off-by: Andrey Konovalov <andreyknvl@google.com>	2022-04-07 17:51:52 +02:00
Andrey Konovalov	7bfa608df5	UPSTREAM: kasan: test: support async (again) and asymm modes for HW_TAGS Async mode support has already been implemented in commit `e80a76aa1a` ("kasan, arm64: tests supports for HW_TAGS async mode") but then got accidentally broken in commit `99734b535d` ("kasan: detect false-positives in tests"). Restore the changes removed by the latter patch and adapt them for asymm mode: add a sync_fault flag to kunit_kasan_expectation that only get set if the MTE fault was synchronous, and reenable MTE on such faults in tests. Also rename kunit_kasan_expectation to kunit_kasan_status and move its definition to mm/kasan/kasan.h from include/linux/kasan.h, as this structure is only internally used by KASAN. Also put the structure definition under IS_ENABLED(CONFIG_KUNIT). Link: https://lkml.kernel.org/r/133970562ccacc93ba19d754012c562351d4a8c8.1645033139.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit `ed6d74446c`) Bug: 217222520 Change-Id: I8be7f20e72efe7ad81999dc75c848fb89664602c Signed-off-by: Andrey Konovalov <andreyknvl@google.com>	2022-04-07 17:51:51 +02:00
David Brazdil	4e56697b42	ANDROID: KVM: arm64: iommu: Optimize snapshot_host_stage2 Currently the generic IOMMU code lets the driver initialize its PT and then invokes callbacks to set the permissions across the entire PA range. Optimize this by making it a requirement on the driver to initialize its PTs to all memory owned by the host. snapshot_host_stage2 then only calls the driver's callback for memory regions not owned by the host. Bug: 190463801 Bug: 218012133 Signed-off-by: David Brazdil <dbrazdil@google.com> Change-Id: I51ff38cb4f4e28e19903af942776b401504c363e	2022-04-07 09:25:16 +01:00
David Brazdil	174ac5b7c5	ANDROID: KVM: arm64: s2mpu: Initialize MPTs to PROT_RW Change the permissions that MPTs are initialized with from PROT_NONE to PROT_RW. No functional change intended as the generic IOMMU code sets permissions for the entire address space later. This will allow to optimize boot time by only unmapping pages not available to host. Bug: 190463801 Bug: 218012133 Signed-off-by: David Brazdil <dbrazdil@google.com> Change-Id: Ic29ec690a84cde22a2ce8fe33e7127711c6f0f3e	2022-04-07 09:25:15 +01:00
David Brazdil	a946ac5ff5	ANDROID: KVM: arm64: iommu: Fix upper bound of PT walk The second argument of the kvm_pgtable_walker callback was misinterpreted as the end of the current entry, where in fact it is the end of the walked memory region. Fix this by computing the end of the current entry from the start and the level. This did not affect correctness, as the code iterates linarly over the entire address space, but it did affect boot time. Bug: 190463801 Bug: 218012133 Signed-off-by: David Brazdil <dbrazdil@google.com> Change-Id: I6d189b87645f47cd215a783c1bc9e1f032ff8c62	2022-04-07 09:25:15 +01:00
Todd Kjos	a63ec2bcac	ANDROID: GKI: 4/6/2022 KMI update Set KMI_GENERATION=3 for 4/6 KMI update Leaf changes summary: 26 artifacts changed Changed leaf types summary: 0 leaf type changed Removed/Changed/Added functions summary: 23 Removed, 0 Changed, 1 Added function Removed/Changed/Added variables summary: 2 Removed, 0 Changed, 0 Added variable 23 Removed functions: [D] 'function file* anon_inode_getfile(const char, const file_operations, void, int)' [D] 'function int compat_only_sysfs_link_entry_to_kobj(kobject, kobject, const char, const char)' [D] 'function int device_match_name(device, void)' [D] 'function gnss_device gnss_allocate_device(device)' [D] 'function void gnss_deregister_device(gnss_device)' [D] 'function int gnss_insert_raw(gnss_device, const unsigned char, size_t)' [D] 'function void gnss_put_device(gnss_device)' [D] 'function int gnss_register_device(gnss_device)' [D] 'function void* idr_replace(idr, void, unsigned long int)' [D] 'function void led_set_brightness_nosleep(led_classdev, led_brightness)' [D] 'function void led_trigger_event(led_trigger, led_brightness)' [D] 'function int led_trigger_register(led_trigger)' [D] 'function void led_trigger_unregister(led_trigger)' [D] 'function dentry* securityfs_create_dir(const char, dentry)' [D] 'function dentry* securityfs_create_file(const char, umode_t, dentry, void, const file_operations)' [D] 'function void securityfs_remove(dentry)' [D] 'function void serdev_device_close(serdev_device)' [D] 'function int serdev_device_open(serdev_device)' [D] 'function unsigned int serdev_device_set_baudrate(serdev_device, unsigned int)' [D] 'function void serdev_device_set_flow_control(serdev_device, bool)' [D] 'function void serdev_device_wait_until_sent(serdev_device, long int)' [D] 'function int serdev_device_write(serdev_device, const unsigned char, size_t, long int)' [D] 'function void serdev_device_write_wakeup(serdev_device)' 1 Added function: [A] 'function void __page_pinner_put_page(page)' 2 Removed variables: [D] 'int efi_tpm_final_log_size' [D] 'const int hash_digest_size[20]' Bug: 228318757 Signed-off-by: Todd Kjos <tkjos@google.com> Change-Id: I947875f13a75de7cb0c2765057cc468cc6810875	2022-04-07 00:54:47 +00:00
Saravana Kannan	ac3d413511	ANDROID: vendor_hooks: Reduce pointless modversions CRC churn When vendor hooks are added to a file that previously didn't have any vendor hooks, we end up indirectly including linux/tracepoint.h. This causes some data types that used to be opaque (forward declared) to the code to become visible to the code. Modversions correctly catches this change in visibility, but we don't really care about the data types made visible when linux/tracepoint.h is included. So, hide this from modversions in the central vendor_hooks.h file instead of having to fix this on a case by case basis. This change itself will cause a one time CRC breakage/churn because it's fixing the existing vendor hook headers, but should reduce unnecessary CRC churns in the future. To avoid future pointless CRC churn, vendor hook header files that include vendor_hooks.h should not include linux/tracepoint.h directly. Bug: 227513263 Bug: 226140073 Signed-off-by: Saravana Kannan <saravanak@google.com> Change-Id: Ia88e6af11dd94fe475c464eb30a6e5e1e24c938b	2022-04-06 15:41:56 -07:00
Minchan Kim	f33dc31c48	ANDROID: mm: gup: additional param in vendor hooks It needs addtional struct page **pages params to judge whether it's possible to migrate pages out of CMA. Bug: 227475444 Signed-off-by: Minchan Kim <minchan@google.com> Change-Id: I9a8aa57ff91228baf0fc970b8499464c07872c09	2022-04-06 15:41:56 -07:00
Minchan Kim	16b4583a99	ANDROID: mm: page_pinner: fix build warning Remove the build warning below. mm/page_pinner.c:201:28: warning: comparison of distinct pointer types ('typeof ((ts_usec)) ' (aka 'long long ') and 'uint64_t ' (aka 'unsigned long long ')) [-Wcompare-distinct-pointer-types] unsigned long rem_usec = do_div(ts_usec, 1000000); Bug: 218731671 Signed-off-by: Minchan Kim <minchan@google.com> Change-Id: I4d7b24998c3288f4066b5f88d5ebbf59e04b9873	2022-04-06 15:41:55 -07:00
Minchan Kim	01edbc91e2	ANDROID: mm: page_pinner: change pinner buffer size Introduce $debugfs/page_pinner/buffer_size to change buffer_size on demand. The change of buffer_size will reset the buffer. Bug: 218731671 Signed-off-by: Minchan Kim <minchan@google.com> Change-Id: I505cdc2ee29aa0c6ed4e2dc2c0b6fcff77c388e4	2022-04-06 15:41:55 -07:00
Minchan Kim	b8a18e852e	ANDROID: mm: page_pinner: remove static buffer We shouldn't waste memory for vendors who don't use page_pinner so remove the page_pinner static buffer. Bug: 218731671 Signed-off-by: Minchan Kim <minchan@google.com> Change-Id: I46ae2fb5000c4eb59253159032182ca106b39eb9	2022-04-06 15:41:55 -07:00
Minchan Kim	5c70ecb399	ANDROID: mm: page_pinner: remove longterm_pinner From the experience, longterm_pinner is not worth maintaining considering how much it churns MM. Just drop the feature and we are good with alloc_contig_failed. The visible effect from this patch is 1. drop $debugfs/page_pinner/longterm_pinner 2. drop put_user_page expoerted API 3. rename alloc_contig_failed to buffer Bug: 218731671 Signed-off-by: Minchan Kim <minchan@google.com> Change-Id: I68cc11db448260987a9e26b99647ecb55f571616	2022-04-06 15:41:55 -07:00
Minchan Kim	e17f903a92	ANDROID: mm: page_pinner: change output format for alloc_contig_failed Currently, output format is a little hard to parse how long the page has been pinned since user need to figure out the timeline from migration failure detection to put event. Sometimes, the log buffer would be overflowed so we lost the migration failure event timeline, even. This patch stores the page pinning time in kernel side and keep the information whenever page was released. Thus, user could understand the output easier and never lose the information. Bug: 218731671 Signed-off-by: Minchan Kim <minchan@google.com> Change-Id: I396f0c12438e0ff8a3497253b750a7e5bb342f57	2022-04-06 15:41:55 -07:00

1 2 3 4 5 ...

988040 Commits