Commit Graph

988040 Commits

Author SHA1 Message Date
Yu Zhao
082bc8296a FROMLIST: mm: multi-gen LRU: optimize multiple memcgs
When multiple memcgs are available, it is possible to make better
choices based on generations and tiers and therefore improve the
overall performance under global memory pressure. This patch adds a
rudimentary optimization to select memcgs that can drop single-use
unmapped clean pages first. Doing so reduces the chance of going into
the aging path or swapping. These two operations can be costly.

A typical example that benefits from this optimization is a server
running mixed types of workloads, e.g., heavy anon workload in one
memcg and heavy buffered I/O workload in the other.

Though this optimization can be applied to both kswapd and direct
reclaim, it is only added to kswapd to keep the patchset manageable.
Later improvements will cover the direct reclaim path.

Server benchmark results:
  Mixed workloads:
    fio (buffered I/O): -[23, 25]%
                         IOPS         BW
      patch1-8:          2960k        11.3GiB/s
      patch1-9:          2248k        8783MiB/s

    memcached (anon): +[210, 214]%
                         Ops/sec      KB/sec
      patch1-8:          606940.09    23576.89
      patch1-9:          1895197.49   73619.93

  Mixed workloads:
    fio (buffered I/O): -[4, 6]%
                         IOPS         BW
      5.18-ed4643521e6a: 2369k        9255MiB/s
      patch1-9:          2248k        8783MiB/s

    memcached (anon): +[510, 516]%
                         Ops/sec      KB/sec
      5.18-ed4643521e6a: 309189.58    12010.61
      patch1-9:          1895197.49   73619.93

  Configurations:
    (changes since patch 6)

    cat mixed.sh
    modprobe brd rd_nr=2 rd_size=56623104

    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    mkfs.ext4 /dev/ram1
    mount -t ext4 /dev/ram1 /mnt

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=90m --group_reporting &
    pid=$!

    sleep 200

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

    kill -INT $pid
    wait

Client benchmark results:
  no change (CONFIG_MEMCG=n)

Link: https://lore.kernel.org/r/20220309021230.721028-10-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I0641467dbd7c5ba0645602cec7fe8d6fdb750edb
2022-04-18 10:11:55 -07:00
Yu Zhao
93c4f86793 FROMLIST: mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page
tables to search for young PTEs and promote hot pages. A kill switch
will be added in the next patch to disable this behavior. When
disabled, the aging relies on the rmap only.

NB: this behavior has nothing similar with the page table scanning in
the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
pages to swapcache and unmaps them.

To avoid confusion, the term "iteration" specifically means the
traversal of an entire mm_struct list; the term "walk" will be applied
to page tables and the rmap, as usual.

An mm_struct list is maintained for each memcg, and an mm_struct
follows its owner task to the new memcg when this task is migrated.
Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot
pages before it increments max_seq.

When multiple page table walkers iterate the same list, each of them
gets a unique mm_struct; therefore they can run concurrently. Page
table walkers ignore any misplaced pages, e.g., if an mm_struct was
migrated, pages it left in the previous memcg will not be promoted
when its current memcg is under reclaim. Similarly, page table walkers
will not promote pages from nodes other than the one under reclaim.

This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
   page table walkers can skip processes that have been sleeping since
   the last iteration.
2. It uses generational Bloom filters to record populated branches so
   that page table walkers can reduce their search space based on the
   query results, e.g., to skip page tables containing mostly holes or
   misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
   CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
   spanning multiple VMAs. IOW, it finishes all the VMAs within the
   range of the same PMD table before it returns to a PGD table. This
   improves the cache performance for workloads that have large
   numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[5.5, 7.5]%
                         Ops/sec      KB/sec
      patch1-7:          1014393.57   39455.42
      patch1-8:          1078507.59   41949.15

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-7
      45.54%  lzo1x_1_do_compress (real work)
       9.56%  page_vma_mapped_walk
       6.70%  _raw_spin_unlock_irq
       2.78%  ptep_clear_flush
       2.47%  do_raw_spin_lock
       2.22%  __zram_bvec_write
       1.87%  lru_gen_look_around
       1.78%  memmove
       1.77%  obj_malloc
       1.44%  free_unref_page_list

    patch1-8
      47.02%  lzo1x_1_do_compress (real work)
       6.73%  page_vma_mapped_walk
       6.14%  _raw_spin_unlock_irq
       3.39%  walk_pte_range
       2.63%  ptep_clear_flush
       2.29%  __zram_bvec_write
       2.10%  do_raw_spin_lock
       1.81%  memmove
       1.73%  obj_malloc
       1.53%  free_unref_page_list

  Configurations:
    no change

[1] https://lwn.net/Articles/23732/
[2] https://source.android.com/devices/tech/debug/scudo

Link: https://lore.kernel.org/r/20220309021230.721028-9-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I5a3c97cf8ebf8d65d5f9528cd979a637c190053e
2022-04-18 10:11:55 -07:00
Yu Zhao
c8356f7573 FROMLIST: mm: multi-gen LRU: exploit locality in rmap
Searching the rmap for PTEs mapping each page on an LRU list (to test
and clear the accessed bit) can be expensive because pages from
different VMAs (PA space) are not cache friendly to the rmap (VA
space). For workloads mostly using mapped pages, the rmap has a high
CPU cost in the reclaim path.

This patch exploits spatial locality to reduce the trips into the
rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
new function lru_gen_look_around() scans at most BITS_PER_LONG-1
adjacent PTEs. On finding another young PTE, it clears the accessed
bit and updates the gen counter of the page mapped by this PTE to
(max_seq%MAX_NR_GENS)+1.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[4, 6]%
                         Ops/sec      KB/sec
      patch1-6:          964656.80    37520.88
      patch1-7:          1014393.57   39455.42

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-6
      36.13%  lzo1x_1_do_compress (real work)
      19.16%  page_vma_mapped_walk
       6.55%  _raw_spin_unlock_irq
       4.02%  do_raw_spin_lock
       2.32%  anon_vma_interval_tree_iter_first
       2.11%  ptep_clear_flush
       1.76%  __zram_bvec_write
       1.64%  folio_referenced_one
       1.40%  memmove
       1.35%  obj_malloc

    patch1-7
      45.54%  lzo1x_1_do_compress (real work)
       9.56%  page_vma_mapped_walk
       6.70%  _raw_spin_unlock_irq
       2.78%  ptep_clear_flush
       2.47%  do_raw_spin_lock
       2.22%  __zram_bvec_write
       1.87%  lru_gen_look_around
       1.78%  memmove
       1.77%  obj_malloc
       1.44%  free_unref_page_list

  Configurations:
    no change

Link: https://lore.kernel.org/r/20220309021230.721028-8-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I9a290343840f3cf925c891c8e360c7cdc24ffb9c
2022-04-18 10:11:55 -07:00
Yu Zhao
436dff20eb FROMLIST: mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be
applied to the multi-gen LRU, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
LRU, as usual.

The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
promotes hot pages to the youngest generation when it finds them
accessed through page tables; the demotion of cold pages happens
consequently when it increments max_seq. The aging has the complexity
O(nr_hot_pages), since it is only interested in hot pages. Promotion
in the aging path does not require any LRU list operations, only the
updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
the result of the increment of max_seq, requires LRU list operations,
e.g., lru_deactivate_fn().

The eviction consumes old generations. Given an lruvec, it increments
min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types
are available from the same generation.

Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses through file descriptors. A
page accessed N times through file descriptors is in tier
order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in page->flags. In contrast to moving across generations, which
requires the LRU lock, moving across tiers only involves operations on
page->flags. The feedback loop also monitors refaults over all tiers
and decides when to protect pages in which tiers (N>1), using the
first tier (N=0,1) as a baseline. The first tier contains single-use
unmapped clean pages, which are most likely the best choices. The
eviction moves a page to the next generation, i.e., min_seq+1, if the
feedback loop decides so. This approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
   inferring whether pages accessed multiple times through file
   descriptors are statistically hot and thus worth protecting in the
   eviction path.
2. It takes pages accessed through page tables into account and avoids
   overprotecting pages accessed multiple times through file
   descriptors. (Pages accessed through page tables are in the first
   tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
   twice through file descriptors, when under heavy buffered I/O
   workloads.

Server benchmark results:
  Single workload:
    fio (buffered I/O): +[38, 40]%
                         IOPS         BW
      5.18-ed4643521e6a: 2547k        9989MiB/s
      patch1-6:          3540k        13.5GiB/s

  Single workload:
    memcached (anon): +[103, 107]%
                         Ops/sec      KB/sec
      5.18-ed4643521e6a: 469048.66    18243.91
      patch1-6:          964656.80    37520.88

  Configurations:
    CPU: two Xeon 6154
    Mem: total 256G

    Node 1 was only used as a ram disk to reduce the variance in the
    results.

    patch drivers/block/brd.c <<EOF
    99,100c99,100
    < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
    < 	page = alloc_page(gfp_flags);
    ---
    > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
    > 	page = alloc_pages_node(1, gfp_flags, 0);
    EOF

    cat >>/etc/systemd/system.conf <<EOF
    CPUAffinity=numa
    NUMAPolicy=bind
    NUMAMask=0
    EOF

    cat >>/etc/memcached.conf <<EOF
    -m 184320
    -s /var/run/memcached/memcached.sock
    -a 0766
    -t 36
    -B binary
    EOF

    cat fio.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkfs.ext4 /dev/ram0
    mount -t ext4 /dev/ram0 /mnt

    mkdir /sys/fs/cgroup/user.slice/test
    echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
    echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
    fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=5m --group_reporting

    cat memcached.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

Client benchmark results:
  kswapd profiles:
    5.18-ed4643521e6a
      39.56%  page_vma_mapped_walk
      19.32%  lzo1x_1_do_compress (real work)
       7.18%  do_raw_spin_lock
       4.23%  _raw_spin_unlock_irq
       2.26%  vma_interval_tree_subtree_search
       2.12%  vma_interval_tree_iter_next
       2.11%  folio_referenced_one
       1.90%  anon_vma_interval_tree_iter_first
       1.47%  ptep_clear_flush
       0.97%  __anon_vma_interval_tree_subtree_search

    patch1-6
      36.13%  lzo1x_1_do_compress (real work)
      19.16%  page_vma_mapped_walk
       6.55%  _raw_spin_unlock_irq
       4.02%  do_raw_spin_lock
       2.32%  anon_vma_interval_tree_iter_first
       2.11%  ptep_clear_flush
       1.76%  __zram_bvec_write
       1.64%  folio_referenced_one
       1.40%  memmove
       1.35%  obj_malloc

  Configurations:
    CPU: single Snapdragon 7c
    Mem: total 4G

    Chrome OS MemoryPressure [1]

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/

Link: https://lore.kernel.org/r/20220309021230.721028-7-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I3fe4850006d7984cd9f4fd46134b826609dc2f86
2022-04-18 10:11:54 -07:00
Yu Zhao
fe302bd1f9 FROMLIST: mm: multi-gen LRU: groundwork
Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in page->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations. They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of
working set estimation and proactive reclaim. These features are
required to optimize job scheduling (bin packing) in data centers. The
variable size of the sliding window is designed for such use cases
[1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive"
will be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
   channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
   flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
   applications usually do not prepare themselves for major page
   faults like they do for blocked I/O. E.g., GUI applications
   commonly use dedicated I/O threads to avoid blocking the rendering
   threads.
There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or
VM_RAND_READ is present; the latter channel is assumed to follow the
latter pattern unless outlying refaults have been observed [3][4].

The next patch will address the "outlying refaults". Three macros,
i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are
added in this patch to make the entire patchset less diffy.

A page is added to the youngest generation on faulting. The aging
needs to check the accessed bit at least twice before handing this
page over to the eviction. The first check takes care of the accessed
bit set on the initial fault; the second check makes sure this page
has not been used since then. This protocol, AKA second chance,
requires a minimum of two generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/

Link: https://lore.kernel.org/r/20220309021230.721028-6-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I333ec6a1d2abfa60d93d6adc190ed3eefe441512
2022-04-18 10:11:54 -07:00
Yu Zhao
4c6c817249 FROMLIST: mm/vmscan.c: refactor shrink_node()
This patch refactors shrink_node() to improve readability for the
upcoming changes to mm/vmscan.c.

Link: https://lore.kernel.org/r/20220309021230.721028-4-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I186f43f946de0d40d54883fb31114840fc749a57
2022-04-18 10:11:54 -07:00
Yu Zhao
95acc9c28b FROMLIST: mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
Some architectures support the accessed bit in non-leaf PMD entries,
e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it
as part of linear address translation [1]. Page table walkers that
clear the accessed bit may use this capability to reduce their search
space.

Note that:
1. Although an inline function is preferable, this capability is added
   as a configuration option for consistency with the existing macros.
2. Due to the little interest in other varieties, this capability was
   only tested on Intel and AMD CPUs.

[1]: Intel 64 and IA-32 Architectures Software Developer's Manual
     Volume 3 (June 2021), section 4.8

Link: https://lore.kernel.org/r/20220309021230.721028-3-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I73f84a21fd315192eaa3e6443334ed1bccb4e99e
2022-04-18 10:11:54 -07:00
Yu Zhao
1ed19b562b FROMLIST: mm: x86, arm64: add arch_has_hw_pte_young()
Some architectures automatically set the accessed bit in PTEs, e.g.,
x86 and arm64 v8.2. On architectures that do not have this capability,
clearing the accessed bit in a PTE usually triggers a page fault
following the TLB miss of this PTE (to emulate the accessed bit).

Being aware of this capability can help make better decisions, e.g.,
whether to spread the work out over a period of time to reduce bursty
page faults when trying to clear the accessed bit in many PTEs.

Note that theoretically this capability can be unreliable, e.g.,
hotplugged CPUs might be different from builtin ones. Therefore it
should not be used in architecture-independent code that involves
correctness, e.g., to determine whether TLB flushes are required (in
combination with the accessed bit).

Link: https://lore.kernel.org/r/20220309021230.721028-2-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: Ie81175d7e0d239f688d31487b298cf9b4fb66707
2022-04-18 10:11:54 -07:00
Yu Zhao
b4f3b6ac71 UPSTREAM: include/linux/page-flags-layout.h: cleanups
Tidy things up and delete comments stating the obvious with typos or
making no sense.

Link: https://lkml.kernel.org/r/20210303071609.797782-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1587db62d8)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I1d57992dd4c68d89c1b9180f280e09d5d08482b6
2022-04-18 10:11:53 -07:00
Yu Zhao
2b286703d9 UPSTREAM: include/linux/page-flags-layout.h: correctly determine LAST_CPUPID_WIDTH
The naming convention used in include/linux/page-flags-layout.h:
  *_SHIFT: the number of bits trying to allocate
  *_WIDTH: the number of bits successfully allocated

So when it comes to LAST_CPUPID_WIDTH, we need to check whether all
previous *_WIDTH and LAST_CPUPID_SHIFT can fit into page flags. This
means we need to use NODES_WIDTH, not NODES_SHIFT.

Link: https://lkml.kernel.org/r/20210303071609.797782-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f73c6c8805)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I6d7c58cf5d10e302adc818ac7e1fd727208d23c8
2022-04-18 10:11:53 -07:00
Yu Zhao
80343eeaf3 UPSTREAM: mm/swap: don't SetPageWorkingset unconditionally during swapin
We are capable of SetPageWorkingset based on refault distances after
commit aae466b005 ("mm/swap: implement workingset detection for
anonymous LRU").  This is done by workingset_refault(), which is right
above the unconditional SetPageWorkingset deleted by this patch.

The unconditional SetPageWorkingset miscategorizes pages that are read
ahead or never belonged to the working set (e.g., tmpfs pages accessed
only once by fd).  When those pages are swapped in (after they were
swapped out) for the first time, they skew PSI (when using async swap).
When this happens again, depending on their refault distances, they might
skew workingset_restore_anon counter in addition to PSI because their
shadows indicate they were part of the working set.

Historically, SetPageWorkingset was added as part of the PSI series, and
Johannes said:
 "It was meant to mark incoming pages under IO with SetPageWorkingset
  when waiting for them constituted a memory stall.

  On the page cache side, because we HAVE workingset detection, this was
  specific to recently evicted pages that had been active in their
  previous life. On the anon side, the aging algorithm had no
  distinction between workingset and sporadically used pages. Given the
  choice between a) no swapin stalls are pressure and b) all swapin
  stalls are pressure, I went with the latter in order to detect swap
  storms. The false positive case - high rate of swapin without severe
  memory pressure - was relatively unlikely, because we tried to avoid
  swapping until everything was completely on fire in the first place."

Link: https://lkml.kernel.org/r/20201209012400.1771150-1-yuzhao@google.com
Link: https://lkml.kernel.org/r/20201214231253.62313-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit cad8320b4b)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: Ifa9c5fa2e875e6ccee6c3f7e2d2983278d54c220
2022-04-18 10:11:53 -07:00
Yu Zhao
0c20cff831 UPSTREAM: include/linux/mm_inline.h: fold page_lru_base_type() into its sole caller
We've removed all other references to this function.

Link: https://lore.kernel.org/linux-mm/20201207220949.830352-9-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-9-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c1770e34f3)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: If229fa7a09e5be79cc28dc5a780b900e69f4ce64
2022-04-18 10:11:53 -07:00
Yu Zhao
aadc45fae6 BACKPORT: mm: VM_BUG_ON lru page flags
Move scattered VM_BUG_ONs to two essential places that cover all
lru list additions and deletions.

Link: https://lore.kernel.org/linux-mm/20201207220949.830352-8-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-8-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit bc7112719e)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I950ad4171f973c740d9fc3778d44efc020d0e12c
2022-04-18 10:11:53 -07:00
Yu Zhao
bcc2f50f7b BACKPORT: mm: add __clear_page_lru_flags() to replace page_off_lru()
Similar to page_off_lru(), the new function does non-atomic clearing
of PageLRU() in addition to PageActive() and PageUnevictable(), on a
page that has no references left.

If PageActive() and PageUnevictable() are both set, refuse to clear
either and leave them to bad_page(). This is a behavior change that
is meant to help debug.

Link: https://lore.kernel.org/linux-mm/20201207220949.830352-7-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8756017962)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I0290916fa08277c50e228a8d3f39af67d62ff9d0
2022-04-18 10:11:53 -07:00
Yu Zhao
552f416558 BACKPORT: mm/swap.c: don't pass "enum lru_list" to del_page_from_lru_list()
The parameter is redundant in the sense that it can be potentially
extracted from the "struct page" parameter by page_lru(). We need to
make sure that existing PageActive() or PageUnevictable() remains
until the function returns. A few places don't conform, and simple
reordering fixes them.

This patch may have left page_off_lru() seemingly odd, and we'll take
care of it in the next patch.

Link: https://lore.kernel.org/linux-mm/20201207220949.830352-6-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-6-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 46ae6b2cc2)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I1e14dcbf4111b39cf155ed3512423448865eb324
2022-04-18 10:11:53 -07:00
Yu Zhao
10899adee3 UPSTREAM: mm/swap.c: don't pass "enum lru_list" to trace_mm_lru_insertion()
The parameter is redundant in the sense that it can be extracted
from the "struct page" parameter by page_lru() correctly.

Link: https://lore.kernel.org/linux-mm/20201207220949.830352-5-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-5-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 861404536a)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: Ia02c0c65dd427a98ffa39e9dc3e2ae701e85fad8
2022-04-18 10:11:52 -07:00
Yu Zhao
c18b4f50ce BACKPORT: mm: don't pass "enum lru_list" to lru list addition functions
The "enum lru_list" parameter to add_page_to_lru_list() and
add_page_to_lru_list_tail() is redundant in the sense that it can
be extracted from the "struct page" parameter by page_lru().

A caveat is that we need to make sure PageActive() or
PageUnevictable() is correctly set or cleared before calling
these two functions. And they are indeed.

Link: https://lore.kernel.org/linux-mm/20201207220949.830352-4-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-4-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 3a9c9788a3)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I0d92b845d18e6ab3bcb5645f22e3cedb04257d98
2022-04-18 10:11:52 -07:00
Yu Zhao
32ebee4382 BACKPORT: include/linux/mm_inline.h: shuffle lru list addition and deletion functions
These functions will call page_lru() in the following patches.  Move them
below page_lru() to avoid the forward declaration.

Link: https://lore.kernel.org/linux-mm/20201207220949.830352-3-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-3-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f90d8191ac)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I32b8565107c9134e656b43886c00105eb07b34dd
2022-04-18 10:11:52 -07:00
Yu Zhao
885e11e970 BACKPORT: mm/vmscan.c: use add_page_to_lru_list()
Patch series "mm: lru related cleanups", v2.

The cleanups are intended to reduce the verbosity in lru list operations
and make them less error-prone.  A typical example would be how the
patches change __activate_page():

 static void __activate_page(struct page *page, struct lruvec *lruvec)
 {
 	if (!PageActive(page) && !PageUnevictable(page)) {
-		int lru = page_lru_base_type(page);
 		int nr_pages = thp_nr_pages(page);

-		del_page_from_lru_list(page, lruvec, lru);
+		del_page_from_lru_list(page, lruvec);
 		SetPageActive(page);
-		lru += LRU_ACTIVE;
-		add_page_to_lru_list(page, lruvec, lru);
+		add_page_to_lru_list(page, lruvec);
 		trace_mm_lru_activate(page);

There are a few more places like __activate_page() and they are
unnecessarily repetitive in terms of figuring out which list a page should
be added onto or deleted from.  And with the duplicated code removed, they
are easier to read, IMO.

Patch 1 to 5 basically cover the above.  Patch 6 and 7 make code more
robust by improving bug reporting.  Patch 8, 9 and 10 take care of some
dangling helpers left in header files.

This patch (of 10):

There is add_page_to_lru_list(), and move_pages_to_lru() should reuse it,
not duplicate it.

Link: https://lkml.kernel.org/r/20210122220600.906146-1-yuzhao@google.com
Link: https://lore.kernel.org/linux-mm/20201207220949.830352-2-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 42895ea73b)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I7e09be6bedcd451c4e8c790c969306b6ca3adebd
2022-04-18 10:11:52 -07:00
Yifan Hong
75020bfbe2 ANDROID: Move BRANCH from build.config.common to .constants.
This allows Bazel to load the value of $BRANCH in order
to determine the value of --dist_dir of copy_to_dist_dir
statically.

Test: TH
Bug: 229268271

Change-Id: Iff759b8188360ea1b2bc204d29750eece9095582
Signed-off-by: Yifan Hong <elsk@google.com>
2022-04-14 14:20:35 -07:00
Woody Lin
5ef1198a15 ANDROID: Update the ABI symbol list
Leaf changes summary: 5 artifacts changed
Changed leaf types summary: 0 leaf type changed
Removed/Changed/Added functions summary: 0 Removed, 0 Changed, 5 Added functions
Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 0 Added variable

5 Added functions:

  [A] 'function void interval_tree_insert(interval_tree_node*, rb_root_cached*)'
  [A] 'function interval_tree_node* interval_tree_iter_first(rb_root_cached*, unsigned long int, unsigned long int)'
  [A] 'function interval_tree_node* interval_tree_iter_next(interval_tree_node*, unsigned long int, unsigned long int)'
  [A] 'function void interval_tree_remove(interval_tree_node*, rb_root_cached*)'
  [A] 'function void suspend_set_ops(const platform_suspend_ops*)'

Bug: 226105845
Bug: 226167799
Signed-off-by: Woody Lin <woodylin@google.com>
Change-Id: I5da0ec8c678e36a46418c0f440fad87de1ac7a52
2022-04-14 16:15:37 +00:00
Fuad Tabba
0a227f89cf ANDROID: KVM: arm64: Do not allow memslot modifications once a PVM has run
Currently trying to move or delete a memslot results in a warning
and a failure. Userspace shouldn't be able to trigger kernel
warnings.

The cause is that in protected mode, stage-2 is managed by hyp.
Modifying a memslot flushes the shadow memslot, which tries to
unmap any stage-2 mapped pages.

Bug: 226890762
Signed-off-by: Fuad Tabba <tabba@google.com>
Change-Id: Icc6a0aada76e8492285cd5509bad1ee57700af7c
2022-04-14 11:59:20 +01:00
Daniel Rosenberg
8be6e93244 ANDROID: fuse-bpf: Fix read_iter
We had a size mismatch for the return value, leading to EIOCBQUEUED
getting interpreted as a return size instead of an error code.

Test: generic/467, generic/013, and fuse_test
Bug: 217570523
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: I64f9d5263f8b37d3c0e286467f9351997b294cc2
2022-04-13 21:21:57 +00:00
Daniel Rosenberg
128ed57bca ANDROID: fuse-bpf: Use cache and refcount
Allocates the iocb we create for asynchronous IO from a cache instead of
a regular kzalloc

Test: generic/467 and fuse_test
Bug: 217570523
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: I27dcec89cd585835f6a8e80e1ae30c503f4038c8
2022-04-13 21:21:49 +00:00
Daniel Rosenberg
8e24eb9a2d ANDROID: fuse-bpf: Rename iocb_fuse to iocb_orig
The current name is a bit confusing. iocb_fuse could refer to the iocb
passed to fuse or created by fuse. The new name unambiguously refers to
the one passed in to fuse.

Test: compiles, behavior unchanged
Bug: 217570523
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: I955500eb8a3186252427fd06ca6e99b4fec469b6
2022-04-13 21:21:39 +00:00
Daniel Rosenberg
0f51319527 ANDROID: fuse-bpf: Fix fixattr in rename
Existing fixattr was adjusting the same node twice.

Bug: 226655982
Test: generic/241 generic/269
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: I4b1cb6d626ee6bd9010012ac126b78f14d6157d0
2022-04-13 21:21:33 +00:00
Daniel Rosenberg
0c37c1459a ANDROID: fuse-bpf: Fix readdir
Fuse uses generic_file_llseek, so we must account for that in readdir to
ensure we read from the correct offset in the lower filesystem.

Bug: 226655281
Test: generic/257, fuse_test
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: Ie752c1c645e95b7c03ef9497562758a5c42b514a
2022-04-13 21:21:18 +00:00
Yi Kong
68c9936883 ANDROID: clang: update to 14.0.4
Bug: 225394140
Signed-off-by: Yi Kong <yikong@google.com>
Change-Id: I9561e11768217b1ea9ab7c90d87445843784f8e9
2022-04-13 04:35:38 +00:00
Minchan Kim
7a197aa504 ANDROID: mm: fix build break
MIGRATE_CMA is defined only when CONFIG_CMA. Thus, we
couldn't use MIGRATE_CMA directly to build for both
!CONFIG_CMA and CONFIG_CMA.

Let's use MIGRATE_RECLAIMABLE in the case.

Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: Idb4fc6f4ea02ab074f270ce62001182c8fff3b37
2022-04-12 15:00:39 -07:00
Minchan Kim
d9e4b67784 ANDROID: mm: freeing MIGRATE_ISOLATE page instantly
Since Android has pcp list for MIGRATE_CMA[1], it could cause
CMA allocation latency due to not freeing the MIGRATE_ISOLATE
page immediately.

Originally, MIGRATE_ISOLATED page is supposed to go buddy list
with skipping pcp list. Otherwise, the page could be reallocated
from pcp list or staying on the pcp list until the pcp is drained
so that CMA keeps retrying since it couldn't find the freed page
from buddy list. That worked before since the CMA pfnblocks changed
only from MIGRATE_CMA to MIGRATE_ISOLATE and free function logic
in page allocator has checked MIGRATE_ISOLATEness on every CMA
pages using below.

  free_unref_page_commit
    if (migratetype >= MIGRATE_PCPTYPES)
      if(is_migrate_isolate(migratetype))
        free_one_page(page);

It worked since enum MIGRATE_CMA was bigger than enum
MIGRATE_PCPTYPES but since [1], the enum MIGRATE_CMA is less than
MIGRATE_PCPTYPES so the logic above doesn't work any more.

It could cause following race

         CPU 0	                          CPU 1
  free_unref_page
  migratetype = get_pfnblock_migratetype()
  set_pcppage_migratetype(MIGRATE_CMA)

                                cma_alloc
				alloc_contig_range
                              	set_migrate_isolate(MIGRATE_ISOLATE)
  add the page into pcp list
  the page could be reallocated

This patch couldn't fix the race completely due to missing zone->lock
in order-0 page free(for performance reason). However, it's not a new
problem so we need to deal with the issue separately.

[1] ANDROID: mm: add cma pcp list

Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: Ibea20085ce5bfb4b74b83b041f9bda9a380120f9
2022-04-12 15:50:50 +00:00
Will Deacon
83aa7ef838 ANDROID: KVM: arm64: Fix size calculation of FFA memory range
Ensure that the FFA memory range to be checked and annotated in the host
stage-2 page-table is page-aligned and that its size is calculated using
64-bit arithmetic to avoid the host triggering overflow and subsequent
truncation.

Bug: 228889679
Reported-by: Gulshan Singh <gsgx@google.com>
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: Ifc51ee9598905cf2926d19c53159804f89d74040
2022-04-12 11:28:50 +01:00
Will Deacon
2d2e0ad1d1 ANDROID: KVM: arm64: Pin FFA mailboxes shared by the host
Gulshan reports that the hypervisor is not pinning the host FFA mailbox
pages, therefore allowing the host to unshare them after registration
and to later donate them for things like page-table pages.

Pin the host FFA mailboxes to prevent the host from unsharing them while
they are in use.

Bug: 228931886
Reported-by: Gulshan Singh <gsgx@google.com>
Signed-off-by: Will Deacon <willdeacon@google.com>
Change-Id: I18ecad6ccaa3ef89015a71d97890fad55f0568f2
2022-04-12 11:21:52 +01:00
Paul Lawrence
b196350f2a ANDROID: fuse-bpf: Fix lseek return value for offset 0
Bug: 227160050
Test: audible app now works
Signed-off-by: Paul Lawrence <paullawrence@google.com>
Change-Id: Ib14765285190b5838f28c25a69c91935d02c34f4
2022-04-11 15:11:24 -07:00
Will McVicker
bba21782c8 ANDROID: Update the ABI symbol list and xml
Leaf changes summary: 1 artifact changed
Changed leaf types summary: 0 leaf type changed
Removed/Changed/Added functions summary: 0 Removed, 0 Changed, 1 Added function
Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 0 Added variable

1 Added function:

  [A] 'function void __drm_printfn_debug(drm_printer*, va_format*)'

Bug: 202781851
Change-Id: I8c0270ac538462cc64246195e20f5c653f5894cc
Signed-off-by: Midas Chien <midaschieh@google.com>
Signed-off-by: Will McVicker <willmcvicker@google.com>
2022-04-11 11:20:46 -07:00
Greg Kroah-Hartman
e5765b86ce ANDROID: GKI: set more vfs-only exports into their own namespace
There are more vfs-only symbols that OEMs want to use, so place them in
the proper vfs-only namespace.

Bug: 157965270
Bug: 210074446
Bug: 227656251
Cc: Matthias Maennich <maennich@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I99b9facc8da45fb329f6627d204180d1f89bcf97
2022-04-08 15:46:37 +02:00
Quentin Perret
74ff6e66d2 ANDROID: KVM: arm64: Fix ToCToU issue when refilling the hyp memcache
Xiling reports that the hypervisor dereferences the host memcache struct
twice when refilling its own memcache. This allows the host to change its
memcache head after it has been admitted and before it is consumed,
leading to an arbitrary write in hypervisor memory.

Fix this by copying the host memcache on the stack before starting to
refill hence guaranteeing its stability.

Bug: 228435321
Reported-by: Xiling Gong <xiling@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: Ib7c5db203e4a4a7f27eb9f0c0083f4b5c726b4d9
2022-04-08 12:34:52 +00:00
Minchan Kim
8fe46774c6 ANDROID: mm: page_pinner: remove dump_page_pinner
This patch removes dump_page_pinner since it was not useful(IOW,
the page_pinner buffer to keep the history is enough).

This patch also changes mismatched printf format specifier.

Bug: 218731671
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I80c6f5ad656b3b0d27a50eabff4d1382559aa105
2022-04-07 23:15:24 +00:00
Andrey Konovalov
94c6c10c39 BACKPORT: mm, kasan: fix __GFP_BITS_SHIFT definition breaking LOCKDEP
[Backport: resolve conflicts caused by CONFIG_CMA.]

KASAN changes that added new GFP flags mistakenly updated
__GFP_BITS_SHIFT as the total number of GFP bits instead of as a shift
used to define __GFP_BITS_MASK.

This broke LOCKDEP, as __GFP_BITS_MASK now gets the 25th bit enabled
instead of the 28th for __GFP_NOLOCKDEP.

Update __GFP_BITS_SHIFT to always count KASAN GFP bits.

In the future, we could handle all combinations of KASAN and LOCKDEP to
occupy as few bits as possible.  For now, we have enough GFP bits to be
inefficient in this quick fix.

Link: https://lkml.kernel.org/r/462ff52742a1fcc95a69778685737f723ee4dfb3.1648400273.git.andreyknvl@google.com
Fixes: 9353ffa6e9 ("kasan, page_alloc: allow skipping memory init for HW_TAGS")
Fixes: 53ae233c30 ("kasan, page_alloc: allow skipping unpoisoning for HW_TAGS")
Fixes: f49d9c5bb1 ("kasan, mm: only define ___GFP_SKIP_KASAN_POISON with HW_TAGS")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 78d104f8b401c81d140adad91e027d7d83b3315c)
Bug: 217222520
Change-Id: I82484635012c5773c6ef9164a9368d9e61157f87
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
2022-04-07 17:51:52 +02:00
Andrey Konovalov
7bfa608df5 UPSTREAM: kasan: test: support async (again) and asymm modes for HW_TAGS
Async mode support has already been implemented in commit e80a76aa1a
("kasan, arm64: tests supports for HW_TAGS async mode") but then got
accidentally broken in commit 99734b535d ("kasan: detect false-positives
in tests").

Restore the changes removed by the latter patch and adapt them for asymm
mode: add a sync_fault flag to kunit_kasan_expectation that only get set
if the MTE fault was synchronous, and reenable MTE on such faults in
tests.

Also rename kunit_kasan_expectation to kunit_kasan_status and move its
definition to mm/kasan/kasan.h from include/linux/kasan.h, as this
structure is only internally used by KASAN.  Also put the structure
definition under IS_ENABLED(CONFIG_KUNIT).

Link: https://lkml.kernel.org/r/133970562ccacc93ba19d754012c562351d4a8c8.1645033139.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ed6d74446c)
Bug: 217222520
Change-Id: I8be7f20e72efe7ad81999dc75c848fb89664602c
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
2022-04-07 17:51:51 +02:00
David Brazdil
4e56697b42 ANDROID: KVM: arm64: iommu: Optimize snapshot_host_stage2
Currently the generic IOMMU code lets the driver initialize its PT and
then invokes callbacks to set the permissions across the entire PA
range. Optimize this by making it a requirement on the driver to
initialize its PTs to all memory owned by the host. snapshot_host_stage2
then only calls the driver's callback for memory regions not owned by
the host.

Bug: 190463801
Bug: 218012133
Signed-off-by: David Brazdil <dbrazdil@google.com>
Change-Id: I51ff38cb4f4e28e19903af942776b401504c363e
2022-04-07 09:25:16 +01:00
David Brazdil
174ac5b7c5 ANDROID: KVM: arm64: s2mpu: Initialize MPTs to PROT_RW
Change the permissions that MPTs are initialized with from PROT_NONE to
PROT_RW. No functional change intended as the generic IOMMU code
sets permissions for the entire address space later. This will allow to
optimize boot time by only unmapping pages not available to host.

Bug: 190463801
Bug: 218012133
Signed-off-by: David Brazdil <dbrazdil@google.com>
Change-Id: Ic29ec690a84cde22a2ce8fe33e7127711c6f0f3e
2022-04-07 09:25:15 +01:00
David Brazdil
a946ac5ff5 ANDROID: KVM: arm64: iommu: Fix upper bound of PT walk
The second argument of the kvm_pgtable_walker callback was
misinterpreted as the end of the current entry, where in fact it is
the end of the walked memory region. Fix this by computing the end of
the current entry from the start and the level.

This did not affect correctness, as the code iterates linarly over
the entire address space, but it did affect boot time.

Bug: 190463801
Bug: 218012133
Signed-off-by: David Brazdil <dbrazdil@google.com>
Change-Id: I6d189b87645f47cd215a783c1bc9e1f032ff8c62
2022-04-07 09:25:15 +01:00
Todd Kjos
a63ec2bcac ANDROID: GKI: 4/6/2022 KMI update
Set KMI_GENERATION=3 for 4/6 KMI update

Leaf changes summary: 26 artifacts changed
Changed leaf types summary: 0 leaf type changed
Removed/Changed/Added functions summary: 23 Removed, 0 Changed, 1 Added function
Removed/Changed/Added variables summary: 2 Removed, 0 Changed, 0 Added variable

23 Removed functions:

  [D] 'function file* anon_inode_getfile(const char*, const file_operations*, void*, int)'
  [D] 'function int compat_only_sysfs_link_entry_to_kobj(kobject*, kobject*, const char*, const char*)'
  [D] 'function int device_match_name(device*, void*)'
  [D] 'function gnss_device* gnss_allocate_device(device*)'
  [D] 'function void gnss_deregister_device(gnss_device*)'
  [D] 'function int gnss_insert_raw(gnss_device*, const unsigned char*, size_t)'
  [D] 'function void gnss_put_device(gnss_device*)'
  [D] 'function int gnss_register_device(gnss_device*)'
  [D] 'function void* idr_replace(idr*, void*, unsigned long int)'
  [D] 'function void led_set_brightness_nosleep(led_classdev*, led_brightness)'
  [D] 'function void led_trigger_event(led_trigger*, led_brightness)'
  [D] 'function int led_trigger_register(led_trigger*)'
  [D] 'function void led_trigger_unregister(led_trigger*)'
  [D] 'function dentry* securityfs_create_dir(const char*, dentry*)'
  [D] 'function dentry* securityfs_create_file(const char*, umode_t, dentry*, void*, const file_operations*)'
  [D] 'function void securityfs_remove(dentry*)'
  [D] 'function void serdev_device_close(serdev_device*)'
  [D] 'function int serdev_device_open(serdev_device*)'
  [D] 'function unsigned int serdev_device_set_baudrate(serdev_device*, unsigned int)'
  [D] 'function void serdev_device_set_flow_control(serdev_device*, bool)'
  [D] 'function void serdev_device_wait_until_sent(serdev_device*, long int)'
  [D] 'function int serdev_device_write(serdev_device*, const unsigned char*, size_t, long int)'
  [D] 'function void serdev_device_write_wakeup(serdev_device*)'

1 Added function:

  [A] 'function void __page_pinner_put_page(page*)'

2 Removed variables:

  [D] 'int efi_tpm_final_log_size'
  [D] 'const int hash_digest_size[20]'

Bug: 228318757
Signed-off-by: Todd Kjos <tkjos@google.com>
Change-Id: I947875f13a75de7cb0c2765057cc468cc6810875
2022-04-07 00:54:47 +00:00
Saravana Kannan
ac3d413511 ANDROID: vendor_hooks: Reduce pointless modversions CRC churn
When vendor hooks are added to a file that previously didn't have any
vendor hooks, we end up indirectly including linux/tracepoint.h.  This
causes some data types that used to be opaque (forward declared) to the
code to become visible to the code.

Modversions correctly catches this change in visibility, but we don't
really care about the data types made visible when linux/tracepoint.h is
included. So, hide this from modversions in the central vendor_hooks.h file
instead of having to fix this on a case by case basis.

This change itself will cause a one time CRC breakage/churn because it's
fixing the existing vendor hook headers, but should reduce unnecessary CRC
churns in the future.

To avoid future pointless CRC churn, vendor hook header files that include
vendor_hooks.h should not include linux/tracepoint.h directly.

Bug: 227513263
Bug: 226140073
Signed-off-by: Saravana Kannan <saravanak@google.com>
Change-Id: Ia88e6af11dd94fe475c464eb30a6e5e1e24c938b
2022-04-06 15:41:56 -07:00
Minchan Kim
f33dc31c48 ANDROID: mm: gup: additional param in vendor hooks
It needs addtional struct page **pages params to judge whether
it's possible to migrate pages out of CMA.

Bug: 227475444
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I9a8aa57ff91228baf0fc970b8499464c07872c09
2022-04-06 15:41:56 -07:00
Minchan Kim
16b4583a99 ANDROID: mm: page_pinner: fix build warning
Remove the build warning below.

mm/page_pinner.c:201:28: warning: comparison of distinct pointer types ('typeof ((ts_usec)) *' (aka 'long long *') and 'uint64_t *' (aka 'unsigned long long *')) [-Wcompare-distinct-pointer-types]
                unsigned long rem_usec = do_div(ts_usec, 1000000);

Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I4d7b24998c3288f4066b5f88d5ebbf59e04b9873
2022-04-06 15:41:55 -07:00
Minchan Kim
01edbc91e2 ANDROID: mm: page_pinner: change pinner buffer size
Introduce $debugfs/page_pinner/buffer_size to change
buffer_size on demand. The change of buffer_size will
reset the buffer.

Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I505cdc2ee29aa0c6ed4e2dc2c0b6fcff77c388e4
2022-04-06 15:41:55 -07:00
Minchan Kim
b8a18e852e ANDROID: mm: page_pinner: remove static buffer
We shouldn't waste memory for vendors who don't use
page_pinner so remove the page_pinner static buffer.

Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I46ae2fb5000c4eb59253159032182ca106b39eb9
2022-04-06 15:41:55 -07:00
Minchan Kim
5c70ecb399 ANDROID: mm: page_pinner: remove longterm_pinner
From the experience, longterm_pinner is not worth maintaining
considering how much it churns MM. Just drop the feature and
we are good with alloc_contig_failed.

The visible effect from this patch is
  1. drop $debugfs/page_pinner/longterm_pinner
  2. drop put_user_page expoerted API
  3. rename alloc_contig_failed to buffer

Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I68cc11db448260987a9e26b99647ecb55f571616
2022-04-06 15:41:55 -07:00
Minchan Kim
e17f903a92 ANDROID: mm: page_pinner: change output format for alloc_contig_failed
Currently, output format is a little hard to parse how long the
page has been pinned since user need to figure out the timeline
from migration failure detection to put event. Sometimes, the log
buffer would be overflowed so we lost the migration failure event
timeline, even. This patch stores the page pinning time in kernel
side and keep the information whenever page was released. Thus,
user could understand the output easier and never lose the information.

Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I396f0c12438e0ff8a3497253b750a7e5bb342f57
2022-04-06 15:41:55 -07:00