Commit Graph

988084 Commits

Author SHA1 Message Date
David Brazdil
c43dfe89fe ANDROID: KVM: arm64: s2mpu: Extract L1ENTRY_* consts
Extract the L1ENTRY_ATTR_{PRON,GRAN}_MASK constants out of macros that
create the corresponding constants. This will allow EL1 users to use the
masks to get the fields out of register values. Also extract
L1ENTRY_L2TABLE_ADDR_SHIFT for adjusting the L2 table address.

Bug: 190463801
Signed-off-by: David Brazdil <dbrazdil@google.com>
Change-Id: I45578857694ca39266fe45b3c00dbea33738167f
2022-04-21 16:33:28 +01:00
Theodore Ts'o
7a9a532432 BACKPORT: ext4: don't BUG if someone dirty pages without asking ext4 first
[ Upstream commit cc5095747e ]

[un]pin_user_pages_remote is dirtying pages without properly warning
the file system in advance.  A related race was noted by Jan Kara in
2018[1]; however, more recently instead of it being a very hard-to-hit
race, it could be reliably triggered by process_vm_writev(2) which was
discovered by Syzbot[2].

This is technically a bug in mm/gup.c, but arguably ext4 is fragile in
that if some other kernel subsystem dirty pages without properly
notifying the file system using page_mkwrite(), ext4 will BUG, while
other file systems will not BUG (although data will still be lost).

So instead of crashing with a BUG, issue a warning (since there may be
potential data loss) and just mark the page as clean to avoid
unprivileged denial of service attacks until the problem can be
properly fixed.  More discussion and background can be found in the
thread starting at [2].

[1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
[2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com

Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com
Reported-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/YiDS9wVfq4mM2jGK@mit.edu
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Change-Id: Ifa528c386540d70eafb7ec55b238add0c6ba7387
2022-04-21 07:54:02 +00:00
Zhang Qilong
c383610d0f UPSTREAM: binder: change error code from postive to negative in binder_transaction
Depending on the context, the error return value
here (extra_buffers_size < added_size) should be
negative.

Acked-by: Martijn Coenen <maco@android.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Link: https://lore.kernel.org/r/20201026110314.135481-1-zhangqilong3@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 88f6c77927)
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Change-Id: If90760a8469f33bb121f0f71098b9415f6d4d783
2022-04-21 00:37:07 +00:00
Daniel Rosenberg
d4d78c7278 ANDROID: fuse-bpf: Fix non-fusebpf build
Added #ifdefs around fuse-bpf init/cleanup code

Bug: 202785178
Test: builds with and without CONFIG_FUSE_BPF
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: Ie15bb04e439b496e4842303437b3f55c3da14f2c
2022-04-20 14:05:59 -07:00
Daniel Rosenberg
9a5023967b ANDROID: fuse-bpf: Use fuse_bpf_args in uapi
fuse_args is not suitable for use in the uapi - it is not stable, and
contains internal pointers. Replace with stable equivalent.

The end_offset values are currently unused and unset, but will be used
in a follow up patch by the verifier.

Test: fuse_test, atest ScopedStorageDeviceTest pass
Bug: 202785178
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: Ic1c12f9706aeae233cc30a0d68ed2533030e485b
2022-04-20 20:57:26 +00:00
Johannes Berg
92c8c21ad0 BACKPORT: nl80211: correctly check NL80211_ATTR_REG_ALPHA2 size
commit 6624bb34b4 upstream.

We need this to be at least two bytes, so we can access
alpha2[0] and alpha2[1]. It may be three in case some
userspace used NUL-termination since it was NLA_STRING
(and we also push it out with NUL-termination).

Cc: stable@vger.kernel.org
Reported-by: Lee Jones <lee.jones@linaro.org>
Link: https://lore.kernel.org/r/20220411114201.fd4a31f06541.Ie7ff4be2cf348d8cc28ed0d626fc54becf7ea799@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Change-Id: I6f2338c409076f0960bb7305cee4959a22aa9280
2022-04-20 11:21:04 +01:00
Midas Chien
65533e0212 ANDROID: Update the ABI representation
Leaf changes summary: 1 artifact changed
Changed leaf types summary: 0 leaf type changed
Removed/Changed/Added functions summary: 0 Removed, 0 Changed, 0 Added function
Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 1 Added variable

1 Added variable:

  [A] 'int console_set_on_cmdline'

Bug: 202781851
Signed-off-by: Midas Chien <midaschieh@google.com>
Change-Id: I302b16e6eeec60b070721c54980d989fb1d31c26
2022-04-20 03:53:28 +00:00
Andrey Konovalov
a1013fd19b FROMLIST: kasan: mark KASAN_VMALLOC flags as kasan_vmalloc_flags_t
Fix sparse warning:

mm/kasan/shadow.c:496:15: warning: restricted kasan_vmalloc_flags_t degrades to integer

Link: https://lkml.kernel.org/r/52d8fccdd3a48d4bdfd0ff522553bac2a13f1579.1649351254.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/all/52d8fccdd3a48d4bdfd0ff522553bac2a13f1579.1649351254.git.andreyknvl@google.com/T/#u
Bug: 217222520
Change-Id: I04133e8e9610b81fd0c856ece4f566110094bcb1
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
2022-04-20 00:58:47 +00:00
Vincenzo Frascino
c098614509 FROMLIST: kasan: fix hw tags enablement when KUNIT tests are disabled
Kasan enables hw tags via kasan_enable_tagging() which based on the mode
passed via kernel command line selects the correct hw backend.
kasan_enable_tagging() is meant to be invoked indirectly via the cpu
features framework of the architectures that support these backends.
Currently the invocation of this function is guarded by
CONFIG_KASAN_KUNIT_TEST which allows the enablement of the correct backend
only when KUNIT tests are enabled in the kernel.

This inconsistency was introduced in commit:

  ed6d74446c ("kasan: test: support async (again) and asymm modes for HW_TAGS")

... and prevents to enable MTE on arm64 when KUNIT tests for kasan hw_tags are
disabled.

Fix the issue making sure that the CONFIG_KASAN_KUNIT_TEST guard does not
prevent the correct invocation of kasan_enable_tagging().

Link: https://lkml.kernel.org/r/20220408124323.10028-1-vincenzo.frascino@arm.com
Fixes: ed6d74446c ("kasan: test: support async (again) and asymm modes for HW_TAGS")
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/all/20220408124323.10028-1-vincenzo.frascino@arm.com/T/#u
Bug: 217222520
Change-Id: Ib4f05d74e091db57d2a8d5000d67137105d59a4c
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
2022-04-20 00:58:37 +00:00
Fabio Aiuto
f60a0b3285 UPSTREAM: usb: dwc3: leave default DMA for PCI devices
in case of a PCI dwc3 controller, leave the default DMA
mask. Calling of a 64 bit DMA mask breaks the driver on
cherrytrail based tablets like Cyberbook T116.

Fixes: 45d39448b4 ("usb: dwc3: support 64 bit DMA in platform driver")
Cc: stable <stable@vger.kernel.org>
Reported-by: Hans De Goede <hdegoede@redhat.com>
Tested-by: Fabio Aiuto <fabioaiuto83@gmail.com>
Tested-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Fabio Aiuto <fabioaiuto83@gmail.com>
Link: https://lore.kernel.org/r/20211113142959.27191-1-fabioaiuto83@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


(cherry picked from commit 47ce45906c)

Bug: 228575378
Signed-off-by: Albert Wang <albertccwang@google.com>
Change-Id: Id4e73116d4f8f603b8282e7bd7717cf28b5a3bf2
2022-04-20 00:40:38 +00:00
Sven Peter
3b508e8fe4 UPSTREAM: usb: dwc3: support 64 bit DMA in platform driver
Currently, the dwc3 platform driver does not explicitly ask for
a DMA mask. This makes it fall back to the default 32-bit mask which
breaks the driver on systems that only have RAM starting above the
first 4G like the Apple M1 SoC.

Fix this by calling dma_set_mask_and_coherent with a 64bit mask.

Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sven Peter <sven@svenpeter.dev>
Link: https://lore.kernel.org/r/20210607061751.89752-1-sven@svenpeter.dev
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

(cherry picked from commit 45d39448b4)

Bug: 228575378
Signed-off-by: Albert Wang <albertccwang@google.com>
Change-Id: Icd994fde819fdd6fb0896d1bec1777fb73f454ee
2022-04-20 00:40:28 +00:00
Rick Yiu
03f40d5252 ANDROID: Update the ABI representation
Leaf changes summary: 4 artifacts changed
Changed leaf types summary: 0 leaf type changed
Removed/Changed/Added functions summary: 0 Removed, 0 Changed, 2 Added functions
Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 2 Added variables

2 Added functions:

  [A] 'function int __traceiter_android_rvh_set_task_cpu(void*, task_struct*, unsigned int)'
  [A] 'function int __traceiter_android_rvh_update_rt_rq_load_avg(void*, u64, rq*, task_struct*, int)'

2 Added variables:

  [A] 'tracepoint __tracepoint_android_rvh_set_task_cpu'
  [A] 'tracepoint __tracepoint_android_rvh_update_rt_rq_load_avg'

Bug: 201261299
Signed-off-by: Rick Yiu <rickyiu@google.com>
Change-Id: Ie1265a9d638e7826b6185bfde0ab8f900b51c6b0
2022-04-19 23:56:45 +00:00
Kalesh Singh
6db38c5bbc FROMGIT: EXP rcu: Move expedited grace period (GP) work to RT kthread_worker
Enabling CONFIG_RCU_BOOST did not reduce RCU expedited grace-period
latency because its workqueues run at SCHED_OTHER, and thus can be
delayed by normal processes.  This commit avoids these delays by moving
the expedited GP work items to a real-time-priority kthread_worker.

This option is controlled by CONFIG_RCU_EXP_KTHREAD and disabled by
default on PREEMPT_RT=y kernels which disable expedited grace periods
after boot by unconditionally setting rcupdate.rcu_normal_after_boot=1.

The results were evaluated on arm64 Android devices (6GB ram) running
5.10 kernel, and capturing trace data in critical user-level code.

The table below shows the resulting order-of-magnitude improvements
in synchronize_rcu_expedited() latency:

------------------------------------------------------------------------
|                          |   workqueues  |  kthread_worker |  Diff   |
------------------------------------------------------------------------
| Count                    |          725  |            688  |         |
------------------------------------------------------------------------
| Min Duration       (ns)  |          326  |            447  |  37.12% |
------------------------------------------------------------------------
| Q1                 (ns)  |       39,428  |         38,971  |  -1.16% |
------------------------------------------------------------------------
| Q2 - Median        (ns)  |       98,225  |         69,743  | -29.00% |
------------------------------------------------------------------------
| Q3                 (ns)  |      342,122  |        126,638  | -62.98% |
------------------------------------------------------------------------
| Max Duration       (ns)  |  372,766,967  |      2,329,671  | -99.38% |
------------------------------------------------------------------------
| Avg Duration       (ns)  |    2,746,353  |        151,242  | -94.49% |
------------------------------------------------------------------------
| Standard Deviation (ns)  |   19,327,765  |        294,408  |         |
------------------------------------------------------------------------

The below table show the range of maximums/minimums for
synchronize_rcu_expedited() latency from all experiments:

------------------------------------------------------------------------
|                          |   workqueues  |  kthread_worker |  Diff   |
------------------------------------------------------------------------
| Total No. of Experiments |           25  |             23  |         |
------------------------------------------------------------------------
| Largest  Maximum   (ns)  |  372,766,967  |      2,329,671  | -99.38% |
------------------------------------------------------------------------
| Smallest Maximum   (ns)  |       38,819  |         86,954  | 124.00% |
------------------------------------------------------------------------
| Range of Maximums  (ns)  |  372,728,148  |      2,242,717  |         |
------------------------------------------------------------------------
| Largest  Minimum   (ns)  |       88,623  |         27,588  | -68.87% |
------------------------------------------------------------------------
| Smallest Minimum   (ns)  |          326  |            447  |  37.12% |
------------------------------------------------------------------------
| Range of Minimums  (ns)  |       88,297  |         27,141  |         |
------------------------------------------------------------------------

Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Reported-by: Tim Murray <timmurray@google.com>
Reported-by: Wei Wang <wvw@google.com>
Tested-by: Kyle Lin <kylelin@google.com>
Tested-by: Chunwei Lu <chunweilu@google.com>
Tested-by: Lulu Wang <luluw@google.com>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220409003527.1587028-1-kaleshsingh@google.com/
(cherry picked from commit 3902dd17a29bd3ed1ead364a331a1761edb7162b
git: //git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git fastexp.2022.04.11a)
Bug: 224791892
Change-Id: I4cc5d28f9ae99e44e92afc43ff4db4b71c415d6c
2022-04-19 23:20:24 +00:00
Sajid Dalvi
68c87a277c ANDROID: Update the ABI representation
Leaf changes summary: 2 artifacts changed
Changed leaf types summary: 0 leaf type changed
Removed/Changed/Added functions summary: 0 Removed, 0 Changed, 1 Added function
Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 1 Added variable

1 Added function:

  [A] 'function int __traceiter_android_rvh_pci_d3_sleep(void*, pci_dev*, unsigned int*)'

1 Added variable:

  [A] 'tracepoint __tracepoint_android_rvh_pci_d3_sleep'

Bug: 229125931
Signed-off-by: Sajid Dalvi <sdalvi@google.com>
Change-Id: I7db067bd3d468b11826e5e59ee9c96706fbad760
2022-04-19 12:45:34 -05:00
Jens Axboe
699e6e3211 UPSTREAM: block: fix async_depth sysfs interface for mq-deadline
A previous commit added this feature, but it inadvertently used the wrong
variable to show/store the setting from/to, victimized by copy/paste. Fix
it up so that the async_depth sysfs interface reads and writes from the
right setting.

Fixes: 07757588e5 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215485
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change-Id: I28273a92ac7cbebc830df8b80ad461948402bcd2
(cherry picked from commit 46cdc45acb)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-19 15:34:11 +00:00
Sajid Dalvi
53ff5efb2c ANDROID: PCI/PM: Use usleep_range for d3hot_delay
This patch implements a vendor hook that changes d3hot_delay to use
usleep_range() instead of msleep() to reduce the resume time from
20ms to 10ms.

The call sequence is as follows:
pci_pm_resume_noirq()
pci_pm_default_resume_early()
pci_power_up()
pci_raw_set_power_state() --> msleep(10)

The default d3hot_delay is 10ms. Using msleep for delays less than 20ms
could result in delays up to 20ms.
Reference: Documentation/timers/timers-howto.rst

Using usleep_range() results in the delay being closer to 10ms and this
reduces the resume time.

Bug: 194231641
Change-Id: If3e4dcfb99edad302371273933fa6784854cf892
Signed-off-by: Sajid Dalvi <sdalvi@google.com>
2022-04-19 07:17:39 +00:00
Minchan Kim
609fa1be7a ANDROID: mm: page_pinner: fix elapsed time
Put the elapsed time instead of zero all the time.

Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: Ibb319e7dfce2d47481e2462bfb8423fbd2ddad66
2022-04-18 23:50:40 +00:00
Minchan Kim
d5d9a23576 ANDROID: mm: retry GUP with orignal gup_flags on failure
If GUP fails due to modified flags by vendor hook,
try one more time with original flag to keep the
API semantic.

Bug: 229391920
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: If2c20fe3e1752c9cc0d7d350ad84b38d8325f9ae
2022-04-18 23:50:32 +00:00
Todd Kjos
6acb261444 ANDROID: GKI: 4/15/2022 KMI freeze
Set KMI_GENERATION=4 for 4/15 KMI freeze

Leaf changes summary: 2734 artifacts changed
Changed leaf types summary: 8 leaf types changed
Removed/Changed/Added functions summary: 0 Removed, 2677 Changed, 0 Added function
Removed/Changed/Added variables summary: 0 Removed, 49 Changed, 0 Added variable

2677 functions with some sub-type change:

  [C] 'function void* PDE_DATA(const inode*)' at generic.c:799:1 has some sub-type changes:
    CRC (modversions) changed from 0x830fd868 to 0xb70c2d59

  [C] 'function void __ClearPageMovable(page*)' at compaction.c:138:1 has some sub-type changes:
    CRC (modversions) changed from 0x274b4312 to 0x1e91976d

  [C] 'function void __SetPageMovable(page*, address_space*)' at compaction.c:130:1 has some sub-type changes:
    CRC (modversions) changed from 0xaf251a50 to 0xb1221a47

  ... 2674 omitted; 2677 symbols have only CRC changes

49 Changed variables:

  [C] 'pglist_data contig_page_data' was changed at memblock.c:96:1:
    size of symbol changed from 5696 to 6976
    CRC (modversions) changed from 0xa8156534 to 0x7007215
    type of variable changed:
      type size changed from 45568 to 55808 (in bits)
      1 data member insertion:
        'lru_gen_mm_walk mm_walk', at offset 51520 (in bits) at mmzone.h:1039:1
      there are data member changes:
        type 'struct lruvec' of 'pglist_data::__lruvec' changed:
          type size changed from 1088 to 9664 (in bits)
          3 data member insertions:
            'lru_gen_struct lrugen', at offset 1024 (in bits) at mmzone.h:497:1
            'lru_gen_mm_state mm_state', at offset 8576 (in bits) at mmzone.h:499:1
            'u64 android_vendor_data1', at offset 9600 (in bits) at mmzone.h:504:1
          there are data member changes:
            'pglist_data* pgdat' offset changed (by +8512 bits)
          2977 impacted interfaces
        'unsigned long int flags' offset changed (by +8576 bits)
        3 ('zone_padding _pad2_' .. 'atomic_long_t vm_stat[38]') offsets changed (by +10240 bits)
      2977 impacted interfaces

  [C] 'task_struct init_task' was changed at init_task.c:64:1:
    CRC (modversions) changed from 0x124472e1 to 0xf5fdc492
    type of variable changed:
      type size hasn't changed
      1 data member insertion:
        'unsigned int in_lru_fault', at offset 11332 (in bits) at sched.h:840:1
      there are data member changes:
        4 ('unsigned int no_cgroup_migration' .. 'unsigned int in_memstall') offsets changed (by +1 bits)
      2977 impacted interfaces

  [C] 'bus_type amba_bustype' was changed at bus.c:215:1:
    CRC (modversions) changed from 0x55933f58 to 0xefd95b38

  [C] 'const clk_ops clk_fixed_factor_ops' was changed at clk-fixed-factor.c:60:1:
    CRC (modversions) changed from 0x38f07e1d to 0xb94d81d6

  [C] 'const clk_ops clk_fixed_rate_ops' was changed at clk-fixed-rate.c:46:1:
    CRC (modversions) changed from 0x47fbebbe to 0x5299a868

  ... 44 omitted; 47 symbols have only CRC changes

'struct lruvec at mmzone.h:280:1' changed:
  details were reported earlier

'struct mem_cgroup at memcontrol.h:211:1' changed:
  type size hasn't changed
  1 data member insertion:
    'lru_gen_mm_list mm_list', at offset 23168 (in bits) at memcontrol.h:337:1
  there are data member changes:
    2 ('u64 android_oem_data1' .. 'mem_cgroup_per_node* nodeinfo[]') offsets changed (by +192 bits)
  2977 impacted interfaces

'struct mem_cgroup_per_node at memcontrol.h:107:1' changed:
  type size changed from 5184 to 13760 (in bits)
  there are data member changes:
    type 'struct lruvec' of 'mem_cgroup_per_node::lruvec' changed, as reported earlier
    10 ('lruvec_stat* lruvec_stat_local' .. 'mem_cgroup* memcg') offsets changed (by +8576 bits)
  2977 impacted interfaces

'struct mm_struct at mm_types.h:419:1' changed:
  type size changed from 7680 to 7936 (in bits)
  there are data member changes:
    anonymous data member at offset 0 (in bits) changed from:
      struct {vm_area_struct* mmap; rb_root mm_rb; u64 vmacache_seqnum; rwlock_t mm_rb_lock; unsigned long int (file*, unsigned long int, unsigned long int, unsigned long int, unsigned long int)* get_unmapped_area; unsigned long int mmap_base; unsigned long int mmap_legacy_base; unsigned long int task_size; unsigned long int highest_vm_end; pgd_t* pgd; atomic_t membarrier_state; atomic_t mm_users; atomic_t mm_count; atomic_t has_pinned; atomic_long_t pgtables_bytes; int map_count; spinlock_t page_table_lock; rw_semaphore mmap_lock; list_head mmlist; unsigned long int hiwater_rss; unsigned long int hiwater_vm; unsigned long int total_vm; unsigned long int locked_vm; atomic64_t pinned_vm; unsigned long int data_vm; unsigned long int exec_vm; unsigned long int stack_vm; unsigned long int def_flags; seqcount_t write_protect_seq; spinlock_t arg_lock; unsigned long int start_code; unsigned long int end_code; unsigned long int start_data; unsigned long int end_data; unsigned long int start_brk; unsigned long int brk; unsigned long int start_stack; unsigned long int arg_start; unsigned long int arg_end; unsigned long int env_start; unsigned long int env_end; unsigned long int saved_auxv[46]; mm_rss_stat rss_stat; linux_binfmt* binfmt; mm_context_t context; unsigned long int flags; core_state* core_state; spinlock_t ioctx_lock; kioctx_table* ioctx_table; task_struct* owner; user_namespace* user_ns; file* exe_file; mmu_notifier_subscriptions* notifier_subscriptions; percpu_rw_semaphore* mmu_notifier_lock; atomic_t tlb_flush_pending; uprobes_state uprobes_state; work_struct async_put_work; u32 pasid; u64 android_kabi_reserved1;}
    to:
      struct {vm_area_struct* mmap; rb_root mm_rb; u64 vmacache_seqnum; rwlock_t mm_rb_lock; unsigned long int (file*, unsigned long int, unsigned long int, unsigned long int, unsigned long int)* get_unmapped_area; unsigned long int mmap_base; unsigned long int mmap_legacy_base; unsigned long int task_size; unsigned long int highest_vm_end; pgd_t* pgd; atomic_t membarrier_state; atomic_t mm_users; atomic_t mm_count; atomic_t has_pinned; atomic_long_t pgtables_bytes; int map_count; spinlock_t page_table_lock; rw_semaphore mmap_lock; list_head mmlist; unsigned long int hiwater_rss; unsigned long int hiwater_vm; unsigned long int total_vm; unsigned long int locked_vm; atomic64_t pinned_vm; unsigned long int data_vm; unsigned long int exec_vm; unsigned long int stack_vm; unsigned long int def_flags; seqcount_t write_protect_seq; spinlock_t arg_lock; unsigned long int start_code; unsigned long int end_code; unsigned long int start_data; unsigned long int end_data; unsigned long int start_brk; unsigned long int brk; unsigned long int start_stack; unsigned long int arg_start; unsigned long int arg_end; unsigned long int env_start; unsigned long int env_end; unsigned long int saved_auxv[46]; mm_rss_stat rss_stat; linux_binfmt* binfmt; mm_context_t context; unsigned long int flags; core_state* core_state; spinlock_t ioctx_lock; kioctx_table* ioctx_table; task_struct* owner; user_namespace* user_ns; file* exe_file; mmu_notifier_subscriptions* notifier_subscriptions; percpu_rw_semaphore* mmu_notifier_lock; atomic_t tlb_flush_pending; uprobes_state uprobes_state; work_struct async_put_work; u32 pasid; struct {list_head list; mem_cgroup* memcg; nodemask_t nodes;} lru_gen; u64 android_kabi_reserved1;}
    and size changed from 7680 to 7936 (in bits) (by +256 bits)
    'unsigned long int cpu_bitmap[]' offset changed (by +256 bits)
  2977 impacted interfaces

'struct pglist_data at mmzone.h:729:1' changed:
  details were reported earlier

'struct reclaim_state at swap.h:131:1' changed:
  type size changed from 64 to 128 (in bits)
  1 data member insertion:
    'lru_gen_mm_walk* mm_walk', at offset 64 (in bits) at swap.h:135:1
  2977 impacted interfaces

'struct scsi_device at scsi_device.h:102:1' changed:
  type size hasn't changed
  1 data member insertion:
    'unsigned int silence_suspend', at offset 2448 (in bits) at scsi_device.h:209:1
  there are data member changes:
    'bool offline_already' offset changed (by +8 bits)
  45 impacted interfaces

'struct task_struct at sched.h:660:1' changed:
  details were reported earlier

Bug: 229630433
Signed-off-by: Todd Kjos <tkjos@google.com>
Change-Id: Iacd70a1553401ead91351db0b5b8ec6dfee6e6ec
2022-04-18 11:58:28 -07:00
Bing Han
a034320a68 ANDROID: add vendor fields to swap_slots_cache to support multiple swap devices
struct swap_slots_cache  :: ANDROID_VENDOR_DATA(1)
1) Multiple swap devices can be supported;
2) There are different kinds of data;
3) During data reclamation, different types of data are exchanged
   to different swap devices;
4) Each swap device has corresponding arrays of slots and slots_ret;
5) Each swap device has corresponding indexes of nr, cur and n_ret;
6) This field is a pointer, it points to a struct which contains
   all the other arrays and indexes;

Bug: 225795494
Change-Id: Icf116135926be98449a2d96fc458e58e5ad3b7e9
Signed-off-by: Bing Han <bing.han@transsion.com>
2022-04-18 10:20:37 -07:00
Bing Han
1b14ae01b0 ANDROID: add vendor fields to lruvec to record refault stats
struct lruvec :: ANDROID_VENDOR_DATA(1)
It is pointer to a struct to record the following message:
1)the account of workingset_restore pages of cached anonymous and
   file pages
This is used to adjust the strategy and amount of reclaiming data.

Bug: 225795494
Change-Id: I34e57ee23b6c97ac91effa5b72513d238335a996
Signed-off-by: Bing Han <bing.han@transsion.com>
2022-04-18 10:20:23 -07:00
Bing Han
af4eb0e377 ANDROID: add vendor fields to swap_info_struct to record swap stats
struct swap_info_struct :: ANDROID_VENDOR_DATA(1)
	It is pointer to a struct to record the following message:
	1) total swapin pages;
	2) total swapout pages;
	3) total number of cold pages swapin;
	4) total number of swapout pages, specified by userspace;
	5) total number of swapout pages, specified by kernel;
	6) the maxmium number of swapout pages;
	7) the maxmium number of swapout pages allowed by kernel;
	8) the maxmium number of swapout pages allowed by framework;

Bug: 225795494
Change-Id: I779145a83d87e339db86ec81c7f962be99946afb
Signed-off-by: Bing Han <bing.han@transsion.com>
2022-04-18 10:20:06 -07:00
Bart Van Assche
fae5207ecc ANDROID: scsi: ufs: Add suspend/resume SCSI command processing support
This functionality is needed by UFS drivers to e.g. suspend SCSI command
processing while reprogramming encryption keys if the hardware does not
support concurrent I/O and key reprogramming.

Bug: 227177294
Change-Id: I10f11e67da81fae7063674838760903d2c178baf
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:18:44 -07:00
Bart Van Assche
64293a57f1 ANDROID: scsi: ufs: Pass the clock scaling timeout as an argument
Prepare for adding an additional ufshcd_clock_scaling_prepare() call
with a different timeout.

Bug: 227177294
Change-Id: I67a569b074c292a3c37f20a1b1e36f95b682c5e8
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:18:30 -07:00
Bart Van Assche
69014b2b36 ANDROID: scsi: ufs: Move a clock scaling check
Move a check related to clock scaling into ufshcd_devfreq_scale(). This
patch prepares for adding a second ufshcd_clock_scaling_prepare()
caller.

Bug: 227177294
Change-Id: I928d4cbe64823960a6112ba7f98c18da6244a77c
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:18:15 -07:00
Bart Van Assche
aca52cabdb ANDROID: scsi: ufs: Reduce the clock scaling latency
Wait at most 20 ms before rechecking the doorbells instead of waiting
for a potentially long time between doorbell checks.

Bug: 227177294
Change-Id: I8a4dd0e93ca02435264961851a095a9c83c68240
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:18:03 -07:00
Peter Wang
00ed95fe93 FROMGIT: scsi: ufs: core: scsi_get_lba() error fix
When ufs initializes without scmd->device->sector_size set, scsi_get_lba()
will get a wrong shift number and trigger an ubsan error.  The shift
exponent 4294967286 is too large for the 64-bit type 'sector_t' (aka
'unsigned long long').

Call scsi_get_lba() only when opcode is READ_10/WRITE_10/UNMAP.

Link: https://lore.kernel.org/r/20220307111752.10465-1-peter.wang@mediatek.com
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Wang <peter.wang@mediatek.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 2bd3b6b759 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I3457fcc88d7c4164c55010e440d9f274c169553e
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:17:48 -07:00
Adrian Hunter
c0a4aeb7aa FROMGIT: scsi: ufs: Fix runtime PM messages never-ending cycle
Kernel messages produced during runtime PM can cause a never-ending cycle
because user space utilities (e.g. journald or rsyslog) write the messages
back to storage, causing runtime resume, more messages, and so on.

Messages that tell of things that are expected to happen, are arguably
unnecessary, so suppress them.

UFS driver messages are changes to from dev_err() to dev_dbg() which means
they will not display unless activated by dynamic debug of building with
-DDEBUG.

sdev->silence_suspend is set to skip messages from sd_suspend_common()
"Synchronizing SCSI cache", "Stopping disk" and scsi_report_sense()
"Power-on or device reset occurred" message (Note, that message appears
when the LUN is accessed after runtime PM, not during runtime PM)

 Example messages from Ubuntu 21.10:

 $ dmesg | tail
 [ 1620.380071] ufshcd 0000:00:12.5: ufshcd_print_pwr_info:[RX, TX]: gear=[1, 1], lane[1, 1], pwr[SLOWAUTO_MODE, SLOWAUTO_MODE], rate = 0
 [ 1620.408825] ufshcd 0000:00:12.5: ufshcd_print_pwr_info:[RX, TX]: gear=[4, 4], lane[2, 2], pwr[FAST MODE, FAST MODE], rate = 2
 [ 1620.409020] ufshcd 0000:00:12.5: ufshcd_find_max_sup_active_icc_level: Regulator capability was not set, actvIccLevel=0
 [ 1620.409524] sd 0:0:0:0: Power-on or device reset occurred
 [ 1622.938794] sd 0:0:0:0: [sda] Synchronizing SCSI cache
 [ 1622.939184] ufs_device_wlun 0:0:0:49488: Power-on or device reset occurred
 [ 1625.183175] ufshcd 0000:00:12.5: ufshcd_print_pwr_info:[RX, TX]: gear=[1, 1], lane[1, 1], pwr[SLOWAUTO_MODE, SLOWAUTO_MODE], rate = 0
 [ 1625.208041] ufshcd 0000:00:12.5: ufshcd_print_pwr_info:[RX, TX]: gear=[4, 4], lane[2, 2], pwr[FAST MODE, FAST MODE], rate = 2
 [ 1625.208311] ufshcd 0000:00:12.5: ufshcd_find_max_sup_active_icc_level: Regulator capability was not set, actvIccLevel=0
 [ 1625.209035] sd 0:0:0:0: Power-on or device reset occurred

Note for stable: depends on patch "scsi: core: sd: Add silence_suspend flag
to suppress some PM messages".

Link: https://lore.kernel.org/r/20220228113652.970857-3-adrian.hunter@intel.com
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Bug: 204438323
(cherry picked from commit 71bb9ab6e3 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Change-Id: I2a50283162aa1dc100e1269533ac61056172bd1d
2022-04-18 10:17:34 -07:00
Adrian Hunter
0cd3abcaa4 FROMGIT: scsi: core: sd: Add silence_suspend flag to suppress some PM messages
Kernel messages produced during runtime PM can cause a never-ending cycle
because user space utilities (e.g. journald or rsyslog) write the messages
back to storage, causing runtime resume, more messages, and so on.

Messages that tell of things that are expected to happen are arguably
unnecessary, so add a flag to suppress them. This flag is used by the UFS
driver.

Link: https://lore.kernel.org/r/20220228113652.970857-2-adrian.hunter@intel.com
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit af4edb1d50 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Change-Id: I8834c9d71618fd04635804779a41117629a75166
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:17:21 -07:00
Keoseong Park
e46eb26194 FROMGIT: scsi: ufs: core: Remove wlun_dev_to_hba()
Commit edc0596cc0 ("scsi: ufs: core: Stop clearing UNIT ATTENTIONS")
removed all callers of wlun_dev_to_hba(). Hence also remove the macro
itself.

Link: https://lore.kernel.org/r/1891546521.01644927481711.JavaMail.epsvc@epcpadp4
Reviewed-by: Alim Akhtar <alim.akhtar@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Keoseong Park <keosung.park@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from 482dcaa1c9 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I1eee9827255305f5567ee21c65dbfca897e9daa5
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:17:03 -07:00
Jinyoung Choi
85d759e39a FROMGIT: scsi: ufs: Add checking lifetime attribute for WriteBooster
Because WB performs writes in SLC mode, it is not possible to use
WriteBooster indefinitely. Vendors can set a lifetime limit in the device.
If the lifetime exceeds this limit, the device ican disable the WB feature.

The feature is defined in the "bWriteBoosterBufferLifeTimeEst (IDN = 1E)"
attribute.

With lifetime exceeding the limit value, the current driver continuously
performs the following query:

	- Write Flag: WB_ENABLE / DISABLE
	- Read attr: Available Buffer Size
	- Read attr: Current Buffer Size

This patch recognizes that WriteBooster is no longer supported by the
device, and prevents unnecessary queries.

Link: https://lore.kernel.org/r/1891546521.01643252701746.JavaMail.epsvc@epcpadp3
Reviewed-by: Asutosh Das <quic_asutoshd@quicinc.com>
Acked-by: Avri Altman <avri.altman@wdc.com>
Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit f681d1078d git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I9178b31aaeb75ef157aa8e12b1dd0f5a646f0579
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:16:38 -07:00
Kiwoong Kim
44b7a4f00f FROMGIT: scsi: ufs: Use generic error code in ufshcd_set_dev_pwr_mode()
The return value of ufshcd_set_dev_pwr_mode() is passed to device PM
core. However, the function currently returns a SCSI result which the PM
core doesn't understand.  This might lead to unexpected behaviors in
userland; a platform reset was observed in Android.

Use a generic error code for SSU failures.

Link: https://lore.kernel.org/r/1642743182-54098-1-git-send-email-kwmad.kim@samsung.com
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Kiwoong Kim <kwmad.kim@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit ad6c8a4264 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I051742cd8ea2215cbd94ac3dedc4fab2863a9c6e
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:16:16 -07:00
Miaoqian Lin
aeedc78679 FROMGIT: scsi: ufs: ufs-mediatek: Fix error checking in ufs_mtk_init_va09_pwr_ctrl()
The function regulator_get() returns an error pointer. Use IS_ERR() to
validate the return value.

Link: https://lore.kernel.org/r/20211222070930.9449-1-linmq006@gmail.com
Fixes: cf137b3ea4 ("scsi: ufs-mediatek: Support VA09 regulator operations")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 3ba880a12d git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I9b6d2fbb9d1d795f30e7873837b59dbf76569d1b
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:15:59 -07:00
SEO HOYOUNG
1fc4aef3d5 FROMGIT: scsi: ufs: Modify Tactive time setting conditions
The Tactive time determines the waiting time before burst at hibern8 exit
and is determined by hardware at linkup time. However, in the case of
Samsung devices, increase host's Tactive time +100us for stability. If the
HCI's Tactive time is equal or greater than the device, +100us should be
set.

Link: https://lore.kernel.org/r/20220106213924.186263-1-hy50.seo@samsung.com
Reviewed-by: Alim Akhtar <alim.akhtar@samsung.com>
Acked-by: Avri Altman <Avri.Altman@wdc.com>
Signed-off-by: SEO HOYOUNG <hy50.seo@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 9008661e19 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I6ffe1c279cab9b780558de763e94cf01cfd4be3e
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:15:39 -07:00
Adrian Hunter
d87405c2fe FROMGIT: scsi: ufs: ufs-pci: Add support for Intel ADL
Add PCI ID and callbacks to support Intel Alder Lake.

Link: https://lore.kernel.org/r/20211124204218.1784559-1-adrian.hunter@intel.com
Cc: stable@vger.kernel.org # v5.15+
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 7dc9fb47bc git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I370ad2e01bc67fec675f4c2ebf7f5d222d0afb07
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:15:19 -07:00
Ye Guojin
b65cfd7b92 FROMGIT: scsi: ufs: ufs-mediatek: Add put_device() after of_find_device_by_node()
This was found by coccicheck:

./drivers/scsi/ufs/ufs-mediatek.c, 211, 1-7, ERROR missing put_device;
call of_find_device_by_node on line 1185, but without a corresponding
object release within this function.

Link: https://lore.kernel.org/r/20211110105133.150171-1-ye.guojin@zte.com.cn
Reported-by: Zeal Robot <zealci@zte.com.cn>
Reviewed-by: Peter Wang <peter.wang@mediatek.com>
Signed-off-by: Ye Guojin <ye.guojin@zte.com.cn>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Bug: 204438323
(cherry picked from commit cc03facb1c git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Change-Id: I0aa3e8be7f7bc84d5c9b4dc3fb1e5586bc7e5ddf
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:14:56 -07:00
Bean Huo
4f4bf31d39 FROMGIT: scsi: ufs: ufshpb: Fix warning in ufshpb_set_hpb_read_to_upiu()
Fix the following sparse warnings in ufshpb_set_hpb_read_to_upiu():

sparse warnings: (new ones prefixed by >>)
drivers/scsi/ufs/ufshpb.c:335:27: sparse: sparse: cast from restricted __be64
drivers/scsi/ufs/ufshpb.c:335:25: sparse: expected restricted __be64 [usertype] ppn_tmp
drivers/scsi/ufs/ufshpb.c:335:25: sparse: got unsigned long long [usertype]

Link: https://lore.kernel.org/r/20211111222452.384089-1-huobean@gmail.com
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Bean Huo <beanhuo@micron.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 73185a1377 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I6770255dab00a3ff8c5f7b4499efdfa0dd53839c
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:14:37 -07:00
Bart Van Assche
acb0ef885c ANDROID: scsi: ufs: Minimize the difference with the upstream code
Make the order of function declarations match the order in the upstream
code.

Bug: 204438323
Change-Id: Iac453cdb5ae67184c4218639ab8b91da03fabc66
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:14:20 -07:00
Yu Zhao
321995d280 ANDROID: GKI: build multi-gen LRU
CONFIG_LRU_GEN=y to build multi-gen LRU.

To enable it, echo y >/sys/kernel/mm/lru_gen/enabled.

Bug: 227651406
Signed-off-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: If5f6ece8373f1da2eb1eb96d809a2e216ebc0fbc
2022-04-18 10:11:56 -07:00
Yu Zhao
306dbfb34c FROMLIST: mm: multi-gen LRU: design doc
Add a design doc.

Link: https://lore.kernel.org/r/20220309021230.721028-15-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I1d66302e618416291ebf9647e20625fb76613c89
2022-04-18 10:11:56 -07:00
Yu Zhao
8b006e4d1c FROMLIST: mm: multi-gen LRU: admin guide
Add an admin guide.

Link: https://lore.kernel.org/r/20220309021230.721028-14-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I6fafbd7eb3ef6819cfcd30376459f14893f17c63
2022-04-18 10:11:56 -07:00
Yu Zhao
3cf1dfaaa5 FROMLIST: mm: multi-gen LRU: debugfs interface
Add /sys/kernel/debug/lru_gen for working set estimation and proactive
reclaim. These features are required to optimize job scheduling (bin
packing) in data centers [1][2].

Compared with the page table-based approach and the PFN-based
approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has
the following advantages:
1. It offers better choices because it is aware of memcgs, NUMA nodes,
   shared mappings and unmapped page cache.
2. It is more scalable because it is O(nr_hot_pages), whereas the
   PFN-based approach is O(nr_total_pages).

Add /sys/kernel/debug/lru_gen_full for debugging.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731

Link: https://lore.kernel.org/r/20220309021230.721028-13-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: Ie558098e0a24a647f77f4eacc4d72576173fc0b8
2022-04-18 10:11:56 -07:00
Yu Zhao
96f4a592d3 FROMLIST: mm: multi-gen LRU: thrashing prevention
Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
requested by many desktop users [1].

When set to value N, it prevents the working set of N milliseconds
from getting evicted. The OOM killer is triggered if this working set
cannot be kept in memory. Based on the average human detectable lag
(~100ms), N=1000 usually eliminates intolerable lags due to thrashing.
Larger values like N=3000 make lags less noticeable at the risk of
premature OOM kills.

Compared with the size-based approach, e.g., [2], this time-based
approach has the following advantages:
1. It is easier to configure because it is agnostic to applications
   and memory sizes.
2. It is more reliable because it is directly wired to the OOM killer.

[1] https://lore.kernel.org/lkml/Ydza%2FzXKY9ATRoh6@google.com/
[2] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/

Link: https://lore.kernel.org/r/20220309021230.721028-12-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I482d33f3beaf7723d2f3eeaaa5b4f12bcb9b48a1
2022-04-18 10:11:55 -07:00
Yu Zhao
76fdc1010b FROMLIST: mm: multi-gen LRU: kill switch
Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
can be disabled include:
  0x0001: the multi-gen LRU core
  0x0002: walking page table, when arch_has_hw_pte_young() returns
          true
  0x0004: clearing the accessed bit in non-leaf PMD entries, when
          CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
  [yYnN]: apply to all the components above
E.g.,
  echo y >/sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/enabled
  0x0007
  echo 5 >/sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/enabled
  0x0005

NB: the page table walks happen on the scale of seconds under heavy
memory pressure, in which case the mmap_lock contention is a lesser
concern, compared with the LRU lock contention and the I/O congestion.
So far the only well-known case of the mmap_lock contention happens on
Android, due to Scudo [1] which allocates several thousand VMAs for
merely a few hundred MBs. The SPF and the Maple Tree also have
provided their own assessments [2][3]. However, if walking page tables
does worsen the mmap_lock contention, the kill switch can be used to
disable it. In this case the multi-gen LRU will suffer a minor
performance degradation, as shown previously.

Clearing the accessed bit in non-leaf PMD entries can also be
disabled, since this behavior was not tested on x86 varieties other
than Intel and AMD.

[1] https://source.android.com/devices/tech/debug/scudo
[2] https://lore.kernel.org/lkml/20220128131006.67712-1-michel@lespinasse.org/
[3] https://lore.kernel.org/lkml/20220202024137.2516438-1-Liam.Howlett@oracle.com/

Link: https://lore.kernel.org/r/20220309021230.721028-11-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I71801d9470a2588cad8bfd14fbcfafc7b010aa03
2022-04-18 10:11:55 -07:00
Yu Zhao
082bc8296a FROMLIST: mm: multi-gen LRU: optimize multiple memcgs
When multiple memcgs are available, it is possible to make better
choices based on generations and tiers and therefore improve the
overall performance under global memory pressure. This patch adds a
rudimentary optimization to select memcgs that can drop single-use
unmapped clean pages first. Doing so reduces the chance of going into
the aging path or swapping. These two operations can be costly.

A typical example that benefits from this optimization is a server
running mixed types of workloads, e.g., heavy anon workload in one
memcg and heavy buffered I/O workload in the other.

Though this optimization can be applied to both kswapd and direct
reclaim, it is only added to kswapd to keep the patchset manageable.
Later improvements will cover the direct reclaim path.

Server benchmark results:
  Mixed workloads:
    fio (buffered I/O): -[23, 25]%
                         IOPS         BW
      patch1-8:          2960k        11.3GiB/s
      patch1-9:          2248k        8783MiB/s

    memcached (anon): +[210, 214]%
                         Ops/sec      KB/sec
      patch1-8:          606940.09    23576.89
      patch1-9:          1895197.49   73619.93

  Mixed workloads:
    fio (buffered I/O): -[4, 6]%
                         IOPS         BW
      5.18-ed4643521e6a: 2369k        9255MiB/s
      patch1-9:          2248k        8783MiB/s

    memcached (anon): +[510, 516]%
                         Ops/sec      KB/sec
      5.18-ed4643521e6a: 309189.58    12010.61
      patch1-9:          1895197.49   73619.93

  Configurations:
    (changes since patch 6)

    cat mixed.sh
    modprobe brd rd_nr=2 rd_size=56623104

    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    mkfs.ext4 /dev/ram1
    mount -t ext4 /dev/ram1 /mnt

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=90m --group_reporting &
    pid=$!

    sleep 200

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

    kill -INT $pid
    wait

Client benchmark results:
  no change (CONFIG_MEMCG=n)

Link: https://lore.kernel.org/r/20220309021230.721028-10-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I0641467dbd7c5ba0645602cec7fe8d6fdb750edb
2022-04-18 10:11:55 -07:00
Yu Zhao
93c4f86793 FROMLIST: mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page
tables to search for young PTEs and promote hot pages. A kill switch
will be added in the next patch to disable this behavior. When
disabled, the aging relies on the rmap only.

NB: this behavior has nothing similar with the page table scanning in
the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
pages to swapcache and unmaps them.

To avoid confusion, the term "iteration" specifically means the
traversal of an entire mm_struct list; the term "walk" will be applied
to page tables and the rmap, as usual.

An mm_struct list is maintained for each memcg, and an mm_struct
follows its owner task to the new memcg when this task is migrated.
Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot
pages before it increments max_seq.

When multiple page table walkers iterate the same list, each of them
gets a unique mm_struct; therefore they can run concurrently. Page
table walkers ignore any misplaced pages, e.g., if an mm_struct was
migrated, pages it left in the previous memcg will not be promoted
when its current memcg is under reclaim. Similarly, page table walkers
will not promote pages from nodes other than the one under reclaim.

This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
   page table walkers can skip processes that have been sleeping since
   the last iteration.
2. It uses generational Bloom filters to record populated branches so
   that page table walkers can reduce their search space based on the
   query results, e.g., to skip page tables containing mostly holes or
   misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
   CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
   spanning multiple VMAs. IOW, it finishes all the VMAs within the
   range of the same PMD table before it returns to a PGD table. This
   improves the cache performance for workloads that have large
   numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[5.5, 7.5]%
                         Ops/sec      KB/sec
      patch1-7:          1014393.57   39455.42
      patch1-8:          1078507.59   41949.15

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-7
      45.54%  lzo1x_1_do_compress (real work)
       9.56%  page_vma_mapped_walk
       6.70%  _raw_spin_unlock_irq
       2.78%  ptep_clear_flush
       2.47%  do_raw_spin_lock
       2.22%  __zram_bvec_write
       1.87%  lru_gen_look_around
       1.78%  memmove
       1.77%  obj_malloc
       1.44%  free_unref_page_list

    patch1-8
      47.02%  lzo1x_1_do_compress (real work)
       6.73%  page_vma_mapped_walk
       6.14%  _raw_spin_unlock_irq
       3.39%  walk_pte_range
       2.63%  ptep_clear_flush
       2.29%  __zram_bvec_write
       2.10%  do_raw_spin_lock
       1.81%  memmove
       1.73%  obj_malloc
       1.53%  free_unref_page_list

  Configurations:
    no change

[1] https://lwn.net/Articles/23732/
[2] https://source.android.com/devices/tech/debug/scudo

Link: https://lore.kernel.org/r/20220309021230.721028-9-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I5a3c97cf8ebf8d65d5f9528cd979a637c190053e
2022-04-18 10:11:55 -07:00
Yu Zhao
c8356f7573 FROMLIST: mm: multi-gen LRU: exploit locality in rmap
Searching the rmap for PTEs mapping each page on an LRU list (to test
and clear the accessed bit) can be expensive because pages from
different VMAs (PA space) are not cache friendly to the rmap (VA
space). For workloads mostly using mapped pages, the rmap has a high
CPU cost in the reclaim path.

This patch exploits spatial locality to reduce the trips into the
rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
new function lru_gen_look_around() scans at most BITS_PER_LONG-1
adjacent PTEs. On finding another young PTE, it clears the accessed
bit and updates the gen counter of the page mapped by this PTE to
(max_seq%MAX_NR_GENS)+1.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[4, 6]%
                         Ops/sec      KB/sec
      patch1-6:          964656.80    37520.88
      patch1-7:          1014393.57   39455.42

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-6
      36.13%  lzo1x_1_do_compress (real work)
      19.16%  page_vma_mapped_walk
       6.55%  _raw_spin_unlock_irq
       4.02%  do_raw_spin_lock
       2.32%  anon_vma_interval_tree_iter_first
       2.11%  ptep_clear_flush
       1.76%  __zram_bvec_write
       1.64%  folio_referenced_one
       1.40%  memmove
       1.35%  obj_malloc

    patch1-7
      45.54%  lzo1x_1_do_compress (real work)
       9.56%  page_vma_mapped_walk
       6.70%  _raw_spin_unlock_irq
       2.78%  ptep_clear_flush
       2.47%  do_raw_spin_lock
       2.22%  __zram_bvec_write
       1.87%  lru_gen_look_around
       1.78%  memmove
       1.77%  obj_malloc
       1.44%  free_unref_page_list

  Configurations:
    no change

Link: https://lore.kernel.org/r/20220309021230.721028-8-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I9a290343840f3cf925c891c8e360c7cdc24ffb9c
2022-04-18 10:11:55 -07:00
Yu Zhao
436dff20eb FROMLIST: mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be
applied to the multi-gen LRU, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
LRU, as usual.

The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
promotes hot pages to the youngest generation when it finds them
accessed through page tables; the demotion of cold pages happens
consequently when it increments max_seq. The aging has the complexity
O(nr_hot_pages), since it is only interested in hot pages. Promotion
in the aging path does not require any LRU list operations, only the
updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
the result of the increment of max_seq, requires LRU list operations,
e.g., lru_deactivate_fn().

The eviction consumes old generations. Given an lruvec, it increments
min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types
are available from the same generation.

Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses through file descriptors. A
page accessed N times through file descriptors is in tier
order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in page->flags. In contrast to moving across generations, which
requires the LRU lock, moving across tiers only involves operations on
page->flags. The feedback loop also monitors refaults over all tiers
and decides when to protect pages in which tiers (N>1), using the
first tier (N=0,1) as a baseline. The first tier contains single-use
unmapped clean pages, which are most likely the best choices. The
eviction moves a page to the next generation, i.e., min_seq+1, if the
feedback loop decides so. This approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
   inferring whether pages accessed multiple times through file
   descriptors are statistically hot and thus worth protecting in the
   eviction path.
2. It takes pages accessed through page tables into account and avoids
   overprotecting pages accessed multiple times through file
   descriptors. (Pages accessed through page tables are in the first
   tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
   twice through file descriptors, when under heavy buffered I/O
   workloads.

Server benchmark results:
  Single workload:
    fio (buffered I/O): +[38, 40]%
                         IOPS         BW
      5.18-ed4643521e6a: 2547k        9989MiB/s
      patch1-6:          3540k        13.5GiB/s

  Single workload:
    memcached (anon): +[103, 107]%
                         Ops/sec      KB/sec
      5.18-ed4643521e6a: 469048.66    18243.91
      patch1-6:          964656.80    37520.88

  Configurations:
    CPU: two Xeon 6154
    Mem: total 256G

    Node 1 was only used as a ram disk to reduce the variance in the
    results.

    patch drivers/block/brd.c <<EOF
    99,100c99,100
    < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
    < 	page = alloc_page(gfp_flags);
    ---
    > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
    > 	page = alloc_pages_node(1, gfp_flags, 0);
    EOF

    cat >>/etc/systemd/system.conf <<EOF
    CPUAffinity=numa
    NUMAPolicy=bind
    NUMAMask=0
    EOF

    cat >>/etc/memcached.conf <<EOF
    -m 184320
    -s /var/run/memcached/memcached.sock
    -a 0766
    -t 36
    -B binary
    EOF

    cat fio.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkfs.ext4 /dev/ram0
    mount -t ext4 /dev/ram0 /mnt

    mkdir /sys/fs/cgroup/user.slice/test
    echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
    echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
    fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=5m --group_reporting

    cat memcached.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

Client benchmark results:
  kswapd profiles:
    5.18-ed4643521e6a
      39.56%  page_vma_mapped_walk
      19.32%  lzo1x_1_do_compress (real work)
       7.18%  do_raw_spin_lock
       4.23%  _raw_spin_unlock_irq
       2.26%  vma_interval_tree_subtree_search
       2.12%  vma_interval_tree_iter_next
       2.11%  folio_referenced_one
       1.90%  anon_vma_interval_tree_iter_first
       1.47%  ptep_clear_flush
       0.97%  __anon_vma_interval_tree_subtree_search

    patch1-6
      36.13%  lzo1x_1_do_compress (real work)
      19.16%  page_vma_mapped_walk
       6.55%  _raw_spin_unlock_irq
       4.02%  do_raw_spin_lock
       2.32%  anon_vma_interval_tree_iter_first
       2.11%  ptep_clear_flush
       1.76%  __zram_bvec_write
       1.64%  folio_referenced_one
       1.40%  memmove
       1.35%  obj_malloc

  Configurations:
    CPU: single Snapdragon 7c
    Mem: total 4G

    Chrome OS MemoryPressure [1]

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/

Link: https://lore.kernel.org/r/20220309021230.721028-7-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I3fe4850006d7984cd9f4fd46134b826609dc2f86
2022-04-18 10:11:54 -07:00
Yu Zhao
fe302bd1f9 FROMLIST: mm: multi-gen LRU: groundwork
Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in page->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations. They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of
working set estimation and proactive reclaim. These features are
required to optimize job scheduling (bin packing) in data centers. The
variable size of the sliding window is designed for such use cases
[1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive"
will be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
   channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
   flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
   applications usually do not prepare themselves for major page
   faults like they do for blocked I/O. E.g., GUI applications
   commonly use dedicated I/O threads to avoid blocking the rendering
   threads.
There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or
VM_RAND_READ is present; the latter channel is assumed to follow the
latter pattern unless outlying refaults have been observed [3][4].

The next patch will address the "outlying refaults". Three macros,
i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are
added in this patch to make the entire patchset less diffy.

A page is added to the youngest generation on faulting. The aging
needs to check the accessed bit at least twice before handing this
page over to the eviction. The first check takes care of the accessed
bit set on the initial fault; the second check makes sure this page
has not been used since then. This protocol, AKA second chance,
requires a minimum of two generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/

Link: https://lore.kernel.org/r/20220309021230.721028-6-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I333ec6a1d2abfa60d93d6adc190ed3eefe441512
2022-04-18 10:11:54 -07:00
Yu Zhao
4c6c817249 FROMLIST: mm/vmscan.c: refactor shrink_node()
This patch refactors shrink_node() to improve readability for the
upcoming changes to mm/vmscan.c.

Link: https://lore.kernel.org/r/20220309021230.721028-4-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I186f43f946de0d40d54883fb31114840fc749a57
2022-04-18 10:11:54 -07:00