linux

mirror of https://github.com/hardkernel/linux.git synced 2026-06-07 03:15:31 +09:00

Author	SHA1	Message	Date
David Brazdil	c43dfe89fe	ANDROID: KVM: arm64: s2mpu: Extract L1ENTRY_* consts Extract the L1ENTRY_ATTR_{PRON,GRAN}_MASK constants out of macros that create the corresponding constants. This will allow EL1 users to use the masks to get the fields out of register values. Also extract L1ENTRY_L2TABLE_ADDR_SHIFT for adjusting the L2 table address. Bug: 190463801 Signed-off-by: David Brazdil <dbrazdil@google.com> Change-Id: I45578857694ca39266fe45b3c00dbea33738167f	2022-04-21 16:33:28 +01:00
Theodore Ts'o	7a9a532432	BACKPORT: ext4: don't BUG if someone dirty pages without asking ext4 first [ Upstream commit `cc5095747e` ] [un]pin_user_pages_remote is dirtying pages without properly warning the file system in advance. A related race was noted by Jan Kara in 2018[1]; however, more recently instead of it being a very hard-to-hit race, it could be reliably triggered by process_vm_writev(2) which was discovered by Syzbot[2]. This is technically a bug in mm/gup.c, but arguably ext4 is fragile in that if some other kernel subsystem dirty pages without properly notifying the file system using page_mkwrite(), ext4 will BUG, while other file systems will not BUG (although data will still be lost). So instead of crashing with a BUG, issue a warning (since there may be potential data loss) and just mark the page as clean to avoid unprivileged denial of service attacks until the problem can be properly fixed. More discussion and background can be found in the thread starting at [2]. [1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz [2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com Reported-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/YiDS9wVfq4mM2jGK@mit.edu Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Change-Id: Ifa528c386540d70eafb7ec55b238add0c6ba7387	2022-04-21 07:54:02 +00:00
Zhang Qilong	c383610d0f	UPSTREAM: binder: change error code from postive to negative in binder_transaction Depending on the context, the error return value here (extra_buffers_size < added_size) should be negative. Acked-by: Martijn Coenen <maco@android.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com> Link: https://lore.kernel.org/r/20201026110314.135481-1-zhangqilong3@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit `88f6c77927`) Signed-off-by: Carlos Llamas <cmllamas@google.com> Change-Id: If90760a8469f33bb121f0f71098b9415f6d4d783	2022-04-21 00:37:07 +00:00
Daniel Rosenberg	d4d78c7278	ANDROID: fuse-bpf: Fix non-fusebpf build Added #ifdefs around fuse-bpf init/cleanup code Bug: 202785178 Test: builds with and without CONFIG_FUSE_BPF Signed-off-by: Daniel Rosenberg <drosen@google.com> Change-Id: Ie15bb04e439b496e4842303437b3f55c3da14f2c	2022-04-20 14:05:59 -07:00
Daniel Rosenberg	9a5023967b	ANDROID: fuse-bpf: Use fuse_bpf_args in uapi fuse_args is not suitable for use in the uapi - it is not stable, and contains internal pointers. Replace with stable equivalent. The end_offset values are currently unused and unset, but will be used in a follow up patch by the verifier. Test: fuse_test, atest ScopedStorageDeviceTest pass Bug: 202785178 Signed-off-by: Daniel Rosenberg <drosen@google.com> Change-Id: Ic1c12f9706aeae233cc30a0d68ed2533030e485b	2022-04-20 20:57:26 +00:00
Johannes Berg	92c8c21ad0	BACKPORT: nl80211: correctly check NL80211_ATTR_REG_ALPHA2 size commit `6624bb34b4` upstream. We need this to be at least two bytes, so we can access alpha2[0] and alpha2[1]. It may be three in case some userspace used NUL-termination since it was NLA_STRING (and we also push it out with NUL-termination). Cc: stable@vger.kernel.org Reported-by: Lee Jones <lee.jones@linaro.org> Link: https://lore.kernel.org/r/20220411114201.fd4a31f06541.Ie7ff4be2cf348d8cc28ed0d626fc54becf7ea799@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Change-Id: I6f2338c409076f0960bb7305cee4959a22aa9280	2022-04-20 11:21:04 +01:00
Midas Chien	65533e0212	ANDROID: Update the ABI representation Leaf changes summary: 1 artifact changed Changed leaf types summary: 0 leaf type changed Removed/Changed/Added functions summary: 0 Removed, 0 Changed, 0 Added function Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 1 Added variable 1 Added variable: [A] 'int console_set_on_cmdline' Bug: 202781851 Signed-off-by: Midas Chien <midaschieh@google.com> Change-Id: I302b16e6eeec60b070721c54980d989fb1d31c26	2022-04-20 03:53:28 +00:00
Andrey Konovalov	a1013fd19b	FROMLIST: kasan: mark KASAN_VMALLOC flags as kasan_vmalloc_flags_t Fix sparse warning: mm/kasan/shadow.c:496:15: warning: restricted kasan_vmalloc_flags_t degrades to integer Link: https://lkml.kernel.org/r/52d8fccdd3a48d4bdfd0ff522553bac2a13f1579.1649351254.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/all/52d8fccdd3a48d4bdfd0ff522553bac2a13f1579.1649351254.git.andreyknvl@google.com/T/#u Bug: 217222520 Change-Id: I04133e8e9610b81fd0c856ece4f566110094bcb1 Signed-off-by: Andrey Konovalov <andreyknvl@google.com>	2022-04-20 00:58:47 +00:00
Vincenzo Frascino	c098614509	FROMLIST: kasan: fix hw tags enablement when KUNIT tests are disabled Kasan enables hw tags via kasan_enable_tagging() which based on the mode passed via kernel command line selects the correct hw backend. kasan_enable_tagging() is meant to be invoked indirectly via the cpu features framework of the architectures that support these backends. Currently the invocation of this function is guarded by CONFIG_KASAN_KUNIT_TEST which allows the enablement of the correct backend only when KUNIT tests are enabled in the kernel. This inconsistency was introduced in commit: `ed6d74446c` ("kasan: test: support async (again) and asymm modes for HW_TAGS") ... and prevents to enable MTE on arm64 when KUNIT tests for kasan hw_tags are disabled. Fix the issue making sure that the CONFIG_KASAN_KUNIT_TEST guard does not prevent the correct invocation of kasan_enable_tagging(). Link: https://lkml.kernel.org/r/20220408124323.10028-1-vincenzo.frascino@arm.com Fixes: `ed6d74446c` ("kasan: test: support async (again) and asymm modes for HW_TAGS") Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/all/20220408124323.10028-1-vincenzo.frascino@arm.com/T/#u Bug: 217222520 Change-Id: Ib4f05d74e091db57d2a8d5000d67137105d59a4c Signed-off-by: Andrey Konovalov <andreyknvl@google.com>	2022-04-20 00:58:37 +00:00
Fabio Aiuto	f60a0b3285	UPSTREAM: usb: dwc3: leave default DMA for PCI devices in case of a PCI dwc3 controller, leave the default DMA mask. Calling of a 64 bit DMA mask breaks the driver on cherrytrail based tablets like Cyberbook T116. Fixes: `45d39448b4` ("usb: dwc3: support 64 bit DMA in platform driver") Cc: stable <stable@vger.kernel.org> Reported-by: Hans De Goede <hdegoede@redhat.com> Tested-by: Fabio Aiuto <fabioaiuto83@gmail.com> Tested-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Fabio Aiuto <fabioaiuto83@gmail.com> Link: https://lore.kernel.org/r/20211113142959.27191-1-fabioaiuto83@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit `47ce45906c`) Bug: 228575378 Signed-off-by: Albert Wang <albertccwang@google.com> Change-Id: Id4e73116d4f8f603b8282e7bd7717cf28b5a3bf2	2022-04-20 00:40:38 +00:00
Sven Peter	3b508e8fe4	UPSTREAM: usb: dwc3: support 64 bit DMA in platform driver Currently, the dwc3 platform driver does not explicitly ask for a DMA mask. This makes it fall back to the default 32-bit mask which breaks the driver on systems that only have RAM starting above the first 4G like the Apple M1 SoC. Fix this by calling dma_set_mask_and_coherent with a 64bit mask. Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sven Peter <sven@svenpeter.dev> Link: https://lore.kernel.org/r/20210607061751.89752-1-sven@svenpeter.dev Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit `45d39448b4`) Bug: 228575378 Signed-off-by: Albert Wang <albertccwang@google.com> Change-Id: Icd994fde819fdd6fb0896d1bec1777fb73f454ee	2022-04-20 00:40:28 +00:00
Rick Yiu	03f40d5252	ANDROID: Update the ABI representation Leaf changes summary: 4 artifacts changed Changed leaf types summary: 0 leaf type changed Removed/Changed/Added functions summary: 0 Removed, 0 Changed, 2 Added functions Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 2 Added variables 2 Added functions: [A] 'function int __traceiter_android_rvh_set_task_cpu(void, task_struct, unsigned int)' [A] 'function int __traceiter_android_rvh_update_rt_rq_load_avg(void, u64, rq, task_struct*, int)' 2 Added variables: [A] 'tracepoint __tracepoint_android_rvh_set_task_cpu' [A] 'tracepoint __tracepoint_android_rvh_update_rt_rq_load_avg' Bug: 201261299 Signed-off-by: Rick Yiu <rickyiu@google.com> Change-Id: Ie1265a9d638e7826b6185bfde0ab8f900b51c6b0	2022-04-19 23:56:45 +00:00
Kalesh Singh	6db38c5bbc	FROMGIT: EXP rcu: Move expedited grace period (GP) work to RT kthread_worker Enabling CONFIG_RCU_BOOST did not reduce RCU expedited grace-period latency because its workqueues run at SCHED_OTHER, and thus can be delayed by normal processes. This commit avoids these delays by moving the expedited GP work items to a real-time-priority kthread_worker. This option is controlled by CONFIG_RCU_EXP_KTHREAD and disabled by default on PREEMPT_RT=y kernels which disable expedited grace periods after boot by unconditionally setting rcupdate.rcu_normal_after_boot=1. The results were evaluated on arm64 Android devices (6GB ram) running 5.10 kernel, and capturing trace data in critical user-level code. The table below shows the resulting order-of-magnitude improvements in synchronize_rcu_expedited() latency: ------------------------------------------------------------------------ \| \| workqueues \| kthread_worker \| Diff \| ------------------------------------------------------------------------ \| Count \| 725 \| 688 \| \| ------------------------------------------------------------------------ \| Min Duration (ns) \| 326 \| 447 \| 37.12% \| ------------------------------------------------------------------------ \| Q1 (ns) \| 39,428 \| 38,971 \| -1.16% \| ------------------------------------------------------------------------ \| Q2 - Median (ns) \| 98,225 \| 69,743 \| -29.00% \| ------------------------------------------------------------------------ \| Q3 (ns) \| 342,122 \| 126,638 \| -62.98% \| ------------------------------------------------------------------------ \| Max Duration (ns) \| 372,766,967 \| 2,329,671 \| -99.38% \| ------------------------------------------------------------------------ \| Avg Duration (ns) \| 2,746,353 \| 151,242 \| -94.49% \| ------------------------------------------------------------------------ \| Standard Deviation (ns) \| 19,327,765 \| 294,408 \| \| ------------------------------------------------------------------------ The below table show the range of maximums/minimums for synchronize_rcu_expedited() latency from all experiments: ------------------------------------------------------------------------ \| \| workqueues \| kthread_worker \| Diff \| ------------------------------------------------------------------------ \| Total No. of Experiments \| 25 \| 23 \| \| ------------------------------------------------------------------------ \| Largest Maximum (ns) \| 372,766,967 \| 2,329,671 \| -99.38% \| ------------------------------------------------------------------------ \| Smallest Maximum (ns) \| 38,819 \| 86,954 \| 124.00% \| ------------------------------------------------------------------------ \| Range of Maximums (ns) \| 372,728,148 \| 2,242,717 \| \| ------------------------------------------------------------------------ \| Largest Minimum (ns) \| 88,623 \| 27,588 \| -68.87% \| ------------------------------------------------------------------------ \| Smallest Minimum (ns) \| 326 \| 447 \| 37.12% \| ------------------------------------------------------------------------ \| Range of Minimums (ns) \| 88,297 \| 27,141 \| \| ------------------------------------------------------------------------ Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Tejun Heo <tj@kernel.org> Reported-by: Tim Murray <timmurray@google.com> Reported-by: Wei Wang <wvw@google.com> Tested-by: Kyle Lin <kylelin@google.com> Tested-by: Chunwei Lu <chunweilu@google.com> Tested-by: Lulu Wang <luluw@google.com> Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20220409003527.1587028-1-kaleshsingh@google.com/ (cherry picked from commit 3902dd17a29bd3ed1ead364a331a1761edb7162b git: //git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git fastexp.2022.04.11a) Bug: 224791892 Change-Id: I4cc5d28f9ae99e44e92afc43ff4db4b71c415d6c	2022-04-19 23:20:24 +00:00
Sajid Dalvi	68c87a277c	ANDROID: Update the ABI representation Leaf changes summary: 2 artifacts changed Changed leaf types summary: 0 leaf type changed Removed/Changed/Added functions summary: 0 Removed, 0 Changed, 1 Added function Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 1 Added variable 1 Added function: [A] 'function int __traceiter_android_rvh_pci_d3_sleep(void, pci_dev, unsigned int*)' 1 Added variable: [A] 'tracepoint __tracepoint_android_rvh_pci_d3_sleep' Bug: 229125931 Signed-off-by: Sajid Dalvi <sdalvi@google.com> Change-Id: I7db067bd3d468b11826e5e59ee9c96706fbad760	2022-04-19 12:45:34 -05:00
Jens Axboe	699e6e3211	UPSTREAM: block: fix async_depth sysfs interface for mq-deadline A previous commit added this feature, but it inadvertently used the wrong variable to show/store the setting from/to, victimized by copy/paste. Fix it up so that the async_depth sysfs interface reads and writes from the right setting. Fixes: `07757588e5` ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests") Link: https://bugzilla.kernel.org/show_bug.cgi?id=215485 Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Change-Id: I28273a92ac7cbebc830df8b80ad461948402bcd2 (cherry picked from commit `46cdc45acb`) Signed-off-by: Bart Van Assche <bvanassche@google.com>	2022-04-19 15:34:11 +00:00
Sajid Dalvi	53ff5efb2c	ANDROID: PCI/PM: Use usleep_range for d3hot_delay This patch implements a vendor hook that changes d3hot_delay to use usleep_range() instead of msleep() to reduce the resume time from 20ms to 10ms. The call sequence is as follows: pci_pm_resume_noirq() pci_pm_default_resume_early() pci_power_up() pci_raw_set_power_state() --> msleep(10) The default d3hot_delay is 10ms. Using msleep for delays less than 20ms could result in delays up to 20ms. Reference: Documentation/timers/timers-howto.rst Using usleep_range() results in the delay being closer to 10ms and this reduces the resume time. Bug: 194231641 Change-Id: If3e4dcfb99edad302371273933fa6784854cf892 Signed-off-by: Sajid Dalvi <sdalvi@google.com>	2022-04-19 07:17:39 +00:00
Minchan Kim	609fa1be7a	ANDROID: mm: page_pinner: fix elapsed time Put the elapsed time instead of zero all the time. Bug: 218731671 Signed-off-by: Minchan Kim <minchan@google.com> Change-Id: Ibb319e7dfce2d47481e2462bfb8423fbd2ddad66	2022-04-18 23:50:40 +00:00
Minchan Kim	d5d9a23576	ANDROID: mm: retry GUP with orignal gup_flags on failure If GUP fails due to modified flags by vendor hook, try one more time with original flag to keep the API semantic. Bug: 229391920 Signed-off-by: Minchan Kim <minchan@google.com> Change-Id: If2c20fe3e1752c9cc0d7d350ad84b38d8325f9ae	2022-04-18 23:50:32 +00:00
Todd Kjos	6acb261444	ANDROID: GKI: 4/15/2022 KMI freeze Set KMI_GENERATION=4 for 4/15 KMI freeze Leaf changes summary: 2734 artifacts changed Changed leaf types summary: 8 leaf types changed Removed/Changed/Added functions summary: 0 Removed, 2677 Changed, 0 Added function Removed/Changed/Added variables summary: 0 Removed, 49 Changed, 0 Added variable 2677 functions with some sub-type change: [C] 'function void* PDE_DATA(const inode)' at generic.c:799:1 has some sub-type changes: CRC (modversions) changed from 0x830fd868 to 0xb70c2d59 [C] 'function void __ClearPageMovable(page)' at compaction.c:138:1 has some sub-type changes: CRC (modversions) changed from 0x274b4312 to 0x1e91976d [C] 'function void __SetPageMovable(page, address_space)' at compaction.c:130:1 has some sub-type changes: CRC (modversions) changed from 0xaf251a50 to 0xb1221a47 ... 2674 omitted; 2677 symbols have only CRC changes 49 Changed variables: [C] 'pglist_data contig_page_data' was changed at memblock.c:96:1: size of symbol changed from 5696 to 6976 CRC (modversions) changed from 0xa8156534 to 0x7007215 type of variable changed: type size changed from 45568 to 55808 (in bits) 1 data member insertion: 'lru_gen_mm_walk mm_walk', at offset 51520 (in bits) at mmzone.h:1039:1 there are data member changes: type 'struct lruvec' of 'pglist_data::__lruvec' changed: type size changed from 1088 to 9664 (in bits) 3 data member insertions: 'lru_gen_struct lrugen', at offset 1024 (in bits) at mmzone.h:497:1 'lru_gen_mm_state mm_state', at offset 8576 (in bits) at mmzone.h:499:1 'u64 android_vendor_data1', at offset 9600 (in bits) at mmzone.h:504:1 there are data member changes: 'pglist_data* pgdat' offset changed (by +8512 bits) 2977 impacted interfaces 'unsigned long int flags' offset changed (by +8576 bits) 3 ('zone_padding _pad2_' .. 'atomic_long_t vm_stat[38]') offsets changed (by +10240 bits) 2977 impacted interfaces [C] 'task_struct init_task' was changed at init_task.c:64:1: CRC (modversions) changed from 0x124472e1 to 0xf5fdc492 type of variable changed: type size hasn't changed 1 data member insertion: 'unsigned int in_lru_fault', at offset 11332 (in bits) at sched.h:840:1 there are data member changes: 4 ('unsigned int no_cgroup_migration' .. 'unsigned int in_memstall') offsets changed (by +1 bits) 2977 impacted interfaces [C] 'bus_type amba_bustype' was changed at bus.c:215:1: CRC (modversions) changed from 0x55933f58 to 0xefd95b38 [C] 'const clk_ops clk_fixed_factor_ops' was changed at clk-fixed-factor.c:60:1: CRC (modversions) changed from 0x38f07e1d to 0xb94d81d6 [C] 'const clk_ops clk_fixed_rate_ops' was changed at clk-fixed-rate.c:46:1: CRC (modversions) changed from 0x47fbebbe to 0x5299a868 ... 44 omitted; 47 symbols have only CRC changes 'struct lruvec at mmzone.h:280:1' changed: details were reported earlier 'struct mem_cgroup at memcontrol.h:211:1' changed: type size hasn't changed 1 data member insertion: 'lru_gen_mm_list mm_list', at offset 23168 (in bits) at memcontrol.h:337:1 there are data member changes: 2 ('u64 android_oem_data1' .. 'mem_cgroup_per_node* nodeinfo[]') offsets changed (by +192 bits) 2977 impacted interfaces 'struct mem_cgroup_per_node at memcontrol.h:107:1' changed: type size changed from 5184 to 13760 (in bits) there are data member changes: type 'struct lruvec' of 'mem_cgroup_per_node::lruvec' changed, as reported earlier 10 ('lruvec_stat* lruvec_stat_local' .. 'mem_cgroup* memcg') offsets changed (by +8576 bits) 2977 impacted interfaces 'struct mm_struct at mm_types.h:419:1' changed: type size changed from 7680 to 7936 (in bits) there are data member changes: anonymous data member at offset 0 (in bits) changed from: struct {vm_area_struct* mmap; rb_root mm_rb; u64 vmacache_seqnum; rwlock_t mm_rb_lock; unsigned long int (file, unsigned long int, unsigned long int, unsigned long int, unsigned long int) get_unmapped_area; unsigned long int mmap_base; unsigned long int mmap_legacy_base; unsigned long int task_size; unsigned long int highest_vm_end; pgd_t* pgd; atomic_t membarrier_state; atomic_t mm_users; atomic_t mm_count; atomic_t has_pinned; atomic_long_t pgtables_bytes; int map_count; spinlock_t page_table_lock; rw_semaphore mmap_lock; list_head mmlist; unsigned long int hiwater_rss; unsigned long int hiwater_vm; unsigned long int total_vm; unsigned long int locked_vm; atomic64_t pinned_vm; unsigned long int data_vm; unsigned long int exec_vm; unsigned long int stack_vm; unsigned long int def_flags; seqcount_t write_protect_seq; spinlock_t arg_lock; unsigned long int start_code; unsigned long int end_code; unsigned long int start_data; unsigned long int end_data; unsigned long int start_brk; unsigned long int brk; unsigned long int start_stack; unsigned long int arg_start; unsigned long int arg_end; unsigned long int env_start; unsigned long int env_end; unsigned long int saved_auxv[46]; mm_rss_stat rss_stat; linux_binfmt* binfmt; mm_context_t context; unsigned long int flags; core_state* core_state; spinlock_t ioctx_lock; kioctx_table* ioctx_table; task_struct* owner; user_namespace* user_ns; file* exe_file; mmu_notifier_subscriptions* notifier_subscriptions; percpu_rw_semaphore* mmu_notifier_lock; atomic_t tlb_flush_pending; uprobes_state uprobes_state; work_struct async_put_work; u32 pasid; u64 android_kabi_reserved1;} to: struct {vm_area_struct* mmap; rb_root mm_rb; u64 vmacache_seqnum; rwlock_t mm_rb_lock; unsigned long int (file, unsigned long int, unsigned long int, unsigned long int, unsigned long int) get_unmapped_area; unsigned long int mmap_base; unsigned long int mmap_legacy_base; unsigned long int task_size; unsigned long int highest_vm_end; pgd_t* pgd; atomic_t membarrier_state; atomic_t mm_users; atomic_t mm_count; atomic_t has_pinned; atomic_long_t pgtables_bytes; int map_count; spinlock_t page_table_lock; rw_semaphore mmap_lock; list_head mmlist; unsigned long int hiwater_rss; unsigned long int hiwater_vm; unsigned long int total_vm; unsigned long int locked_vm; atomic64_t pinned_vm; unsigned long int data_vm; unsigned long int exec_vm; unsigned long int stack_vm; unsigned long int def_flags; seqcount_t write_protect_seq; spinlock_t arg_lock; unsigned long int start_code; unsigned long int end_code; unsigned long int start_data; unsigned long int end_data; unsigned long int start_brk; unsigned long int brk; unsigned long int start_stack; unsigned long int arg_start; unsigned long int arg_end; unsigned long int env_start; unsigned long int env_end; unsigned long int saved_auxv[46]; mm_rss_stat rss_stat; linux_binfmt* binfmt; mm_context_t context; unsigned long int flags; core_state* core_state; spinlock_t ioctx_lock; kioctx_table* ioctx_table; task_struct* owner; user_namespace* user_ns; file* exe_file; mmu_notifier_subscriptions* notifier_subscriptions; percpu_rw_semaphore* mmu_notifier_lock; atomic_t tlb_flush_pending; uprobes_state uprobes_state; work_struct async_put_work; u32 pasid; struct {list_head list; mem_cgroup* memcg; nodemask_t nodes;} lru_gen; u64 android_kabi_reserved1;} and size changed from 7680 to 7936 (in bits) (by +256 bits) 'unsigned long int cpu_bitmap[]' offset changed (by +256 bits) 2977 impacted interfaces 'struct pglist_data at mmzone.h:729:1' changed: details were reported earlier 'struct reclaim_state at swap.h:131:1' changed: type size changed from 64 to 128 (in bits) 1 data member insertion: 'lru_gen_mm_walk* mm_walk', at offset 64 (in bits) at swap.h:135:1 2977 impacted interfaces 'struct scsi_device at scsi_device.h:102:1' changed: type size hasn't changed 1 data member insertion: 'unsigned int silence_suspend', at offset 2448 (in bits) at scsi_device.h:209:1 there are data member changes: 'bool offline_already' offset changed (by +8 bits) 45 impacted interfaces 'struct task_struct at sched.h:660:1' changed: details were reported earlier Bug: 229630433 Signed-off-by: Todd Kjos <tkjos@google.com> Change-Id: Iacd70a1553401ead91351db0b5b8ec6dfee6e6ec	2022-04-18 11:58:28 -07:00
Bing Han	a034320a68	ANDROID: add vendor fields to swap_slots_cache to support multiple swap devices struct swap_slots_cache :: ANDROID_VENDOR_DATA(1) 1) Multiple swap devices can be supported; 2) There are different kinds of data; 3) During data reclamation, different types of data are exchanged to different swap devices; 4) Each swap device has corresponding arrays of slots and slots_ret; 5) Each swap device has corresponding indexes of nr, cur and n_ret; 6) This field is a pointer, it points to a struct which contains all the other arrays and indexes; Bug: 225795494 Change-Id: Icf116135926be98449a2d96fc458e58e5ad3b7e9 Signed-off-by: Bing Han <bing.han@transsion.com>	2022-04-18 10:20:37 -07:00
Bing Han	1b14ae01b0	ANDROID: add vendor fields to lruvec to record refault stats struct lruvec :: ANDROID_VENDOR_DATA(1) It is pointer to a struct to record the following message: 1）the account of workingset_restore pages of cached anonymous and file pages This is used to adjust the strategy and amount of reclaiming data. Bug: 225795494 Change-Id: I34e57ee23b6c97ac91effa5b72513d238335a996 Signed-off-by: Bing Han <bing.han@transsion.com>	2022-04-18 10:20:23 -07:00
Bing Han	af4eb0e377	ANDROID: add vendor fields to swap_info_struct to record swap stats struct swap_info_struct :: ANDROID_VENDOR_DATA(1) It is pointer to a struct to record the following message: 1) total swapin pages; 2) total swapout pages; 3) total number of cold pages swapin; 4) total number of swapout pages, specified by userspace; 5) total number of swapout pages, specified by kernel; 6) the maxmium number of swapout pages; 7) the maxmium number of swapout pages allowed by kernel; 8) the maxmium number of swapout pages allowed by framework; Bug: 225795494 Change-Id: I779145a83d87e339db86ec81c7f962be99946afb Signed-off-by: Bing Han <bing.han@transsion.com>	2022-04-18 10:20:06 -07:00
Bart Van Assche	fae5207ecc	ANDROID: scsi: ufs: Add suspend/resume SCSI command processing support This functionality is needed by UFS drivers to e.g. suspend SCSI command processing while reprogramming encryption keys if the hardware does not support concurrent I/O and key reprogramming. Bug: 227177294 Change-Id: I10f11e67da81fae7063674838760903d2c178baf Signed-off-by: Bart Van Assche <bvanassche@google.com>	2022-04-18 10:18:44 -07:00
Bart Van Assche	64293a57f1	ANDROID: scsi: ufs: Pass the clock scaling timeout as an argument Prepare for adding an additional ufshcd_clock_scaling_prepare() call with a different timeout. Bug: 227177294 Change-Id: I67a569b074c292a3c37f20a1b1e36f95b682c5e8 Signed-off-by: Bart Van Assche <bvanassche@google.com>	2022-04-18 10:18:30 -07:00
Bart Van Assche	69014b2b36	ANDROID: scsi: ufs: Move a clock scaling check Move a check related to clock scaling into ufshcd_devfreq_scale(). This patch prepares for adding a second ufshcd_clock_scaling_prepare() caller. Bug: 227177294 Change-Id: I928d4cbe64823960a6112ba7f98c18da6244a77c Signed-off-by: Bart Van Assche <bvanassche@google.com>	2022-04-18 10:18:15 -07:00
Bart Van Assche	aca52cabdb	ANDROID: scsi: ufs: Reduce the clock scaling latency Wait at most 20 ms before rechecking the doorbells instead of waiting for a potentially long time between doorbell checks. Bug: 227177294 Change-Id: I8a4dd0e93ca02435264961851a095a9c83c68240 Signed-off-by: Bart Van Assche <bvanassche@google.com>	2022-04-18 10:18:03 -07:00
Peter Wang	00ed95fe93	FROMGIT: scsi: ufs: core: scsi_get_lba() error fix When ufs initializes without scmd->device->sector_size set, scsi_get_lba() will get a wrong shift number and trigger an ubsan error. The shift exponent 4294967286 is too large for the 64-bit type 'sector_t' (aka 'unsigned long long'). Call scsi_get_lba() only when opcode is READ_10/WRITE_10/UNMAP. Link: https://lore.kernel.org/r/20220307111752.10465-1-peter.wang@mediatek.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Peter Wang <peter.wang@mediatek.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> (cherry picked from commit `2bd3b6b759` git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next) Bug: 204438323 Change-Id: I3457fcc88d7c4164c55010e440d9f274c169553e Signed-off-by: Bart Van Assche <bvanassche@google.com>	2022-04-18 10:17:48 -07:00
Adrian Hunter	c0a4aeb7aa	FROMGIT: scsi: ufs: Fix runtime PM messages never-ending cycle Kernel messages produced during runtime PM can cause a never-ending cycle because user space utilities (e.g. journald or rsyslog) write the messages back to storage, causing runtime resume, more messages, and so on. Messages that tell of things that are expected to happen, are arguably unnecessary, so suppress them. UFS driver messages are changes to from dev_err() to dev_dbg() which means they will not display unless activated by dynamic debug of building with -DDEBUG. sdev->silence_suspend is set to skip messages from sd_suspend_common() "Synchronizing SCSI cache", "Stopping disk" and scsi_report_sense() "Power-on or device reset occurred" message (Note, that message appears when the LUN is accessed after runtime PM, not during runtime PM) Example messages from Ubuntu 21.10: $ dmesg \| tail [ 1620.380071] ufshcd 0000:00:12.5: ufshcd_print_pwr_info:[RX, TX]: gear=[1, 1], lane[1, 1], pwr[SLOWAUTO_MODE, SLOWAUTO_MODE], rate = 0 [ 1620.408825] ufshcd 0000:00:12.5: ufshcd_print_pwr_info:[RX, TX]: gear=[4, 4], lane[2, 2], pwr[FAST MODE, FAST MODE], rate = 2 [ 1620.409020] ufshcd 0000:00:12.5: ufshcd_find_max_sup_active_icc_level: Regulator capability was not set, actvIccLevel=0 [ 1620.409524] sd 0:0:0:0: Power-on or device reset occurred [ 1622.938794] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 1622.939184] ufs_device_wlun 0:0:0:49488: Power-on or device reset occurred [ 1625.183175] ufshcd 0000:00:12.5: ufshcd_print_pwr_info:[RX, TX]: gear=[1, 1], lane[1, 1], pwr[SLOWAUTO_MODE, SLOWAUTO_MODE], rate = 0 [ 1625.208041] ufshcd 0000:00:12.5: ufshcd_print_pwr_info:[RX, TX]: gear=[4, 4], lane[2, 2], pwr[FAST MODE, FAST MODE], rate = 2 [ 1625.208311] ufshcd 0000:00:12.5: ufshcd_find_max_sup_active_icc_level: Regulator capability was not set, actvIccLevel=0 [ 1625.209035] sd 0:0:0:0: Power-on or device reset occurred Note for stable: depends on patch "scsi: core: sd: Add silence_suspend flag to suppress some PM messages". Link: https://lore.kernel.org/r/20220228113652.970857-3-adrian.hunter@intel.com Cc: stable@vger.kernel.org Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Bug: 204438323 (cherry picked from commit `71bb9ab6e3` git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next) Signed-off-by: Bart Van Assche <bvanassche@google.com> Change-Id: I2a50283162aa1dc100e1269533ac61056172bd1d	2022-04-18 10:17:34 -07:00
Adrian Hunter	0cd3abcaa4	FROMGIT: scsi: core: sd: Add silence_suspend flag to suppress some PM messages Kernel messages produced during runtime PM can cause a never-ending cycle because user space utilities (e.g. journald or rsyslog) write the messages back to storage, causing runtime resume, more messages, and so on. Messages that tell of things that are expected to happen are arguably unnecessary, so add a flag to suppress them. This flag is used by the UFS driver. Link: https://lore.kernel.org/r/20220228113652.970857-2-adrian.hunter@intel.com Cc: stable@vger.kernel.org Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> (cherry picked from commit `af4edb1d50` git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next) Change-Id: I8834c9d71618fd04635804779a41117629a75166 Signed-off-by: Bart Van Assche <bvanassche@google.com>	2022-04-18 10:17:21 -07:00
Keoseong Park	e46eb26194	FROMGIT: scsi: ufs: core: Remove wlun_dev_to_hba() Commit `edc0596cc0` ("scsi: ufs: core: Stop clearing UNIT ATTENTIONS") removed all callers of wlun_dev_to_hba(). Hence also remove the macro itself. Link: https://lore.kernel.org/r/1891546521.01644927481711.JavaMail.epsvc@epcpadp4 Reviewed-by: Alim Akhtar <alim.akhtar@samsung.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Keoseong Park <keosung.park@samsung.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> (cherry picked from `482dcaa1c9` git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next) Bug: 204438323 Change-Id: I1eee9827255305f5567ee21c65dbfca897e9daa5 Signed-off-by: Bart Van Assche <bvanassche@google.com>	2022-04-18 10:17:03 -07:00
Jinyoung Choi	85d759e39a	FROMGIT: scsi: ufs: Add checking lifetime attribute for WriteBooster Because WB performs writes in SLC mode, it is not possible to use WriteBooster indefinitely. Vendors can set a lifetime limit in the device. If the lifetime exceeds this limit, the device ican disable the WB feature. The feature is defined in the "bWriteBoosterBufferLifeTimeEst (IDN = 1E)" attribute. With lifetime exceeding the limit value, the current driver continuously performs the following query: - Write Flag: WB_ENABLE / DISABLE - Read attr: Available Buffer Size - Read attr: Current Buffer Size This patch recognizes that WriteBooster is no longer supported by the device, and prevents unnecessary queries. Link: https://lore.kernel.org/r/1891546521.01643252701746.JavaMail.epsvc@epcpadp3 Reviewed-by: Asutosh Das <quic_asutoshd@quicinc.com> Acked-by: Avri Altman <avri.altman@wdc.com> Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> (cherry picked from commit `f681d1078d` git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next) Bug: 204438323 Change-Id: I9178b31aaeb75ef157aa8e12b1dd0f5a646f0579 Signed-off-by: Bart Van Assche <bvanassche@google.com>	2022-04-18 10:16:38 -07:00
Kiwoong Kim	44b7a4f00f	FROMGIT: scsi: ufs: Use generic error code in ufshcd_set_dev_pwr_mode() The return value of ufshcd_set_dev_pwr_mode() is passed to device PM core. However, the function currently returns a SCSI result which the PM core doesn't understand. This might lead to unexpected behaviors in userland; a platform reset was observed in Android. Use a generic error code for SSU failures. Link: https://lore.kernel.org/r/1642743182-54098-1-git-send-email-kwmad.kim@samsung.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Kiwoong Kim <kwmad.kim@samsung.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> (cherry picked from commit `ad6c8a4264` git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next) Bug: 204438323 Change-Id: I051742cd8ea2215cbd94ac3dedc4fab2863a9c6e Signed-off-by: Bart Van Assche <bvanassche@google.com>	2022-04-18 10:16:16 -07:00
Miaoqian Lin	aeedc78679	FROMGIT: scsi: ufs: ufs-mediatek: Fix error checking in ufs_mtk_init_va09_pwr_ctrl() The function regulator_get() returns an error pointer. Use IS_ERR() to validate the return value. Link: https://lore.kernel.org/r/20211222070930.9449-1-linmq006@gmail.com Fixes: `cf137b3ea4` ("scsi: ufs-mediatek: Support VA09 regulator operations") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> (cherry picked from commit `3ba880a12d` git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next) Bug: 204438323 Change-Id: I9b6d2fbb9d1d795f30e7873837b59dbf76569d1b Signed-off-by: Bart Van Assche <bvanassche@google.com>	2022-04-18 10:15:59 -07:00
SEO HOYOUNG	1fc4aef3d5	FROMGIT: scsi: ufs: Modify Tactive time setting conditions The Tactive time determines the waiting time before burst at hibern8 exit and is determined by hardware at linkup time. However, in the case of Samsung devices, increase host's Tactive time +100us for stability. If the HCI's Tactive time is equal or greater than the device, +100us should be set. Link: https://lore.kernel.org/r/20220106213924.186263-1-hy50.seo@samsung.com Reviewed-by: Alim Akhtar <alim.akhtar@samsung.com> Acked-by: Avri Altman <Avri.Altman@wdc.com> Signed-off-by: SEO HOYOUNG <hy50.seo@samsung.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> (cherry picked from commit `9008661e19` git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next) Bug: 204438323 Change-Id: I6ffe1c279cab9b780558de763e94cf01cfd4be3e Signed-off-by: Bart Van Assche <bvanassche@google.com>	2022-04-18 10:15:39 -07:00
Adrian Hunter	d87405c2fe	FROMGIT: scsi: ufs: ufs-pci: Add support for Intel ADL Add PCI ID and callbacks to support Intel Alder Lake. Link: https://lore.kernel.org/r/20211124204218.1784559-1-adrian.hunter@intel.com Cc: stable@vger.kernel.org # v5.15+ Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> (cherry picked from commit `7dc9fb47bc` git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next) Bug: 204438323 Change-Id: I370ad2e01bc67fec675f4c2ebf7f5d222d0afb07 Signed-off-by: Bart Van Assche <bvanassche@google.com>	2022-04-18 10:15:19 -07:00
Ye Guojin	b65cfd7b92	FROMGIT: scsi: ufs: ufs-mediatek: Add put_device() after of_find_device_by_node() This was found by coccicheck: ./drivers/scsi/ufs/ufs-mediatek.c, 211, 1-7, ERROR missing put_device; call of_find_device_by_node on line 1185, but without a corresponding object release within this function. Link: https://lore.kernel.org/r/20211110105133.150171-1-ye.guojin@zte.com.cn Reported-by: Zeal Robot <zealci@zte.com.cn> Reviewed-by: Peter Wang <peter.wang@mediatek.com> Signed-off-by: Ye Guojin <ye.guojin@zte.com.cn> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Bug: 204438323 (cherry picked from commit `cc03facb1c` git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next) Change-Id: I0aa3e8be7f7bc84d5c9b4dc3fb1e5586bc7e5ddf Signed-off-by: Bart Van Assche <bvanassche@google.com>	2022-04-18 10:14:56 -07:00
Bean Huo	4f4bf31d39	FROMGIT: scsi: ufs: ufshpb: Fix warning in ufshpb_set_hpb_read_to_upiu() Fix the following sparse warnings in ufshpb_set_hpb_read_to_upiu(): sparse warnings: (new ones prefixed by >>) drivers/scsi/ufs/ufshpb.c:335:27: sparse: sparse: cast from restricted __be64 drivers/scsi/ufs/ufshpb.c:335:25: sparse: expected restricted __be64 [usertype] ppn_tmp drivers/scsi/ufs/ufshpb.c:335:25: sparse: got unsigned long long [usertype] Link: https://lore.kernel.org/r/20211111222452.384089-1-huobean@gmail.com Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Bean Huo <beanhuo@micron.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> (cherry picked from commit `73185a1377` git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next) Bug: 204438323 Change-Id: I6770255dab00a3ff8c5f7b4499efdfa0dd53839c Signed-off-by: Bart Van Assche <bvanassche@google.com>	2022-04-18 10:14:37 -07:00
Bart Van Assche	acb0ef885c	ANDROID: scsi: ufs: Minimize the difference with the upstream code Make the order of function declarations match the order in the upstream code. Bug: 204438323 Change-Id: Iac453cdb5ae67184c4218639ab8b91da03fabc66 Signed-off-by: Bart Van Assche <bvanassche@google.com>	2022-04-18 10:14:20 -07:00
Yu Zhao	321995d280	ANDROID: GKI: build multi-gen LRU CONFIG_LRU_GEN=y to build multi-gen LRU. To enable it, echo y >/sys/kernel/mm/lru_gen/enabled. Bug: 227651406 Signed-off-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: If5f6ece8373f1da2eb1eb96d809a2e216ebc0fbc	2022-04-18 10:11:56 -07:00
Yu Zhao	306dbfb34c	FROMLIST: mm: multi-gen LRU: design doc Add a design doc. Link: https://lore.kernel.org/r/20220309021230.721028-15-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I1d66302e618416291ebf9647e20625fb76613c89	2022-04-18 10:11:56 -07:00
Yu Zhao	8b006e4d1c	FROMLIST: mm: multi-gen LRU: admin guide Add an admin guide. Link: https://lore.kernel.org/r/20220309021230.721028-14-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I6fafbd7eb3ef6819cfcd30376459f14893f17c63	2022-04-18 10:11:56 -07:00
Yu Zhao	3cf1dfaaa5	FROMLIST: mm: multi-gen LRU: debugfs interface Add /sys/kernel/debug/lru_gen for working set estimation and proactive reclaim. These features are required to optimize job scheduling (bin packing) in data centers [1][2]. Compared with the page table-based approach and the PFN-based approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has the following advantages: 1. It offers better choices because it is aware of memcgs, NUMA nodes, shared mappings and unmapped page cache. 2. It is more scalable because it is O(nr_hot_pages), whereas the PFN-based approach is O(nr_total_pages). Add /sys/kernel/debug/lru_gen_full for debugging. [1] https://dl.acm.org/doi/10.1145/3297858.3304053 [2] https://dl.acm.org/doi/10.1145/3503222.3507731 Link: https://lore.kernel.org/r/20220309021230.721028-13-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: Ie558098e0a24a647f77f4eacc4d72576173fc0b8	2022-04-18 10:11:56 -07:00
Yu Zhao	96f4a592d3	FROMLIST: mm: multi-gen LRU: thrashing prevention Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as requested by many desktop users [1]. When set to value N, it prevents the working set of N milliseconds from getting evicted. The OOM killer is triggered if this working set cannot be kept in memory. Based on the average human detectable lag (~100ms), N=1000 usually eliminates intolerable lags due to thrashing. Larger values like N=3000 make lags less noticeable at the risk of premature OOM kills. Compared with the size-based approach, e.g., [2], this time-based approach has the following advantages: 1. It is easier to configure because it is agnostic to applications and memory sizes. 2. It is more reliable because it is directly wired to the OOM killer. [1] https://lore.kernel.org/lkml/Ydza%2FzXKY9ATRoh6@google.com/ [2] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/ Link: https://lore.kernel.org/r/20220309021230.721028-12-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I482d33f3beaf7723d2f3eeaaa5b4f12bcb9b48a1	2022-04-18 10:11:55 -07:00
Yu Zhao	76fdc1010b	FROMLIST: mm: multi-gen LRU: kill switch Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that can be disabled include: 0x0001: the multi-gen LRU core 0x0002: walking page table, when arch_has_hw_pte_young() returns true 0x0004: clearing the accessed bit in non-leaf PMD entries, when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y [yYnN]: apply to all the components above E.g., echo y >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0007 echo 5 >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0005 NB: the page table walks happen on the scale of seconds under heavy memory pressure, in which case the mmap_lock contention is a lesser concern, compared with the LRU lock contention and the I/O congestion. So far the only well-known case of the mmap_lock contention happens on Android, due to Scudo [1] which allocates several thousand VMAs for merely a few hundred MBs. The SPF and the Maple Tree also have provided their own assessments [2][3]. However, if walking page tables does worsen the mmap_lock contention, the kill switch can be used to disable it. In this case the multi-gen LRU will suffer a minor performance degradation, as shown previously. Clearing the accessed bit in non-leaf PMD entries can also be disabled, since this behavior was not tested on x86 varieties other than Intel and AMD. [1] https://source.android.com/devices/tech/debug/scudo [2] https://lore.kernel.org/lkml/20220128131006.67712-1-michel@lespinasse.org/ [3] https://lore.kernel.org/lkml/20220202024137.2516438-1-Liam.Howlett@oracle.com/ Link: https://lore.kernel.org/r/20220309021230.721028-11-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I71801d9470a2588cad8bfd14fbcfafc7b010aa03	2022-04-18 10:11:55 -07:00
Yu Zhao	082bc8296a	FROMLIST: mm: multi-gen LRU: optimize multiple memcgs When multiple memcgs are available, it is possible to make better choices based on generations and tiers and therefore improve the overall performance under global memory pressure. This patch adds a rudimentary optimization to select memcgs that can drop single-use unmapped clean pages first. Doing so reduces the chance of going into the aging path or swapping. These two operations can be costly. A typical example that benefits from this optimization is a server running mixed types of workloads, e.g., heavy anon workload in one memcg and heavy buffered I/O workload in the other. Though this optimization can be applied to both kswapd and direct reclaim, it is only added to kswapd to keep the patchset manageable. Later improvements will cover the direct reclaim path. Server benchmark results: Mixed workloads: fio (buffered I/O): -[23, 25]% IOPS BW patch1-8: 2960k 11.3GiB/s patch1-9: 2248k 8783MiB/s memcached (anon): +[210, 214]% Ops/sec KB/sec patch1-8: 606940.09 23576.89 patch1-9: 1895197.49 73619.93 Mixed workloads: fio (buffered I/O): -[4, 6]% IOPS BW 5.18-ed4643521e6a: 2369k 9255MiB/s patch1-9: 2248k 8783MiB/s memcached (anon): +[510, 516]% Ops/sec KB/sec 5.18-ed4643521e6a: 309189.58 12010.61 patch1-9: 1895197.49 73619.93 Configurations: (changes since patch 6) cat mixed.sh modprobe brd rd_nr=2 rd_size=56623104 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 mkfs.ext4 /dev/ram1 mount -t ext4 /dev/ram1 /mnt memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=90m --group_reporting & pid=$! sleep 200 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed kill -INT $pid wait Client benchmark results: no change (CONFIG_MEMCG=n) Link: https://lore.kernel.org/r/20220309021230.721028-10-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I0641467dbd7c5ba0645602cec7fe8d6fdb750edb	2022-04-18 10:11:55 -07:00
Yu Zhao	93c4f86793	FROMLIST: mm: multi-gen LRU: support page table walks To further exploit spatial locality, the aging prefers to walk page tables to search for young PTEs and promote hot pages. A kill switch will be added in the next patch to disable this behavior. When disabled, the aging relies on the rmap only. NB: this behavior has nothing similar with the page table scanning in the 2.4 kernel [1], which searches page tables for old PTEs, adds cold pages to swapcache and unmaps them. To avoid confusion, the term "iteration" specifically means the traversal of an entire mm_struct list; the term "walk" will be applied to page tables and the rmap, as usual. An mm_struct list is maintained for each memcg, and an mm_struct follows its owner task to the new memcg when this task is migrated. Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls walk_page_range() with each mm_struct on this list to promote hot pages before it increments max_seq. When multiple page table walkers iterate the same list, each of them gets a unique mm_struct; therefore they can run concurrently. Page table walkers ignore any misplaced pages, e.g., if an mm_struct was migrated, pages it left in the previous memcg will not be promoted when its current memcg is under reclaim. Similarly, page table walkers will not promote pages from nodes other than the one under reclaim. This patch uses the following optimizations when walking page tables: 1. It tracks the usage of mm_struct's between context switches so that page table walkers can skip processes that have been sleeping since the last iteration. 2. It uses generational Bloom filters to record populated branches so that page table walkers can reduce their search space based on the query results, e.g., to skip page tables containing mostly holes or misplaced pages. 3. It takes advantage of the accessed bit in non-leaf PMD entries when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y. 4. It does not zigzag between a PGD table and the same PMD table spanning multiple VMAs. IOW, it finishes all the VMAs within the range of the same PMD table before it returns to a PGD table. This improves the cache performance for workloads that have large numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[5.5, 7.5]% Ops/sec KB/sec patch1-7: 1014393.57 39455.42 patch1-8: 1078507.59 41949.15 Configurations: no change Client benchmark results: kswapd profiles: patch1-7 45.54% lzo1x_1_do_compress (real work) 9.56% page_vma_mapped_walk 6.70% _raw_spin_unlock_irq 2.78% ptep_clear_flush 2.47% do_raw_spin_lock 2.22% __zram_bvec_write 1.87% lru_gen_look_around 1.78% memmove 1.77% obj_malloc 1.44% free_unref_page_list patch1-8 47.02% lzo1x_1_do_compress (real work) 6.73% page_vma_mapped_walk 6.14% _raw_spin_unlock_irq 3.39% walk_pte_range 2.63% ptep_clear_flush 2.29% __zram_bvec_write 2.10% do_raw_spin_lock 1.81% memmove 1.73% obj_malloc 1.53% free_unref_page_list Configurations: no change [1] https://lwn.net/Articles/23732/ [2] https://source.android.com/devices/tech/debug/scudo Link: https://lore.kernel.org/r/20220309021230.721028-9-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I5a3c97cf8ebf8d65d5f9528cd979a637c190053e	2022-04-18 10:11:55 -07:00
Yu Zhao	c8356f7573	FROMLIST: mm: multi-gen LRU: exploit locality in rmap Searching the rmap for PTEs mapping each page on an LRU list (to test and clear the accessed bit) can be expensive because pages from different VMAs (PA space) are not cache friendly to the rmap (VA space). For workloads mostly using mapped pages, the rmap has a high CPU cost in the reclaim path. This patch exploits spatial locality to reduce the trips into the rmap. When shrink_page_list() walks the rmap and finds a young PTE, a new function lru_gen_look_around() scans at most BITS_PER_LONG-1 adjacent PTEs. On finding another young PTE, it clears the accessed bit and updates the gen counter of the page mapped by this PTE to (max_seq%MAX_NR_GENS)+1. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[4, 6]% Ops/sec KB/sec patch1-6: 964656.80 37520.88 patch1-7: 1014393.57 39455.42 Configurations: no change Client benchmark results: kswapd profiles: patch1-6 36.13% lzo1x_1_do_compress (real work) 19.16% page_vma_mapped_walk 6.55% _raw_spin_unlock_irq 4.02% do_raw_spin_lock 2.32% anon_vma_interval_tree_iter_first 2.11% ptep_clear_flush 1.76% __zram_bvec_write 1.64% folio_referenced_one 1.40% memmove 1.35% obj_malloc patch1-7 45.54% lzo1x_1_do_compress (real work) 9.56% page_vma_mapped_walk 6.70% _raw_spin_unlock_irq 2.78% ptep_clear_flush 2.47% do_raw_spin_lock 2.22% __zram_bvec_write 1.87% lru_gen_look_around 1.78% memmove 1.77% obj_malloc 1.44% free_unref_page_list Configurations: no change Link: https://lore.kernel.org/r/20220309021230.721028-8-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I9a290343840f3cf925c891c8e360c7cdc24ffb9c	2022-04-18 10:11:55 -07:00
Yu Zhao	436dff20eb	FROMLIST: mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. Promotion in the aging path does not require any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The eviction consumes old generations. Given an lruvec, it increments min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. Each generation is divided into multiple tiers. Tiers represent different ranges of numbers of accesses through file descriptors. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in page->flags. In contrast to moving across generations, which requires the LRU lock, moving across tiers only involves operations on page->flags. The feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. The eviction moves a page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[38, 40]% IOPS BW 5.18-ed4643521e6a: 2547k 9989MiB/s patch1-6: 3540k 13.5GiB/s Single workload: memcached (anon): +[103, 107]% Ops/sec KB/sec 5.18-ed4643521e6a: 469048.66 18243.91 patch1-6: 964656.80 37520.88 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO \| __GFP_ZERO \| __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO \| __GFP_ZERO \| __GFP_HIGHMEM \| __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.18-ed4643521e6a 39.56% page_vma_mapped_walk 19.32% lzo1x_1_do_compress (real work) 7.18% do_raw_spin_lock 4.23% _raw_spin_unlock_irq 2.26% vma_interval_tree_subtree_search 2.12% vma_interval_tree_iter_next 2.11% folio_referenced_one 1.90% anon_vma_interval_tree_iter_first 1.47% ptep_clear_flush 0.97% __anon_vma_interval_tree_subtree_search patch1-6 36.13% lzo1x_1_do_compress (real work) 19.16% page_vma_mapped_walk 6.55% _raw_spin_unlock_irq 4.02% do_raw_spin_lock 2.32% anon_vma_interval_tree_iter_first 2.11% ptep_clear_flush 1.76% __zram_bvec_write 1.64% folio_referenced_one 1.40% memmove 1.35% obj_malloc Configurations: CPU: single Snapdragon 7c Mem: total 4G Chrome OS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lore.kernel.org/r/20220309021230.721028-7-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I3fe4850006d7984cd9f4fd46134b826609dc2f86	2022-04-18 10:11:54 -07:00
Yu Zhao	fe302bd1f9	FROMLIST: mm: multi-gen LRU: groundwork Evictable pages are divided into multiple generations for each lruvec. The youngest generation number is stored in lrugen->max_seq for both anon and file types as they are aged on an equal footing. The oldest generation numbers are stored in lrugen->min_seq[] separately for anon and file types as clean file pages can be evicted regardless of swap constraints. These three variables are monotonically increasing. Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into the gen counter in page->flags. Each truncated generation number is an index to lrugen->lists[]. The sliding window technique is used to track at least MIN_NR_GENS and at most MAX_NR_GENS generations. The gen counter stores a value within [1, MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it stores 0. There are two conceptually independent procedures: "the aging", which produces young generations, and "the eviction", which consumes old generations. They form a closed-loop system, i.e., "the page reclaim". Both procedures can be invoked from userspace for the purposes of working set estimation and proactive reclaim. These features are required to optimize job scheduling (bin packing) in data centers. The variable size of the sliding window is designed for such use cases [1][2]. To avoid confusion, the terms "hot" and "cold" will be applied to the multi-gen LRU, as a new convention; the terms "active" and "inactive" will be applied to the active/inactive LRU, as usual. The protection of hot pages and the selection of cold pages are based on page access channels and patterns. There are two access channels: one through page tables and the other through file descriptors. The protection of the former channel is by design stronger because: 1. The uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit. 2. The cost of evicting the former channel is higher due to the TLB flushes required and the likelihood of encountering the dirty bit. 3. The penalty of underprotecting the former channel is higher because applications usually do not prepare themselves for major page faults like they do for blocked I/O. E.g., GUI applications commonly use dedicated I/O threads to avoid blocking the rendering threads. There are also two access patterns: one with temporal locality and the other without. For the reasons listed above, the former channel is assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is present; the latter channel is assumed to follow the latter pattern unless outlying refaults have been observed [3][4]. The next patch will address the "outlying refaults". Three macros, i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in this patch to make the entire patchset less diffy. A page is added to the youngest generation on faulting. The aging needs to check the accessed bit at least twice before handing this page over to the eviction. The first check takes care of the accessed bit set on the initial fault; the second check makes sure this page has not been used since then. This protocol, AKA second chance, requires a minimum of two generations, hence MIN_NR_GENS. [1] https://dl.acm.org/doi/10.1145/3297858.3304053 [2] https://dl.acm.org/doi/10.1145/3503222.3507731 [3] https://lwn.net/Articles/495543/ [4] https://lwn.net/Articles/815342/ Link: https://lore.kernel.org/r/20220309021230.721028-6-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I333ec6a1d2abfa60d93d6adc190ed3eefe441512	2022-04-18 10:11:54 -07:00
Yu Zhao	4c6c817249	FROMLIST: mm/vmscan.c: refactor shrink_node() This patch refactors shrink_node() to improve readability for the upcoming changes to mm/vmscan.c. Link: https://lore.kernel.org/r/20220309021230.721028-4-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 227651406 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I186f43f946de0d40d54883fb31114840fc749a57	2022-04-18 10:11:54 -07:00

1 2 3 4 5 ...

988084 Commits