Extract the L1ENTRY_ATTR_{PRON,GRAN}_MASK constants out of macros that
create the corresponding constants. This will allow EL1 users to use the
masks to get the fields out of register values. Also extract
L1ENTRY_L2TABLE_ADDR_SHIFT for adjusting the L2 table address.
Bug: 190463801
Signed-off-by: David Brazdil <dbrazdil@google.com>
Change-Id: I45578857694ca39266fe45b3c00dbea33738167f
[ Upstream commit cc5095747e ]
[un]pin_user_pages_remote is dirtying pages without properly warning
the file system in advance. A related race was noted by Jan Kara in
2018[1]; however, more recently instead of it being a very hard-to-hit
race, it could be reliably triggered by process_vm_writev(2) which was
discovered by Syzbot[2].
This is technically a bug in mm/gup.c, but arguably ext4 is fragile in
that if some other kernel subsystem dirty pages without properly
notifying the file system using page_mkwrite(), ext4 will BUG, while
other file systems will not BUG (although data will still be lost).
So instead of crashing with a BUG, issue a warning (since there may be
potential data loss) and just mark the page as clean to avoid
unprivileged denial of service attacks until the problem can be
properly fixed. More discussion and background can be found in the
thread starting at [2].
[1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz
[2] https://lore.kernel.org/r/Yg0m6IjcNmfaSokM@google.com
Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a8db@syzkaller.appspotmail.com
Reported-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/YiDS9wVfq4mM2jGK@mit.edu
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Change-Id: Ifa528c386540d70eafb7ec55b238add0c6ba7387
Added #ifdefs around fuse-bpf init/cleanup code
Bug: 202785178
Test: builds with and without CONFIG_FUSE_BPF
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: Ie15bb04e439b496e4842303437b3f55c3da14f2c
fuse_args is not suitable for use in the uapi - it is not stable, and
contains internal pointers. Replace with stable equivalent.
The end_offset values are currently unused and unset, but will be used
in a follow up patch by the verifier.
Test: fuse_test, atest ScopedStorageDeviceTest pass
Bug: 202785178
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: Ic1c12f9706aeae233cc30a0d68ed2533030e485b
Kasan enables hw tags via kasan_enable_tagging() which based on the mode
passed via kernel command line selects the correct hw backend.
kasan_enable_tagging() is meant to be invoked indirectly via the cpu
features framework of the architectures that support these backends.
Currently the invocation of this function is guarded by
CONFIG_KASAN_KUNIT_TEST which allows the enablement of the correct backend
only when KUNIT tests are enabled in the kernel.
This inconsistency was introduced in commit:
ed6d74446c ("kasan: test: support async (again) and asymm modes for HW_TAGS")
... and prevents to enable MTE on arm64 when KUNIT tests for kasan hw_tags are
disabled.
Fix the issue making sure that the CONFIG_KASAN_KUNIT_TEST guard does not
prevent the correct invocation of kasan_enable_tagging().
Link: https://lkml.kernel.org/r/20220408124323.10028-1-vincenzo.frascino@arm.com
Fixes: ed6d74446c ("kasan: test: support async (again) and asymm modes for HW_TAGS")
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/all/20220408124323.10028-1-vincenzo.frascino@arm.com/T/#u
Bug: 217222520
Change-Id: Ib4f05d74e091db57d2a8d5000d67137105d59a4c
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Currently, the dwc3 platform driver does not explicitly ask for
a DMA mask. This makes it fall back to the default 32-bit mask which
breaks the driver on systems that only have RAM starting above the
first 4G like the Apple M1 SoC.
Fix this by calling dma_set_mask_and_coherent with a 64bit mask.
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sven Peter <sven@svenpeter.dev>
Link: https://lore.kernel.org/r/20210607061751.89752-1-sven@svenpeter.dev
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 45d39448b4)
Bug: 228575378
Signed-off-by: Albert Wang <albertccwang@google.com>
Change-Id: Icd994fde819fdd6fb0896d1bec1777fb73f454ee
A previous commit added this feature, but it inadvertently used the wrong
variable to show/store the setting from/to, victimized by copy/paste. Fix
it up so that the async_depth sysfs interface reads and writes from the
right setting.
Fixes: 07757588e5 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215485
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change-Id: I28273a92ac7cbebc830df8b80ad461948402bcd2
(cherry picked from commit 46cdc45acb)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
This patch implements a vendor hook that changes d3hot_delay to use
usleep_range() instead of msleep() to reduce the resume time from
20ms to 10ms.
The call sequence is as follows:
pci_pm_resume_noirq()
pci_pm_default_resume_early()
pci_power_up()
pci_raw_set_power_state() --> msleep(10)
The default d3hot_delay is 10ms. Using msleep for delays less than 20ms
could result in delays up to 20ms.
Reference: Documentation/timers/timers-howto.rst
Using usleep_range() results in the delay being closer to 10ms and this
reduces the resume time.
Bug: 194231641
Change-Id: If3e4dcfb99edad302371273933fa6784854cf892
Signed-off-by: Sajid Dalvi <sdalvi@google.com>
Put the elapsed time instead of zero all the time.
Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: Ibb319e7dfce2d47481e2462bfb8423fbd2ddad66
If GUP fails due to modified flags by vendor hook,
try one more time with original flag to keep the
API semantic.
Bug: 229391920
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: If2c20fe3e1752c9cc0d7d350ad84b38d8325f9ae
Set KMI_GENERATION=4 for 4/15 KMI freeze
Leaf changes summary: 2734 artifacts changed
Changed leaf types summary: 8 leaf types changed
Removed/Changed/Added functions summary: 0 Removed, 2677 Changed, 0 Added function
Removed/Changed/Added variables summary: 0 Removed, 49 Changed, 0 Added variable
2677 functions with some sub-type change:
[C] 'function void* PDE_DATA(const inode*)' at generic.c:799:1 has some sub-type changes:
CRC (modversions) changed from 0x830fd868 to 0xb70c2d59
[C] 'function void __ClearPageMovable(page*)' at compaction.c:138:1 has some sub-type changes:
CRC (modversions) changed from 0x274b4312 to 0x1e91976d
[C] 'function void __SetPageMovable(page*, address_space*)' at compaction.c:130:1 has some sub-type changes:
CRC (modversions) changed from 0xaf251a50 to 0xb1221a47
... 2674 omitted; 2677 symbols have only CRC changes
49 Changed variables:
[C] 'pglist_data contig_page_data' was changed at memblock.c:96:1:
size of symbol changed from 5696 to 6976
CRC (modversions) changed from 0xa8156534 to 0x7007215
type of variable changed:
type size changed from 45568 to 55808 (in bits)
1 data member insertion:
'lru_gen_mm_walk mm_walk', at offset 51520 (in bits) at mmzone.h:1039:1
there are data member changes:
type 'struct lruvec' of 'pglist_data::__lruvec' changed:
type size changed from 1088 to 9664 (in bits)
3 data member insertions:
'lru_gen_struct lrugen', at offset 1024 (in bits) at mmzone.h:497:1
'lru_gen_mm_state mm_state', at offset 8576 (in bits) at mmzone.h:499:1
'u64 android_vendor_data1', at offset 9600 (in bits) at mmzone.h:504:1
there are data member changes:
'pglist_data* pgdat' offset changed (by +8512 bits)
2977 impacted interfaces
'unsigned long int flags' offset changed (by +8576 bits)
3 ('zone_padding _pad2_' .. 'atomic_long_t vm_stat[38]') offsets changed (by +10240 bits)
2977 impacted interfaces
[C] 'task_struct init_task' was changed at init_task.c:64:1:
CRC (modversions) changed from 0x124472e1 to 0xf5fdc492
type of variable changed:
type size hasn't changed
1 data member insertion:
'unsigned int in_lru_fault', at offset 11332 (in bits) at sched.h:840:1
there are data member changes:
4 ('unsigned int no_cgroup_migration' .. 'unsigned int in_memstall') offsets changed (by +1 bits)
2977 impacted interfaces
[C] 'bus_type amba_bustype' was changed at bus.c:215:1:
CRC (modversions) changed from 0x55933f58 to 0xefd95b38
[C] 'const clk_ops clk_fixed_factor_ops' was changed at clk-fixed-factor.c:60:1:
CRC (modversions) changed from 0x38f07e1d to 0xb94d81d6
[C] 'const clk_ops clk_fixed_rate_ops' was changed at clk-fixed-rate.c:46:1:
CRC (modversions) changed from 0x47fbebbe to 0x5299a868
... 44 omitted; 47 symbols have only CRC changes
'struct lruvec at mmzone.h:280:1' changed:
details were reported earlier
'struct mem_cgroup at memcontrol.h:211:1' changed:
type size hasn't changed
1 data member insertion:
'lru_gen_mm_list mm_list', at offset 23168 (in bits) at memcontrol.h:337:1
there are data member changes:
2 ('u64 android_oem_data1' .. 'mem_cgroup_per_node* nodeinfo[]') offsets changed (by +192 bits)
2977 impacted interfaces
'struct mem_cgroup_per_node at memcontrol.h:107:1' changed:
type size changed from 5184 to 13760 (in bits)
there are data member changes:
type 'struct lruvec' of 'mem_cgroup_per_node::lruvec' changed, as reported earlier
10 ('lruvec_stat* lruvec_stat_local' .. 'mem_cgroup* memcg') offsets changed (by +8576 bits)
2977 impacted interfaces
'struct mm_struct at mm_types.h:419:1' changed:
type size changed from 7680 to 7936 (in bits)
there are data member changes:
anonymous data member at offset 0 (in bits) changed from:
struct {vm_area_struct* mmap; rb_root mm_rb; u64 vmacache_seqnum; rwlock_t mm_rb_lock; unsigned long int (file*, unsigned long int, unsigned long int, unsigned long int, unsigned long int)* get_unmapped_area; unsigned long int mmap_base; unsigned long int mmap_legacy_base; unsigned long int task_size; unsigned long int highest_vm_end; pgd_t* pgd; atomic_t membarrier_state; atomic_t mm_users; atomic_t mm_count; atomic_t has_pinned; atomic_long_t pgtables_bytes; int map_count; spinlock_t page_table_lock; rw_semaphore mmap_lock; list_head mmlist; unsigned long int hiwater_rss; unsigned long int hiwater_vm; unsigned long int total_vm; unsigned long int locked_vm; atomic64_t pinned_vm; unsigned long int data_vm; unsigned long int exec_vm; unsigned long int stack_vm; unsigned long int def_flags; seqcount_t write_protect_seq; spinlock_t arg_lock; unsigned long int start_code; unsigned long int end_code; unsigned long int start_data; unsigned long int end_data; unsigned long int start_brk; unsigned long int brk; unsigned long int start_stack; unsigned long int arg_start; unsigned long int arg_end; unsigned long int env_start; unsigned long int env_end; unsigned long int saved_auxv[46]; mm_rss_stat rss_stat; linux_binfmt* binfmt; mm_context_t context; unsigned long int flags; core_state* core_state; spinlock_t ioctx_lock; kioctx_table* ioctx_table; task_struct* owner; user_namespace* user_ns; file* exe_file; mmu_notifier_subscriptions* notifier_subscriptions; percpu_rw_semaphore* mmu_notifier_lock; atomic_t tlb_flush_pending; uprobes_state uprobes_state; work_struct async_put_work; u32 pasid; u64 android_kabi_reserved1;}
to:
struct {vm_area_struct* mmap; rb_root mm_rb; u64 vmacache_seqnum; rwlock_t mm_rb_lock; unsigned long int (file*, unsigned long int, unsigned long int, unsigned long int, unsigned long int)* get_unmapped_area; unsigned long int mmap_base; unsigned long int mmap_legacy_base; unsigned long int task_size; unsigned long int highest_vm_end; pgd_t* pgd; atomic_t membarrier_state; atomic_t mm_users; atomic_t mm_count; atomic_t has_pinned; atomic_long_t pgtables_bytes; int map_count; spinlock_t page_table_lock; rw_semaphore mmap_lock; list_head mmlist; unsigned long int hiwater_rss; unsigned long int hiwater_vm; unsigned long int total_vm; unsigned long int locked_vm; atomic64_t pinned_vm; unsigned long int data_vm; unsigned long int exec_vm; unsigned long int stack_vm; unsigned long int def_flags; seqcount_t write_protect_seq; spinlock_t arg_lock; unsigned long int start_code; unsigned long int end_code; unsigned long int start_data; unsigned long int end_data; unsigned long int start_brk; unsigned long int brk; unsigned long int start_stack; unsigned long int arg_start; unsigned long int arg_end; unsigned long int env_start; unsigned long int env_end; unsigned long int saved_auxv[46]; mm_rss_stat rss_stat; linux_binfmt* binfmt; mm_context_t context; unsigned long int flags; core_state* core_state; spinlock_t ioctx_lock; kioctx_table* ioctx_table; task_struct* owner; user_namespace* user_ns; file* exe_file; mmu_notifier_subscriptions* notifier_subscriptions; percpu_rw_semaphore* mmu_notifier_lock; atomic_t tlb_flush_pending; uprobes_state uprobes_state; work_struct async_put_work; u32 pasid; struct {list_head list; mem_cgroup* memcg; nodemask_t nodes;} lru_gen; u64 android_kabi_reserved1;}
and size changed from 7680 to 7936 (in bits) (by +256 bits)
'unsigned long int cpu_bitmap[]' offset changed (by +256 bits)
2977 impacted interfaces
'struct pglist_data at mmzone.h:729:1' changed:
details were reported earlier
'struct reclaim_state at swap.h:131:1' changed:
type size changed from 64 to 128 (in bits)
1 data member insertion:
'lru_gen_mm_walk* mm_walk', at offset 64 (in bits) at swap.h:135:1
2977 impacted interfaces
'struct scsi_device at scsi_device.h:102:1' changed:
type size hasn't changed
1 data member insertion:
'unsigned int silence_suspend', at offset 2448 (in bits) at scsi_device.h:209:1
there are data member changes:
'bool offline_already' offset changed (by +8 bits)
45 impacted interfaces
'struct task_struct at sched.h:660:1' changed:
details were reported earlier
Bug: 229630433
Signed-off-by: Todd Kjos <tkjos@google.com>
Change-Id: Iacd70a1553401ead91351db0b5b8ec6dfee6e6ec
struct swap_slots_cache :: ANDROID_VENDOR_DATA(1)
1) Multiple swap devices can be supported;
2) There are different kinds of data;
3) During data reclamation, different types of data are exchanged
to different swap devices;
4) Each swap device has corresponding arrays of slots and slots_ret;
5) Each swap device has corresponding indexes of nr, cur and n_ret;
6) This field is a pointer, it points to a struct which contains
all the other arrays and indexes;
Bug: 225795494
Change-Id: Icf116135926be98449a2d96fc458e58e5ad3b7e9
Signed-off-by: Bing Han <bing.han@transsion.com>
struct lruvec :: ANDROID_VENDOR_DATA(1)
It is pointer to a struct to record the following message:
1)the account of workingset_restore pages of cached anonymous and
file pages
This is used to adjust the strategy and amount of reclaiming data.
Bug: 225795494
Change-Id: I34e57ee23b6c97ac91effa5b72513d238335a996
Signed-off-by: Bing Han <bing.han@transsion.com>
struct swap_info_struct :: ANDROID_VENDOR_DATA(1)
It is pointer to a struct to record the following message:
1) total swapin pages;
2) total swapout pages;
3) total number of cold pages swapin;
4) total number of swapout pages, specified by userspace;
5) total number of swapout pages, specified by kernel;
6) the maxmium number of swapout pages;
7) the maxmium number of swapout pages allowed by kernel;
8) the maxmium number of swapout pages allowed by framework;
Bug: 225795494
Change-Id: I779145a83d87e339db86ec81c7f962be99946afb
Signed-off-by: Bing Han <bing.han@transsion.com>
This functionality is needed by UFS drivers to e.g. suspend SCSI command
processing while reprogramming encryption keys if the hardware does not
support concurrent I/O and key reprogramming.
Bug: 227177294
Change-Id: I10f11e67da81fae7063674838760903d2c178baf
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Prepare for adding an additional ufshcd_clock_scaling_prepare() call
with a different timeout.
Bug: 227177294
Change-Id: I67a569b074c292a3c37f20a1b1e36f95b682c5e8
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Move a check related to clock scaling into ufshcd_devfreq_scale(). This
patch prepares for adding a second ufshcd_clock_scaling_prepare()
caller.
Bug: 227177294
Change-Id: I928d4cbe64823960a6112ba7f98c18da6244a77c
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Wait at most 20 ms before rechecking the doorbells instead of waiting
for a potentially long time between doorbell checks.
Bug: 227177294
Change-Id: I8a4dd0e93ca02435264961851a095a9c83c68240
Signed-off-by: Bart Van Assche <bvanassche@google.com>
When ufs initializes without scmd->device->sector_size set, scsi_get_lba()
will get a wrong shift number and trigger an ubsan error. The shift
exponent 4294967286 is too large for the 64-bit type 'sector_t' (aka
'unsigned long long').
Call scsi_get_lba() only when opcode is READ_10/WRITE_10/UNMAP.
Link: https://lore.kernel.org/r/20220307111752.10465-1-peter.wang@mediatek.com
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Wang <peter.wang@mediatek.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 2bd3b6b759 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I3457fcc88d7c4164c55010e440d9f274c169553e
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Kernel messages produced during runtime PM can cause a never-ending cycle
because user space utilities (e.g. journald or rsyslog) write the messages
back to storage, causing runtime resume, more messages, and so on.
Messages that tell of things that are expected to happen, are arguably
unnecessary, so suppress them.
UFS driver messages are changes to from dev_err() to dev_dbg() which means
they will not display unless activated by dynamic debug of building with
-DDEBUG.
sdev->silence_suspend is set to skip messages from sd_suspend_common()
"Synchronizing SCSI cache", "Stopping disk" and scsi_report_sense()
"Power-on or device reset occurred" message (Note, that message appears
when the LUN is accessed after runtime PM, not during runtime PM)
Example messages from Ubuntu 21.10:
$ dmesg | tail
[ 1620.380071] ufshcd 0000:00:12.5: ufshcd_print_pwr_info:[RX, TX]: gear=[1, 1], lane[1, 1], pwr[SLOWAUTO_MODE, SLOWAUTO_MODE], rate = 0
[ 1620.408825] ufshcd 0000:00:12.5: ufshcd_print_pwr_info:[RX, TX]: gear=[4, 4], lane[2, 2], pwr[FAST MODE, FAST MODE], rate = 2
[ 1620.409020] ufshcd 0000:00:12.5: ufshcd_find_max_sup_active_icc_level: Regulator capability was not set, actvIccLevel=0
[ 1620.409524] sd 0:0:0:0: Power-on or device reset occurred
[ 1622.938794] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 1622.939184] ufs_device_wlun 0:0:0:49488: Power-on or device reset occurred
[ 1625.183175] ufshcd 0000:00:12.5: ufshcd_print_pwr_info:[RX, TX]: gear=[1, 1], lane[1, 1], pwr[SLOWAUTO_MODE, SLOWAUTO_MODE], rate = 0
[ 1625.208041] ufshcd 0000:00:12.5: ufshcd_print_pwr_info:[RX, TX]: gear=[4, 4], lane[2, 2], pwr[FAST MODE, FAST MODE], rate = 2
[ 1625.208311] ufshcd 0000:00:12.5: ufshcd_find_max_sup_active_icc_level: Regulator capability was not set, actvIccLevel=0
[ 1625.209035] sd 0:0:0:0: Power-on or device reset occurred
Note for stable: depends on patch "scsi: core: sd: Add silence_suspend flag
to suppress some PM messages".
Link: https://lore.kernel.org/r/20220228113652.970857-3-adrian.hunter@intel.com
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Bug: 204438323
(cherry picked from commit 71bb9ab6e3 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Change-Id: I2a50283162aa1dc100e1269533ac61056172bd1d
Kernel messages produced during runtime PM can cause a never-ending cycle
because user space utilities (e.g. journald or rsyslog) write the messages
back to storage, causing runtime resume, more messages, and so on.
Messages that tell of things that are expected to happen are arguably
unnecessary, so add a flag to suppress them. This flag is used by the UFS
driver.
Link: https://lore.kernel.org/r/20220228113652.970857-2-adrian.hunter@intel.com
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit af4edb1d50 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Change-Id: I8834c9d71618fd04635804779a41117629a75166
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Because WB performs writes in SLC mode, it is not possible to use
WriteBooster indefinitely. Vendors can set a lifetime limit in the device.
If the lifetime exceeds this limit, the device ican disable the WB feature.
The feature is defined in the "bWriteBoosterBufferLifeTimeEst (IDN = 1E)"
attribute.
With lifetime exceeding the limit value, the current driver continuously
performs the following query:
- Write Flag: WB_ENABLE / DISABLE
- Read attr: Available Buffer Size
- Read attr: Current Buffer Size
This patch recognizes that WriteBooster is no longer supported by the
device, and prevents unnecessary queries.
Link: https://lore.kernel.org/r/1891546521.01643252701746.JavaMail.epsvc@epcpadp3
Reviewed-by: Asutosh Das <quic_asutoshd@quicinc.com>
Acked-by: Avri Altman <avri.altman@wdc.com>
Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit f681d1078d git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I9178b31aaeb75ef157aa8e12b1dd0f5a646f0579
Signed-off-by: Bart Van Assche <bvanassche@google.com>
The return value of ufshcd_set_dev_pwr_mode() is passed to device PM
core. However, the function currently returns a SCSI result which the PM
core doesn't understand. This might lead to unexpected behaviors in
userland; a platform reset was observed in Android.
Use a generic error code for SSU failures.
Link: https://lore.kernel.org/r/1642743182-54098-1-git-send-email-kwmad.kim@samsung.com
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Kiwoong Kim <kwmad.kim@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit ad6c8a4264 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I051742cd8ea2215cbd94ac3dedc4fab2863a9c6e
Signed-off-by: Bart Van Assche <bvanassche@google.com>
The Tactive time determines the waiting time before burst at hibern8 exit
and is determined by hardware at linkup time. However, in the case of
Samsung devices, increase host's Tactive time +100us for stability. If the
HCI's Tactive time is equal or greater than the device, +100us should be
set.
Link: https://lore.kernel.org/r/20220106213924.186263-1-hy50.seo@samsung.com
Reviewed-by: Alim Akhtar <alim.akhtar@samsung.com>
Acked-by: Avri Altman <Avri.Altman@wdc.com>
Signed-off-by: SEO HOYOUNG <hy50.seo@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 9008661e19 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I6ffe1c279cab9b780558de763e94cf01cfd4be3e
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Fix the following sparse warnings in ufshpb_set_hpb_read_to_upiu():
sparse warnings: (new ones prefixed by >>)
drivers/scsi/ufs/ufshpb.c:335:27: sparse: sparse: cast from restricted __be64
drivers/scsi/ufs/ufshpb.c:335:25: sparse: expected restricted __be64 [usertype] ppn_tmp
drivers/scsi/ufs/ufshpb.c:335:25: sparse: got unsigned long long [usertype]
Link: https://lore.kernel.org/r/20211111222452.384089-1-huobean@gmail.com
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Bean Huo <beanhuo@micron.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 73185a1377 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I6770255dab00a3ff8c5f7b4499efdfa0dd53839c
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Make the order of function declarations match the order in the upstream
code.
Bug: 204438323
Change-Id: Iac453cdb5ae67184c4218639ab8b91da03fabc66
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
can be disabled include:
0x0001: the multi-gen LRU core
0x0002: walking page table, when arch_has_hw_pte_young() returns
true
0x0004: clearing the accessed bit in non-leaf PMD entries, when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
[yYnN]: apply to all the components above
E.g.,
echo y >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0007
echo 5 >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0005
NB: the page table walks happen on the scale of seconds under heavy
memory pressure, in which case the mmap_lock contention is a lesser
concern, compared with the LRU lock contention and the I/O congestion.
So far the only well-known case of the mmap_lock contention happens on
Android, due to Scudo [1] which allocates several thousand VMAs for
merely a few hundred MBs. The SPF and the Maple Tree also have
provided their own assessments [2][3]. However, if walking page tables
does worsen the mmap_lock contention, the kill switch can be used to
disable it. In this case the multi-gen LRU will suffer a minor
performance degradation, as shown previously.
Clearing the accessed bit in non-leaf PMD entries can also be
disabled, since this behavior was not tested on x86 varieties other
than Intel and AMD.
[1] https://source.android.com/devices/tech/debug/scudo
[2] https://lore.kernel.org/lkml/20220128131006.67712-1-michel@lespinasse.org/
[3] https://lore.kernel.org/lkml/20220202024137.2516438-1-Liam.Howlett@oracle.com/
Link: https://lore.kernel.org/r/20220309021230.721028-11-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I71801d9470a2588cad8bfd14fbcfafc7b010aa03
When multiple memcgs are available, it is possible to make better
choices based on generations and tiers and therefore improve the
overall performance under global memory pressure. This patch adds a
rudimentary optimization to select memcgs that can drop single-use
unmapped clean pages first. Doing so reduces the chance of going into
the aging path or swapping. These two operations can be costly.
A typical example that benefits from this optimization is a server
running mixed types of workloads, e.g., heavy anon workload in one
memcg and heavy buffered I/O workload in the other.
Though this optimization can be applied to both kswapd and direct
reclaim, it is only added to kswapd to keep the patchset manageable.
Later improvements will cover the direct reclaim path.
Server benchmark results:
Mixed workloads:
fio (buffered I/O): -[23, 25]%
IOPS BW
patch1-8: 2960k 11.3GiB/s
patch1-9: 2248k 8783MiB/s
memcached (anon): +[210, 214]%
Ops/sec KB/sec
patch1-8: 606940.09 23576.89
patch1-9: 1895197.49 73619.93
Mixed workloads:
fio (buffered I/O): -[4, 6]%
IOPS BW
5.18-ed4643521e6a: 2369k 9255MiB/s
patch1-9: 2248k 8783MiB/s
memcached (anon): +[510, 516]%
Ops/sec KB/sec
5.18-ed4643521e6a: 309189.58 12010.61
patch1-9: 1895197.49 73619.93
Configurations:
(changes since patch 6)
cat mixed.sh
modprobe brd rd_nr=2 rd_size=56623104
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
mkfs.ext4 /dev/ram1
mount -t ext4 /dev/ram1 /mnt
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=90m --group_reporting &
pid=$!
sleep 200
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
kill -INT $pid
wait
Client benchmark results:
no change (CONFIG_MEMCG=n)
Link: https://lore.kernel.org/r/20220309021230.721028-10-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I0641467dbd7c5ba0645602cec7fe8d6fdb750edb
To further exploit spatial locality, the aging prefers to walk page
tables to search for young PTEs and promote hot pages. A kill switch
will be added in the next patch to disable this behavior. When
disabled, the aging relies on the rmap only.
NB: this behavior has nothing similar with the page table scanning in
the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
pages to swapcache and unmaps them.
To avoid confusion, the term "iteration" specifically means the
traversal of an entire mm_struct list; the term "walk" will be applied
to page tables and the rmap, as usual.
An mm_struct list is maintained for each memcg, and an mm_struct
follows its owner task to the new memcg when this task is migrated.
Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot
pages before it increments max_seq.
When multiple page table walkers iterate the same list, each of them
gets a unique mm_struct; therefore they can run concurrently. Page
table walkers ignore any misplaced pages, e.g., if an mm_struct was
migrated, pages it left in the previous memcg will not be promoted
when its current memcg is under reclaim. Similarly, page table walkers
will not promote pages from nodes other than the one under reclaim.
This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
page table walkers can skip processes that have been sleeping since
the last iteration.
2. It uses generational Bloom filters to record populated branches so
that page table walkers can reduce their search space based on the
query results, e.g., to skip page tables containing mostly holes or
misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
spanning multiple VMAs. IOW, it finishes all the VMAs within the
range of the same PMD table before it returns to a PGD table. This
improves the cache performance for workloads that have large
numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
Server benchmark results:
Single workload:
fio (buffered I/O): no change
Single workload:
memcached (anon): +[5.5, 7.5]%
Ops/sec KB/sec
patch1-7: 1014393.57 39455.42
patch1-8: 1078507.59 41949.15
Configurations:
no change
Client benchmark results:
kswapd profiles:
patch1-7
45.54% lzo1x_1_do_compress (real work)
9.56% page_vma_mapped_walk
6.70% _raw_spin_unlock_irq
2.78% ptep_clear_flush
2.47% do_raw_spin_lock
2.22% __zram_bvec_write
1.87% lru_gen_look_around
1.78% memmove
1.77% obj_malloc
1.44% free_unref_page_list
patch1-8
47.02% lzo1x_1_do_compress (real work)
6.73% page_vma_mapped_walk
6.14% _raw_spin_unlock_irq
3.39% walk_pte_range
2.63% ptep_clear_flush
2.29% __zram_bvec_write
2.10% do_raw_spin_lock
1.81% memmove
1.73% obj_malloc
1.53% free_unref_page_list
Configurations:
no change
[1] https://lwn.net/Articles/23732/
[2] https://source.android.com/devices/tech/debug/scudo
Link: https://lore.kernel.org/r/20220309021230.721028-9-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I5a3c97cf8ebf8d65d5f9528cd979a637c190053e
Searching the rmap for PTEs mapping each page on an LRU list (to test
and clear the accessed bit) can be expensive because pages from
different VMAs (PA space) are not cache friendly to the rmap (VA
space). For workloads mostly using mapped pages, the rmap has a high
CPU cost in the reclaim path.
This patch exploits spatial locality to reduce the trips into the
rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
new function lru_gen_look_around() scans at most BITS_PER_LONG-1
adjacent PTEs. On finding another young PTE, it clears the accessed
bit and updates the gen counter of the page mapped by this PTE to
(max_seq%MAX_NR_GENS)+1.
Server benchmark results:
Single workload:
fio (buffered I/O): no change
Single workload:
memcached (anon): +[4, 6]%
Ops/sec KB/sec
patch1-6: 964656.80 37520.88
patch1-7: 1014393.57 39455.42
Configurations:
no change
Client benchmark results:
kswapd profiles:
patch1-6
36.13% lzo1x_1_do_compress (real work)
19.16% page_vma_mapped_walk
6.55% _raw_spin_unlock_irq
4.02% do_raw_spin_lock
2.32% anon_vma_interval_tree_iter_first
2.11% ptep_clear_flush
1.76% __zram_bvec_write
1.64% folio_referenced_one
1.40% memmove
1.35% obj_malloc
patch1-7
45.54% lzo1x_1_do_compress (real work)
9.56% page_vma_mapped_walk
6.70% _raw_spin_unlock_irq
2.78% ptep_clear_flush
2.47% do_raw_spin_lock
2.22% __zram_bvec_write
1.87% lru_gen_look_around
1.78% memmove
1.77% obj_malloc
1.44% free_unref_page_list
Configurations:
no change
Link: https://lore.kernel.org/r/20220309021230.721028-8-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I9a290343840f3cf925c891c8e360c7cdc24ffb9c
To avoid confusion, the terms "promotion" and "demotion" will be
applied to the multi-gen LRU, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
promotes hot pages to the youngest generation when it finds them
accessed through page tables; the demotion of cold pages happens
consequently when it increments max_seq. The aging has the complexity
O(nr_hot_pages), since it is only interested in hot pages. Promotion
in the aging path does not require any LRU list operations, only the
updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
the result of the increment of max_seq, requires LRU list operations,
e.g., lru_deactivate_fn().
The eviction consumes old generations. Given an lruvec, it increments
min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types
are available from the same generation.
Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses through file descriptors. A
page accessed N times through file descriptors is in tier
order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in page->flags. In contrast to moving across generations, which
requires the LRU lock, moving across tiers only involves operations on
page->flags. The feedback loop also monitors refaults over all tiers
and decides when to protect pages in which tiers (N>1), using the
first tier (N=0,1) as a baseline. The first tier contains single-use
unmapped clean pages, which are most likely the best choices. The
eviction moves a page to the next generation, i.e., min_seq+1, if the
feedback loop decides so. This approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[38, 40]%
IOPS BW
5.18-ed4643521e6a: 2547k 9989MiB/s
patch1-6: 3540k 13.5GiB/s
Single workload:
memcached (anon): +[103, 107]%
Ops/sec KB/sec
5.18-ed4643521e6a: 469048.66 18243.91
patch1-6: 964656.80 37520.88
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.18-ed4643521e6a
39.56% page_vma_mapped_walk
19.32% lzo1x_1_do_compress (real work)
7.18% do_raw_spin_lock
4.23% _raw_spin_unlock_irq
2.26% vma_interval_tree_subtree_search
2.12% vma_interval_tree_iter_next
2.11% folio_referenced_one
1.90% anon_vma_interval_tree_iter_first
1.47% ptep_clear_flush
0.97% __anon_vma_interval_tree_subtree_search
patch1-6
36.13% lzo1x_1_do_compress (real work)
19.16% page_vma_mapped_walk
6.55% _raw_spin_unlock_irq
4.02% do_raw_spin_lock
2.32% anon_vma_interval_tree_iter_first
2.11% ptep_clear_flush
1.76% __zram_bvec_write
1.64% folio_referenced_one
1.40% memmove
1.35% obj_malloc
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
Chrome OS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lore.kernel.org/r/20220309021230.721028-7-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I3fe4850006d7984cd9f4fd46134b826609dc2f86
Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.
Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in page->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.
There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations. They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of
working set estimation and proactive reclaim. These features are
required to optimize job scheduling (bin packing) in data centers. The
variable size of the sliding window is designed for such use cases
[1][2].
To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive"
will be applied to the active/inactive LRU, as usual.
The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
applications usually do not prepare themselves for major page
faults like they do for blocked I/O. E.g., GUI applications
commonly use dedicated I/O threads to avoid blocking the rendering
threads.
There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or
VM_RAND_READ is present; the latter channel is assumed to follow the
latter pattern unless outlying refaults have been observed [3][4].
The next patch will address the "outlying refaults". Three macros,
i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are
added in this patch to make the entire patchset less diffy.
A page is added to the youngest generation on faulting. The aging
needs to check the accessed bit at least twice before handing this
page over to the eviction. The first check takes care of the accessed
bit set on the initial fault; the second check makes sure this page
has not been used since then. This protocol, AKA second chance,
requires a minimum of two generations, hence MIN_NR_GENS.
[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/
Link: https://lore.kernel.org/r/20220309021230.721028-6-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I333ec6a1d2abfa60d93d6adc190ed3eefe441512