Commit Graph

988065 Commits

Author SHA1 Message Date
Bing Han
a034320a68 ANDROID: add vendor fields to swap_slots_cache to support multiple swap devices
struct swap_slots_cache  :: ANDROID_VENDOR_DATA(1)
1) Multiple swap devices can be supported;
2) There are different kinds of data;
3) During data reclamation, different types of data are exchanged
   to different swap devices;
4) Each swap device has corresponding arrays of slots and slots_ret;
5) Each swap device has corresponding indexes of nr, cur and n_ret;
6) This field is a pointer, it points to a struct which contains
   all the other arrays and indexes;

Bug: 225795494
Change-Id: Icf116135926be98449a2d96fc458e58e5ad3b7e9
Signed-off-by: Bing Han <bing.han@transsion.com>
2022-04-18 10:20:37 -07:00
Bing Han
1b14ae01b0 ANDROID: add vendor fields to lruvec to record refault stats
struct lruvec :: ANDROID_VENDOR_DATA(1)
It is pointer to a struct to record the following message:
1)the account of workingset_restore pages of cached anonymous and
   file pages
This is used to adjust the strategy and amount of reclaiming data.

Bug: 225795494
Change-Id: I34e57ee23b6c97ac91effa5b72513d238335a996
Signed-off-by: Bing Han <bing.han@transsion.com>
2022-04-18 10:20:23 -07:00
Bing Han
af4eb0e377 ANDROID: add vendor fields to swap_info_struct to record swap stats
struct swap_info_struct :: ANDROID_VENDOR_DATA(1)
	It is pointer to a struct to record the following message:
	1) total swapin pages;
	2) total swapout pages;
	3) total number of cold pages swapin;
	4) total number of swapout pages, specified by userspace;
	5) total number of swapout pages, specified by kernel;
	6) the maxmium number of swapout pages;
	7) the maxmium number of swapout pages allowed by kernel;
	8) the maxmium number of swapout pages allowed by framework;

Bug: 225795494
Change-Id: I779145a83d87e339db86ec81c7f962be99946afb
Signed-off-by: Bing Han <bing.han@transsion.com>
2022-04-18 10:20:06 -07:00
Bart Van Assche
fae5207ecc ANDROID: scsi: ufs: Add suspend/resume SCSI command processing support
This functionality is needed by UFS drivers to e.g. suspend SCSI command
processing while reprogramming encryption keys if the hardware does not
support concurrent I/O and key reprogramming.

Bug: 227177294
Change-Id: I10f11e67da81fae7063674838760903d2c178baf
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:18:44 -07:00
Bart Van Assche
64293a57f1 ANDROID: scsi: ufs: Pass the clock scaling timeout as an argument
Prepare for adding an additional ufshcd_clock_scaling_prepare() call
with a different timeout.

Bug: 227177294
Change-Id: I67a569b074c292a3c37f20a1b1e36f95b682c5e8
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:18:30 -07:00
Bart Van Assche
69014b2b36 ANDROID: scsi: ufs: Move a clock scaling check
Move a check related to clock scaling into ufshcd_devfreq_scale(). This
patch prepares for adding a second ufshcd_clock_scaling_prepare()
caller.

Bug: 227177294
Change-Id: I928d4cbe64823960a6112ba7f98c18da6244a77c
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:18:15 -07:00
Bart Van Assche
aca52cabdb ANDROID: scsi: ufs: Reduce the clock scaling latency
Wait at most 20 ms before rechecking the doorbells instead of waiting
for a potentially long time between doorbell checks.

Bug: 227177294
Change-Id: I8a4dd0e93ca02435264961851a095a9c83c68240
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:18:03 -07:00
Peter Wang
00ed95fe93 FROMGIT: scsi: ufs: core: scsi_get_lba() error fix
When ufs initializes without scmd->device->sector_size set, scsi_get_lba()
will get a wrong shift number and trigger an ubsan error.  The shift
exponent 4294967286 is too large for the 64-bit type 'sector_t' (aka
'unsigned long long').

Call scsi_get_lba() only when opcode is READ_10/WRITE_10/UNMAP.

Link: https://lore.kernel.org/r/20220307111752.10465-1-peter.wang@mediatek.com
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Wang <peter.wang@mediatek.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 2bd3b6b759 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I3457fcc88d7c4164c55010e440d9f274c169553e
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:17:48 -07:00
Adrian Hunter
c0a4aeb7aa FROMGIT: scsi: ufs: Fix runtime PM messages never-ending cycle
Kernel messages produced during runtime PM can cause a never-ending cycle
because user space utilities (e.g. journald or rsyslog) write the messages
back to storage, causing runtime resume, more messages, and so on.

Messages that tell of things that are expected to happen, are arguably
unnecessary, so suppress them.

UFS driver messages are changes to from dev_err() to dev_dbg() which means
they will not display unless activated by dynamic debug of building with
-DDEBUG.

sdev->silence_suspend is set to skip messages from sd_suspend_common()
"Synchronizing SCSI cache", "Stopping disk" and scsi_report_sense()
"Power-on or device reset occurred" message (Note, that message appears
when the LUN is accessed after runtime PM, not during runtime PM)

 Example messages from Ubuntu 21.10:

 $ dmesg | tail
 [ 1620.380071] ufshcd 0000:00:12.5: ufshcd_print_pwr_info:[RX, TX]: gear=[1, 1], lane[1, 1], pwr[SLOWAUTO_MODE, SLOWAUTO_MODE], rate = 0
 [ 1620.408825] ufshcd 0000:00:12.5: ufshcd_print_pwr_info:[RX, TX]: gear=[4, 4], lane[2, 2], pwr[FAST MODE, FAST MODE], rate = 2
 [ 1620.409020] ufshcd 0000:00:12.5: ufshcd_find_max_sup_active_icc_level: Regulator capability was not set, actvIccLevel=0
 [ 1620.409524] sd 0:0:0:0: Power-on or device reset occurred
 [ 1622.938794] sd 0:0:0:0: [sda] Synchronizing SCSI cache
 [ 1622.939184] ufs_device_wlun 0:0:0:49488: Power-on or device reset occurred
 [ 1625.183175] ufshcd 0000:00:12.5: ufshcd_print_pwr_info:[RX, TX]: gear=[1, 1], lane[1, 1], pwr[SLOWAUTO_MODE, SLOWAUTO_MODE], rate = 0
 [ 1625.208041] ufshcd 0000:00:12.5: ufshcd_print_pwr_info:[RX, TX]: gear=[4, 4], lane[2, 2], pwr[FAST MODE, FAST MODE], rate = 2
 [ 1625.208311] ufshcd 0000:00:12.5: ufshcd_find_max_sup_active_icc_level: Regulator capability was not set, actvIccLevel=0
 [ 1625.209035] sd 0:0:0:0: Power-on or device reset occurred

Note for stable: depends on patch "scsi: core: sd: Add silence_suspend flag
to suppress some PM messages".

Link: https://lore.kernel.org/r/20220228113652.970857-3-adrian.hunter@intel.com
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Bug: 204438323
(cherry picked from commit 71bb9ab6e3 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Change-Id: I2a50283162aa1dc100e1269533ac61056172bd1d
2022-04-18 10:17:34 -07:00
Adrian Hunter
0cd3abcaa4 FROMGIT: scsi: core: sd: Add silence_suspend flag to suppress some PM messages
Kernel messages produced during runtime PM can cause a never-ending cycle
because user space utilities (e.g. journald or rsyslog) write the messages
back to storage, causing runtime resume, more messages, and so on.

Messages that tell of things that are expected to happen are arguably
unnecessary, so add a flag to suppress them. This flag is used by the UFS
driver.

Link: https://lore.kernel.org/r/20220228113652.970857-2-adrian.hunter@intel.com
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit af4edb1d50 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Change-Id: I8834c9d71618fd04635804779a41117629a75166
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:17:21 -07:00
Keoseong Park
e46eb26194 FROMGIT: scsi: ufs: core: Remove wlun_dev_to_hba()
Commit edc0596cc0 ("scsi: ufs: core: Stop clearing UNIT ATTENTIONS")
removed all callers of wlun_dev_to_hba(). Hence also remove the macro
itself.

Link: https://lore.kernel.org/r/1891546521.01644927481711.JavaMail.epsvc@epcpadp4
Reviewed-by: Alim Akhtar <alim.akhtar@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Keoseong Park <keosung.park@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from 482dcaa1c9 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I1eee9827255305f5567ee21c65dbfca897e9daa5
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:17:03 -07:00
Jinyoung Choi
85d759e39a FROMGIT: scsi: ufs: Add checking lifetime attribute for WriteBooster
Because WB performs writes in SLC mode, it is not possible to use
WriteBooster indefinitely. Vendors can set a lifetime limit in the device.
If the lifetime exceeds this limit, the device ican disable the WB feature.

The feature is defined in the "bWriteBoosterBufferLifeTimeEst (IDN = 1E)"
attribute.

With lifetime exceeding the limit value, the current driver continuously
performs the following query:

	- Write Flag: WB_ENABLE / DISABLE
	- Read attr: Available Buffer Size
	- Read attr: Current Buffer Size

This patch recognizes that WriteBooster is no longer supported by the
device, and prevents unnecessary queries.

Link: https://lore.kernel.org/r/1891546521.01643252701746.JavaMail.epsvc@epcpadp3
Reviewed-by: Asutosh Das <quic_asutoshd@quicinc.com>
Acked-by: Avri Altman <avri.altman@wdc.com>
Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit f681d1078d git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I9178b31aaeb75ef157aa8e12b1dd0f5a646f0579
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:16:38 -07:00
Kiwoong Kim
44b7a4f00f FROMGIT: scsi: ufs: Use generic error code in ufshcd_set_dev_pwr_mode()
The return value of ufshcd_set_dev_pwr_mode() is passed to device PM
core. However, the function currently returns a SCSI result which the PM
core doesn't understand.  This might lead to unexpected behaviors in
userland; a platform reset was observed in Android.

Use a generic error code for SSU failures.

Link: https://lore.kernel.org/r/1642743182-54098-1-git-send-email-kwmad.kim@samsung.com
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Kiwoong Kim <kwmad.kim@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit ad6c8a4264 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I051742cd8ea2215cbd94ac3dedc4fab2863a9c6e
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:16:16 -07:00
Miaoqian Lin
aeedc78679 FROMGIT: scsi: ufs: ufs-mediatek: Fix error checking in ufs_mtk_init_va09_pwr_ctrl()
The function regulator_get() returns an error pointer. Use IS_ERR() to
validate the return value.

Link: https://lore.kernel.org/r/20211222070930.9449-1-linmq006@gmail.com
Fixes: cf137b3ea4 ("scsi: ufs-mediatek: Support VA09 regulator operations")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 3ba880a12d git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I9b6d2fbb9d1d795f30e7873837b59dbf76569d1b
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:15:59 -07:00
SEO HOYOUNG
1fc4aef3d5 FROMGIT: scsi: ufs: Modify Tactive time setting conditions
The Tactive time determines the waiting time before burst at hibern8 exit
and is determined by hardware at linkup time. However, in the case of
Samsung devices, increase host's Tactive time +100us for stability. If the
HCI's Tactive time is equal or greater than the device, +100us should be
set.

Link: https://lore.kernel.org/r/20220106213924.186263-1-hy50.seo@samsung.com
Reviewed-by: Alim Akhtar <alim.akhtar@samsung.com>
Acked-by: Avri Altman <Avri.Altman@wdc.com>
Signed-off-by: SEO HOYOUNG <hy50.seo@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 9008661e19 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I6ffe1c279cab9b780558de763e94cf01cfd4be3e
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:15:39 -07:00
Adrian Hunter
d87405c2fe FROMGIT: scsi: ufs: ufs-pci: Add support for Intel ADL
Add PCI ID and callbacks to support Intel Alder Lake.

Link: https://lore.kernel.org/r/20211124204218.1784559-1-adrian.hunter@intel.com
Cc: stable@vger.kernel.org # v5.15+
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 7dc9fb47bc git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I370ad2e01bc67fec675f4c2ebf7f5d222d0afb07
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:15:19 -07:00
Ye Guojin
b65cfd7b92 FROMGIT: scsi: ufs: ufs-mediatek: Add put_device() after of_find_device_by_node()
This was found by coccicheck:

./drivers/scsi/ufs/ufs-mediatek.c, 211, 1-7, ERROR missing put_device;
call of_find_device_by_node on line 1185, but without a corresponding
object release within this function.

Link: https://lore.kernel.org/r/20211110105133.150171-1-ye.guojin@zte.com.cn
Reported-by: Zeal Robot <zealci@zte.com.cn>
Reviewed-by: Peter Wang <peter.wang@mediatek.com>
Signed-off-by: Ye Guojin <ye.guojin@zte.com.cn>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Bug: 204438323
(cherry picked from commit cc03facb1c git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Change-Id: I0aa3e8be7f7bc84d5c9b4dc3fb1e5586bc7e5ddf
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:14:56 -07:00
Bean Huo
4f4bf31d39 FROMGIT: scsi: ufs: ufshpb: Fix warning in ufshpb_set_hpb_read_to_upiu()
Fix the following sparse warnings in ufshpb_set_hpb_read_to_upiu():

sparse warnings: (new ones prefixed by >>)
drivers/scsi/ufs/ufshpb.c:335:27: sparse: sparse: cast from restricted __be64
drivers/scsi/ufs/ufshpb.c:335:25: sparse: expected restricted __be64 [usertype] ppn_tmp
drivers/scsi/ufs/ufshpb.c:335:25: sparse: got unsigned long long [usertype]

Link: https://lore.kernel.org/r/20211111222452.384089-1-huobean@gmail.com
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Bean Huo <beanhuo@micron.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 73185a1377 git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next)
Bug: 204438323
Change-Id: I6770255dab00a3ff8c5f7b4499efdfa0dd53839c
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:14:37 -07:00
Bart Van Assche
acb0ef885c ANDROID: scsi: ufs: Minimize the difference with the upstream code
Make the order of function declarations match the order in the upstream
code.

Bug: 204438323
Change-Id: Iac453cdb5ae67184c4218639ab8b91da03fabc66
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2022-04-18 10:14:20 -07:00
Yu Zhao
321995d280 ANDROID: GKI: build multi-gen LRU
CONFIG_LRU_GEN=y to build multi-gen LRU.

To enable it, echo y >/sys/kernel/mm/lru_gen/enabled.

Bug: 227651406
Signed-off-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: If5f6ece8373f1da2eb1eb96d809a2e216ebc0fbc
2022-04-18 10:11:56 -07:00
Yu Zhao
306dbfb34c FROMLIST: mm: multi-gen LRU: design doc
Add a design doc.

Link: https://lore.kernel.org/r/20220309021230.721028-15-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I1d66302e618416291ebf9647e20625fb76613c89
2022-04-18 10:11:56 -07:00
Yu Zhao
8b006e4d1c FROMLIST: mm: multi-gen LRU: admin guide
Add an admin guide.

Link: https://lore.kernel.org/r/20220309021230.721028-14-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I6fafbd7eb3ef6819cfcd30376459f14893f17c63
2022-04-18 10:11:56 -07:00
Yu Zhao
3cf1dfaaa5 FROMLIST: mm: multi-gen LRU: debugfs interface
Add /sys/kernel/debug/lru_gen for working set estimation and proactive
reclaim. These features are required to optimize job scheduling (bin
packing) in data centers [1][2].

Compared with the page table-based approach and the PFN-based
approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has
the following advantages:
1. It offers better choices because it is aware of memcgs, NUMA nodes,
   shared mappings and unmapped page cache.
2. It is more scalable because it is O(nr_hot_pages), whereas the
   PFN-based approach is O(nr_total_pages).

Add /sys/kernel/debug/lru_gen_full for debugging.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731

Link: https://lore.kernel.org/r/20220309021230.721028-13-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: Ie558098e0a24a647f77f4eacc4d72576173fc0b8
2022-04-18 10:11:56 -07:00
Yu Zhao
96f4a592d3 FROMLIST: mm: multi-gen LRU: thrashing prevention
Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
requested by many desktop users [1].

When set to value N, it prevents the working set of N milliseconds
from getting evicted. The OOM killer is triggered if this working set
cannot be kept in memory. Based on the average human detectable lag
(~100ms), N=1000 usually eliminates intolerable lags due to thrashing.
Larger values like N=3000 make lags less noticeable at the risk of
premature OOM kills.

Compared with the size-based approach, e.g., [2], this time-based
approach has the following advantages:
1. It is easier to configure because it is agnostic to applications
   and memory sizes.
2. It is more reliable because it is directly wired to the OOM killer.

[1] https://lore.kernel.org/lkml/Ydza%2FzXKY9ATRoh6@google.com/
[2] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/

Link: https://lore.kernel.org/r/20220309021230.721028-12-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I482d33f3beaf7723d2f3eeaaa5b4f12bcb9b48a1
2022-04-18 10:11:55 -07:00
Yu Zhao
76fdc1010b FROMLIST: mm: multi-gen LRU: kill switch
Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
can be disabled include:
  0x0001: the multi-gen LRU core
  0x0002: walking page table, when arch_has_hw_pte_young() returns
          true
  0x0004: clearing the accessed bit in non-leaf PMD entries, when
          CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
  [yYnN]: apply to all the components above
E.g.,
  echo y >/sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/enabled
  0x0007
  echo 5 >/sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/enabled
  0x0005

NB: the page table walks happen on the scale of seconds under heavy
memory pressure, in which case the mmap_lock contention is a lesser
concern, compared with the LRU lock contention and the I/O congestion.
So far the only well-known case of the mmap_lock contention happens on
Android, due to Scudo [1] which allocates several thousand VMAs for
merely a few hundred MBs. The SPF and the Maple Tree also have
provided their own assessments [2][3]. However, if walking page tables
does worsen the mmap_lock contention, the kill switch can be used to
disable it. In this case the multi-gen LRU will suffer a minor
performance degradation, as shown previously.

Clearing the accessed bit in non-leaf PMD entries can also be
disabled, since this behavior was not tested on x86 varieties other
than Intel and AMD.

[1] https://source.android.com/devices/tech/debug/scudo
[2] https://lore.kernel.org/lkml/20220128131006.67712-1-michel@lespinasse.org/
[3] https://lore.kernel.org/lkml/20220202024137.2516438-1-Liam.Howlett@oracle.com/

Link: https://lore.kernel.org/r/20220309021230.721028-11-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I71801d9470a2588cad8bfd14fbcfafc7b010aa03
2022-04-18 10:11:55 -07:00
Yu Zhao
082bc8296a FROMLIST: mm: multi-gen LRU: optimize multiple memcgs
When multiple memcgs are available, it is possible to make better
choices based on generations and tiers and therefore improve the
overall performance under global memory pressure. This patch adds a
rudimentary optimization to select memcgs that can drop single-use
unmapped clean pages first. Doing so reduces the chance of going into
the aging path or swapping. These two operations can be costly.

A typical example that benefits from this optimization is a server
running mixed types of workloads, e.g., heavy anon workload in one
memcg and heavy buffered I/O workload in the other.

Though this optimization can be applied to both kswapd and direct
reclaim, it is only added to kswapd to keep the patchset manageable.
Later improvements will cover the direct reclaim path.

Server benchmark results:
  Mixed workloads:
    fio (buffered I/O): -[23, 25]%
                         IOPS         BW
      patch1-8:          2960k        11.3GiB/s
      patch1-9:          2248k        8783MiB/s

    memcached (anon): +[210, 214]%
                         Ops/sec      KB/sec
      patch1-8:          606940.09    23576.89
      patch1-9:          1895197.49   73619.93

  Mixed workloads:
    fio (buffered I/O): -[4, 6]%
                         IOPS         BW
      5.18-ed4643521e6a: 2369k        9255MiB/s
      patch1-9:          2248k        8783MiB/s

    memcached (anon): +[510, 516]%
                         Ops/sec      KB/sec
      5.18-ed4643521e6a: 309189.58    12010.61
      patch1-9:          1895197.49   73619.93

  Configurations:
    (changes since patch 6)

    cat mixed.sh
    modprobe brd rd_nr=2 rd_size=56623104

    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    mkfs.ext4 /dev/ram1
    mount -t ext4 /dev/ram1 /mnt

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=90m --group_reporting &
    pid=$!

    sleep 200

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

    kill -INT $pid
    wait

Client benchmark results:
  no change (CONFIG_MEMCG=n)

Link: https://lore.kernel.org/r/20220309021230.721028-10-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I0641467dbd7c5ba0645602cec7fe8d6fdb750edb
2022-04-18 10:11:55 -07:00
Yu Zhao
93c4f86793 FROMLIST: mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page
tables to search for young PTEs and promote hot pages. A kill switch
will be added in the next patch to disable this behavior. When
disabled, the aging relies on the rmap only.

NB: this behavior has nothing similar with the page table scanning in
the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
pages to swapcache and unmaps them.

To avoid confusion, the term "iteration" specifically means the
traversal of an entire mm_struct list; the term "walk" will be applied
to page tables and the rmap, as usual.

An mm_struct list is maintained for each memcg, and an mm_struct
follows its owner task to the new memcg when this task is migrated.
Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot
pages before it increments max_seq.

When multiple page table walkers iterate the same list, each of them
gets a unique mm_struct; therefore they can run concurrently. Page
table walkers ignore any misplaced pages, e.g., if an mm_struct was
migrated, pages it left in the previous memcg will not be promoted
when its current memcg is under reclaim. Similarly, page table walkers
will not promote pages from nodes other than the one under reclaim.

This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
   page table walkers can skip processes that have been sleeping since
   the last iteration.
2. It uses generational Bloom filters to record populated branches so
   that page table walkers can reduce their search space based on the
   query results, e.g., to skip page tables containing mostly holes or
   misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
   CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
   spanning multiple VMAs. IOW, it finishes all the VMAs within the
   range of the same PMD table before it returns to a PGD table. This
   improves the cache performance for workloads that have large
   numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[5.5, 7.5]%
                         Ops/sec      KB/sec
      patch1-7:          1014393.57   39455.42
      patch1-8:          1078507.59   41949.15

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-7
      45.54%  lzo1x_1_do_compress (real work)
       9.56%  page_vma_mapped_walk
       6.70%  _raw_spin_unlock_irq
       2.78%  ptep_clear_flush
       2.47%  do_raw_spin_lock
       2.22%  __zram_bvec_write
       1.87%  lru_gen_look_around
       1.78%  memmove
       1.77%  obj_malloc
       1.44%  free_unref_page_list

    patch1-8
      47.02%  lzo1x_1_do_compress (real work)
       6.73%  page_vma_mapped_walk
       6.14%  _raw_spin_unlock_irq
       3.39%  walk_pte_range
       2.63%  ptep_clear_flush
       2.29%  __zram_bvec_write
       2.10%  do_raw_spin_lock
       1.81%  memmove
       1.73%  obj_malloc
       1.53%  free_unref_page_list

  Configurations:
    no change

[1] https://lwn.net/Articles/23732/
[2] https://source.android.com/devices/tech/debug/scudo

Link: https://lore.kernel.org/r/20220309021230.721028-9-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I5a3c97cf8ebf8d65d5f9528cd979a637c190053e
2022-04-18 10:11:55 -07:00
Yu Zhao
c8356f7573 FROMLIST: mm: multi-gen LRU: exploit locality in rmap
Searching the rmap for PTEs mapping each page on an LRU list (to test
and clear the accessed bit) can be expensive because pages from
different VMAs (PA space) are not cache friendly to the rmap (VA
space). For workloads mostly using mapped pages, the rmap has a high
CPU cost in the reclaim path.

This patch exploits spatial locality to reduce the trips into the
rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
new function lru_gen_look_around() scans at most BITS_PER_LONG-1
adjacent PTEs. On finding another young PTE, it clears the accessed
bit and updates the gen counter of the page mapped by this PTE to
(max_seq%MAX_NR_GENS)+1.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[4, 6]%
                         Ops/sec      KB/sec
      patch1-6:          964656.80    37520.88
      patch1-7:          1014393.57   39455.42

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-6
      36.13%  lzo1x_1_do_compress (real work)
      19.16%  page_vma_mapped_walk
       6.55%  _raw_spin_unlock_irq
       4.02%  do_raw_spin_lock
       2.32%  anon_vma_interval_tree_iter_first
       2.11%  ptep_clear_flush
       1.76%  __zram_bvec_write
       1.64%  folio_referenced_one
       1.40%  memmove
       1.35%  obj_malloc

    patch1-7
      45.54%  lzo1x_1_do_compress (real work)
       9.56%  page_vma_mapped_walk
       6.70%  _raw_spin_unlock_irq
       2.78%  ptep_clear_flush
       2.47%  do_raw_spin_lock
       2.22%  __zram_bvec_write
       1.87%  lru_gen_look_around
       1.78%  memmove
       1.77%  obj_malloc
       1.44%  free_unref_page_list

  Configurations:
    no change

Link: https://lore.kernel.org/r/20220309021230.721028-8-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I9a290343840f3cf925c891c8e360c7cdc24ffb9c
2022-04-18 10:11:55 -07:00
Yu Zhao
436dff20eb FROMLIST: mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be
applied to the multi-gen LRU, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
LRU, as usual.

The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
promotes hot pages to the youngest generation when it finds them
accessed through page tables; the demotion of cold pages happens
consequently when it increments max_seq. The aging has the complexity
O(nr_hot_pages), since it is only interested in hot pages. Promotion
in the aging path does not require any LRU list operations, only the
updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
the result of the increment of max_seq, requires LRU list operations,
e.g., lru_deactivate_fn().

The eviction consumes old generations. Given an lruvec, it increments
min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types
are available from the same generation.

Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses through file descriptors. A
page accessed N times through file descriptors is in tier
order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in page->flags. In contrast to moving across generations, which
requires the LRU lock, moving across tiers only involves operations on
page->flags. The feedback loop also monitors refaults over all tiers
and decides when to protect pages in which tiers (N>1), using the
first tier (N=0,1) as a baseline. The first tier contains single-use
unmapped clean pages, which are most likely the best choices. The
eviction moves a page to the next generation, i.e., min_seq+1, if the
feedback loop decides so. This approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
   inferring whether pages accessed multiple times through file
   descriptors are statistically hot and thus worth protecting in the
   eviction path.
2. It takes pages accessed through page tables into account and avoids
   overprotecting pages accessed multiple times through file
   descriptors. (Pages accessed through page tables are in the first
   tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
   twice through file descriptors, when under heavy buffered I/O
   workloads.

Server benchmark results:
  Single workload:
    fio (buffered I/O): +[38, 40]%
                         IOPS         BW
      5.18-ed4643521e6a: 2547k        9989MiB/s
      patch1-6:          3540k        13.5GiB/s

  Single workload:
    memcached (anon): +[103, 107]%
                         Ops/sec      KB/sec
      5.18-ed4643521e6a: 469048.66    18243.91
      patch1-6:          964656.80    37520.88

  Configurations:
    CPU: two Xeon 6154
    Mem: total 256G

    Node 1 was only used as a ram disk to reduce the variance in the
    results.

    patch drivers/block/brd.c <<EOF
    99,100c99,100
    < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
    < 	page = alloc_page(gfp_flags);
    ---
    > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
    > 	page = alloc_pages_node(1, gfp_flags, 0);
    EOF

    cat >>/etc/systemd/system.conf <<EOF
    CPUAffinity=numa
    NUMAPolicy=bind
    NUMAMask=0
    EOF

    cat >>/etc/memcached.conf <<EOF
    -m 184320
    -s /var/run/memcached/memcached.sock
    -a 0766
    -t 36
    -B binary
    EOF

    cat fio.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkfs.ext4 /dev/ram0
    mount -t ext4 /dev/ram0 /mnt

    mkdir /sys/fs/cgroup/user.slice/test
    echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
    echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
    fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=5m --group_reporting

    cat memcached.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

Client benchmark results:
  kswapd profiles:
    5.18-ed4643521e6a
      39.56%  page_vma_mapped_walk
      19.32%  lzo1x_1_do_compress (real work)
       7.18%  do_raw_spin_lock
       4.23%  _raw_spin_unlock_irq
       2.26%  vma_interval_tree_subtree_search
       2.12%  vma_interval_tree_iter_next
       2.11%  folio_referenced_one
       1.90%  anon_vma_interval_tree_iter_first
       1.47%  ptep_clear_flush
       0.97%  __anon_vma_interval_tree_subtree_search

    patch1-6
      36.13%  lzo1x_1_do_compress (real work)
      19.16%  page_vma_mapped_walk
       6.55%  _raw_spin_unlock_irq
       4.02%  do_raw_spin_lock
       2.32%  anon_vma_interval_tree_iter_first
       2.11%  ptep_clear_flush
       1.76%  __zram_bvec_write
       1.64%  folio_referenced_one
       1.40%  memmove
       1.35%  obj_malloc

  Configurations:
    CPU: single Snapdragon 7c
    Mem: total 4G

    Chrome OS MemoryPressure [1]

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/

Link: https://lore.kernel.org/r/20220309021230.721028-7-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I3fe4850006d7984cd9f4fd46134b826609dc2f86
2022-04-18 10:11:54 -07:00
Yu Zhao
fe302bd1f9 FROMLIST: mm: multi-gen LRU: groundwork
Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in page->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations. They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of
working set estimation and proactive reclaim. These features are
required to optimize job scheduling (bin packing) in data centers. The
variable size of the sliding window is designed for such use cases
[1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive"
will be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
   channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
   flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
   applications usually do not prepare themselves for major page
   faults like they do for blocked I/O. E.g., GUI applications
   commonly use dedicated I/O threads to avoid blocking the rendering
   threads.
There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or
VM_RAND_READ is present; the latter channel is assumed to follow the
latter pattern unless outlying refaults have been observed [3][4].

The next patch will address the "outlying refaults". Three macros,
i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are
added in this patch to make the entire patchset less diffy.

A page is added to the youngest generation on faulting. The aging
needs to check the accessed bit at least twice before handing this
page over to the eviction. The first check takes care of the accessed
bit set on the initial fault; the second check makes sure this page
has not been used since then. This protocol, AKA second chance,
requires a minimum of two generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/

Link: https://lore.kernel.org/r/20220309021230.721028-6-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I333ec6a1d2abfa60d93d6adc190ed3eefe441512
2022-04-18 10:11:54 -07:00
Yu Zhao
4c6c817249 FROMLIST: mm/vmscan.c: refactor shrink_node()
This patch refactors shrink_node() to improve readability for the
upcoming changes to mm/vmscan.c.

Link: https://lore.kernel.org/r/20220309021230.721028-4-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I186f43f946de0d40d54883fb31114840fc749a57
2022-04-18 10:11:54 -07:00
Yu Zhao
95acc9c28b FROMLIST: mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
Some architectures support the accessed bit in non-leaf PMD entries,
e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it
as part of linear address translation [1]. Page table walkers that
clear the accessed bit may use this capability to reduce their search
space.

Note that:
1. Although an inline function is preferable, this capability is added
   as a configuration option for consistency with the existing macros.
2. Due to the little interest in other varieties, this capability was
   only tested on Intel and AMD CPUs.

[1]: Intel 64 and IA-32 Architectures Software Developer's Manual
     Volume 3 (June 2021), section 4.8

Link: https://lore.kernel.org/r/20220309021230.721028-3-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I73f84a21fd315192eaa3e6443334ed1bccb4e99e
2022-04-18 10:11:54 -07:00
Yu Zhao
1ed19b562b FROMLIST: mm: x86, arm64: add arch_has_hw_pte_young()
Some architectures automatically set the accessed bit in PTEs, e.g.,
x86 and arm64 v8.2. On architectures that do not have this capability,
clearing the accessed bit in a PTE usually triggers a page fault
following the TLB miss of this PTE (to emulate the accessed bit).

Being aware of this capability can help make better decisions, e.g.,
whether to spread the work out over a period of time to reduce bursty
page faults when trying to clear the accessed bit in many PTEs.

Note that theoretically this capability can be unreliable, e.g.,
hotplugged CPUs might be different from builtin ones. Therefore it
should not be used in architecture-independent code that involves
correctness, e.g., to determine whether TLB flushes are required (in
combination with the accessed bit).

Link: https://lore.kernel.org/r/20220309021230.721028-2-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: Ie81175d7e0d239f688d31487b298cf9b4fb66707
2022-04-18 10:11:54 -07:00
Yu Zhao
b4f3b6ac71 UPSTREAM: include/linux/page-flags-layout.h: cleanups
Tidy things up and delete comments stating the obvious with typos or
making no sense.

Link: https://lkml.kernel.org/r/20210303071609.797782-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1587db62d8)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I1d57992dd4c68d89c1b9180f280e09d5d08482b6
2022-04-18 10:11:53 -07:00
Yu Zhao
2b286703d9 UPSTREAM: include/linux/page-flags-layout.h: correctly determine LAST_CPUPID_WIDTH
The naming convention used in include/linux/page-flags-layout.h:
  *_SHIFT: the number of bits trying to allocate
  *_WIDTH: the number of bits successfully allocated

So when it comes to LAST_CPUPID_WIDTH, we need to check whether all
previous *_WIDTH and LAST_CPUPID_SHIFT can fit into page flags. This
means we need to use NODES_WIDTH, not NODES_SHIFT.

Link: https://lkml.kernel.org/r/20210303071609.797782-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f73c6c8805)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I6d7c58cf5d10e302adc818ac7e1fd727208d23c8
2022-04-18 10:11:53 -07:00
Yu Zhao
80343eeaf3 UPSTREAM: mm/swap: don't SetPageWorkingset unconditionally during swapin
We are capable of SetPageWorkingset based on refault distances after
commit aae466b005 ("mm/swap: implement workingset detection for
anonymous LRU").  This is done by workingset_refault(), which is right
above the unconditional SetPageWorkingset deleted by this patch.

The unconditional SetPageWorkingset miscategorizes pages that are read
ahead or never belonged to the working set (e.g., tmpfs pages accessed
only once by fd).  When those pages are swapped in (after they were
swapped out) for the first time, they skew PSI (when using async swap).
When this happens again, depending on their refault distances, they might
skew workingset_restore_anon counter in addition to PSI because their
shadows indicate they were part of the working set.

Historically, SetPageWorkingset was added as part of the PSI series, and
Johannes said:
 "It was meant to mark incoming pages under IO with SetPageWorkingset
  when waiting for them constituted a memory stall.

  On the page cache side, because we HAVE workingset detection, this was
  specific to recently evicted pages that had been active in their
  previous life. On the anon side, the aging algorithm had no
  distinction between workingset and sporadically used pages. Given the
  choice between a) no swapin stalls are pressure and b) all swapin
  stalls are pressure, I went with the latter in order to detect swap
  storms. The false positive case - high rate of swapin without severe
  memory pressure - was relatively unlikely, because we tried to avoid
  swapping until everything was completely on fire in the first place."

Link: https://lkml.kernel.org/r/20201209012400.1771150-1-yuzhao@google.com
Link: https://lkml.kernel.org/r/20201214231253.62313-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit cad8320b4b)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: Ifa9c5fa2e875e6ccee6c3f7e2d2983278d54c220
2022-04-18 10:11:53 -07:00
Yu Zhao
0c20cff831 UPSTREAM: include/linux/mm_inline.h: fold page_lru_base_type() into its sole caller
We've removed all other references to this function.

Link: https://lore.kernel.org/linux-mm/20201207220949.830352-9-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-9-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c1770e34f3)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: If229fa7a09e5be79cc28dc5a780b900e69f4ce64
2022-04-18 10:11:53 -07:00
Yu Zhao
aadc45fae6 BACKPORT: mm: VM_BUG_ON lru page flags
Move scattered VM_BUG_ONs to two essential places that cover all
lru list additions and deletions.

Link: https://lore.kernel.org/linux-mm/20201207220949.830352-8-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-8-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit bc7112719e)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I950ad4171f973c740d9fc3778d44efc020d0e12c
2022-04-18 10:11:53 -07:00
Yu Zhao
bcc2f50f7b BACKPORT: mm: add __clear_page_lru_flags() to replace page_off_lru()
Similar to page_off_lru(), the new function does non-atomic clearing
of PageLRU() in addition to PageActive() and PageUnevictable(), on a
page that has no references left.

If PageActive() and PageUnevictable() are both set, refuse to clear
either and leave them to bad_page(). This is a behavior change that
is meant to help debug.

Link: https://lore.kernel.org/linux-mm/20201207220949.830352-7-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8756017962)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I0290916fa08277c50e228a8d3f39af67d62ff9d0
2022-04-18 10:11:53 -07:00
Yu Zhao
552f416558 BACKPORT: mm/swap.c: don't pass "enum lru_list" to del_page_from_lru_list()
The parameter is redundant in the sense that it can be potentially
extracted from the "struct page" parameter by page_lru(). We need to
make sure that existing PageActive() or PageUnevictable() remains
until the function returns. A few places don't conform, and simple
reordering fixes them.

This patch may have left page_off_lru() seemingly odd, and we'll take
care of it in the next patch.

Link: https://lore.kernel.org/linux-mm/20201207220949.830352-6-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-6-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 46ae6b2cc2)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I1e14dcbf4111b39cf155ed3512423448865eb324
2022-04-18 10:11:53 -07:00
Yu Zhao
10899adee3 UPSTREAM: mm/swap.c: don't pass "enum lru_list" to trace_mm_lru_insertion()
The parameter is redundant in the sense that it can be extracted
from the "struct page" parameter by page_lru() correctly.

Link: https://lore.kernel.org/linux-mm/20201207220949.830352-5-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-5-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 861404536a)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: Ia02c0c65dd427a98ffa39e9dc3e2ae701e85fad8
2022-04-18 10:11:52 -07:00
Yu Zhao
c18b4f50ce BACKPORT: mm: don't pass "enum lru_list" to lru list addition functions
The "enum lru_list" parameter to add_page_to_lru_list() and
add_page_to_lru_list_tail() is redundant in the sense that it can
be extracted from the "struct page" parameter by page_lru().

A caveat is that we need to make sure PageActive() or
PageUnevictable() is correctly set or cleared before calling
these two functions. And they are indeed.

Link: https://lore.kernel.org/linux-mm/20201207220949.830352-4-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-4-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 3a9c9788a3)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I0d92b845d18e6ab3bcb5645f22e3cedb04257d98
2022-04-18 10:11:52 -07:00
Yu Zhao
32ebee4382 BACKPORT: include/linux/mm_inline.h: shuffle lru list addition and deletion functions
These functions will call page_lru() in the following patches.  Move them
below page_lru() to avoid the forward declaration.

Link: https://lore.kernel.org/linux-mm/20201207220949.830352-3-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-3-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f90d8191ac)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I32b8565107c9134e656b43886c00105eb07b34dd
2022-04-18 10:11:52 -07:00
Yu Zhao
885e11e970 BACKPORT: mm/vmscan.c: use add_page_to_lru_list()
Patch series "mm: lru related cleanups", v2.

The cleanups are intended to reduce the verbosity in lru list operations
and make them less error-prone.  A typical example would be how the
patches change __activate_page():

 static void __activate_page(struct page *page, struct lruvec *lruvec)
 {
 	if (!PageActive(page) && !PageUnevictable(page)) {
-		int lru = page_lru_base_type(page);
 		int nr_pages = thp_nr_pages(page);

-		del_page_from_lru_list(page, lruvec, lru);
+		del_page_from_lru_list(page, lruvec);
 		SetPageActive(page);
-		lru += LRU_ACTIVE;
-		add_page_to_lru_list(page, lruvec, lru);
+		add_page_to_lru_list(page, lruvec);
 		trace_mm_lru_activate(page);

There are a few more places like __activate_page() and they are
unnecessarily repetitive in terms of figuring out which list a page should
be added onto or deleted from.  And with the duplicated code removed, they
are easier to read, IMO.

Patch 1 to 5 basically cover the above.  Patch 6 and 7 make code more
robust by improving bug reporting.  Patch 8, 9 and 10 take care of some
dangling helpers left in header files.

This patch (of 10):

There is add_page_to_lru_list(), and move_pages_to_lru() should reuse it,
not duplicate it.

Link: https://lkml.kernel.org/r/20210122220600.906146-1-yuzhao@google.com
Link: https://lore.kernel.org/linux-mm/20201207220949.830352-2-yuzhao@google.com/
Link: https://lkml.kernel.org/r/20210122220600.906146-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 42895ea73b)
Bug: 227651406
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I7e09be6bedcd451c4e8c790c969306b6ca3adebd
2022-04-18 10:11:52 -07:00
Yifan Hong
75020bfbe2 ANDROID: Move BRANCH from build.config.common to .constants.
This allows Bazel to load the value of $BRANCH in order
to determine the value of --dist_dir of copy_to_dist_dir
statically.

Test: TH
Bug: 229268271

Change-Id: Iff759b8188360ea1b2bc204d29750eece9095582
Signed-off-by: Yifan Hong <elsk@google.com>
2022-04-14 14:20:35 -07:00
Woody Lin
5ef1198a15 ANDROID: Update the ABI symbol list
Leaf changes summary: 5 artifacts changed
Changed leaf types summary: 0 leaf type changed
Removed/Changed/Added functions summary: 0 Removed, 0 Changed, 5 Added functions
Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 0 Added variable

5 Added functions:

  [A] 'function void interval_tree_insert(interval_tree_node*, rb_root_cached*)'
  [A] 'function interval_tree_node* interval_tree_iter_first(rb_root_cached*, unsigned long int, unsigned long int)'
  [A] 'function interval_tree_node* interval_tree_iter_next(interval_tree_node*, unsigned long int, unsigned long int)'
  [A] 'function void interval_tree_remove(interval_tree_node*, rb_root_cached*)'
  [A] 'function void suspend_set_ops(const platform_suspend_ops*)'

Bug: 226105845
Bug: 226167799
Signed-off-by: Woody Lin <woodylin@google.com>
Change-Id: I5da0ec8c678e36a46418c0f440fad87de1ac7a52
2022-04-14 16:15:37 +00:00
Fuad Tabba
0a227f89cf ANDROID: KVM: arm64: Do not allow memslot modifications once a PVM has run
Currently trying to move or delete a memslot results in a warning
and a failure. Userspace shouldn't be able to trigger kernel
warnings.

The cause is that in protected mode, stage-2 is managed by hyp.
Modifying a memslot flushes the shadow memslot, which tries to
unmap any stage-2 mapped pages.

Bug: 226890762
Signed-off-by: Fuad Tabba <tabba@google.com>
Change-Id: Icc6a0aada76e8492285cd5509bad1ee57700af7c
2022-04-14 11:59:20 +01:00
Daniel Rosenberg
8be6e93244 ANDROID: fuse-bpf: Fix read_iter
We had a size mismatch for the return value, leading to EIOCBQUEUED
getting interpreted as a return size instead of an error code.

Test: generic/467, generic/013, and fuse_test
Bug: 217570523
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: I64f9d5263f8b37d3c0e286467f9351997b294cc2
2022-04-13 21:21:57 +00:00
Daniel Rosenberg
128ed57bca ANDROID: fuse-bpf: Use cache and refcount
Allocates the iocb we create for asynchronous IO from a cache instead of
a regular kzalloc

Test: generic/467 and fuse_test
Bug: 217570523
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: I27dcec89cd585835f6a8e80e1ae30c503f4038c8
2022-04-13 21:21:49 +00:00
Daniel Rosenberg
8e24eb9a2d ANDROID: fuse-bpf: Rename iocb_fuse to iocb_orig
The current name is a bit confusing. iocb_fuse could refer to the iocb
passed to fuse or created by fuse. The new name unambiguously refers to
the one passed in to fuse.

Test: compiles, behavior unchanged
Bug: 217570523
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: I955500eb8a3186252427fd06ca6e99b4fec469b6
2022-04-13 21:21:39 +00:00