This reverts commit db88745171.
Reason for revert: This changes KMI, and thus must be submitted as later
Change-Id: I879033c8df10e31d245d50718edccae99f12df38
Previously errors from the daemon in FUSE_CANONICAL_PATH were simply
ignored. In order to block inotifys, it is useful to be able to return
errors from this opcode.
Bug: 238619640
Test: inotify no longer works on /storage/emulated/0/Android/media but
does on child folders
Change-Id: I1c65814f4ad0ccef330bca9764c2db15c71bf2be
Signed-off-by: Paul Lawrence <paullawrence@google.com>
Set KMI_GENERATION=3 for 3/29 KMI update
1 function symbol(s) removed
'bool tcpm_is_debouncing(struct tcpm_port*)'
function symbol changed from 'void cfg80211_port_authorized(struct net_device*, const u8*, gfp_t)' to 'void cfg80211_port_authorized(struct net_device*, const u8*, const u8*, u8, gfp_t)'
CRC changed from 0x3d557d66 to 0x48ea813d
type changed from 'void(struct net_device*, const u8*, gfp_t)' to 'void(struct net_device*, const u8*, const u8*, u8, gfp_t)'
parameter 3 type changed from 'gfp_t' = 'unsigned int' to 'const u8*'
resolved type changed from 'unsigned int' to 'const u8*'
parameter 4 of type 'u8' was added
parameter 5 of type 'gfp_t' was added
function symbol 'void __ClearPageMovable(struct page*)' changed
CRC changed from 0xf49feb35 to 0x6cd6821a
function symbol 'void __SetPageMovable(struct page*, struct address_space*)' changed
CRC changed from 0x7021ca08 to 0x98af4ede
function symbol 'int ___pskb_trim(struct sk_buff*, unsigned int)' changed
CRC changed from 0xa25b9470 to 0x9a468f39
... 1210 omitted; 1213 symbols have only CRC changes
type 'enum nl80211_attrs' changed
enumerator 'NL80211_ATTR_TD_BITMAP' (321) was added
enumerator '__NL80211_ATTR_AFTER_LAST' value changed from 321 to 322
enumerator 'NUM_NL80211_ATTR' value changed from 321 to 322
enumerator 'NL80211_ATTR_MAX' value changed from 320 to 321
type 'struct ufs_hba' changed
member 'unsigned int android_quirks' was added
member 'unsigned int dev_quirks' changed
offset changed by 32
type 'struct pglist_data' changed
byte size changed from 6976 to 7168
2 members ('unsigned long flags' .. 'struct lru_gen_mm_walk mm_walk') changed
offset changed by 128
member 'struct lru_gen_memcg memcg_lru' was added
3 members ('struct zone_padding _pad2_' .. 'atomic_long_t vm_stat[41]') changed
offset changed by 1536
type 'struct tcpm_port' changed
member 'bool potential_contaminant' was added
member 'bool debouncing' was removed
type 'struct tcpci_data' changed
byte size changed from 64 to 72
member 'void(* check_contaminant)(struct tcpci*, struct tcpci_data*)' was added
type 'struct lruvec' changed
byte size changed from 1208 to 1224
member changed from 'struct lru_gen_struct lrugen' to 'struct lru_gen_page lrugen'
type changed from 'struct lru_gen_struct' to 'struct lru_gen_page'
2 members ('struct lru_gen_mm_state mm_state' .. 'struct pglist_data* pgdat') changed
offset changed by 128
type 'struct pci_host_bridge' changed
member 'unsigned int no_inc_mrrs:1' was added
9 members ('unsigned int native_aer:1' .. 'unsigned int msi_domain:1') changed
offset changed by 1
type 'struct tcpc_dev' changed
member changed from 'int(* check_contaminant)(struct tcpc_dev*)' to 'void(* check_contaminant)(struct tcpc_dev*)'
type changed from 'int(*)(struct tcpc_dev*)' to 'void(*)(struct tcpc_dev*)'
pointed-to type changed from 'int(struct tcpc_dev*)' to 'void(struct tcpc_dev*)'
return type changed from 'int' to 'void'
type 'enum tcpm_state' changed
enumerator 'CHECK_CONTAMINANT' (2) was added
enumerator 'SRC_UNATTACHED' value changed from 2 to 3
enumerator 'SRC_ATTACH_WAIT' value changed from 3 to 4
enumerator 'SRC_ATTACHED' value changed from 4 to 5
enumerator 'SRC_STARTUP' value changed from 5 to 6
enumerator 'SRC_SEND_CAPABILITIES' value changed from 6 to 7
enumerator 'SRC_SEND_CAPABILITIES_TIMEOUT' value changed from 7 to 8
enumerator 'SRC_NEGOTIATE_CAPABILITIES' value changed from 8 to 9
enumerator 'SRC_TRANSITION_SUPPLY' value changed from 9 to 10
enumerator 'SRC_READY' value changed from 10 to 11
enumerator 'SRC_WAIT_NEW_CAPABILITIES' value changed from 11 to 12
enumerator 'SNK_UNATTACHED' value changed from 12 to 13
enumerator 'SNK_ATTACH_WAIT' value changed from 13 to 14
enumerator 'SNK_DEBOUNCED' value changed from 14 to 15
enumerator 'SNK_ATTACHED' value changed from 15 to 16
enumerator 'SNK_STARTUP' value changed from 16 to 17
enumerator 'SNK_DISCOVERY' value changed from 17 to 18
enumerator 'SNK_DISCOVERY_DEBOUNCE' value changed from 18 to 19
enumerator 'SNK_DISCOVERY_DEBOUNCE_DONE' value changed from 19 to 20
enumerator 'SNK_WAIT_CAPABILITIES' value changed from 20 to 21
enumerator 'SNK_NEGOTIATE_CAPABILITIES' value changed from 21 to 22
enumerator 'SNK_NEGOTIATE_PPS_CAPABILITIES' value changed from 22 to 23
enumerator 'SNK_TRANSITION_SINK' value changed from 23 to 24
enumerator 'SNK_TRANSITION_SINK_VBUS' value changed from 24 to 25
enumerator 'SNK_READY' value changed from 25 to 26
enumerator 'ACC_UNATTACHED' value changed from 26 to 27
enumerator 'DEBUG_ACC_ATTACHED' value changed from 27 to 28
enumerator 'AUDIO_ACC_ATTACHED' value changed from 28 to 29
enumerator 'AUDIO_ACC_DEBOUNCE' value changed from 29 to 30
enumerator 'HARD_RESET_SEND' value changed from 30 to 31
enumerator 'HARD_RESET_START' value changed from 31 to 32
enumerator 'SRC_HARD_RESET_VBUS_OFF' value changed from 32 to 33
enumerator 'SRC_HARD_RESET_VBUS_ON' value changed from 33 to 34
enumerator 'SNK_HARD_RESET_SINK_OFF' value changed from 34 to 35
enumerator 'SNK_HARD_RESET_WAIT_VBUS' value changed from 35 to 36
enumerator 'SNK_HARD_RESET_SINK_ON' value changed from 36 to 37
enumerator 'SOFT_RESET' value changed from 37 to 38
enumerator 'SRC_SOFT_RESET_WAIT_SNK_TX' value changed from 38 to 39
enumerator 'SNK_SOFT_RESET' value changed from 39 to 40
enumerator 'SOFT_RESET_SEND' value changed from 40 to 41
enumerator 'DR_SWAP_ACCEPT' value changed from 41 to 42
enumerator 'DR_SWAP_SEND' value changed from 42 to 43
enumerator 'DR_SWAP_SEND_TIMEOUT' value changed from 43 to 44
enumerator 'DR_SWAP_CANCEL' value changed from 44 to 45
enumerator 'DR_SWAP_CHANGE_DR' value changed from 45 to 46
enumerator 'PR_SWAP_ACCEPT' value changed from 46 to 47
enumerator 'PR_SWAP_SEND' value changed from 47 to 48
enumerator 'PR_SWAP_SEND_TIMEOUT' value changed from 48 to 49
enumerator 'PR_SWAP_CANCEL' value changed from 49 to 50
enumerator 'PR_SWAP_START' value changed from 50 to 51
enumerator 'PR_SWAP_SRC_SNK_TRANSITION_OFF' value changed from 51 to 52
enumerator 'PR_SWAP_SRC_SNK_SOURCE_OFF' value changed from 52 to 53
enumerator 'PR_SWAP_SRC_SNK_SOURCE_OFF_CC_DEBOUNCED' value changed from 53 to 54
enumerator 'PR_SWAP_SRC_SNK_SINK_ON' value changed from 54 to 55
enumerator 'PR_SWAP_SNK_SRC_SINK_OFF' value changed from 55 to 56
enumerator 'PR_SWAP_SNK_SRC_SOURCE_ON' value changed from 56 to 57
enumerator 'PR_SWAP_SNK_SRC_SOURCE_ON_VBUS_RAMPED_UP' value changed from 57 to 58
enumerator 'VCONN_SWAP_ACCEPT' value changed from 58 to 59
enumerator 'VCONN_SWAP_SEND' value changed from 59 to 60
enumerator 'VCONN_SWAP_SEND_TIMEOUT' value changed from 60 to 61
enumerator 'VCONN_SWAP_CANCEL' value changed from 61 to 62
enumerator 'VCONN_SWAP_START' value changed from 62 to 63
enumerator 'VCONN_SWAP_WAIT_FOR_VCONN' value changed from 63 to 64
enumerator 'VCONN_SWAP_TURN_ON_VCONN' value changed from 64 to 65
enumerator 'VCONN_SWAP_TURN_OFF_VCONN' value changed from 65 to 66
enumerator 'FR_SWAP_SEND' value changed from 66 to 67
enumerator 'FR_SWAP_SEND_TIMEOUT' value changed from 67 to 68
enumerator 'FR_SWAP_SNK_SRC_TRANSITION_TO_OFF' value changed from 68 to 69
enumerator 'FR_SWAP_SNK_SRC_NEW_SINK_READY' value changed from 69 to 70
enumerator 'FR_SWAP_SNK_SRC_SOURCE_VBUS_APPLIED' value changed from 70 to 71
enumerator 'FR_SWAP_CANCEL' value changed from 71 to 72
enumerator 'SNK_TRY' value changed from 72 to 73
enumerator 'SNK_TRY_WAIT' value changed from 73 to 74
enumerator 'SNK_TRY_WAIT_DEBOUNCE' value changed from 74 to 75
enumerator 'SNK_TRY_WAIT_DEBOUNCE_CHECK_VBUS' value changed from 75 to 76
enumerator 'SRC_TRYWAIT' value changed from 76 to 77
enumerator 'SRC_TRYWAIT_DEBOUNCE' value changed from 77 to 78
enumerator 'SRC_TRYWAIT_UNATTACHED' value changed from 78 to 79
enumerator 'SRC_TRY' value changed from 79 to 80
enumerator 'SRC_TRY_WAIT' value changed from 80 to 81
enumerator 'SRC_TRY_DEBOUNCE' value changed from 81 to 82
enumerator 'SNK_TRYWAIT' value changed from 82 to 83
enumerator 'SNK_TRYWAIT_DEBOUNCE' value changed from 83 to 84
enumerator 'SNK_TRYWAIT_VBUS' value changed from 84 to 85
enumerator 'BIST_RX' value changed from 85 to 86
enumerator 'GET_STATUS_SEND' value changed from 86 to 87
enumerator 'GET_STATUS_SEND_TIMEOUT' value changed from 87 to 88
enumerator 'GET_PPS_STATUS_SEND' value changed from 88 to 89
enumerator 'GET_PPS_STATUS_SEND_TIMEOUT' value changed from 89 to 90
enumerator 'GET_SINK_CAP' value changed from 90 to 91
enumerator 'GET_SINK_CAP_TIMEOUT' value changed from 91 to 92
enumerator 'ERROR_RECOVERY' value changed from 92 to 93
enumerator 'PORT_RESET' value changed from 93 to 94
enumerator 'PORT_RESET_WAIT_OFF' value changed from 94 to 95
enumerator 'AMS_START' value changed from 95 to 96
enumerator 'CHUNK_NOT_SUPP' value changed from 96 to 97
type 'struct mem_cgroup_per_node' changed
byte size changed from 2064 to 2080
9 members ('struct lruvec_stats_percpu* lruvec_stats_percpu' .. 'struct mem_cgroup* memcg') changed
offset changed by 128
type 'struct pkvm_module_ops' changed
byte size changed from 208 to 496
member 'int(* host_share_hyp)(u64)' was added
member 'int(* host_unshare_hyp)(u64)' was added
member 'int(* pin_shared_mem)(void*, void*)' was added
member 'void(* unpin_shared_mem)(void*, void*)' was added
5 members ('void*(* memcpy)(void*, const void*, size_t)' .. 'unsigned long(* kern_hyp_va)(unsigned long)') changed
offset changed by 256
member 'u64 android_kabi_reserved1' was added
member 'u64 android_kabi_reserved2' was added
member 'u64 android_kabi_reserved3' was added
member 'u64 android_kabi_reserved4' was added
member 'u64 android_kabi_reserved5' was added
member 'u64 android_kabi_reserved6' was added
member 'u64 android_kabi_reserved7' was added
member 'u64 android_kabi_reserved8' was added
member 'u64 android_kabi_reserved9' was added
member 'u64 android_kabi_reserved10' was added
member 'u64 android_kabi_reserved11' was added
member 'u64 android_kabi_reserved12' was added
member 'u64 android_kabi_reserved13' was added
member 'u64 android_kabi_reserved14' was added
member 'u64 android_kabi_reserved15' was added
member 'u64 android_kabi_reserved16' was added
member 'u64 android_kabi_reserved17' was added
member 'u64 android_kabi_reserved18' was added
member 'u64 android_kabi_reserved19' was added
member 'u64 android_kabi_reserved20' was added
member 'u64 android_kabi_reserved21' was added
member 'u64 android_kabi_reserved22' was added
member 'u64 android_kabi_reserved23' was added
member 'u64 android_kabi_reserved24' was added
member 'u64 android_kabi_reserved25' was added
member 'u64 android_kabi_reserved26' was added
member 'u64 android_kabi_reserved27' was added
member 'u64 android_kabi_reserved28' was added
member 'u64 android_kabi_reserved29' was added
member 'u64 android_kabi_reserved30' was added
member 'u64 android_kabi_reserved31' was added
member 'u64 android_kabi_reserved32' was added
type 'struct kvm_vcpu_arch' changed
member 'struct task_struct* parent_task' was removed
12 members ('struct { struct kvm_guest_debug_arch regs; u64 pmscr_el1; u64 trfcr_el1; } host_debug_state' .. 'struct { u64 last_steal; gpa_t base; } steal') changed
offset changed by -64
Bug: 273751441
Change-Id: I9617a576b8535d879ba077e980d22e4195af13c7
Signed-off-by: Carlos Llamas <cmllamas@google.com>
The following symbol no longer exists after reverting an out-of-tree
typec contaminant patch in favor of its upstream version:
- tcpm_is_debouncing
Removing the symbol fixes the following KMI build issue:
ERROR: Checking for kmi_symbol_list_strict_mode
[...]
Symbols missing from the ksymtab:
tcpm_is_debouncing
Bug: 273751441
Change-Id: I0f842e61237187c54dc471567c2b6df455778548
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Prevent the imminent collision between the upstream quirk bits (now up
to '1 << 19') and the Android quirk bits (starting at '1 << 20') by
moving the Android quirk bits into their own field in struct ufs_hba.
Bug: 162257402
Change-Id: I5373c092734d16f693300d9bd73c7c9063ac921e
Signed-off-by: Eric Biggers <ebiggers@google.com>
Recall that the per-node memcg LRU has two generations and they alternate
when the last memcg (of a given node) is moved from one to the other.
Each generation is also sharded into multiple bins to improve scalability.
A reclaimer starts with a random bin (in the old generation) and, if it
fails, it will retry, i.e., to try the rest of the bins.
If a reclaimer fails with the last memcg, it should move this memcg to the
young generation first, which causes the generations to alternate, and
then retry. Otherwise, the retries will be futile because all other bins
are empty.
Link: https://lkml.kernel.org/r/20230213075322.1416966-1-yuzhao@google.com
Fixes: e4dde56cd2 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: T.J. Mercier <tjmercier@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 274865848
(cherry picked from commit 9f550d78b4)
[Yu: Resolve conflicts over absence of folios on 5.15]
Change-Id: I4a34f39b5ed9fc7f3deb2d7695eac8298d6fbc35
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Among the flags in scan_control:
1. sc->may_swap, which indicates swap constraint due to memsw.max, is
supported as usual.
2. sc->proactive, which indicates reclaim by memory.reclaim, may not
opportunistically skip the aging path, since it is considered less
latency sensitive.
3. !(sc->gfp_mask & __GFP_IO), which indicates IO constraint, lowers
swappiness to prioritize file LRU, since clean file pages are more
likely to exist.
4. sc->may_writepage and sc->may_unmap, which indicates opportunistic
reclaim, are rejected, since unmapped clean pages are already
prioritized. Scanning for more of them is likely futile and can
cause high reclaim latency when there is a large number of memcgs.
The rest are handled by the existing code.
Link: https://lkml.kernel.org/r/20221222041905.2431096-8-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 274865848
(cherry picked from commit e9d4e1ee78)
[TJ: Resolved conflict over
650081b837 ("ANDROID: MGLRU: Don't skip anon reclaim if swap low")]
[Yu: Resolve conflicts over absence of folios and proactive reclaim on 5.15]
Change-Id: I75dd554826ccd4b5c0e4fc497f40d9e06d6c3673
Signed-off-by: T.J. Mercier <tjmercier@google.com>
For each node, memcgs are divided into two generations: the old and
the young. For each generation, memcgs are randomly sharded into
multiple bins to improve scalability. For each bin, an RCU hlist_nulls
is virtually divided into three segments: the head, the tail and the
default.
An onlining memcg is added to the tail of a random bin in the old
generation. The eviction starts at the head of a random bin in the old
generation. The per-node memcg generation counter, whose reminder (mod
2) indexes the old generation, is incremented when all its bins become
empty.
There are four operations:
1. MEMCG_LRU_HEAD, which moves an memcg to the head of a random bin in
its current generation (old or young) and updates its "seg" to
"head";
2. MEMCG_LRU_TAIL, which moves an memcg to the tail of a random bin in
its current generation (old or young) and updates its "seg" to
"tail";
3. MEMCG_LRU_OLD, which moves an memcg to the head of a random bin in
the old generation, updates its "gen" to "old" and resets its "seg"
to "default";
4. MEMCG_LRU_YOUNG, which moves an memcg to the tail of a random bin
in the young generation, updates its "gen" to "young" and resets
its "seg" to "default".
The events that trigger the above operations are:
1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
2. The first attempt to reclaim an memcg below low, which triggers
MEMCG_LRU_TAIL;
3. The first attempt to reclaim an memcg below reclaimable size
threshold, which triggers MEMCG_LRU_TAIL;
4. The second attempt to reclaim an memcg below reclaimable size
threshold, which triggers MEMCG_LRU_YOUNG;
5. Attempting to reclaim an memcg below min, which triggers
MEMCG_LRU_YOUNG;
6. Finishing the aging on the eviction path, which triggers
MEMCG_LRU_YOUNG;
7. Offlining an memcg, which triggers MEMCG_LRU_OLD.
Note that memcg LRU only applies to global reclaim, and the
round-robin incrementing of their max_seq counters ensures the
eventual fairness to all eligible memcgs. For memcg reclaim, it still
relies on mem_cgroup_iter().
Link: https://lkml.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 274865848
(cherry picked from commit e4dde56cd2)
[Yu: Resolve conflicts over absence of folios and proactive reclaim on
5.15]
[TJ: resolve conflict due to trace_android_vh_mem_cgroup_css_online in
mem_cgroup_css_online]
Change-Id: I5bad7125eca86df446594d9aedfd856340f75ef0
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Recall that the aging produces the youngest generation: first it scans
for accessed pages and updates their gen counters; then it increments
lrugen->max_seq.
The current aging fairness safeguard for kswapd uses two passes to
ensure the fairness to multiple eligible memcgs. On the first pass,
which is shared with the eviction, it checks whether all eligible
memcgs are low on cold pages. If so, it requires a second pass, on
which it ages all those memcgs at the same time.
With memcg LRU, the aging, while ensuring eventual fairness, will run
when necessary. Therefore the current aging fairness safeguard for
kswapd will not be needed.
Note that memcg LRU only applies to global reclaim. For memcg reclaim,
the aging can be unfair to different memcgs, i.e., their
lrugen->max_seq can be incremented at different paces.
Link: https://lkml.kernel.org/r/20221222041905.2431096-5-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 274865848
(cherry picked from commit 7348cc9182)
[Yu: Resolve conflicts over absence of folios and proactive reclaim on 5.15]
Change-Id: Iad1847f586f713cc2b4ee0fac12265cf9462477a
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Recall that the eviction consumes the oldest generation: first it
bucket-sorts pages whose gen counters were updated by the aging and
reclaims the rest; then it increments lrugen->min_seq.
The current eviction fairness safeguard for global reclaim has a
dilemma: when there are multiple eligible memcgs, should it continue
or stop upon meeting the reclaim goal? If it continues, it overshoots
and increases direct reclaim latency; if it stops, it loses fairness
between memcgs it has taken memory away from and those it has yet to.
With memcg LRU, the eviction, while ensuring eventual fairness, will
stop upon meeting its goal. Therefore the current eviction fairness
safeguard for global reclaim will not be needed.
Note that memcg LRU only applies to global reclaim. For memcg reclaim,
the eviction will continue, even if it is overshooting. This becomes
unconditional due to code simplification.
Link: https://lkml.kernel.org/r/20221222041905.2431096-4-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 274865848
(cherry picked from commit a579086c99)
[Yu: Resolve conflicts over absence of folios and proactive reclaim on 5.15]
Change-Id: I008664a847b10a7990325c0a3cb2d707f1a1bc2a
Signed-off-by: T.J. Mercier <tjmercier@google.com>
The page reclaim isolates a batch of pages from the tail of one of the
LRU lists and works on those pages one by one. For a suitable
swap-backed page, if the swap device is async, it queues that page for
writeback. After the page reclaim finishes an entire batch, it puts back
the pages it queued for writeback to the head of the original LRU list.
In the meantime, the page writeback flushes the queued pages also by
batches. Its batching logic is independent from that of the page reclaim.
For each of the pages it writes back, the page writeback calls
rotate_reclaimable_page() which tries to rotate a page to the tail.
rotate_reclaimable_page() only works for a page after the page reclaim
has put it back. If an async swap device is fast enough, the page
writeback can finish with that page while the page reclaim is still
working on the rest of the batch containing it. In this case, that page
will remain at the head and the page reclaim will not retry it before
reaching there.
This patch adds a retry to evict_pages(). After evict_pages() has
finished an entire batch and before it puts back pages it cannot free
immediately, it retries those that may have missed the rotation.
Before this patch, ~60% of pages swapped to an Intel Optane missed
rotate_reclaimable_page(). After this patch, ~99% of missed pages were
reclaimed upon retry.
This problem affects relatively slow async swap devices like Samsung 980
Pro much less and does not affect sync swap devices like zram or zswap at
all.
Link: https://lkml.kernel.org/r/20221116013808.3995280-1-yuzhao@google.com
Fixes: ac35a49023 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: "Yin, Fengwei" <fengwei.yin@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 274865848
(cherry picked from commit 359a5e1416)
[Yu: Resolve conflicts over absence of folios on 5.15]
Change-Id: Ife11b13e2612c84a2de1727781983f66a06141bb
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Patch series "mm: multi-gen LRU: memcg LRU", v3.
Overview
========
An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
since each node and memcg combination has an LRU of pages (see
mem_cgroup_lruvec()).
Its goal is to improve the scalability of global reclaim, which is
critical to system-wide memory overcommit in data centers. Note that
memcg reclaim is currently out of scope.
Its memory bloat is a pointer to each lruvec and negligible to each
pglist_data. In terms of traversing memcgs during global reclaim, it
improves the best-case complexity from O(n) to O(1) and does not affect
the worst-case complexity O(n). Therefore, on average, it has a sublinear
complexity in contrast to the current linear complexity.
The basic structure of an memcg LRU can be understood by an analogy to
the active/inactive LRU (of pages):
1. It has the young and the old (generations), i.e., the counterparts
to the active and the inactive;
2. The increment of max_seq triggers promotion, i.e., the counterpart
to activation;
3. Other events trigger similar operations, e.g., offlining an memcg
triggers demotion, i.e., the counterpart to deactivation.
In terms of global reclaim, it has two distinct features:
1. Sharding, which allows each thread to start at a random memcg (in
the old generation) and improves parallelism;
2. Eventual fairness, which allows direct reclaim to bail out at will
and reduces latency without affecting fairness over some time.
The commit message in patch 6 details the workflow:
https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com/
The following is a simple test to quickly verify its effectiveness.
Test design:
1. Create multiple memcgs.
2. Each memcg contains a job (fio).
3. All jobs access the same amount of memory randomly.
4. The system does not experience global memory pressure.
5. Periodically write to the root memory.reclaim.
Desired outcome:
1. All memcgs have similar pgsteal counts, i.e., stddev(pgsteal)
over mean(pgsteal) is close to 0%.
2. The total pgsteal is close to the total requested through
memory.reclaim, i.e., sum(pgsteal) over sum(requested) is close
to 100%.
Actual outcome [1]:
MGLRU off MGLRU on
stddev(pgsteal) / mean(pgsteal) 75% 20%
sum(pgsteal) / sum(requested) 425% 95%
####################################################################
MEMCGS=128
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
mkdir /sys/fs/cgroup/memcg$memcg
done
start() {
echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs
fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \
--filename=/dev/zero --size=1920M --rw=randrw \
--rate=64m,64m --random_distribution=random \
--fadvise_hint=0 --time_based --runtime=10h \
--group_reporting --minimal
}
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
start &
done
sleep 600
for ((i = 0; i < 600; i++)); do
echo 256m >/sys/fs/cgroup/memory.reclaim
sleep 6
done
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat
done
####################################################################
[1]: This was obtained from running the above script (touches less
than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an
hour.
This patch (of 8):
The new name lru_gen_page will be more distinct from the coming
lru_gen_memcg.
Link: https://lkml.kernel.org/r/20221222041905.2431096-1-yuzhao@google.com
Link: https://lkml.kernel.org/r/20221222041905.2431096-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 274865848
(cherry picked from commit 391655fe08)
[Yu: Resolve conflicts over absence of folios on 5.15]
Change-Id: Ie92535676b005ec9e7987632b742fdde8d54436f
Signed-off-by: T.J. Mercier <tjmercier@google.com>
In case of 4way handshake offload, transition disable policy
updated by the AP during EAPOL 3/4 is not updated to the upper layer.
This results in mismatch between transition disable policy
between the upper layer and the driver. This patch addresses this
issue by updating transition disable policy as part of port
authorization indication.
Signed-off-by: Vinayak Yadawad <vinayak.yadawad@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Bug: 272227555
Change-Id: Iac5d22a2c3999c7bdddc3a1f683fef82ed8ff918
(cherry picked from commit 0ff57171d6)
Signed-off-by: Shivani Baranwal <quic_shivbara@quicinc.com>
Signed-off-by: Will McVicker <willmcvicker@google.com>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
This reverts commit c9d17c24b9.
It was perserving the ABI, but that is not needed anymore at this point
in time.
Change-Id: I571a879d78bcbb7f1be4554456ea2ac6ebcc53cc
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
This reverts commit cf76e85064.
It was perserving the ABI, but that is not needed anymore at this point
in time.
Change-Id: Ie8de065eb07476140971d0684de0460ce391d52c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
This reverts commit 408db6d88d.
It was perserving the ABI, but that is not needed anymore at this point
in time.
Change-Id: If129ead534970cd3a634ac9dcf563441c0c19a01
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
This reverts commit b923dd1052.
It was perserving the ABI, but that is not needed anymore at this point
in time.
Change-Id: Ib7087614a16570125233f26d582d449fe5ead163
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Non-protected mode relies on the host to restore its SVE state if
necessary. However, protected VMs shouldn't reveal any
information to the host, including whether they have potentially
dirtied the host's sve state. Therefore, save and restore the
host's sve state at hyp in protected mode.
Currently this behavior applies to protected and non-protected
VMs in protected mode. It could be optimised for non-protected
VMs by applying the same behavior as non-protected mode, which is
to inform the host that it should restore its sve state. But for
now it's kept this way to maintain the same behavior for all VMs
in protected mode.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: Ifbcc64b387c3f821a6c1047e8c843f6250a3f690
The code for deactivating traps, to be able to update the fpsimd
registers, is the only code in this file that is n/vhe specific.
Move it to specialized functions.
This is also needed for the subsequent patch, since the logic for
deciding which traps to enable/disable will get more complex.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: Ia0477450aa9319a46a91b3c31c1910ad02fbe246
In subsequent patches, vhe/pKVM(nvhe) will diverge significantly
on saving the host fpsimd/sve state when taking a guest fpsimd
trap. Add a specialized helper to handle that.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: Ib6b13cafad8bf568694804e3b55e0a5a4fcd70a4
Allocate memory and donate it to hyp at setup time for tracking
the host sve state at hyp in protected mode. This memory is used
in the subsequent patch.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: If07eec9ea9c7b216d02e2d1ea69bd62d99f08081
The code to determine the maximum sve vector length by the system
isn't trivial. In subsequent patches hyp needs to know it for
allocating memory for the host sve state.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: I2561af67722a99d8a989b26cb47d073eba3869ff
Subsequent patches will augment this state to allocate space for
tracking the host sve state. SVE state size is not static, and
there isn't support for dynamic per_cpu allocation in hyp.
This is done as a first step in allowing us to allocate SVE state
under the same umbrella.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: I0902623a5ab81a80105f5b00a26765d257bc1ceb
The state will be augmented in future patches and accessed in
more than one location. It makes it easier to reason about the
code.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: If3a3a9266c201f63c126860b61da9698be9b9faa
Subsequent patches will change how the fpsimd state is allocated,
and add tracking of sve state. Moving this to a helper makes
future code cleaner and patches easier to reason about.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: Ic46b8889c1fe11f0cfdd7b5f3d2b98bf412183f0
Before the conversion of the various booleans into an enum
representing the state, this helper clarified things. Since the
introduction of the enum, the helper obfuscates rather than
helps.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: I83c870146ed2d910bf10d625d1048b95c8b23736
pKVM maintains its own state for tracking the host fpsimd state.
Therefore, no need to map and share the host's view with it.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: I5e5164a7694881ffa641b5b6a8691a542fd55a14
Expand comment clarifying why the host value representing sve
vector length being restored for ZCR_EL1 on guest exit isn't the
same as it was on guest entry.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: I5889407b4391a80dfcf77b31375c3a17705b68da
The GKI policy allows the addition of new symbols to a frozen KMI as
long as doing so has no impact on existing frozen symbols. Interestingly
the hypervisor's ABI is defined by the pkvm_module_ops structure. Any
addition to this struct will be flagged as a type change, which equates
to a KMI breakage in the GKI world. This could become a major problem
long term if it prevented backport of (security) fixes to KMI-frozen
kernels.
To allow such backports, add a set of reserved ABI slots to the
pkvm_module_ops struct. These slots are usually reserved to fix LTS
merges, but given that none of the pKVM module code is upstream yet,
these slots are likely to be used by Android-specific fixes.
Bug: 233587962
Change-Id: I61a00a09947ccff153c96a4829e083ef9ede19d3
Signed-off-by: Quentin Perret <qperret@google.com>
pKVM modules may need to access memory that is kept map in the host's
stage-2 page-table. Expose the host_{un}share_hyp() API to allow the
use-case, as well as the pinning API that goes with it.
Bug: 245034629
Change-Id: I1b5abacfcd2f066b1cbb1bbac43b77e6808f559c
Signed-off-by: Quentin Perret <qperret@google.com>
DWARFv5 is the latest iteration of the debug info spec; it contains many
encoding tricks to optimize for space.
For example, with this patch applied (DWARFv5), for
build.config.gki.aarch64:
$ du -h out/android-mainline/dist/vmlinux
304M out/android-mainline/dist/vmlinux
Before (DWARFv4):
du -h out/android-mainline/dist/vmlinux
339M out/android-mainline/dist/vmlinux
Bug: 192694378
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Change-Id: I6644482d9b12eb3e0d1d3676c53ee2eee97a6573
If blk_crypto_evict_key() sees that the key is still in-use (due to a
bug) or that ->keyslot_evict failed, it currently just returns while
leaving the key linked into the keyslot management structures.
However, blk_crypto_evict_key() is only called in contexts such as inode
eviction where failure is not an option. So actually the caller
proceeds with freeing the blk_crypto_key regardless of the return value
of blk_crypto_evict_key().
These two assumptions don't match, and the result is that there can be a
use-after-free in blk_crypto_reprogram_all_keys() after one of these
errors occurs. (Note, these errors *shouldn't* happen; we're just
talking about what happens if they do anyway.)
Fix this by making blk_crypto_evict_key() unlink the key from the
keyslot management structures even on failure.
Also improve some comments.
Fixes: 1b26283970 ("block: Keyslot Manager for Inline Encryption")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 270098322
(cherry picked from commit 5c7cb94452https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=for-next)
Change-Id: I4e8983ad7db94ea8cd422743196da8854adda552
Signed-off-by: Eric Biggers <ebiggers@google.com>
Once all I/O using a blk_crypto_key has completed, filesystems can call
blk_crypto_evict_key(). However, the block layer currently doesn't call
blk_crypto_put_keyslot() until the request is being freed, which happens
after upper layers have been told (via bio_endio()) the I/O has
completed. This causes a race condition where blk_crypto_evict_key()
can see 'slot_refs != 0' without there being an actual bug.
This makes __blk_crypto_evict_key() hit the
'WARN_ON_ONCE(atomic_read(&slot->slot_refs) != 0)' and return without
doing anything, eventually causing a use-after-free in
blk_crypto_reprogram_all_keys(). (This is a very rare bug and has only
been seen when per-file keys are being used with fscrypt.)
There are two options to fix this: either release the keyslot before
bio_endio() is called on the request's last bio, or make
__blk_crypto_evict_key() ignore slot_refs. Let's go with the first
solution, since it preserves the ability to report bugs (via
WARN_ON_ONCE) where a key is evicted while still in-use.
Fixes: a892c8d52c ("block: Inline encryption support for blk-mq")
Cc: stable@vger.kernel.org
Reviewed-by: Nathan Huckleberry <nhuck@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 270098322
(cherry picked from commit 9cd1e56667https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=for-next)
Change-Id: Ic2c2426db7693a06901c7893d481471f30de03b2
Signed-off-by: Eric Biggers <ebiggers@google.com>
Enable the ARMv8 Crypto Extensions implementation of AES-GCM, as it's an
order of magnitude faster than the generic implementation and is more
secure. AES-GCM is used by Android's IPsec support
(https://developer.android.com/reference/android/net/IpSecAlgorithm#AUTH_CRYPT_AES_GCM)
and often is the first choice of algorithm for new purposes as well.
This also makes GKI on arm64 consistent with GKI on x86, as the AES-NI
accelerated AES-GCM is already enabled on x86. (It is not its own
option on x86, but rather is included in CONFIG_CRYPTO_AES_NI_INTEL.)
Bug: 274721410
Change-Id: I2877192dad8f71a961d6f6f465b62b6aeee69540
Signed-off-by: Eric Biggers <ebiggers@google.com>
Simply make shadow of vmalloc area mapped on demand.
Since the virtual address of vmalloc for Arm is also between
MODULE_VADDR and 0x100000000 (ZONE_HIGHMEM), which means the shadow
address has already included between KASAN_SHADOW_START and
KASAN_SHADOW_END.
Thus we need to change nothing for memory map of Arm.
This can fix ARM_MODULE_PLTS with KASan, support KASan for higmem
and support CONFIG_VMAP_STACK with KASan.
Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Tested-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Bug: 275526617
(cherry picked from commit 565cbaad83)
Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Change-Id: Ic2cb62e294dad96ba5a98b2ca48fa5efea2c2e57
I found a bug in the previous version and this patch fixes the gap from
upstream version.
Fixes: fcc385fd44 ("FROMGIT: f2fs: factor out discard_cmd usage from general rb_tree use")
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
(cherry picked from commit e39836183be8
https: //git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git dev)
Change-Id: I4dbfb9f1f2cc956685a7c4de5fcfbba705c30cfb
Add a vendor hook for pagecache hit/miss and other
vendor specific functions.
Bug: 174088128
Bug: 172987241
Signed-off-by: Chiawei Wang <chiaweiwang@google.com>
Change-Id: Ie9f14a69a86b8ed81de766e44e30f2eba1d9bd84
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit db158b4ae0)
Add a vendor hook for costly order page counting
and other vendor specific functions.
Bug: 174521902
Bug: 172987241
Signed-off-by: Chiawei Wang <chiaweiwang@google.com>
Change-Id: I89206727a462548cc3500b695d85c83ff003eec7
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 369de37804)