Recall that the aging produces the youngest generation: first it scans
for accessed pages and updates their gen counters; then it increments
lrugen->max_seq.
The current aging fairness safeguard for kswapd uses two passes to
ensure the fairness to multiple eligible memcgs. On the first pass,
which is shared with the eviction, it checks whether all eligible
memcgs are low on cold pages. If so, it requires a second pass, on
which it ages all those memcgs at the same time.
With memcg LRU, the aging, while ensuring eventual fairness, will run
when necessary. Therefore the current aging fairness safeguard for
kswapd will not be needed.
Note that memcg LRU only applies to global reclaim. For memcg reclaim,
the aging can be unfair to different memcgs, i.e., their
lrugen->max_seq can be incremented at different paces.
Link: https://lkml.kernel.org/r/20221222041905.2431096-5-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 274865848
(cherry picked from commit 7348cc9182)
[Yu: Resolve conflicts over absence of folios and proactive reclaim on 5.15]
Change-Id: Iad1847f586f713cc2b4ee0fac12265cf9462477a
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Recall that the eviction consumes the oldest generation: first it
bucket-sorts pages whose gen counters were updated by the aging and
reclaims the rest; then it increments lrugen->min_seq.
The current eviction fairness safeguard for global reclaim has a
dilemma: when there are multiple eligible memcgs, should it continue
or stop upon meeting the reclaim goal? If it continues, it overshoots
and increases direct reclaim latency; if it stops, it loses fairness
between memcgs it has taken memory away from and those it has yet to.
With memcg LRU, the eviction, while ensuring eventual fairness, will
stop upon meeting its goal. Therefore the current eviction fairness
safeguard for global reclaim will not be needed.
Note that memcg LRU only applies to global reclaim. For memcg reclaim,
the eviction will continue, even if it is overshooting. This becomes
unconditional due to code simplification.
Link: https://lkml.kernel.org/r/20221222041905.2431096-4-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 274865848
(cherry picked from commit a579086c99)
[Yu: Resolve conflicts over absence of folios and proactive reclaim on 5.15]
Change-Id: I008664a847b10a7990325c0a3cb2d707f1a1bc2a
Signed-off-by: T.J. Mercier <tjmercier@google.com>
The page reclaim isolates a batch of pages from the tail of one of the
LRU lists and works on those pages one by one. For a suitable
swap-backed page, if the swap device is async, it queues that page for
writeback. After the page reclaim finishes an entire batch, it puts back
the pages it queued for writeback to the head of the original LRU list.
In the meantime, the page writeback flushes the queued pages also by
batches. Its batching logic is independent from that of the page reclaim.
For each of the pages it writes back, the page writeback calls
rotate_reclaimable_page() which tries to rotate a page to the tail.
rotate_reclaimable_page() only works for a page after the page reclaim
has put it back. If an async swap device is fast enough, the page
writeback can finish with that page while the page reclaim is still
working on the rest of the batch containing it. In this case, that page
will remain at the head and the page reclaim will not retry it before
reaching there.
This patch adds a retry to evict_pages(). After evict_pages() has
finished an entire batch and before it puts back pages it cannot free
immediately, it retries those that may have missed the rotation.
Before this patch, ~60% of pages swapped to an Intel Optane missed
rotate_reclaimable_page(). After this patch, ~99% of missed pages were
reclaimed upon retry.
This problem affects relatively slow async swap devices like Samsung 980
Pro much less and does not affect sync swap devices like zram or zswap at
all.
Link: https://lkml.kernel.org/r/20221116013808.3995280-1-yuzhao@google.com
Fixes: ac35a49023 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: "Yin, Fengwei" <fengwei.yin@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 274865848
(cherry picked from commit 359a5e1416)
[Yu: Resolve conflicts over absence of folios on 5.15]
Change-Id: Ife11b13e2612c84a2de1727781983f66a06141bb
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Patch series "mm: multi-gen LRU: memcg LRU", v3.
Overview
========
An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
since each node and memcg combination has an LRU of pages (see
mem_cgroup_lruvec()).
Its goal is to improve the scalability of global reclaim, which is
critical to system-wide memory overcommit in data centers. Note that
memcg reclaim is currently out of scope.
Its memory bloat is a pointer to each lruvec and negligible to each
pglist_data. In terms of traversing memcgs during global reclaim, it
improves the best-case complexity from O(n) to O(1) and does not affect
the worst-case complexity O(n). Therefore, on average, it has a sublinear
complexity in contrast to the current linear complexity.
The basic structure of an memcg LRU can be understood by an analogy to
the active/inactive LRU (of pages):
1. It has the young and the old (generations), i.e., the counterparts
to the active and the inactive;
2. The increment of max_seq triggers promotion, i.e., the counterpart
to activation;
3. Other events trigger similar operations, e.g., offlining an memcg
triggers demotion, i.e., the counterpart to deactivation.
In terms of global reclaim, it has two distinct features:
1. Sharding, which allows each thread to start at a random memcg (in
the old generation) and improves parallelism;
2. Eventual fairness, which allows direct reclaim to bail out at will
and reduces latency without affecting fairness over some time.
The commit message in patch 6 details the workflow:
https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com/
The following is a simple test to quickly verify its effectiveness.
Test design:
1. Create multiple memcgs.
2. Each memcg contains a job (fio).
3. All jobs access the same amount of memory randomly.
4. The system does not experience global memory pressure.
5. Periodically write to the root memory.reclaim.
Desired outcome:
1. All memcgs have similar pgsteal counts, i.e., stddev(pgsteal)
over mean(pgsteal) is close to 0%.
2. The total pgsteal is close to the total requested through
memory.reclaim, i.e., sum(pgsteal) over sum(requested) is close
to 100%.
Actual outcome [1]:
MGLRU off MGLRU on
stddev(pgsteal) / mean(pgsteal) 75% 20%
sum(pgsteal) / sum(requested) 425% 95%
####################################################################
MEMCGS=128
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
mkdir /sys/fs/cgroup/memcg$memcg
done
start() {
echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs
fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \
--filename=/dev/zero --size=1920M --rw=randrw \
--rate=64m,64m --random_distribution=random \
--fadvise_hint=0 --time_based --runtime=10h \
--group_reporting --minimal
}
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
start &
done
sleep 600
for ((i = 0; i < 600; i++)); do
echo 256m >/sys/fs/cgroup/memory.reclaim
sleep 6
done
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat
done
####################################################################
[1]: This was obtained from running the above script (touches less
than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an
hour.
This patch (of 8):
The new name lru_gen_page will be more distinct from the coming
lru_gen_memcg.
Link: https://lkml.kernel.org/r/20221222041905.2431096-1-yuzhao@google.com
Link: https://lkml.kernel.org/r/20221222041905.2431096-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 274865848
(cherry picked from commit 391655fe08)
[Yu: Resolve conflicts over absence of folios on 5.15]
Change-Id: Ie92535676b005ec9e7987632b742fdde8d54436f
Signed-off-by: T.J. Mercier <tjmercier@google.com>
In case of 4way handshake offload, transition disable policy
updated by the AP during EAPOL 3/4 is not updated to the upper layer.
This results in mismatch between transition disable policy
between the upper layer and the driver. This patch addresses this
issue by updating transition disable policy as part of port
authorization indication.
Signed-off-by: Vinayak Yadawad <vinayak.yadawad@broadcom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Bug: 272227555
Change-Id: Iac5d22a2c3999c7bdddc3a1f683fef82ed8ff918
(cherry picked from commit 0ff57171d6)
Signed-off-by: Shivani Baranwal <quic_shivbara@quicinc.com>
Signed-off-by: Will McVicker <willmcvicker@google.com>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
This reverts commit c9d17c24b9.
It was perserving the ABI, but that is not needed anymore at this point
in time.
Change-Id: I571a879d78bcbb7f1be4554456ea2ac6ebcc53cc
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
This reverts commit cf76e85064.
It was perserving the ABI, but that is not needed anymore at this point
in time.
Change-Id: Ie8de065eb07476140971d0684de0460ce391d52c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
This reverts commit 408db6d88d.
It was perserving the ABI, but that is not needed anymore at this point
in time.
Change-Id: If129ead534970cd3a634ac9dcf563441c0c19a01
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
This reverts commit b923dd1052.
It was perserving the ABI, but that is not needed anymore at this point
in time.
Change-Id: Ib7087614a16570125233f26d582d449fe5ead163
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Non-protected mode relies on the host to restore its SVE state if
necessary. However, protected VMs shouldn't reveal any
information to the host, including whether they have potentially
dirtied the host's sve state. Therefore, save and restore the
host's sve state at hyp in protected mode.
Currently this behavior applies to protected and non-protected
VMs in protected mode. It could be optimised for non-protected
VMs by applying the same behavior as non-protected mode, which is
to inform the host that it should restore its sve state. But for
now it's kept this way to maintain the same behavior for all VMs
in protected mode.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: Ifbcc64b387c3f821a6c1047e8c843f6250a3f690
The code for deactivating traps, to be able to update the fpsimd
registers, is the only code in this file that is n/vhe specific.
Move it to specialized functions.
This is also needed for the subsequent patch, since the logic for
deciding which traps to enable/disable will get more complex.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: Ia0477450aa9319a46a91b3c31c1910ad02fbe246
In subsequent patches, vhe/pKVM(nvhe) will diverge significantly
on saving the host fpsimd/sve state when taking a guest fpsimd
trap. Add a specialized helper to handle that.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: Ib6b13cafad8bf568694804e3b55e0a5a4fcd70a4
Allocate memory and donate it to hyp at setup time for tracking
the host sve state at hyp in protected mode. This memory is used
in the subsequent patch.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: If07eec9ea9c7b216d02e2d1ea69bd62d99f08081
The code to determine the maximum sve vector length by the system
isn't trivial. In subsequent patches hyp needs to know it for
allocating memory for the host sve state.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: I2561af67722a99d8a989b26cb47d073eba3869ff
Subsequent patches will augment this state to allocate space for
tracking the host sve state. SVE state size is not static, and
there isn't support for dynamic per_cpu allocation in hyp.
This is done as a first step in allowing us to allocate SVE state
under the same umbrella.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: I0902623a5ab81a80105f5b00a26765d257bc1ceb
The state will be augmented in future patches and accessed in
more than one location. It makes it easier to reason about the
code.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: If3a3a9266c201f63c126860b61da9698be9b9faa
Subsequent patches will change how the fpsimd state is allocated,
and add tracking of sve state. Moving this to a helper makes
future code cleaner and patches easier to reason about.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: Ic46b8889c1fe11f0cfdd7b5f3d2b98bf412183f0
Before the conversion of the various booleans into an enum
representing the state, this helper clarified things. Since the
introduction of the enum, the helper obfuscates rather than
helps.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: I83c870146ed2d910bf10d625d1048b95c8b23736
pKVM maintains its own state for tracking the host fpsimd state.
Therefore, no need to map and share the host's view with it.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: I5e5164a7694881ffa641b5b6a8691a542fd55a14
Expand comment clarifying why the host value representing sve
vector length being restored for ZCR_EL1 on guest exit isn't the
same as it was on guest entry.
Signed-off-by: Fuad Tabba <tabba@google.com>
Bug: 267291591
Change-Id: I5889407b4391a80dfcf77b31375c3a17705b68da
The GKI policy allows the addition of new symbols to a frozen KMI as
long as doing so has no impact on existing frozen symbols. Interestingly
the hypervisor's ABI is defined by the pkvm_module_ops structure. Any
addition to this struct will be flagged as a type change, which equates
to a KMI breakage in the GKI world. This could become a major problem
long term if it prevented backport of (security) fixes to KMI-frozen
kernels.
To allow such backports, add a set of reserved ABI slots to the
pkvm_module_ops struct. These slots are usually reserved to fix LTS
merges, but given that none of the pKVM module code is upstream yet,
these slots are likely to be used by Android-specific fixes.
Bug: 233587962
Change-Id: I61a00a09947ccff153c96a4829e083ef9ede19d3
Signed-off-by: Quentin Perret <qperret@google.com>
pKVM modules may need to access memory that is kept map in the host's
stage-2 page-table. Expose the host_{un}share_hyp() API to allow the
use-case, as well as the pinning API that goes with it.
Bug: 245034629
Change-Id: I1b5abacfcd2f066b1cbb1bbac43b77e6808f559c
Signed-off-by: Quentin Perret <qperret@google.com>
DWARFv5 is the latest iteration of the debug info spec; it contains many
encoding tricks to optimize for space.
For example, with this patch applied (DWARFv5), for
build.config.gki.aarch64:
$ du -h out/android-mainline/dist/vmlinux
304M out/android-mainline/dist/vmlinux
Before (DWARFv4):
du -h out/android-mainline/dist/vmlinux
339M out/android-mainline/dist/vmlinux
Bug: 192694378
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Change-Id: I6644482d9b12eb3e0d1d3676c53ee2eee97a6573
If blk_crypto_evict_key() sees that the key is still in-use (due to a
bug) or that ->keyslot_evict failed, it currently just returns while
leaving the key linked into the keyslot management structures.
However, blk_crypto_evict_key() is only called in contexts such as inode
eviction where failure is not an option. So actually the caller
proceeds with freeing the blk_crypto_key regardless of the return value
of blk_crypto_evict_key().
These two assumptions don't match, and the result is that there can be a
use-after-free in blk_crypto_reprogram_all_keys() after one of these
errors occurs. (Note, these errors *shouldn't* happen; we're just
talking about what happens if they do anyway.)
Fix this by making blk_crypto_evict_key() unlink the key from the
keyslot management structures even on failure.
Also improve some comments.
Fixes: 1b26283970 ("block: Keyslot Manager for Inline Encryption")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 270098322
(cherry picked from commit 5c7cb94452https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=for-next)
Change-Id: I4e8983ad7db94ea8cd422743196da8854adda552
Signed-off-by: Eric Biggers <ebiggers@google.com>
Once all I/O using a blk_crypto_key has completed, filesystems can call
blk_crypto_evict_key(). However, the block layer currently doesn't call
blk_crypto_put_keyslot() until the request is being freed, which happens
after upper layers have been told (via bio_endio()) the I/O has
completed. This causes a race condition where blk_crypto_evict_key()
can see 'slot_refs != 0' without there being an actual bug.
This makes __blk_crypto_evict_key() hit the
'WARN_ON_ONCE(atomic_read(&slot->slot_refs) != 0)' and return without
doing anything, eventually causing a use-after-free in
blk_crypto_reprogram_all_keys(). (This is a very rare bug and has only
been seen when per-file keys are being used with fscrypt.)
There are two options to fix this: either release the keyslot before
bio_endio() is called on the request's last bio, or make
__blk_crypto_evict_key() ignore slot_refs. Let's go with the first
solution, since it preserves the ability to report bugs (via
WARN_ON_ONCE) where a key is evicted while still in-use.
Fixes: a892c8d52c ("block: Inline encryption support for blk-mq")
Cc: stable@vger.kernel.org
Reviewed-by: Nathan Huckleberry <nhuck@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 270098322
(cherry picked from commit 9cd1e56667https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=for-next)
Change-Id: Ic2c2426db7693a06901c7893d481471f30de03b2
Signed-off-by: Eric Biggers <ebiggers@google.com>
Enable the ARMv8 Crypto Extensions implementation of AES-GCM, as it's an
order of magnitude faster than the generic implementation and is more
secure. AES-GCM is used by Android's IPsec support
(https://developer.android.com/reference/android/net/IpSecAlgorithm#AUTH_CRYPT_AES_GCM)
and often is the first choice of algorithm for new purposes as well.
This also makes GKI on arm64 consistent with GKI on x86, as the AES-NI
accelerated AES-GCM is already enabled on x86. (It is not its own
option on x86, but rather is included in CONFIG_CRYPTO_AES_NI_INTEL.)
Bug: 274721410
Change-Id: I2877192dad8f71a961d6f6f465b62b6aeee69540
Signed-off-by: Eric Biggers <ebiggers@google.com>
Simply make shadow of vmalloc area mapped on demand.
Since the virtual address of vmalloc for Arm is also between
MODULE_VADDR and 0x100000000 (ZONE_HIGHMEM), which means the shadow
address has already included between KASAN_SHADOW_START and
KASAN_SHADOW_END.
Thus we need to change nothing for memory map of Arm.
This can fix ARM_MODULE_PLTS with KASan, support KASan for higmem
and support CONFIG_VMAP_STACK with KASan.
Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Tested-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Bug: 275526617
(cherry picked from commit 565cbaad83)
Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Change-Id: Ic2cb62e294dad96ba5a98b2ca48fa5efea2c2e57
I found a bug in the previous version and this patch fixes the gap from
upstream version.
Fixes: fcc385fd44 ("FROMGIT: f2fs: factor out discard_cmd usage from general rb_tree use")
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
(cherry picked from commit e39836183be8
https: //git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git dev)
Change-Id: I4dbfb9f1f2cc956685a7c4de5fcfbba705c30cfb
Add a vendor hook for pagecache hit/miss and other
vendor specific functions.
Bug: 174088128
Bug: 172987241
Signed-off-by: Chiawei Wang <chiaweiwang@google.com>
Change-Id: Ie9f14a69a86b8ed81de766e44e30f2eba1d9bd84
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit db158b4ae0)
Add a vendor hook for costly order page counting
and other vendor specific functions.
Bug: 174521902
Bug: 172987241
Signed-off-by: Chiawei Wang <chiaweiwang@google.com>
Change-Id: I89206727a462548cc3500b695d85c83ff003eec7
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 369de37804)
This reverts commit 3df32812eb which is
commit b1a37ed00d upstream.
It breaks the Android KABI and if needed, should come back in an
abi-safe way.
Bug: 161946584
Cc: Lee Jones <joneslee@google.com>
Change-Id: I1f160797720e8bdf4960542e711fd17940a975d9
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
This reverts commit 02904e8a2f which is
commit 1c5d422124 upstream.
It breaks the Android KABI and if needed, should come back in an
abi-safe way.
Bug: 161946584
Cc: Lee Jones <joneslee@google.com>
Change-Id: I9a460d9dbc41512ee71ff607e875f2da9be7f9f6
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Even if we have multiple queues in the plug list, chances that they
are very interspersed is minimal. Don't bother spending CPU cycles
sorting the list.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Change-Id: Ia85d5c75ef4f2bf3f90e4d3408cffec5c41dcfe2
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 274474142
(cherry picked from commit df87eb0fce)
Signed-off-by: Bart Van Assche <bvanassche@google.com>