Get direct reclaim info.
Bug: 190795589
Signed-off-by: Martin Liu <liumartin@google.com>
Change-Id: Ie66a3c87484a364a918c19b8e044c82f1afd6749
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 12902c9996)
Do not account __GFP_NORETRY allocation failure as cma_alloc_fail
since it's not critical failure(i.e., the caller with __GFP_NORETRY
should always carry on the fallback plan). It's also good for
compatibility POV with upstream since upstream cma_alloc_fail
only counts cma_alloc_fail with !__GFP_NORETRY since upstream
doesn't support __GFP_NORTRY yet.
Bug: 220669548
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I377e6b033c3786e10b6b1c814037a4fc40e20a73
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 8ffc7ff817)
Export cma_get_size to tell cma instance's size, which is needed
to allocate entire pages of the cma.
Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: Ifb2769f60250ce605236342b950907218e1c28a5
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 7a44906686)
Do not sleep for retrying for __GFP_NORERY since it's failfast
mode approach. User could retry the allocation without the flag
by themselves if they see the failure.
Bug: 192475091
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: Ic6a857978fda8e353b9ed770d1e0ba1808fd201e
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 12f48605e8)
alloc_contig_range is supposed to work on max(MAX_ORDER_NR_PAGES,
or pageblock_nr_pages) granularity aligned range. If it fails
at a page and return error to user, user doesn't know what page
makes the allocation failure and keep retrying another allocation
with new range including the failed page and encountered error
again and again until it could escape the out of the granularity
block. Instead, let's make CMA aware of what pfn was troubled in
previous trial and then continue to work new pageblock out of the
failed page so it doesn't see the repeated error repeatedly.
Currently, this option works for only __GFP_NORETRY case for
safe for existing CMA users.
Bug: 192475091
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I0959c9df3d4b36408a68920abbb4d52d31026079
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 0e688e972d)
lru_cache_disable is not trivial cost since it should run work
from every cores in the system. Thus, repeated call of the
function whenever alloc_contig_range in the cma's allocation loop
is called is expensive.
This patch makes the lru_cache_disable smarter in that it will
not run __lru_add_drain_all since it knows the cache was already
disabled by someone else.
With that, user of alloc_contig_range can disable the lru cache
in advance in their context so that subsequent alloc_contig_range
for user's operation will avoid the costly function call.
This patch moves lru_cache APIs from swap.h to swap.c and export
it for vendor users.
Bug: 192475091
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I23da8599c55db49dc80226285972e4cd80dedcff
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit c8578a3e90)
Currently, alloc_contig_range expects that even though a page fails
with -EBUSY from __alloc_contig_migrate_range, it want to check
those failed pages in test_pages_isolated again with hope that
those page would be freed soon so cma allocatoin would be succeeded.
However, it depends on the luck and I found sometimes test_page_isolated
constantly fails at the page repeatedly whenever cma_alloc retried.
Rather than burning out CPU to check the page's status in
test_pages_isolated for GFP_NORETRY allocation, just bail out and
relies on the user what they want to do.
Currently, this option works for only __GFP_NORETRY case for safe
of existing other users.
Bug: 192475091
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I9211452be06960dc7d8f854537e53b3fc5262c8e
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit c01ce3b5ef)
alloc_contig_range is the core worker function for CMA allocation
so it has every information to be able to understand allocation
latency. For example, how many pages are migrated, how many time
unmap was needed to migrate pages, how many times it encountered
errors by some reasons.
This patch adds such statistics in the alloc_contig_range and
return it to user so user can use those information to analyize
latency. The cma_alloc is first user for the statistics, which
export the statistics as new trace event(i.e., cma_alloc_info).
It was really usefuli to optimize cma allocation work.
Bug: 192475091
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I7be43cc89d11078e2a324d2d06aada6d8e9e1cc9
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 675e504598)
Since CMA is getting used more widely, it's more important to
keep monitoring CMA statistics for system health since it's
directly related to user experience.
This feature introduces sysfs statistics for CMA, in order to provide
some basic monitoring of the CMA allocator.
* the number of CMA page successful allocations
* the number of CMA page allocation failures
These two values allow the user to calculate the allocation
failure rate for each CMA area.
Bug: 179256052
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I5c8dc58a5d195d2e1b2e25628545f7d2a9c3b7df
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit f45afb4508)
GKI has CONFIG_DYNAMIC_DEBUG_CORE. Thus, to enable only the
specific alloc_contig_dump_pages without needing all pr_debug
in every source files is using -DCONFIG_DYNAMIC_MODULE
when the page_alloc.o compiled.
Bug: 182195592
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I93266eb4161b3653389c737d4c767fd5d1083339
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 8d03e49505)
Contiguous memory allocation can be stalled due to waiting
on page writeback and/or page lock which causes unpredictable
delay. It's a unavoidable cost for the requestor to get *big*
contiguous memory but it's expensive for *small* contiguous
memory(e.g., order-4) because caller could retry the request
in different range where would have easy migratable pages
without stalling.
This patch introduce __GFP_NORETRY as compaction gfp_mask in
alloc_contig_range so it will fail fast without blocking
when it encounters pages needed waiting.
Bug: 170340257
Bug: 120293424
Link: https://lore.kernel.org/linux-mm/YAnM5PbNJZlk%2F%2FiX@google.com/T/#m1362218ebb69e6e10c20d9361008b079745c4e6f
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I42ba8dd5aeb065d936978ab205e4baf84bf9a321
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 20512940b8)
The upcoming patch will introduce __GFP_NORETRY semantic
in alloc_contig_range which is a failfast mode of the API.
Instead of adding a additional parameter for gfp, replace
no_warn with gfp flag.
To keep old behaviors, it follows the rule below.
no_warn gfp_flags
false GFP_KERNEL
true GFP_KERNEL|__GFP_NOWARN
gfp & __GFP_NOWARN GFP_KERNEL | (gfp & __GFP_NOWARN)
Bug: 170340257
Bug: 120293424
Link: https://lore.kernel.org/linux-mm/YAnM5PbNJZlk%2F%2FiX@google.com/T/#m36b144ff81fe0a8f0ecaf6813de4819ecc41f8fe
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I1ce020ab5d5fff34eb6464be4632ddef72fb43eb
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 23ba990a3e)
With below two cases, it will cause NULL pointer dereference when
accessing SM_I(sbi)->fcc_info in f2fs_issue_flush().
a) If kthread_run() fails in f2fs_create_flush_cmd_control(), it will
release SM_I(sbi)->fcc_info,
- mount -o noflush_merge /dev/vda /mnt/f2fs
- mount -o remount,flush_merge /dev/vda /mnt/f2fs -- kthread_run() fails
- dd if=/dev/zero of=/mnt/f2fs/file bs=4k count=1 conv=fsync
b) we will never allocate memory for SM_I(sbi)->fcc_info w/ below
testcase,
- mount -o ro /dev/vda /mnt/f2fs
- mount -o rw,remount /dev/vda /mnt/f2fs
- dd if=/dev/zero of=/mnt/f2fs/file bs=4k count=1 conv=fsync
In order to fix this issue, let change as below:
- fix error path handling in f2fs_create_flush_cmd_control().
- allocate SM_I(sbi)->fcc_info even if readonly is on.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 168f912893 upstream.
When calling setattr_prepare() to determine the validity of the
attributes the ia_{g,u}id fields contain the value that will be written
to inode->i_{g,u}id. This is exactly the same for idmapped and
non-idmapped mounts and allows callers to pass in the values they want
to see written to inode->i_{g,u}id.
When group ownership is changed a caller whose fsuid owns the inode can
change the group of the inode to any group they are a member of. When
searching through the caller's groups we need to use the gid mapped
according to the idmapped mount otherwise we will fail to change
ownership for unprivileged users.
Consider a caller running with fsuid and fsgid 1000 using an idmapped
mount that maps id 65534 to 1000 and 65535 to 1001. Consequently, a file
owned by 65534:65535 in the filesystem will be owned by 1000:1001 in the
idmapped mount.
The caller now requests the gid of the file to be changed to 1000 going
through the idmapped mount. In the vfs we will immediately map the
requested gid to the value that will need to be written to inode->i_gid
and place it in attr->ia_gid. Since this idmapped mount maps 65534 to
1000 we place 65534 in attr->ia_gid.
When we check whether the caller is allowed to change group ownership we
first validate that their fsuid matches the inode's uid. The
inode->i_uid is 65534 which is mapped to uid 1000 in the idmapped mount.
Since the caller's fsuid is 1000 we pass the check.
We now check whether the caller is allowed to change inode->i_gid to the
requested gid by calling in_group_p(). This will compare the passed in
gid to the caller's fsgid and search the caller's additional groups.
Since we're dealing with an idmapped mount we need to pass in the gid
mapped according to the idmapped mount. This is akin to checking whether
a caller is privileged over the future group the inode is owned by. And
that needs to take the idmapped mount into account. Note, all helpers
are nops without idmapped mounts.
New regression test sent to xfstests.
Link: https://github.com/lxc/lxd/issues/10537
Link: https://lore.kernel.org/r/20220613111517.2186646-1-brauner@kernel.org
Fixes: 2f221d6f7b ("attr: handle idmapped mounts")
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org # 5.15+
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 705191b03d upstream.
Last cycle we extended the idmapped mounts infrastructure to support
idmapped mounts of idmapped filesystems (No such filesystem yet exist.).
Since then, the meaning of an idmapped mount is a mount whose idmapping
is different from the filesystems idmapping.
While doing that work we missed to adapt the acl translation helpers.
They still assume that checking for the identity mapping is enough. But
they need to use the no_idmapping() helper instead.
Note, POSIX ACLs are always translated right at the userspace-kernel
boundary using the caller's current idmapping and the initial idmapping.
The order depends on whether we're coming from or going to userspace.
The filesystem's idmapping doesn't matter at the border.
Consequently, if a non-idmapped mount is passed we need to make sure to
always pass the initial idmapping as the mount's idmapping and not the
filesystem idmapping. Since it's irrelevant here it would yield invalid
ids and prevent setting acls for filesystems that are mountable in a
userns and support posix acls (tmpfs and fuse).
I verified the regression reported in [1] and verified that this patch
fixes it. A regression test will be added to xfstests in parallel.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215849 [1]
Fixes: bd303368b7 ("fs: support mapped mounts of mapped filesystems")
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <stable@vger.kernel.org> # 5.15+
Cc: <regressions@lists.linux.dev>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bd303368b7 upstream.
In previous patches we added new and modified existing helpers to handle
idmapped mounts of filesystems mounted with an idmapping. In this final
patch we convert all relevant places in the vfs to actually pass the
filesystem's idmapping into these helpers.
With this the vfs is in shape to handle idmapped mounts of filesystems
mounted with an idmapping. Note that this is just the generic
infrastructure. Actually adding support for idmapped mounts to a
filesystem mountable with an idmapping is follow-up work.
In this patch we extend the definition of an idmapped mount from a mount
that that has the initial idmapping attached to it to a mount that has
an idmapping attached to it which is not the same as the idmapping the
filesystem was mounted with.
As before we do not allow the initial idmapping to be attached to a
mount. In addition this patch prevents that the idmapping the filesystem
was mounted with can be attached to a mount created based on this
filesystem.
This has multiple reasons and advantages. First, attaching the initial
idmapping or the filesystem's idmapping doesn't make much sense as in
both cases the values of the i_{g,u}id and other places where k{g,u}ids
are used do not change. Second, a user that really wants to do this for
whatever reason can just create a separate dedicated identical idmapping
to attach to the mount. Third, we can continue to use the initial
idmapping as an indicator that a mount is not idmapped allowing us to
continue to keep passing the initial idmapping into the mapping helpers
to tell them that something isn't an idmapped mount even if the
filesystem is mounted with an idmapping.
Link: https://lore.kernel.org/r/20211123114227.3124056-11-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-11-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-11-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1ac2a41049 upstream.
Currently we only support idmapped mounts for filesystems mounted
without an idmapping. This was a conscious decision mentioned in
multiple places (cf. e.g. [1]).
As explained at length in [3] it is perfectly fine to extend support for
idmapped mounts to filesystem's mounted with an idmapping should the
need arise. The need has been there for some time now. Various container
projects in userspace need this to run unprivileged and nested
unprivileged containers (cf. [2]).
Before we can port any filesystem that is mountable with an idmapping to
support idmapped mounts we need to first extend the mapping helpers to
account for the filesystem's idmapping. This again, is explained at
length in our documentation at [3] but I'll give an overview here again.
Currently, the low-level mapping helpers implement the remapping
algorithms described in [3] in a simplified manner. Because we could
rely on the fact that all filesystems supporting idmapped mounts are
mounted without an idmapping the translation step from or into the
filesystem idmapping could be skipped.
In order to support idmapped mounts of filesystem's mountable with an
idmapping the translation step we were able to skip before cannot be
skipped anymore. A filesystem mounted with an idmapping is very likely
to not use an identity mapping and will instead use a non-identity
mapping. So the translation step from or into the filesystem's idmapping
in the remapping algorithm cannot be skipped for such filesystems. More
details with examples can be found in [3].
This patch adds a few new and prepares some already existing low-level
mapping helpers to perform the full translation algorithm explained in
[3]. The low-level helpers can be written in a way that they only
perform the additional translation step when the filesystem is indeed
mounted with an idmapping.
If the low-level helpers detect that they are not dealing with an
idmapped mount they can simply return the relevant k{g,u}id unchanged;
no remapping needs to be performed at all. The no_idmapping() helper
detects whether the shortcut can be used.
If the low-level helpers detected that they are dealing with an idmapped
mount but the underlying filesystem is mounted without an idmapping we
can rely on the previous shorcut and can continue to skip the
translation step from or into the filesystem's idmapping.
These checks guarantee that only the minimal amount of work is
performed. As before, if idmapped mounts aren't used the low-level
helpers are idempotent and no work is performed at all.
This patch adds the helpers mapped_k{g,u}id_fs() and
mapped_k{g,u}id_user(). Following patches will port all places to
replace the old k{g,u}id_into_mnt() and k{g,u}id_from_mnt() with these
two new helpers. After the conversion is done k{g,u}id_into_mnt() and
k{g,u}id_from_mnt() will be removed. This also concludes the renaming of
the mapping helpers we started in [4]. Now, all mapping helpers will
started with the "mapped_" prefix making everything nice and consistent.
The mapped_k{g,u}id_fs() helpers replace the k{g,u}id_into_mnt()
helpers. They are to be used when k{g,u}ids are to be mapped from the
vfs, e.g. from from struct inode's i_{g,u}id. Conversely, the
mapped_k{g,u}id_user() helpers replace the k{g,u}id_from_mnt() helpers.
They are to be used when k{g,u}ids are to be written to disk, e.g. when
entering from a system call to change ownership of a file.
This patch only introduces the helpers. It doesn't yet convert the
relevant places to account for filesystem mounted with an idmapping.
[1]: commit 2ca4dcc490 ("fs/mount_setattr: tighten permission checks")
[2]: https://github.com/containers/podman/issues/10374
[3]: Documentations/filesystems/idmappings.rst
[4]: commit a65e58e791 ("fs: document and rename fsid helpers")
Link: https://lore.kernel.org/r/20211123114227.3124056-5-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-5-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-5-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The hypervisor memory pool is sized to allow mapping up to 1GiB of data
in the 'private' range of the hypervisor. However, this is currently
not enforced in any way, which might become a problem as private range
mappings are used more and more (e.g. from pKVM modules).
Enforce the 1GiB limit at allocation time, and while at it, rename
__io_map_base to __private_range_base for consistency.
Bug: 244543039
Change-Id: I32c9145ba331309b49428ff461a41c94ea0c1512
Signed-off-by: Quentin Perret <qperret@google.com>
Parse the devicetree during pKVM init to find nodes with the
"pkvm,protected-region" compatible string. These nodes specify a
physical address range in reg that must alway be mapped as invalid in
the host stage-2 page table when running under pKVM.
Example DT:
pkvm_prot_reg: pkvm_prot_reg@80000000 {
compatible = "pkvm,protected-region";
reg = <0x00 0x80000000 0x00 0x200000>;
};
Bug: 244543039
Bug: 244373730
Change-Id: I102cd16c91d96e5283cdd1a4fa58836cc4834eac
Signed-off-by: Quentin Perret <qperret@google.com>
The pKVM memory pool is currently sized to allow page-granularity
mapping in the host stage-2 page-table of all the memory as well as up
to 1GiB of MMIO range. Indeed, pKVM currently assumes that MMIO regions
are completely and solely owned by the host for the entire lifetime of
the system. As such, the pages used to map MMIO regions can always be
recycled to allow forward progress if the memory pool ran out of
pages -- pKVM can unmap MMIO ranges at stage-2 without fearing to loose
important information about the state of the underlying page, and those
mappings can always be reconstructed later.
In order to allow transitioning the ownership of non-memory regions,
introduce a concept of pkvm 'moveable' regions, which represents regions
of the physical address space which can be 'moved' from an ownership
perspective. These moveable regions are used to size the hyp memory
pool. In a first step, the list of moveable regions is equal to the
memblock list, but it will be extended in subsequent changes.
No functional changes intended.
Bug: 244543039
Bug: 244373730
Change-Id: I7f451924b1eed9579868e6ff8c7adc7b4a5a0ae1
Signed-off-by: Quentin Perret <qperret@google.com>
The host_get_page_state() logic has currently a baked in assumption that
it will only be used on memory, and checks against the default memory
permssions to flag pages as having a RESTRICTED_PROT state.
Add support for correctly flagging non-memory pages to prepare the
ground for future patches.
Bug: 244543039
Bug: 244373730
Change-Id: Idaaef96cb98c147c8b793059438064cf770af525
Signed-off-by: Quentin Perret <qperret@google.com>
pKVM uses different default permissions for memory and non-memory
regions of the PA space. To avoid scattering this logic around,
introduce a default_host_prot() helper function.
Non functional changes intended.
Bug: 244543039
Bug: 244373730
Change-Id: I36cdbb26a2cb0d54b5641f945f6ede4ffe371045
Signed-off-by: Quentin Perret <qperret@google.com>
pKVM modules may need to be notified in case of unexpected same-level
EL2 exceptions, which result in a hyp panic. To do so, introduce a new
notifier on the hyp_panic path.
Bug: 244373730
Change-Id: I144609a933d648ddf2aebcd950e64d6035bf8be3
Signed-off-by: Quentin Perret <qperret@google.com>
pKVM modules may need to temporarily map large-ish physically contiguous
regions of memory when bootstrapping themselves. In order to support
this use-case, introduce two new APIs in the module_ops struct allowing
to map and unmap pages in pKVM's linear map range. Since pKVM's page
ownership infrastructure relies on linear map PTEs, this needs to be
done with special care. To avoid any problem, let's count the number of
pages mapped by modules and unsure they have been unmapped before
reaching the point of deprivilege.
Bug: 244373730
Change-Id: I4aecb93f5c9ba08d9f830d1f0976704688b98509
Signed-off-by: Quentin Perret <qperret@google.com>