crypto_memneq() is one of the utility functions that was intentionally
included in the fips140 module, out of concerns that it would be seen as
"cryptographic" and thus would be required to be included the module for
the FIPS certification. It should not have been removed from the
module, so add it back.
Bug: 188620248
Fixes: 916c8c836f ("ANDROID: FIPS: remove memneq.o from the list of objects for the fips module")
Change-Id: I8a19dfd73390f8c1348885f97fa42d900e47b82b
Signed-off-by: Eric Biggers <ebiggers@google.com>
This reverts commit aeac190b3b.
Bug: 264333547
Test: /data/local/tmp/sebastianene/tests/test_host_app
Change-Id: Id88b705dd725cc8720913fd2909030c2f2fb597f
Signed-off-by: Sebastian Ene <sebastianene@google.com>
This reverts commit 8a57e65659.
Reason for revert: This symbol "PageMovable" should not be used by
any driver and is being removed from upstream.
Bug: 264353898
Bug: 261718474
Cc: Jörg Wagner <jorwag@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I368f30bd4ab1f28480ee37e3ee10df9dbb2f5744
For the entropy analysis, we must provide some output from the Jitter
RNG: a large amount of output from one instance, and a smaller amount of
output from each of a certain number of instances.
The original plan was to use a build of the userspace jitterentropy
library that matches the kernel's jitterentropy_rng as closely as
possible. However, it's now being requested that the output be gotten
from the kernel instead.
Now that fips140_lab_util depends on AF_ALG anyway, it's straightforward
to dump output from jitterentropy_rng instances using AF_ALG.
Therefore, add a command dump_jitterentropy which supports this.
Bug: 188620248
Change-Id: I78eb26250e88f2fc28fc44aa201acbe5b84df8bb
Signed-off-by: Eric Biggers <ebiggers@google.com>
(cherry picked from commit dc01503266)
There is a race window between page_pinner_inited set and the pp_buffer
initialization which cause accessing the pp_buffer->lock. Avoid this by
moving the pp_buffer initialization to page_ext_ops->init() which sets
the page_pinner_inited only after the pp_buffer is initialized.
Race scenario:
1) init_page_pinner is called --> page_pinner_inited is set.
2) __alloc_contig_migrate_range --> __page_pinner_failure_detect()
accesses the pp_buffer->lock(yet to be initialized).
3) Then the pp_buffer is allocated and initialized.
Below is the issue call stack:
spin_bug+0x0
_raw_spin_lock_irqsave+0x3c
__page_pinner_failure_detect+0x110
__alloc_contig_migrate_range+0x1c4
alloc_contig_range+0x130
cma_alloc+0x170
dma_alloc_contiguous+0xa0
__dma_direct_alloc_pages+0x16c
dma_direct_alloc+0x88
Bug: 259024332
Change-Id: I6849ac4d944498b9a431b47cad7adc7903c9bbaa
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Get direct reclaim info.
Bug: 190795589
Signed-off-by: Martin Liu <liumartin@google.com>
Change-Id: Ie66a3c87484a364a918c19b8e044c82f1afd6749
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 12902c9996)
Do not account __GFP_NORETRY allocation failure as cma_alloc_fail
since it's not critical failure(i.e., the caller with __GFP_NORETRY
should always carry on the fallback plan). It's also good for
compatibility POV with upstream since upstream cma_alloc_fail
only counts cma_alloc_fail with !__GFP_NORETRY since upstream
doesn't support __GFP_NORTRY yet.
Bug: 220669548
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I377e6b033c3786e10b6b1c814037a4fc40e20a73
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 8ffc7ff817)
Export cma_get_size to tell cma instance's size, which is needed
to allocate entire pages of the cma.
Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: Ifb2769f60250ce605236342b950907218e1c28a5
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 7a44906686)
Do not sleep for retrying for __GFP_NORERY since it's failfast
mode approach. User could retry the allocation without the flag
by themselves if they see the failure.
Bug: 192475091
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: Ic6a857978fda8e353b9ed770d1e0ba1808fd201e
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 12f48605e8)
alloc_contig_range is supposed to work on max(MAX_ORDER_NR_PAGES,
or pageblock_nr_pages) granularity aligned range. If it fails
at a page and return error to user, user doesn't know what page
makes the allocation failure and keep retrying another allocation
with new range including the failed page and encountered error
again and again until it could escape the out of the granularity
block. Instead, let's make CMA aware of what pfn was troubled in
previous trial and then continue to work new pageblock out of the
failed page so it doesn't see the repeated error repeatedly.
Currently, this option works for only __GFP_NORETRY case for
safe for existing CMA users.
Bug: 192475091
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I0959c9df3d4b36408a68920abbb4d52d31026079
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 0e688e972d)
lru_cache_disable is not trivial cost since it should run work
from every cores in the system. Thus, repeated call of the
function whenever alloc_contig_range in the cma's allocation loop
is called is expensive.
This patch makes the lru_cache_disable smarter in that it will
not run __lru_add_drain_all since it knows the cache was already
disabled by someone else.
With that, user of alloc_contig_range can disable the lru cache
in advance in their context so that subsequent alloc_contig_range
for user's operation will avoid the costly function call.
This patch moves lru_cache APIs from swap.h to swap.c and export
it for vendor users.
Bug: 192475091
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I23da8599c55db49dc80226285972e4cd80dedcff
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit c8578a3e90)
Currently, alloc_contig_range expects that even though a page fails
with -EBUSY from __alloc_contig_migrate_range, it want to check
those failed pages in test_pages_isolated again with hope that
those page would be freed soon so cma allocatoin would be succeeded.
However, it depends on the luck and I found sometimes test_page_isolated
constantly fails at the page repeatedly whenever cma_alloc retried.
Rather than burning out CPU to check the page's status in
test_pages_isolated for GFP_NORETRY allocation, just bail out and
relies on the user what they want to do.
Currently, this option works for only __GFP_NORETRY case for safe
of existing other users.
Bug: 192475091
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I9211452be06960dc7d8f854537e53b3fc5262c8e
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit c01ce3b5ef)
alloc_contig_range is the core worker function for CMA allocation
so it has every information to be able to understand allocation
latency. For example, how many pages are migrated, how many time
unmap was needed to migrate pages, how many times it encountered
errors by some reasons.
This patch adds such statistics in the alloc_contig_range and
return it to user so user can use those information to analyize
latency. The cma_alloc is first user for the statistics, which
export the statistics as new trace event(i.e., cma_alloc_info).
It was really usefuli to optimize cma allocation work.
Bug: 192475091
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I7be43cc89d11078e2a324d2d06aada6d8e9e1cc9
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 675e504598)
Since CMA is getting used more widely, it's more important to
keep monitoring CMA statistics for system health since it's
directly related to user experience.
This feature introduces sysfs statistics for CMA, in order to provide
some basic monitoring of the CMA allocator.
* the number of CMA page successful allocations
* the number of CMA page allocation failures
These two values allow the user to calculate the allocation
failure rate for each CMA area.
Bug: 179256052
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I5c8dc58a5d195d2e1b2e25628545f7d2a9c3b7df
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit f45afb4508)
GKI has CONFIG_DYNAMIC_DEBUG_CORE. Thus, to enable only the
specific alloc_contig_dump_pages without needing all pr_debug
in every source files is using -DCONFIG_DYNAMIC_MODULE
when the page_alloc.o compiled.
Bug: 182195592
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I93266eb4161b3653389c737d4c767fd5d1083339
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 8d03e49505)
Contiguous memory allocation can be stalled due to waiting
on page writeback and/or page lock which causes unpredictable
delay. It's a unavoidable cost for the requestor to get *big*
contiguous memory but it's expensive for *small* contiguous
memory(e.g., order-4) because caller could retry the request
in different range where would have easy migratable pages
without stalling.
This patch introduce __GFP_NORETRY as compaction gfp_mask in
alloc_contig_range so it will fail fast without blocking
when it encounters pages needed waiting.
Bug: 170340257
Bug: 120293424
Link: https://lore.kernel.org/linux-mm/YAnM5PbNJZlk%2F%2FiX@google.com/T/#m1362218ebb69e6e10c20d9361008b079745c4e6f
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I42ba8dd5aeb065d936978ab205e4baf84bf9a321
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 20512940b8)
The upcoming patch will introduce __GFP_NORETRY semantic
in alloc_contig_range which is a failfast mode of the API.
Instead of adding a additional parameter for gfp, replace
no_warn with gfp flag.
To keep old behaviors, it follows the rule below.
no_warn gfp_flags
false GFP_KERNEL
true GFP_KERNEL|__GFP_NOWARN
gfp & __GFP_NOWARN GFP_KERNEL | (gfp & __GFP_NOWARN)
Bug: 170340257
Bug: 120293424
Link: https://lore.kernel.org/linux-mm/YAnM5PbNJZlk%2F%2FiX@google.com/T/#m36b144ff81fe0a8f0ecaf6813de4819ecc41f8fe
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I1ce020ab5d5fff34eb6464be4632ddef72fb43eb
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 23ba990a3e)
With below two cases, it will cause NULL pointer dereference when
accessing SM_I(sbi)->fcc_info in f2fs_issue_flush().
a) If kthread_run() fails in f2fs_create_flush_cmd_control(), it will
release SM_I(sbi)->fcc_info,
- mount -o noflush_merge /dev/vda /mnt/f2fs
- mount -o remount,flush_merge /dev/vda /mnt/f2fs -- kthread_run() fails
- dd if=/dev/zero of=/mnt/f2fs/file bs=4k count=1 conv=fsync
b) we will never allocate memory for SM_I(sbi)->fcc_info w/ below
testcase,
- mount -o ro /dev/vda /mnt/f2fs
- mount -o rw,remount /dev/vda /mnt/f2fs
- dd if=/dev/zero of=/mnt/f2fs/file bs=4k count=1 conv=fsync
In order to fix this issue, let change as below:
- fix error path handling in f2fs_create_flush_cmd_control().
- allocate SM_I(sbi)->fcc_info even if readonly is on.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
commit 168f912893 upstream.
When calling setattr_prepare() to determine the validity of the
attributes the ia_{g,u}id fields contain the value that will be written
to inode->i_{g,u}id. This is exactly the same for idmapped and
non-idmapped mounts and allows callers to pass in the values they want
to see written to inode->i_{g,u}id.
When group ownership is changed a caller whose fsuid owns the inode can
change the group of the inode to any group they are a member of. When
searching through the caller's groups we need to use the gid mapped
according to the idmapped mount otherwise we will fail to change
ownership for unprivileged users.
Consider a caller running with fsuid and fsgid 1000 using an idmapped
mount that maps id 65534 to 1000 and 65535 to 1001. Consequently, a file
owned by 65534:65535 in the filesystem will be owned by 1000:1001 in the
idmapped mount.
The caller now requests the gid of the file to be changed to 1000 going
through the idmapped mount. In the vfs we will immediately map the
requested gid to the value that will need to be written to inode->i_gid
and place it in attr->ia_gid. Since this idmapped mount maps 65534 to
1000 we place 65534 in attr->ia_gid.
When we check whether the caller is allowed to change group ownership we
first validate that their fsuid matches the inode's uid. The
inode->i_uid is 65534 which is mapped to uid 1000 in the idmapped mount.
Since the caller's fsuid is 1000 we pass the check.
We now check whether the caller is allowed to change inode->i_gid to the
requested gid by calling in_group_p(). This will compare the passed in
gid to the caller's fsgid and search the caller's additional groups.
Since we're dealing with an idmapped mount we need to pass in the gid
mapped according to the idmapped mount. This is akin to checking whether
a caller is privileged over the future group the inode is owned by. And
that needs to take the idmapped mount into account. Note, all helpers
are nops without idmapped mounts.
New regression test sent to xfstests.
Link: https://github.com/lxc/lxd/issues/10537
Link: https://lore.kernel.org/r/20220613111517.2186646-1-brauner@kernel.org
Fixes: 2f221d6f7b ("attr: handle idmapped mounts")
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org # 5.15+
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 705191b03d upstream.
Last cycle we extended the idmapped mounts infrastructure to support
idmapped mounts of idmapped filesystems (No such filesystem yet exist.).
Since then, the meaning of an idmapped mount is a mount whose idmapping
is different from the filesystems idmapping.
While doing that work we missed to adapt the acl translation helpers.
They still assume that checking for the identity mapping is enough. But
they need to use the no_idmapping() helper instead.
Note, POSIX ACLs are always translated right at the userspace-kernel
boundary using the caller's current idmapping and the initial idmapping.
The order depends on whether we're coming from or going to userspace.
The filesystem's idmapping doesn't matter at the border.
Consequently, if a non-idmapped mount is passed we need to make sure to
always pass the initial idmapping as the mount's idmapping and not the
filesystem idmapping. Since it's irrelevant here it would yield invalid
ids and prevent setting acls for filesystems that are mountable in a
userns and support posix acls (tmpfs and fuse).
I verified the regression reported in [1] and verified that this patch
fixes it. A regression test will be added to xfstests in parallel.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215849 [1]
Fixes: bd303368b7 ("fs: support mapped mounts of mapped filesystems")
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <stable@vger.kernel.org> # 5.15+
Cc: <regressions@lists.linux.dev>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bd303368b7 upstream.
In previous patches we added new and modified existing helpers to handle
idmapped mounts of filesystems mounted with an idmapping. In this final
patch we convert all relevant places in the vfs to actually pass the
filesystem's idmapping into these helpers.
With this the vfs is in shape to handle idmapped mounts of filesystems
mounted with an idmapping. Note that this is just the generic
infrastructure. Actually adding support for idmapped mounts to a
filesystem mountable with an idmapping is follow-up work.
In this patch we extend the definition of an idmapped mount from a mount
that that has the initial idmapping attached to it to a mount that has
an idmapping attached to it which is not the same as the idmapping the
filesystem was mounted with.
As before we do not allow the initial idmapping to be attached to a
mount. In addition this patch prevents that the idmapping the filesystem
was mounted with can be attached to a mount created based on this
filesystem.
This has multiple reasons and advantages. First, attaching the initial
idmapping or the filesystem's idmapping doesn't make much sense as in
both cases the values of the i_{g,u}id and other places where k{g,u}ids
are used do not change. Second, a user that really wants to do this for
whatever reason can just create a separate dedicated identical idmapping
to attach to the mount. Third, we can continue to use the initial
idmapping as an indicator that a mount is not idmapped allowing us to
continue to keep passing the initial idmapping into the mapping helpers
to tell them that something isn't an idmapped mount even if the
filesystem is mounted with an idmapping.
Link: https://lore.kernel.org/r/20211123114227.3124056-11-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-11-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-11-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1ac2a41049 upstream.
Currently we only support idmapped mounts for filesystems mounted
without an idmapping. This was a conscious decision mentioned in
multiple places (cf. e.g. [1]).
As explained at length in [3] it is perfectly fine to extend support for
idmapped mounts to filesystem's mounted with an idmapping should the
need arise. The need has been there for some time now. Various container
projects in userspace need this to run unprivileged and nested
unprivileged containers (cf. [2]).
Before we can port any filesystem that is mountable with an idmapping to
support idmapped mounts we need to first extend the mapping helpers to
account for the filesystem's idmapping. This again, is explained at
length in our documentation at [3] but I'll give an overview here again.
Currently, the low-level mapping helpers implement the remapping
algorithms described in [3] in a simplified manner. Because we could
rely on the fact that all filesystems supporting idmapped mounts are
mounted without an idmapping the translation step from or into the
filesystem idmapping could be skipped.
In order to support idmapped mounts of filesystem's mountable with an
idmapping the translation step we were able to skip before cannot be
skipped anymore. A filesystem mounted with an idmapping is very likely
to not use an identity mapping and will instead use a non-identity
mapping. So the translation step from or into the filesystem's idmapping
in the remapping algorithm cannot be skipped for such filesystems. More
details with examples can be found in [3].
This patch adds a few new and prepares some already existing low-level
mapping helpers to perform the full translation algorithm explained in
[3]. The low-level helpers can be written in a way that they only
perform the additional translation step when the filesystem is indeed
mounted with an idmapping.
If the low-level helpers detect that they are not dealing with an
idmapped mount they can simply return the relevant k{g,u}id unchanged;
no remapping needs to be performed at all. The no_idmapping() helper
detects whether the shortcut can be used.
If the low-level helpers detected that they are dealing with an idmapped
mount but the underlying filesystem is mounted without an idmapping we
can rely on the previous shorcut and can continue to skip the
translation step from or into the filesystem's idmapping.
These checks guarantee that only the minimal amount of work is
performed. As before, if idmapped mounts aren't used the low-level
helpers are idempotent and no work is performed at all.
This patch adds the helpers mapped_k{g,u}id_fs() and
mapped_k{g,u}id_user(). Following patches will port all places to
replace the old k{g,u}id_into_mnt() and k{g,u}id_from_mnt() with these
two new helpers. After the conversion is done k{g,u}id_into_mnt() and
k{g,u}id_from_mnt() will be removed. This also concludes the renaming of
the mapping helpers we started in [4]. Now, all mapping helpers will
started with the "mapped_" prefix making everything nice and consistent.
The mapped_k{g,u}id_fs() helpers replace the k{g,u}id_into_mnt()
helpers. They are to be used when k{g,u}ids are to be mapped from the
vfs, e.g. from from struct inode's i_{g,u}id. Conversely, the
mapped_k{g,u}id_user() helpers replace the k{g,u}id_from_mnt() helpers.
They are to be used when k{g,u}ids are to be written to disk, e.g. when
entering from a system call to change ownership of a file.
This patch only introduces the helpers. It doesn't yet convert the
relevant places to account for filesystem mounted with an idmapping.
[1]: commit 2ca4dcc490 ("fs/mount_setattr: tighten permission checks")
[2]: https://github.com/containers/podman/issues/10374
[3]: Documentations/filesystems/idmappings.rst
[4]: commit a65e58e791 ("fs: document and rename fsid helpers")
Link: https://lore.kernel.org/r/20211123114227.3124056-5-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-5-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-5-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The hypervisor memory pool is sized to allow mapping up to 1GiB of data
in the 'private' range of the hypervisor. However, this is currently
not enforced in any way, which might become a problem as private range
mappings are used more and more (e.g. from pKVM modules).
Enforce the 1GiB limit at allocation time, and while at it, rename
__io_map_base to __private_range_base for consistency.
Bug: 244543039
Change-Id: I32c9145ba331309b49428ff461a41c94ea0c1512
Signed-off-by: Quentin Perret <qperret@google.com>