Patch series "mm/mprotect: Fix soft-dirty checks", v4.
This patch (of 3):
The check wanted to make sure when soft-dirty tracking is enabled we won't
grant write bit by accident, as a page fault is needed for dirty tracking.
The intention is correct but we didn't check it right because
VM_SOFTDIRTY set actually means soft-dirty tracking disabled. Fix it.
There's another thing tricky about soft-dirty is that, we can't check the
vma flag !(vma_flags & VM_SOFTDIRTY) directly but only check it after we
checked CONFIG_MEM_SOFT_DIRTY because otherwise VM_SOFTDIRTY will be
defined as zero, and !(vma_flags & VM_SOFTDIRTY) will constantly return
true. To avoid misuse, introduce a helper for checking whether vma has
soft-dirty tracking enabled.
We can easily verify this with any exclusive anonymous page, like program
below:
=======8<======
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <assert.h>
#include <inttypes.h>
#include <stdint.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdbool.h>
#define BIT_ULL(nr) (1ULL << (nr))
#define PM_SOFT_DIRTY BIT_ULL(55)
unsigned int psize;
char *page;
uint64_t pagemap_read_vaddr(int fd, void *vaddr)
{
uint64_t value;
int ret;
ret = pread(fd, &value, sizeof(uint64_t),
((uint64_t)vaddr >> 12) * sizeof(uint64_t));
assert(ret == sizeof(uint64_t));
return value;
}
void clear_refs_write(void)
{
int fd = open("/proc/self/clear_refs", O_RDWR);
assert(fd >= 0);
write(fd, "4", 2);
close(fd);
}
#define check_soft_dirty(str, expect) do { \
bool dirty = pagemap_read_vaddr(fd, page) & PM_SOFT_DIRTY; \
if (dirty != expect) { \
printf("ERROR: %s, soft-dirty=%d (expect: %d)
", str, dirty, expect); \
exit(-1); \
} \
} while (0)
int main(void)
{
int fd = open("/proc/self/pagemap", O_RDONLY);
assert(fd >= 0);
psize = getpagesize();
page = mmap(NULL, psize, PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
assert(page != MAP_FAILED);
*page = 1;
check_soft_dirty("Just faulted in page", 1);
clear_refs_write();
check_soft_dirty("Clear_refs written", 0);
mprotect(page, psize, PROT_READ);
check_soft_dirty("Marked RO", 0);
mprotect(page, psize, PROT_READ|PROT_WRITE);
check_soft_dirty("Marked RW", 0);
*page = 2;
check_soft_dirty("Wrote page again", 1);
munmap(page, psize);
close(fd);
printf("Test passed.
");
return 0;
}
=======8<======
Here we attach a Fixes to commit 64fe24a3e0 only for easy tracking, as
this patch won't apply to a tree before that point. However the commit
wasn't the source of problem, but instead 64e455079e. It's just that
after 64fe24a3e0 anonymous memory will also suffer from this problem
with mprotect().
Link: https://lkml.kernel.org/r/20220725142048.30450-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20220725142048.30450-2-peterx@redhat.com
Fixes: 64e455079e ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared")
Fixes: 64fe24a3e0 ("mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The basic interaction for setting up a userfaultfd is, userspace issues
a UFFDIO_API ioctl, and passes in a set of zero or more feature flags,
indicating the features they would prefer to use.
Of course, different kernels may support different sets of features
(depending on kernel version, kconfig options, architecture, etc).
Userspace's expectations may also not match: perhaps it was built
against newer kernel headers, which defined some features the kernel
it's running on doesn't support.
Currently, if userspace passes in a flag we don't recognize, the
initialization fails and we return -EINVAL. This isn't great, though.
Userspace doesn't have an obvious way to react to this; sure, one of the
features I asked for was unavailable, but which one? The only option it
has is to turn off things "at random" and hope something works.
Instead, modify UFFDIO_API to just ignore any unrecognized feature
flags. The interaction is now that the initialization will succeed, and
as always we return the *subset* of feature flags that can actually be
used back to userspace.
Now userspace has an obvious way to react: it checks if any flags it
asked for are missing. If so, it can conclude this kernel doesn't
support those, and it can either resign itself to not using them, or
fail with an error on its own, or whatever else.
Link: https://lkml.kernel.org/r/20220722201513.1624158-1-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We can use unlock label to unlock ptl and return ret directly to remove
the unneeded out label and reduce the size of mempolicy.o. No functional
change intended.
[Before]
text data bss dec hex filename
26702 3972 6168 36842 8fea mm/mempolicy.o
[After]
text data bss dec hex filename
26662 3972 6168 36802 8fc2 mm/mempolicy.o
Link: https://lkml.kernel.org/r/20220719115233.6706-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
zs_malloc returns 0 if it fails. zs_zpool_malloc will return -1 when
zs_malloc return 0. But -1 makes the return value unclear.
For example, when zswap_frontswap_store calls zs_malloc through
zs_zpool_malloc, it will return -1 to its caller. The other return value
is -EINVAL, -ENODEV or something else.
This commit changes zs_malloc to return ERR_PTR on failure. It didn't
just let zs_zpool_malloc return -ENOMEM becaue zs_malloc has two types of
failure:
- size is not OK return -EINVAL
- memory alloc fail return -ENOMEM.
Link: https://lkml.kernel.org/r/20220714080757.12161-1-teawater@gmail.com
Signed-off-by: Hui Zhu <teawater@antgroup.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In a system(Huawei Ascend ARM64 SoC) using HBM, a multi-bit ECC error
occurs, and the BIOS will mark the corresponding area (for example, 2 MB)
as unusable. When the system restarts next time, these areas are not
reported or reported as EFI_UNUSABLE_MEMORY. Both cases lead to an
increase in the number of memblocks, whereas EFI_UNUSABLE_MEMORY leads to
a larger number of memblocks.
For example, if the EFI_UNUSABLE_MEMORY type is reported:
...
memory[0x92] [0x0000200834a00000-0x0000200835bfffff], 0x0000000001200000 bytes on node 7 flags: 0x0
memory[0x93] [0x0000200835c00000-0x0000200835dfffff], 0x0000000000200000 bytes on node 7 flags: 0x4
memory[0x94] [0x0000200835e00000-0x00002008367fffff], 0x0000000000a00000 bytes on node 7 flags: 0x0
memory[0x95] [0x0000200836800000-0x00002008369fffff], 0x0000000000200000 bytes on node 7 flags: 0x4
memory[0x96] [0x0000200836a00000-0x0000200837bfffff], 0x0000000001200000 bytes on node 7 flags: 0x0
memory[0x97] [0x0000200837c00000-0x0000200837dfffff], 0x0000000000200000 bytes on node 7 flags: 0x4
memory[0x98] [0x0000200837e00000-0x000020087fffffff], 0x0000000048200000 bytes on node 7 flags: 0x0
memory[0x99] [0x0000200880000000-0x0000200bcfffffff], 0x0000000350000000 bytes on node 6 flags: 0x0
memory[0x9a] [0x0000200bd0000000-0x0000200bd01fffff], 0x0000000000200000 bytes on node 6 flags: 0x4
memory[0x9b] [0x0000200bd0200000-0x0000200bd07fffff], 0x0000000000600000 bytes on node 6 flags: 0x0
memory[0x9c] [0x0000200bd0800000-0x0000200bd09fffff], 0x0000000000200000 bytes on node 6 flags: 0x4
memory[0x9d] [0x0000200bd0a00000-0x0000200fcfffffff], 0x00000003ff600000 bytes on node 6 flags: 0x0
memory[0x9e] [0x0000200fd0000000-0x0000200fd01fffff], 0x0000000000200000 bytes on node 6 flags: 0x4
memory[0x9f] [0x0000200fd0200000-0x0000200fffffffff], 0x000000002fe00000 bytes on node 6 flags: 0x0
...
The EFI memory map is parsed to construct the memblock arrays before the
memblock arrays can be resized. As the result, memory regions beyond
INIT_MEMBLOCK_REGIONS are lost.
Add a new macro INIT_MEMBLOCK_MEMORY_REGIONS to replace
INIT_MEMBLOCK_REGTIONS to define the size of the static memblock.memory
array.
Allow overriding memblock.memory array size with architecture defined
INIT_MEMBLOCK_MEMORY_REGIONS and make arm64 to set
INIT_MEMBLOCK_MEMORY_REGIONS to 1024 when CONFIG_EFI is enabled.
Link: https://lkml.kernel.org/r/20220615102742.96450-1-zhouguanghui1@huawei.com
Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Tested-by: Darren Hart <darren@os.amperecomputing.com>
Acked-by: Will Deacon <will@kernel.org> [arm64]
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Xu Qiang <xuqiang36@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The number of scanned pages can be lower than the number of isolated pages
when isolating mirgratable or free pageblock. The metric is being
reported in trace event and also used in vmstat.
some example output from trace where it shows nr_taken can be greater
than nr_scanned:
Produced by kernel v5.19-rc6
kcompactd0-42 [001] ..... 1210.268022: mm_compaction_isolate_migratepages: range=(0x107ae4 ~ 0x107c00) nr_scanned=265 nr_taken=255
[...]
kcompactd0-42 [001] ..... 1210.268382: mm_compaction_isolate_freepages: range=(0x215800 ~ 0x215a00) nr_scanned=13 nr_taken=128
kcompactd0-42 [001] ..... 1210.268383: mm_compaction_isolate_freepages: range=(0x215600 ~ 0x215680) nr_scanned=1 nr_taken=128
mm_compaction_isolate_migratepages does not seem to have this
behaviour, but for the reason of consistency, nr_scanned should also be
taken care of in that side.
This behaviour is confusing since currently the count for isolated pages
takes account of compound page but not for the case of scanned pages. And
given that the number of isolated pages(nr_taken) reported in
mm_compaction_isolate_template trace event is on a single-page basis, the
ambiguity when reporting the number of scanned pages can be removed by
also including compound page count.
Link: https://lkml.kernel.org/r/20220711202806.22296-1-william.lam@bytedance.com
Signed-off-by: William Lam <william.lam@bytedance.com>
Reviewed-by: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The test va_128TBswitch.c exercises a feature only supported on PPC and
x86_64, but it's run on other 64-bit archs as well. Before this patch,
the test did nothing and returned 0 for KSFT_PASS. This patch makes it
return the KSFT codes from kselftest.h, including KSFT_SKIP when
appropriate.
Verified on arm64 and x86_64.
Link: https://lkml.kernel.org/r/20220704123813.427625-1-adam@wowsignal.io
Signed-off-by: Adam Sindelar <adam@wowsignal.io>
Cc: David Vernet <void@manifault.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yafang Shao reported an issue related to the accounting of bpf memory:
if a bpf map is charged indirectly for memory consumed from an
interrupt context and allocations are enforced, MEMCG_MAX events are
not raised.
It's not/less of an issue in a generic case because consequent
allocations from a process context will trigger the direct reclaim and
MEMCG_MAX events will be raised. However a bpf map can belong to a
dying/abandoned memory cgroup, so there will be no allocations from a
process context and no MEMCG_MAX events will be triggered.
Link: https://lkml.kernel.org/r/20220702033521.64630-1-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Reported-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since the beginning, charged is set to 0 to avoid calling vm_unacct_memory
twice because vm_unacct_memory will be called by above unmap_region. But
since commit 4f74d2c8e8 ("vm: remove 'nr_accounted' calculations from
the unmap_vmas() interfaces"), unmap_region doesn't call vm_unacct_memory
anymore. So charged shouldn't be set to 0 now otherwise the calling to
paired vm_unacct_memory will be missed and leads to imbalanced account.
Link: https://lkml.kernel.org/r/20220618082027.43391-1-linmiaohe@huawei.com
Fixes: 4f74d2c8e8 ("vm: remove 'nr_accounted' calculations from the unmap_vmas() interfaces")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Do not record a pointer to a VMA outside of the mmap_lock for later use.
This is unsafe and there are a number of failure paths *after* the
recorded VMA pointer may be freed during setup. There is no callback to
the driver to clear the saved pointer from generic mm code. Furthermore,
the VMA pointer may become stale if any number of VMA operations end up
freeing the VMA so saving it was fragile to being with.
Instead, change the binder_alloc struct to record the start address of the
VMA and use vma_lookup() to get the vma when needed. Add lockdep
mmap_lock checks on updates to the vma pointer to ensure the lock is held
and depend on that lock for synchronization of readers and writers - which
was already the case anyways, so the smp_wmb()/smp_rmb() was not
necessary.
[akpm@linux-foundation.org: fix drivers/android/binder_alloc_selftest.c]
Link: https://lkml.kernel.org/r/20220621140212.vpkio64idahetbyf@revolver
Fixes: da1b9564e8 ("android: binder: fix the race mmap and alloc_new_buf_locked")
Reported-by: syzbot+58b51ac2b04e388ab7b0@syzkaller.appspotmail.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Christian Brauner (Microsoft) <brauner@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Martijn Coenen <maco@android.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Todd Kjos <tkjos@android.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
nfsd_setattr() now sets a security label if provided, and nfsv4 provides
it in the 'open' and 'create' paths and the 'setattr' path.
If setting the label failed (including because the kernel doesn't
support labels), an error field in 'struct nfsd_attrs' is set, and the
caller can respond. The open/create callers clear
FATTR4_WORD2_SECURITY_LABEL in the returned attr set in this case.
The setattr caller returns the error.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The NFS protocol includes attributes when creating symlinks.
Linux does store attributes for symlinks and allows them to be set,
though they are not used for permission checking.
NFSD currently doesn't set standard (struct iattr) attributes when
creating symlinks, but for NFSv4 it does set ACLs and security labels.
This is inconsistent.
To improve consistency, pass the provided attributes into nfsd_symlink()
and call nfsd_create_setattr() to set them.
NOTE: this results in a behaviour change for all NFS versions when the
client sends non-default attributes with a SYMLINK request. With the
Linux client, the only attributes are:
attr.ia_mode = S_IFLNK | S_IRWXUGO;
attr.ia_valid = ATTR_MODE;
so the final outcome will be unchanged. Other clients might sent
different attributes, and if they did they probably expect them to be
honoured.
We ignore any error from nfsd_create_setattr(). It isn't really clear
what should be done if a file is successfully created, but the
attributes cannot be set. NFS doesn't allow partial success to be
reported. Reporting failure is probably more misleading than reporting
success, so the status is ignored.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The attributes that nfsd might want to set on a file include 'struct
iattr' as well as an ACL and security label.
The latter two are passed around quite separately from the first, in
part because they are only needed for NFSv4. This leads to some
clumsiness in the code, such as the attributes NOT being set in
nfsd_create_setattr().
We need to keep the directory locked until all attributes are set to
ensure the file is never visibile without all its attributes. This need
combined with the inconsistent handling of attributes leads to more
clumsiness.
As a first step towards tidying this up, introduce 'struct nfsd_attrs'.
This is passed (by reference) to vfs.c functions that work with
attributes, and is assembled by the various nfs*proc functions which
call them. As yet only iattr is included, but future patches will
expand this.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Between opening a file and setting a delegation on it, someone could
rename or unlink the dentry. If this happens, we do not want to grant a
delegation on the open.
On a CLAIM_NULL open, we're opening by filename, and we may (in the
non-create case) or may not (in the create case) be holding i_rwsem
when attempting to set a delegation. The latter case allows a
race.
After getting a lease, redo the lookup of the file being opened and
validate that the resulting dentry matches the one in the open file
description.
To properly redo the lookup we need an rqst pointer to pass to
nfsd_lookup_dentry(), so make sure that is available.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Currently, we pass the fh of the opened file down through several
functions so that alloc_init_deleg can pass it to delegation_blocked.
The filehandle of the open file is available in the nfs4_file however,
so there's no need to pass it in a separate argument.
Drop the argument from alloc_init_deleg, nfs4_open_delegation and
nfs4_set_delegation.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor so that CB_OFFLOAD arguments can be passed without
allocating a whole struct nfsd4_copy object. On my system (x86_64)
this removes another 96 bytes from struct nfsd4_copy.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Instead of manufacturing a phony struct nfsd_file, pass the
struct file returned by nfs42_ssc_open() directly to
nfsd4_do_copy().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Refactor: Now that nfsd4_do_copy() no longer calls the cleanup
helpers, plumb the use of struct file pointers all the way down to
_nfsd_copy_file_range().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Move the nfsd4_cleanup_*() call sites out of nfsd4_do_copy(). A
subsequent patch will modify one of the new call sites to avoid
the need to manufacture the phony struct nfsd_file.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The @src parameter is sometimes a pointer to a struct nfsd_file and
sometimes a pointer to struct file hiding in a phony struct
nfsd_file. Refactor nfsd4_cleanup_inter_ssc() so the @src parameter
is always an explicit struct file.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>