In preparation of passing the vma state through split, the pre-allocation
that occurs before the split has to be moved to after. Since the
preallocation would then live right next to the store, just call store
instead of preallocating. This effectively restores the potential error
path of splitting and not munmap'ing which pre-dates the maple tree.
Link: https://lkml.kernel.org/r/20230120162650.984577-12-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 0378c0a0e9)
Bug: 274059236
Change-Id: I3539fb3a08043dae1bc8aaa6c7f285711a0b5548
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Add vendor hook in smaps_pte_entry for swap entry
- android_vh_smaps_pte_entry
- android_vh_show_smap
This vendor hook is to show more information for
swap entries of a process based on the
characteristics, such as written-back, same-filled
or huge (uncompressed).
Bug: 284059805
Change-Id: Ie4a48ae42212c056992d34a10b026b60439d0012
Signed-off-by: Sooyong Suk <s.suk@samsung.com>
we try to adjust page reclaim operations based on the running task
and kernel memory pressure. Thus, we want to create some vendor hooks
into kernel6.1.
Firstly, we add ADNRROID_VENDOR_DATA into the struct scan_control,
special operations would be performed based on this special scan option.
We measure the importance of the current process in the system and
obtain its weight, which is recorded in ANDROID_VENDOR_DATA.
The hook function: trace_android_vh_modify_scan_control is added inside
of the function modify_scan_control() to adjust reclaim operations based
on memory pressure.
The hook function: trace_android_vh_should_continue_reclaim is added inside
of the function shrink_node() to decide if page_reclaim would continue
or not based on memory pressure.
The hook function: trace_android_vh_file_is_tiny_bypass is added into the
function prepare_scan_count() to decide if the file pages should be skipped
in condition to file refualts and memory pressure.
Bug: 279793370
Change-Id: I1efe9d3e866f37b0295c7cd94ec8ca0117a9bd4a
Signed-off-by: Dezhi Huang <huangdezhi@hihonor.com>
Haibo Li reported:
| Unable to handle kernel paging request at virtual address
| ffffff802a0d8d7171
| Mem abort info⭕
| ESR = 0x9600002121
| EC = 0x25: DABT (current EL), IL = 32 bitsts
| SET = 0, FnV = 0 0
| EA = 0, S1PTW = 0 0
| FSC = 0x21: alignment fault
| Data abort info⭕
| ISV = 0, ISS = 0x0000002121
| CM = 0, WnR = 0 0
| swapper pgtable: 4k pages, 39-bit VAs, pgdp=000000002835200000
| [ffffff802a0d8d71] pgd=180000005fbf9003, p4d=180000005fbf9003,
| pud=180000005fbf9003, pmd=180000005fbe8003, pte=006800002a0d8707
| Internal error: Oops: 96000021 [#1] PREEMPT SMP
| Modules linked in:
| CPU: 2 PID: 45 Comm: kworker/u8:2 Not tainted
| 5.15.78-android13-8-g63561175bbda-dirty #1
| ...
| pc : kcsan_setup_watchpoint+0x26c/0x6bc
| lr : kcsan_setup_watchpoint+0x88/0x6bc
| sp : ffffffc00ab4b7f0
| x29: ffffffc00ab4b800 x28: ffffff80294fe588 x27: 0000000000000001
| x26: 0000000000000019 x25: 0000000000000001 x24: ffffff80294fdb80
| x23: 0000000000000000 x22: ffffffc00a70fb68 x21: ffffff802a0d8d71
| x20: 0000000000000002 x19: 0000000000000000 x18: ffffffc00a9bd060
| x17: 0000000000000001 x16: 0000000000000000 x15: ffffffc00a59f000
| x14: 0000000000000001 x13: 0000000000000000 x12: ffffffc00a70faa0
| x11: 00000000aaaaaaab x10: 0000000000000054 x9 : ffffffc00839adf8
| x8 : ffffffc009b4cf00 x7 : 0000000000000000 x6 : 0000000000000007
| x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffffffc00a70fb70
| x2 : 0005ff802a0d8d71 x1 : 0000000000000000 x0 : 0000000000000000
| Call trace:
| kcsan_setup_watchpoint+0x26c/0x6bc
| __tsan_read2+0x1f0/0x234
| inflate_fast+0x498/0x750
| zlib_inflate+0x1304/0x2384
| __gunzip+0x3a0/0x45c
| gunzip+0x20/0x30
| unpack_to_rootfs+0x2a8/0x3fc
| do_populate_rootfs+0xe8/0x11c
| async_run_entry_fn+0x58/0x1bc
| process_one_work+0x3ec/0x738
| worker_thread+0x4c4/0x838
| kthread+0x20c/0x258
| ret_from_fork+0x10/0x20
| Code: b8bfc2a8 2a0803f7 14000007 d503249f (78bfc2a8) )
| ---[ end trace 613a943cb0a572b6 ]-----
The reason for this is that on certain arm64 configuration since
e35123d83e ("arm64: lto: Strengthen READ_ONCE() to acquire when
CONFIG_LTO=y"), READ_ONCE() may be promoted to a full atomic acquire
instruction which cannot be used on unaligned addresses.
Fix it by avoiding READ_ONCE() in read_instrumented_memory(), and simply
forcing the compiler to do the required access by casting to the
appropriate volatile type. In terms of generated code this currently
only affects architectures that do not use the default READ_ONCE()
implementation.
The only downside is that we are not guaranteed atomicity of the access
itself, although on most architectures a plain load up to machine word
size should still be atomic (a fact the default READ_ONCE() still relies
on itself).
BUG: 285794521
(cherry picked from commit 8dec88070d)
Reported-by: Haibo Li <haibo.li@mediatek.com>
Tested-by: Haibo Li <haibo.li@mediatek.com>
Cc: <stable@vger.kernel.org> # 5.17+
Change-Id: I16c9f83c3b4e28021a936249cafb1501760aa59d
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: 杨辉 <yanghui10@xiaomi.corp-partner.google.com>
Ok to commit this before KMI update since CRC change only affects the broken
hooks which are only used by the partner that introduced the hooks.
INFO: variable symbol 'struct tracepoint __tracepoint_android_rvh_psci_cpu_suspend' changed
CRC changed from 0x4628ef5b to 0xf9b81cca
variable symbol 'struct tracepoint __tracepoint_android_rvh_psci_tos_resident_on' changed
CRC changed from 0x477813d5 to 0xb163a362
Fixes: b7a7fd15ed ("ANDROID: vendor_hooks: psci: add hook to check if cpu is allowed to power off")
Bug: 285477556
Change-Id: I0539ac8ff1d26a6ba8dd0f13fc09b53f5ee0335b
Signed-off-by: Todd Kjos <tkjos@google.com>
syscore ops in gic-v3 takes care of invoking gic_v3_resume() when
exiting from "deep" suspend. However for "s2idle" suspend syscore
ops will not get invoked.
Vendor modules can register for s2idle notifications and
invoke gic_v3_resume() when the first cpu is waking up from s2idle.
Bug: 279879797
Change-Id: Ifd48d676a5bc907eb957c2002934e18bd1c9c095
Signed-off-by: Maulik Shah <mkshah@codeaurora.org>
Signed-off-by: Shreyas K K <quic_shrekk@quicinc.com>
This change adds vendor hook for gic suspend syscore ops callback.
And it is invoked during deepsleep and hibernation to store
gic register snapshot.
Bug: 279879797
Change-Id: I4e3729afa4daf18d73e00ee9601b6da72a578b4a
Signed-off-by: Nagireddy Annem <quic_nannem@quicinc.com>
Signed-off-by: Shreyas K K <quic_shrekk@quicinc.com>
While TOS is running alongside with linux, cpu power off operation by linux
may need be denied by TOS in some scenarios.
This patch added two hooks in psci_tos_resident_on and psci_cpu_suspend
to hook cpu off operation.
The function psci_tos_resident_on originally is used to check if TOS is resident on
a specific cpu and that cpu is dedicated for running TOS exclusively. If so, that
cpu can not be power off. Actually if TOS supports SMP, TOS may need deny any
other cpu to power down in some cases, i.e. there are no-expired timers in TOS.
Thus the first hook for psci_tos_resident_on is used to determine if
the specific cpu is allowed to power off in the cpu hotplug path.
Besides cpu hotplug, a cpu also can power off by cpu_suspend.
The second hook for psci_cpu_suspend determines if cpu suspend should go through
or not. When the same conditions described above meets, cpu suspend will break up.
The hook cherry-pick from commit 88d88955ae0b8b1f1a555d7810beb6c8ca4ca0f1
and changed vh to rvh according to commit 949edf7539b60058cf2da98f24db2b6d4d89eaa0
Bug: 284797902
Change-Id: Ib329beeff20f0cfef263f6a7813280d33f6a5eaa
Signed-off-by: Jian Gong <Jian.Gong@unisoc.com>
Signed-off-by: Cixi Geng <cixi.geng1@unisoc.com>
android_rvh_effective_cpu_util:
To perform vendor-specific cpu util, it is used in EAS/schedutil/thermal.
The effective_cpu_util would be called when thermal calc the dynamic power,
it's non-atomic context, so set the hook be restricted.
Bug: 226686099
Test: build pass
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Change-Id: I6fd77f44ca4328f5ef37d96989aa2e08d65e29bb
For SoC's skin temperature, we have to use more stringent temperature
control to make IPA can monitor and mitigate temperature control earlier
and faster, so add it to meet platform thermal requirement.
Bug: 211564753
Signed-off-by: Jeson Gao <jeson.gao@unisoc.com>
Signed-off-by: Di Shen <di.shen@unisoc.com>
Change-Id: Iaef87287eef93d6fdbc3c58c93f70c1525e38296
(cherry picked from commit 6709f52325)
(cherry picked from commit 97a290b0e5)
Need to get temperature data and config info from thermal zone device.
Bug: 208946028
Signed-off-by: Di Shen <di.shen@unisoc.com>
Signed-off-by: Jeson Gao <jeson.gao@unisoc.com>
Change-Id: I5945df5258181b4a441b6bbe09327099491418b3
(cherry picked from commit c53f0e3530)
(cherry picked from commit 12b8ef18b2)
Add hook to get cpufreq policy data after registering and unregistering
cpufreq thermal for platform thermal requirement.
Bug: 228423762
Signed-off-by: Jeson Gao <jeson.gao@unisoc.com>
Signed-off-by: Di Shen <di.shen@unisoc.com>
Change-Id: I9c6bc88f348f252c428560427bd8bca91092edfa
(cherry picked from commit fbe6f8708d)
Kfence only needs its pool to be mapped as page granularity, if it is
inited early. Previous judgement was a bit over protected. From [1], Mark
suggested to "just map the KFENCE region a page granularity". So I
decouple it from judgement and do page granularity mapping for kfence
pool only. Need to be noticed that late init of kfence pool still requires
page granularity mapping.
Page granularity mapping in theory cost more(2M per 1GB) memory on arm64
platform. Like what I've tested on QEMU(emulated 1GB RAM) with
gki_defconfig, also turning off rodata protection:
Before:
[root@liebao ]# cat /proc/meminfo
MemTotal: 999484 kB
After:
[root@liebao ]# cat /proc/meminfo
MemTotal: 1001480 kB
To implement this, also relocate the kfence pool allocation before the
linear mapping setting up, arm64_kfence_alloc_pool is to allocate phys
addr, __kfence_pool is to be set after linear mapping set up.
LINK: [1] https://lore.kernel.org/linux-arm-kernel/Y+IsdrvDNILA59UN@FVFF77S0Q05N/
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/1679066974-690-1-git-send-email-quic_zhenhuah@quicinc.com
Signed-off-by: Will Deacon <will@kernel.org>
BUG: 284812202
Change-Id: I8e7c565d3f4d6349a028a6a060259d62cf5beee7
(cherry picked from commit bfa7965b33)
Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
commit 1007843a91 upstream.
syzbot is reporting circular locking dependency which involves
zonelist_update_seq seqlock [1], for this lock is checked by memory
allocation requests which do not need to be retried.
One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.
CPU0
----
__build_all_zonelists() {
write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
// e.g. timer interrupt handler runs at this moment
some_timer_func() {
kmalloc(GFP_ATOMIC) {
__alloc_pages_slowpath() {
read_seqbegin(&zonelist_update_seq) {
// spins forever because zonelist_update_seq.seqcount is odd
}
}
}
}
// e.g. timer interrupt handler finishes
write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
}
This deadlock scenario can be easily eliminated by not calling
read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation
requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation
requests. But Michal Hocko does not know whether we should go with this
approach.
Another deadlock scenario which syzbot is reporting is a race between
kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer() with
port->lock held and printk() from __build_all_zonelists() with
zonelist_update_seq held.
CPU0 CPU1
---- ----
pty_write() {
tty_insert_flip_string_and_push_buffer() {
__build_all_zonelists() {
write_seqlock(&zonelist_update_seq);
build_zonelists() {
printk() {
vprintk() {
vprintk_default() {
vprintk_emit() {
console_unlock() {
console_flush_all() {
console_emit_next_record() {
con->write() = serial8250_console_write() {
spin_lock_irqsave(&port->lock, flags);
tty_insert_flip_string() {
tty_insert_flip_string_fixed_flag() {
__tty_buffer_request_room() {
tty_buffer_alloc() {
kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
__alloc_pages_slowpath() {
zonelist_iter_begin() {
read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd
spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held
}
}
}
}
}
}
}
}
spin_unlock_irqrestore(&port->lock, flags);
// message is printed to console
spin_unlock_irqrestore(&port->lock, flags);
}
}
}
}
}
}
}
}
}
write_sequnlock(&zonelist_update_seq);
}
}
}
This deadlock scenario can be eliminated by
preventing interrupt context from calling kmalloc(GFP_ATOMIC)
and
preventing printk() from calling console_flush_all()
while zonelist_update_seq.seqcount is odd.
Since Petr Mladek thinks that __build_all_zonelists() can become a
candidate for deferring printk() [2], let's address this problem by
disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC)
and
disabling synchronous printk() in order to avoid console_flush_all()
.
As a side effect of minimizing duration of zonelist_update_seq.seqcount
being odd by disabling synchronous printk(), latency at
read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and
__GFP_DIRECT_RECLAIM allocation requests will be reduced. Although, from
lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e.
do not record unnecessary locking dependency) from interrupt context is
still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC)
inside
write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq)
section...
Link: https://lkml.kernel.org/r/8796b95c-3da3-5885-fddd-6ef55f30e4d3@I-love.SAKURA.ne.jp
Fixes: 3d36424b3b ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
Change-Id: Ifc0c6ed9be6d36166367811ad412bedc66ed713e
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Petr Mladek <pmladek@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Patrick Daly <quic_pdaly@quicinc.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b528537d13)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 4d73ba5fa7 upstream.
A bug was reported by Yuanxi Liu where allocating 1G pages at runtime is
taking an excessive amount of time for large amounts of memory. Further
testing allocating huge pages that the cost is linear i.e. if allocating
1G pages in batches of 10 then the time to allocate nr_hugepages from
10->20->30->etc increases linearly even though 10 pages are allocated at
each step. Profiles indicated that much of the time is spent checking the
validity within already existing huge pages and then attempting a
migration that fails after isolating the range, draining pages and a whole
lot of other useless work.
Commit eb14d4eefd ("mm,page_alloc: drop unnecessary checks from
pfn_range_valid_contig") removed two checks, one which ignored huge pages
for contiguous allocations as huge pages can sometimes migrate. While
there may be value on migrating a 2M page to satisfy a 1G allocation, it's
potentially expensive if the 1G allocation fails and it's pointless to try
moving a 1G page for a new 1G allocation or scan the tail pages for valid
PFNs.
Reintroduce the PageHuge check and assume any contiguous region with
hugetlbfs pages is unsuitable for a new 1G allocation.
The hpagealloc test allocates huge pages in batches and reports the
average latency per page over time. This test happens just after boot
when fragmentation is not an issue. Units are in milliseconds.
hpagealloc
6.3.0-rc6 6.3.0-rc6 6.3.0-rc6
vanilla hugeallocrevert-v1r1 hugeallocsimple-v1r2
Min Latency 26.42 ( 0.00%) 5.07 ( 80.82%) 18.94 ( 28.30%)
1st-qrtle Latency 356.61 ( 0.00%) 5.34 ( 98.50%) 19.85 ( 94.43%)
2nd-qrtle Latency 697.26 ( 0.00%) 5.47 ( 99.22%) 20.44 ( 97.07%)
3rd-qrtle Latency 972.94 ( 0.00%) 5.50 ( 99.43%) 20.81 ( 97.86%)
Max-1 Latency 26.42 ( 0.00%) 5.07 ( 80.82%) 18.94 ( 28.30%)
Max-5 Latency 82.14 ( 0.00%) 5.11 ( 93.78%) 19.31 ( 76.49%)
Max-10 Latency 150.54 ( 0.00%) 5.20 ( 96.55%) 19.43 ( 87.09%)
Max-90 Latency 1164.45 ( 0.00%) 5.53 ( 99.52%) 20.97 ( 98.20%)
Max-95 Latency 1223.06 ( 0.00%) 5.55 ( 99.55%) 21.06 ( 98.28%)
Max-99 Latency 1278.67 ( 0.00%) 5.57 ( 99.56%) 22.56 ( 98.24%)
Max Latency 1310.90 ( 0.00%) 8.06 ( 99.39%) 26.62 ( 97.97%)
Amean Latency 678.36 ( 0.00%) 5.44 * 99.20%* 20.44 * 96.99%*
6.3.0-rc6 6.3.0-rc6 6.3.0-rc6
vanilla revert-v1 hugeallocfix-v2
Duration User 0.28 0.27 0.30
Duration System 808.66 17.77 35.99
Duration Elapsed 830.87 18.08 36.33
The vanilla kernel is poor, taking up to 1.3 second to allocate a huge
page and almost 10 minutes in total to run the test. Reverting the
problematic commit reduces it to 8ms at worst and the patch takes 26ms.
This patch fixes the main issue with skipping huge pages but leaves the
page_count() out because a page with an elevated count potentially can
migrate.
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=217022
Link: https://lkml.kernel.org/r/20230414141429.pwgieuwluxwez3rj@techsingularity.net
Fixes: eb14d4eefd ("mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig")
Change-Id: I552f0631f15e41038219e207c994fa7702b269fa
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Yuanxi Liu <y.liu@naruida.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 059f24aff6)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
While exercising the unbind path, with the current implementation
the functionfs_unbind would be calling which waits for the ffs->mutex
to be available, however within the same time ffs_ep0_read is invoked
& if no setup packets are pending, it will invoke function
wait_event_interruptible_exclusive_locked_irq which by definition waits
for the ev.count to be increased inside the same mutex for which
functionfs_unbind is waiting.
This creates deadlock situation because the functionfs_unbind won't
get the lock until ev.count is increased which can only happen if
the caller ffs_func_unbind can proceed further.
Following is the illustration:
CPU1 CPU2
ffs_func_unbind() ffs_ep0_read()
mutex_lock(ffs->mutex)
wait_event(ffs->ev.count)
functionfs_unbind()
mutex_lock(ffs->mutex)
mutex_unlock(ffs->mutex)
ffs_event_add()
<deadlock>
Fix this by moving the event unbind before functionfs_unbind
to ensure the ev.count is incrased properly.
Fixes: 6a19da1110 ("usb: gadget: f_fs: Prevent race during ffs_ep0_queue_wait")
Cc: stable <stable@kernel.org>
Signed-off-by: Uttkarsh Aggarwal <quic_uaggarwa@quicinc.com>
Link: https://lore.kernel.org/r/20230525092854.7992-1-quic_uaggarwa@quicinc.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 285072336
(cherry picked from commit efb6b53520)
Change-Id: I1a001606f62f1966825d47809cd1c887e3d6fb71
Signed-off-by: Uttkarsh Aggarwal <quic_uaggarwa@quicinc.com>
This reverts commit b9bb33b73c.
Reason: This patch breaks any USB gadget function that deactivates the
gadget on bind (by setting bind_deactivated = true).
Bug: 285019584
Change-Id: I2885819dd75e9d65de8258b7d2f6fc5d98de6c68
Signed-off-by: Avichal Rakesh <arakesh@google.com>
When servicemanager process added service proxy from other process
register the service, we want to know the matching relation between
handle in the process and service name. When binder transaction
happened, We want to know what process calls what method on what service.
Bug: 186604985
Signed-off-by: zhengding chen <chenzhengding@oppo.com>
Change-Id: I813d1cde10294d8665f899f7fef0d444ec1f1f5e
commit 24bf08c437 upstream.
Looks like what we fixed for hugetlb in commit 44f86392bd ("mm/hugetlb:
fix uffd-wp handling for migration entries in
hugetlb_change_protection()") similarly applies to THP.
Setting/clearing uffd-wp on THP migration entries is not implemented
properly. Further, while removing migration PMDs considers the uffd-wp
bit, inserting migration PMDs does not consider the uffd-wp bit.
We have to set/clear independently of the migration entry type in
change_huge_pmd() and properly copy the uffd-wp bit in
set_pmd_migration_entry().
Verified using a simple reproducer that triggers migration of a THP, that
the set_pmd_migration_entry() no longer loses the uffd-wp bit.
Link: https://lkml.kernel.org/r/20230405160236.587705-2-david@redhat.com
Fixes: f45ec5ff16 ("userfaultfd: wp: support swap and page migration")
Change-Id: I263a9fd8a6695f546fe5c5279a439f4f1c151c48
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit cc647e05db)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit dd47ac428c upstream.
Khugepaged collapse an anonymous thp in two rounds of scans. The 2nd
round done in __collapse_huge_page_isolate() after
hpage_collapse_scan_pmd(), during which all the locks will be released
temporarily. It means the pgtable can change during this phase before 2nd
round starts.
It's logically possible some ptes got wr-protected during this phase, and
we can errornously collapse a thp without noticing some ptes are
wr-protected by userfault. e1e267c792 wanted to avoid it but it only
did that for the 1st phase, not the 2nd phase.
Since __collapse_huge_page_isolate() happens after a round of small page
swapins, we don't need to worry on any !present ptes - if it existed
khugepaged will already bail out. So we only need to check present ptes
with uffd-wp bit set there.
This is something I found only but never had a reproducer, I thought it
was one caused a bug in Muhammad's recent pagemap new ioctl work, but it
turns out it's not the cause of that but an userspace bug. However this
seems to still be a real bug even with a very small race window, still
worth to have it fixed and copy stable.
Link: https://lkml.kernel.org/r/20230405155120.3608140-1-peterx@redhat.com
Fixes: e1e267c792 ("khugepaged: skip collapse if uffd-wp detected")
Change-Id: Iab7f0ac5b9b6d055485ca244b2fa1e13f0dbc570
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 519dbe737f)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit ccc031e26a upstream.
The previous commit df8629af29 ("fuse: always revalidate if exclusive
create") ensures that the dentries are revalidated on O_EXCL creates. This
commit complements it by also performing revalidation for rename target
dentries. Otherwise, a rename target file that only exists in kernel
dentry cache but not in the filesystem will result in EEXIST if
RENAME_NOREPLACE flag is used.
Change-Id: I3500c168b37469e0fcf5664a3deb4d54e45b926d
Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Signed-off-by: Zhang Tianci <zhangtianci.1997@bytedance.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Yang Bo <yb203166@antfin.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 7ca973d830)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>