Keep a reference to the node when possible with mas_prev(). This will
avoid re-walking the tree. In keeping a reference to the node, keep the
last/index accurate to the range being referenced. This means the limit
may be within the range, but the range may extend outside of the limit.
Also fix the single entry tree to respect the range (of 0), or set the
node to MAS_NONE in the case of shifting beyond 0.
Link: https://lkml.kernel.org/r/20230518145544.1722059-25-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 20e9433710317ab0278c1d76821e213fb2d11e19
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable)
Bug: 274059236
Change-Id: If0b40925884dac6e334474249098d03175ba6dd6
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Calculate the number of nodes based on the pending write action instead
of assuming the worst case.
This addresses a performance regression introduced in platforms that
have longer allocation timing.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Link: https://lore.kernel.org/lkml/20230601021605.2823123-14-Liam.Howlett@oracle.com/
[surenb: adjust node_size calculation, allow to store a slot when
possible]
Bug: 274059236
Change-Id: I1db402fb463ee1e391081d2d81c34619f15713ac
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Create a new function to get the size of the mas_wr_node_size() since it
will be used elsewhere soon.
Drop the incrementing of the node size if this is the left-most node.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Link: d781921bc8
[surenb: due to the differences with upstream kernel where the node size
can be obtained using mas_wr_new_end() function, this patch is not
applicable upstream. The patch was obtained from the author's tree]
Bug: 274059236
Change-Id: I9f0b5238294d0842b4c2717437ed7288b17c7617
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Relocate it and call mas_wr_extend_null() from within mas_wr_end_piv().
Extending the NULL may affect the end pivot value so call
mas_wr_endtend_null() from within mas_wr_end_piv() to keep it all
together.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Link: https://lore.kernel.org/lkml/20230601021605.2823123-12-Liam.Howlett@oracle.com/
[surenb: moved additional wr_mas->end_piv assignment missing in later
kernel versions]
Bug: 274059236
Change-Id: I55c5843273e7a679aef918e66d4b4ed034d493da
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Only write when necessary to the maple tree. This should only occur
when the VMA changes. In the __vma_adjust() case, it is either the vma
when it is expanded, the next vma when the boundary expands into 'vma',
writing the 'insert', or when vma expands/shrinks for shift_arg_pages().
The mas_preallocate() setup should track the intended write to ensure
the correct number of nodes are preallocated for the pending write.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Link: 61b337f650
[surenb: __vma_adjust was removed in 6.3, therefore these fixes are
not applicable upstream anymore. The patch was obtained from the
author's tree]
Bug: 274059236
Change-Id: I69d68a5b4ff11c40985f7b03b31eec4bb24dcbb6
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
The maple tree node limits are implied by the parent. When walking up the
tree, the limit may not be known until a slot that does not have implied
limits are encountered. However, if the node is the left-most or
right-most node, the walking up to find that limit can be skipped.
This commit also fixes the debug/testing code that was not setting the
limit on walking down the tree as that optimization is not compatible with
this change.
Link: https://lkml.kernel.org/r/20230518145544.1722059-4-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 0f4e7f5fc2122534ae0573b37224ddfa367fa7ac
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable)
Bug: 274059236
Change-Id: I4a5e852906692b27ea598fdf38eba8e1a69355d9
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
The majority of the calls to munmap a VMA is for a single vma. The
maple tree is able to store a single entry at 0, with a size of 1 as a
pointer and avoid any allocations. Change do_vmi_align_munmap() to
store the VMAs being munmap()'ed into a tree indexed by the count. This
will leverage the ability to store the first entry without a node
allocation.
Storing the entries into a tree by the count and not the vma start and
end means changing the functions which iterate over the entries. Update
unmap_vmas() and free_pgtables() to take a maple state and a tree end
address to support this functionality.
Passing through the same maple state to unmap_vmas() and free_pgtables()
means the state needs to be reset between calls. This happens in the
static unmap_region() and exit_mmap().
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Link: https://lore.kernel.org/lkml/20230601021605.2823123-5-Liam.Howlett@oracle.com/
[surenb: skip changes passing maple state to unmap_vmas() and
free_pgtables()]
Bug: 274059236
Change-Id: If38cfecd51da884bcfdbdfdfbf955a0b338d3d60
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
In preparation of passing the vma state through split, the pre-allocation
that occurs before the split has to be moved to after. Since the
preallocation would then live right next to the store, just call store
instead of preallocating. This effectively restores the potential error
path of splitting and not munmap'ing which pre-dates the maple tree.
Link: https://lkml.kernel.org/r/20230120162650.984577-12-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 0378c0a0e9)
Bug: 274059236
Change-Id: I3539fb3a08043dae1bc8aaa6c7f285711a0b5548
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Add vendor hook in smaps_pte_entry for swap entry
- android_vh_smaps_pte_entry
- android_vh_show_smap
This vendor hook is to show more information for
swap entries of a process based on the
characteristics, such as written-back, same-filled
or huge (uncompressed).
Bug: 284059805
Change-Id: Ie4a48ae42212c056992d34a10b026b60439d0012
Signed-off-by: Sooyong Suk <s.suk@samsung.com>
we try to adjust page reclaim operations based on the running task
and kernel memory pressure. Thus, we want to create some vendor hooks
into kernel6.1.
Firstly, we add ADNRROID_VENDOR_DATA into the struct scan_control,
special operations would be performed based on this special scan option.
We measure the importance of the current process in the system and
obtain its weight, which is recorded in ANDROID_VENDOR_DATA.
The hook function: trace_android_vh_modify_scan_control is added inside
of the function modify_scan_control() to adjust reclaim operations based
on memory pressure.
The hook function: trace_android_vh_should_continue_reclaim is added inside
of the function shrink_node() to decide if page_reclaim would continue
or not based on memory pressure.
The hook function: trace_android_vh_file_is_tiny_bypass is added into the
function prepare_scan_count() to decide if the file pages should be skipped
in condition to file refualts and memory pressure.
Bug: 279793370
Change-Id: I1efe9d3e866f37b0295c7cd94ec8ca0117a9bd4a
Signed-off-by: Dezhi Huang <huangdezhi@hihonor.com>
Haibo Li reported:
| Unable to handle kernel paging request at virtual address
| ffffff802a0d8d7171
| Mem abort info⭕
| ESR = 0x9600002121
| EC = 0x25: DABT (current EL), IL = 32 bitsts
| SET = 0, FnV = 0 0
| EA = 0, S1PTW = 0 0
| FSC = 0x21: alignment fault
| Data abort info⭕
| ISV = 0, ISS = 0x0000002121
| CM = 0, WnR = 0 0
| swapper pgtable: 4k pages, 39-bit VAs, pgdp=000000002835200000
| [ffffff802a0d8d71] pgd=180000005fbf9003, p4d=180000005fbf9003,
| pud=180000005fbf9003, pmd=180000005fbe8003, pte=006800002a0d8707
| Internal error: Oops: 96000021 [#1] PREEMPT SMP
| Modules linked in:
| CPU: 2 PID: 45 Comm: kworker/u8:2 Not tainted
| 5.15.78-android13-8-g63561175bbda-dirty #1
| ...
| pc : kcsan_setup_watchpoint+0x26c/0x6bc
| lr : kcsan_setup_watchpoint+0x88/0x6bc
| sp : ffffffc00ab4b7f0
| x29: ffffffc00ab4b800 x28: ffffff80294fe588 x27: 0000000000000001
| x26: 0000000000000019 x25: 0000000000000001 x24: ffffff80294fdb80
| x23: 0000000000000000 x22: ffffffc00a70fb68 x21: ffffff802a0d8d71
| x20: 0000000000000002 x19: 0000000000000000 x18: ffffffc00a9bd060
| x17: 0000000000000001 x16: 0000000000000000 x15: ffffffc00a59f000
| x14: 0000000000000001 x13: 0000000000000000 x12: ffffffc00a70faa0
| x11: 00000000aaaaaaab x10: 0000000000000054 x9 : ffffffc00839adf8
| x8 : ffffffc009b4cf00 x7 : 0000000000000000 x6 : 0000000000000007
| x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffffffc00a70fb70
| x2 : 0005ff802a0d8d71 x1 : 0000000000000000 x0 : 0000000000000000
| Call trace:
| kcsan_setup_watchpoint+0x26c/0x6bc
| __tsan_read2+0x1f0/0x234
| inflate_fast+0x498/0x750
| zlib_inflate+0x1304/0x2384
| __gunzip+0x3a0/0x45c
| gunzip+0x20/0x30
| unpack_to_rootfs+0x2a8/0x3fc
| do_populate_rootfs+0xe8/0x11c
| async_run_entry_fn+0x58/0x1bc
| process_one_work+0x3ec/0x738
| worker_thread+0x4c4/0x838
| kthread+0x20c/0x258
| ret_from_fork+0x10/0x20
| Code: b8bfc2a8 2a0803f7 14000007 d503249f (78bfc2a8) )
| ---[ end trace 613a943cb0a572b6 ]-----
The reason for this is that on certain arm64 configuration since
e35123d83e ("arm64: lto: Strengthen READ_ONCE() to acquire when
CONFIG_LTO=y"), READ_ONCE() may be promoted to a full atomic acquire
instruction which cannot be used on unaligned addresses.
Fix it by avoiding READ_ONCE() in read_instrumented_memory(), and simply
forcing the compiler to do the required access by casting to the
appropriate volatile type. In terms of generated code this currently
only affects architectures that do not use the default READ_ONCE()
implementation.
The only downside is that we are not guaranteed atomicity of the access
itself, although on most architectures a plain load up to machine word
size should still be atomic (a fact the default READ_ONCE() still relies
on itself).
BUG: 285794521
(cherry picked from commit 8dec88070d)
Reported-by: Haibo Li <haibo.li@mediatek.com>
Tested-by: Haibo Li <haibo.li@mediatek.com>
Cc: <stable@vger.kernel.org> # 5.17+
Change-Id: I16c9f83c3b4e28021a936249cafb1501760aa59d
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: 杨辉 <yanghui10@xiaomi.corp-partner.google.com>
Ok to commit this before KMI update since CRC change only affects the broken
hooks which are only used by the partner that introduced the hooks.
INFO: variable symbol 'struct tracepoint __tracepoint_android_rvh_psci_cpu_suspend' changed
CRC changed from 0x4628ef5b to 0xf9b81cca
variable symbol 'struct tracepoint __tracepoint_android_rvh_psci_tos_resident_on' changed
CRC changed from 0x477813d5 to 0xb163a362
Fixes: b7a7fd15ed ("ANDROID: vendor_hooks: psci: add hook to check if cpu is allowed to power off")
Bug: 285477556
Change-Id: I0539ac8ff1d26a6ba8dd0f13fc09b53f5ee0335b
Signed-off-by: Todd Kjos <tkjos@google.com>
syscore ops in gic-v3 takes care of invoking gic_v3_resume() when
exiting from "deep" suspend. However for "s2idle" suspend syscore
ops will not get invoked.
Vendor modules can register for s2idle notifications and
invoke gic_v3_resume() when the first cpu is waking up from s2idle.
Bug: 279879797
Change-Id: Ifd48d676a5bc907eb957c2002934e18bd1c9c095
Signed-off-by: Maulik Shah <mkshah@codeaurora.org>
Signed-off-by: Shreyas K K <quic_shrekk@quicinc.com>
This change adds vendor hook for gic suspend syscore ops callback.
And it is invoked during deepsleep and hibernation to store
gic register snapshot.
Bug: 279879797
Change-Id: I4e3729afa4daf18d73e00ee9601b6da72a578b4a
Signed-off-by: Nagireddy Annem <quic_nannem@quicinc.com>
Signed-off-by: Shreyas K K <quic_shrekk@quicinc.com>
While TOS is running alongside with linux, cpu power off operation by linux
may need be denied by TOS in some scenarios.
This patch added two hooks in psci_tos_resident_on and psci_cpu_suspend
to hook cpu off operation.
The function psci_tos_resident_on originally is used to check if TOS is resident on
a specific cpu and that cpu is dedicated for running TOS exclusively. If so, that
cpu can not be power off. Actually if TOS supports SMP, TOS may need deny any
other cpu to power down in some cases, i.e. there are no-expired timers in TOS.
Thus the first hook for psci_tos_resident_on is used to determine if
the specific cpu is allowed to power off in the cpu hotplug path.
Besides cpu hotplug, a cpu also can power off by cpu_suspend.
The second hook for psci_cpu_suspend determines if cpu suspend should go through
or not. When the same conditions described above meets, cpu suspend will break up.
The hook cherry-pick from commit 88d88955ae0b8b1f1a555d7810beb6c8ca4ca0f1
and changed vh to rvh according to commit 949edf7539b60058cf2da98f24db2b6d4d89eaa0
Bug: 284797902
Change-Id: Ib329beeff20f0cfef263f6a7813280d33f6a5eaa
Signed-off-by: Jian Gong <Jian.Gong@unisoc.com>
Signed-off-by: Cixi Geng <cixi.geng1@unisoc.com>
android_rvh_effective_cpu_util:
To perform vendor-specific cpu util, it is used in EAS/schedutil/thermal.
The effective_cpu_util would be called when thermal calc the dynamic power,
it's non-atomic context, so set the hook be restricted.
Bug: 226686099
Test: build pass
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Change-Id: I6fd77f44ca4328f5ef37d96989aa2e08d65e29bb
For SoC's skin temperature, we have to use more stringent temperature
control to make IPA can monitor and mitigate temperature control earlier
and faster, so add it to meet platform thermal requirement.
Bug: 211564753
Signed-off-by: Jeson Gao <jeson.gao@unisoc.com>
Signed-off-by: Di Shen <di.shen@unisoc.com>
Change-Id: Iaef87287eef93d6fdbc3c58c93f70c1525e38296
(cherry picked from commit 6709f52325)
(cherry picked from commit 97a290b0e5)
Need to get temperature data and config info from thermal zone device.
Bug: 208946028
Signed-off-by: Di Shen <di.shen@unisoc.com>
Signed-off-by: Jeson Gao <jeson.gao@unisoc.com>
Change-Id: I5945df5258181b4a441b6bbe09327099491418b3
(cherry picked from commit c53f0e3530)
(cherry picked from commit 12b8ef18b2)
Add hook to get cpufreq policy data after registering and unregistering
cpufreq thermal for platform thermal requirement.
Bug: 228423762
Signed-off-by: Jeson Gao <jeson.gao@unisoc.com>
Signed-off-by: Di Shen <di.shen@unisoc.com>
Change-Id: I9c6bc88f348f252c428560427bd8bca91092edfa
(cherry picked from commit fbe6f8708d)
Kfence only needs its pool to be mapped as page granularity, if it is
inited early. Previous judgement was a bit over protected. From [1], Mark
suggested to "just map the KFENCE region a page granularity". So I
decouple it from judgement and do page granularity mapping for kfence
pool only. Need to be noticed that late init of kfence pool still requires
page granularity mapping.
Page granularity mapping in theory cost more(2M per 1GB) memory on arm64
platform. Like what I've tested on QEMU(emulated 1GB RAM) with
gki_defconfig, also turning off rodata protection:
Before:
[root@liebao ]# cat /proc/meminfo
MemTotal: 999484 kB
After:
[root@liebao ]# cat /proc/meminfo
MemTotal: 1001480 kB
To implement this, also relocate the kfence pool allocation before the
linear mapping setting up, arm64_kfence_alloc_pool is to allocate phys
addr, __kfence_pool is to be set after linear mapping set up.
LINK: [1] https://lore.kernel.org/linux-arm-kernel/Y+IsdrvDNILA59UN@FVFF77S0Q05N/
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/1679066974-690-1-git-send-email-quic_zhenhuah@quicinc.com
Signed-off-by: Will Deacon <will@kernel.org>
BUG: 284812202
Change-Id: I8e7c565d3f4d6349a028a6a060259d62cf5beee7
(cherry picked from commit bfa7965b33)
Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
commit 1007843a91 upstream.
syzbot is reporting circular locking dependency which involves
zonelist_update_seq seqlock [1], for this lock is checked by memory
allocation requests which do not need to be retried.
One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.
CPU0
----
__build_all_zonelists() {
write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
// e.g. timer interrupt handler runs at this moment
some_timer_func() {
kmalloc(GFP_ATOMIC) {
__alloc_pages_slowpath() {
read_seqbegin(&zonelist_update_seq) {
// spins forever because zonelist_update_seq.seqcount is odd
}
}
}
}
// e.g. timer interrupt handler finishes
write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
}
This deadlock scenario can be easily eliminated by not calling
read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation
requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation
requests. But Michal Hocko does not know whether we should go with this
approach.
Another deadlock scenario which syzbot is reporting is a race between
kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer() with
port->lock held and printk() from __build_all_zonelists() with
zonelist_update_seq held.
CPU0 CPU1
---- ----
pty_write() {
tty_insert_flip_string_and_push_buffer() {
__build_all_zonelists() {
write_seqlock(&zonelist_update_seq);
build_zonelists() {
printk() {
vprintk() {
vprintk_default() {
vprintk_emit() {
console_unlock() {
console_flush_all() {
console_emit_next_record() {
con->write() = serial8250_console_write() {
spin_lock_irqsave(&port->lock, flags);
tty_insert_flip_string() {
tty_insert_flip_string_fixed_flag() {
__tty_buffer_request_room() {
tty_buffer_alloc() {
kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
__alloc_pages_slowpath() {
zonelist_iter_begin() {
read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd
spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held
}
}
}
}
}
}
}
}
spin_unlock_irqrestore(&port->lock, flags);
// message is printed to console
spin_unlock_irqrestore(&port->lock, flags);
}
}
}
}
}
}
}
}
}
write_sequnlock(&zonelist_update_seq);
}
}
}
This deadlock scenario can be eliminated by
preventing interrupt context from calling kmalloc(GFP_ATOMIC)
and
preventing printk() from calling console_flush_all()
while zonelist_update_seq.seqcount is odd.
Since Petr Mladek thinks that __build_all_zonelists() can become a
candidate for deferring printk() [2], let's address this problem by
disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC)
and
disabling synchronous printk() in order to avoid console_flush_all()
.
As a side effect of minimizing duration of zonelist_update_seq.seqcount
being odd by disabling synchronous printk(), latency at
read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and
__GFP_DIRECT_RECLAIM allocation requests will be reduced. Although, from
lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e.
do not record unnecessary locking dependency) from interrupt context is
still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC)
inside
write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq)
section...
Link: https://lkml.kernel.org/r/8796b95c-3da3-5885-fddd-6ef55f30e4d3@I-love.SAKURA.ne.jp
Fixes: 3d36424b3b ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
Change-Id: Ifc0c6ed9be6d36166367811ad412bedc66ed713e
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Petr Mladek <pmladek@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Patrick Daly <quic_pdaly@quicinc.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b528537d13)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 4d73ba5fa7 upstream.
A bug was reported by Yuanxi Liu where allocating 1G pages at runtime is
taking an excessive amount of time for large amounts of memory. Further
testing allocating huge pages that the cost is linear i.e. if allocating
1G pages in batches of 10 then the time to allocate nr_hugepages from
10->20->30->etc increases linearly even though 10 pages are allocated at
each step. Profiles indicated that much of the time is spent checking the
validity within already existing huge pages and then attempting a
migration that fails after isolating the range, draining pages and a whole
lot of other useless work.
Commit eb14d4eefd ("mm,page_alloc: drop unnecessary checks from
pfn_range_valid_contig") removed two checks, one which ignored huge pages
for contiguous allocations as huge pages can sometimes migrate. While
there may be value on migrating a 2M page to satisfy a 1G allocation, it's
potentially expensive if the 1G allocation fails and it's pointless to try
moving a 1G page for a new 1G allocation or scan the tail pages for valid
PFNs.
Reintroduce the PageHuge check and assume any contiguous region with
hugetlbfs pages is unsuitable for a new 1G allocation.
The hpagealloc test allocates huge pages in batches and reports the
average latency per page over time. This test happens just after boot
when fragmentation is not an issue. Units are in milliseconds.
hpagealloc
6.3.0-rc6 6.3.0-rc6 6.3.0-rc6
vanilla hugeallocrevert-v1r1 hugeallocsimple-v1r2
Min Latency 26.42 ( 0.00%) 5.07 ( 80.82%) 18.94 ( 28.30%)
1st-qrtle Latency 356.61 ( 0.00%) 5.34 ( 98.50%) 19.85 ( 94.43%)
2nd-qrtle Latency 697.26 ( 0.00%) 5.47 ( 99.22%) 20.44 ( 97.07%)
3rd-qrtle Latency 972.94 ( 0.00%) 5.50 ( 99.43%) 20.81 ( 97.86%)
Max-1 Latency 26.42 ( 0.00%) 5.07 ( 80.82%) 18.94 ( 28.30%)
Max-5 Latency 82.14 ( 0.00%) 5.11 ( 93.78%) 19.31 ( 76.49%)
Max-10 Latency 150.54 ( 0.00%) 5.20 ( 96.55%) 19.43 ( 87.09%)
Max-90 Latency 1164.45 ( 0.00%) 5.53 ( 99.52%) 20.97 ( 98.20%)
Max-95 Latency 1223.06 ( 0.00%) 5.55 ( 99.55%) 21.06 ( 98.28%)
Max-99 Latency 1278.67 ( 0.00%) 5.57 ( 99.56%) 22.56 ( 98.24%)
Max Latency 1310.90 ( 0.00%) 8.06 ( 99.39%) 26.62 ( 97.97%)
Amean Latency 678.36 ( 0.00%) 5.44 * 99.20%* 20.44 * 96.99%*
6.3.0-rc6 6.3.0-rc6 6.3.0-rc6
vanilla revert-v1 hugeallocfix-v2
Duration User 0.28 0.27 0.30
Duration System 808.66 17.77 35.99
Duration Elapsed 830.87 18.08 36.33
The vanilla kernel is poor, taking up to 1.3 second to allocate a huge
page and almost 10 minutes in total to run the test. Reverting the
problematic commit reduces it to 8ms at worst and the patch takes 26ms.
This patch fixes the main issue with skipping huge pages but leaves the
page_count() out because a page with an elevated count potentially can
migrate.
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=217022
Link: https://lkml.kernel.org/r/20230414141429.pwgieuwluxwez3rj@techsingularity.net
Fixes: eb14d4eefd ("mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig")
Change-Id: I552f0631f15e41038219e207c994fa7702b269fa
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Yuanxi Liu <y.liu@naruida.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 059f24aff6)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
While exercising the unbind path, with the current implementation
the functionfs_unbind would be calling which waits for the ffs->mutex
to be available, however within the same time ffs_ep0_read is invoked
& if no setup packets are pending, it will invoke function
wait_event_interruptible_exclusive_locked_irq which by definition waits
for the ev.count to be increased inside the same mutex for which
functionfs_unbind is waiting.
This creates deadlock situation because the functionfs_unbind won't
get the lock until ev.count is increased which can only happen if
the caller ffs_func_unbind can proceed further.
Following is the illustration:
CPU1 CPU2
ffs_func_unbind() ffs_ep0_read()
mutex_lock(ffs->mutex)
wait_event(ffs->ev.count)
functionfs_unbind()
mutex_lock(ffs->mutex)
mutex_unlock(ffs->mutex)
ffs_event_add()
<deadlock>
Fix this by moving the event unbind before functionfs_unbind
to ensure the ev.count is incrased properly.
Fixes: 6a19da1110 ("usb: gadget: f_fs: Prevent race during ffs_ep0_queue_wait")
Cc: stable <stable@kernel.org>
Signed-off-by: Uttkarsh Aggarwal <quic_uaggarwa@quicinc.com>
Link: https://lore.kernel.org/r/20230525092854.7992-1-quic_uaggarwa@quicinc.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 285072336
(cherry picked from commit efb6b53520)
Change-Id: I1a001606f62f1966825d47809cd1c887e3d6fb71
Signed-off-by: Uttkarsh Aggarwal <quic_uaggarwa@quicinc.com>