On x86, batched and deferred tlb shootdown has lead to 90% performance
increase on tlb shootdown. on arm64, HW can do tlb shootdown without
software IPI. But sync tlbi is still quite expensive.
Even running a simplest program which requires swapout can
prove this is true,
#include <sys/types.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
int main()
{
#define SIZE (1 * 1024 * 1024)
volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
memset(p, 0x88, SIZE);
for (int k = 0; k < 10000; k++) {
/* swap in */
for (int i = 0; i < SIZE; i += 4096) {
(void)p[i];
}
/* swap out */
madvise(p, SIZE, MADV_PAGEOUT);
}
}
Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.
~ # perf record taskset -c 4 ./a.out
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
~ # perf report
# To display the perf.data header info, please use --header/--header-only options.
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 60K of event 'cycles'
# Event count (approx.): 35706225414
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ......
#
21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq
8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
6.67% a.out [kernel.kallsyms] [k] filemap_map_pages
6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write
5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush
3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock
3.49% a.out [kernel.kallsyms] [k] memset64
1.63% a.out [kernel.kallsyms] [k] clear_page
1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock
1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930
1.23% a.out [kernel.kallsyms] [k] xas_load
1.15% a.out [kernel.kallsyms] [k] zram_slot_lock
ptep_clear_flush() takes 5.36% CPU in the micro-benchmark swapping in/out
a page mapped by only one process. If the page is mapped by multiple
processes, typically, like more than 100 on a phone, the overhead would be
much higher as we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number of CPU cores due to
the bad scalability of tlb shootdown in HW, so those ARM64 servers should
expect much higher overhead.
Further perf annonate shows 95% cpu time of ptep_clear_flush is actually
used by the final dsb() to wait for the completion of tlb flush. This
provides us a very good chance to leverage the existing batched tlb in
kernel. The minimum modification is that we only send async tlbi in the
first stage and we send dsb while we have to sync in the second stage.
With the above simplest micro benchmark, collapsed time to finish the
program decreases around 5%.
Typical collapsed time w/o patch:
~ # time taskset -c 4 ./a.out
0.21user 14.34system 0:14.69elapsed
w/ patch:
~ # time taskset -c 4 ./a.out
0.22user 13.45system 0:13.80elapsed
Also tested with benchmark in the commit on Kunpeng920 arm64 server
and observed an improvement around 12.5% with command
`time ./swap_bench`.
w/o w/
real 0m13.460s 0m11.771s
user 0m0.248s 0m0.279s
sys 0m12.039s 0m11.458s
Originally it's noticed a 16.99% overhead of ptep_clear_flush()
which has been eliminated by this patch:
[root@localhost yang]# perf record -- ./swap_bench && perf report
[...]
16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush
It is tested on 4,8,128 CPU platforms and shows to be beneficial on
large systems but may not have improvement on small systems like on
a 4 CPU platform.
Also this patch improve the performance of page migration. Using pmbench
and tries to migrate the pages of pmbench between node 0 and node 1 for
100 times for 1G memory, this patch decrease the time used around 20%
(prev 18.338318910 sec after 13.981866350 sec) and saved the time used
by ptep_clear_flush().
Link: https://lkml.kernel.org/r/20230717131004.12662-5-yangyicong@huawei.com
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Barry Song <baohua@kernel.org>
Cc: Darren Hart <darren@os.amperecomputing.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: lipeifeng <lipeifeng@oppo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "arm64: support batched/deferred tlb shootdown during page
reclamation/migration", v11.
Though ARM64 has the hardware to do tlb shootdown, the hardware
broadcasting is not free. A simplest micro benchmark shows even on
snapdragon 888 with only 8 cores, the overhead for ptep_clear_flush is
huge even for paging out one page mapped by only one process: 5.36% a.out
[kernel.kallsyms] [k] ptep_clear_flush
While pages are mapped by multiple processes or HW has more CPUs, the cost
should become even higher due to the bad scalability of tlb shootdown.
The same benchmark can result in 16.99% CPU consumption on ARM64 server
with around 100 cores according to the test on patch 4/4.
This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
sync in arch_tlbbatch_flush()
Testing on snapdragon shows the overhead of ptep_clear_flush is removed by
the patchset. The micro benchmark becomes 5% faster even for one page
mapped by single process on snapdragon 888.
Since BATCHED_UNMAP_TLB_FLUSH is implemented only on x86, the patchset
does some renaming/extension for the current implementation first (Patch
1-3), then add the support on arm64 (Patch 4).
This patch (of 4):
The entire scheme of deferred TLB flush in reclaim path rests on the fact
that the cost to refill TLB entries is less than flushing out individual
entries by sending IPI to remote CPUs. But architecture can have
different ways to evaluate that. Hence apart from checking
TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
architecture specific.
[yangyicong@hisilicon.com: rebase and fix incorrect return value type]
Link: https://lkml.kernel.org/r/20230717131004.12662-1-yangyicong@huawei.com
Link: https://lkml.kernel.org/r/20230717131004.12662-2-yangyicong@huawei.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khandual@linux.vnet.ibm.com/]
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Punit Agrawal <punit.agrawal@bytedance.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Darren Hart <darren@os.amperecomputing.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: lipeifeng <lipeifeng@oppo.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zeng Tao <prime.zeng@hisilicon.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <namit@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: ioremap: Convert architectures to take GENERIC_IOREMAP
way", v8.
Motivation and implementation:
==============================
Currently, many architecutres have't taken the standard GENERIC_IOREMAP
way to implement ioremap_prot(), iounmap(), and ioremap_xx(), but make
these functions specifically under each arch's folder. Those cause many
duplicated code of ioremap() and iounmap().
In this patchset, firstly introduce generic_ioremap_prot() and
generic_iounmap() to extract the generic code for GENERIC_IOREMAP. By
taking GENERIC_IOREMAP method, the generic generic_ioremap_prot(),
generic_iounmap(), and their generic wrapper ioremap_prot(), ioremap() and
iounmap() are all visible and available to arch. Arch needs to provide
wrapper functions to override the generic version if there's arch specific
handling in its corresponding ioremap_prot(), ioremap() or iounmap().
With these changes, duplicated ioremap/iounmap() code uder ARCH-es are
removed, and the equivalent functioality is kept as before.
Background info:
================
1) The converting more architectures to take GENERIC_IOREMAP way is
suggested by Christoph in below discussion:
https://lore.kernel.org/all/Yp7h0Jv6vpgt6xdZ@infradead.org/T/#u
2) In the previous v1 to v3, it's basically further action after arm64
has converted to GENERIC_IOREMAP way in below patchset. It's done by
adding hook ioremap_allowed() and iounmap_allowed() in ARCH to add ARCH
specific handling the middle of ioremap_prot() and iounmap().
[PATCH v5 0/6] arm64: Cleanup ioremap() and support ioremap_prot()
https://lore.kernel.org/all/20220607125027.44946-1-wangkefeng.wang@huawei.com/T/#u
Later, during v3 reviewing, Christophe Leroy suggested to introduce
generic_ioremap_prot() and generic_iounmap() to generic codes, and ARCH
can provide wrapper function ioremap_prot(), ioremap() or iounmap() if
needed. Christophe made a RFC patchset as below to specially demonstrate
his idea. This is what v4 and now v5 is doing.
[RFC PATCH 0/8] mm: ioremap: Convert architectures to take GENERIC_IOREMAP way
https://lore.kernel.org/all/cover.1665568707.git.christophe.leroy@csgroup.eu/T/#u
Testing:
========
In v8, I only applied this patchset onto the latest linus's tree to build
and run on arm64 and s390.
This patch (of 19):
Let's use '#define ioremap_xx' and "#ifdef ioremap_xx" instead.
To remove defined ARCH_HAS_IOREMAP_xx macros in <asm/io.h> of each ARCH,
the ARCH's own ioremap_wc|wt|np definition need be above "#include
<asm-generic/iomap.h>. Otherwise the redefinition error would be seen
during compiling. So the relevant adjustments are made to avoid compiling
error:
loongarch:
- doesn't include <asm-generic/iomap.h>, defining ARCH_HAS_IOREMAP_WC
is redundant, so simply remove it.
m68k:
- selected GENERIC_IOMAP, <asm-generic/iomap.h> has been added in
<asm-generic/io.h>, and <asm/kmap.h> is included above
<asm-generic/iomap.h>, so simply remove ARCH_HAS_IOREMAP_WT defining.
mips:
- move "#include <asm-generic/iomap.h>" below ioremap_wc definition
in <asm/io.h>
powerpc:
- remove "#include <asm-generic/iomap.h>" in <asm/io.h> because it's
duplicated with the one in <asm-generic/io.h>, let's rely on the
latter.
x86:
- selected GENERIC_IOMAP, remove #include <asm-generic/iomap.h> in
the middle of <asm/io.h>. Let's rely on <asm-generic/io.h>.
Link: https://lkml.kernel.org/r/20230706154520.11257-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Helge Deller <deller@gmx.de>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
test_pages() tests the page allocator by calling alloc_pages() with
different orders up to order 10.
However, different architectures and platforms support different maximum
contiguous allocation sizes. The default maximum allocation order
(MAX_ORDER) is 10, but architectures can use CONFIG_ARCH_FORCE_MAX_ORDER
to override this. On platforms where this is less than 10, test_meminit()
will blow up with a WARN(). This is expected, so let's not do that.
Replace the hardcoded "10" with the MAX_ORDER macro so that we test
allocations up to the expected platform limit.
Link: https://lkml.kernel.org/r/20230714015238.47931-1-ajd@linux.ibm.com
Fixes: 5015a300a5 ("lib: introduce test_meminit module")
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Xiaoke Wang <xkernel.wang@foxmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If init_section_page_ext failed, we only need rollback for mem_section
before failed mem_section. Make rollback end point to failed mem_section
to remove unnecessary rollback.
As pfn += PAGES_PER_SECTION will be executed even if init_section_page_ext
failed. So pfn points to mem_section after failed mem_section. Subtract
one mem_section from pfn to get failed mem_section.
Link: https://lkml.kernel.org/r/20230714114749.1743032-3-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add the functionality, is_raw_hwpoison_page_in_hugepage, to tell if a raw
page in a hugetlb folio is HWPOISON. This functionality relies on
RawHwpUnreliable to be not set; otherwise hugepage's raw HWPOISON list
becomes meaningless.
is_raw_hwpoison_page_in_hugepage holds mf_mutex in order to synchronize
with folio_set_hugetlb_hwpoison and folio_free_raw_hwp who iterate,
insert, or delete entry in raw_hwp_list. llist itself doesn't ensure
insertion and removal are synchornized with the llist_for_each_entry used
by is_raw_hwpoison_page_in_hugepage (unless iterated entries are already
deleted from the list). Caller can minimize the overhead of lock cycles
by first checking HWPOISON flag of the folio.
Exports this functionality to be immediately used in the read operation
for hugetlbfs.
Link: https://lkml.kernel.org/r/20230713001833.3778937-3-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Improve hugetlbfs read on HWPOISON hugepages", v4.
Today when hardware memory is corrupted in a hugetlb hugepage, kernel
leaves the hugepage in pagecache [1]; otherwise future mmap or read will
suject to silent data corruption. This is implemented by returning -EIO
from hugetlb_read_iter immediately if the hugepage has HWPOISON flag set.
Since memory_failure already tracks the raw HWPOISON subpages in a
hugepage, a natural improvement is possible: if userspace only asks for
healthy subpages in the pagecache, kernel can return these data.
This patchset implements this improvement. It consist of three parts.
The 1st commit exports the functionality to tell if a subpage inside a
hugetlb hugepage is a raw HWPOISON page. The 2nd commit teaches
hugetlbfs_read_iter to return as many healthy bytes as possible. The 3rd
commit properly tests this new feature.
[1] commit 8625147caf ("hugetlbfs: don't delete error page from pagecache")
This patch (of 4):
Traversal on llist (e.g. llist_for_each_safe) is only safe AFTER entries
are deleted from the llist. Correct the way __folio_free_raw_hwp deletes
and frees raw_hwp_page entries in raw_hwp_list: first llist_del_all, then
kfree within llist_for_each_safe.
As of today, concurrent adding, deleting, and traversal on raw_hwp_list
from hugetlb.c and/or memory-failure.c are fine with each other. Note
this is guaranteed partly by the lock-free nature of llist, and partly by
holding hugetlb_lock and/or mf_mutex. For example, as llist_del_all is
lock-free with itself, folio_clear_hugetlb_hwpoison()s from
__update_and_free_hugetlb_folio and memory_failure won't need explicit
locking when freeing the raw_hwp_list. New code that manipulates
raw_hwp_list must be careful to ensure the concurrency correctness.
Link: https://lkml.kernel.org/r/20230713001833.3778937-1-jiaqiyan@google.com
Link: https://lkml.kernel.org/r/20230713001833.3778937-2-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
UnixBench/Execl represents a class of workload where bash scripts are
spawned frequently to do some short jobs. When running multiple parallel
tasks, hot osq_lock is observed from do_mmap and exit_mmap. Both of them
come from load_elf_binary through the call chain
"execl->do_execveat_common->bprm_execve->load_elf_binary".
In do_mmap,it will call mmap_region to create vma node, initialize it and
insert it to vma maintain structure in mm_struct and i_mmap tree of the
mapping file, then increase map_count to record the number of vma nodes
used. The hot osq_lock is to protect operations on file's i_mmap tree.
For the mm_struct member change like vma insertion and map_count update,
they do not affect i_mmap tree. Move those operations out of the lock's
critical section, to reduce hold time on the lock.
With this change, on Intel Sapphire Rapids 112C/224T platform, based on
v6.0-rc6, the 160 parallel score improves by 12%. The patch has no
obvious performance gain on v6.5-rc1 due to regression of this benchmark
from this commit f1a7941243 (mm: convert
mm's rss stats into percpu_counter). Related discussion and conclusion
can be referred at the mail thread initiated by 0day as below: Link:
https://lore.kernel.org/linux-mm/a4aa2e13-7187-600b-c628-7e8fb108def0@intel.com/
Link: https://lkml.kernel.org/r/20230712145739.604215-1-yu.ma@intel.com
Signed-off-by: Yu Ma <yu.ma@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A . Shutemov <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Zhu, Lipeng <lipeng.zhu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>