We've got a bunch of special swap entries that stores PFN inside the swap
offset fields. To fetch the PFN, normally the user just calls
swp_offset() assuming that'll be the PFN.
Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
max possible length of a PFN on the host, meanwhile doing proper check
with MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the
PFNs properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
One reason to do so is we never tried to sanitize whether swap offset can
really fit for storing PFN. At the meantime, this patch also prepares us
with the future possibility to store more information inside the swp
offset field, so assuming "swp_offset(entry)" to be the PFN will not stand
any more very soon.
Replace many of the swp_offset() callers to use swp_offset_pfn() where
proper. Note that many of the existing users are not candidates for the
replacement, e.g.:
(1) When the swap entry is not a pfn swap entry at all, or,
(2) when we wanna keep the whole swp_offset but only change the swp type.
For the latter, it can happen when fork() triggered on a write-migration
swap entry pte, we may want to only change the migration type from
write->read but keep the rest, so it's not "fetching PFN" but "changing
swap type only". They're left aside so that when there're more
information within the swp offset they'll be carried over naturally in
those cases.
Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
the new swp_offset_pfn() is about.
Link: https://lkml.kernel.org/r/20220811161331.37055-4-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: Remember a/d bits for migration entries", v4.
Problem
=======
When migrating a page, right now we always mark the migrated page as old &
clean.
However that could lead to at least two problems:
(1) We lost the real hot/cold information while we could have persisted.
That information shouldn't change even if the backing page is changed
after the migration,
(2) There can be always extra overhead on the immediate next access to
any migrated page, because hardware MMU needs cycles to set the young
bit again for reads, and dirty bits for write, as long as the
hardware MMU supports these bits.
Many of the recent upstream works showed that (2) is not something trivial
and actually very measurable. In my test case, reading 1G chunk of memory
- jumping in page size intervals - could take 99ms just because of the
extra setting on the young bit on a generic x86_64 system, comparing to
4ms if young set.
This issue is originally reported by Andrea Arcangeli.
Solution
========
To solve this problem, this patchset tries to remember the young/dirty
bits in the migration entries and carry them over when recovering the
ptes.
We have the chance to do so because in many systems the swap offset is not
really fully used. Migration entries use swp offset to store PFN only,
while the PFN is normally not as large as swp offset and normally smaller.
It means we do have some free bits in swp offset that we can use to store
things like A/D bits, and that's how this series tried to approach this
problem.
max_swapfile_size() is used here to detect per-arch offset length in swp
entries. We'll automatically remember the A/D bits when we find that we
have enough swp offset field to keep both the PFN and the extra bits.
Since max_swapfile_size() can be slow, the last two patches cache the
results for it and also swap_migration_ad_supported as a whole.
Known Issues / TODOs
====================
We still haven't taught madvise() to recognize the new A/D bits in
migration entries, namely MADV_COLD/MADV_FREE. E.g. when MADV_COLD upon
a migration entry. It's not clear yet on whether we should clear the A
bit, or we should just drop the entry directly.
We didn't teach idle page tracking on the new migration entries, because
it'll need larger rework on the tree on rmap pgtable walk. However it
should make it already better because before this patchset page will be
old page after migration, so the series will fix potential false negative
of idle page tracking when pages were migrated before observing.
The other thing is migration A/D bits will not start to working for
private device swap entries. The code is there for completeness but since
private device swap entries do not yet have fields to store A/D bits, even
if we'll persistent A/D across present pte switching to migration entry,
we'll lose it again when the migration entry converted to private device
swap entry.
Tests
=====
After the patchset applied, the immediate read access test [1] of above 1G
chunk after migration can shrink from 99ms to 4ms. The test is done by
moving 1G pages from node 0->1->0 then read it in page size jumps. The
test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
Similar effect can also be measured when writting the memory the 1st time
after migration.
After applying the patchset, both initial immediate read/write after page
migrated will perform similarly like before migration happened.
Patch Layout
============
Patch 1-2: Cleanups from either previous versions or on swapops.h macros.
Patch 3-4: Prepare for the introduction of migration A/D bits
Patch 5: The core patch to remember young/dirty bit in swap offsets.
Patch 6-7: Cache relevant fields to make migration_entry_supports_ad() fast.
[1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c
This patch (of 7):
Replace all the magic "5" with the macro.
Link: https://lkml.kernel.org/r/20220811161331.37055-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20220811161331.37055-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The syzbot reported the below problem:
BUG: Bad page map in process syz-executor198 pte:8000000071c00227 pmd:74b30067
addr:0000000020563000 vm_flags:08100077 anon_vma:ffff8880547d2200 mapping:0000000000000000 index:20563
file:(null) fault:0x0 mmap:0x0 read_folio:0x0
CPU: 1 PID: 3614 Comm: syz-executor198 Not tainted 6.0.0-rc3-next-20220901-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_bad_pte.cold+0x2a7/0x2d0 mm/memory.c:565
vm_normal_page+0x10c/0x2a0 mm/memory.c:636
hpage_collapse_scan_pmd+0x729/0x1da0 mm/khugepaged.c:1199
madvise_collapse+0x481/0x910 mm/khugepaged.c:2433
madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1062
madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1236
do_madvise.part.0+0x24a/0x340 mm/madvise.c:1415
do_madvise mm/madvise.c:1428 [inline]
__do_sys_madvise mm/madvise.c:1428 [inline]
__se_sys_madvise mm/madvise.c:1426 [inline]
__x64_sys_madvise+0x113/0x150 mm/madvise.c:1426
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f770ba87929
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f770ba18308 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007f770bb0f3f8 RCX: 00007f770ba87929
RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
RBP: 00007f770bb0f3f0 R08: 00007f770ba18700 R09: 0000000000000000
R10: 00007f770ba18700 R11: 0000000000000246 R12: 00007f770bb0f3fc
R13: 00007ffc2d8b62ef R14: 00007f770ba18400 R15: 0000000000022000
Basically the test program does the below conceptually:
1. mmap 0x2000000 - 0x21000000 as anonymous region
2. mmap io_uring SQ stuff at 0x20563000 with MAP_FIXED, io_uring_mmap()
actually remaps the pages with special PTEs
3. call MADV_COLLAPSE for 0x20000000 - 0x21000000
It actually triggered the below race:
CPU A CPU B
mmap 0x20000000 - 0x21000000 as anon
madvise_collapse is called on this area
Retrieve start and end address from the vma (NEVER updated later!)
Collapsed the first 2M area and dropped mmap_lock
Acquire mmap_lock
mmap io_uring file at 0x20563000
Release mmap_lock
Reacquire mmap_lock
revalidate vma pass since 0x20200000 + 0x200000 > 0x20563000
scan the next 2M (0x20200000 - 0x20400000), but due to whatever reason it didn't release mmap_lock
scan the 3rd 2M area (start from 0x20400000)
get into the vma created by io_uring
The hend should be updated after MADV_COLLAPSE reacquire mmap_lock since
the vma may be shrunk. We don't have to worry about shink from the other
direction since it could be caught by hugepage_vma_revalidate(). Either
no valid vma is found or the vma doesn't fit anymore.
Link: https://lkml.kernel.org/r/20220914162220.787703-1-shy828301@gmail.com
Fixes: 7d8faaf155 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
Reported-by: syzbot+915f3e317adb0e85835f@syzkaller.appspotmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Convert the driver to use OF match data for providing the Alpha PLL config
per compatible.
This is required for IPQ8074 support since it uses a different Alpha PLL
config.
While we are here rename "ipq_pll_config" to "ipq6018_pll_config" to make
it clear that it is for IPQ6018 only.
Signed-off-by: Robert Marko <robimarko@gmail.com>
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Link: https://lore.kernel.org/r/20220818220628.339366-5-robimarko@gmail.com
While working on IPQ8074 APSS driver it was discovered that IPQ6018 and
IPQ8074 use almost the same PLL and APSS clocks, however APSS driver is
currently broken.
More precisely apcs_alias0_clk_src is broken, it was added as regmap_mux
clock.
However after debugging why it was always stuck at 800Mhz, it was figured
out that its not regmap_mux compatible at all.
It is a simple mux but it uses RCG2 register layout and control bits, so
utilize the new clk_rcg2_mux_closest_ops to correctly drive it while not
having to provide a dummy frequency table.
While we are here, use ARRAY_SIZE for number of parents.
Tested on IPQ6018-CP01-C1 reference board and multiple IPQ8074 boards.
Fixes: 5e77b4ef1b ("clk: qcom: Add ipq6018 apss clock controller")
Signed-off-by: Robert Marko <robimarko@gmail.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Link: https://lore.kernel.org/r/20220818220628.339366-2-robimarko@gmail.com
An RCG may act as a mux that switch between 2 parents.
This is the case on IPQ6018 and IPQ8074 where the APCS core clk that feeds
the CPU cluster clock just switches between XO and the PLL that feeds it.
Add the required ops to add support for this special configuration and use
the generic mux function to determine the rate.
This way we dont have to keep a essentially dummy frequency table to use
RCG2 as a mux.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: Robert Marko <robimarko@gmail.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Link: https://lore.kernel.org/r/20220818220628.339366-1-robimarko@gmail.com
Pass the gendisk to blkcg_init_disk and blkcg_exit_disk as part of moving
the blk-cgroup infrastructure to be gendisk based. Also remove the
rather pointless kerneldoc comments for these internal functions with a
single caller each.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sean Anderson says:
====================
net: sunhme: Cleanups and logging improvements
This series is a continuation of [1] with a focus on logging improvements (in
the style of commit b11e5f6a3a ("net: sunhme: output link status with a single
print.")). I have included several of Rolf's patches in the series where
appropriate (with slight modifications). After this series is applied, many more
messages from this driver will come with driver/device information.
Additionally, most messages (especially debug messages) have been condensed onto
one line (as KERN_CONT messages get split).
[1] https://lore.kernel.org/netdev/4686583.GXAFRqVoOG@eto.sf-tec.de/
====================
Link: https://lore.kernel.org/r/20220924015339.1816744-1-seanga2@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
I have the hardware so at the very least I can test things.
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The SXD, TXD, and RXD macros are used only once (or twice). Just use the
vdbg print, which seems to have been devised for these sorts of very
verbose messages.
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This driver seems to have been written under the assumption that messages
can be continued arbitrarily. I'm not when this changed (if ever), but such
ad-hoc continuations are liable to be rudely interrupted. Convert all such
instances to single prints. This loses a bit of timing information (such as
when a line was constructed piecemeal as the function executed), but it's
easy to add a few prints if necessary. This also adds newlines to the ends
of any prints without them.
Since (almost every) debug print included the name of the function, include
it automatically.
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wherever possible, use the associated netdev (or device) when printing
errors or other messages. This makes it immediately clear what device
caused the error, and provides more information than just the device name.
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is a mostly-mechanical translation of the existing printks into
pr_foos. In several places, I have pasted messages which were broken over
several lines to allow for easier grepping.
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove all the single-use debug conditionals, and just collect the debug
defines at the top of the file. HMD seems like it is used for general debug
info, so just redefine it as pr_debug. Additionally, instead of using the
default loglevel, use the debug loglevel for debugging.
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With the power of variadic macros, double parentheses are unnecessary.
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This not only removes a lot of code, it also fixes the memleak of the DMA
memory when register_netdev() fails.
Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
[ rebased onto net-next/master; fixed error reporting ]
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This fixes several error paths to ensure they return an appropriate error
(instead of ENODEV).
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In order to differentiate between a missing bridge and an OOM condition,
return ERR_PTRs from quattro_pci_find. This also does some general linting
in the area.
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This already returns a proper error value, so pass it to the caller.
Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Module versions are not very useful:
> The basic problem is, the version string does not identify the sources
> with enough accuracy. It says nothing about back ported fixes in
> stable kernels. It tells you nothing about vendor patches to the
> network core, etc.
https://lore.kernel.org/all/Yf6mtvA1zO7cdzr7@lunn.ch/
While we're at it, inline the author and use the driver name a bit more.
Signed-off-by: Sean Anderson <seanga2@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>