When a THP is removed from the page cache by reclaim, we replace it with a
shadow entry that occupies all slots of the XArray previously occupied by
the THP. If the user then accesses that page again, we only allocate a
single page, but storing it into the shadow entry replaces all entries
with that one page. That leads to bugs like
page dumped because: VM_BUG_ON_PAGE(page_to_pgoff(page) != offset)
------------[ cut here ]------------
kernel BUG at mm/filemap.c:2529!
https://bugzilla.kernel.org/show_bug.cgi?id=206569
This is hard to reproduce with mainline, but happens regularly with the
THP patchset (as so many more THPs are created). This solution is take
from the THP patchset. It splits the shadow entry into order-0 pieces at
the time that we bring a new page into cache.
Fixes: 99cb0dbd47 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Qian Cai <cai@lca.pw>
Link: https://lkml.kernel.org/r/20200903183029.14930-4-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Fix read-only THP for non-tmpfs filesystems".
As described more verbosely in the [3/3] changelog, we can inadvertently
put an order-0 page in the page cache which occupies 512 consecutive
entries. Users are running into this if they enable the
READ_ONLY_THP_FOR_FS config option; see
https://bugzilla.kernel.org/show_bug.cgi?id=206569 and Qian Cai has also
reported it here:
https://lore.kernel.org/lkml/20200616013309.GB815@lca.pw/
This is a rather intrusive way of fixing the problem, but has the
advantage that I've actually been testing it with the THP patches, which
means that it sees far more use than it does upstream -- indeed, Song has
been entirely unable to reproduce it. It also has the advantage that it
removes a few patches from my gargantuan backlog of THP patches.
This patch (of 3):
This function returns the order of the entry at the index. We need this
because there isn't space in the shadow entry to encode its order.
[akpm@linux-foundation.org: export xa_get_order to modules]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Qian Cai <cai@lca.pw>
Cc: Song Liu <songliubraving@fb.com>
Link: https://lkml.kernel.org/r/20200903183029.14930-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20200903183029.14930-2-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
powerpc used to set the pte specific flags in set_pte_at(). This is
different from other architectures. To be consistent with other
architecture update pfn_pte to set _PAGE_PTE on ppc64. Also, drop now
unused pte_mkpte.
We add a VM_WARN_ON() to catch the usage of calling set_pte_at() without
setting _PAGE_PTE bit. We will remove that after a few releases.
With respect to huge pmd entries, pmd_mkhuge() takes care of adding the
_PAGE_PTE bit.
[akpm@linux-foundation.org: whitespace fix, per Christophe]
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lkml.kernel.org/r/20200902114222.181353-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ucma_free_ctx() should call to __destroy_id() on all the connection requests
that have not been delivered to user space. Currently it calls on the
context itself and cause to use after free.
Fixes the trace:
BUG: Unable to handle kernel data access on write at 0x5deadbeef0000108
Faulting instruction address: 0xc0080000002428f4
Oops: Kernel access of bad area, sig: 11 [#1]
Call Trace:
[c000000207f2b680] [c00800000024280c] .__destroy_id+0x28c/0x610 [rdma_ucm] (unreliable)
[c000000207f2b750] [c0080000002429c4] .__destroy_id+0x444/0x610 [rdma_ucm]
[c000000207f2b820] [c008000000242c24] .ucma_close+0x94/0xf0 [rdma_ucm]
[c000000207f2b8c0] [c00000000046fbdc] .__fput+0xac/0x330
[c000000207f2b960] [c00000000015d48c] .task_work_run+0xbc/0x110
[c000000207f2b9f0] [c00000000012fb00] .do_exit+0x430/0xc50
[c000000207f2bae0] [c0000000001303ec] .do_group_exit+0x5c/0xd0
[c000000207f2bb70] [c000000000144a34] .get_signal+0x194/0xe30
[c000000207f2bc60] [c00000000001f6b4] .do_notify_resume+0x124/0x470
[c000000207f2bd60] [c000000000032484] .interrupt_exit_user_prepare+0x1b4/0x240
[c000000207f2be20] [c000000000010034] interrupt_return+0x14/0x1c0
Rename listen_ctx to conn_req_ctx as the poor name was the cause of this
bug.
Fixes: a1d33b70db ("RDMA/ucma: Rework how new connections are passed through event delivery")
Link: https://lore.kernel.org/r/20201012045600.418271-4-leon@kernel.org
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
If skb_clone() is unable to allocate memory for a new sk_buff this is not
detected by the current code.
Check for a NULL return and continue. This is similar to other errors in
this loop over QPs attached to the multicast address and consistent with
the unreliable UD transport.
Fixes: e7ec96fc79 ("RDMA/rxe: Fix skb lifetime in rxe_rcv_mcast_pkt()")
Addresses-Coverity-ID: 1497804: Null pointer dereferences (NULL_RETURNS)
Link: https://lore.kernel.org/r/20201013184236.5231-1-rpearson@hpe.com
Signed-off-by: Bob Pearson <rpearson@hpe.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
RXE was wrongly using an internal kernel enum as part of its uAPI, split
this out into a dedicated uAPI enum just for RXE. It only uses the IPv4
and IPv6 values.
This was exposed by changing the internal kernel enum definition which
broke RXE.
Fixes: 1c15b4f2a4 ("RDMA/core: Modify enum ib_gid_type and enum rdma_network_type")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The avs drivers are all SoC specific drivers that doesn't share any code.
Instead they are located in a directory, mostly to keep similar
functionality together. From a maintenance point of view, it makes better
sense to collect SoC specific drivers like these, into the SoC specific
directories.
Therefore, let's move the smartreflex driver for OMAP to the ti directory.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Nishanth Menon <nm@ti.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The avs drivers are all SoC specific drivers that doesn't share any code.
Instead they are located in a directory, mostly to keep similar
functionality together. From a maintenance point of view, it makes better
sense to collect SoC specific drivers like these, into the SoC specific
directories.
Therefore, let's move the rockchip-io driver to the rockchip directory.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Heiko Stuebner <heiko@sntech.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Previously we computed L1.2 parameters in the enumeration path, saved them
in struct pcie_link_state.l1ss, and programmed them into the devices
whenever we enabled or disabled L1.2 on the link. But these parameters are
constant and don't need to be updated when enabling/disabling L1.2.
Compute and program the L1.2 parameters once during enumeration and remove
the struct pcie_link_state.l1ss member. No functional change intended.
[bhelgaas: rework to program L1.2 parameters during enumeration]
Link: https://lore.kernel.org/r/20201015193039.12585-13-helgaas@kernel.org
Signed-off-by: Saheed O. Bolarinwa <refactormyself@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Previously we stored the L1SS Capabilities value in the struct
aspm_register_info.
We only need this information in one place, so read it there and remove
struct aspm_register_info completely, since it's now empty. No functional
change intended.
[bhelgaas: split up, don't cache l1ss_cap in pci_dev]
Link: https://lore.kernel.org/r/20201015193039.12585-12-helgaas@kernel.org
Signed-off-by: Saheed O. Bolarinwa <refactormyself@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Save the L1 Substates Capability pointer in struct pci_dev. Then we don't
have to keep track of it in the struct aspm_register_info and struct
pcie_link_state, which makes the code easier to read. No functional change
intended.
[bhelgaas: split to a separate patch]
Link: https://lore.kernel.org/r/20201015193039.12585-8-helgaas@kernel.org
Signed-off-by: Saheed O. Bolarinwa <refactormyself@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Previously we stored L0s and L1 Exit Latency information from the Link
Capabilities register in the struct aspm_register_info.
We only need these latencies when we already have the Link Capabilities
values, so use those directly and remove the latencies from struct
aspm_register_info. No functional change intended.
Link: https://lore.kernel.org/r/20201015193039.12585-7-helgaas@kernel.org
Signed-off-by: Saheed O. Bolarinwa <refactormyself@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Previously we stored the "ASPM Control" bits from the Link Control register
in the struct aspm_register_info.
Read PCI_EXP_LNKCTL directly when needed. This means we can use the
PCI_EXP_LNKCTL_ASPM_* bits directly instead of the similar but different
PCIE_LINK_STATE_* bits. No functional change intended.
[bhelgaas: drop get_aspm_enable() and read LNKCTL once directly]
Link: https://lore.kernel.org/r/20201015193039.12585-6-helgaas@kernel.org
Signed-off-by: Saheed O. Bolarinwa <refactormyself@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Previously we stored the "ASPM Support" field from the Link Capabilities
register in the struct aspm_register_info.
Read the Link Capabilities directly when needed and remove it from the
struct aspm_register_info. No functional change intended.
[bhelgaas: remove pci_dev cached copy since LNKCAP isn't truly read-only,
add PCI_EXP_LNKCAP_ASPM_L0S & PCI_EXP_LNKCAP_ASPM_L1, check them directly
instead of adding aspm_support()]
Link: https://lore.kernel.org/r/20201015193039.12585-5-helgaas@kernel.org
Signed-off-by: Saheed O. Bolarinwa <refactormyself@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Other users of link->pdev and link->downstream, e.g., pcie_aspm_cap_init(),
pcie_config_aspm_l1ss(), and pcie_config_aspm_link(), use "parent" and
"child" as local names.
Do the same in aspm_calc_l1ss_info() for readability. No functional change
intended.
Link: https://lore.kernel.org/r/20201015193039.12585-4-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
pcie_get_aspm_reg() mostly reads ASPM-related registers, but in some cases
it also updates the value read from PCI_L1SS_CAP based on LTR properties.
Move this update to the point where the value is used to make the code more
readable.
No functional change intended, although previously we could clear
PCI_L1SS_CAP_ASPM_L1_2 for both ends of the link, and now we'll only do it
for the downstream end of a link. This shouldn't matter because we always
test that bit by ANDing l1ss_cap for the upstream and downstream ends.
Link: https://lore.kernel.org/r/20201015193039.12585-3-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Add a Kconfig menu for Intel DPTF (Dynamic Platform and Thermal
Framework), put both the existing participant drivers in it and set
them to be built as modules by default.
While at it, do a few assorted cleanups for a good measure.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Change the names of DPTF participant drivers to adhere to the
sysfs file naming conventions (no spaces present in the name in
particular).
Fixes: 2ce6324ead ("ACPI: DPTF: Add PCH FIVR participant driver")
Fixes: 6256ebd5da ("ACPI / DPTF: Add DPTF power participant driver")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
ACPI 6.3 Errata A no longer allows _UID to return a string except for
Itanium (for historical reasons) as stated in section 5.2.12:
"From ACPI Specification 6.3 onward, all processor objects for all
architectures except Itanium must now use Device() objects with an
_HID of ACPI0007, and use only integer _UID values."
Therefore, the "we don't handle string _UIDs yet" comment, which
implies a missing feature, is redundant, so drop it.
Signed-off-by: Alex Hung <alex.hung@canonical.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
According to the ACPI spec, "The system must reset immediately following
the write to the ACPI RESET_REG register.", but there are cases that the
system does not follow this and results in racing with the subsequetial
reboot mechanism, which brings unexpected behavior.
Fix this by adding a 15ms delay after writing to the ACPI RESET_REG.
Reported-by: Ghorai, Sukumar <sukumar.ghorai@intel.com>
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
[ rjw: Edit comment in the code and subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If ACPI is disabled then loading the acpi_dbg module will result in the
following splat when lock debugging is enabled.
DEBUG_LOCKS_WARN_ON(lock->magic != lock)
WARNING: CPU: 0 PID: 1 at kernel/locking/mutex.c:938 __mutex_lock+0xa10/0x1290
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc8+ #103
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x4d8
show_stack+0x34/0x48
dump_stack+0x174/0x1f8
panic+0x360/0x7a0
__warn+0x244/0x2ec
report_bug+0x240/0x398
bug_handler+0x50/0xc0
call_break_hook+0x160/0x1d8
brk_handler+0x30/0xc0
do_debug_exception+0x184/0x340
el1_dbg+0x48/0xb0
el1_sync_handler+0x170/0x1c8
el1_sync+0x80/0x100
__mutex_lock+0xa10/0x1290
mutex_lock_nested+0x6c/0xc0
acpi_register_debugger+0x40/0x88
acpi_aml_init+0xc4/0x114
do_one_initcall+0x24c/0xb10
kernel_init_freeable+0x690/0x728
kernel_init+0x20/0x1e8
ret_from_fork+0x10/0x18
This is because acpi_debugger.lock has not been initialized as
acpi_debugger_init() is not called when ACPI is disabled. Fail module
loading to avoid this and any subsequent problems that might arise by
trying to debug AML when ACPI is disabled.
Fixes: 8cfb0cdf07 ("ACPI / debugger: Add IO interface to access debugger functionalities")
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Jamie Iles <jamie@nuviainc.com>
Cc: 4.10+ <stable@vger.kernel.org> # 4.10+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
To enable better debug of PM domains, keep a track of successful
and failing attempts to enter each domain idle state.
This statistics are exported in debugfs when reading the
idle_states node associated with each PM domain.
Signed-off-by: Lina Iyer <ilina@codeaurora.org>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There is not strict need to group a comment and a single statement in an
if block, as comments are stripped by the pre-processor. However,
adding curly braces does make the code easier to read, and may avoid
mistakes when changing the code later.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The main intention of the max_segment argument to
__sg_alloc_table_from_pages() is to match the DMA layer segment size set
by dma_set_max_seg_size().
Restricting the input to be page aligned makes it impossible to just
connect the DMA layer to this API.
The only reason for a page alignment here is because the algorithm will
overshoot the max_segment if it is not a multiple of PAGE_SIZE. Simply fix
the alignment before starting and don't expose this implementation detail
to the callers.
A future patch will completely remove SCATTERLIST_MAX_SEGMENT.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
From Maor Gottlieb says:
====================
This series extends __sg_alloc_table_from_pages to allow chaining of new
pages to an already initialized SG table.
This allows for drivers to utilize the optimization of merging contiguous
pages without a need to pre allocate all the pages and hold them in a very
large temporary buffer prior to the call to SG table initialization.
The last patch changes the Infiniband core to use the new API. It removes
duplicate functionality from the code and benefits from the optimization
of allocating dynamic SG table from pages.
In huge pages system of 2MB page size, without this change, the SG table
would contain x512 SG entries.
====================
* branch 'dynamic_sg':
RDMA/umem: Move to allocate SG table from pages
lib/scatterlist: Add support in dynamic allocation of SG table from pages
tools/testing/scatterlist: Show errors in human readable form
tools/testing/scatterlist: Rejuvenate bit-rotten test
A device may have specific HW constraints that must be obeyed to, before
its corresponding PM domain (genpd) can be powered off - and vice verse at
power on. These constraints can't be managed through the regular runtime PM
based deployment for a device, because the access pattern for it, isn't
always request based. In other words, using the runtime PM callbacks to
deal with the constraints doesn't work for these cases.
For these reasons, let's instead add a PM domain power on/off notification
mechanism to genpd. To add/remove a notifier for a device, the device must
already have been attached to the genpd, which also means that it needs to
be a part of the PM domain topology.
To add/remove a notifier, let's introduce two genpd specific functions:
- dev_pm_genpd_add|remove_notifier()
Note that, to further clarify when genpd power on/off notifiers may be
used, one can compare with the existing CPU_CLUSTER_PM_ENTER|EXIT
notifiers. In the long run, the genpd power on/off notifiers should be able
to replace them, but that requires additional genpd based platform support
for the current users.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Tested-by: Lina Iyer <ilina@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
On multi-package systems, the Psys MSR is only valid for CPUs on
specific package (master package). The current code makes the
assumption that package 0 is the master package, but this is not
true on new platforms like SPR.
Fix the problem by emuerating the Psys RAPL domain for every
package, so CPUs in slave packages will read 0 for the Psys energy
counter and only CPUs in master packages can get a valid reading
and register the Psys RAPL domain.
The sysfs I/F for the Psys RAPL domain is not changed.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
As only the low 32 bits of the RAPL_DOMAIN_REG_STATUS register
represents the energy counter, and the high 32 bits are reserved,
detect the existence of a RAPL domain by checking the low 32 bits only.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>