Currently the kernel force charges the allocations which have __GFP_HIGH
flag without triggering the memory reclaim. __GFP_HIGH indicates that
the caller is high priority and since commit 869712fd3d ("mm:
memcontrol: fix network errors from failing __GFP_ATOMIC charges") the
kernel lets such allocations do force charging. Please note that
__GFP_ATOMIC has been replaced by __GFP_HIGH.
__GFP_HIGH does not tell if the caller can block or can trigger reclaim.
There are separate checks to determine that. So, there is no need to
skip reclaiming for __GFP_HIGH allocations. So, handle __GFP_HIGH
together with __GFP_NOFAIL which also does force charging.
Please note that this is a noop change as there are no __GFP_HIGH
allocators in the kernel which also have __GFP_ACCOUNT (or SLAB_ACCOUNT)
and does not allow reclaim for now.
Link: https://lkml.kernel.org/r/20220211064917.2028469-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Chris Down <chris@chrisdown.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "memcg: robust enforcement of memory.high", v2.
Due to the semantics of memory.high enforcement i.e. throttle the
workload without oom-kill, we are trying to use it for right sizing the
workloads in our production environment. However we observed the
mechanism fails for some specific applications which does big chunck of
allocations in a single syscall. The reason behind this failure is due
to the limitation of the memory.high enforcement's current
implementation.
This patch series solves this issue by enforcing the memory.high
synchronously if the current process has accumulated a large amount of
high overcharge.
This patch (of 4):
The function mem_cgroup_oom returns enum which has four possible values
but the caller does not care about such values and only cares if the
return value is OOM_SUCCESS or not. So, remove the enum altogether and
make mem_cgroup_oom returns a simple bool.
Link: https://lkml.kernel.org/r/20220211064917.2028469-1-shakeelb@google.com
Link: https://lkml.kernel.org/r/20220211064917.2028469-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently memcg stats show several types of kernel memory: kernel stack,
page tables, sock, vmalloc, and slab. However, there are other
allocations with __GFP_ACCOUNT (or supersets such as GFP_KERNEL_ACCOUNT)
that are not accounted in any of those stats, a few examples are:
- various kvm allocations (e.g. allocated pages to create vcpus)
- io_uring
- tmp_page in pipes during pipe_write()
- bpf ringbuffers
- unix sockets
Keeping track of the total kernel memory is essential for the ease of
migration from cgroup v1 to v2 as there are large discrepancies between
v1's kmem.usage_in_bytes and the sum of the available kernel memory
stats in v2. Adding separate memcg stats for all __GFP_ACCOUNT kernel
allocations is an impractical maintenance burden as there a lot of those
all over the kernel code, with more use cases likely to show up in the
future.
Therefore, add a "kernel" memcg stat that is analogous to kmem page
counter, with added benefits such as using rstat infrastructure which
aggregates stats more efficiently. Additionally, this provides a
lighter alternative in case the legacy kmem is deprecated in the future
[yosryahmed@google.com: v2]
Link: https://lkml.kernel.org/r/20220203193856.972500-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20220201200823.3283171-1-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mikulas asked in "Do we still need commit a0ee5ec520 ('tmpfs: allocate
on read when stacked')?" in [1]
Lukas noticed this unusual behavior of loop device backed by tmpfs in [2].
Normally, shmem_file_read_iter() copies the ZERO_PAGE when reading
holes; but if it looks like it might be a read for "a stacking
filesystem", it allocates actual pages to the page cache, and even marks
them as dirty. And reads from the loop device do satisfy the test that
is used.
This oddity was added for an old version of unionfs, to help to limit
its usage to the limited size of the tmpfs mount involved; but about the
same time as the tmpfs mod went in (2.6.25), unionfs was reworked to
proceed differently; and the mod kept just in case others needed it.
Do we still need it? I cannot answer with more certainty than "Probably
not". It's nasty enough that we really should try to delete it; but if
a regression is reported somewhere, then we might have to revert later.
It's not quite as simple as just removing the test (as Mikulas did):
xfstests generic/013 hung because splice from tmpfs failed on page not
up-to-date and page mapping unset. That can be fixed just by marking
the ZERO_PAGE as Uptodate, which of course it is: do so in
pagecache_init() - it might be useful to others than tmpfs.
My intention, though, was to stop using the ZERO_PAGE here altogether:
surely iov_iter_zero() is better for this case? Sadly not: it relies on
clear_user(), and the x86 clear_user() is slower than its copy_user() [3].
But while we are still using the ZERO_PAGE, let's stop dirtying its
struct page cacheline with unnecessary get_page() and put_page().
Link: https://lore.kernel.org/linux-mm/alpine.LRH.2.02.2007210510230.6959@file01.intranet.prod.int.rdu2.redhat.com/ [1]
Link: https://lore.kernel.org/linux-mm/20211126075100.gd64odg2bcptiqeb@work/ [2]
Link: https://lore.kernel.org/lkml/2f5ca5e4-e250-a41c-11fb-a7f4ebc7e1c9@google.com/ [3]
Link: https://lkml.kernel.org/r/90bc5e69-9984-b5fa-a685-be55f2b64b@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Lukas Czerner <lczerner@redhat.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Zdenek Kabelac <zkabelac@redhat.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When I added page_mapped() resilience in __delete_from_page_cache() for
the mapping_exiting() case, I missed that mapping_set_exiting() is done
in truncate_inode_pages_final(), which is not actually called for shmem.
(Today, it is folio_mapped() resilience in filemap_unaccount_folio().)
So the fixup to avoid a memory leak in this case never worked on shmem:
add a mapping_set_exiting() in shmem_evict_inode() at last. But this is
hardly a candidate for stable, since it's only useful if "Bad page".
Link: https://lkml.kernel.org/r/beefffda-6326-e36d-2d41-ed15b51af872@google.com
Fixes: 06b241f32c ("mm: __delete_from_page_cache show Bad page if mapped")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/gup: some cleanups", v5.
This patch (of 5):
Alex reported invalid page pointer returned with pin_user_pages_remote()
from vfio after upstream commit 4b6c33b322 ("vfio/type1: Prepare for
batched pinning with struct vfio_batch").
It turns out that it's not the fault of the vfio commit; however after
vfio switches to a full page buffer to store the page pointers it starts
to expose the problem easier.
The problem is for VM_PFNMAP vmas we should normally fail with an
-EFAULT then vfio will carry on to handle the MMIO regions. However
when the bug triggered, follow_page_mask() returned -EEXIST for such a
page, which will jump over the current page, leaving that entry in
**pages untouched. However the caller is not aware of it, hence the
caller will reference the page as usual even if the pointer data can be
anything.
We had that -EEXIST logic since commit 1027e4436b ("mm: make GUP
handle pfn mapping unless FOLL_GET is requested") which seems very
reasonable. It could be that when we reworked GUP with FOLL_PIN we
could have overlooked that special path in commit 3faa52c03f ("mm/gup:
track FOLL_PIN pages"), even if that commit rightfully touched up
follow_devmap_pud() on checking FOLL_PIN when it needs to return an
-EEXIST.
Attaching the Fixes to the FOLL_PIN rework commit, as it happened later
than 1027e4436b.
[jhubbard@nvidia.com: added some tags, removed a reference to an out of tree module.]
Link: https://lkml.kernel.org/r/20220207062213.235127-1-jhubbard@nvidia.com
Link: https://lkml.kernel.org/r/20220204020010.68930-1-jhubbard@nvidia.com
Link: https://lkml.kernel.org/r/20220204020010.68930-2-jhubbard@nvidia.com
Fixes: 3faa52c03f ("mm/gup: track FOLL_PIN pages")
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Debugged-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ever since these macros were introduced in commit b56c0d8937
("kthread: implement kthread_worker"), there has been precisely one user
(commit 4d11542070, "NVMe: Async IO queue deletion"), and that user
went away in 2016 with db3cbfff5b ("NVMe: IO queue deletion
re-write").
Apart from being unused, these macros are also awkward to use (which may
contribute to them not being used): Having a way to statically (or
on-stack) allocating the storage for the struct kthread_worker itself
doesn't help much, since obviously one needs to have some code for
actually _spawning_ the worker thread, which must have error checking.
And these days we have the kthread_create_worker() interface which both
allocates the struct kthread_worker and spawns the kthread.
Link: https://lkml.kernel.org/r/20220314145343.494694-1-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Cai Huoqing <caihuoqing@baidu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Finish transition to L1 state in rcar_pcie_config_access() because R-Car
can't do it on its own (Marek Vasut)
- Return PCI_ERROR_RESPONSE for reads that trigger PCIe errors (Marek
Vasut)
* remotes/lorenzo/pci/rcar:
PCI: rcar: Use PCI_SET_ERROR_RESPONSE after read which triggered an exception
PCI: rcar: Finish transition to L1 state in rcar_pcie_config_access()
- Save pointer to device match data instead of copying it (Dmitry
Baryshkov)
- Add ddrss_sf_tbu flag to device match data instead of checking OF
compatible string (Dmitry Baryshkov)
- Add SM8450 SoC PCIe DT bindings (Dmitry Baryshkov)
- Add SM8450 PCIe support (Dmitry Baryshkov)
* remotes/lorenzo/pci/qcom:
PCI: qcom: Add SM8450 PCIe support
PCI: qcom: Add ddrss_sf_tbu flag
PCI: qcom: Remove redundancy between qcom_pcie and qcom_pcie_cfg
dt-bindings: pci: qcom: Document PCIe bindings for SM8450
- Add Pali Rohár as pci-mvebu.c maintainer (Pali Rohár)
- Make struct pci_bridge_emul_ops const (Pali Rohár)
- Rename PCI_BRIDGE_EMUL_NO_PREFETCHABLE_BAR to
PCI_BRIDGE_EMUL_NO_PREFMEM_FORWARD since it doesn't apply to BARs (Pali
Rohár)
- Add new flag PCI_BRIDGE_EMUL_NO_IO_FORWARD for bridges that don't support
IO forwarding (Pali Rohár)
- Add Kconfig help text for CONFIG_PCI_MVEBU (Pali Rohár)
- Remove duplicate nports assignment (Pali Rohár)
- Set PCI_BRIDGE_EMUL_NO_IO_FORWARD when IO is unsupported (Pali Rohár)
- Initialize vendor, device and revision of emulated bridge (Pali Rohár)
- Fix Data Link Layer Link Active reporting on emulated bridge (Pali Rohár)
- Rearrange tests in bridge emulation for easier maintenance (Russell King)
- Add emulated bridge support for PCIe extended capabilities (Russell King)
- Add emulated bridge support for bridge Subsystem Vendor ID capability
(Pali Rohár)
- Configure Maximum Link Width based on DT "num-lanes" property (Pali
Rohár)
- Emulate bridge Subsystem Vendor ID capability (Pali Rohár)
- Emulate AER Capability (Pali Rohár)
- Use PCI core bridge->ops and bridge->child_ops to separate config
accesses to Root Port vs downstream devices (Pali Rohár)
- Unmask all INTx interrupts; they're reported via a single shared GIC
source (Pali Rohár)
- Add INTx support (Pali Rohár)
* remotes/lorenzo/pci/mvebu:
PCI: mvebu: Implement support for legacy INTx interrupts
PCI: mvebu: Fix macro names and comments about legacy interrupts
dt-bindings: PCI: mvebu: Update information about intx interrupts
PCI: mvebu: Use child_ops API
PCI: mvebu: Add support for Advanced Error Reporting registers on emulated bridge
PCI: mvebu: Add support for PCI Bridge Subsystem Vendor ID on emulated bridge
PCI: mvebu: Correctly configure x1/x4 mode
dt-bindings: PCI: mvebu: Add num-lanes property
PCI: pci-bridge-emul: Add support for PCI Bridge Subsystem Vendor ID capability
PCI: pci-bridge-emul: Add support for PCIe extended capabilities
PCI: pci-bridge-emul: Re-arrange register tests
PCI: mvebu: Fix reporting Data Link Layer Link Active on emulated bridge
PCI: mvebu: Update comment for PCI_EXP_LNKCTL register on emulated bridge
PCI: mvebu: Update comment for PCI_EXP_LNKCAP register on emulated bridge
PCI: mvebu: Properly initialize vendor, device and revision of emulated bridge
PCI: mvebu: Set PCI_BRIDGE_EMUL_NO_IO_FORWARD when IO is unsupported
PCI: mvebu: Remove duplicate nports assignment
PCI: mvebu: Add help string for CONFIG_PCI_MVEBU option
PCI: pci-bridge-emul: Add support for new flag PCI_BRIDGE_EMUL_NO_IO_FORWARD
PCI: pci-bridge-emul: Rename PCI_BRIDGE_EMUL_NO_PREFETCHABLE_BAR to PCI_BRIDGE_EMUL_NO_PREFMEM_FORWARD
PCI: pci-bridge-emul: Make struct pci_bridge_emul_ops as const
MAINTAINERS: Add Pali Rohár as pci-mvebu.c maintainer
- Allow host controller driver to probe successfully (as other drivers do)
even if link is currently down (Fabio Estevam)
- Enable i.MX6QP PCIe power management (Richard Zhu)
- Invoke PHY exit function after PHY power off (Richard Zhu)
- Assert i.MX8MM CLKREQ# even if no device present to avoid boot hangs
(Richard Zhu)
* remotes/lorenzo/pci/imx6:
PCI: imx6: Assert i.MX8MM CLKREQ# even if no device present
PCI: imx6: Invoke the PHY exit function after PHY power off
PCI: imx6: Enable i.MX6QP PCIe power management support
PCI: imx6: Allow to probe when dw_pcie_wait_for_link() fails
- Avoid retarget interrupt hypercall in irq_unmask() on ARM64 (Boqun Feng)
* remotes/lorenzo/pci/hv:
PCI: hv: Avoid the retarget interrupt hypercall in irq_unmask() on ARM64
- Drop redundant '-gpios' from DT GPIO lookup (Ben Dooks)
- Force 2.5GT/s for initial device probe to workaround enumeration issue on
SiFive Unmatched board (Ben Dooks)
* pci/host/fu740:
PCI: fu740: Force 2.5GT/s for initial device probe
PCI: fu740: Drop redundant '-gpios' from DT GPIO lookup
- Use PCI_INTERRUPT_* definitions from PCI core instead of custom ones
(Pali Rohár)
- Derive MSI number from bit(s) set in PCIE_MSI_STATUS_REG, not from
PCIE_MSI_PAYLOAD_REG (Pali Rohár)
- Align multi-MSI vectors to power of two (Pali Rohár)
- Rewrite IRQ code to use chained IRQ handler (Pali Rohár)
- Check return value of generic_handle_domain_irq() and warn about spurious
interrupts (Pali Rohár)
- Make MSI irq_chip structures static to driver (Marek Behún)
- Make msi_domain_info structure static to driver (Marek Behún)
- Use dev_fwnode() instead of of_node_to_fwnode(dev->of_node) (Marek Behún)
- Refactor unmasking of summary MSI interrupt (Pali Rohár)
- Add support for masking MSI interrupts and leave them masked at setup
(Pali Rohár)
- Set MSI doorbell address to address of struct advk_pcie (Pali Rohár)
- Enable MSI-X support (Pali Rohár)
- Add support for ERR interrupt on emulated bridge (Pali Rohár)
- Fix read of PCI_EXP_RTSTA_PME bit on emulated bridge (Pali Rohár)
- Optimize writing PCI_EXP_RTCTL_PMEIE and PCI_EXP_RTSTA_PME on emulated
bridge (Pali Rohár)
- Add support for PME interrupts (Pali Rohár)
- Fix support for PME requester on emulated bridge (Pali Rohár)
- Use separate INTA interrupt for emulated Root Port so PME and AER
interrupt is not shared with downstream devices (Pali Rohár)
- Remove irq_mask_ack() callback for INTx interrupts (Pali Rohár)
- Don't mask legacy INTx interrupts when mapping (Pali Rohár)
- Drop unnecessary "__maybe_unused" from advk_pcie_disable_phy() (Marek
Behún)
- Update comment about why we check for link being up before issuing a
config request (Marek Behún)
* remotes/lorenzo/pci/aardvark:
PCI: aardvark: Update comment about link going down after link-up
PCI: aardvark: Drop __maybe_unused from advk_pcie_disable_phy()
PCI: aardvark: Don't mask irq when mapping
PCI: aardvark: Remove irq_mask_ack() callback for INTx interrupts
PCI: aardvark: Use separate INTA interrupt for emulated root bridge
PCI: aardvark: Fix support for PME requester on emulated bridge
PCI: aardvark: Add support for PME interrupts
PCI: aardvark: Optimize writing PCI_EXP_RTCTL_PMEIE and PCI_EXP_RTSTA_PME on emulated bridge
PCI: aardvark: Fix reading PCI_EXP_RTSTA_PME bit on emulated bridge
PCI: aardvark: Add support for ERR interrupt on emulated bridge
PCI: aardvark: Enable MSI-X support
PCI: aardvark: Fix setting MSI address
PCI: aardvark: Add support for masking MSI interrupts
PCI: aardvark: Refactor unmasking summary MSI interrupt
PCI: aardvark: Use dev_fwnode() instead of of_node_to_fwnode(dev->of_node)
PCI: aardvark: Make msi_domain_info structure a static driver structure
PCI: aardvark: Make MSI irq_chip structures static driver structures
PCI: aardvark: Check return value of generic_handle_domain_irq() when processing INTx IRQ
PCI: aardvark: Rewrite IRQ code to chained IRQ handler
PCI: aardvark: Fix support for MSI interrupts
PCI: aardvark: Fix reading MSI interrupt number
PCI: aardvark: Replace custom PCIE_CORE_INT_* macros with PCI_INTERRUPT_*
- Move vgaarb.c from drivers/gpu/vga to drivers/pci (Bjorn Helgaas)
- Factor out default VGA device selection (Huacai Chen)
- Move firmware default device detection to ADD_DEVICE path so we can
select a default device regardless of whether it is enumerated before or
after vga_arb_device_init() (Huacai Chen)
- Move non-legacy VGA detection to ADD_DEVICE path (Huacai Chen)
- Move disabled VGA device detection to ADD_DEVICE path (Huacai Chen)
* pci/vga:
PCI/VGA: Replace full MIT license text with SPDX identifier
PCI/VGA: Use unsigned format string to print lock counts
PCI/VGA: Log bridge control messages when adding devices
PCI/VGA: Remove empty vga_arb_device_card_gone()
PCI/VGA: Move disabled VGA device detection to ADD_DEVICE path
PCI/VGA: Move non-legacy VGA detection to ADD_DEVICE path
PCI/VGA: Move firmware default device detection to ADD_DEVICE path
PCI/VGA: Factor out default VGA device selection
PCI/VGA: Factor out vga_select_framebuffer_device()
PCI/VGA: Move vga_arb_integrated_gpu() earlier in file
PCI/VGA: Move vgaarb to drivers/pci