The high limit is used to throttle the workload without invoking the
oom-killer. Recently we tried to use the high limit to right size our
internal workloads. More specifically dynamically adjusting the limits
of the workload without letting the workload get oom-killed. However
due to the limitation of the implementation of high limit enforcement,
we observed the mechanism fails for some real workloads.
The high limit is enforced on return-to-userspace i.e. the kernel let
the usage goes over the limit and when the execution returns to
userspace, the high reclaim is triggered and the process can get
throttled as well. However this mechanism fails for workloads which do
large allocations in a single kernel entry e.g. applications that
mlock() a large chunk of memory in a single syscall. Such applications
bypass the high limit and can trigger the oom-killer.
To make high limit enforcement more robust, this patch makes the limit
enforcement synchronous only if the accumulated overcharge becomes
larger than MEMCG_CHARGE_BATCH. So, most of the allocations would still
be throttled on the return-to-userspace path but only the extreme
allocations which accumulates large amount of overcharge without
returning to the userspace will be throttled synchronously. The value
MEMCG_CHARGE_BATCH is a bit arbitrary but most of other places in the
memcg codebase uses this constant therefore for now uses the same one.
Link: https://lkml.kernel.org/r/20220211064917.2028469-5-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the kernel force charges the allocations which have __GFP_HIGH
flag without triggering the memory reclaim. __GFP_HIGH indicates that
the caller is high priority and since commit 869712fd3d ("mm:
memcontrol: fix network errors from failing __GFP_ATOMIC charges") the
kernel lets such allocations do force charging. Please note that
__GFP_ATOMIC has been replaced by __GFP_HIGH.
__GFP_HIGH does not tell if the caller can block or can trigger reclaim.
There are separate checks to determine that. So, there is no need to
skip reclaiming for __GFP_HIGH allocations. So, handle __GFP_HIGH
together with __GFP_NOFAIL which also does force charging.
Please note that this is a noop change as there are no __GFP_HIGH
allocators in the kernel which also have __GFP_ACCOUNT (or SLAB_ACCOUNT)
and does not allow reclaim for now.
Link: https://lkml.kernel.org/r/20220211064917.2028469-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Chris Down <chris@chrisdown.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "memcg: robust enforcement of memory.high", v2.
Due to the semantics of memory.high enforcement i.e. throttle the
workload without oom-kill, we are trying to use it for right sizing the
workloads in our production environment. However we observed the
mechanism fails for some specific applications which does big chunck of
allocations in a single syscall. The reason behind this failure is due
to the limitation of the memory.high enforcement's current
implementation.
This patch series solves this issue by enforcing the memory.high
synchronously if the current process has accumulated a large amount of
high overcharge.
This patch (of 4):
The function mem_cgroup_oom returns enum which has four possible values
but the caller does not care about such values and only cares if the
return value is OOM_SUCCESS or not. So, remove the enum altogether and
make mem_cgroup_oom returns a simple bool.
Link: https://lkml.kernel.org/r/20220211064917.2028469-1-shakeelb@google.com
Link: https://lkml.kernel.org/r/20220211064917.2028469-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently memcg stats show several types of kernel memory: kernel stack,
page tables, sock, vmalloc, and slab. However, there are other
allocations with __GFP_ACCOUNT (or supersets such as GFP_KERNEL_ACCOUNT)
that are not accounted in any of those stats, a few examples are:
- various kvm allocations (e.g. allocated pages to create vcpus)
- io_uring
- tmp_page in pipes during pipe_write()
- bpf ringbuffers
- unix sockets
Keeping track of the total kernel memory is essential for the ease of
migration from cgroup v1 to v2 as there are large discrepancies between
v1's kmem.usage_in_bytes and the sum of the available kernel memory
stats in v2. Adding separate memcg stats for all __GFP_ACCOUNT kernel
allocations is an impractical maintenance burden as there a lot of those
all over the kernel code, with more use cases likely to show up in the
future.
Therefore, add a "kernel" memcg stat that is analogous to kmem page
counter, with added benefits such as using rstat infrastructure which
aggregates stats more efficiently. Additionally, this provides a
lighter alternative in case the legacy kmem is deprecated in the future
[yosryahmed@google.com: v2]
Link: https://lkml.kernel.org/r/20220203193856.972500-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20220201200823.3283171-1-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mikulas asked in "Do we still need commit a0ee5ec520 ('tmpfs: allocate
on read when stacked')?" in [1]
Lukas noticed this unusual behavior of loop device backed by tmpfs in [2].
Normally, shmem_file_read_iter() copies the ZERO_PAGE when reading
holes; but if it looks like it might be a read for "a stacking
filesystem", it allocates actual pages to the page cache, and even marks
them as dirty. And reads from the loop device do satisfy the test that
is used.
This oddity was added for an old version of unionfs, to help to limit
its usage to the limited size of the tmpfs mount involved; but about the
same time as the tmpfs mod went in (2.6.25), unionfs was reworked to
proceed differently; and the mod kept just in case others needed it.
Do we still need it? I cannot answer with more certainty than "Probably
not". It's nasty enough that we really should try to delete it; but if
a regression is reported somewhere, then we might have to revert later.
It's not quite as simple as just removing the test (as Mikulas did):
xfstests generic/013 hung because splice from tmpfs failed on page not
up-to-date and page mapping unset. That can be fixed just by marking
the ZERO_PAGE as Uptodate, which of course it is: do so in
pagecache_init() - it might be useful to others than tmpfs.
My intention, though, was to stop using the ZERO_PAGE here altogether:
surely iov_iter_zero() is better for this case? Sadly not: it relies on
clear_user(), and the x86 clear_user() is slower than its copy_user() [3].
But while we are still using the ZERO_PAGE, let's stop dirtying its
struct page cacheline with unnecessary get_page() and put_page().
Link: https://lore.kernel.org/linux-mm/alpine.LRH.2.02.2007210510230.6959@file01.intranet.prod.int.rdu2.redhat.com/ [1]
Link: https://lore.kernel.org/linux-mm/20211126075100.gd64odg2bcptiqeb@work/ [2]
Link: https://lore.kernel.org/lkml/2f5ca5e4-e250-a41c-11fb-a7f4ebc7e1c9@google.com/ [3]
Link: https://lkml.kernel.org/r/90bc5e69-9984-b5fa-a685-be55f2b64b@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Lukas Czerner <lczerner@redhat.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Zdenek Kabelac <zkabelac@redhat.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When I added page_mapped() resilience in __delete_from_page_cache() for
the mapping_exiting() case, I missed that mapping_set_exiting() is done
in truncate_inode_pages_final(), which is not actually called for shmem.
(Today, it is folio_mapped() resilience in filemap_unaccount_folio().)
So the fixup to avoid a memory leak in this case never worked on shmem:
add a mapping_set_exiting() in shmem_evict_inode() at last. But this is
hardly a candidate for stable, since it's only useful if "Bad page".
Link: https://lkml.kernel.org/r/beefffda-6326-e36d-2d41-ed15b51af872@google.com
Fixes: 06b241f32c ("mm: __delete_from_page_cache show Bad page if mapped")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/gup: some cleanups", v5.
This patch (of 5):
Alex reported invalid page pointer returned with pin_user_pages_remote()
from vfio after upstream commit 4b6c33b322 ("vfio/type1: Prepare for
batched pinning with struct vfio_batch").
It turns out that it's not the fault of the vfio commit; however after
vfio switches to a full page buffer to store the page pointers it starts
to expose the problem easier.
The problem is for VM_PFNMAP vmas we should normally fail with an
-EFAULT then vfio will carry on to handle the MMIO regions. However
when the bug triggered, follow_page_mask() returned -EEXIST for such a
page, which will jump over the current page, leaving that entry in
**pages untouched. However the caller is not aware of it, hence the
caller will reference the page as usual even if the pointer data can be
anything.
We had that -EEXIST logic since commit 1027e4436b ("mm: make GUP
handle pfn mapping unless FOLL_GET is requested") which seems very
reasonable. It could be that when we reworked GUP with FOLL_PIN we
could have overlooked that special path in commit 3faa52c03f ("mm/gup:
track FOLL_PIN pages"), even if that commit rightfully touched up
follow_devmap_pud() on checking FOLL_PIN when it needs to return an
-EEXIST.
Attaching the Fixes to the FOLL_PIN rework commit, as it happened later
than 1027e4436b.
[jhubbard@nvidia.com: added some tags, removed a reference to an out of tree module.]
Link: https://lkml.kernel.org/r/20220207062213.235127-1-jhubbard@nvidia.com
Link: https://lkml.kernel.org/r/20220204020010.68930-1-jhubbard@nvidia.com
Link: https://lkml.kernel.org/r/20220204020010.68930-2-jhubbard@nvidia.com
Fixes: 3faa52c03f ("mm/gup: track FOLL_PIN pages")
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Debugged-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ever since these macros were introduced in commit b56c0d8937
("kthread: implement kthread_worker"), there has been precisely one user
(commit 4d11542070, "NVMe: Async IO queue deletion"), and that user
went away in 2016 with db3cbfff5b ("NVMe: IO queue deletion
re-write").
Apart from being unused, these macros are also awkward to use (which may
contribute to them not being used): Having a way to statically (or
on-stack) allocating the storage for the struct kthread_worker itself
doesn't help much, since obviously one needs to have some code for
actually _spawning_ the worker thread, which must have error checking.
And these days we have the kthread_create_worker() interface which both
allocates the struct kthread_worker and spawns the kthread.
Link: https://lkml.kernel.org/r/20220314145343.494694-1-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Cai Huoqing <caihuoqing@baidu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Finish transition to L1 state in rcar_pcie_config_access() because R-Car
can't do it on its own (Marek Vasut)
- Return PCI_ERROR_RESPONSE for reads that trigger PCIe errors (Marek
Vasut)
* remotes/lorenzo/pci/rcar:
PCI: rcar: Use PCI_SET_ERROR_RESPONSE after read which triggered an exception
PCI: rcar: Finish transition to L1 state in rcar_pcie_config_access()
- Save pointer to device match data instead of copying it (Dmitry
Baryshkov)
- Add ddrss_sf_tbu flag to device match data instead of checking OF
compatible string (Dmitry Baryshkov)
- Add SM8450 SoC PCIe DT bindings (Dmitry Baryshkov)
- Add SM8450 PCIe support (Dmitry Baryshkov)
* remotes/lorenzo/pci/qcom:
PCI: qcom: Add SM8450 PCIe support
PCI: qcom: Add ddrss_sf_tbu flag
PCI: qcom: Remove redundancy between qcom_pcie and qcom_pcie_cfg
dt-bindings: pci: qcom: Document PCIe bindings for SM8450
- Add Pali Rohár as pci-mvebu.c maintainer (Pali Rohár)
- Make struct pci_bridge_emul_ops const (Pali Rohár)
- Rename PCI_BRIDGE_EMUL_NO_PREFETCHABLE_BAR to
PCI_BRIDGE_EMUL_NO_PREFMEM_FORWARD since it doesn't apply to BARs (Pali
Rohár)
- Add new flag PCI_BRIDGE_EMUL_NO_IO_FORWARD for bridges that don't support
IO forwarding (Pali Rohár)
- Add Kconfig help text for CONFIG_PCI_MVEBU (Pali Rohár)
- Remove duplicate nports assignment (Pali Rohár)
- Set PCI_BRIDGE_EMUL_NO_IO_FORWARD when IO is unsupported (Pali Rohár)
- Initialize vendor, device and revision of emulated bridge (Pali Rohár)
- Fix Data Link Layer Link Active reporting on emulated bridge (Pali Rohár)
- Rearrange tests in bridge emulation for easier maintenance (Russell King)
- Add emulated bridge support for PCIe extended capabilities (Russell King)
- Add emulated bridge support for bridge Subsystem Vendor ID capability
(Pali Rohár)
- Configure Maximum Link Width based on DT "num-lanes" property (Pali
Rohár)
- Emulate bridge Subsystem Vendor ID capability (Pali Rohár)
- Emulate AER Capability (Pali Rohár)
- Use PCI core bridge->ops and bridge->child_ops to separate config
accesses to Root Port vs downstream devices (Pali Rohár)
- Unmask all INTx interrupts; they're reported via a single shared GIC
source (Pali Rohár)
- Add INTx support (Pali Rohár)
* remotes/lorenzo/pci/mvebu:
PCI: mvebu: Implement support for legacy INTx interrupts
PCI: mvebu: Fix macro names and comments about legacy interrupts
dt-bindings: PCI: mvebu: Update information about intx interrupts
PCI: mvebu: Use child_ops API
PCI: mvebu: Add support for Advanced Error Reporting registers on emulated bridge
PCI: mvebu: Add support for PCI Bridge Subsystem Vendor ID on emulated bridge
PCI: mvebu: Correctly configure x1/x4 mode
dt-bindings: PCI: mvebu: Add num-lanes property
PCI: pci-bridge-emul: Add support for PCI Bridge Subsystem Vendor ID capability
PCI: pci-bridge-emul: Add support for PCIe extended capabilities
PCI: pci-bridge-emul: Re-arrange register tests
PCI: mvebu: Fix reporting Data Link Layer Link Active on emulated bridge
PCI: mvebu: Update comment for PCI_EXP_LNKCTL register on emulated bridge
PCI: mvebu: Update comment for PCI_EXP_LNKCAP register on emulated bridge
PCI: mvebu: Properly initialize vendor, device and revision of emulated bridge
PCI: mvebu: Set PCI_BRIDGE_EMUL_NO_IO_FORWARD when IO is unsupported
PCI: mvebu: Remove duplicate nports assignment
PCI: mvebu: Add help string for CONFIG_PCI_MVEBU option
PCI: pci-bridge-emul: Add support for new flag PCI_BRIDGE_EMUL_NO_IO_FORWARD
PCI: pci-bridge-emul: Rename PCI_BRIDGE_EMUL_NO_PREFETCHABLE_BAR to PCI_BRIDGE_EMUL_NO_PREFMEM_FORWARD
PCI: pci-bridge-emul: Make struct pci_bridge_emul_ops as const
MAINTAINERS: Add Pali Rohár as pci-mvebu.c maintainer
- Allow host controller driver to probe successfully (as other drivers do)
even if link is currently down (Fabio Estevam)
- Enable i.MX6QP PCIe power management (Richard Zhu)
- Invoke PHY exit function after PHY power off (Richard Zhu)
- Assert i.MX8MM CLKREQ# even if no device present to avoid boot hangs
(Richard Zhu)
* remotes/lorenzo/pci/imx6:
PCI: imx6: Assert i.MX8MM CLKREQ# even if no device present
PCI: imx6: Invoke the PHY exit function after PHY power off
PCI: imx6: Enable i.MX6QP PCIe power management support
PCI: imx6: Allow to probe when dw_pcie_wait_for_link() fails
- Avoid retarget interrupt hypercall in irq_unmask() on ARM64 (Boqun Feng)
* remotes/lorenzo/pci/hv:
PCI: hv: Avoid the retarget interrupt hypercall in irq_unmask() on ARM64
- Drop redundant '-gpios' from DT GPIO lookup (Ben Dooks)
- Force 2.5GT/s for initial device probe to workaround enumeration issue on
SiFive Unmatched board (Ben Dooks)
* pci/host/fu740:
PCI: fu740: Force 2.5GT/s for initial device probe
PCI: fu740: Drop redundant '-gpios' from DT GPIO lookup