commit e677edbcab upstream.
io_flush_timeouts() assumes the timeout isn't in progress of triggering
or being removed/canceled, so it unconditionally removes it from the
timeout list and attempts to cancel it.
Leave it on the list and let the normal timeout cancelation take care
of it.
Bug: 231494876
Cc: stable@vger.kernel.org # 5.5+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Change-Id: Ie7dba41da32732391f8a85526fe20168bd431be8
When handling host stage-2 faults the hypervisor currently updates the
CPU _and_ IOMMUs page-tables. However, since we currently proactively
map accessible PA ranges into IOMMUs, updating them during stage-2
faults is unnecessary -- it only needs to be done during ownership
transitions. Optimize this by skipping the IOMMU updates from the host
memory abort path, which also reduces contention on the host stage-2
lock during boot and saves up to 1.1 sec of boot time on Pixel 6.
Bug: 232879742
Change-Id: I71f439311fe9573005efcc9529a2be53f21993a4
Signed-off-by: Quentin Perret <qperret@google.com>
The boot.img will be used for GKI testing.
Also removing BUILD_GKI_CERTIFICATION_TOOLS=1, because
we only need to certify GKI boot-*.img for aarch64.
Bug: 232906147
Test: BUILD_CONFIG=common/build.config.gki.x86_64 build/build.sh
Signed-off-by: Bowgo Tsai <bowgotsai@google.com>
Change-Id: Ia6790dc9faddce7c616411d7ec5c1f60a12aea44
(cherry picked from commit a80c9ffa86)
Add vendor hook for compaction begin/end. The first use would be
to measure compaction durations.
Bug: 229927848
Test: local kernel build test
Signed-off-by: Robin Hsu <robinhsu@google.com>
Change-Id: I3d95434bf49b37199056dc9ddfc36a59a7de17b7
Without initialization, it will be random data and hard for
vendor hook to decide.
Bug: 207739506
Change-Id: I278772d87eea38c03a40d4f0bef20ac8644e2ecd
Signed-off-by: Maria Yu <quic_aiquny@quicinc.com>
(cherry picked from commit 898e7ec950)
One may want to have DF set on large packets to support discovering
path mtu and limiting the size of generated packets (hence not
setting the XFRM_STATE_NOPMTUDISC tunnel flag), while still
supporting networks that are incapable of carrying even minimal
sized IPv6 frames (post encapsulation).
Having IPv4 Don't Frag bit set on encapsulated IPv6 frames that
are not larger than the minimum IPv6 mtu of 1280 isn't useful,
because the resulting ICMP Fragmentation Required error isn't
actionable (even assuming you receive it) because IPv6 will not
drop it's path mtu below 1280 anyway. While the IPv4 stack
could prefrag the packets post encap, this requires the ICMP
error to be successfully delivered and causes a loss of the
original IPv6 frame (thus requiring a retransmit and latency
hit). Luckily with IPv4 if we simply don't set the DF flag,
we'll just make further fragmenting the packets some other
router's problems.
We'll still learn the correct IPv4 path mtu through encapsulation
of larger IPv6 frames.
I'm still not convinced this patch is entirely sufficient to make
everything happy... but I don't see how it could possibly
make things worse.
See also recent:
4ff2980b6b 'xfrm: fix tunnel model fragmentation behavior'
and friends
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Lina Wang <lina.wang@mediatek.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Maciej Zenczykowski <maze@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
(cherry picked from commit 6821ad8770https://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec.git master)
Bug: 203183943
Test: TreeHugger
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Change-Id: Ie7701ebc63b1e2a974114538befd278154eb3bc6
If the ring is setup with IORING_SETUP_IOPOLL and we have more than
one task doing submissions on a ring, we can up in a situation where
we assign the context from the current task rather than the request
originator.
Always use req->task rather than assume it's the same as current.
No upstream patch exists for this issue, as only older kernels with
the non-native workers have this problem.
Bug: 233078742
Reported-by: Kyle Zeng <zengyhkyle@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Akilesh Kailash <akailash@google.com>
(cherry picked from commit 29f077d070
from linux-5.10.y stable branch)
Change-Id: I4cc543950a95e1df201fa9867c5e9c272fd54b6f
Usually as a result of initial fuse lookup with bpf enabled we have following dentry:
-----------------------------------------------------------------
| dentry /storage/emulated/0/Android/data |
| inode |
| backing_inode: /pass_through/emulated/0/Android/data |
-----------------------------------------------------------------
Every communication with this folder will have to go
through fuse_dentry_revalidate(dentry, flags) which can move forward by:
1. If the timeout is not reached, just ignore it
2. If entry has backing_inode and bpf is not against it, execute revalidate on backing FS (inside kernel)
3. Move to userspace to revalidate
But for some reason currently, we're checking parent inode (not one that we wanna revalidate) to have
backing inode that we can use to execute operations on. Basically, the whole flow looks like this:
1. Receiving revalidate event for fuse_dentry_revalidate(/storage/emulated/0/Android/data, flags)
2. Checking .../0/Android/ inode to have backing inode <------------------------ Primary problem is HERE
3. Moving to the userspace with pf_lookup(/storage/emulated/0/Android, data)
4. Even though successfully handled lookup on the fuse daemon side, kernel cannot interpret
the result due to fuse_simple_request and fuse_lookup_init logic changes <------- Secondary problem is HERE
5. Because of the problems I mentioned before, full lookup is triggered on the kernel side so we receive the second pf_lookup to the userspace
Fixing primary problem by executing backing revalidate on the current inode (not the parent one).
Bug: 234346312
Test: Manually made sure don't have any userspace calls for interactions inside directory with backing one.
Test: Manually check youtube app is successfully saving exo cache into the external storage cache folder.
Test: atest --test-mapping packages/providers/MediaProvider
Signed-off-by: Dmitrii Merkurev <dimorinny@google.com>
Change-Id: Id57f1944302076d93ebef255533dfc53e8c30f20
This reverts commit 67bef07aab.
Reason for revert: switching to latest version merged into Linus's tree.
Bug: 231271475
Change-Id: I27745412e9ffbd4685d54c06e3aa975eb23347fa
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Will Deacon <willdeacon@google.com>
Pages on CMA area could have MIGRATE_ISOLATE as well as MIGRATE_CMA
so current is_pinnable_page could miss CMA pages which has MIGRATE_
ISOLATE. It ends up pinning CMA pages as longterm at pin_user_pages
APIs so CMA allocation keep failed until the pin is released.
CPU 0 CPU 1 - Task B
cma_alloc
alloc_contig_range
pin_user_pages_fast(FOLL_LONGTERM)
change pageblock as MIGRATE_ISOLATE
internal_get_user_pages_fast
lockless_pages_from_mm
gup_pte_range
try_grab_folio
is_pinnable_page
return true;
So, pinned the page successfully.
page migration failure with pinned page
..
.. After 30 sec
unpin_user_page(page)
CMA allocation succeeded after 30 sec.
The CMA allocation path protects the migration type change race
using zone->lock but what GUP path need to know is just whether the
page is on CMA area or not rather than exact migration type.
Thus, we don't need zone->lock but just checks migration type in
either of (MIGRATE_ISOLATE and MIGRATE_CMA).
Adding the MIGRATE_ISOLATE check in is_pinnable_page could cause
rejecting of pinning pages on MIGRATE_ISOLATE pageblocks even
though it's neither CMA nor movable zone if the page is temporarily
unmovable. However, such a migration failure by unexpected temporal
refcount holding is general issue, not only come from MIGRATE_ISOLATE
and the MIGRATE_ISOLATE is also transient state like other temporal
elevated refcount problem.
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Conflicts:
include/linux/mm.h
1. There is no is_pinnable_page in 5.10
Link: https://lore.kernel.org/all/20220524171525.976723-1-minchan@kernel.org/
Bug: 231227007
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I5cdd2b8eefdd7e89658abd21c32aa84876ad7782
CMA first allocation policy for movable makes CMA(Upstream doesn't)
area always full. It's good for memory efficiency since it could use
up CMA available memory most of time. However, it could cause
cma_alloc slow since it causes a lot page migration all the time.
Let's add vendor hook for someone who want to restore CMA allocation
policy to upstream so they will see less page migration in cma_alloc.
If the vendor_hook returns false, the rmqueue_bulk return 0 without
filling pcp->lists so get_populated_pcp_list will return NULL.
Once get_populated_pcp_list returns NULL, __rmqueue_pcplist will retry
the page allocation with original migratetype(currently, original
migratetype couldn't be MIGRATE_CMA) so the retrial will find
available pages from !MIGRATE_CMA free list.
Bug: 231978523
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: Ia031d9bc6f34085b892a8d9923bf5b9b1794f94a
The boot-img.tar.gz includes boot.img and boot-lz4.img,
with kernel image: Image and Image.lz4, respectively.
Bug: 222078981
Test: BUILD_CONFIG=common/build.config.gki.aarch64 build/build.sh
Signed-off-by: Bowgo Tsai <bowgotsai@google.com>
Change-Id: I7f929a73967ce87d0d653d0b9926198cfeedc973
(cherry picked from commit 3361d46a39)
When dwc3_gadget_ep_cleanup_completed_requests() called to
dwc3_gadget_giveback() where the dwc3 lock is released, other thread is
able to execute. In this situation, usb_ep_disable() gets the chance to
clear endpoint descriptor pointer which leds to the null pointer
dereference problem. So needs to move the null pointer check to a proper
place.
Example call stack:
Thread#1:
dwc3_thread_interrupt()
spin_lock
-> dwc3_process_event_buf()
-> dwc3_process_event_entry()
-> dwc3_endpoint_interrupt()
-> dwc3_gadget_endpoint_trbs_complete()
-> dwc3_gadget_ep_cleanup_completed_requests()
...
-> dwc3_giveback()
spin_unlock
Thread#2 executes
Thread#2:
configfs_composite_disconnect()
-> __composite_disconnect()
-> ffs_func_disable()
-> ffs_func_set_alt()
-> ffs_func_eps_disable()
-> usb_ep_disable()
wait for dwc3 spin_lock
Thread#1 released lock
clear endpoint.desc
Fixes: 2628844812 ("usb: dwc3: gadget: Fix null pointer exception")
Cc: stable <stable@kernel.org>
Signed-off-by: Albert Wang <albertccwang@google.com>
Link: https://lore.kernel.org/r/20220518061315.3359198-1-albertccwang@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3c5880745b)
Bug: 224405818
Signed-off-by: Albert Wang <albertccwang@google.com>
Change-Id: I716885b0966711a166d6142417cd6d18fe5c14a8
Device drivers may decide to not load firmware when probed to avoid
slowing down the boot process should the firmware filesystem not be
available yet. In this case, the firmware loading request may be done
when a device file associated with the driver is first accessed. The
credentials of the userspace process accessing the device file may be
used to validate access to the firmware files requested by the driver.
Ensure that the kernel assumes the responsibility of reading the
firmware.
This was observed on Android for a graphic driver loading their firmware
when the device file (e.g. /dev/mali0) was first opened by userspace
(i.e. surfaceflinger). The security context of surfaceflinger was used
to validate the access to the firmware file (e.g.
/vendor/firmware/mali.bin).
Previously, Android configurations were not setting up the
firmware_class.path command line argument and were relying on the
userspace fallback mechanism. In this case, the security context of the
userspace daemon (i.e. ueventd) was consistently used to read firmware
files. More Android devices are now found to set firmware_class.path
which gives the kernel the opportunity to read the firmware directly
(via kernel_read_file_from_path_initns). In this scenario, the current
process credentials were used, even if unrelated to the loading of the
firmware file.
Signed-off-by: Thiébaud Weksteen <tweek@google.com>
Cc: <stable@vger.kernel.org> # 5.10
Reviewed-by: Paul Moore <paul@paul-moore.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20220502004952.3970800-1-tweek@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 581dd69830)
[adelva: merged thru LTS, but LTS merges are paused on a13-5.10]
Bug: 232963476
Signed-off-by: Alistair Delva <adelva@google.com>
Change-Id: Ie24b5ec86451e36e7f982f403446161c326d5fe4
The dmabuf file uses get_next_ino()(through dma_buf_getfile() ->
alloc_anon_inode()) to get an inode number and uses the same as a
directory name under /sys/kernel/dmabuf/buffers/<ino>. This directory is
used to collect the dmabuf stats and it is created through
dma_buf_stats_setup(). At current, failure to create this directory
entry can make the dma_buf_export() to fail.
Now, as the get_next_ino() can definitely give a repetitive inode no
causing the directory entry creation to fail with -EEXIST. This is a
problem on the systems where dmabuf stats functionality is enabled on
the production builds can make the dma_buf_export(), though the dmabuf
memory is allocated successfully, to fail just because it couldn't
create stats entry.
This issue we are able to see on the snapdragon system within 13 days
where there already exists a directory with inode no "122602" so
dma_buf_stats_setup() failed with -EEXIST as it is trying to create
the same directory entry.
To make the dentry name as unique, use the dmabuf fs specific inode
which is based on the simple atomic variable increment. There is tmpfs
subsystem too which relies on its own inode generation rather than
relying on the get_next_ino() for the same reason of avoiding the
duplicate inodes[1].
[1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/patch/?id=e809d5f0b5c912fe981dce738f3283b2010665f0
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: <stable@vger.kernel.org> # 5.15.x+
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1652441296-1986-1-git-send-email-quic_charante@quicinc.com
(cherry picked from commit 370704e707
git://anongit.freedesktop.org/drm/drm-misc)
Signed-off-by: Christian König <christian.koenig@amd.com>
Bug: 232887194
Change-Id: If244529c4c54086fe9eb5a4e76f6e8a07eaaa6ab
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
The ioremap hook can be called before slab is initialised, at which time
calling into kmalloc() is not allowed.
Signed-off-by: Keir Fraser <keirf@google.com>
Bug: 232894028
Fixes: f89d2055a3 ("ANDROID: arm64: Implement ioremap/iounmap hooks calling into KVM's MMIO guard")
Change-Id: Ieaf5adbdacdb196e37f4629998164a015e15c6d8
Disable CFI on trace hooks, as this improves some lmbench
microbenchmarks by as much as 12%.
Bug: 200542217
Change-Id: I6ad1d12047c4e69743ff94cf0ea8f70f5023c7da
Signed-off-by: Shaleen Agrawal <shalagra@codeaurora.org>
If a different vcpu from the same vm is loaded on the same
physical CPU, we must flush the CPU context.
This patch ensures that by tracking the vcpu that was last loaded
on this CPU, and flushes if that changes. This could lead to
over-invalidation, which could affect performance but not
correctness.
Bug: 228810735
Signed-off-by: Fuad Tabba <tabba@google.com>
Change-Id: I70976007165ca3b8d293089dbf9c2111b01ca2f7
This field is stale and not being used. Remove it.
Bug: 228810735
Signed-off-by: Fuad Tabba <tabba@google.com>
Change-Id: I5a734c22f246186b81ffd7bc73b46e0b60518306
This reverts commit b9b94e2aca.
Reason for revert: Suspected cause of hyp panic when running suite/user/pkvm_test
Change-Id: I117261a2298c0c59da2b22f8199317cab0635b03
Bug: 232390891
Signed-off-by: Will Deacon <willdeacon@google.com>
Update the generic symbol list.
Bug: 232424854
Change-Id: Ia164a1171bfe4a250e738b885d26f5037408adbb
Signed-off-by: David Kimmel <davidkimmel@google.com>
commit ebe48d368e upstream.
The maximum message size that can be send is bigger than
the maximum site that skb_page_frag_refill can allocate.
So it is possible to write beyond the allocated buffer.
Fix this by doing a fallback to COW in that case.
v2:
Avoid get get_order() costs as suggested by Linus Torvalds.
Bug: 227452856
Fixes: cac2661c53 ("esp4: Avoid skb_cow_data whenever possible")
Fixes: 03e2a30f6a ("esp6: Avoid skb_cow_data whenever possible")
Reported-by: valis <sec@valis.email>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: I2c7f97914138271e7788adfcebbd0b2b8b43cdcb
Signed-off-by: Lee Jones <lee.jones@linaro.org>
This reverts commit f0416df755.
Reason for revert: This was a "temporary" reversion to workaround what is believed to be a user-space issue.
Change-Id: I5322aecfe57cd8237e6657525eb33975c4840059
Bug: 166779391
Signed-off-by: Todd Kjos <tkjos@google.com>
(cherry picked from commit d1c6df6dc8)
[cmllamas: Resolved merge conflict with vendor hook in binder.c]
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Some devices may return invalid or zeroed data during an UIC error
condition. In addition, reading these SFRs will clear them. This means the
subsequent error handling will not be able to see them and therefore no
error handling will be scheduled.
Skip reading these SFRs in ufshcd_dump_regs().
Link: https://lore.kernel.org/r/1648689845-33521-1-git-send-email-kwmad.kim@samsung.com
Fixes: d672475664 ("scsi: ufs: Use explicit access size in ufshcd_dump_regs")
Signed-off-by: Kiwoong Kim <kwmad.kim@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Bug: 229358593
(cherry picked from commit ef60031022)
Change-Id: Idc62842c79f948580107f95c65a14e34630a0017
Signed-off-by: Bart Van Assche <bvanassche@google.com>
When dma_buf_stats_setup() fails, it closes the dmabuf file which
results into the calling of dma_buf_file_release() where it does
list_del(&dmabuf->list_node) with out first adding it to the proper
list. This is resulting into panic in the below path:
__list_del_entry_valid+0x38/0xac
dma_buf_file_release+0x74/0x158
__fput+0xf4/0x428
____fput+0x14/0x24
task_work_run+0x178/0x24c
do_notify_resume+0x194/0x264
work_pending+0xc/0x5f0
Fix it by moving the dma_buf_stats_setup() after dmabuf is added to the
list.
Fixes: bdb8d06dfe ("dmabuf: Add the capability to expose DMA-BUF stats in sysfs")
Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com>
Tested-by: T.J. Mercier <tjmercier@google.com>
Acked-by: T.J. Mercier <tjmercier@google.com>
Cc: <stable@vger.kernel.org> # 5.15.x+
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/1652125797-2043-1-git-send-email-quic_charante@quicinc.com
(cherry picked from commit ef3a6b7050 git://anongit.freedesktop.org/drm/drm-misc)
Bug: 231929173
Change-Id: Iaefbae326175483444eaf5dbd3fdf8eb8fcca2aa
Originally, in the FOLL_LONGTERM case, __get_user_pages_remote
returned with __gup_longterm_locked's return value directly
but [1] broke the behavior so keep old behavior.
[1] d5d9a23576, ANDROID: mm: retry GUP with orignal gup_flags on failure
Bug: 231990030
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: If91b01c666cfbeb11d535d282c1ee7eec5700125
[ Upstream commit 4ff2980b6b ]
in tunnel mode, if outer interface(ipv4) is less, it is easily to let
inner IPV6 mtu be less than 1280. If so, a Packet Too Big ICMPV6 message
is received. When send again, packets are fragmentized with 1280, they
are still rejected with ICMPV6(Packet Too Big) by xfrmi_xmit2().
According to RFC4213 Section3.2.2:
if (IPv4 path MTU - 20) is less than 1280
if packet is larger than 1280 bytes
Send ICMPv6 "packet too big" with MTU=1280
Drop packet
else
Encapsulate but do not set the Don't Fragment
flag in the IPv4 header. The resulting IPv4
packet might be fragmented by the IPv4 layer
on the encapsulator or by some router along
the IPv4 path.
endif
else
if packet is larger than (IPv4 path MTU - 20)
Send ICMPv6 "packet too big" with
MTU = (IPv4 path MTU - 20).
Drop packet.
else
Encapsulate and set the Don't Fragment flag
in the IPv4 header.
endif
endif
Packets should be fragmentized with ipv4 outer interface, so change it.
After it is fragemtized with ipv4, there will be double fragmenation.
No.48 & No.51 are ipv6 fragment packets, No.48 is double fragmentized,
then tunneled with IPv4(No.49& No.50), which obey spec. And received peer
cannot decrypt it rightly.
48 2002::10 2002::11 1296(length) IPv6 fragment (off=0 more=y ident=0xa20da5bc nxt=50)
49 0x0000 (0) 2002::10 2002::11 1304 IPv6 fragment (off=0 more=y ident=0x7448042c nxt=44)
50 0x0000 (0) 2002::10 2002::11 200 ESP (SPI=0x00035000)
51 2002::10 2002::11 180 Echo (ping) request
52 0x56dc 2002::10 2002::11 248 IPv6 fragment (off=1232 more=n ident=0xa20da5bc nxt=50)
xfrm6_noneed_fragment has fixed above issues. Finally, it acted like below:
1 0x6206 192.168.1.138 192.168.1.1 1316 Fragmented IP protocol (proto=Encap Security Payload 50, off=0, ID=6206) [Reassembled in #2]
2 0x6206 2002::10 2002::11 88 IPv6 fragment (off=0 more=y ident=0x1f440778 nxt=50)
3 0x0000 2002::10 2002::11 248 ICMPv6 Echo (ping) request
Signed-off-by: Lina Wang <lina.wang@mediatek.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Bug: 226699354
Change-Id: Ideec82bea6a1efa26352680cb3113f7c36b945ef
Signed-off-by: Lina Wang <lina.wang@mediatek.com>
The patchset adds a new spin_lock field into per_cpu_pages which
breaks KMI so this patch introduces per_cpu_pages_ext and
per_cpu_pageset_ext and changes relavant functions and code
to use the _ext data structures instead of original one.
Bug: 230899966
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: Ic5156f784223695c9716409036e2973df69ef99b
The patchset includes two additional fields along with lru in struct page
but they were all union so it shouldn't break change the semantic.
However, ABI is broken so this patch reverts the patchset since it
doesn't change runtime behavior difference. Just lose code readability.
Bug: 230899966
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I4eb1a55a9ca52794e136870bfddbd04175f1134b
Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
drain work queued by __drain_all_pages(). So introduce new a mechanism
to remotely drain the per-cpu lists. It is made possible by remotely
locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this
new scheme is that drain operations are now migration safe.
There was no observed performance degradation vs. the previous scheme.
Both netperf and hackbench were run in parallel to triggering the
__drain_all_pages(NULL, true) code path around ~100 times per second.
The new scheme performs a bit better (~5%), although the important point
here is there are no performance regressions vs. the previous mechanism.
Per-cpu lists draining happens only in slow paths.
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/all/20220420095906.27349-7-mgorman@techsingularity.net/
Conflicts:
mm/page_alloc.c
1. aosp doesn't need 9c25cbfcb3, skip it
Bug: 230899966
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I8c4120d215836b04c53d0e4950a821fce4c99075
Currently the PCP lists are protected by using local_lock_irqsave to
prevent migration and IRQ reentrancy but this is inconvenient. Remote
draining of the lists is impossible and a workqueue is required and
every task allocation/free must disable then enable interrupts which is
expensive.
As preparation for dealing with both of those problems, protect the
lists with a spinlock. The IRQ-unsafe version of the lock is used
because IRQs are already disabled by local_lock_irqsave. spin_trylock
is used in preparation for a time when local_lock could be used instead
of lock_lock_irqsave.
The per_cpu_pages still fits within the same number of cache lines after
this patch relative to before the series.
struct per_cpu_pages {
spinlock_t lock; /* 0 4 */
int count; /* 4 4 */
int high; /* 8 4 */
int batch; /* 12 4 */
short int free_factor; /* 16 2 */
short int expire; /* 18 2 */
/* XXX 4 bytes hole, try to pack */
struct list_head lists[13]; /* 24 208 */
/* size: 256, cachelines: 4, members: 7 */
/* sum members: 228, holes: 1, sum holes: 4 */
/* padding: 24 */
} __attribute__((__aligned__(64)));
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/all/20220420095906.27349-6-mgorman@techsingularity.net/
Link: https://lore.kernel.org/all/20220429091321.GB3441@techsingularity.net/
Conflicts:
mm/page_alloc.c
1. per_cpu_pages are updated from 44042b4498 at 5.13 so conflicted
Since we don't need to have high-order page pcp atm, skip the patch.
Bug: 230899966
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I03ff1c22301e7f8735947e71413376ea143e855a