This patch repurposes a ANDROID_KABI_RESERVE slot used for LTS backports
for feature backports. Slot 4 is repurposed as parts of slot 1 are
already used for accept_ra_min_lft on some branches.
Bug: 315069348
Signed-off-by: Patrick Rohr <prohr@google.com>
Change-Id: I19b9dfc16d891fb6fe48ec4379c6fa3dcb6adf89
Kernel panic was observed in do_swap_page() when invoked on a previously
moved (via MOVE ioctl) page from swap-cache. This was because [1] was not
backported previously and therefore calling page_move_anon_rmap() would
set PG_anon_exclusive flag in the source folio, which shouldn't be done
for a swap-cache folio.
[1] https://lore.kernel.org/all/20231002142949.235104-3-david@redhat.com/T/#ma99279cb1eb9d5f8f23540f68ea1244de7294ca0
Bug: 413428616
Change-Id: I867aa9c85fdba111bdecb303614438312038d2fe
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Patch series "mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()".
Convert page_move_anon_rmap() to folio_move_anon_rmap(), letting the
callers handle PageAnonExclusive. I'm including cleanup patch #3 because
it fits into the picture and can be done cleaner by the conversion.
This patch (of 3):
Let's move it into the caller: there is a difference between whether an
anon folio can only be mapped by one process (e.g., into one VMA), and
whether it is truly exclusive (e.g., no references -- including GUP --
from other processes).
Further, for large folios the page might not actually be pointing at the
head page of the folio, so it better be handled in the caller. This is a
preparation for converting page_move_anon_rmap() to consume a folio.
Link: https://lkml.kernel.org/r/20231002142949.235104-1-david@redhat.com
Link: https://lkml.kernel.org/r/20231002142949.235104-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Conflicts:
1. mm/hugetlb.c
[Due to page_mapcount() instead of folio_mapcount() and folio_test_anon()
instead of PageAnon()]
(cherry picked from commit 5ca432896a4ce6d69fffc3298b24c0dd9bdb871f)
Bug: 413428616
Bug: 313807618
Change-Id: Ibd29fec4d2a521d5ffc0782effd855cde9687a78
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
By recording the workingset refault count of important processes and
passing it to the userspace policy, optimizations can be made
to improve system performance.
Bug: 340146803
Change-Id: Ibf9791d9645e392b49c24480ca0be5e7fe99bebe
Signed-off-by: Lei Liu <liulei.rjpt@vivo.corp-partner.google.com>
(cherry picked from commit c196e17dffdb946434b92410507395a586407be4)
Signed-off-by: DANGJian <dangjian@honor.corp-partner.google.com>
Android has mounted the v1 cpuset controller using filesystem type
"cpuset" (not "cgroup") since 2015 [1], and depends on the resulting
behavior where the controller name is not added as a prefix for cgroupfs
files. [2]
Later, a problem was discovered where cpu hotplug onlining did not
affect the cpuset/cpus files, which Android carried an out-of-tree patch
to address for a while. An attempt was made to upstream this patch, but
the recommendation was to use the "cpuset_v2_mode" mount option
instead. [3]
An effort was made to do so, but this fails with "cgroup: Unknown
parameter 'cpuset_v2_mode'" because commit e1cba4b85d ("cgroup: Add
mount flag to enable cpuset to use v2 behavior in v1 cgroup") did not
update the special cased cpuset_mount(), and only the cgroup (v1)
filesystem type was updated.
Add parameter parsing to the cpuset filesystem type so that
cpuset_v2_mode works like the cgroup filesystem type:
$ mkdir /dev/cpuset
$ mount -t cpuset -ocpuset_v2_mode none /dev/cpuset
$ mount|grep cpuset
none on /dev/cpuset type cgroup (rw,relatime,cpuset,noprefix,cpuset_v2_mode,release_agent=/sbin/cpuset_release_agent)
[1] b769c8d24f
[2] https://cs.android.com/android/platform/superproject/main/+/main:system/core/libprocessgroup/setup/cgroup_map_write.cpp;drc=2dac5d89a0f024a2d0cc46a80ba4ee13472f1681;l=192
[3] https://lore.kernel.org/lkml/f795f8be-a184-408a-0b5a-553d26061385@redhat.com/T/
Fixes: e1cba4b85d ("cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup")
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Acked-by: Waiman Long <longman@redhat.com>
Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Acked-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 1bf67c8fdbda21fadd564a12dbe2b13c1ea5eda7 https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-6.15-fixes)
Bug: 409240872
Change-Id: I24726766d247e2638c719b56bd7d2d536085f6e4
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Export cgroup_rm_cftypes to allow module to remove cgroup control files
when exit, otherwise undefined behavior may occur.
Bug: 340297716
Change-Id: Ieda8a8ab155aeb71e0f20fdfb5068ac24465061f
Signed-off-by: Jianan Huang <huangjianan@xiaomi.com>
(cherry picked from commit 800f7297b5d0b17f00ad09e345513c4ba30d77d2)
Export mem_cgroup_move_account to migrate folios between different
memcgs. This is to achieve more accurate memory reclamation.
Bug: 373540729
Change-Id: I77ac12fdc25bae90f37f725be1a168da52f02abd
Signed-off-by: Jianan Huang <huangjianan@xiaomi.com>
(cherry picked from commit c031476ae982c66d0f0674eb0a5c1ee03e825fd7)
This is to adjust parameters between different memcgs to achieve
more accurate memory reclamation.
Bug: 373540729
Change-Id: Ifb97a144c057555c5f9181f357fa146f9509be3e
Signed-off-by: Jianan Huang <huangjianan@xiaomi.com>
(cherry picked from commit 9d6f981a89e6e289f114270e2f1738b2b6fdd2ab)
Add vendor hook when folio charges memcg. This is to manage some
specific folios in separate memcg for more accurate memory reclamation.
Bug: 373540729
Change-Id: I11b1fca279ea9e9e8be1f789bdf1f9d7c1bf001f
Signed-off-by: Jianan Huang <huangjianan@xiaomi.com>
(cherry picked from commit 6e2565c513127c425ddfb84e473dba8161154036)
The merge included
commit bbedc64de0 ("f2fs: factor the read/write tracing logic into a helper")
During merge we accidentally undid a part of the change from
commit fae611f4f0 ("f2fs: allocate trace path buffer from names_cache")
This patch fixes it by using f2fs_getname() to match with f2fs_putname()
at the end.
Bug: 409714766
Fixes: bfad6b019c ("Merge tag 'android14-6.1.115_r00' into android14-6.1")
Change-Id: I56f78e560c0847939773c9773064bc60561effcb
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
The symbol list has been updated to the QCOM ABI symbol list for the display
HFI driver to facilitate communication with the Display CoProcessor
(DCP Firmware).
1 function symbol added
virtqueue_get_vring
Bug: 409461670
Change-Id: I5ad34386609d3dc0a72a2600edc202fcecf0d999
Signed-off-by: Mahadevan <quic_mahap@quicinc.com>
Classic BPF socket filters with SKB_NET_OFF and SKB_LL_OFF fail to
read when these offsets extend into frags.
This has been observed with iwlwifi and reproduced with tun with
IFF_NAPI_FRAGS. The below straightforward socket filter on UDP port,
applied to a RAW socket, will silently miss matching packets.
const int offset_proto = offsetof(struct ip6_hdr, ip6_nxt);
const int offset_dport = sizeof(struct ip6_hdr) + offsetof(struct udphdr, dest);
struct sock_filter filter_code[] = {
BPF_STMT(BPF_LD + BPF_B + BPF_ABS, SKF_AD_OFF + SKF_AD_PKTTYPE),
BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, PACKET_HOST, 0, 4),
BPF_STMT(BPF_LD + BPF_B + BPF_ABS, SKF_NET_OFF + offset_proto),
BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, IPPROTO_UDP, 0, 2),
BPF_STMT(BPF_LD + BPF_H + BPF_ABS, SKF_NET_OFF + offset_dport),
This is unexpected behavior. Socket filter programs should be
consistent regardless of environment. Silent misses are
particularly concerning as hard to detect.
Use skb_copy_bits for offsets outside linear, same as done for
non-SKF_(LL|NET) offsets.
Offset is always positive after subtracting the reference threshold
SKB_(LL|NET)_OFF, so is always >= skb_(mac|network)_offset. The sum of
the two is an offset against skb->data, and may be negative, but it
cannot point before skb->head, as skb_(mac|network)_offset would too.
This appears to go back to when frag support was introduced to
sk_run_filter in linux-2.4.4, before the introduction of git.
The amount of code change and 8/16/32 bit duplication are unfortunate.
But any attempt I made to be smarter saved very few LoC while
complicating the code.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/netdev/20250122200402.3461154-1-maze@google.com/
Link: https://elixir.bootlin.com/linux/2.4.4/source/net/core/filter.c#L244
Reported-by: Matt Moeller <moeller.matt@gmail.com>
Co-developed-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://lore.kernel.org/r/20250408132833.195491-2-willemdebruijn.kernel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
(cherry picked from commit d4bac0288a2b444e468e6df9cb4ed69479ddf14a)
See: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=d4bac0288a2b444e468e6df9cb4ed69479ddf14a
Bug: 384636719
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Change-Id: I44e2572232f3a3459c49626f0fc5089e3e47d451
While browsing through ChromeOS crash reports, I found one with an
allocation failure that looked like this:
chrome: page allocation failure: order:7,
mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO),
nodemask=(null),cpuset=urgent,mems_allowed=0
CPU: 7 PID: 3295 Comm: chrome Not tainted
5.15.133-20574-g8044615ac35c #1 (HASH:1162 1)
Hardware name: Google Lazor (rev3 - 8) with KB Backlight (DT)
Call trace:
...
warn_alloc+0x104/0x174
__alloc_pages+0x5f0/0x6e4
kmalloc_order+0x44/0x98
kmalloc_order_trace+0x34/0x124
__kmalloc+0x228/0x36c
__regset_get+0x68/0xcc
regset_get_alloc+0x1c/0x28
elf_core_dump+0x3d8/0xd8c
do_coredump+0xeb8/0x1378
get_signal+0x14c/0x804
...
An order 7 allocation is (1 << 7) contiguous pages, or 512K. It's not
a surprise that this allocation failed on a system that's been running
for a while.
More digging showed that it was fairly easy to see the order 7
allocation by just sending a SIGQUIT to chrome (or other processes) to
generate a core dump. The actual amount being allocated was 279,584
bytes and it was for "core_note_type" NT_ARM_SVE.
There was quite a bit of discussion [1] on the mailing lists in
response to my v1 patch attempting to switch to vmalloc. The overall
conclusion was that we could likely reduce the 279,584 byte allocation
by quite a bit and Mark Brown has sent a patch to that effect [2].
However even with the 279,584 byte allocation gone there are still
65,552 byte allocations. These are just barely more than the 65,536
bytes and thus would require an order 5 allocation.
An order 5 allocation is still something to avoid unless necessary and
nothing needs the memory here to be contiguous. Change the allocation
to kvzalloc() which should still be efficient for small allocations
but doesn't force the memory subsystem to work hard (and maybe fail)
at getting a large contiguous chunk.
[1] https://lore.kernel.org/r/20240201171159.1.Id9ad163b60d21c9e56c2d686b0cc9083a8ba7924@changeid
[2] https://lore.kernel.org/r/20240203-arm64-sve-ptrace-regset-size-v1-1-2c3ba1386b9e@kernel.org
Link: https://lkml.kernel.org/r/20240205092626.v2.1.Id9ad163b60d21c9e56c2d686b0cc9083a8ba7924@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 409708978
(cherry picked from commit 6b839b3b76cf17296ebd4a893841f32cae08229c)
Signed-off-by: Seiya Wang <seiya.wang@mediatek.com>
(cherry picked from https://android-review.googlesource.com/q/commit:4f551093f53b449c590bbd44e97bc2cdf528e8d3)
Merged-In: I42c9bcb78bde782b0b52432086c6b3e9e95ab6d3
Change-Id: I42c9bcb78bde782b0b52432086c6b3e9e95ab6d3
This reverts commit 44ee678655
The original change, commit 710da3c8ea ("sched/core: Prevent race
condition between cpuset and __sched_setscheduler()") added potential
rwsem locking inside __sched_setscheduler() and moved the call
to __sched_setscheduler() out of the rcu read lock section in
do_sched_setschduler(). However, there was a complication with
binder calling sched_setscheduler_nocheck() while holding the node
spin lock as well as potentially the thread->prio_lock.
So in commit 44ee678655 this was reverted in the Android tree,
undoing the rwsem additions and moving __sched_setscheduler() back
under the rcu read lock.
Later, upstream in commit 111cd11bbc ("sched/cpuset: Bring back
cpuset_mutex") and backported via 6.1-stable in commit 9bcfe15278,
the change reverted the original rwsem locking in __sched_setscheduler()
replacing them with mutexes, used only in the SCHED_DEADLINE case.
This resulted in the android tree having do_sched_setscheduler()
code paths take an rcu_read_lock() and then eventually call into a
mutex_lock(), triggering the following warning:
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:293
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 13352, name: <test>
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
Call trace:
dump_backtrace+0xf8/0x148
show_stack+0x18/0x24
dump_stack_lvl+0x60/0x7c
dump_stack+0x18/0x38
__might_resched+0x1f0/0x2e8
__might_sleep+0x48/0x7c
mutex_lock+0x24/0xfc
cpuset_lock+0x18/0x28
__sched_setscheduler+0x2ec/0xb38
do_sched_setscheduler+0x180/0x1fc
__arm64_sys_sched_setscheduler+0x20/0x3c
invoke_syscall+0x58/0x118
el0_svc_common+0xb4/0xf4
do_el0_svc+0x24/0x80
el0_svc+0x2c/0x90
el0t_64_sync_handler+0x68/0xb4
el0t_64_sync+0x1a4/0x1a8
In the android-mainline tree, it was noted that the origial issue with
binder had been resolved in 6.5-rc1, so the original the revert was
undone by commit 4fb867eea029 ("Revert "Revert "sched/core: Prevent
race condition between cpuset and __sched_setscheduler()"").
However, binder is still calling sched_setscheduler_nocheck()
potentially holding spinlocks (see: b/275379975), but as we don't
see major issues (as __sched_setscheduler already may *currently*
sleep), it seems there may be logical restrictions that prevent it
from actually occuring (seemingly due to binder not running as
deadline).
The binder call path however does not use do_sched_setscheduler(),
so revert the remaining portion of commit 44ee678655 ("Revert "sched/core:
Prevent race condition between cpuset and __sched_setscheduler()""),
moving the call to __sched_setscheduler() outside the rcu critical
section. This will address the reported issue above, while not changing
the current situation with binder calling __sched_setscheduler().
Bug: 408888661
Fixes: 44ee678655 ("Revert "sched/core: Prevent race condition between cpuset and __sched_setscheduler()"")
Change-Id: Ibebf364586cc3dda3993e7d685b5fee3566ec806
Signed-off-by: John Stultz <jstultz@google.com>
This reverts commit 1fe91f863a as it
breaks Desktop Head Unit of AA on macbooks when connected with
Superspeed or faster cables.
Test: Run DHU on mac with Superspeed cables.
Bug: 401274795
Signed-off-by: Badhri Jagan Sridharan <badhri@google.com>
Change-Id: Ibdf6d9360aa65480831127bee1cc6554f4a5beb9
[ Upstream commit bc50835e83f60f56e9bec2b392fb5544f250fb6f ]
Lion Ackermann was able to create a UAF which can be abused for privilege
escalation with the following script
Step 1. create root qdisc
tc qdisc add dev lo root handle 1:0 drr
step2. a class for packet aggregation do demonstrate uaf
tc class add dev lo classid 1:1 drr
step3. a class for nesting
tc class add dev lo classid 1:2 drr
step4. a class to graft qdisc to
tc class add dev lo classid 1:3 drr
step5.
tc qdisc add dev lo parent 1:1 handle 2:0 plug limit 1024
step6.
tc qdisc add dev lo parent 1:2 handle 3:0 drr
step7.
tc class add dev lo classid 3:1 drr
step 8.
tc qdisc add dev lo parent 3:1 handle 4:0 pfifo
step 9. Display the class/qdisc layout
tc class ls dev lo
class drr 1:1 root leaf 2: quantum 64Kb
class drr 1:2 root leaf 3: quantum 64Kb
class drr 3:1 root leaf 4: quantum 64Kb
tc qdisc ls
qdisc drr 1: dev lo root refcnt 2
qdisc plug 2: dev lo parent 1:1
qdisc pfifo 4: dev lo parent 3:1 limit 1000p
qdisc drr 3: dev lo parent 1:2
step10. trigger the bug <=== prevented by this patch
tc qdisc replace dev lo parent 1:3 handle 4:0
step 11. Redisplay again the qdiscs/classes
tc class ls dev lo
class drr 1:1 root leaf 2: quantum 64Kb
class drr 1:2 root leaf 3: quantum 64Kb
class drr 1:3 root leaf 4: quantum 64Kb
class drr 3:1 root leaf 4: quantum 64Kb
tc qdisc ls
qdisc drr 1: dev lo root refcnt 2
qdisc plug 2: dev lo parent 1:1
qdisc pfifo 4: dev lo parent 3:1 refcnt 2 limit 1000p
qdisc drr 3: dev lo parent 1:2
Observe that a) parent for 4:0 does not change despite the replace request.
There can only be one parent. b) refcount has gone up by two for 4:0 and
c) both class 1:3 and 3:1 are pointing to it.
Step 12. send one packet to plug
echo "" | socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888,priority=$((0x10001))
step13. send one packet to the grafted fifo
echo "" | socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888,priority=$((0x10003))
step14. lets trigger the uaf
tc class delete dev lo classid 1:3
tc class delete dev lo classid 1:1
The semantics of "replace" is for a del/add _on the same node_ and not
a delete from one node(3:1) and add to another node (1:3) as in step10.
While we could "fix" with a more complex approach there could be
consequences to expectations so the patch takes the preventive approach of
"disallow such config".
Bug: 393266309
Joint work with Lion Ackermann <nnamrec@gmail.com>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250116013713.900000-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit deda09c054)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: Id94e8dfb543643e489e33f79af990f23580b9121
one Vendor hook add:
android_vh_do_shrink_slab_ex
Add vendor hook point in do_shrink_slab to optimize for user
experience related threads and time-consuming shrinkers.
Bug: 407420219
Change-Id: I5ee29988eebb53da503f729564946b12deb1d981
Signed-off-by: pengzhongcui <pengzhongcui@xiaomi.corp-partner.google.com>
[ Upstream commit 399a45e5237ca14037120b1b895bd38a3b4492ea ]
device_del() can lead to new work being scheduled in gadget->work
workqueue. This is observed, for example, with the dwc3 driver with the
following call stack:
device_del()
gadget_unbind_driver()
usb_gadget_disconnect_locked()
dwc3_gadget_pullup()
dwc3_gadget_soft_disconnect()
usb_gadget_set_state()
schedule_work(&gadget->work)
Move flush_work() after device_del() to ensure the workqueue is cleaned
up.
Fixes: 5702f75375 ("usb: gadget: udc-core: move sysfs_notify() to a workqueue")
Cc: stable <stable@kernel.org>
Bug: 406664478
Bug: 400301689
Change-Id: Icf64956f8a17b1876388546b679cfd203d9701dc
Signed-off-by: Roy Luo <royluo@google.com>
Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
Reviewed-by: Thinh Nguyen <Thinh.Nguyen@synopsys.com>
Link: https://lore.kernel.org/r/20250204233642.666991-1-royluo@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 859cb45aef)
Signed-off-by: wei li <sirius.liwei@honor.corp-partner.google.com>
(cherry picked from commit de3fe45104b53290db95363d89fa763b8724e22c)
Signed-off-by: Lianqin Hu <hulianqin@vivo.corp-partner.google.com>
Adding the following symbols:
- __traceiter_android_vh_calculate_totalreserve_pages
- __tracepoint_android_vh_calculate_totalreserve_pages
Bug: 396115949
Change-Id: I0e17cd9359b1bdc1b5de5c63d75681ee3be1366d
Signed-off-by: Martin Liu <liumartin@google.com>
This vendor hook enables or disables updating the LMKD zone watermark level.
Bug: 396115949
Test: build
Change-Id: I0089a0586821120e47c46e08bcfea11a1602d516
Signed-off-by: Martin Liu <liumartin@google.com>
The android_vh_folio_referenced_check_bypass hook reverse-maps and
skips pages with high memory pressure in shrink_active_list,
preferring to recycle them. This helps reduce memory pressure and
improve system performance under high load.
Bug: 404067669
Change-Id: Ic10edcef9761df774d6cf18544e7c044bf78d3ed
Signed-off-by: Marcus Ma <maminghui5@xiaomi.corp-partner.google.com>
This commit introduces a new trace event,
`mm_calculate_totalreserve_pages`, which reports the new reserve value at
the exact time when it takes effect.
The `totalreserve_pages` value represents the total amount of memory
reserved across all zones and nodes in the system. This reserved memory
is crucial for ensuring that critical kernel operations have access to
sufficient memory, even under memory pressure.
By tracing the `totalreserve_pages` value, developers can gain insights
that how the total reserved memory changes over time.
Link: https://lkml.kernel.org/r/20250308034606.2036033-4-liumartin@google.com
Signed-off-by: Martin Liu <liumartin@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 396115949
(cherry picked from commit 15766485e4a51bec2dcce304c089a95550720033
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
Change-Id: Iced6ea39ad8a36a50bf4393814b6bca2f64ac3b0
Signed-off-by: Martin Liu <liumartin@google.com>
This commit introduces the `mm_setup_per_zone_lowmem_reserve` trace
event,which provides detailed insights into the kernel's per-zone lowmem
reserve configuration.
The trace event provides precise timestamps, allowing developers to
1. Correlate lowmem reserve changes with specific kernel events and
able to diagnose unexpected kswapd or direct reclaim behavior triggered
by dynamic changes in lowmem reserve.
2. Know memory allocation failures that occur due to insufficient
lowmem reserve, by precisely correlating allocation attempts with
reserve adjustments.
Link: https://lkml.kernel.org/r/20250308034606.2036033-3-liumartin@google.com
Signed-off-by: Martin Liu <liumartin@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 396115949
(cherry picked from commit a293aba4a584709889f77a0ad0c45746aecf1b9f
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
Change-Id: I271fc260ec60645230681bf0afbcd10d84453c88
Signed-off-by: Martin Liu <liumartin@google.com>
Patch series "Add tracepoints for lowmem reserves, watermarks and
totalreserve_pages", v2.
This patchset introduces tracepoints to track changes in the lowmem
reserves, watermarks and totalreserve_pages. This helps to track
the exact timing of such changes and understand their relation to
reclaim activities.
The tracepoints added are:
mm_setup_per_zone_lowmem_reserve
mm_setup_per_zone_wmarks
mm_calculate_totalreserve_pagesi
This patch (of 3):
This commit introduces the `mm_setup_per_zone_wmarks` trace event,
which provides detailed insights into the kernel's per-zone watermark
configuration, offering precise timing and the ability to correlate
watermark changes with specific kernel events.
While `/proc/zoneinfo` provides some information about zone watermarks,
this trace event offers:
1. The ability to link watermark changes to specific kernel events and
logic.
2. The ability to capture rapid or short-lived changes in watermarks
that may be missed by user-space polling
3. Diagnosing unexpected kswapd activity or excessive direct reclaim
triggered by rapidly changing watermarks.
Link: https://lkml.kernel.org/r/20250308034606.2036033-1-liumartin@google.com
Link: https://lkml.kernel.org/r/20250308034606.2036033-2-liumartin@google.com
Signed-off-by: Martin Liu <liumartin@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Martin Liu <liumartin@google.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 396115949
(cherry picked from commit 8c02048d1c6126527f15752a5e0849dc49cefeeb
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
Change-Id: I7e326e78542abb6fa5f3ccbe5d61a59f42d7cf2f
Signed-off-by: Martin Liu <liumartin@google.com>
Adding the following symbols:
- folio_mapcount
Bug: 404067677
Change-Id: Id8382f108729e23475a652855a75d99ee892c41c
Signed-off-by: Marcus Ma <maminghui5@xiaomi.corp-partner.google.com>
We need to get the number of folio mappings through folio_mapcount. Later, pages with mapcount higher than a certain threshold will be skipped for reverse mapping to reduce the high load caused by
reverse mapping during the recycling process.
Bug: 404067677
Change-Id: I21dd847a07fb4e7bb616a3bc01b7d1cdf46e9b0b
Signed-off-by: Marcus Ma <maminghui5@xiaomi.corp-partner.google.com>
When testing the atomic write fix patches, the f2fs_bug_on was
triggered as below:
------------[ cut here ]------------
kernel BUG at fs/f2fs/inode.c:935!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 3 UID: 0 PID: 257 Comm: bash Not tainted 6.13.0-rc1-00033-gc283a70d3497 #5
RIP: 0010:f2fs_evict_inode+0x50f/0x520
Call Trace:
<TASK>
? __die_body+0x65/0xb0
? die+0x9f/0xc0
? do_trap+0xa1/0x170
? f2fs_evict_inode+0x50f/0x520
? f2fs_evict_inode+0x50f/0x520
? handle_invalid_op+0x65/0x80
? f2fs_evict_inode+0x50f/0x520
? exc_invalid_op+0x39/0x50
? asm_exc_invalid_op+0x1a/0x20
? __pfx_f2fs_get_dquots+0x10/0x10
? f2fs_evict_inode+0x50f/0x520
? f2fs_evict_inode+0x2e5/0x520
evict+0x186/0x2f0
prune_icache_sb+0x75/0xb0
super_cache_scan+0x1a8/0x200
do_shrink_slab+0x163/0x320
shrink_slab+0x2fc/0x470
drop_slab+0x82/0xf0
drop_caches_sysctl_handler+0x4e/0xb0
proc_sys_call_handler+0x183/0x280
vfs_write+0x36d/0x450
ksys_write+0x68/0xd0
do_syscall_64+0xc8/0x1a0
? arch_exit_to_user_mode_prepare+0x11/0x60
? irqentry_exit_to_user_mode+0x7e/0xa0
The root cause is: f2fs uses FI_ATOMIC_DIRTIED to indicate dirty
atomic files during commit. If the inode is dirtied during commit,
such as by f2fs_i_pino_write, the vfs inode keeps clean and the
f2fs inode is set to FI_DIRTY_INODE. The FI_DIRTY_INODE flag cann't
be cleared by write_inode later due to the clean vfs inode. Finally,
f2fs_bug_on is triggered due to this inconsistent state when evict.
To reproduce this situation:
- fd = open("/mnt/test.db", O_WRONLY)
- ioctl(fd, F2FS_IOC_START_ATOMIC_WRITE)
- mv /mnt/test.db /mnt/test1.db
- ioctl(fd, F2FS_IOC_COMMIT_ATOMIC_WRITE)
- echo 3 > /proc/sys/vm/drop_caches
To fix this problem, clear FI_DIRTY_INODE after commit, then
f2fs_mark_inode_dirty_sync will ensure a consistent dirty state.
Bug: 402645924
Fixes: fccaa81de87e ("f2fs: prevent atomic file from being dirtied before commit")
Change-Id: I2c637b4bc544453b07ab124527efb694da9b757f
Signed-off-by: Yunlei He <heyunlei@xiaomi.com>
Signed-off-by: Jianan Huang <huangjianan@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
(cherry picked from commit 03511e936916873bf880e6678c98d5fb59c19742)
(cherry picked from commit 0e0c530475d05e8d91972957761d08ab0f0e931d)
(cherry picked from commit 52d776ea9f68f0101bd6c1b42ac98e9b697bfe7b)
[ Upstream commit 12e070eb6964b341b41677fd260af5a305316a1f ]
The following trace can be seen if a device is being unregistered while
its number of channels are being modified.
DEBUG_LOCKS_WARN_ON(lock->magic != lock)
WARNING: CPU: 3 PID: 3754 at kernel/locking/mutex.c:564 __mutex_lock+0xc8a/0x1120
CPU: 3 UID: 0 PID: 3754 Comm: ethtool Not tainted 6.13.0-rc6+ #771
RIP: 0010:__mutex_lock+0xc8a/0x1120
Call Trace:
<TASK>
ethtool_check_max_channel+0x1ea/0x880
ethnl_set_channels+0x3c3/0xb10
ethnl_default_set_doit+0x306/0x650
genl_family_rcv_msg_doit+0x1e3/0x2c0
genl_rcv_msg+0x432/0x6f0
netlink_rcv_skb+0x13d/0x3b0
genl_rcv+0x28/0x40
netlink_unicast+0x42e/0x720
netlink_sendmsg+0x765/0xc20
__sys_sendto+0x3ac/0x420
__x64_sys_sendto+0xe0/0x1c0
do_syscall_64+0x95/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
This is because unregister_netdevice_many_notify might run before the
rtnl lock section of ethnl operations, eg. set_channels in the above
example. In this example the rss lock would be destroyed by the device
unregistration path before being used again, but in general running
ethnl operations while dismantle has started is not a good idea.
Fix this by denying any operation on devices being unregistered. A check
was already there in ethnl_ops_begin, but not wide enough.
Note that the same issue cannot be seen on the ioctl version
(__dev_ethtool) because the device reference is retrieved from within
the rtnl lock section there. Once dismantle started, the net device is
unlisted and no reference will be found.
Bug: 392852041
Fixes: dde91ccfa2 ("ethtool: do not perform operations on net devices being unregistered")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/20250116092159.50890-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit b1cb37a31a)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I56dbd897bb6db194d1eab1d5370796d2e3142fe2
commit 647cef20e649c576dff271e018d5d15d998b629d upstream.
Expected behaviour:
In case we reach scheduler's limit, pfifo_tail_enqueue() will drop a
packet in scheduler's queue and decrease scheduler's qlen by one.
Then, pfifo_tail_enqueue() enqueue new packet and increase
scheduler's qlen by one. Finally, pfifo_tail_enqueue() return
`NET_XMIT_CN` status code.
Weird behaviour:
In case we set `sch->limit == 0` and trigger pfifo_tail_enqueue() on a
scheduler that has no packet, the 'drop a packet' step will do nothing.
This means the scheduler's qlen still has value equal 0.
Then, we continue to enqueue new packet and increase scheduler's qlen by
one. In summary, we can leverage pfifo_tail_enqueue() to increase qlen by
one and return `NET_XMIT_CN` status code.
The problem is:
Let's say we have two qdiscs: Qdisc_A and Qdisc_B.
- Qdisc_A's type must have '->graft()' function to create parent/child relationship.
Let's say Qdisc_A's type is `hfsc`. Enqueue packet to this qdisc will trigger `hfsc_enqueue`.
- Qdisc_B's type is pfifo_head_drop. Enqueue packet to this qdisc will trigger `pfifo_tail_enqueue`.
- Qdisc_B is configured to have `sch->limit == 0`.
- Qdisc_A is configured to route the enqueued's packet to Qdisc_B.
Enqueue packet through Qdisc_A will lead to:
- hfsc_enqueue(Qdisc_A) -> pfifo_tail_enqueue(Qdisc_B)
- Qdisc_B->q.qlen += 1
- pfifo_tail_enqueue() return `NET_XMIT_CN`
- hfsc_enqueue() check for `NET_XMIT_SUCCESS` and see `NET_XMIT_CN` => hfsc_enqueue() don't increase qlen of Qdisc_A.
The whole process lead to a situation where Qdisc_A->q.qlen == 0 and Qdisc_B->q.qlen == 1.
Replace 'hfsc' with other type (for example: 'drr') still lead to the same problem.
This violate the design where parent's qlen should equal to the sum of its childrens'qlen.
Bug impact: This issue can be used for user->kernel privilege escalation when it is reachable.
Bug: 395539871
Fixes: 57dbb2d83d ("sched: add head drop fifo queue")
Reported-by: Quang Le <quanglex97@gmail.com>
Signed-off-by: Quang Le <quanglex97@gmail.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Link: https://patch.msgid.link/20250204005841.223511-2-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 79a955ea4a)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I94a3851190671bc98666cb659e8419ab2767fb03
commit 78dafe1cf3afa02ed71084b350713b07e72a18fb upstream.
During socket release, sock_orphan() is called without considering that it
sets sk->sk_wq to NULL. Later, if SO_LINGER is enabled, this leads to a
null pointer dereferenced in virtio_transport_wait_close().
Orphan the socket only after transport release.
Partially reverts the 'Fixes:' commit.
KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
lock_acquire+0x19e/0x500
_raw_spin_lock_irqsave+0x47/0x70
add_wait_queue+0x46/0x230
virtio_transport_release+0x4e7/0x7f0
__vsock_release+0xfd/0x490
vsock_release+0x90/0x120
__sock_release+0xa3/0x250
sock_close+0x14/0x20
__fput+0x35e/0xa90
__x64_sys_close+0x78/0xd0
do_syscall_64+0x93/0x1b0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Bug: 396331793
Reported-by: syzbot+9d55b199192a4be7d02c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9d55b199192a4be7d02c
Fixes: fcdd2242c023 ("vsock: Keep the binding until socket destruction")
Tested-by: Luigi Leonardi <leonardi@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250210-vsock-linger-nullderef-v3-1-ef6244d02b54@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Luigi Leonardi <leonardi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 631e00fdac)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I61ef914e5f706ee1c9dd2b9f95cbc69020fe8f00
commit 135ffc7becc82cfb84936ae133da7969220b43b2 upstream.
vsock defines a BPF callback to be invoked when close() is called. However,
this callback is never actually executed. As a result, a closed vsock
socket is not automatically removed from the sockmap/sockhash.
Introduce a dummy vsock_close() and make vsock_release() call proto::close.
Note: changes in __vsock_release() look messy, but it's only due to indent
level reduction and variables xmas tree reorder.
Bug: 396331793
Fixes: 634f1a7110 ("vsock: support sockmap")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Link: https://lore.kernel.org/r/20241118-vsock-bpf-poll-close-v1-3-f1b9669cacdc@rbox.co
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
[LL: There is no sockmap support for this kernel version. This patch has
been backported because it helps reduce conflicts on future backports]
Signed-off-by: Luigi Leonardi <leonardi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 13a4362ab8)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I8aefa411aa1ef317743deb600aaa4a9cdd52abd3
In the case of the following call stack for an atomic file,
FI_DIRTY_INODE is set, but FI_ATOMIC_DIRTIED is not subsequently set.
f2fs_file_write_iter
f2fs_map_blocks
f2fs_reserve_new_blocks
inc_valid_block_count
__mark_inode_dirty(dquot)
f2fs_dirty_inode
If FI_ATOMIC_DIRTIED is not set, atomic file can encounter corruption
due to a mismatch between old file size and new data.
To resolve this issue, I changed to set FI_ATOMIC_DIRTIED when
FI_DIRTY_INODE is set. This ensures that FI_DIRTY_INODE, which was
previously cleared by the Writeback thread during the commit atomic, is
set and i_size is updated.
Cc: <stable@vger.kernel.org>
Fixes: fccaa81de87e ("f2fs: prevent atomic file from being dirtied before commit")
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Sunmin Jeong <s_min.jeong@samsung.com>
Signed-off-by: Yeongjin Gil <youngjin.gil@samsung.com>
Reviewed-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Bug: 381519582
(cherry picked from commit f098aeba04c9328571567dca45159358a250240c
https: //git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git dev)
Link: https://lore.kernel.org/linux-f2fs-devel/20250314120651.443184-1-youngjin.gil@samsung.com/
Change-Id: I7ce87dfbc2525ae185ae6c22671e98ecf021b988
Add xas_load to qcom abi symbol list.
Bug: 397560786
Change-Id: Ia4a7bab9c2f7670fd62b7aba6a8858a1c1890969
Signed-off-by: Ravi Kumar Bokka <quic_c_rbokka@quicinc.com>
Signed-off-by: Srinivasarao Pathipati <quic_c_spathi@quicinc.com>
If hibernation fail, user cannot check log during hibernation. During
hibernation, we cannot get any log from copying hibernation image to shutdown
the system, for example, write image to storage. A vendor hook copies every log
with all loglevel to reserved memory address. We cannot get all loglevels with
pstore, so we add vendor hook for copying every log. When the system is
rebooted, user can check log from reserved memory address where vendor hook
stored in.
Bug: 342523877
Change-Id: I31f61378f555ea65ccecfa5b7a96a3ed3e4061a6
Signed-off-by: Dongbum Kim <dongbum.kim@lge.com>
If checkpoint is disabled, GC can not reclaim any segments, we need
to detect such condition and bail out from fallocate() of a pinfile,
rather than letting allocator running out of free segment, which may
cause f2fs to be shutdown.
reproducer:
mkfs.f2fs -f /dev/vda 16777216
mount -o checkpoint=disable:10% /dev/vda /mnt/f2fs
for ((i=0;i<4096;i++)) do { dd if=/dev/zero of=/mnt/f2fs/$i bs=1M count=1; } done
sync
for ((i=0;i<4096;i+=2)) do { rm /mnt/f2fs/$i; } done
sync
touch /mnt/f2fs/pinfile
f2fs_io pinfile set /mnt/f2fs/pinfile
f2fs_io fallocate 0 0 4201644032 /mnt/f2fs/pinfile
cat /sys/kernel/debug/f2fs/status
output:
- Free: 0 (0)
Fixes: f5a53edcf0 ("f2fs: support aligned pinned file")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Bug: 399583169
(cherry picked from commit f7f8932ca6bb22494ef6db671633ad3b4d982271
https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git dev)
Link: https://lore.kernel.org/linux-f2fs-devel/20250312090125.4014447-1-chao@kernel.org/
[Jaegeuk Kim: replace f2fs_warn_ratelimited with f2fs_warn]
Change-Id: If19aa65412e6ed59f1c15a4a29e210679ec260a0