linux

mirror of https://github.com/hardkernel/linux.git synced 2026-03-25 12:00:22 +09:00

Author	SHA1	Message	Date
Chen Ridong	06a5e91764	cpuset: Treat cpusets in attaching as populated [ Upstream commit b1bcaed1e39a9e0dfbe324a15d2ca4253deda316 ] Currently, the check for whether a partition is populated does not account for tasks in the cpuset of attaching. This is a corner case that can leave a task stuck in a partition with no effective CPUs. The race condition occurs as follows: cpu0 cpu1 //cpuset A with cpu N migrate task p to A cpuset_can_attach // with effective cpus // check ok // cpuset_mutex is not held // clear cpuset.cpus.exclusive // making effective cpus empty update_exclusive_cpumask // tasks_nocpu_error check ok // empty effective cpus, partition valid cpuset_attach ... // task p stays in A, with non-effective cpus. To fix this issue, this patch introduces cs_is_populated, which considers tasks in the attaching cpuset. This new helper is used in validate_change and partition_is_populated. Fixes: `e2d59900d9` ("cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective") Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2026-01-11 15:21:27 +01:00
Kumar Kartikeya Dwivedi	edf3b82887	bpf: Do not limit bpf_cgroup_from_id to current's namespace [ Upstream commit 2c895133950646f45e5cf3900b168c952c8dbee8 ] The bpf_cgroup_from_id kfunc relies on cgroup_get_from_id to obtain the cgroup corresponding to a given cgroup ID. This helper can be called in a lot of contexts where the current thread can be random. A recent example was its use in sched_ext's ops.tick(), to obtain the root cgroup pointer. Since the current task can be whatever random user space task preempted by the timer tick, this makes the behavior of the helper unreliable. Refactor out __cgroup_get_from_id as the non-namespace aware version of cgroup_get_from_id, and change bpf_cgroup_from_id to make use of it. There is no compatibility breakage here, since changing the namespace against which the lookup is being done to the root cgroup namespace only permits a wider set of lookups to succeed now. The cgroup IDs across namespaces are globally unique, and thus don't need to be retranslated. Reported-by: Dan Schatzberg <dschatzberg@meta.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250915032618.1551762-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-11-24 10:29:21 +01:00
Chen Ridong	4a1e3ec28e	cgroup: split cgroup_destroy_wq into 3 workqueues [ Upstream commit 79f919a89c9d06816dbdbbd168fa41d27411a7f9 ] A hung task can occur during [1] LTP cgroup testing when repeatedly mounting/unmounting perf_event and net_prio controllers with systemd.unified_cgroup_hierarchy=1. The hang manifests in cgroup_lock_and_drain_offline() during root destruction. Related case: cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio Call Trace: cgroup_lock_and_drain_offline+0x14c/0x1e8 cgroup_destroy_root+0x3c/0x2c0 css_free_rwork_fn+0x248/0x338 process_one_work+0x16c/0x3b8 worker_thread+0x22c/0x3b0 kthread+0xec/0x100 ret_from_fork+0x10/0x20 Root Cause: CPU0 CPU1 mount perf_event umount net_prio cgroup1_get_tree cgroup_kill_sb rebind_subsystems // root destruction enqueues // cgroup_destroy_wq // kill all perf_event css // one perf_event css A is dying // css A offline enqueues cgroup_destroy_wq // root destruction will be executed first css_free_rwork_fn cgroup_destroy_root cgroup_lock_and_drain_offline // some perf descendants are dying // cgroup_destroy_wq max_active = 1 // waiting for css A to die Problem scenario: 1. CPU0 mounts perf_event (rebind_subsystems) 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work 3. A dying perf_event CSS gets queued for offline after root destruction 4. Root destruction waits for offline completion, but offline work is blocked behind root destruction in cgroup_destroy_wq (max_active=1) Solution: Split cgroup_destroy_wq into three dedicated workqueues: cgroup_offline_wq – Handles CSS offline operations cgroup_release_wq – Manages resource release cgroup_free_wq – Performs final memory deallocation This separation eliminates blocking in the CSS free path while waiting for offline operations to complete. [1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers Fixes: `334c3679ec` ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends") Reported-by: Gao Yingjie <gaoyingjie@uniontech.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Suggested-by: Teju Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-09-25 11:00:05 +02:00
Waiman Long	34c3bc762b	cgroup/cpuset: Use static_branch_enable_cpuslocked() on cpusets_insane_config_key [ Upstream commit 65f97cc81b0adc5f49cf6cff5d874be0058e3f41 ] The following lockdep splat was observed. [ 812.359086] ============================================ [ 812.359089] WARNING: possible recursive locking detected [ 812.359097] -------------------------------------------- [ 812.359100] runtest.sh/30042 is trying to acquire lock: [ 812.359105] ffffffffa7f27420 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_enable+0xe/0x20 [ 812.359131] [ 812.359131] but task is already holding lock: [ 812.359134] ffffffffa7f27420 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_write_resmask+0x98/0xa70 : [ 812.359267] Call Trace: [ 812.359272] <TASK> [ 812.359367] cpus_read_lock+0x3c/0xe0 [ 812.359382] static_key_enable+0xe/0x20 [ 812.359389] check_insane_mems_config.part.0+0x11/0x30 [ 812.359398] cpuset_write_resmask+0x9f2/0xa70 [ 812.359411] cgroup_file_write+0x1c7/0x660 [ 812.359467] kernfs_fop_write_iter+0x358/0x530 [ 812.359479] vfs_write+0xabe/0x1250 [ 812.359529] ksys_write+0xf9/0x1d0 [ 812.359558] do_syscall_64+0x5f/0xe0 Since commit `d74b27d63a` ("cgroup/cpuset: Change cpuset_rwsem and hotplug lock order"), the ordering of cpu hotplug lock and cpuset_mutex had been reversed. That patch correctly used the cpuslocked version of the static branch API to enable cpusets_pre_enable_key and cpusets_enabled_key, but it didn't do the same for cpusets_insane_config_key. The cpusets_insane_config_key can be enabled in the check_insane_mems_config() which is called from update_nodemask() or cpuset_hotplug_update_tasks() with both cpu hotplug lock and cpuset_mutex held. Deadlock can happen with a pending hotplug event that tries to acquire the cpu hotplug write lock which will block further cpus_read_lock() attempt from check_insane_mems_config(). Fix that by switching to use static_branch_enable_cpuslocked(). Fixes: `d74b27d63a` ("cgroup/cpuset: Change cpuset_rwsem and hotplug lock order") Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-08-28 16:28:47 +02:00
Chen Ridong	f371ad6471	Revert "cgroup_freezer: cgroup_freezing: Check if not frozen" [ Upstream commit 14a67b42cb6f3ab66f41603c062c5056d32ea7dd ] This reverts commit cff5f49d433fcd0063c8be7dd08fa5bf190c6c37. Commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not frozen") modified the cgroup_freezing() logic to verify that the FROZEN flag is not set, affecting the return value of the freezing() function, in order to address a warning in __thaw_task. A race condition exists that may allow tasks to escape being frozen. The following scenario demonstrates this issue: CPU 0 (get_signal path) CPU 1 (freezer.state reader) try_to_freeze read freezer.state __refrigerator freezer_read update_if_frozen WRITE_ONCE(current->__state, TASK_FROZEN); ... /* Task is now marked frozen / / frozen(task) == true / / Assuming other tasks are frozen / freezer->state \|= CGROUP_FROZEN; / freezing(current) returns false / / because cgroup is frozen (not freezing) / break out __set_current_state(TASK_RUNNING); / Bug: Task resumes running when it should remain frozen */ The existing !frozen(p) check in __thaw_task makes the WARN_ON_ONCE(freezing(p)) warning redundant. Removing this warning enables reverting the commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not frozen") to resolve the issue. The warning has been removed in the previous patch. This patch revert the commit cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not frozen") to complete the fix. Fixes: cff5f49d433f ("cgroup_freezer: cgroup_freezing: Check if not frozen") Reported-by: Zhong Jiawei<zhongjiawei1@huawei.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-07-24 08:53:20 +02:00
Chen Ridong	48f35a3294	cgroup,freezer: fix incomplete freezing when attaching tasks commit 37fb58a7273726e59f9429c89ade5116083a213d upstream. An issue was found: # cd /sys/fs/cgroup/freezer/ # mkdir test # echo FROZEN > test/freezer.state # cat test/freezer.state FROZEN # sleep 1000 & [1] 863 # echo 863 > test/cgroup.procs # cat test/freezer.state FREEZING When tasks are migrated to a frozen cgroup, the freezer fails to immediately freeze the tasks, causing the cgroup to remain in the "FREEZING". The freeze_task() function is called before clearing the CGROUP_FROZEN flag. This causes the freezing() check to incorrectly return false, preventing __freeze_task() from being invoked for the migrated task. To fix this issue, clear the CGROUP_FROZEN state before calling freeze_task(). Fixes: `f5d39b0208` ("freezer,sched: Rewrite core freezer logic") Cc: stable@vger.kernel.org # v6.1+ Reported-by: Zhong Jiawei <zhongjiawei1@huawei.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-06-27 11:08:47 +01:00
gaoxu	20fb292ab5	cgroup: Fix compilation issue due to cgroup_mutex not being exported [ Upstream commit 87c259a7a359e73e6c52c68fcbec79988999b4e6 ] When adding folio_memcg function call in the zram module for Android16-6.12, the following error occurs during compilation: ERROR: modpost: "cgroup_mutex" [../soc-repo/zram.ko] undefined! This error is caused by the indirect call to lockdep_is_held(&cgroup_mutex) within folio_memcg. The export setting for cgroup_mutex is controlled by the CONFIG_PROVE_RCU macro. If CONFIG_LOCKDEP is enabled while CONFIG_PROVE_RCU is not, this compilation error will occur. To resolve this issue, add a parallel macro CONFIG_LOCKDEP control to ensure cgroup_mutex is properly exported when needed. Signed-off-by: gao xu <gaoxu2@honor.com> Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-06-04 14:41:53 +02:00
Waiman Long	4e63b6907d	cgroup/cpuset: Extend kthread_is_per_cpu() check to all PF_NO_SETAFFINITY tasks [ Upstream commit 39b5ef791d109dd54c7c2e6e87933edfcc0ad1ac ] Commit `ec5fbdfb99` ("cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset") enabled us to pull CPUs dedicated to child partitions from tasks in top_cpuset by ignoring per cpu kthreads. However, there can be other kthreads that are not per cpu but have PF_NO_SETAFFINITY flag set to indicate that we shouldn't mess with their CPU affinity. For other kthreads, their affinity will be changed to skip CPUs dedicated to child partitions whether it is an isolating or a scheduling one. As all the per cpu kthreads have PF_NO_SETAFFINITY set, the PF_NO_SETAFFINITY tasks are essentially a superset of per cpu kthreads. Fix this issue by dropping the kthread_is_per_cpu() check and checking the PF_NO_SETAFFINITY flag instead. Fixes: `ec5fbdfb99` ("cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset") Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-05-22 14:12:12 +02:00
Shakeel Butt	a00e607102	cgroup: fix race between fork and cgroup.kill commit b69bb476dee99d564d65d418e9a20acca6f32c3f upstream. Tejun reported the following race between fork() and cgroup.kill at [1]. Tejun: I was looking at cgroup.kill implementation and wondering whether there could be a race window. So, __cgroup_kill() does the following: k1. Set CGRP_KILL. k2. Iterate tasks and deliver SIGKILL. k3. Clear CGRP_KILL. The copy_process() does the following: c1. Copy a bunch of stuff. c2. Grab siglock. c3. Check fatal_signal_pending(). c4. Commit to forking. c5. Release siglock. c6. Call cgroup_post_fork() which puts the task on the css_set and tests CGRP_KILL. The intention seems to be that either a forking task gets SIGKILL and terminates on c3 or it sees CGRP_KILL on c6 and kills the child. However, I don't see what guarantees that k3 can't happen before c6. ie. After a forking task passes c5, k2 can take place and then before the forking task reaches c6, k3 can happen. Then, nobody would send SIGKILL to the child. What am I missing? This is indeed a race. One way to fix this race is by taking cgroup_threadgroup_rwsem in write mode in __cgroup_kill() as the fork() side takes cgroup_threadgroup_rwsem in read mode from cgroup_can_fork() to cgroup_post_fork(). However that would be heavy handed as this adds one more potential stall scenario for cgroup.kill which is usually called under extreme situation like memory pressure. To fix this race, let's maintain a sequence number per cgroup which gets incremented on __cgroup_kill() call. On the fork() side, the cgroup_can_fork() will cache the sequence number locally and recheck it against the cgroup's sequence number at cgroup_post_fork() site. If the sequence numbers mismatch, it means __cgroup_kill() can been called and we should send SIGKILL to the newly created task. Reported-by: Tejun Heo <tj@kernel.org> Closes: https://lore.kernel.org/all/Z5QHE2Qn-QZ6M-KW@slm.duckdns.org/ [1] Fixes: `661ee62809` ("cgroup: introduce cgroup.kill") Cc: stable@vger.kernel.org # v5.14+ Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-02-21 13:57:17 +01:00
Muhammad Adeel	b25ba45fcf	cgroup: Remove steal time from usage_usec [ Upstream commit db5fd3cf8bf41b84b577b8ad5234ea95f327c9be ] The CPU usage time is the time when user, system or both are using the CPU. Steal time is the time when CPU is waiting to be run by the Hypervisor. It should not be added to the CPU usage time, hence removing it from the usage_usec entry. Fixes: `936f2a70f2` ("cgroup: add cpu.stat file to root cgroup") Acked-by: Axel Busch <axel.busch@ibm.com> Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Muhammad Adeel <muhammad.adeel@ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-02-21 13:57:08 +01:00
Chen Ridong	9e67b05419	cgroup/bpf: only cgroup v2 can be attached by bpf programs [ Upstream commit 2190df6c91373fdec6db9fc07e427084f232f57e ] Only cgroup v2 can be attached by bpf programs, so this patch introduces that cgroup_bpf_inherit and cgroup_bpf_offline can only be called in cgroup v2, and this can fix the memleak mentioned by commit `04f8ef5643` ("cgroup: Fix memory leak caused by missing cgroup_bpf_offline"), which has been reverted. Fixes: `2b0d3d3e4f` ("percpu_ref: reduce memory footprint of percpu_ref in fast path") Fixes: `4bfc0bb2c6` ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") Link: https://lore.kernel.org/cgroups/aka2hk5jsel5zomucpwlxsej6iwnfw4qu5jkrmjhyfhesjlfdw@46zxhg5bdnr7/ Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-09 10:31:54 +01:00
Chen Ridong	92031d6601	Revert "cgroup: Fix memory leak caused by missing cgroup_bpf_offline" [ Upstream commit feb301c60970bd2a1310a53ce2d6e4375397a51b ] This reverts commit `04f8ef5643`. Only cgroup v2 can be attached by cgroup by BPF programs. Revert this commit and cgroup_bpf_inherit and cgroup_bpf_offline won't be called in cgroup v1. The memory leak issue will be fixed with next patch. Fixes: `04f8ef5643` ("cgroup: Fix memory leak caused by missing cgroup_bpf_offline") Link: https://lore.kernel.org/cgroups/aka2hk5jsel5zomucpwlxsej6iwnfw4qu5jkrmjhyfhesjlfdw@46zxhg5bdnr7/ Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-09 10:31:54 +01:00
Xiu Jianfeng	fb384669cb	cgroup: Fix potential overflow issue when checking max_depth [ Upstream commit 3cc4e13bb1617f6a13e5e6882465984148743cf4 ] cgroup.max.depth is the maximum allowed descent depth below the current cgroup. If the actual descent depth is equal or larger, an attempt to create a new child cgroup will fail. However due to the cgroup->max_depth is of int type and having the default value INT_MAX, the condition 'level > cgroup->max_depth' will never be satisfied, and it will cause an overflow of the level after it reaches to INT_MAX. Fix it by starting the level from 0 and using '>=' instead. It's worth mentioning that this issue is unlikely to occur in reality, as it's impossible to have a depth of INT_MAX hierarchy, but should be be avoided logically. Fixes: `1a926e0bba` ("cgroup: implement hierarchy limits") Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-08 16:28:16 +01:00
Waiman Long	84a6b76b28	cgroup: Protect css->cgroup write under css_set_lock [ Upstream commit 57b56d16800e8961278ecff0dc755d46c4575092 ] The writing of css->cgroup associated with the cgroup root in rebind_subsystems() is currently protected only by cgroup_mutex. However, the reading of css->cgroup in both proc_cpuset_show() and proc_cgroup_show() is protected just by css_set_lock. That makes the readers susceptible to racing problems like data tearing or caching. It is also a problem that can be reported by KCSAN. This can be fixed by using READ_ONCE() and WRITE_ONCE() to access css->cgroup. Alternatively, the writing of css->cgroup can be moved under css_set_lock as well which is done by this patch. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-09-12 11:11:35 +02:00
Kamalesh Babulal	a2225b7af5	cgroup: Avoid extra dereference in css_populate_dir() [ Upstream commit d24f05987ce8bf61e62d86fedbe47523dc5c3393 ] Use css directly instead of dereferencing it from &cgroup->self, while adding the cgroup v2 cft base and psi files in css_populate_dir(). Both points to the same css, when css->ss is NULL, this avoids extra deferences and makes code consistent in usage across the function. Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-08-29 17:33:24 +02:00
Yafang Shao	dd9542ae7c	cgroup: Make operations on the cgroup root_list RCU safe commit d23b5c577715892c87533b13923306acc6243f93 upstream. At present, when we perform operations on the cgroup root_list, we must hold the cgroup_mutex, which is a relatively heavyweight lock. In reality, we can make operations on this list RCU-safe, eliminating the need to hold the cgroup_mutex during traversal. Modifications to the list only occur in the cgroup root setup and destroy paths, which should be infrequent in a production environment. In contrast, traversal may occur frequently. Therefore, making it RCU-safe would be beneficial. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> To: Michal Koutný <mkoutny@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-19 06:04:25 +02:00
Chen Ridong	96226fbed5	cgroup/cpuset: Prevent UAF in proc_cpuset_show() [ Upstream commit 1be59c97c83ccd67a519d8a49486b3a8a73ca28a ] An UAF can happen when /proc/cpuset is read as reported in [1]. This can be reproduced by the following methods: 1.add an mdelay(1000) before acquiring the cgroup_lock In the cgroup_path_ns function. 2.$cat /proc/<pid>/cpuset repeatly. 3.$mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset/ $umount /sys/fs/cgroup/cpuset/ repeatly. The race that cause this bug can be shown as below: (umount) \| (cat /proc/<pid>/cpuset) css_release \| proc_cpuset_show css_release_work_fn \| css = task_get_css(tsk, cpuset_cgrp_id); css_free_rwork_fn \| cgroup_path_ns(css->cgroup, ...); cgroup_destroy_root \| mutex_lock(&cgroup_mutex); rebind_subsystems \| cgroup_free_root \| \| // cgrp was freed, UAF \| cgroup_path_ns_locked(cgrp,..); When the cpuset is initialized, the root node top_cpuset.css.cgrp will point to &cgrp_dfl_root.cgrp. In cgroup v1, the mount operation will allocate cgroup_root, and top_cpuset.css.cgrp will point to the allocated &cgroup_root.cgrp. When the umount operation is executed, top_cpuset.css.cgrp will be rebound to &cgrp_dfl_root.cgrp. The problem is that when rebinding to cgrp_dfl_root, there are cases where the cgroup_root allocated by setting up the root for cgroup v1 is cached. This could lead to a Use-After-Free (UAF) if it is subsequently freed. The descendant cgroups of cgroup v1 can only be freed after the css is released. However, the css of the root will never be released, yet the cgroup_root should be freed when it is unmounted. This means that obtaining a reference to the css of the root does not guarantee that css.cgrp->root will not be freed. Fix this problem by using rcu_read_lock in proc_cpuset_show(). As cgroup_root is kfree_rcu after commit d23b5c577715 ("cgroup: Make operations on the cgroup root_list RCU safe"), css->cgroup won't be freed during the critical section. To call cgroup_path_ns_locked, css_set_lock is needed, so it is safe to replace task_get_css with task_css. [1] https://syzkaller.appspot.com/bug?extid=9b1ff7be974a403aa4cd Fixes: `a79a908fd2` ("cgroup: introduce cgroup namespaces") Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-08-03 08:53:22 +02:00
Kees Cook	aea95c68b7	kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy() [ Upstream commit ff6d413b0b59466e5acf2e42f294b1842ae130a1 ] One of the last remaining users of strlcpy() in the kernel is kernfs_path_from_node_locked(), which passes back the problematic "length we _would_ have copied" return value to indicate truncation. Convert the chain of all callers to use the negative return value (some of which already doing this explicitly). All callers were already also checking for negative return values, so the risk to missed checks looks very low. In this analysis, it was found that cgroup1_release_agent() actually didn't handle the "too large" condition, so this is technically also a bug fix. :) Here's the chain of callers, and resolution identifying each one as now handling the correct return value: kernfs_path_from_node_locked() kernfs_path_from_node() pr_cont_kernfs_path() returns void kernfs_path() sysfs_warn_dup() return value ignored cgroup_path() blkg_path() bfq_bic_update_cgroup() return value ignored TRACE_IOCG_PATH() return value ignored TRACE_CGROUP_PATH() return value ignored perf_event_cgroup() return value ignored task_group_path() return value ignored damon_sysfs_memcg_path_eq() return value ignored get_mm_memcg_path() return value ignored lru_gen_seq_show() return value ignored cgroup_path_from_kernfs_id() return value ignored cgroup_show_path() already converted "too large" error to negative value cgroup_path_ns_locked() cgroup_path_ns() bpf_iter_cgroup_show_fdinfo() return value ignored cgroup1_release_agent() wasn't checking "too large" error proc_cgroup_show() already converted "too large" to negative value Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Waiman Long <longman@redhat.com> Cc: <cgroups@vger.kernel.org> Co-developed-by: Azeem Shaikh <azeemshaikh38@gmail.com> Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Link: https://lore.kernel.org/r/20231116192127.1558276-3-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20231212211741.164376-3-keescook@chromium.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Stable-dep-of: 1be59c97c83c ("cgroup/cpuset: Prevent UAF in proc_cpuset_show()") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-08-03 08:53:21 +02:00
Vitalii Bursov	4d9d099ab2	sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_level [ Upstream commit a1fd0b9d751f840df23ef0e75b691fc00cfd4743 ] Change relax_domain_level checks so that it would be possible to include or exclude all domains from newidle balancing. This matches the behavior described in the documentation: -1 no request. use system default or follow request of others. 0 no search. 1 search siblings (hyperthreads in a core). "2" enables levels 0 and 1, level_max excludes the last (level_max) level, and level_max+1 includes all levels. Fixes: `1d3504fcf5` ("sched, cpuset: customize sched domains, core") Signed-off-by: Vitalii Bursov <vitaly@bursov.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/bd6de28e80073c79466ec6401cdeae78f0d4423d.1714488502.git.vitaly@bursov.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-06-12 11:12:13 +02:00
Kamalesh Babulal	d9f400dc3e	cgroup/cpuset: Fix retval in update_cpumask() commit 25125a4762835d62ba1e540c1351d447fc1f6c7c upstream. The update_cpumask(), checks for newly requested cpumask by calling validate_change(), which returns an error on passing an invalid set of cpu(s). Independent of the error returned, update_cpumask() always returns zero, suppressing the error and returning success to the user on writing an invalid cpu range for a cpuset. Fix it by returning retval instead, which is returned by validate_change(). Fixes: `99fe36ba6f` ("cgroup/cpuset: Improve temporary cpumasks handling") Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Reviewed-by: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-04-03 15:28:40 +02:00
Tim Van Patten	9ec2d92673	cgroup_freezer: cgroup_freezing: Check if not frozen commit cff5f49d433fcd0063c8be7dd08fa5bf190c6c37 upstream. __thaw_task() was recently updated to warn if the task being thawed was part of a freezer cgroup that is still currently freezing: void __thaw_task(struct task_struct *p) { ... if (WARN_ON_ONCE(freezing(p))) goto unlock; This has exposed a bug in cgroup1 freezing where when CGROUP_FROZEN is asserted, the CGROUP_FREEZING bits are not also cleared at the same time. Meaning, when a cgroup is marked FROZEN it continues to be marked FREEZING as well. This causes the WARNING to trigger, because cgroup_freezing() thinks the cgroup is still freezing. There are two ways to fix this: 1. Whenever FROZEN is set, clear FREEZING for the cgroup and all children cgroups. 2. Update cgroup_freezing() to also verify that FROZEN is not set. This patch implements option (2), since it's smaller and more straightforward. Signed-off-by: Tim Van Patten <timvp@google.com> Tested-by: Mark Hasemeyer <markhas@chromium.org> Fixes: `f5d39b0208` ("freezer,sched: Rewrite core freezer logic") Cc: stable@vger.kernel.org # v6.1+ Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-12-13 18:45:22 +01:00
Johannes Weiner	04d2fea2b7	sched: psi: fix unprivileged polling against cgroups commit 8b39d20eceeda6c4eb23df1497f9ed2fffdc8f69 upstream. `519fabc7aa` ("psi: remove 500ms min window size limitation for triggers") breaks unprivileged psi polling on cgroups. Historically, we had a privilege check for polling in the open() of a pressure file in /proc, but were erroneously missing it for the open() of cgroup pressure files. When unprivileged polling was introduced in `d82caa2735` ("sched/psi: Allow unprivileged polling of N*2s period"), it needed to filter privileges depending on the exact polling parameters, and as such moved the CAP_SYS_RESOURCE check from the proc open() callback to psi_trigger_create(). Both the proc files as well as cgroup files go through this during write(). This implicitly added the missing check for privileges required for HT polling for cgroups. When `519fabc7aa` ("psi: remove 500ms min window size limitation for triggers") followed right after to remove further restrictions on the RT polling window, it incorrectly assumed the cgroup privilege check was still missing and added it to the cgroup open(), mirroring what we used to do for proc files in the past. As a result, unprivileged poll requests that would be supported now get rejected when opening the cgroup pressure file for writing. Remove the cgroup open() check. psi_trigger_create() handles it. Fixes: `519fabc7aa` ("psi: remove 500ms min window size limitation for triggers") Reported-by: Luca Boccassi <bluca@debian.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Luca Boccassi <bluca@debian.org> Acked-by: Suren Baghdasaryan <surenb@google.com> Cc: stable@vger.kernel.org # 6.5+ Link: https://lore.kernel.org/r/20231026164114.2488682-1-hannes@cmpxchg.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-11-28 17:19:56 +00:00
Waiman Long	d6e21bf76e	cgroup/cpuset: Fix load balance state in update_partition_sd_lb() [ Upstream commit 6fcdb0183bf024a70abccb0439321c25891c708d ] Commit `a86ce68078` ("cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE handling") adds a new helper function update_partition_sd_lb() to update the load balance state of the cpuset. However the new load balance is determined by just looking at whether the cpuset is a valid isolated partition root or not. That is not enough if the cpuset is not a valid partition root but its parent is in the isolated state (load balance off). Update the function to set the new state to be the same as its parent in this case like what has been done in commit `c8c926200c` ("cgroup/cpuset: Inherit parent's load balance state in v2"). Fixes: `a86ce68078` ("cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE handling") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-11-20 11:58:53 +01:00
Michal Koutný	1ca0b60515	cgroup: Remove duplicates in cgroup v1 tasks file One PID may appear multiple times in a preloaded pidlist. (Possibly due to PID recycling but we have reports of the same task_struct appearing with different PIDs, thus possibly involving transfer of PID via de_thread().) Because v1 seq_file iterator uses PIDs as position, it leads to a message: > seq_file: buggy .next function kernfs_seq_next did not update position index Conservative and quick fix consists of removing duplicates from `tasks` file (as opposed to removing pidlists altogether). It doesn't affect correctness (it's sufficient to show a PID once), performance impact would be hidden by unconditional sorting of the pidlist already in place (asymptotically). Link: https://lore.kernel.org/r/20230823174804.23632-1-mkoutny@suse.com/ Suggested-by: Firo Yang <firo.yang@suse.com> Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org	2023-10-09 06:42:05 -10:00
Linus Torvalds	76be05d4fd	cgroup: fix build when CGROUP_SCHED is not enabled Sudip Mukherjee reports that the mips sb1250_swarm_defconfig build fails with the current kernel. It isn't actually MIPS-specific, it's just that that defconfig does not have CGROUP_SCHED enabled like most configs do, and as such shows this error: kernel/cgroup/cgroup.c: In function 'cgroup_local_stat_show': kernel/cgroup/cgroup.c:3699:15: error: implicit declaration of function 'cgroup_tryget_css'; did you mean 'cgroup_tryget'? [-Werror=implicit-function-declaration] 3699 \| css = cgroup_tryget_css(cgrp, ss); \| ^~~~~~~~~~~~~~~~~ \| cgroup_tryget kernel/cgroup/cgroup.c:3699:13: warning: assignment to 'struct cgroup_subsys_state *' from 'int' makes pointer from integer without a cast [-Wint-conversion] 3699 \| css = cgroup_tryget_css(cgrp, ss); \| ^ because cgroup_tryget_css() only exists when CGROUP_SCHED is enabled, and the cgroup_local_stat_show() function should similarly be guarded by that config option. Move things around a bit to fix this all. Fixes: `d1d4ff5d11` ("cgroup: put cgroup_tryget_css() inside CONFIG_CGROUP_SCHED") Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2023-09-02 08:27:17 -07:00
Linus Torvalds	7716f383a5	Merge tag 'cgroup-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Per-cpu cpu usage stats are now tracked This currently isn't printed out in the cgroupfs interface and can only be accessed through e.g. BPF. Should decide on a not-too-ugly way to show per-cpu stats in cgroupfs - cpuset received some cleanups and prepatory patches for the pending cpus.exclusive patchset which will allow cpuset partitions to be created below non-partition parents, which should ease the management of partition cpusets - A lot of code and documentation cleanup patches - tools/testing/selftests/cgroup/test_cpuset.c added * tag 'cgroup-for-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (32 commits) cgroup: Avoid -Wstringop-overflow warnings cgroup:namespace: Remove unused cgroup_namespaces_init() cgroup/rstat: Record the cumulative per-cpu time of cgroup and its descendants cgroup: clean up if condition in cgroup_pidlist_start() cgroup: fix obsolete function name in cgroup_destroy_locked() Documentation: cgroup-v2.rst: Correct number of stats entries cgroup: fix obsolete function name above css_free_rwork_fn() cgroup/cpuset: fix kernel-doc cgroup: clean up printk() cgroup: fix obsolete comment above cgroup_create() docs: cgroup-v1: fix typo docs: cgroup-v1: correct the term of Page Cache organization in inode cgroup/misc: Store atomic64_t reads to u64 cgroup/misc: Change counters to be explicit 64bit types cgroup/misc: update struct members descriptions cgroup: remove cgrp->kn check in css_populate_dir() cgroup: fix obsolete function name cgroup: use cached local variable parent in for loop cgroup: remove obsolete comment above struct cgroupstats cgroup: put cgroup_tryget_css() inside CONFIG_CGROUP_SCHED ...	2023-09-01 15:58:21 -07:00
Gustavo A. R. Silva	78d44b824e	cgroup: Avoid -Wstringop-overflow warnings Change the notation from pointer-to-array to pointer-to-pointer. With this, we avoid the compiler complaining about trying to access a region of size zero as an argument during function calls. This is a workaround to prevent the compiler complaining about accessing an array of size zero when evaluating the arguments of a couple of function calls. See below: kernel/cgroup/cgroup.c: In function 'find_css_set': kernel/cgroup/cgroup.c:1206:16: warning: 'find_existing_css_set' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] 1206 \| cset = find_existing_css_set(old_cset, cgrp, template); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ kernel/cgroup/cgroup.c:1206:16: note: referencing argument 3 of type 'struct cgroup_subsys_state [0]' kernel/cgroup/cgroup.c:1071:24: note: in a call to function 'find_existing_css_set' 1071 \| static struct css_set find_existing_css_set(struct css_set *old_cset, \| ^~~~~~~~~~~~~~~~~~~~~ With the change to pointer-to-pointer, the functions are not prevented from being executed, and they will do what they have to do when CGROUP_SUBSYS_COUNT == 0. Address the following -Wstringop-overflow warnings seen when built with ARM architecture and aspeed_g4_defconfig configuration (notice that under this configuration CGROUP_SUBSYS_COUNT == 0): kernel/cgroup/cgroup.c:1208:16: warning: 'find_existing_css_set' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] kernel/cgroup/cgroup.c:1258:15: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] kernel/cgroup/cgroup.c:6089:18: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] kernel/cgroup/cgroup.c:6153:18: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] This results in no differences in binary output. Link: https://github.com/KSPP/linux/issues/316 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-17 11:55:05 -10:00
Lu Jialin	82b90b6c5b	cgroup:namespace: Remove unused cgroup_namespaces_init() cgroup_namspace_init() just return 0. Therefore, there is no need to call it during start_kernel. Just remove it. Fixes: `a79a908fd2` ("cgroup: introduce cgroup namespaces") Signed-off-by: Lu Jialin <lujialin4@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-14 14:29:47 -10:00
Hao Jia	0437719c1a	cgroup/rstat: Record the cumulative per-cpu time of cgroup and its descendants The member variable bstat of the structure cgroup_rstat_cpu records the per-cpu time of the cgroup itself, but does not include the per-cpu time of its descendants. The per-cpu time including descendants is very useful for calculating the per-cpu usage of cgroups. Although we can indirectly obtain the total per-cpu time of the cgroup and its descendants by accumulating the per-cpu bstat of each descendant of the cgroup. But after a child cgroup is removed, we will lose its bstat information. This will cause the cumulative value to be non-monotonic, thus affecting the accuracy of cgroup per-cpu usage. So we add the subtree_bstat variable to record the total per-cpu time of this cgroup and its descendants, which is similar to "cpuacct.usage*" in cgroup v1. And this is also helpful for the migration from cgroup v1 to cgroup v2. After adding this variable, we can obtain the per-cpu time of cgroup and its descendants in user mode through eBPF/drgn, etc. And we are still trying to determine how to expose it in the cgroupfs interface. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 08:41:25 -10:00
Miaohe Lin	e7e64a1bff	cgroup: clean up if condition in cgroup_pidlist_start() There's no need to use '<=' when knowing 'l->list[mid] != pid' already. No functional change intended. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 08:30:06 -10:00
Miaohe Lin	7f828eacc4	cgroup: fix obsolete function name in cgroup_destroy_locked() Since commit `e76ecaeef6` ("cgroup: use cgroup_kn_lock_live() in other cgroup kernfs methods"), cgroup_kn_lock_live() is used in cgroup kernfs methods. Update corresponding comment. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-03 14:13:33 -10:00
Miaohe Lin	a2c15fece4	cgroup: fix obsolete function name above css_free_rwork_fn() Since commit `8f36aaec9c` ("cgroup: Use rcu_work instead of explicit rcu and work item"), css_free_work_fn has been renamed to css_free_rwork_fn. Update corresponding comment. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-02 09:37:59 -10:00
Cai Xinchen	05f76ae95e	cgroup/cpuset: fix kernel-doc Add kernel-doc of param @rotor to fix warnings: kernel/cgroup/cpuset.c:4162: warning: Function parameter or member 'rotor' not described in 'cpuset_spread_node' kernel/cgroup/cpuset.c:3771: warning: Function parameter or member 'work' not described in 'cpuset_hotplug_workfn' Signed-off-by: Cai Xinchen <caixinchen1@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-02 09:37:03 -10:00
Kamalesh Babulal	55a5956a55	cgroup: clean up printk() Convert the only printk() to use pr_*() helper. No functional change. Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-02 09:36:24 -10:00
Miaohe Lin	a3fdeeb3f1	cgroup: fix obsolete comment above cgroup_create() Since commit `743210386c` ("cgroup: use cgrp->kn->id as the cgroup ID"), cgrp is associated with its kernfs_node. Update corresponding comment. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-07-21 09:54:48 -10:00
Haitao Huang	714e08cc3e	cgroup/misc: Store atomic64_t reads to u64 Change 'new_usage' type to u64 so it can be compared with unsigned 'max' and 'capacity' properly even if the value crosses the signed boundary. Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-07-21 08:10:06 -10:00
Ingo Molnar	752182b24b	Merge tag 'v6.5-rc2' into sched/core, to pick up fixes Sync with upstream fixes before applying EEVDF. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2023-07-19 09:43:25 +02:00
Haitao Huang	32bf85c60c	cgroup/misc: Change counters to be explicit 64bit types So the variables can account for resources of huge quantities even on 32-bit machines. Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-07-18 12:10:00 -10:00
Kamalesh Babulal	c25ff4b911	cgroup: remove cgrp->kn check in css_populate_dir() cgroup_create() creates cgrp and assigns the kernfs_node to cgrp->kn, then cgroup_mkdir() populates base and csses cft file by calling css_populate_dir() and cgroup_apply_control_enable() with a valid cgrp->kn. Check for NULL cgrp->kn, will always be false, remove it. Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-07-17 08:44:56 -10:00
Miaohe Lin	6f71780e7f	cgroup: fix obsolete function name cgroup_taskset_migrate() has been renamed to cgroup_migrate_execute() since commit `e595cd7069` ("cgroup: track migration context in cgroup_mgctx"). Update the corresponding comment. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-07-17 08:38:58 -10:00
Miaohe Lin	fcbb485d9f	cgroup: use cached local variable parent in for loop Use local variable parent to initialize iter tcgrp in for loop so the size of cgroup.o can be reduced by 64 bytes. No functional change intended. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-07-17 08:34:41 -10:00
Josh Don	677ea015f2	sched: add throttled time stat for throttled children We currently export the total throttled time for cgroups that are given a bandwidth limit. This patch extends this accounting to also account the total time that each children cgroup has been throttled. This is useful to understand the degree to which children have been affected by the throttling control. Children which are not runnable during the entire throttled period, for example, will not show any self-throttling time during this period. Expose this in a new interface, 'cpu.stat.local', which is similar to how non-hierarchical events are accounted in 'memory.events.local'. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230620183247.737942-2-joshdon@google.com	2023-07-13 15:21:49 +02:00
Miaohe Lin	d1d4ff5d11	cgroup: put cgroup_tryget_css() inside CONFIG_CGROUP_SCHED Put cgroup_tryget_css() inside CONFIG_CGROUP_SCHED to fix the warning of 'cgroup_tryget_css' defined but not used [-Wunused-function] when CONFIG_CGROUP_SCHED is disabled. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-07-11 11:46:00 -10:00
Waiman Long	3ae0b77321	cgroup/cpuset: Allow suppression of sched domain rebuild in update_cpumasks_hier() A single partition setup and tear-down operation can lead to multiple rebuild_sched_domains_locked() calls which is a waste of effort. This can partly be mitigated by adding a flag to suppress the rebuild_sched_domains_locked() call in update_cpumasks_hier(). Since a Boolean flag has already been passed as the 3rd argument to update_cpumasks_hier(), we can extend that to a full flag word. The sched domain rebuild suppression is now enabled in update_sibling_cpumasks() as all it callers will do the sched domain rebuild after its return later on anyway. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-07-10 11:01:23 -10:00
Waiman Long	99fe36ba6f	cgroup/cpuset: Improve temporary cpumasks handling The limitation that update_parent_subparts_cpumask() can only use addmask & delmask in the given tmp cpumasks is fragile and may lead to unexpected error. Fix this problem by allocating/freeing a struct tmpmasks in update_cpumask() to avoid reusing the cpumasks in trial_cs. With this change, we can move the update_tasks_cpumask() for the parent and update_sibling_cpumasks() for the sibling to inside update_parent_subparts_cpumask(). Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-07-10 11:01:03 -10:00
Waiman Long	a86ce68078	cgroup/cpuset: Extract out CS_CPU_EXCLUSIVE & CS_SCHED_LOAD_BALANCE handling Extract out the setting of CS_CPU_EXCLUSIVE and CS_SCHED_LOAD_BALANCE flags as well as the rebuilding of scheduling domains into the new update_partition_exclusive() and update_partition_sd_lb() helper functions to simplify the logic. The update_partition_exclusive() helper is called mainly at the beginning of the caller, but it may be called at the end too. The update_partition_sd_lb() helper is called at the end of the caller. This patch should reduce the chance that cpuset partition will end up in an incorrect state. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-07-10 11:00:12 -10:00
Waiman Long	c8c926200c	cgroup/cpuset: Inherit parent's load balance state in v2 Since commit `f28e22441f` ("cgroup/cpuset: Add a new isolated cpus.partition type"), the CS_SCHED_LOAD_BALANCE bit of a v2 cpuset can be on or off. The child cpusets of a partition root must have the same setting as its parent or it may screw up the rebuilding of sched domains. Fix this problem by making sure the a child v2 cpuset will follows its parent cpuset load balance state unless the child cpuset is a new partition root itself. Fixes: `f28e22441f` ("cgroup/cpuset: Add a new isolated cpus.partition type") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-07-10 10:59:27 -10:00
Miaohe Lin	868f87b375	cgroup: fix obsolete comment above for_each_css() cgroup_tree_mutex is removed since commit `8353da1f91` ("cgroup: remove cgroup_tree_mutex"), update corresponding comment. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-07-10 10:47:25 -10:00
Michal Koutný	0a67b847e1	cpuset: Allow setscheduler regardless of manipulated task When we migrate a task between two cgroups, one of the checks is a verification whether we can modify task's scheduler settings (cap_task_setscheduler()). An implicit migration occurs also when enabling a controller on the unified hierarchy (think of parent to child migration). The aforementioned check may be problematic if the caller of the migration (enabling a controller) has no permissions over migrated tasks. For instance, a user's cgroup that ends up running a process of a different user. Although cgroup permissions are configured favorably, the enablement fails due to the foreign process [1]. Change the behavior by relaxing the permissions check on the unified hierarchy when no effective change would happen. This is in accordance with unified hierarchy attachment behavior when permissions of the source to target cgroups are decisive whereas the migrated task is opaque (as opposed to more restrictive check in __cgroup1_procs_write()). Notice that foreign task's affinity may still be modified if the user can modify destination cgroup's cpuset attributes (update_tasks_cpumask() does no permissions check). The permissions check could thus be skipped on v2 even when affinity changes. Stay conservative in this patch though. [1] https://github.com/systemd/systemd/issues/18293#issuecomment-831205649 Signed-off-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-07-10 10:28:43 -10:00
Miaohe Lin	48f074565b	cgroup/cpuset: avoid unneeded cpuset_mutex re-lock cpuset_mutex unlock and lock pair is only needed when transferring tasks out of empty cpuset. Avoid unneeded cpuset_mutex re-lock when !is_empty to save cpu cycles. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-07-10 10:26:06 -10:00

1 2 3 4 5 ...

646 Commits