linux

mirror of https://github.com/hardkernel/linux.git synced 2026-04-01 18:53:02 +09:00

Author	SHA1	Message	Date
Kevin Hao	5b5c346f5a	cpufreq: schedutil: Use kobject release() method to free sugov_tunables [ Upstream commit `e5c6b312ce` ] The struct sugov_tunables is protected by the kobject, so we can't free it directly. Otherwise we would get a call trace like this: ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x30 WARNING: CPU: 3 PID: 720 at lib/debugobjects.c:505 debug_print_object+0xb8/0x100 Modules linked in: CPU: 3 PID: 720 Comm: a.sh Tainted: G W 5.14.0-rc1-next-20210715-yocto-standard+ #507 Hardware name: Marvell OcteonTX CN96XX board (DT) pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--) pc : debug_print_object+0xb8/0x100 lr : debug_print_object+0xb8/0x100 sp : ffff80001ecaf910 x29: ffff80001ecaf910 x28: ffff00011b10b8d0 x27: ffff800011043d80 x26: ffff00011a8f0000 x25: ffff800013cb3ff0 x24: 0000000000000000 x23: ffff80001142aa68 x22: ffff800011043d80 x21: ffff00010de46f20 x20: ffff800013c0c520 x19: ffff800011d8f5b0 x18: 0000000000000010 x17: 6e6968207473696c x16: 5f72656d6974203a x15: 6570797420746365 x14: 6a626f2029302065 x13: 303378302f307830 x12: 2b6e665f72656d69 x11: ffff8000124b1560 x10: ffff800012331520 x9 : ffff8000100ca6b0 x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 0000000000000001 x5 : ffff800011d8c000 x4 : ffff800011d8c740 x3 : 0000000000000000 x2 : ffff0001108301c0 x1 : ab3c90eedf9c0f00 x0 : 0000000000000000 Call trace: debug_print_object+0xb8/0x100 __debug_check_no_obj_freed+0x1c0/0x230 debug_check_no_obj_freed+0x20/0x88 slab_free_freelist_hook+0x154/0x1c8 kfree+0x114/0x5d0 sugov_exit+0xbc/0xc0 cpufreq_exit_governor+0x44/0x90 cpufreq_set_policy+0x268/0x4a8 store_scaling_governor+0xe0/0x128 store+0xc0/0xf0 sysfs_kf_write+0x54/0x80 kernfs_fop_write_iter+0x128/0x1c0 new_sync_write+0xf0/0x190 vfs_write+0x2d4/0x478 ksys_write+0x74/0x100 __arm64_sys_write+0x24/0x30 invoke_syscall.constprop.0+0x54/0xe0 do_el0_svc+0x64/0x158 el0_svc+0x2c/0xb0 el0t_64_sync_handler+0xb0/0xb8 el0t_64_sync+0x198/0x19c irq event stamp: 5518 hardirqs last enabled at (5517): [<ffff8000100cbd7c>] console_unlock+0x554/0x6c8 hardirqs last disabled at (5518): [<ffff800010fc0638>] el1_dbg+0x28/0xa0 softirqs last enabled at (5504): [<ffff8000100106e0>] __do_softirq+0x4d0/0x6c0 softirqs last disabled at (5483): [<ffff800010049548>] irq_exit+0x1b0/0x1b8 So split the original sugov_tunables_free() into two functions, sugov_clear_global_tunables() is just used to clear the global_tunables and the new sugov_tunables_free() is used as kobj_type::release to release the sugov_tunables safely. Fixes: `9bdcb44e39` ("cpufreq: schedutil: New governor based on scheduler utilization data") Cc: 4.7+ <stable@vger.kernel.org> # 4.7+ Signed-off-by: Kevin Hao <haokexin@gmail.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-16 11:53:39 +09:00
Odin Ugedal	60f0b82677	sched/fair: Fix CFS bandwidth hrtimer expiry type [ Upstream commit `72d0ad7cb5` ] The time remaining until expiry of the refresh_timer can be negative. Casting the type to an unsigned 64-bit value will cause integer underflow, making the runtime_refresh_within return false instead of true. These situations are rare, but they do happen. This does not cause user-facing issues or errors; other than possibly unthrottling cfs_rq's using runtime from the previous period(s), making the CFS bandwidth enforcement less strict in those (special) situations. Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Link: https://lore.kernel.org/r/20210629121452.18429-1-odin@uged.al Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-16 11:33:40 +09:00
Vincent Guittot	1ec0012772	sched/fair: handle case of task_h_load() returning 0 commit `01cfcde9c2` upstream. task_h_load() can return 0 in some situations like running stress-ng mmapfork, which forks thousands of threads, in a sched group on a 224 cores system. The load balance doesn't handle this correctly because env->imbalance never decreases and it will stop pulling tasks only after reaching loop_max, which can be equal to the number of running tasks of the cfs. Make sure that imbalance will be decreased by at least 1. misfit task is the other feature that doesn't handle correctly such situation although it's probably more difficult to face the problem because of the smaller number of CPUs and running tasks on heterogenous system. We can't simply ensure that task_h_load() returns at least one because it would imply to handle underflow in other places. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: <stable@vger.kernel.org> # v4.4+ Link: https://lkml.kernel.org/r/20200710152426.16981-1-vincent.guittot@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-16 08:31:08 +09:00
Shile Zhang	6cd73be5b6	sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds [ Upstream commit `975e155ed8` ] We added the 'sched_rr_timeslice_ms' SCHED_RR tuning knob in this commit: `ce0dbbbb30` ("sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice") ... which name suggests to users that it's in milliseconds, while in reality it's being set in milliseconds but the result is shown in jiffies. This is obviously confusing when HZ is not 1000, it makes it appear like the value set failed, such as HZ=100: root# echo 100 > /proc/sys/kernel/sched_rr_timeslice_ms root# cat /proc/sys/kernel/sched_rr_timeslice_ms 10 Fix this to be milliseconds all around. Signed-off-by: Shile Zhang <shile.zhang@nokia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1485612049-20923-1-git-send-email-shile.zhang@nokia.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-16 08:29:07 +09:00
Juri Lelli	c7afe2f482	sched/core: Fix PI boosting between RT and DEADLINE tasks [ Upstream commit `740797ce3a` ] syzbot reported the following warning: WARNING: CPU: 1 PID: 6351 at kernel/sched/deadline.c:628 enqueue_task_dl+0x22da/0x38a0 kernel/sched/deadline.c:1504 At deadline.c:628 we have: 623 static inline void setup_new_dl_entity(struct sched_dl_entity dl_se) 624 { 625 struct dl_rq dl_rq = dl_rq_of_se(dl_se); 626 struct rq *rq = rq_of_dl_rq(dl_rq); 627 628 WARN_ON(dl_se->dl_boosted); 629 WARN_ON(dl_time_before(rq_clock(rq), dl_se->deadline)); [...] } Which means that setup_new_dl_entity() has been called on a task currently boosted. This shouldn't happen though, as setup_new_dl_entity() is only called when the 'dynamic' deadline of the new entity is in the past w.r.t. rq_clock and boosted tasks shouldn't verify this condition. Digging through the PI code I noticed that what above might in fact happen if an RT tasks blocks on an rt_mutex hold by a DEADLINE task. In the first branch of boosting conditions we check only if a pi_task 'dynamic' deadline is earlier than mutex holder's and in this case we set mutex holder to be dl_boosted. However, since RT 'dynamic' deadlines are only initialized if such tasks get boosted at some point (or if they become DEADLINE of course), in general RT 'dynamic' deadlines are usually equal to 0 and this verifies the aforementioned condition. Fix it by checking that the potential donor task is actually (even if temporary because in turn boosted) running at DEADLINE priority before using its 'dynamic' deadline value. Fixes: `2d3d891d33` ("sched/deadline: Add SCHED_DEADLINE inheritance logic") Reported-by: syzbot+119ba87189432ead09b4@syzkaller.appspotmail.com Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Tested-by: Daniel Wagner <dwagner@suse.de> Link: https://lkml.kernel.org/r/20181119153201.GB2119@localhost.localdomain Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-16 08:21:43 +09:00
Jens Axboe	3a641a476e	sched/fair: Don't NUMA balance for kthreads [ Upstream commit `18f855e574` ] Stefano reported a crash with using SQPOLL with io_uring: BUG: kernel NULL pointer dereference, address: 00000000000003b0 CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11 RIP: 0010:task_numa_work+0x4f/0x2c0 Call Trace: task_work_run+0x68/0xa0 io_sq_thread+0x252/0x3d0 kthread+0xf9/0x130 ret_from_fork+0x35/0x40 which is task_numa_work() oopsing on current->mm being NULL. The task work is queued by task_tick_numa(), which checks if current->mm is NULL at the time of the call. But this state isn't necessarily persistent, if the kthread is using use_mm() to temporarily adopt the mm of a task. Change the task_tick_numa() check to exclude kernel threads in general, as it doesn't make sense to attempt ot balance for kthreads anyway. Reported-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/865de121-8190-5d30-ece5-3b097dc74431@kernel.dk Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-15 17:32:04 +09:00
Michael Wang	f6edd8ce6f	sched: Avoid scale real weight down to zero [ Upstream commit `26cf52229e` ] During our testing, we found a case that shares no longer working correctly, the cgroup topology is like: /sys/fs/cgroup/cpu/A (shares=102400) /sys/fs/cgroup/cpu/A/B (shares=2) /sys/fs/cgroup/cpu/A/B/C (shares=1024) /sys/fs/cgroup/cpu/D (shares=1024) /sys/fs/cgroup/cpu/D/E (shares=1024) /sys/fs/cgroup/cpu/D/E/F (shares=1024) The same benchmark is running in group C & F, no other tasks are running, the benchmark is capable to consumed all the CPUs. We suppose the group C will win more CPU resources since it could enjoy all the shares of group A, but it's F who wins much more. The reason is because we have group B with shares as 2, since A->cfs_rq.load.weight == B->se.load.weight == B->shares/nr_cpus, so A->cfs_rq.load.weight become very small. And in calc_group_shares() we calculate shares as: load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg); shares = (tg_shares * load) / tg_weight; Since the 'cfs_rq->load.weight' is too small, the load become 0 after scale down, although 'tg_shares' is 102400, shares of the se which stand for group A on root cfs_rq become 2. While the se of D on root cfs_rq is far more bigger than 2, so it wins the battle. Thus when scale_load_down() scale real weight down to 0, it's no longer telling the real story, the caller will have the wrong information and the calculation will be buggy. This patch add check in scale_load_down(), so the real weight will be >= MIN_SHARES after scale, after applied the group C wins as expected. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/38e8e212-59a1-64b2-b247-b6d0b52d8dc1@linux.alibaba.com Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-15 17:13:50 +09:00
Xuewei Zhang	9ae6c47d3c	sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision commit `4929a4e6fa` upstream. The quota/period ratio is used to ensure a child task group won't get more bandwidth than the parent task group, and is calculated as: normalized_cfs_quota() = [(quota_us << 20) / period_us] If the quota/period ratio was changed during this scaling due to precision loss, it will cause inconsistency between parent and child task groups. See below example: A userspace container manager (kubelet) does three operations: 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us. 2) Create a few children cgroups. 3) Set quota to 1,000us and period to 10,000us on a child cgroup. These operations are expected to succeed. However, if the scaling of 147/128 happens before step 3, quota and period of the parent cgroup will be changed: new_quota: 1148437ns, 1148us new_period: 11484375ns, 11484us And when step 3 comes in, the ratio of the child cgroup will be 104857, which will be larger than the parent cgroup ratio (104821), and will fail. Scaling them by a factor of 2 will fix the problem. Tested-by: Phil Auld <pauld@redhat.com> Signed-off-by: Xuewei Zhang <xueweiz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Phil Auld <pauld@redhat.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ben Segall <bsegall@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Fixes: `2e8e192263` ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup") Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 16:10:08 +09:00
Valentin Schneider	f18fe54d12	sched/fair: Don't increase sd->balance_interval on newidle balance [ Upstream commit `3f130a37c4` ] When load_balance() fails to move some load because of task affinity, we end up increasing sd->balance_interval to delay the next periodic balance in the hopes that next time we look, that annoying pinned task(s) will be gone. However, idle_balance() pays no attention to sd->balance_interval, yet it will still lead to an increase in balance_interval in case of pinned tasks. If we're going through several newidle balances (e.g. we have a periodic task), this can lead to a huge increase of the balance_interval in a very small amount of time. To prevent that, don't increase the balance interval when going through a newidle balance. This is a similar approach to what is done in commit `58b26c4c02` ("sched: Increment cache_nice_tries only on periodic lb"), where we disregard newidle balance and rely on periodic balance for more stable results. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar.Eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: patrick.bellasi@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1537974727-30788-2-git-send-email-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-15 15:18:39 +09:00
KeMeng Shi	4dd4d296e4	sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr() [ Upstream commit `714e501e16` ] An oops can be triggered in the scheduler when running qemu on arm64: Unable to handle kernel paging request at virtual address ffff000008effe40 Internal error: Oops: 96000007 [#1] SMP Process migration/0 (pid: 12, stack limit = 0x00000000084e3736) pstate: 20000085 (nzCv daIf -PAN -UAO) pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20 lr : move_queued_task.isra.21+0x124/0x298 ... Call trace: __ll_sc___cmpxchg_case_acq_4+0x4/0x20 __migrate_task+0xc8/0xe0 migration_cpu_stop+0x170/0x180 cpu_stopper_thread+0xec/0x178 smpboot_thread_fn+0x1ac/0x1e8 kthread+0x134/0x138 ret_from_fork+0x10/0x18 __set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to migrage the process if process is not currently running on any one of the CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if corresponding bit is coincidentally set. As a consequence, kernel will access an invalid rq address associate with the invalid CPU in migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs. The reproduce the crash: 1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling sched_setaffinity. 2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online" and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn. 3) Oops appears if the invalid CPU is set in memory after tested cpumask. Signed-off-by: KeMeng Shi <shikemeng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1568616808-16808-1-git-send-email-shikemeng@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-15 14:47:59 +09:00
Juri Lelli	379d8495bb	sched/core: Fix CPU controller for !RT_GROUP_SCHED [ Upstream commit `a07db5c086` ] On !CONFIG_RT_GROUP_SCHED configurations it is currently not possible to move RT tasks between cgroups to which CPU controller has been attached; but it is oddly possible to first move tasks around and then make them RT (setschedule to FIFO/RR). E.g.: # mkdir /sys/fs/cgroup/cpu,cpuacct/group1 # chrt -fp 10 $$ # echo $$ > /sys/fs/cgroup/cpu,cpuacct/group1/tasks bash: echo: write error: Invalid argument # chrt -op 0 $$ # echo $$ > /sys/fs/cgroup/cpu,cpuacct/group1/tasks # chrt -fp 10 $$ # cat /sys/fs/cgroup/cpu,cpuacct/group1/tasks 2345 2598 # chrt -p 2345 pid 2345's current scheduling policy: SCHED_FIFO pid 2345's current scheduling priority: 10 Also, as Michal noted, it is currently not possible to enable CPU controller on unified hierarchy with !CONFIG_RT_GROUP_SCHED (if there are any kernel RT threads in root cgroup, they can't be migrated to the newly created CPU controller's root in cgroup_update_dfl_csses()). Existing code comes with a comment saying the "we don't support RT-tasks being in separate groups". Such comment is however stale and belongs to pre-RT_GROUP_SCHED times. Also, it doesn't make much sense for !RT_GROUP_ SCHED configurations, since checks related to RT bandwidth are not performed at all in these cases. Make moving RT tasks between CPU controller groups viable by removing special case check for RT (and DEADLINE) tasks. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: lizefan@huawei.com Cc: longman@redhat.com Cc: luca.abeni@santannapisa.it Cc: rostedt@goodmis.org Link: https://lkml.kernel.org/r/20190719063455.27328-1-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-15 14:33:01 +09:00
Vincent Guittot	9fb9eaa490	sched/fair: Fix imbalance due to CPU affinity [ Upstream commit `f6cad8df6b` ] The load_balance() has a dedicated mecanism to detect when an imbalance is due to CPU affinity and must be handled at parent level. In this case, the imbalance field of the parent's sched_group is set. The description of sg_imbalanced() gives a typical example of two groups of 4 CPUs each and 4 tasks each with a cpumask covering 1 CPU of the first group and 3 CPUs of the second group. Something like: { 0 1 2 3 } { 4 5 6 7 } * * * * But the load_balance fails to fix this UC on my octo cores system made of 2 clusters of quad cores. Whereas the load_balance is able to detect that the imbalanced is due to CPU affinity, it fails to fix it because the imbalance field is cleared before letting parent level a chance to run. In fact, when the imbalance is detected, the load_balance reruns without the CPU with pinned tasks. But there is no other running tasks in the situation described above and everything looks balanced this time so the imbalance field is immediately cleared. The imbalance field should not be cleared if there is no other task to move when the imbalance is detected. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1561996022-28829-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-15 14:32:59 +09:00
Liangyan	a199ce2af3	sched/fair: Don't assign runtime for throttled cfs_rq commit `5e2d2cc258` upstream. do_sched_cfs_period_timer() will refill cfs_b runtime and call distribute_cfs_runtime to unthrottle cfs_rq, sometimes cfs_b->runtime will allocate all quota to one cfs_rq incorrectly, then other cfs_rqs attached to this cfs_b can't get runtime and will be throttled. We find that one throttled cfs_rq has non-negative cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64 in snippet: distribute_cfs_runtime() { runtime = -cfs_rq->runtime_remaining + 1; } The runtime here will change to a large number and consume all cfs_b->runtime in this cfs_b period. According to Ben Segall, the throttled cfs_rq can have account_cfs_rq_runtime called on it because it is throttled before idle_balance, and the idle_balance calls update_rq_clock to add time that is accounted to the task. This commit prevents cfs_rq to be assgined new runtime if it has been throttled until that distribute_cfs_runtime is called. Signed-off-by: Liangyan <liangyan.peng@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: shanpeic@linux.alibaba.com Cc: stable@vger.kernel.org Cc: xlpang@linux.alibaba.com Fixes: `d3d9dc3302` ("sched: Throttle entities exceeding their allowed bandwidth") Link: https://lkml.kernel.org/r/20190826121633.6538-1-liangyan.peng@linux.alibaba.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 14:29:00 +09:00
Jann Horn	9ebc8cd5f7	sched/fair: Don't free p->numa_faults with concurrent readers commit `16d51a590a` upstream. When going through execve(), zero out the NUMA fault statistics instead of freeing them. During execve, the task is reachable through procfs and the scheduler. A concurrent /proc/*/sched reader can read data from a freed ->numa_faults allocation (confirmed by KASAN) and write it back to userspace. I believe that it would also be possible for a use-after-free read to occur through a race between a NUMA fault and execve(): task_numa_fault() can lead to task_numa_compare(), which invokes task_weight() on the currently running task of a different CPU. Another way to fix this would be to make ->numa_faults RCU-managed or add extra locking, but it seems easier to wipe the NUMA fault statistics on execve. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Fixes: `82727018b0` ("sched/numa: Call task_numa_free() from do_execve()") Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 14:08:49 +09:00
Konstantin Khlebnikov	20d5853eb1	sched/core: Handle overflow in cpu_shares_write_u64 [ Upstream commit `5b61d50ab4` ] Bit shift in scale_load() could overflow shares. This patch saturates it to MAX_SHARES like following sched_group_set_shares(). Example: # echo 9223372036854776832 > cpu.shares # cat cpu.shares Before patch: 1024 After pattch: 262144 Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/155125501891.293431.3345233332801109696.stgit@buzz Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-15 13:34:43 +09:00
Konstantin Khlebnikov	4b9fcac57f	sched/core: Check quota and period overflow at usec to nsec conversion [ Upstream commit `1a8b4540db` ] Large values could overflow u64 and pass following sanity checks. # echo 18446744073750000 > cpu.cfs_period_us # cat cpu.cfs_period_us 40448 # echo 18446744073750000 > cpu.cfs_quota_us # cat cpu.cfs_quota_us 40448 After this patch they will fail with -EINVAL. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/155125502079.293431.3947497929372138600.stgit@buzz Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-15 13:34:41 +09:00
Ben Hutchings	4f87cf8ebb	sched: Add sched_smt_active() Add the sched_smt_active() function needed for some x86 speculation mitigations. This was introduced upstream by commits `1b568f0aab` "sched/core: Optimize SCHED_SMT", `ba2591a599` "sched/smt: Update sched_smt_present at runtime", `c5511d03ec` "sched/smt: Make sched_smt_present track topology", and `321a874a7e` "sched/smt: Expose sched_smt_present static key". The upstream implementation uses the static_key_{disable,enable}_cpuslocked() functions, which aren't practical to backport. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 12:47:12 +09:00
Xie XiuQi	a67712d343	sched/numa: Fix a possible divide-by-zero commit `a860fa7b96` upstream. sched_clock_cpu() may not be consistent between CPUs. If a task migrates to another CPU, then se.exec_start is set to that CPU's rq_clock_task() by update_stats_curr_start(). Specifically, the new value might be before the old value due to clock skew. So then if in numa_get_avg_runtime() the expression: 'now - p->last_task_numa_placement' ends up as -1, then the divider '*period + 1' in task_numa_placement() is 0 and things go bang. Similar to update_curr(), check if time goes backwards to avoid this. [ peterz: Wrote new changelog. ] [ mingo: Tweaked the code comment. ] Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: cj.chengjian@huawei.com Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20190425080016.GX11158@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 12:35:05 +09:00
Phil Auld	8023469412	sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup [ Upstream commit `2e8e192263` ] With extremely short cfs_period_us setting on a parent task group with a large number of children the for loop in sched_cfs_period_timer() can run until the watchdog fires. There is no guarantee that the call to hrtimer_forward_now() will ever return 0. The large number of children can make do_sched_cfs_period_timer() take longer than the period. NMI watchdog: Watchdog detected hard LOCKUP on cpu 24 RIP: 0010:tg_nop+0x0/0x10 <IRQ> walk_tg_tree_from+0x29/0xb0 unthrottle_cfs_rq+0xe0/0x1a0 distribute_cfs_runtime+0xd3/0xf0 sched_cfs_period_timer+0xcb/0x160 ? sched_cfs_slack_timer+0xd0/0xd0 __hrtimer_run_queues+0xfb/0x270 hrtimer_interrupt+0x122/0x270 smp_apic_timer_interrupt+0x6a/0x140 apic_timer_interrupt+0xf/0x20 </IRQ> To prevent this we add protection to the loop that detects when the loop has run too many times and scales the period and quota up, proportionally, so that the timer can complete before then next period expires. This preserves the relative runtime quota while preventing the hard lockup. A warning is issued reporting this state and the new values. Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190319130005.25492-1-pauld@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-15 12:28:05 +09:00
Mel Gorman	fd57ab642c	sched/fair: Do not re-read ->h_load_next during hierarchical load calculation commit `0e9f02450d` upstream. A NULL pointer dereference bug was reported on a distribution kernel but the same issue should be present on mainline kernel. It occured on s390 but should not be arch-specific. A partial oops looks like: Unable to handle kernel pointer dereference in virtual kernel address space ... Call Trace: ... try_to_wake_up+0xfc/0x450 vhost_poll_wakeup+0x3a/0x50 [vhost] __wake_up_common+0xbc/0x178 __wake_up_common_lock+0x9e/0x160 __wake_up_sync_key+0x4e/0x60 sock_def_readable+0x5e/0x98 The bug hits any time between 1 hour to 3 days. The dereference occurs in update_cfs_rq_h_load when accumulating h_load. The problem is that cfq_rq->h_load_next is not protected by any locking and can be updated by parallel calls to task_h_load. Depending on the compiler, code may be generated that re-reads cfq_rq->h_load_next after the check for NULL and then oops when reading se->avg.load_avg. The dissassembly showed that it was possible to reread h_load_next after the check for NULL. While this does not appear to be an issue for later compilers, it's still an accident if the correct code is generated. Full locking in this path would have high overhead so this patch uses READ_ONCE to read h_load_next only once and check for NULL before dereferencing. It was confirmed that there were no further oops after 10 days of testing. As Peter pointed out, it is also necessary to use WRITE_ONCE() to avoid any potential problems with store tearing. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Fixes: `685207963b` ("sched: Move h_load calculation to task_h_load()") Link: https://lkml.kernel.org/r/20190319123610.nsivgf3mjbjjesxb@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 12:22:49 +09:00
Steven Rostedt (VMware)	5e8e590433	sched/core: Allow __sched_setscheduler() in interrupts when PI is not used commit `896bbb2522` upstream. When priority inheritance was added back in 2.6.18 to sched_setscheduler(), it added a path to taking an rt-mutex wait_lock, which is not IRQ safe. As PI is not a common occurrence, lockdep will likely never trigger if sched_setscheduler was called from interrupt context. A BUG_ON() was added to trigger if __sched_setscheduler() was ever called from interrupt context because there was a possibility to take the wait_lock. Today the wait_lock is irq safe, but the path to taking it in sched_setscheduler() is the same as the path to taking it from normal context. The wait_lock is taken with raw_spin_lock_irq() and released with raw_spin_unlock_irq() which will indiscriminately enable interrupts, which would be bad in interrupt context. The problem is that normalize_rt_tasks, which is called by triggering the sysrq nice-all-RT-tasks was changed to call __sched_setscheduler(), and this is done from interrupt context! Now __sched_setscheduler() takes a "pi" parameter that is used to know if the priority inheritance should be called or not. As the BUG_ON() only cares about calling the PI code, it should only bug if called from interrupt context with the "pi" parameter set to true. Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@osdl.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: `dbc7f069b9` ("sched: Use replace normalize_task() with __sched_setscheduler()") Link: http://lkml.kernel.org/r/20170308124654.10e598f2@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 09:59:19 +09:00
Song Muchun	b9cc54e80d	sched/fair: Fix the min_vruntime update logic in dequeue_entity() [ Upstream commit `9845c49cc9` ] The comment and the code around the update_min_vruntime() call in dequeue_entity() are not in agreement. >From commit: `b60205c7c5` ("sched/fair: Fix min_vruntime tracking") I think that we want to update min_vruntime when a task is sleeping/migrating. So, the check is inverted there - fix it. Signed-off-by: Song Muchun <smuchun@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: `b60205c7c5` ("sched/fair: Fix min_vruntime tracking") Link: http://lkml.kernel.org/r/20181014112612.2614-1-smuchun@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 09:19:12 +09:00
Phil Auld	d296f53c35	sched/fair: Fix throttle_list starvation with low CFS quota commit `baa9be4ffb` upstream. With a very low cpu.cfs_quota_us setting, such as the minimum of 1000, distribute_cfs_runtime may not empty the throttled_list before it runs out of runtime to distribute. In that case, due to the change from `c06f04c704` to put throttled entries at the head of the list, later entries on the list will starve. Essentially, the same X processes will get pulled off the list, given CPU time and then, when expired, get put back on the head of the list where distribute_cfs_runtime will give runtime to the same set of processes leaving the rest. Fix the issue by setting a bit in struct cfs_bandwidth when distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can decide to put the throttled entry on the tail or the head of the list. The bit is set/cleared by the callers of distribute_cfs_runtime while they hold cfs_bandwidth->lock. This is easy to reproduce with a handful of CPU consumers. I use 'crash' on the live system. In some cases you can simply look at the throttled list and see the later entries are not changing: crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining \| paste - - \| awk '{print $1" "$4}' \| pr -t -n3 1 ffff90b56cb2d200 -976050 2 ffff90b56cb2cc00 -484925 3 ffff90b56cb2bc00 -658814 4 ffff90b56cb2ba00 -275365 5 ffff90b166a45600 -135138 6 ffff90b56cb2da00 -282505 7 ffff90b56cb2e000 -148065 8 ffff90b56cb2fa00 -872591 9 ffff90b56cb2c000 -84687 10 ffff90b56cb2f000 -87237 11 ffff90b166a40a00 -164582 crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining \| paste - - \| awk '{print $1" "$4}' \| pr -t -n3 1 ffff90b56cb2d200 -994147 2 ffff90b56cb2cc00 -306051 3 ffff90b56cb2bc00 -961321 4 ffff90b56cb2ba00 -24490 5 ffff90b166a45600 -135138 6 ffff90b56cb2da00 -282505 7 ffff90b56cb2e000 -148065 8 ffff90b56cb2fa00 -872591 9 ffff90b56cb2c000 -84687 10 ffff90b56cb2f000 -87237 11 ffff90b166a40a00 -164582 Sometimes it is easier to see by finding a process getting starved and looking at the sched_info: crash> task ffff8eb765994500 sched_info PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest" sched_info = { pcount = 8, run_delay = 697094208, last_arrival = 240260125039, last_queued = 240260327513 }, crash> task ffff8eb765994500 sched_info PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest" sched_info = { pcount = 8, run_delay = 697094208, last_arrival = 240260125039, last_queued = 240260327513 }, Signed-off-by: Phil Auld <pauld@redhat.com> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: `c06f04c704` ("sched: Fix potential near-infinite distribute_cfs_runtime() loop") Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csb Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 09:15:24 +09:00
Frederic Weisbecker	915239a7f3	sched/cputime: Fix ksoftirqd cputime accounting regression commit `25e2d8c1b9` upstream. irq_time_read() returns the irqtime minus the ksoftirqd time. This is necessary because irq_time_read() is used to substract the IRQ time from the sum_exec_runtime of a task. If we were to include the softirq time of ksoftirqd, this task would substract its own CPU time everytime it updates ksoftirqd->sum_exec_runtime which would therefore never progress. But this behaviour got broken by: `a499a5a14d` ("sched/cputime: Increment kcpustat directly on irqtime account") ... which now includes ksoftirqd softirq time in the time returned by irq_time_read(). This has resulted in wrong ksoftirqd cputime reported to userspace through /proc/stat and thus "top" not showing ksoftirqd when it should after intense networking load. ksoftirqd->stime happens to be correct but it gets scaled down by sum_exec_runtime through task_cputime_adjusted(). To fix this, just account the strict IRQ time in a separate counter and use it to report the IRQ time. Reported-and-tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1493129448-5356-1-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ivan Delalande <colona@arista.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 09:10:27 +09:00
Frederic Weisbecker	8b39fed704	sched/cputime: Increment kcpustat directly on irqtime account commit `a499a5a14d` upstream. The irqtime is accounted is nsecs and stored in cpu_irq_time.hardirq_time and cpu_irq_time.softirq_time. Once the accumulated amount reaches a new jiffy, this one gets accounted to the kcpustat. This was necessary when kcpustat was stored in cputime_t, which could at worst have jiffies granularity. But now kcpustat is stored in nsecs so this whole discretization game with temporary irqtime storage has become unnecessary. We can now directly account the irqtime to the kcpustat. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-17-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ivan Delalande <colona@arista.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 09:10:26 +09:00
Frederic Weisbecker	ba3247f9a0	sched/cputime: Convert kcpustat to nsecs commit `7fb1327ee9` upstream. Kernel CPU stats are stored in cputime_t which is an architecture defined type, and hence a bit opaque and requiring accessors and mutators for any operation. Converting them to nsecs simplifies the code and is one step toward the removal of cputime_t in the core code. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-4-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> [colona: minor conflict as `527b0a76f4` ("sched/cpuacct: Avoid %lld seq_printf warning") is missing from v4.9] Signed-off-by: Ivan Delalande <colona@arista.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 09:10:24 +09:00
Steve Muckle	5282278601	sched/fair: Fix vruntime_normalized() for remote non-migration wakeup commit `d0cdb3ce88` upstream. When a task which previously ran on a given CPU is remotely queued to wake up on that same CPU, there is a period where the task's state is TASK_WAKING and its vruntime is not normalized. This is not accounted for in vruntime_normalized() which will cause an error in the task's vruntime if it is switched from the fair class during this time. For example if it is boosted to RT priority via rt_mutex_setprio(), rq->min_vruntime will not be subtracted from the task's vruntime but it will be added again when the task returns to the fair class. The task's vruntime will have been erroneously doubled and the effective priority of the task will be reduced. Note this will also lead to inflation of all vruntimes since the doubled vruntime value will become the rq's min_vruntime when other tasks leave the rq. This leads to repeated doubling of the vruntime and priority penalty. Fix this by recognizing a WAKING task's vruntime as normalized only if sched_remote_wakeup is true. This indicates a migration, in which case the vruntime would have been normalized in migrate_task_rq_fair(). Based on a similar patch from John Dias <joaodias@google.com>. Suggested-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Steve Muckle <smuckle@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Redpath <Chris.Redpath@arm.com> Cc: John Dias <joaodias@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Miguel de Dios <migueldedios@google.com> Cc: Morten Rasmussen <Morten.Rasmussen@arm.com> Cc: Patrick Bellasi <Patrick.Bellasi@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: kernel-team@android.com Fixes: `b5179ac70d` ("sched/fair: Prepare to fix fairness problems on migration") Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuckle@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-15 08:25:26 +09:00
Boqun Feng	e19c7e764f	sched/wait: Remove the lockless swait_active() check in swake_up() commit `35a2897c2a` upstream. Steven Rostedt reported a potential race in RCU core because of swake_up(): CPU0 CPU1 ---- ---- __call_rcu_core() { spin_lock(rnp_root) need_wake = __rcu_start_gp() { rcu_start_gp_advanced() { gp_flags = FLAG_INIT } } rcu_gp_kthread() { swait_event_interruptible(wq, gp_flags & FLAG_INIT) { spin_lock(q->lock) fetch wq->task_list here! * list_add(wq->task_list, q->task_list) spin_unlock(q->lock); fetch old value of gp_flags here spin_unlock(rnp_root) rcu_gp_kthread_wake() { swake_up(wq) { swait_active(wq) { list_empty(wq->task_list) } * return false * if (condition) * false * schedule(); In this case, a wakeup is missed, which could cause the rcu_gp_kthread waits for a long time. The reason of this is that we do a lockless swait_active() check in swake_up(). To fix this, we can either 1) add a smp_mb() in swake_up() before swait_active() to provide the proper order or 2) simply remove the swait_active() in swake_up(). The solution 2 not only fixes this problem but also keeps the swait and wait API as close as possible, as wake_up() doesn't provide a full barrier and doesn't do a lockless check of the wait queue either. Moreover, there are users already using swait_active() to do their quick checks for the wait queues, so it make less sense that swake_up() and swake_up_all() do this on their own. This patch then removes the lockless swait_active() check in swake_up() and swake_up_all(). Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Krister Johansen <kjlx@templeofstupid.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170615041828.zk3a3sfyudm5p6nl@tardis Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: David Chen <david.chen@nutanix.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-05-12 16:45:19 +09:00
Luan Yuan	695cede0cc	Amlogic: sync the code from mainline. [1/1] PD#SWPL-17246 Problem: sync the code from mainline. Solution: sync the code from mainline. 7c03859983c2 OSS vulnerability found in [boot.img]:[linux_kernel] (CVE-2018-12232) Risk:[] [1/1] ba89a3d9c791 OSS vulnerability found in [boot.img]:[linux_kernel] (CVE-2019-8912) Risk:[] [1/1] c434d0530610 Android Security Bulletin - November 2019-11 - Kernel components binder driver - CVE-2019-2214 [1/1] ff8d9012fbd4 Android Security Bulletin - November 2019-11 - Kernel components ext4 filesystem - CVE-2019-11833 [1/1] 3c52e964495e cec: store msg after bootup from st [1/2] 94198a56ee10 lcd: support tcon vac and demura data [2/2] 1add1a008a03 vout: spi: porting lcd driver and SPI to Linux [1/1] 3e8d7b0e5f97 hdmirx: add hpd recovery logic when input clk is unstable [1/1] f92e7ba21c62 ppmgr: Add 10bit, dolby and HDR video rotation. [1/1] dab2cc37cd95 dvb: fix dmx2 interrupt bug [1/1] 9d31efae4a55 dv: add dv target output mode [1/1] e86eb9d1b5c5 hdmirx: add rx phy tdr enable control [1/1] 8ea66f645bf6 dts: enable spi for gva [1/1] baf6e74528ef drm: add drm support for tm2 [1/1] Verify: verify by newton Change-Id: I9415060a4b39895b5d624117271a72fc6a1fd187 Signed-off-by: Luan Yuan <luan.yuan@amlogic.com>	2020-02-04 13:48:58 +09:00
Jianxin Pan	31e3a8ac1f	schedtune: fix crash when there is cpu boot fail in hmp [1/1] PD#SWPL-7656 Problem: crash when there is cpu boot fail in hmp Solution: fix crash when there is cpu boot fail in hmp Verify: W400 Change-Id: I0153975593adb0bfcbc3c3bd6543f0fb2e6bf2e0 Signed-off-by: Jianxin Pan <jianxin.pan@amlogic.com>	2019-05-02 17:43:45 +08:00
Victor Wan	cc7b1eac54	Merge branch 'android-4.9' into amlogic-4.9-dev Signed-off-by: Victor Wan <victor.wan@amlogic.com> Conflicts: drivers/md/dm-bufio.c drivers/media/dvb-core/dvb_frontend.c drivers/usb/dwc3/core.c drivers/usb/gadget/function/f_fs.c	2018-08-07 14:43:24 +08:00
Sultan Alsawaf	47bbcd6bf8	ANDROID: Fix massive cpufreq_times memory leaks Every time _cpu_up() is called for a CPU, idle_thread_get() is called which then re-initializes a CPU's idle thread that was already previously created and cached in a global variable in smpboot.c. idle_thread_get() calls init_idle() which then calls __sched_fork(). __sched_fork() is where cpufreq_task_times_init() is, and cpufreq_task_times_init() allocates memory for the task struct's time_in_state array. Since idle_thread_get() reuses a task struct instance that was already previously created, this means that every time it calls init_idle(), cpufreq_task_times_init() allocates this array again and overwrites the existing allocation that the idle thread already had. This causes memory to be leaked every time a CPU is onlined. In order to fix this, move allocation of time_in_state into _do_fork to avoid allocating it at all for idle threads. The cpufreq times interface is intended to be used for tracking userspace tasks, so we can safely remove it from the kernel's idle threads without killing any functionality. But that's not all! Task structs can be freed outside of release_task(), which creates another memory leak because a task struct can be freed without having its cpufreq times allocation freed. To fix this, free the cpufreq times allocation at the same time that task struct allocations are freed, in free_task(). Since free_task() can also be called in error paths of copy_process() after dup_task_struct(), set time_in_state to NULL immediately after calling dup_task_struct() to avoid possible double free. Bug description and fix adapted from patch submitted by Sultan Alsawaf <sultanxda@gmail.com> at https://android-review.googlesource.com/c/kernel/msm/+/700134 Bug: 110044919 Test: Hikey960 builds, boots & reports /proc/<pid>/time_in_state correctly Change-Id: I12fe7611fc88eb7f6c39f8f7629ad27b6ec4722c Signed-off-by: Connor O'Brien <connoro@google.com>	2018-07-18 13:22:08 +00:00
Connor O'Brien	23a1412b82	ANDROID: Reduce use of #ifdef CONFIG_CPU_FREQ_TIMES Add empty versions of functions to cpufreq_times.h to cut down on use of #ifdef in .c files. Test: kernel builds with and without CONFIG_CPU_FREQ_TIMES=y Change-Id: I49ac364fac3d42bba0ca1801e23b15081094fb12 Signed-off-by: Connor O'Brien <connoro@google.com>	2018-07-18 13:21:52 +00:00
jianxin.pan	1efcf84db4	schedutil: limit up/down init rate PD#165090: schedutil: limit up/down min rate Change-Id: Ib3aa6653d56056298df05bdede2e2bf6aea46882 Signed-off-by: jianxin.pan <jianxin.pan@amlogic.com>	2018-06-29 00:35:25 -07:00
jianxin.pan	874ce1dbf4	sched: disable SD_WAKE_AFFINE PD#165090: remove WAKE_AFFINE to get better balance when heavy loading Change-Id: Id5650e9c3fd12b23be04f8f52a0f5c2e11c49199 Signed-off-by: jianxin.pan <jianxin.pan@amlogic.com>	2018-06-29 00:10:07 -07:00
Juri Lelli	e7fd5b1876	UPSTREAM: cpufreq: schedutil: use now as reference when aggregating shared policy requests Currently, sugov_next_freq_shared() uses last_freq_update_time as a reference to decide when to start considering CPU contributions as stale. However, since last_freq_update_time is set by the last CPU that issued a frequency transition, this might cause problems in certain cases. In practice, the detection of stale utilization values fails whenever the CPU with such values was the last to update the policy. For example (and please note again that the SCHED_CPUFREQ_RT flag is not the problem here, but only the detection of after how much time that flag has to be considered stale), suppose a policy with 2 CPUs: CPU0 \| CPU1 \| \| RT task scheduled \| SCHED_CPUFREQ_RT is set \| CPU1->last_update = now \| freq transition to max \| last_freq_update_time = now \| more than TICK_NSEC nsecs \| a small CFS wakes up \| CPU0->last_update = now1 \| delta_ns(CPU0) < TICK_NSEC* \| CPU0's util is considered \| delta_ns(CPU1) = \| last_freq_update_time - \| CPU1->last_update = 0 \| < TICK_NSEC \| CPU1 is still considered \| CPU1->SCHED_CPUFREQ_RT is set \| we stay at max (until CPU1 \| exits from idle) \| * delta_ns is actually negative as now1 > last_freq_update_time While last_freq_update_time is a sensible reference for rate limiting, it doesn't seem to be useful for working around stale CPU states. Fix the problem by always considering now (time) as the reference for deciding when CPUs have stale contributions. Bug: 109836581 Signed-off-by: Juri Lelli <juri.lelli@arm.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (cherry picked from commit `d86ab9cff8`)	2018-06-22 15:45:45 +00:00
Patrick Bellasi	db2c520bb5	ANDROID: sched/tune: fix boost_group spin_lock re-initialization In: `c5616f2f87` ANDROID: sched/tune: Initialize raw_spin_lock in boosted_groups a spin_lock is initialized on each CPU every time a boost_group is activated (i.e. a cgroup created). However, those spin_lock are already initialized at boot time by schedtune_init_cgroups() which also set schedtune_initialized to true thus enabling the tasks accounting done by schedtune_{en,de}queue_task(). This means that an already initialize and used spin_lock is wrongly re-initialized thus potentially leading to a: BUG: spinlock already unlocked on CPU in the (unlikely in AOSP) case we have a race between schedtune cgroups creation and tasks enqueue/dequeue. This probably happened because the fix provided by `c5616f2f87` was just the wrong cure for a different issues: the missing one-time initialization of the per-CPU spinlocks in schedtune_init_cgroups(). All these fixes happened on v4.4 and have been forward ported to the current v4.9 base. Let's better fix this by: - removing the not necessary spinlock re-initialization in: schedtune_boostgroup_init() - add a new "valid" flag to better flag which boost_groups are currently used to track a valid cgroup. This patch adds also a better documentation of the used data structures and the locking strategy in use to synchronize fast and slow paths. Change-Id: I3c2a256693b12b317373bbc032ed46e620f79ee8 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Reported-by: Stanley Shih <stanley.shih@mstarsemi.com> [ The first of the two fixes listed above has been already merged by: commit `f6bec4e8c7` ("Revert "ANDROID: sched/tune: Initialize raw_spin_lock in boosted_groups") which, in conjunction with: commit `751e509391` ("ANDROID: sched: tune: Fix lacking spinlock initialization") provides the correct initialization of the boostgroups spinlocks. Let's keep the changelog as it is to better keep track of the original intended fix as well as to better document the required locking strategy. ] Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>	2018-06-08 10:01:57 +00:00
Patrick Bellasi	eafebcabec	ANDROID: sched/tune: cleanup schedtune_boostgroup_{init,release} This update the definition of these two methods to make more clear their role. In this refactored version, the slow path entry functions: schedtune_css_alloc() schedtune_css_free() are in charge just to allocate and release the memory used to track schedtune boost groups. These two functions rely respectively on: schedtune_boostgroup_init() schedtune_boostgroup_release() for everything related to setting up data structures and properly initializing them. Change-Id: I9336102b5c6a6b5726fd466e99b7d6b28d38f455 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>	2018-06-07 13:46:11 +00:00
Patrick Bellasi	ca17b83d9c	ANDROID: sched/tune: remove unused variable This value was added in: `e71c425516` ANDROID: sched/tune: Add support for negative boost values and it's likely coming from a conflict resolution or a merge of some previous patches used to track negative boost values. Change-Id: I3b63e99db9232eb6117a561aa61c527986ade8e4 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>	2018-06-07 13:45:04 +00:00
Patrick Bellasi	c964a2ba92	ANDROID: sched/fair: cosmetics Change-Id: Ibbefdf1cdb4d72c7075c7b3edf6789630475a8e2 Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>	2018-06-07 13:44:47 +00:00
Satya Durga Srinivasu Prabhala	a56900fabe	ANDROID: sched/walt: Fix compilation issue for x86_64 Below compilation errors observed when SCHED_WALT enabled and FAIR_GROUP_SCHED is disabled, fix it. CC kernel/sched/walt.o kernel/sched/walt.c: In function .walt_inc_cfs_cumulative_runnable_avg.: kernel/sched/walt.c:157:8: error: .struct cfs_rq. has no member named .cumulative_runnable_avg. cfs_rq->cumulative_runnable_avg += p->ravg.demand; ^ kernel/sched/walt.c: In function .walt_dec_cfs_cumulative_runnable_avg.: kernel/sched/walt.c:163:8: error: .struct cfs_rq. has no member named .cumulative_runnable_avg. cfs_rq->cumulative_runnable_avg -= p->ravg.demand; ^ make[2]: * [kernel/sched/walt.o] Error 1 make[1]: * [kernel/sched] Error 2 make: *** [kernel] Error 2 Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>	2018-05-31 17:15:04 +00:00
Greg Kroah-Hartman	9797dcb8c7	Merge 4.9.104 into android-4.9 Changes in 4.9.104 MIPS: c-r4k: Fix data corruption related to cache coherence MIPS: ptrace: Expose FIR register through FP regset MIPS: Fix ptrace(2) PTRACE_PEEKUSR and PTRACE_POKEUSR accesses to o32 FGRs KVM: Fix spelling mistake: "cop_unsuable" -> "cop_unusable" affs_lookup(): close a race with affs_remove_link() aio: fix io_destroy(2) vs. lookup_ioctx() race ALSA: timer: Fix pause event notification do d_instantiate/unlock_new_inode combinations safely mmc: sdhci-iproc: remove hard coded mmc cap 1.8v mmc: sdhci-iproc: fix 32bit writes for TRANSFER_MODE register libata: Blacklist some Sandisk SSDs for NCQ libata: blacklist Micron 500IT SSD with MU01 firmware xen-swiotlb: fix the check condition for xen_swiotlb_free_coherent drm/vmwgfx: Fix 32-bit VMW_PORT_HB_[IN\|OUT] macros IB/hfi1: Use after free race condition in send context error path Revert "ipc/shm: Fix shmat mmap nil-page protection" ipc/shm: fix shmat() nil address after round-down when remapping kasan: fix memory hotplug during boot kernel/sys.c: fix potential Spectre v1 issue kernel/signal.c: avoid undefined behaviour in kill_something_info KVM/VMX: Expose SSBD properly to guests KVM: s390: vsie: fix < 8k check for the itdba KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changed kvm: x86: IA32_ARCH_CAPABILITIES is always supported firewire-ohci: work around oversized DMA reads on JMicron controllers x86/tsc: Allow TSC calibration without PIT NFSv4: always set NFS_LOCK_LOST when a lock is lost. ALSA: hda - Use IS_REACHABLE() for dependency on input kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl netfilter: ipv6: nf_defrag: Pass on packets to stack per RFC2460 tracing/hrtimer: Fix tracing bugs by taking all clock bases and modes into account PCI: Add function 1 DMA alias quirk for Marvell 9128 Input: psmouse - fix Synaptics detection when protocol is disabled i40iw: Zero-out consumer key on allocate stag for FMR tools lib traceevent: Simplify pointer print logic and fix %pF perf callchain: Fix attr.sample_max_stack setting tools lib traceevent: Fix get_field_str() for dynamic strings perf record: Fix failed memory allocation for get_cpuid_str iommu/vt-d: Use domain instead of cache fetching dm thin: fix documentation relative to low water mark threshold net: stmmac: dwmac-meson8b: fix setting the RGMII TX clock on Meson8b net: stmmac: dwmac-meson8b: propagate rate changes to the parent clock nfs: Do not convert nfs_idmap_cache_timeout to jiffies watchdog: sp5100_tco: Fix watchdog disable bit kconfig: Don't leak main menus during parsing kconfig: Fix automatic menu creation mem leak kconfig: Fix expr_free() E_NOT leak mac80211_hwsim: fix possible memory leak in hwsim_new_radio_nl() ipmi/powernv: Fix error return code in ipmi_powernv_probe() Btrfs: set plug for fsync btrfs: Fix out of bounds access in btrfs_search_slot Btrfs: fix scrub to repair raid6 corruption btrfs: fail mount when sb flag is not in BTRFS_SUPER_FLAG_SUPP HID: roccat: prevent an out of bounds read in kovaplus_profile_activated() fm10k: fix "failed to kill vid" message for VF device property: Define type of PROPERTY_ENRTY_() macros jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path powerpc/numa: Use ibm,max-associativity-domains to discover possible nodes powerpc/numa: Ensure nodes initialized for hotplug RDMA/mlx5: Avoid memory leak in case of XRCD dealloc failure ntb_transport: Fix bug with max_mw_size parameter gianfar: prevent integer wrapping in the rx handler tcp_nv: fix potential integer overflow in tcpnv_acked kvm: Map PFN-type memory regions as writable (if possible) ocfs2: return -EROFS to mount.ocfs2 if inode block is invalid ocfs2/acl: use 'ip_xattr_sem' to protect getting extended attribute ocfs2: return error when we attempt to access a dirty bh in jbd2 mm/mempolicy: fix the check of nodemask from user mm/mempolicy: add nodes_empty check in SYSC_migrate_pages asm-generic: provide generic_pmdp_establish() sparc64: update pmdp_invalidate() to return old pmd value mm: thp: use down_read_trylock() in khugepaged to avoid long block mm: pin address_space before dereferencing it while isolating an LRU page mm/fadvise: discard partial page if endbyte is also EOF openvswitch: Remove padding from packet before L3+ conntrack processing IB/ipoib: Fix for potential no-carrier state drm/nouveau/pmu/fuc: don't use movw directly anymore netfilter: ipv6: nf_defrag: Kill frag queue on RFC2460 failure x86/power: Fix swsusp_arch_resume prototype firmware: dmi_scan: Fix handling of empty DMI strings ACPI: processor_perflib: Do not send _PPC change notification if not ready ACPI / scan: Use acpi_bus_get_status() to initialize ACPI_TYPE_DEVICE devs bpf: fix selftests/bpf test_kmod.sh failure when CONFIG_BPF_JIT_ALWAYS_ON=y MIPS: generic: Fix machine compatible matching MIPS: TXx9: use IS_BUILTIN() for CONFIG_LEDS_CLASS xen-netfront: Fix race between device setup and open xen/grant-table: Use put_page instead of free_page RDS: IB: Fix null pointer issue arm64: spinlock: Fix theoretical trylock() A-B-A with LSE atomics proc: fix /proc//map_files lookup cifs: silence compiler warnings showing up with gcc-8.0.0 bcache: properly set task state in bch_writeback_thread() bcache: fix for allocator and register thread race bcache: fix for data collapse after re-attaching an attached device bcache: return attach error when no cache set exist tools/libbpf: handle issues with bpf ELF objects containing .eh_frames bpf: fix rlimit in reuseport net selftest vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when dumping vsyscall user page locking/qspinlock: Ensure node->count is updated before initialising node irqchip/gic-v3: Ignore disabled ITS nodes cpumask: Make for_each_cpu_wrap() available on UP as well irqchip/gic-v3: Change pr_debug message to pr_devel ARC: Fix malformed ARC_EMUL_UNALIGNED default ptr_ring: prevent integer overflow when calculating size libata: Fix compile warning with ATA_DEBUG enabled selftests: pstore: Adding config fragment CONFIG_PSTORE_RAM=m selftests: memfd: add config fragment for fuse ARM: OMAP2+: timer: fix a kmemleak caused in omap_get_timer_dt ARM: OMAP3: Fix prm wake interrupt for resume ARM: OMAP1: clock: Fix debugfs_create_*() usage ibmvnic: Free RX socket buffer in case of adapter error iwlwifi: mvm: fix security bug in PN checking iwlwifi: mvm: always init rs with 20mhz bandwidth rates NFC: llcp: Limit size of SDP URI rxrpc: Work around usercopy check mac80211: round IEEE80211_TX_STATUS_HEADROOM up to multiple of 4 mac80211: fix a possible leak of station stats mac80211: fix calling sleeping function in atomic context mac80211: Do not disconnect on invalid operating class md raid10: fix NULL deference in handle_write_completed() drm/exynos: g2d: use monotonic timestamps drm/exynos: fix comparison to bitshift when dealing with a mask locking/xchg/alpha: Add unconditional memory barrier to cmpxchg() md: raid5: avoid string overflow warning kernel/relay.c: limit kmalloc size to KMALLOC_MAX_SIZE powerpc/bpf/jit: Fix 32-bit JIT for seccomp_data access s390/cio: fix ccw_device_start_timeout API s390/cio: fix return code after missing interrupt s390/cio: clear timer when terminating driver I/O PKCS#7: fix direct verification of SignerInfo signature ARM: OMAP: Fix dmtimer init for omap1 smsc75xx: fix smsc75xx_set_features() regulatory: add NUL to request alpha2 integrity/security: fix digsig.c build error with header file locking/xchg/alpha: Fix xchg() and cmpxchg() memory ordering bugs x86/topology: Update the 'cpu cores' field in /proc/cpuinfo correctly across CPU hotplug operations mac80211: drop frames with unexpected DS bits from fast-rx to slow path arm64: fix unwind_frame() for filtered out fn for function graph tracing macvlan: fix use-after-free in macvlan_common_newlink() kvm: fix warning for CONFIG_HAVE_KVM_EVENTFD builds fs: dcache: Avoid livelock between d_alloc_parallel and __d_add fs: dcache: Use READ_ONCE when accessing i_dir_seq md: fix a potential deadlock of raid5/raid10 reshape md/raid1: fix NULL pointer dereference batman-adv: fix packet checksum in receive path batman-adv: invalidate checksum on fragment reassembly netfilter: ebtables: convert BUG_ONs to WARN_ONs batman-adv: Ignore invalid batadv_iv_gw during netlink send batman-adv: Ignore invalid batadv_v_gw during netlink send batman-adv: Fix netlink dumping of BLA claims batman-adv: Fix netlink dumping of BLA backbones nvme-pci: Fix nvme queue cleanup if IRQ setup fails clocksource/drivers/fsl_ftm_timer: Fix error return checking ceph: fix dentry leak when failing to init debugfs ARM: orion5x: Revert commit `4904dbda41`. qrtr: add MODULE_ALIAS macro to smd r8152: fix tx packets accounting virtio-gpu: fix ioctl and expose the fixed status to userspace. dmaengine: rcar-dmac: fix max_chunk_size for R-Car Gen3 bcache: fix kcrashes with fio in RAID5 backend dev ip6_tunnel: fix IFLA_MTU ignored on NEWLINK sit: fix IFLA_MTU ignored on NEWLINK ARM: dts: NSP: Fix amount of RAM on BCM958625HR powerpc/boot: Fix random libfdt related build errors gianfar: Fix Rx byte accounting for ndev stats net/tcp/illinois: replace broken algorithm reference link nvmet: fix PSDT field check in command format xen/pirq: fix error path cleanup when binding MSIs drm/sun4i: Fix dclk_set_phase Btrfs: send, fix issuing write op when processing hole in no data mode selftests/powerpc: Skip the subpage_prot tests if the syscall is unavailable KVM: PPC: Book3S HV: Fix VRMA initialization with 2MB or 1GB memory backing iwlwifi: mvm: fix TX of CCMP 256 watchdog: f71808e_wdt: Fix magic close handling watchdog: sbsa: use 32-bit read for WCV batman-adv: Fix multicast packet loss with a single WANT_ALL_IPV4/6 flag e1000e: Fix check_for_link return value with autoneg off e1000e: allocate ring descriptors with dma_zalloc_coherent ia64/err-inject: Use get_user_pages_fast() RDMA/qedr: Fix kernel panic when running fio over NFSoRDMA RDMA/qedr: Fix iWARP write and send with immediate IB/mlx4: Fix corruption of RoCEv2 IPv4 GIDs IB/mlx4: Include GID type when deleting GIDs from HW table under RoCE IB/mlx5: Fix an error code in __mlx5_ib_modify_qp() fbdev: Fixing arbitrary kernel leak in case FBIOGETCMAP_SPARC in sbusfb_ioctl_helper(). fsl/fman: avoid sleeping in atomic context while adding an address net: qcom/emac: Use proper free methods during TX net: smsc911x: Fix unload crash when link is up IB/core: Fix possible crash to access NULL netdev xen: xenbus: use put_device() instead of kfree() arm64: Relax ARM_SMCCC_ARCH_WORKAROUND_1 discovery dmaengine: mv_xor_v2: Fix clock resource by adding a register clock netfilter: ebtables: fix erroneous reject of last rule bnxt_en: Check valid VNIC ID in bnxt_hwrm_vnic_set_tpa(). workqueue: use put_device() instead of kfree() ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu sunvnet: does not support GSO for sctp drm/imx: move arming of the vblank event to atomic_flush net: Fix vlan untag for bridge and vlan_dev with reorder_hdr off batman-adv: fix header size check in batadv_dbg_arp() batman-adv: Fix skbuff rcsum on packet reroute vti4: Don't count header length twice on tunnel setup vti4: Don't override MTU passed on link creation via IFLA_MTU perf/cgroup: Fix child event counting bug brcmfmac: Fix check for ISO3166 code kbuild: make scripts/adjust_autoksyms.sh robust against timestamp races RDMA/ucma: Correct option size check using optlen RDMA/qedr: fix QP's ack timeout configuration RDMA/qedr: Fix rc initialization on CNQ allocation failure mm/mempolicy.c: avoid use uninitialized preferred_node mm, thp: do not cause memcg oom for thp selftests: ftrace: Add probe event argument syntax testcase selftests: ftrace: Add a testcase for string type with kprobe_event selftests: ftrace: Add a testcase for probepoint batman-adv: fix multicast-via-unicast transmission with AP isolation batman-adv: fix packet loss for broadcasted DHCP packets to a server ARM: 8748/1: mm: Define vdso_start, vdso_end as array net: qmi_wwan: add BroadMobi BM806U 2020:2033 perf/x86/intel: Fix linear IP of PEBS real_ip on Haswell and later CPUs llc: properly handle dev_queue_xmit() return value builddeb: Fix header package regarding dtc source links mm/kmemleak.c: wait for scan completion before disabling free net: Fix untag for vlan packets without ethernet header net: mvneta: fix enable of all initialized RXQs sh: fix debug trap failure to process signals before return to user nvme: don't send keep-alives to the discovery controller x86/pgtable: Don't set huge PUD/PMD on non-leaf entries x86/mm: Do not forbid _PAGE_RW before init for __ro_after_init fs/proc/proc_sysctl.c: fix potential page fault while unregistering sysctl table swap: divide-by-zero when zero length swap file on ssd sr: get/drop reference to device in revalidate and check_events Force log to disk before reading the AGF during a fstrim cpufreq: CPPC: Initialize shared perf capabilities of CPUs dp83640: Ensure against premature access to PHY registers after reset mm/ksm: fix interaction with THP mm: fix races between address_space dereference and free in page_evicatable Btrfs: bail out on error during replay_dir_deletes Btrfs: fix NULL pointer dereference in log_dir_items btrfs: Fix possible softlock on single core machines ocfs2/dlm: don't handle migrate lockres if already in shutdown sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warning KVM: VMX: raise internal error for exception during invalid protected mode state fscache: Fix hanging wait on page discarded by writeback sparc64: Make atomic_xchg() an inline function rather than a macro. net: bgmac: Fix endian access in bgmac_dma_tx_ring_free() btrfs: tests/qgroup: Fix wrong tree backref level Btrfs: fix copy_items() return value when logging an inode btrfs: fix lockdep splat in btrfs_alloc_subvolume_writers rxrpc: Fix Tx ring annotation after initial Tx failure rxrpc: Don't treat call aborts as conn aborts xen/acpi: off by one in read_acpi_id() drivers: macintosh: rack-meter: really fix bogus memsets ACPI: acpi_pad: Fix memory leak in power saving threads powerpc/mpic: Check if cpu_possible() in mpic_physmask() m68k: set dma and coherent masks for platform FEC ethernets parisc/pci: Switch LBA PCI bus from Hard Fail to Soft Fail mode hwmon: (nct6775) Fix writing pwmX_mode powerpc/perf: Prevent kernel address leak to userspace via BHRB buffer powerpc/perf: Fix kernel address leak via sampling registers tools/thermal: tmon: fix for segfault selftests: Print the test we're running to /dev/kmsg net/mlx5: Protect from command bit overflow ath10k: Fix kernel panic while using worker (ath10k_sta_rc_update_wk) cxgb4: Setup FW queues before registering netdev ima: Fallback to the builtin hash algorithm virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS arm: dts: socfpga: fix GIC PPI warning cpufreq: cppc_cpufreq: Fix cppc_cpufreq_init() failure path zorro: Set up z->dev.dma_mask for the DMA API bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set ACPICA: Events: add a return on failure from acpi_hw_register_read ACPICA: acpi: acpica: fix acpi operand cache leak in nseval.c cxgb4: Fix queue free path of ULD drivers i2c: mv64xxx: Apply errata delay only in standard mode KVM: lapic: stop advertising DIRECTED_EOI when in-kernel IOAPIC is in use perf top: Fix top.call-graph config option reading perf stat: Fix core dump when flag T is used IB/core: Honor port_num while resolving GID for IB link layer regulator: gpio: Fix some error handling paths in 'gpio_regulator_probe()' spi: bcm-qspi: fIX some error handling paths MIPS: ath79: Fix AR724X_PLL_REG_PCIE_CONFIG offset PCI: Restore config space on runtime resume despite being unbound ipmi_ssif: Fix kernel panic at msg_done_handler powerpc: Add missing prototype for arch_irq_work_raise() f2fs: fix to check extent cache in f2fs_drop_extent_tree perf/core: Fix perf_output_read_group() drm/panel: simple: Fix the bus format for the Ontat panel hwmon: (pmbus/max8688) Accept negative page register values hwmon: (pmbus/adm1275) Accept negative page register values perf/x86/intel: Properly save/restore the PMU state in the NMI handler cdrom: do not call check_disk_change() inside cdrom_open() perf/x86/intel: Fix large period handling on Broadwell CPUs perf/x86/intel: Fix event update for auto-reload arm64: dts: qcom: Fix SPI5 config on MSM8996 soc: qcom: wcnss_ctrl: Fix increment in NV upload gfs2: Fix fallocate chunk size x86/devicetree: Initialize device tree before using it x86/devicetree: Fix device IRQ settings in DT ALSA: vmaster: Propagate slave error dmaengine: pl330: fix a race condition in case of threaded irqs dmaengine: rcar-dmac: Check the done lists in rcar_dmac_chan_get_residue() enic: enable rq before updating rq descriptors hwrng: stm32 - add reset during probe dmaengine: qcom: bam_dma: get num-channels and num-ees from dt net: stmmac: ensure that the device has released ownership before reading data net: stmmac: ensure that the MSS desc is the last desc to set the own bit cpufreq: Reorder cpufreq_online() error code path PCI: Add function 1 DMA alias quirk for Marvell 88SE9220 udf: Provide saner default for invalid uid / gid ARM: dts: bcm283x: Fix probing of bcm2835-i2s audit: return on memory error to avoid null pointer dereference rcu: Call touch_nmi_watchdog() while printing stall warnings pinctrl: sh-pfc: r8a7796: Fix MOD_SEL register pin assignment for SSI pins group MIPS: Octeon: Fix logging messages with spurious periods after newlines drm/rockchip: Respect page offset for PRIME mmap calls x86/apic: Set up through-local-APIC mode on the boot CPU if 'noapic' specified perf tests: Use arch__compare_symbol_names to compare symbols perf report: Fix memory corruption in --branch-history mode --branch-history selftests/net: fixes psock_fanout eBPF test case netlabel: If PF_INET6, check sk_buff ip header version regmap: Correct comparison in regmap_cached ARM: dts: imx7d: cl-som-imx7: fix pinctrl_enet ARM: dts: porter: Fix HDMI output routing regulator: of: Add a missing 'of_node_put()' in an error handling path of 'of_regulator_match()' pinctrl: msm: Use dynamic GPIO numbering kdb: make "mdr" command repeat Linux 4.9.104 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2018-05-30 13:19:56 +02:00
Davidlohr Bueso	3d0632557e	sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warning [ Upstream commit `d29a20645d` ] While running rt-tests' pi_stress program I got the following splat: rq->clock_update_flags < RQCF_ACT_SKIP WARNING: CPU: 27 PID: 0 at kernel/sched/sched.h:960 assert_clock_updated.isra.38.part.39+0x13/0x20 [...] <IRQ> enqueue_top_rt_rq+0xf4/0x150 ? cpufreq_dbs_governor_start+0x170/0x170 sched_rt_rq_enqueue+0x65/0x80 sched_rt_period_timer+0x156/0x360 ? sched_rt_rq_enqueue+0x80/0x80 __hrtimer_run_queues+0xfa/0x260 hrtimer_interrupt+0xcb/0x220 smp_apic_timer_interrupt+0x62/0x120 apic_timer_interrupt+0xf/0x20 </IRQ> [...] do_idle+0x183/0x1e0 cpu_startup_entry+0x5f/0x70 start_secondary+0x192/0x1d0 secondary_startup_64+0xa5/0xb0 We can get rid of it be the "traditional" means of adding an update_rq_clock() call after acquiring the rq->lock in do_sched_rt_period_timer(). The case for the RT task throttling (which this workload also hits) can be ignored in that the skip_update call is actually bogus and quite the contrary (the request bits are removed/reverted). By setting RQCF_UPDATED we really don't care if the skip is happening or not and will therefore make the assert_clock_updated() check happy. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@stgolabs.net Cc: linux-kernel@vger.kernel.org Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/20180402164954.16255-1-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-05-30 07:50:41 +02:00
Joel Fernandes	7fd40752c3	UPSTREAM: sched/fair: Consider RT/IRQ pressure in capacity_spare_wake capacity_spare_wake in the slow path influences choice of idlest groups, as we search for groups with maximum spare capacity. In scenarios where RT pressure is high, a sub optimal group can be chosen and hurt performance of the task being woken up. Several tests with results are included below to show improvements with this change. 1) Hackbench on Pixel 2 Android device (4x4 ARM64 Octa core) ------------------------------------------------------------ Here we have RT activity running on big CPU cluster induced with rt-app, and running hackbench in parallel. The RT tasks are bound to 4 CPUs on the big cluster (cpu 4,5,6,7) and have 100ms periodicity with runtime=20ms sleep=80ms. Hackbench shows big benefit (30%) improvement when number of tasks is 8 and 32: Note: data is completion time in seconds (lower is better). Number of loops for 8 and 16 tasks is 50000, and for 32 tasks its 20000. +--------+-----+-------+-------------------+---------------------------+ \| groups \| fds \| tasks \| Without Patch \| With Patch \| +--------+-----+-------+---------+---------+-----------------+---------+ \| \| \| \| Mean \| Stdev \| Mean \| Stdev \| \| \| \| +-------------------+-----------------+---------+ \| 1 \| 8 \| 8 \| 1.0534 \| 0.13722 \| 0.7293 (+30.7%) \| 0.02653 \| \| 2 \| 8 \| 16 \| 1.6219 \| 0.16631 \| 1.6391 (-1%) \| 0.24001 \| \| 4 \| 8 \| 32 \| 1.2538 \| 0.13086 \| 1.1080 (+11.6%) \| 0.16201 \| +--------+-----+-------+---------+---------+-----------------+---------+ 2) Rohit ran barrier.c test (details below) with following improvements: ------------------------------------------------------------------------ This was Rohit's original use case for a patch he posted at [1] however from his recent tests he showed my patch can replace his slow path changes [1] and there's no need to selectively scan/skip CPUs in find_idlest_group_cpu in the slow path to get the improvement he sees. barrier.c (open_mp code) as a micro-benchmark. It does a number of iterations and barrier sync at the end of each for loop. Here barrier,c is running in along with ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX' barrier.c can be found at: http://www.spinics.net/lists/kernel/msg2506955.html Following are the results for the iterations per second with this micro-benchmark (higher is better), on a 44 core, 2 socket 88 Threads Intel x86 machine: +--------+------------------+---------------------------+ \|Threads \| Without patch \| With patch \| \| \| \| \| +--------+--------+---------+-----------------+---------+ \| \| Mean \| Std Dev \| Mean \| Std Dev \| +--------+--------+---------+-----------------+---------+ \|1 \| 539.36 \| 60.16 \| 572.54 (+6.15%) \| 40.95 \| \|2 \| 481.01 \| 19.32 \| 530.64 (+10.32%)\| 56.16 \| \|4 \| 474.78 \| 22.28 \| 479.46 (+0.99%) \| 18.89 \| \|8 \| 450.06 \| 24.91 \| 447.82 (-0.50%) \| 12.36 \| \|16 \| 436.99 \| 22.57 \| 441.88 (+1.12%) \| 7.39 \| \|32 \| 388.28 \| 55.59 \| 429.4 (+10.59%)\| 31.14 \| \|64 \| 314.62 \| 6.33 \| 311.81 (-0.89%) \| 11.99 \| +--------+--------+---------+-----------------+---------+ 3) ping+hackbench test on bare-metal sever (Rohit ran this test) ---------------------------------------------------------------- Here hackbench is running in threaded mode along with, running ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX' This test is running on 2 socket, 20 core and 40 threads Intel x86 machine: Number of loops is 10000 and runtime is in seconds (Lower is better). +--------------+-----------------+--------------------------+ \|Task Groups \| Without patch \| With patch \| \| +-------+---------+----------------+---------+ \|(Groups of 40)\| Mean \| Std Dev \| Mean \| Std Dev \| +--------------+-------+---------+----------------+---------+ \|1 \| 0.851 \| 0.007 \| 0.828 (+2.77%)\| 0.032 \| \|2 \| 1.083 \| 0.203 \| 1.087 (-0.37%)\| 0.246 \| \|4 \| 1.601 \| 0.051 \| 1.611 (-0.62%)\| 0.055 \| \|8 \| 2.837 \| 0.060 \| 2.827 (+0.35%)\| 0.031 \| \|16 \| 5.139 \| 0.133 \| 5.107 (+0.63%)\| 0.085 \| \|25 \| 7.569 \| 0.142 \| 7.503 (+0.88%)\| 0.143 \| +--------------+-------+---------+----------------+---------+ [1] https://patchwork.kernel.org/patch/9991635/ Matt Fleming also ran cyclictest and several different hackbench tests on his test machines to santiy-check that the patch doesn't harm any of his usecases. Change-Id: I1de1e40a3a92a54a9adc1303ccc98104704c1e35 Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Morten Ramussen <morten.rasmussen@arm.com> Cc: Brendan Jackman <brendan.jackman@arm.com> Tested-by: Rohit Jain <rohit.k.jain@oracle.com> Tested-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Joel Fernandes <joelaf@google.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>	2018-05-29 09:00:30 -07:00
Victor Wan	324524de04	Merge branch 'android-4.9' into amlogic-4.9-dev	2018-05-22 10:48:42 +08:00
Victor Wan	810c6dd972	Merge branch 'android-4.9' into amlogic-4.9-dev Signed-off-by: Victor Wan <victor.wan@amlogic.com> Conflicts: arch/arm/configs/bcm2835_defconfig arch/arm/configs/sunxi_defconfig include/linux/cpufreq.h init/main.c	2018-04-24 17:43:19 +08:00
Greg Kroah-Hartman	8683408f8e	Merge 4.9.94 into android-4.9 Changes in 4.9.94 qed: Fix overriding of supported autoneg value. cfg80211: make RATE_INFO_BW_20 the default md/raid5: make use of spin_lock_irq over local_irq_disable + spin_lock rtc: snvs: fix an incorrect check of return value x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic() x86/mm/kaslr: Use the _ASM_MUL macro for multiplication to work around Clang incompatibility ovl: persistent inode numbers for upper hardlinks NFSv4.1: RECLAIM_COMPLETE must handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION x86/boot: Declare error() as noreturn IB/srpt: Fix abort handling IB/srpt: Avoid that aborting a command triggers a kernel warning af_key: Fix slab-out-of-bounds in pfkey_compile_policy. mac80211: bail out from prep_connection() if a reconfig is ongoing bna: Avoid reading past end of buffer qlge: Avoid reading past end of buffer ubi: fastmap: Fix slab corruption ipmi_ssif: unlock on allocation failure net: cdc_ncm: Fix TX zero padding net: ethernet: ti: cpsw: adjust cpsw fifos depth for fullduplex flow control lockd: fix lockd shutdown race drivers/misc/vmw_vmci/vmci_queue_pair.c: fix a couple integer overflow tests pidns: disable pid allocation if pid_ns_prepare_proc() is failed in alloc_pid() s390: move _text symbol to address higher than zero net/mlx4_en: Avoid adding steering rules with invalid ring qed: Correct doorbell configuration for !4Kb pages NFSv4.1: Work around a Linux server bug... CIFS: silence lockdep splat in cifs_relock_file() perf/callchain: Force USER_DS when invoking perf_callchain_user() blk-mq: NVMe 512B/4K+T10 DIF/DIX format returns I/O error on dd with split op net: qca_spi: Fix alignment issues in rx path netxen_nic: set rcode to the return status from the call to netxen_issue_cmd mdio: mux: Correct mdio_mux_init error path issues Input: elan_i2c - check if device is there before really probing Input: elantech - force relative mode on a certain module KVM: PPC: Book3S PR: Check copy_to/from_user return values irqchip/mbigen: Fix the clear register offset calculation vmxnet3: ensure that adapter is in proper state during force_close mm, vmstat: Remove spurious WARN() during zoneinfo print SMB2: Fix share type handling bus: brcmstb_gisb: Use register offsets with writes too bus: brcmstb_gisb: correct support for 64-bit address output PowerCap: Fix an error code in powercap_register_zone() iio: pressure: zpa2326: report interrupted case as failure ARM: dts: imx53-qsrb: Pulldown PMIC IRQ pin staging: wlan-ng: prism2mgmt.c: fixed a double endian conversion before calling hfa384x_drvr_setconfig16, also fixes relative sparse warning clk: renesas: rcar-gen2: Fix PLL0 on R-Car V2H and E2 x86/tsc: Provide 'tsc=unstable' boot parameter powerpc/modules: If mprofile-kernel is enabled add it to vermagic ARM: dts: imx6qdl-wandboard: Fix audio channel swap i2c: mux: reg: put away the parent i2c adapter on probe failure arm64: perf: Ignore exclude_hv when kernel is running in HYP mdio: mux: fix device_node_continue.cocci warnings ipv6: avoid dad-failures for addresses with NODAD async_tx: Fix DMA_PREP_FENCE usage in do_async_gen_syndrome() KVM: arm: Restore banked registers and physical timer access on hyp_panic() KVM: arm64: Restore host physical timer access on hyp_panic() usb: dwc3: keystone: check return value btrfs: fix incorrect error return ret being passed to mapping_set_error ata: libahci: properly propagate return value of platform_get_irq() ipmr: vrf: Find VIFs using the actual device uio: fix incorrect memory leak cleanup neighbour: update neigh timestamps iff update is effective arp: honour gratuitous ARP _replies_ ARM: dts: rockchip: fix rk322x i2s1 pinctrl error usb: chipidea: properly handle host or gadget initialization failure pxa_camera: fix module remove codepath for v4l2 clock USB: ene_usb6250: fix first command execution net: x25: fix one potential use-after-free issue USB: ene_usb6250: fix SCSI residue overwriting serial: 8250: omap: Disable DMA for console UART serial: sh-sci: Fix race condition causing garbage during shutdown net/wan/fsl_ucc_hdlc: fix unitialized variable warnings net/wan/fsl_ucc_hdlc: fix incorrect memory allocation fsl/qe: add bit description for SYNL register for GUMR sh_eth: Use platform device for printing before register_netdev() mlxsw: spectrum: Avoid possible NULL pointer dereference scsi: csiostor: fix use after free in csio_hw_use_fwconfig() powerpc/mm: Fix virt_addr_valid() etc. on 64-bit hash ath5k: fix memory leak on buf on failed eeprom read selftests/powerpc: Fix TM resched DSCR test with some compilers xfrm: fix state migration copy replay sequence numbers ASoC: simple-card: fix mic jack initialization iio: hi8435: avoid garbage event at first enable iio: hi8435: cleanup reset gpio iio: light: rpr0521 poweroff for probe fails ext4: handle the rest of ext4_mb_load_buddy() ENOMEM errors md-cluster: fix potential lock issue in add_new_disk ARM: davinci: da8xx: Create DSP device only when assigned memory ray_cs: Avoid reading past end of buffer net/wan/fsl_ucc_hdlc: fix muram allocation error leds: pca955x: Correct I2C Functionality perf/core: Fix error handling in perf_event_alloc() sched/numa: Use down_read_trylock() for the mmap_sem gpio: crystalcove: Do not write regular gpio registers for virtual GPIOs net/mlx5: Tolerate irq_set_affinity_hint() failures selinux: do not check open permission on sockets block: fix an error code in add_partition() mlx5: fix bug reading rss_hash_type from CQE net: ieee802154: fix net_device reference release too early libceph: NULL deref on crush_decode() error path perf report: Fix off-by-one for non-activation frames netfilter: ctnetlink: fix incorrect nf_ct_put during hash resize pNFS/flexfiles: missing error code in ff_layout_alloc_lseg() ASoC: rsnd: SSI PIO adjust to 24bit mode scsi: bnx2fc: fix race condition in bnx2fc_get_host_stats() fix race in drivers/char/random.c:get_reg() ext4: fix off-by-one on max nr_pages in ext4_find_unwritten_pgoff() ARM64: PCI: Fix struct acpi_pci_root_ops allocation failure path tcp: better validation of received ack sequences net: move somaxconn init from sysctl code Input: elan_i2c - clear INT before resetting controller bonding: Don't update slave->link until ready to commit cpuhotplug: Link lock stacks for hotplug callbacks PCI/msi: fix the pci_alloc_irq_vectors_affinity stub KVM: X86: Fix preempt the preemption timer cancel KVM: nVMX: Fix handling of lmsw instruction net: llc: add lock_sock in llc_ui_bind to avoid a race condition drm/msm: Take the mutex before calling msm_gem_new_impl i40iw: Fix sequence number for the first partial FPDU i40iw: Correct Q1/XF object count equation ARM: dts: ls1021a: add "fsl,ls1021a-esdhc" compatible string to esdhc node thermal: power_allocator: fix one race condition issue for thermal_instances list perf probe: Add warning message if there is unexpected event name l2tp: fix missing print session offset info rds; Reset rs->rs_bound_addr in rds_add_bound() failure path ACPI / video: Default lcd_only to true on Win8-ready and newer machines net/mlx4_en: Change default QoS settings VFS: close race between getcwd() and d_move() PM / devfreq: Fix potential NULL pointer dereference in governor_store hwmon: (ina2xx) Make calibration register value fixed media: videobuf2-core: don't go out of the buffer range ASoC: Intel: Skylake: Disable clock gating during firmware and library download ASoC: Intel: cht_bsw_rt5645: Analog Mic support scsi: libiscsi: Allow sd_shutdown on bad transport scsi: mpt3sas: Proper handling of set/clear of "ATA command pending" flag. irqchip/gic-v3: Fix the driver probe() fail due to disabled GICC entry ACPI: EC: Fix debugfs_create_*() usage mac80211: Fix setting TX power on monitor interfaces vfb: fix video mode and line_length being set when loaded gpio: label descriptors using the device name IB/rdmavt: Allocate CQ memory on the correct node blk-mq: fix race between updating nr_hw_queues and switching io sched backlight: tdo24m: Fix the SPI CS between transfers pinctrl: baytrail: Enable glitch filter for GPIOs used as interrupts ASoC: Intel: sst: Fix the return value of 'sst_send_byte_stream_mrfld()' rt2x00: do not pause queue unconditionally on error path wl1251: check return from call to wl1251_acx_arp_ip_filter hdlcdrv: Fix divide by zero in hdlcdrv_ioctl x86/efi: Disable runtime services on kexec kernel if booted with efi=old_map netfilter: conntrack: don't call iter for non-confirmed conntracks HID: i2c: Call acpi_device_fix_up_power for ACPI-enumerated devices ovl: filter trusted xattr for non-admin powerpc/[booke\|4xx]: Don't clobber TCR[WP] when setting TCR[DIE] dmaengine: imx-sdma: Handle return value of clk_prepare_enable backlight: Report error on failure arm64: futex: Fix undefined behaviour with FUTEX_OP_OPARG_SHIFT usage net/mlx5: avoid build warning for uniprocessor cxgb4: FW upgrade fixes cxgb4: Fix netdev_features flag rtc: m41t80: fix SQW dividers override when setting a date i40evf: fix merge error in older patch rtc: opal: Handle disabled TPO in opal_get_tpo_time() rtc: interface: Validate alarm-time before handling rollover SUNRPC: ensure correct error is reported by xs_tcp_setup_socket() net: freescale: fix potential null pointer dereference clk: at91: fix clk-generated parenting drm/sun4i: Ignore the generic connectors for components dt-bindings: display: sun4i: Add allwinner,tcon-channel property mtd: nand: gpmi: Fix gpmi_nand_init() error path mtd: nand: check ecc->total sanity in nand_scan_tail KVM: SVM: do not zero out segment attributes if segment is unusable or not present clk: scpi: fix return type of __scpi_dvfs_round_rate clk: Fix __set_clk_rates error print-string powerpc/spufs: Fix coredump of SPU contexts drm/amdkfd: NULL dereference involving create_process() ath10k: add BMI parameters to fix calibration from DT/pre-cal perf trace: Add mmap alias for s390 qlcnic: Fix a sleep-in-atomic bug in qlcnic_82xx_hw_write_wx_2M and qlcnic_82xx_hw_read_wx_2M arm64: kernel: restrict /dev/mem read() calls to linear region mISDN: Fix a sleep-in-atomic bug net: phy: micrel: Restore led_mode and clk_sel on resume RDMA/iw_cxgb4: Avoid touch after free error in ARP failure handlers RDMA/hfi1: fix array termination by appending NULL to attr array drm/omap: fix tiled buffer stride calculations powerpc/8xx: fix mpc8xx_get_irq() return on no irq cxgb4: fix incorrect cim_la output for T6 Fix serial console on SNI RM400 machines bio-integrity: Do not allocate integrity context for bio w/o data ip6_tunnel: fix traffic class routing for tunnels skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow macsec: check return value of skb_to_sgvec always sit: reload iphdr in ipip6_rcv net/mlx4: Fix the check in attaching steering rules net/mlx4: Check if Granular QoS per VF has been enabled before updating QP qos_vport perf header: Set proper module name when build-id event found perf report: Ensure the perf DSO mapping matches what libdw sees iwlwifi: mvm: fix firmware debug restart recording watchdog: f71808e_wdt: Add F71868 support iwlwifi: mvm: Fix command queue number on d0i3 flow iwlwifi: tt: move ucode_loaded check under mutex iwlwifi: pcie: only use d0i3 in suspend/resume if system_pm is set to d0i3 iwlwifi: fix min API version for 7265D, 3168, 8000 and 8265 tags: honor COMPILED_SOURCE with apart output directory ARM: dts: qcom: ipq4019: fix i2c_0 node e1000e: fix race condition around skb_tstamp_tx() igb: fix race condition with PTP_TX_IN_PROGRESS bits cxl: Unlock on error in probe cx25840: fix unchecked return values mceusb: sporadic RX truncation corruption fix net: phy: avoid genphy_aneg_done() for PHYs without clause 22 support ARM: imx: Add MXC_CPU_IMX6ULL and cpu_is_imx6ull nvme-pci: fix multiple ctrl removal scheduling nvme: fix hang in remove path KVM: nVMX: Update vmcs12->guest_linear_address on nested VM-exit e1000e: Undo e1000e_pm_freeze if __e1000_shutdown fails perf/core: Correct event creation with PERF_FORMAT_GROUP sched/deadline: Use the revised wakeup rule for suspending constrained dl tasks MIPS: mm: fixed mappings: correct initialisation MIPS: mm: adjust PKMAP location MIPS: kprobes: flush_insn_slot should flush only if probe initialised ARM: dts: armadillo800eva: Split LCD mux and gpio Fix loop device flush before configure v3 net: emac: fix reset timeout with AR8035 phy perf tools: Decompress kernel module when reading DSO data perf tests: Decompress kernel module before objdump skbuff: only inherit relevant tx_flags xen: avoid type warning in xchg_xen_ulong X.509: Fix error code in x509_cert_parse() pinctrl: meson-gxbb: remove non-existing pin GPIOX_22 coresight: Fix reference count for software sources coresight: tmc: Configure DMA mask appropriately stmmac: fix ptp header for GMAC3 hw timestamp geneve: add missing rx stats accounting crypto: omap-sham - buffer handling fixes for hashing later crypto: omap-sham - fix closing of hash with separate finalize call bnx2x: Allow vfs to disable txvlan offload sctp: fix recursive locking warning in sctp_do_peeloff net: fec: Add a fec_enet_clear_ethtool_stats() stub for CONFIG_M5272 sparc64: ldc abort during vds iso boot iio: magnetometer: st_magn_spi: fix spi_device_id table net: ena: fix rare uncompleted admin command false alarm net: ena: fix race condition between submit and completion admin command net: ena: add missing return when ena_com_get_io_handlers() fails net: ena: add missing unmap bars on device removal net: ena: disable admin msix while working in polling mode clk: meson: meson8b: add compatibles for Meson8 and Meson8m2 Bluetooth: Send HCI Set Event Mask Page 2 command only when needed cpuidle: dt: Add missing 'of_node_put()' ACPICA: OSL: Add support to exclude stdarg.h ACPICA: Events: Add runtime stub support for event APIs ACPICA: Disassembler: Abort on an invalid/unknown AML opcode s390/dasd: fix hanging safe offline vxlan: dont migrate permanent fdb entries during learn hsr: fix incorrect warning selftests: kselftest_harness: Fix compile warning drm/vc4: Fix resource leak in 'vc4_get_hang_state_ioctl()' in error handling path bcache: stop writeback thread after detaching bcache: segregate flash only volume write streams scsi: libsas: fix memory leak in sas_smp_get_phy_events() scsi: libsas: fix error when getting phy events scsi: libsas: initialize sas_phy status according to response of DISCOVER blk-mq: fix kernel oops in blk_mq_tag_idle() tty: n_gsm: Allow ADM response in addition to UA for control dlci EDAC, mv64x60: Fix an error handling path cxgb4vf: Fix SGE FL buffer initialization logic for 64K pages sdhci: Advertise 2.0v supply on SDIO host controller Input: goodix - disable IRQs while suspended mtd: mtd_oobtest: Handle bitflips during reads perf tools: Fix copyfile_offset update of output offset ipsec: check return value of skb_to_sgvec always rxrpc: check return value of skb_to_sgvec always virtio_net: check return value of skb_to_sgvec always virtio_net: check return value of skb_to_sgvec in one more location random: use lockless method of accessing and updating f->reg_idx clk: at91: fix clk-generated compilation arp: fix arp_filter on l3slave devices ipv6: the entire IPv6 header chain must fit the first fragment net: fix possible out-of-bound read in skb_network_protocol() net/ipv6: Fix route leaking between VRFs net/ipv6: Increment OUTxxx counters after netfilter hook netlink: make sure nladdr has correct size in netlink_connect() net/sched: fix NULL dereference in the error path of tcf_bpf_init() pptp: remove a buggy dst release in pptp_connect() r8169: fix setting driver_data after register_netdev sctp: do not leak kernel memory to user space sctp: sctp_sockaddr_af must check minimal addr length for AF_INET6 sky2: Increase D3 delay to sky2 stops working after suspend vhost: correctly remove wait queue during poll failure vlan: also check phy_driver ts_info for vlan's real device bonding: fix the err path for dev hwaddr sync in bond_enslave bonding: move dev_mc_sync after master_upper_dev_link in bond_enslave bonding: process the err returned by dev_set_allmulti properly in bond_enslave net: fool proof dev_valid_name() ip_tunnel: better validate user provided tunnel names ipv6: sit: better validate user provided tunnel names ip6_gre: better validate user provided tunnel names ip6_tunnel: better validate user provided tunnel names vti6: better validate user provided tunnel names net/mlx5e: Sync netdev vxlan ports at open net/sched: fix NULL dereference in the error path of tunnel_key_init() net/sched: fix NULL dereference on the error path of tcf_skbmod_init() net/mlx4_en: Fix mixed PFC and Global pause user control requests vhost: validate log when IOTLB is enabled route: check sysctl_fib_multipath_use_neigh earlier than hash team: move dev_mc_sync after master_upper_dev_link in team_port_add vhost_net: add missing lock nesting notation net/mlx4_core: Fix memory leak while delete slave's resources strparser: Fix sign of err codes net sched actions: fix dumping which requires several messages to user space vrf: Fix use after free and double free in vrf_finish_output Revert "xhci: plat: Register shutdown for xhci_plat" Linux 4.9.94 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2018-04-14 15:40:56 +02:00
Daniel Bristot de Oliveira	0559ea3414	sched/deadline: Use the revised wakeup rule for suspending constrained dl tasks [ Upstream commit `3effcb4247` ] We have been facing some problems with self-suspending constrained deadline tasks. The main reason is that the original CBS was not designed for such sort of tasks. One problem reported by Xunlei Pang takes place when a task suspends, and then is awakened before the deadline, but so close to the deadline that its remaining runtime can cause the task to have an absolute density higher than allowed. In such situation, the original CBS assumes that the task is facing an early activation, and so it replenishes the task and set another deadline, one deadline in the future. This rule works fine for implicit deadline tasks. Moreover, it allows the system to adapt the period of a task in which the external event source suffered from a clock drift. However, this opens the window for bandwidth leakage for constrained deadline tasks. For instance, a task with the following parameters: runtime = 5 ms deadline = 7 ms [density] = 5 / 7 = 0.71 period = 1000 ms If the task runs for 1 ms, and then suspends for another 1ms, it will be awakened with the following parameters: remaining runtime = 4 laxity = 5 presenting a absolute density of 4 / 5 = 0.80. In this case, the original CBS would assume the task had an early wakeup. Then, CBS will reset the runtime, and the absolute deadline will be postponed by one relative deadline, allowing the task to run. The problem is that, if the task runs this pattern forever, it will keep receiving bandwidth, being able to run 1ms every 2ms. Following this behavior, the task would be able to run 500 ms in 1 sec. Thus running more than the 5 ms / 1 sec the admission control allowed it to run. Trying to address the self-suspending case, Luca Abeni, Giuseppe Lipari, and Juri Lelli [1] revisited the CBS in order to deal with self-suspending tasks. In the new approach, rather than replenishing/postponing the absolute deadline, the revised wakeup rule adjusts the remaining runtime, reducing it to fit into the allowed density. A revised version of the idea is: At a given time t, the maximum absolute density of a task cannot be higher than its relative density, that is: runtime / (deadline - t) <= dl_runtime / dl_deadline Knowing the laxity of a task (deadline - t), it is possible to move it to the other side of the equality, thus enabling to define max remaining runtime a task can use within the absolute deadline, without over-running the allowed density: runtime = (dl_runtime / dl_deadline) * (deadline - t) For instance, in our previous example, the task could still run: runtime = ( 5 / 7 ) * 5 runtime = 3.57 ms Without causing damage for other deadline tasks. It is note worthy that the laxity cannot be negative because that would cause a negative runtime. Thus, this patch depends on the patch: `df8eac8caf` ("sched/deadline: Throttle a constrained deadline task activated after the deadline") Which throttles a constrained deadline task activated after the deadline. Finally, it is also possible to use the revised wakeup rule for all other tasks, but that would require some more discussions about pros and cons. Reported-by: Xunlei Pang <xpang@redhat.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> [peterz: replaced dl_is_constrained with dl_is_implicit] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/5c800ab3a74a168a84ee5f3f84d12a02e11383be.1495803804.git.bristot@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-04-13 19:48:23 +02:00
Vlastimil Babka	a1e7a9e2e3	sched/numa: Use down_read_trylock() for the mmap_sem [ Upstream commit `8655d54977` ] A customer has reported a soft-lockup when running an intensive memory stress test, where the trace on multiple CPU's looks like this: RIP: 0010:[<ffffffff810c53fe>] [<ffffffff810c53fe>] native_queued_spin_lock_slowpath+0x10e/0x190 ... Call Trace: [<ffffffff81182d07>] queued_spin_lock_slowpath+0x7/0xa [<ffffffff811bc331>] change_protection_range+0x3b1/0x930 [<ffffffff811d4be8>] change_prot_numa+0x18/0x30 [<ffffffff810adefe>] task_numa_work+0x1fe/0x310 [<ffffffff81098322>] task_work_run+0x72/0x90 Further investigation showed that the lock contention here is pmd_lock(). The task_numa_work() function makes sure that only one thread is let to perform the work in a single scan period (via cmpxchg), but if there's a thread with mmap_sem locked for writing for several periods, multiple threads in task_numa_work() can build up a convoy waiting for mmap_sem for read and then all get unblocked at once. This patch changes the down_read() to the trylock version, which prevents the build up. For a workload experiencing mmap_sem contention, it's probably better to postpone the NUMA balancing work anyway. This seems to have fixed the soft lockups involving pmd_lock(), which is in line with the convoy theory. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170515131316.21909-1-vbabka@suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2018-04-13 19:48:04 +02:00
Vikram Mulukutla	f6bec4e8c7	Revert "ANDROID: sched/tune: Initialize raw_spin_lock in boosted_groups" This reverts commit `c5616f2f87`. If we re-init the per-cpu boostgroup spinlock every time that we add a new boosted cgroup, we can easily wipe out (reinit) a spinlock struct while in a critical section. We should only be setting up the per-cpu boostgroup data, and the spin_lock initialization need only happen once - which we're already doing in a postcore_initcall. For example: -------- CPU 0 -------- \| -------- CPU1 -------- cgroupX boost group added \| schedtune_enqueue_task \| acquires(bg->lock) \| cgroupY boost group added \| for_each_cpu() \| raw_spin_lock_init(bg->lock) releases(bg->lock) \| BUG (already unlocked) \| \| This results in the following BUG from the debug spinlock code: BUG: spinlock already unlocked on CPU#5, rcuop/6/68 Bug: 32668852 Change-Id: I3016702780b461a0cd95e26c538cd18df27d6316 Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>	2018-04-10 18:00:01 +00:00

1 2 3 4 5 ...

1995 Commits