PD#138714: too much log info for cpu capacity updating
Change-Id: I14164903950afc2f29d7446d743ba0b0062edc19
Signed-off-by: Jiamin Ma <jiamin.ma@amlogic.com>
Changes in 4.9.42
parisc: Handle vma's whose context is not current in flush_cache_range
cgroup: create dfl_root files on subsys registration
cgroup: fix error return value from cgroup_subtree_control()
libata: array underflow in ata_find_dev()
workqueue: restore WQ_UNBOUND/max_active==1 to be ordered
iwlwifi: dvm: prevent an out of bounds access
brcmfmac: fix memleak due to calling brcmf_sdiod_sgtable_alloc() twice
NFSv4: Fix EXCHANGE_ID corrupt verifier issue
mmc: sdhci-of-at91: force card detect value for non removable devices
device property: Make dev_fwnode() public
mmc: core: Fix access to HS400-ES devices
mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries
cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()
ALSA: hda - Fix speaker output from VAIO VPCL14M1R
drm/amdgpu: Fix undue fallthroughs in golden registers initialization
ASoC: do not close shared backend dailink
KVM: async_pf: make rcu irq exit if not triggered from idle task
mm/page_alloc: Remove kernel address exposure in free_reserved_area()
timers: Fix overflow in get_next_timer_interrupt
powerpc/tm: Fix saving of TM SPRs in core dump
powerpc/64: Fix __check_irq_replay missing decrementer interrupt
iommu/amd: Enable ga_log_intr when enabling guest_mode
gpiolib: skip unwanted events, don't convert them to opposite edge
ext4: fix SEEK_HOLE/SEEK_DATA for blocksize < pagesize
ext4: fix overflow caused by missing cast in ext4_resize_fs()
ARM: dts: armada-38x: Fix irq type for pca955
ARM: dts: tango4: Request RGMII RX and TX clock delays
media: platform: davinci: return -EINVAL for VPFE_CMD_S_CCDC_RAW_PARAMS ioctl
iscsi-target: Fix initial login PDU asynchronous socket close OOPs
mmc: dw_mmc: Use device_property_read instead of of_property_read
mmc: core: Use device_property_read instead of of_property_read
media: lirc: LIRC_GET_REC_RESOLUTION should return microseconds
f2fs: sanity check checkpoint segno and blkoff
Btrfs: fix early ENOSPC due to delalloc
saa7164: fix double fetch PCIe access condition
tcp_bbr: cut pacing rate only if filled pipe
tcp_bbr: introduce bbr_bw_to_pacing_rate() helper
tcp_bbr: introduce bbr_init_pacing_rate_from_rtt() helper
tcp_bbr: remove sk_pacing_rate=0 transient during init
tcp_bbr: init pacing rate on first RTT sample
ipv4: ipv6: initialize treq->txhash in cookie_v[46]_check()
net: Zero terminate ifr_name in dev_ifname().
ipv6: avoid overflow of offset in ip6_find_1stfragopt
net: dsa: b53: Add missing ARL entries for BCM53125
ipv4: initialize fib_trie prior to register_netdev_notifier call.
rtnetlink: allocate more memory for dev_set_mac_address()
mcs7780: Fix initialization when CONFIG_VMAP_STACK is enabled
openvswitch: fix potential out of bound access in parse_ct
packet: fix use-after-free in prb_retire_rx_blk_timer_expired()
ipv6: Don't increase IPSTATS_MIB_FRAGFAILS twice in ip6_fragment()
net: ethernet: nb8800: Handle all 4 RGMII modes identically
dccp: fix a memleak that dccp_ipv6 doesn't put reqsk properly
dccp: fix a memleak that dccp_ipv4 doesn't put reqsk properly
dccp: fix a memleak for dccp_feat_init err process
sctp: don't dereference ptr before leaving _sctp_walk_{params, errors}()
sctp: fix the check for _sctp_walk_params and _sctp_walk_errors
net/mlx5: Consider tx_enabled in all modes on remap
net/mlx5: Fix command bad flow on command entry allocation failure
net/mlx5e: Fix outer_header_zero() check size
net/mlx5e: Fix wrong delay calculation for overflow check scheduling
net/mlx5e: Schedule overflow check work to mlx5e workqueue
net: phy: Correctly process PHY_HALTED in phy_stop_machine()
xen-netback: correctly schedule rate-limited queues
sparc64: Measure receiver forward progress to avoid send mondo timeout
sparc64: Fix exception handling in UltraSPARC-III memcpy.
wext: handle NULL extra data in iwe_stream_add_point better
sh_eth: fix EESIPR values for SH77{34|63}
sh_eth: R8A7740 supports packet shecksumming
net: phy: dp83867: fix irq generation
tg3: Fix race condition in tg3_get_stats64().
x86/boot: Add missing declaration of string functions
spi: spi-axi: Free resources on error path
ASoC: rt5645: set sel_i2s_pre_div1 to 2
netfilter: use fwmark_reflect in nf_send_reset
phy state machine: failsafe leave invalid RUNNING state
ipv4: make tcp_notsent_lowat sysctl knob behave as true unsigned int
clk/samsung: exynos542x: mark some clocks as critical
scsi: qla2xxx: Get mutex lock before checking optrom_state
drm/virtio: fix framebuffer sparse warning
ARM: dts: sun8i: Support DTB build for NanoPi M1
ARM: dts: sunxi: Change node name for pwrseq pin on Olinuxino-lime2-emmc
iw_cxgb4: do not send RX_DATA_ACK CPLs after close/abort
nbd: blk_mq_init_queue returns an error code on failure, not NULL
virtio_blk: fix panic in initialization error path
ARM: 8632/1: ftrace: fix syscall name matching
mm, slab: make sure that KMALLOC_MAX_SIZE will fit into MAX_ORDER
lib/Kconfig.debug: fix frv build failure
signal: protect SIGNAL_UNKILLABLE from unintentional clearing.
mm: don't dereference struct page fields of invalid pages
net/mlx5: E-Switch, Re-enable RoCE on mode change only after FDB destroy
ipv4: Should use consistent conditional judgement for ip fragment in __ip_append_data and ip_finish_output
net: account for current skb length when deciding about UFO
net: phy: Fix PHY unbind crash
workqueue: implicit ordered attribute should be overridable
Linux 4.9.42
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 0a94efb5ac upstream.
5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be
ordered") automatically enabled ordered attribute for unbound
workqueues w/ max_active == 1. Because ordered workqueues reject
max_active and some attribute changes, this implicit ordered mode
broke cases where the user creates an unbound workqueue w/ max_active
== 1 and later explicitly changes the related attributes.
This patch distinguishes explicit and implicit ordered setting and
overrides from attribute changes if implict.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 5c0338c687 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
Cc: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 2d39b3cd34 ]
Since commit 00cd5c37af ("ptrace: permit ptracing of /sbin/init") we
can now trace init processes. init is initially protected with
SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
there are a number of paths during tracing where SIGNAL_UNKILLABLE can
be implicitly cleared.
This can result in init becoming stoppable/killable after tracing. For
example, running:
while true; do kill -STOP 1; done &
strace -p 1
and then stopping strace and the kill loop will result in init being
left in state TASK_STOPPED. Sending SIGCONT to init will resume it, but
init will now respond to future SIGSTOP signals rather than ignoring
them.
Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
that we don't clear SIGNAL_UNKILLABLE.
Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.com
Signed-off-by: Jamie Iles <jamie.iles@oracle.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 89affbf5d9 upstream.
In codepaths that use the begin/retry interface for reading
mems_allowed_seq with irqs disabled, there exists a race condition that
stalls the patch process after only modifying a subset of the
static_branch call sites.
This problem manifested itself as a deadlock in the slub allocator,
inside get_any_partial. The loop reads mems_allowed_seq value (via
read_mems_allowed_begin), performs the defrag operation, and then
verifies the consistency of mem_allowed via the read_mems_allowed_retry
and the cookie returned by xxx_begin.
The issue here is that both begin and retry first check if cpusets are
enabled via cpusets_enabled() static branch. This branch can be
rewritted dynamically (via cpuset_inc) if a new cpuset is created. The
x86 jump label code fully synchronizes across all CPUs for every entry
it rewrites. If it rewrites only one of the callsites (specifically the
one in read_mems_allowed_retry) and then waits for the
smp_call_function(do_sync_core) to complete while a CPU is inside the
begin/retry section with IRQs off and the mems_allowed value is changed,
we can hang.
This is because begin() will always return 0 (since it wasn't patched
yet) while retry() will test the 0 against the actual value of the seq
counter.
The fix is to use two different static keys: one for begin
(pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we
first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
always return a valid seqcount if are enabling cpusets. Similarly, when
disabling cpusets via cpuset_dec(), we first ensure that callers of
cpuset_mems_allowed_retry() will start ignoring the seqcount value
before we let cpuset_mems_allowed_begin() return 0.
The relevant stack traces of the two stuck threads:
CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4
Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
RIP: smp_call_function_many+0x1f9/0x260
Call Trace:
smp_call_function+0x3b/0x70
on_each_cpu+0x2f/0x90
text_poke_bp+0x87/0xd0
arch_jump_label_transform+0x93/0x100
__jump_label_update+0x77/0x90
jump_label_update+0xaa/0xc0
static_key_slow_inc+0x9e/0xb0
cpuset_css_online+0x70/0x2e0
online_css+0x2c/0xa0
cgroup_apply_control_enable+0x27f/0x3d0
cgroup_mkdir+0x2b7/0x420
kernfs_iop_mkdir+0x5a/0x80
vfs_mkdir+0xf6/0x1a0
SyS_mkdir+0xb7/0xe0
entry_SYSCALL_64_fastpath+0x18/0xad
...
CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4
Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
task: ffff8818087c0000 task.stack: ffffc90000030000
RIP: int3+0x39/0x70
Call Trace:
<#DB> ? ___slab_alloc+0x28b/0x5a0
<EOE> ? copy_process.part.40+0xf7/0x1de0
__slab_alloc.isra.80+0x54/0x90
copy_process.part.40+0xf7/0x1de0
copy_process.part.40+0xf7/0x1de0
kmem_cache_alloc_node+0x8a/0x280
copy_process.part.40+0xf7/0x1de0
_do_fork+0xe7/0x6c0
_raw_spin_unlock_irq+0x2d/0x60
trace_hardirqs_on_caller+0x136/0x1d0
entry_SYSCALL_64_fastpath+0x5/0xad
do_syscall_64+0x27/0x350
SyS_clone+0x19/0x20
do_syscall_64+0x60/0x350
entry_SYSCALL64_slow_path+0x25/0x25
Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
Fixes: 46e700abc4 ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
Signed-off-by: Dima Zavin <dmitriyz@waymo.com>
Reported-by: Cliff Spradlin <cspradlin@waymo.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5c0338c687 upstream.
The combination of WQ_UNBOUND and max_active == 1 used to imply
ordered execution. After NUMA affinity 4c16bd327c ("workqueue:
implement NUMA affinity for unbound workqueues"), this is no longer
true due to per-node worker pools.
While the right way to create an ordered workqueue is
alloc_ordered_workqueue(), the documentation has been misleading for a
long time and people do use WQ_UNBOUND and max_active == 1 for ordered
workqueues which can lead to subtle bugs which are very difficult to
trigger.
It's unlikely that we'd see noticeable performance impact by enforcing
ordering on WQ_UNBOUND / max_active == 1 workqueues. Let's
automatically set __WQ_ORDERED for those workqueues.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christoph Hellwig <hch@infradead.org>
Reported-by: Alexei Potashnik <alexei@purestorage.com>
Fixes: 4c16bd327c ("workqueue: implement NUMA affinity for unbound workqueues")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3c74541777 upstream.
While refactoring, f7b2814bb9 ("cgroup: factor out
cgroup_{apply|finalize}_control() from
cgroup_subtree_control_write()") broke error return value from the
function. The return value from the last operation is always
overridden to zero. Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7af608e4f9 upstream.
On subsystem registration, css_populate_dir() is not called on the new
root css, so the interface files for the subsystem on cgrp_dfl_root
aren't created on registration. This is a residue from the days when
cgrp_dfl_root was used only as the parking spot for unused subsystems,
which no longer is true as it's used as the root for cgroup2.
This is often fine as later operations tend to create them as a part
of mount (cgroup1) or subtree_control operations (cgroup2); however,
it's not difficult to mount cgroup2 with the controller interface
files missing as Waiman found out.
Fix it by invoking css_populate_dir() on the root css on subsys
registration.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.9.41
af_key: Add lock to key dump
pstore: Make spinlock per zone instead of global
net: reduce skb_warn_bad_offload() noise
jfs: Don't clear SGID when inheriting ACLs
ALSA: fm801: Initialize chip after IRQ handler is registered
ALSA: hda - Add missing NVIDIA GPU codec IDs to patch table
parisc: Prevent TLB speculation on flushed pages on CPUs that only support equivalent aliases
parisc: Extend disabled preemption in copy_user_page
parisc: Suspend lockup detectors before system halt
powerpc/pseries: Fix of_node_put() underflow during reconfig remove
NFS: invalidate file size when taking a lock.
NFSv4.1: Fix a race where CB_NOTIFY_LOCK fails to wake a waiter
crypto: authencesn - Fix digest_null crash
KVM: PPC: Book3S HV: Enable TM before accessing TM registers
md/raid5: add thread_group worker async_tx_issue_pending_all
drm/vmwgfx: Fix gcc-7.1.1 warning
drm/nouveau/disp/nv50-: bump max chans to 21
drm/nouveau/bar/gf100: fix access to upper half of BAR2
KVM: PPC: Book3S HV: Restore critical SPRs to host values on guest exit
KVM: PPC: Book3S HV: Save/restore host values of debug registers
Revert "powerpc/numa: Fix percpu allocations to be NUMA aware"
Staging: comedi: comedi_fops: Avoid orphaned proc entry
drm: rcar-du: Simplify and fix probe error handling
smp/hotplug: Move unparking of percpu threads to the control CPU
smp/hotplug: Replace BUG_ON and react useful
nfc: Fix hangup of RC-S380* in port100_send_ack()
nfc: fdp: fix NULL pointer dereference
net: phy: Do not perform software reset for Generic PHY
isdn: Fix a sleep-in-atomic bug
isdn/i4l: fix buffer overflow
ath10k: fix null deref on wmi-tlv when trying spectral scan
wil6210: fix deadlock when using fw_no_recovery option
mailbox: always wait in mbox_send_message for blocking Tx mode
mailbox: skip complete wait event if timer expired
mailbox: handle empty message in tx_tick
sched/cgroup: Move sched_online_group() back into css_online() to fix crash
RDMA/uverbs: Fix the check for port number
ipmi/watchdog: fix watchdog timeout set on reboot
dentry name snapshots
v4l: s5c73m3: fix negation operator
pstore: Allow prz to control need for locking
pstore: Correctly initialize spinlock and flags
pstore: Use dynamic spinlock initializer
net: skb_needs_check() accepts CHECKSUM_NONE for tx
device-dax: fix sysfs duplicate warnings
x86/mce/AMD: Make the init code more robust
r8169: add support for RTL8168 series add-on card.
ARM: omap2+: fixing wrong strcat for Non-NULL terminated string
dt-bindings: power/supply: Update TPS65217 properties
dt-bindings: input: Specify the interrupt number of TPS65217 power button
ARM: dts: am57xx-idk: Put USB2 port in peripheral mode
ARM: dts: n900: Mark eMMC slot with no-sdio and no-sd flags
net/mlx5: Disable RoCE on the e-switch management port under switchdev mode
ipv6: Should use consistent conditional judgement for ip6 fragment between __ip6_append_data and ip6_finish_output
net/mlx4_core: Use-after-free causes a resource leak in flow-steering detach
net/mlx4: Remove BUG_ON from ICM allocation routine
net/mlx4_core: Fix raw qp flow steering rules under SRIOV
drm/msm: Ensure that the hardware write pointer is valid
drm/msm: Put back the vaddr in submit_reloc()
drm/msm: Verify that MSM_SUBMIT_BO_FLAGS are set
vfio-pci: use 32-bit comparisons for register address for gcc-4.5
irqchip/keystone: Fix "scheduling while atomic" on rt
ASoC: tlv320aic3x: Mark the RESET register as volatile
spi: dw: Make debugfs name unique between instances
ASoC: nau8825: fix invalid configuration in Pre-Scalar of FLL
irqchip/mxs: Enable SKIP_SET_WAKE and MASK_ON_SUSPEND
openrisc: Add _text symbol to fix ksym build error
dmaengine: ioatdma: Add Skylake PCI Dev ID
dmaengine: ioatdma: workaround SKX ioatdma version
l2tp: consider '::' as wildcard address in l2tp_ip6 socket lookup
dmaengine: ti-dma-crossbar: Add some 'of_node_put()' in error path.
usb: dwc3: omap: fix race of pm runtime with irq handler in probe
ARM64: zynqmp: Fix W=1 dtc 1.4 warnings
ARM64: zynqmp: Fix i2c node's compatible string
perf probe: Fix to get correct modname from elf header
ARM: s3c2410_defconfig: Fix invalid values for NF_CT_PROTO_*
ACPI / scan: Prefer devices without _HID/_CID for _ADR matching
usb: gadget: Fix copy/pasted error message
Btrfs: use down_read_nested to make lockdep silent
Btrfs: fix lockdep warning about log_mutex
benet: stricter vxlan offloading check in be_features_check
Btrfs: adjust outstanding_extents counter properly when dio write is split
Xen: ARM: Zero reserved fields of xatp before making hypervisor call
tools lib traceevent: Fix prev/next_prio for deadline tasks
xfrm: Don't use sk_family for socket policy lookups
perf tools: Install tools/lib/traceevent plugins with install-bin
perf symbols: Robustify reading of build-id from sysfs
video: fbdev: cobalt_lcdfb: Handle return NULL error from devm_ioremap
vfio-pci: Handle error from pci_iomap
arm64: mm: fix show_pte KERN_CONT fallout
nvmem: imx-ocotp: Fix wrong register size
net: usb: asix_devices: add .reset_resume for USB PM
ASoC: fsl_ssi: set fifo watermark to more reliable value
sh_eth: enable RX descriptor word 0 shift on SH7734
ARCv2: IRQ: Call entry/exit functions for chained handlers in MCIP
ALSA: usb-audio: test EP_FLAG_RUNNING at urb completion
x86/platform/intel-mid: Rename 'spidev' to 'mrfld_spidev'
perf/x86: Set pmu->module in Intel PMU modules
ASoC: Intel: bytcr-rt5640: fix settings in internal clock mode
HID: ignore Petzl USB headlamp
scsi: fnic: Avoid sending reset to firmware when another reset is in progress
scsi: snic: Return error code on memory allocation failure
scsi: bfa: Increase requested firmware version to 3.2.5.1
ASoC: Intel: Skylake: Release FW ctx in cleanup
ASoC: dpcm: Avoid putting stream state to STOP when FE stream is paused
Linux 4.9.41
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 96b777452d upstream.
Commit:
2f5177f0fd ("sched/cgroup: Fix/cleanup cgroup teardown/init")
.. moved sched_online_group() from css_online() to css_alloc().
It exposes half-baked task group into global lists before initializing
generic cgroup stuff.
LTP testcase (third in cgroup_regression_test) written for testing
similar race in kernels 2.6.26-2.6.28 easily triggers this oops:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: kernfs_path_from_node_locked+0x260/0x320
CPU: 1 PID: 30346 Comm: cat Not tainted 4.10.0-rc5-test #4
Call Trace:
? kernfs_path_from_node+0x4f/0x60
kernfs_path_from_node+0x3e/0x60
print_rt_rq+0x44/0x2b0
print_rt_stats+0x7a/0xd0
print_cpu+0x2fc/0xe80
? __might_sleep+0x4a/0x80
sched_debug_show+0x17/0x30
seq_read+0xf2/0x3b0
proc_reg_read+0x42/0x70
__vfs_read+0x28/0x130
? security_file_permission+0x9b/0xc0
? rw_verify_area+0x4e/0xb0
vfs_read+0xa5/0x170
SyS_read+0x46/0xa0
entry_SYSCALL_64_fastpath+0x1e/0xad
Here the task group is already linked into the global RCU-protected 'task_groups'
list, but the css->cgroup pointer is still NULL.
This patch reverts this chunk and moves online back to css_online().
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2f5177f0fd ("sched/cgroup: Fix/cleanup cgroup teardown/init")
Link: http://lkml.kernel.org/r/148655324740.424917.5302984537258726349.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dea1d0f5f1 upstream.
The move of the unpark functions to the control thread moved the BUG_ON()
there as well. While it made some sense in the idle thread of the upcoming
CPU, it's bogus to crash the control thread on the already online CPU,
especially as the function has a return value and the callsite is prepared
to handle an error return.
Replace it with a WARN_ON_ONCE() and return a proper error code.
Fixes: 9cd4f1a4e7 ("smp/hotplug: Move unparking of percpu threads to the control CPU")
Rightfully-ranted-at-by: Linux Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9cd4f1a4e7 upstream.
Vikram reported the following backtrace:
BUG: scheduling while atomic: swapper/7/0/0x00000002
CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.9.32-perf+ #680
schedule
schedule_hrtimeout_range_clock
schedule_hrtimeout
wait_task_inactive
__kthread_bind_mask
__kthread_bind
__kthread_unpark
kthread_unpark
cpuhp_online_idle
cpu_startup_entry
secondary_start_kernel
He analyzed correctly that a parked cpu hotplug thread of an offlined CPU
was still on the runqueue when the CPU came back online and tried to unpark
it. This causes the thread which invoked kthread_unpark() to call
wait_task_inactive() and subsequently schedule() with preemption disabled.
His proposed workaround was to "make sure" that a parked thread has
scheduled out when the CPU goes offline, so the situation cannot happen.
But that's still wrong because the root cause is not the fact that the
percpu thread is still on the runqueue and neither that preemption is
disabled, which could be simply solved by enabling preemption before
calling kthread_unpark().
The real issue is that the calling thread is the idle task of the upcoming
CPU, which is not supposed to call anything which might sleep. The moron,
who wrote that code, missed completely that kthread_unpark() might end up
in schedule().
The solution is simpler than expected. The thread which controls the
hotplug operation is waiting for the CPU to call complete() on the hotplug
state completion. So the idle task of the upcoming CPU can set its state to
CPUHP_AP_ONLINE_IDLE and invoke complete(). This in turn wakes the control
task on a different CPU, which then can safely do the unpark and kick the
now unparked hotplug thread of the upcoming CPU to complete the bringup to
the final target state.
Control CPU AP
bringup_cpu();
__cpu_up() ------------>
bringup_ap();
bringup_wait_for_ap()
wait_for_completion();
cpuhp_online_idle();
<------------ complete();
unpark(AP->stopper);
unpark(AP->hotplugthread);
while(1)
do_idle();
kick(AP->hotplugthread);
wait_for_completion(); hotplug_thread()
run_online_callbacks();
complete();
Fixes: 8df3e07e7f ("cpu/hotplug: Let upcoming cpu bring itself fully up")
Reported-by: Vikram Mulukutla <markivx@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Sewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1707042218020.2131@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sometimes we find a target cpu but then we do not use it as
the energy_diff indicates that we would increase energy usage
or not save anything. To offer an additional option for those
cases, we return a second option which is what we would have
selected if the target CPU had not been found. This gives us
another chance to try to save some energy.
Change-Id: I42c4f20aba10e4cf65b51ac4153e2e00e534c8c7
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
(cherry picked from commit 426e11af9e415437d857e0f3cc5215b43a7cee08)
[Fixed cherry-pick issue]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
In the current EAS group energy calculations, we only use
the idle state of the group as it is right now. This means
that there are times when EAS cannot see that we are about
to remove all utilization from a group which is likely to
result in us being able to idle that entire group.
This is an attempt to detect that situation and at least
allow the energy calculation to include savings in that
scenario, regardless of what we might be able to actually
achieve in the real world. If a cluster or cpu looks like
it will have some idle time available to it, we try to
map the utilization onto an idle state.
Change-Id: I8fcb1e507f65ae6a2c5647eeef75a4bf28c7a0c0
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit 6a43b3d6d4f0ce4c978a7b3170f130146a3dbc12)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Before using a task's util_avg signal in EAS, we need to ensure that
it has been synced up to the last_update_time of prev_cpu's root
cfs_rq.
We previously relied on the side effect of wake_cap to do that,
however that does not happen when the waking CPU has the same
capacity as the prev_cpu. Therefore just explicitly call
sync_entity_load_avg. This may result in calling that function twice
within the same select_task_rq_fair, but since last_update_time
hasn't changed the second call will bail out very quickly.
Change-Id: I91f1fcd71dfeb96b7f5b73418f1cf9ac311d4655
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
(cherry picked from commit 595d8e0d207b9dfb0aa412d24293ed80e6478c88)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
If there have misfit task on one CPU, current code does not handle this
situation for nohz idle balance. As result, we can see the misfit task
stays run on little core for long time.
So this patch check if the CPU has misfit task or not. If has misfit
task then kick nohz idle balance so finally can execute active balance.
Change-Id: I117d3b7404296f8de11cb960a87a6b9a54a9f348
Signed-off-by: Leo Yan <leo.yan at linaro.org>
[taken from https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
(cherry picked from commit 07bb98f399e2df79c79f6ad563fc69eff8c32967)
[Fixed cpu_load_update_idle call]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Stale cpu utilization signals can cause havoc for energy-aware systems,
and they are caused by no updates being performed for cpus which have
no tick running. There is open debate about when is the correct time to
update these cpus, and general recognition that something needs to be
done.
This is an attempt to do something useful.
When we are looking for a task to pull for a newly-idle cpu, we have
an opportunity to update the stats for any cpu which has no tick running
without causing too much disturbance to the system or waking it up.
Change-Id: I0280104ea9c53e56c26f1c56a62bacab5d3e951b
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
(cherry picked from commit 352ffffddc25e6237395210439f2c49e68606703)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
The find_best_target() code has evolved over time to integrate different
micro-optimizations to the point to be quite difficult now to follow
exactly what it's doing.
This patch rafactors the existing code to make it more readable and easy
to maintain. It does that by properly identifying the three main
use-cases and addressing them in priority order:
A) latency sensitive tasks
B) non latency sensitive tasks on IDLE CPUs
C) non latency sensitive tasks on ACTIVE CPUs
The original behaviors are preserved. Some tests to compare
power/performances before and after this patch have been done using
Jankbench and YouTube and we did not noticed sensible differences.
The only difference with respect of the original code is a small update
to favor lower-capacity idle CPUs in case B. The same preference is not
enforce in case A since this can lead to a selection of a non-reserved
CPU for TOP_APP tasks, which ultimately can lead to non desirable
co-scheduling side-effects.
Change-Id: I871e5d95af89176217e4e239b64d44a420baabe8
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
(removed checkpatch whitespace error)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
[Fixed cherry-pick issue]
(cherry picked from commit 6b0008b7d7be7d141f9855337eb09ad0c9217cfb)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
sugov_update_commit() calls trace_cpu_frequency() to record the
current CPU frequency if it has not changed in the fast switch case
to prevent utilities from getting confused (they may report that the
CPU is idle if the frequency has not been recorded for too long, for
example).
However, that may cause the tracepoint to be triggered quite often
for no real reason (if the frequency doesn't change, we will not
modify the last update time stamp and governor computations may
run again shortly when that happens), so don't do that (arguably, it
is done to work around a utilities bug anyway).
That allows code duplication in sugov_update_commit() to be reduced
somewhat too.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
(cherry picked from commit 38d4ea229d)
(conflicts with sugov_up_down_rate_limit resolved)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ia019dda29b8c1c4cf3553da75c88d066eb5674e9
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
The way the schedutil governor uses the PELT metric causes it to
underestimate the CPU utilization in some cases.
That can be easily demonstrated by running kernel compilation on
a Sandy Bridge Intel processor, running turbostat in parallel with
it and looking at the values written to the MSR_IA32_PERF_CTL
register. Namely, the expected result would be that when all CPUs
were 100% busy, all of them would be requested to run in the maximum
P-state, but observation shows that this clearly isn't the case.
The CPUs run in the maximum P-state for a while and then are
requested to run slower and go back to the maximum P-state after
a while again. That causes the actual frequency of the processor to
visibly oscillate below the sustainable maximum in a jittery fashion
which clearly is not desirable.
That has been attributed to CPU utilization metric updates on task
migration that cause the total utilization value for the CPU to be
reduced by the utilization of the migrated task. If that happens,
the schedutil governor may see a CPU utilization reduction and will
attempt to reduce the CPU frequency accordingly right away. That
may be premature, though, for example if the system is generally
busy and there are other runnable tasks waiting to be run on that
CPU already.
This is unlikely to be an issue on systems where cpufreq policies are
shared between multiple CPUs, because in those cases the policy
utilization is computed as the maximum of the CPU utilization values
over the whole policy and if that turns out to be low, reducing the
frequency for the policy most likely is a good idea anyway. On
systems with one CPU per policy, however, it may affect performance
adversely and even lead to increased energy consumption in some cases.
On those systems it may be addressed by taking another utilization
metric into consideration, like whether or not the CPU whose
frequency is about to be reduced has been idle recently, because if
that's not the case, the CPU is likely to be busy in the near future
and its frequency should not be reduced.
To that end, use the counter of idle calls in the timekeeping code.
Namely, make the schedutil governor look at that counter for the
current CPU every time before its frequency is about to be reduced.
If the counter has not changed since the previous iteration of the
governor computations for that CPU, the CPU has been busy for all
that time and its frequency should not be decreased, so if the new
frequency would be lower than the one set previously, the governor
will skip the frequency update.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Joel Fernandes <joelaf@google.com>
(cherry picked from commit b7eaf1aab9)
(simple CPUFREQ_RT_DL vs CPUFREQ_DL usage conflicts)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I531ec02c052944ee07a904dc2a25c59948ee762b
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
The loop in sugov_next_freq_shared() contains an if block to skip the
loop for the current CPU. This turns out to be an unnecessary
conditional in the scheduler's hot-path for every CPU in the policy.
It would be better to drop the conditional and make the loop treat all
the CPUs in the same way. That would eliminate the need of calling
sugov_iowait_boost() at the top of the routine.
To keep the code optimized to return early if the current CPU has RT/DL
flags set, move the flags check to sugov_update_shared() instead in
order to avoid the function call entirely.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit cba1dfb57b)
(modified for SCHED_CPUFREQ_DL vs SCHED_CPUFREQ_RT)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ie046fdc8eda46821356750edd0fb6f7d077af363
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
get_next_freq() uses sg_cpu only to get sg_policy, which the callers of
get_next_freq() already have. Pass sg_policy instead of sg_cpu to
get_next_freq(), to make it more efficient.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 655cb1ebff)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ia210058da32930a6cdb18258aa679cd1a44a747e
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Execute the irq-work specific initialization/exit code only when the
fast path isn't available.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 21ef57297b)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Icfd68f455ef71846d799fcd2d8ec6aa1bf59573e
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
The fast_switch_enabled flag will be used by both sugov_policy_alloc()
and sugov_policy_free() with a later patch.
Prepare for that by moving the calls to enable and disable it to the
beginning of sugov_init() and end of sugov_exit().
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 4a71ce4348)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: Ia174f423ca02d59360657ac2e77a5098ce5cf99c
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
The usage of conditional compiled code is discouraged in fair.c.
This patch clean up a bit fair.c by moving schedtune_{cpu.task}_boost
definitions into tune.h.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
(cherry picked from commit 274bbcfbe4)
[Fixed cherry-pick issue]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I7543ca0786fa5cbc2490e5c91598976bbdce61f3
The SchedTune tasks accounting is used to identify how many tasks are in
a boostgroup and thus to bias the selection of an OPP based on the
maximum boost value of the active boostgroups.
The current implementation however update the accounting after CPU
capacity has been update. This has two effects:
a) when we enqueue a boosted task, we do not immediately boost its CPU
b) when we dequeue a boosted task, we can keep a CPU boosted even if not
required
This patch change the order of the SchedTune accounting and SchedFreq
updated to ensure to have always an updated representation of which
boosted tasks are runnable on a CPU before updating its capacity.
Reported-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
(cherry picked from commit a85045c034)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Change-Id: I46ee5795238f93ef5ae1fed5e772e19680892306
Due to rounding error hrtimer tick interval becomes 3333333 ns when HZ=300.
Consequently the tick time stamp nearest to the WALT's default window size
20ms will be also 19999998 (3333333 * 6).
Change-Id: I08f9bd2dbecccbb683e4490d06d8b0da703d3ab2
Suggested-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
(cherry picked from commit d368c6faa1)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Remove the declarations of capacity_orig_of() and cpu_util() from
fair.c. These are remnants of commit 608d49484e
which is derived from commit 6b6c192453.
Change-Id: I579a58867b2bb707bfe3c5a98f8d4ec65367fe5d
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Add appropriate #ifdef guards to ensure the smp-only easstats structs
are not used when smp is not enabled. Arnd got a report from buildbot,
analysed it, and pointed out exactly what the issue was.
Reported-by: "Arnd Bergmann" <arnd@arndb.de>
Suggested-by: "Arnd Bergmann" <arnd@arndb.de>
Fixes: 4b85765a3d ("sched/fair: Add eas (& cas)
specific rq, sd and task stats")
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I60554dea20137f6774db3f59b4afd40a06554cfc
(cherry picked from commit fce0ecf04a)
[Fixed cherry-pick issue]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
When EAS is enabled during boot, we have to be careful not to use
schedtune from fair.c before it is ready or it will warn us and we'll
get a traceback in the console.
Change-Id: I1a5cf29b18af626545c636c51219f9ed497c19fa
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
sched_energy_diff tracepoint is in a place where it can never trace
payoff or nrg.delta. If CONFIG_SCHED_TUNE is enabled, put it in
a place where those values exist. If it is not enabled, trace from
the current location
Change-Id: Id5442f2b34ec76625491d27c0f4285433ca12699
Reported-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
We use 5 groups everywhere else, this should default to the same.
Change-Id: I05a20bdcf8046ea90a2e36979940cef11246e735
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
When using WALT we always used boosted cpu util for OPP selection.
This is the primary purpose for boosted cpu util, but we hadn't
changed the PELT utilization check to do the same thing.
Fix that here.
Change-Id: Id5ffb26eac23b25fe754255221f6d21b8cededfd
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
sched_group_energy() was supposed to support per-cpu capacity states
(DVFS), however, while fixing a hotplug issue this was broken as we bail
out if there is no SD_SHARE_CAP_STATES flag set.
This patch implements the hotplug race check differently and should
therefore reinstate support for per-cpu capacity states.
Change-Id: I5b865666c9ce833dcfa6514c574580d75aa0a195
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
In some cases, the new_util of a task can be the same on several
CPUs. This causes an issue because the target_util is only updated
if the current new_util is strictly smaller than target_util.
To fix that, the cpu_util_wake() return value is used alongside the
new_util value. If two CPUs compute the same new_util value,
we'll now also look at their cpu_util_wake() return value. In this
case, the CPU that last ran the task will be chosen in priority.
Change-Id: Ia1ea2c4b3ec39621372c2f748862317d5b497723
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
wake_cap performs task and cpu utilization synchronization which is
what allows us to subtract current task util from prev_cpu util and
have a sensible number to work with.
It looks as though if wake_wide returns 0, we could potentially not
execute wake_cap, which would result in unsynced signals we then use
for energy calculations.
This is not necessarily an issue we've seen in traces, but it looks
as though it should be changed.
Change-Id: Ic54a3cba2a10d946ea20113a04371dea04115e82
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
[Remove _wake_cap variable to match commit cf6ed9a668]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
When we place a waking task with find_best_target, we calculate the
existing and new utilisation of each candidate cpu. However, we do
not remove any blocked load resulting from the waking task on the
previous cpu which might cause unnecessary migrations.
Switch to using cpu_util_wake which does this for us, which requires
moving cpu_util_wake a few functions earlier.
Also, we have multiple potential cpu utilization signals here, so
update the necessary bits to allow WALT to work properly (including
not subtracting task util for WALT).
When WALT is in use, cpu utilization is the utilization
in the previous completed window, whilst the task utilization
ignores fully idle windows. There seems to be no way to have a
decently accurate estimate of how much (if any) utilization from
this task remains on the prev cpu.
Instead, just return cpu_util when we're using WALT.
Change-Id: I448203ab98ffb5c020dfb6b218581eef1f5601f7
Reported-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
With the ability to choose between WALT and PELT for utilisation tracking
we can have the situation where we're using WALT to make all the
decisions and reporting PELT figures in the sched_load_avg_(cpu|task)
trace points. This is not too much of an issue, but when analysing trace
it is nice to see numbers representing what the scheduler is using rather
than needing to add in additional sched_walt_* traces to figure it out.
Add reporting for both types, and make the util_avg member reflect what
will be seen from cpu or task_util functions in the scheduler.
Change-Id: I2abbd2c5fa70822096d0f3372b4c12b1c6af1590
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
[renamed macros according to 172895e6b5]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>