PD#157069: skip SUBLEVEL and crc when ver check durning insmod
When CONFIG_MODVERSIONS enabled, vermagic and crc are checked
durning insmod.
Change-Id: I6eb7bdda5b771afa754f7b783a7bbfe1be7cedd1
Signed-off-by: jianxin.pan <jianxin.pan@amlogic.com>
Changes in 4.9.68
bcache: only permit to recovery read error when cache device is clean
bcache: recover data from backing when data is clean
drm/fsl-dcu: avoid disabling pixel clock twice on suspend
drm/fsl-dcu: enable IRQ before drm_atomic_helper_resume()
Revert "crypto: caam - get rid of tasklet"
mm, oom_reaper: gather each vma to prevent leaking TLB entry
uas: Always apply US_FL_NO_ATA_1X quirk to Seagate devices
usb: quirks: Add no-lpm quirk for KY-688 USB 3.1 Type-C Hub
serial: 8250_pci: Add Amazon PCI serial device ID
s390/runtime instrumentation: simplify task exit handling
USB: serial: option: add Quectel BG96 id
ima: fix hash algorithm initialization
s390/pci: do not require AIS facility
selftests/x86/ldt_get: Add a few additional tests for limits
staging: greybus: loopback: Fix iteration count on async path
m68k: fix ColdFire node shift size calculation
serial: 8250_fintek: Fix rs485 disablement on invalid ioctl()
staging: rtl8188eu: avoid a null dereference on pmlmepriv
spi: sh-msiof: Fix DMA transfer size check
spi: spi-axi: fix potential use-after-free after deregistration
mmc: sdhci-msm: fix issue with power irq
usb: phy: tahvo: fix error handling in tahvo_usb_probe()
serial: 8250: Preserve DLD[7:4] for PORT_XR17V35X
x86/entry: Use SYSCALL_DEFINE() macros for sys_modify_ldt()
EDAC, sb_edac: Fix missing break in switch
sysrq : fix Show Regs call trace on ARM
usbip: tools: Install all headers needed for libusbip development
perf test attr: Fix ignored test case result
kprobes/x86: Disable preemption in ftrace-based jprobes
tools include: Do not use poison with C++
iio: adc: ti-ads1015: add 10% to conversion wait time
dax: Avoid page invalidation races and unnecessary radix tree traversals
net/mlx4_en: Fix type mismatch for 32-bit systems
l2tp: take remote address into account in l2tp_ip and l2tp_ip6 socket lookups
dmaengine: stm32-dma: Set correct args number for DMA request from DT
dmaengine: stm32-dma: Fix null pointer dereference in stm32_dma_tx_status
usb: gadget: f_fs: Fix ExtCompat descriptor validation
libcxgb: fix error check for ip6_route_output()
net: systemport: Utilize skb_put_padto()
net: systemport: Pad packet before inserting TSB
ARM: OMAP2+: Fix WL1283 Bluetooth Baud Rate
ARM: OMAP1: DMA: Correct the number of logical channels
vti6: fix device register to report IFLA_INFO_KIND
be2net: fix accesses to unicast list
be2net: fix unicast list filling
net/appletalk: Fix kernel memory disclosure
libfs: Modify mount_pseudo_xattr to be clear it is not a userspace mount
net: qrtr: Mark 'buf' as little endian
mm: fix remote numa hits statistics
mac80211: calculate min channel width correctly
ravb: Remove Rx overflow log messages
nfs: Don't take a reference on fl->fl_file for LOCK operation
drm/exynos/decon5433: update shadow registers iff there are active windows
drm/exynos/decon5433: set STANDALONE_UPDATE_F also if planes are disabled
KVM: arm/arm64: Fix occasional warning from the timer work function
mac80211: prevent skb/txq mismatch
NFSv4: Fix client recovery when server reboots multiple times
perf/x86/intel: Account interrupts for PEBS errors
powerpc/mm: Fix memory hotplug BUG() on radix
qla2xxx: Fix wrong IOCB type assumption
drm/amdgpu: fix bug set incorrect value to vce register
drm/exynos/decon5433: set STANDALONE_UPDATE_F on output enablement
net: sctp: fix array overrun read on sctp_timer_tbl
x86/fpu: Set the xcomp_bv when we fake up a XSAVES area
drm/amdgpu: fix unload driver issue for virtual display
mac80211: don't try to sleep in rate_control_rate_init()
RDMA/qedr: Return success when not changing QP state
RDMA/qedr: Fix RDMA CM loopback
tipc: fix nametbl_lock soft lockup at module exit
tipc: fix cleanup at module unload
dmaengine: pl330: fix double lock
tcp: correct memory barrier usage in tcp_check_space()
i2c: i2c-cadence: Initialize configuration before probing devices
nvmet: cancel fatal error and flush async work before free controller
gtp: clear DF bit on GTP packet tx
gtp: fix cross netns recv on gtp socket
net: phy: micrel: KSZ8795 do not set SUPPORTED_[Asym_]Pause
net: thunderx: avoid dereferencing xcv when NULL
be2net: fix initial MAC setting
vfio/spapr: Fix missing mutex unlock when creating a window
mm: avoid returning VM_FAULT_RETRY from ->page_mkwrite handlers
xen-netfront: Improve error handling during initialization
cec: initiator should be the same as the destination for, poll
xen-netback: vif counters from int/long to u64
net: fec: fix multicast filtering hardware setup
dma-buf/dma-fence: Extract __dma_fence_is_later()
dma-buf/sw-sync: Fix the is-signaled test to handle u32 wraparound
dma-buf/sw-sync: Prevent user overflow on timeline advance
dma-buf/sw-sync: Reduce irqsave/irqrestore from known context
dma-buf/sw-sync: sync_pt is private and of fixed size
dma-buf/sw-sync: Fix locking around sync_timeline lists
dma-buf/sw-sync: Use an rbtree to sort fences in the timeline
dma-buf/sw_sync: move timeline_fence_ops around
dma-buf/sw_sync: clean up list before signaling the fence
dma-fence: Clear fence->status during dma_fence_init()
dma-fence: Wrap querying the fence->status
dma-fence: Introduce drm_fence_set_error() helper
dma-buf/sw_sync: force signal all unsignaled fences on dying timeline
dma-buf/sync_file: hold reference to fence when creating sync_file
dma-buf: Update kerneldoc for sync_file_create
usb: hub: Cycle HUB power when initialization fails
usb: xhci: fix panic in xhci_free_virt_devices_depth_first
USB: core: Add type-specific length check of BOS descriptors
USB: Increase usbfs transfer limit
USB: devio: Prevent integer overflow in proc_do_submiturb()
USB: usbfs: Filter flags passed in from user space
usb: host: fix incorrect updating of offset
xen-netfront: avoid crashing on resume after a failure in talk_to_netback()
Linux 4.9.68
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 475113d937 ]
It's possible to set up PEBS events to get only errors and not
any data, like on SNB-X (model 45) and IVB-EP (model 62)
via 2 perf commands running simultaneously:
taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10
This leads to a soft lock up, because the error path of the
intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
for error PEBS interrupts, so in case you're getting ONLY
errors you don't have a way to stop the event when it's over
the max_samples_per_tick limit:
NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
...
RIP: 0010:[<ffffffff81159232>] [<ffffffff81159232>] smp_call_function_single+0xe2/0x140
...
Call Trace:
? trace_hardirqs_on_caller+0xf5/0x1b0
? perf_cgroup_attach+0x70/0x70
perf_install_in_context+0x199/0x1b0
? ctx_resched+0x90/0x90
SYSC_perf_event_open+0x641/0xf90
SyS_perf_event_open+0x9/0x10
do_syscall_64+0x6c/0x1f0
entry_SYSCALL64_slow_path+0x25/0x25
Add perf_event_account_interrupt() which does the interrupt
and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
error path.
We keep the pending_kill and pending_wakeup logic only in the
__perf_event_overflow() path, because they make sense only if
there's any data to deliver.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
'cached_raw_freq' is used to get the next frequency quickly but should
always be in sync with sg_policy->next_freq. There are cases where it is
not and in such cases it should be reset to avoid switching to incorrect
frequencies.
Consider this case for example:
- policy->cur is 1.2 GHz (Max)
- New request comes for 780 MHz and we store that in cached_raw_freq.
- Based on 780 MHz, we calculate the effective frequency as 800 MHz.
- We then decide not to update the frequency as
sugov_up_down_rate_limit() return true.
- Here cached_raw_freq is 780 MHz and sg_policy->next_freq is 1.2 GHz.
- Now if the utilization doesn't change in next request, then the next
target frequency will still be 780 MHz and it will match with
cached_raw_freq and so we will directly return 1.2 GHz instead of 800
MHz.
BACKPORT of upstream commit 07458f6a51 ("cpufreq: schedutil: Reset
cached_raw_freq when not in sync with next_freq").
This also updates sugov_update_commit() for handling up/down tunables,
which aren't present in mainline.
Change-Id: Ie86465231e7cb265e5b4c26f59d6faf8d9630b0a
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
For upmigrating misfit running task case, the currently running
task's util has been counted into cpu_util(). Thus currently
__cpu_overutilized() which add task's uitl twice is overestimated.
Change-Id: I5326f4c736a55679009d2e7293f8792311c04294
Signed-off-by: Ke Wang <ke.wang@spreadtrum.com>
WALT's function cpu_util(cpu) reports CPU's load without taking into
account of waking task's load. Thus currently cpu_overutilized()
underestimates load on the previous CPU of waking task.
Take into account of task's load to determine whether previous CPU is
overutilzed to bail out early without running energy_diff() which is
expensive.
Change-Id: I30f146984a880ad2cc1b8a4ce35bd239a8c9a607
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
(minor rebase conflicts)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit 94e5c96507)
[trivial cherry-pick issues]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Upmigrate misfit current task upon scheduler tick with stopper.
We can kick an random (not necessarily big CPU) NOHZ idle CPU when a
CPU bound task is in need of upmigration. But it's not efficient as that
way needs following unnecessary wakeups:
1. Busy little CPU A to kick idle B
2. B runs idle balancer and enqueue migration/A
3. B goes idle
4. A runs migration/A, enqueues busy task on B.
5. B wakes up again.
This change makes active upmigration more efficiently by doing:
1. Busy little CPU A find target CPU B upon tick.
2. CPU A enqueues migration/A.
Change-Id: Ie865738054ea3296f28e6ba01710635efa7193c0
[joonwoop: The original version had logic to reserve CPU. The logic is
omitted in this version.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
(cherry picked from commit 9e293db052)
[trivial cherry-pick issues]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Currently active_load_balance_cpu_stop is run by cpu stopper and it
pushes running tasks off the busiest CPU onto idle target CPU. But
there is no check to see whether target cpu is offline or not before
pushing the tasks. With the introduction of active migration in the
scheduler tick path (see check_for_migration()) there have been
instances of attempts to migrate tasks to offline CPUs.
Add a check as to whether the target cpu is online or not to prevent
scheduling on offline CPUs.
Change-Id: Ib8ac7f8aeabd3ca7365f3eae977075952dab4f21
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
(cherry picked from commit dc626b28ee)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Active balance currently picks one task to migrate from busy cpu to
a chosen cpu (push_cpu). This patch extends active load balance to
recognize a particular task ('push_task') that needs to be migrated to
'push_cpu'. This capability will be leveraged by HMP-aware task
placement in a subsequent patch.
Change-Id: If31320111e6cc7044e617b5c3fd6d8e0c0e16952
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
(cherry picked from commit 2da014c0d8)
[ trivial cherry-pick issues, fix "may be used uninitialized" warning
for push_task ]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
It is preferable that WALT window rollover occurs just
before a tick, since the tick is an opportune moment
to record a complete window's statistics, as well as report
those stats to the cpu frequency governor. When CONFIG_HZ
results in a TICK_NSEC that isn't a integral number, this
requirement may be violated. Account for this by reducing
the WALT window size to the nearest multiple of TICK_NSEC.
Commit d368c6faa1 ("sched: walt: fix window misalignment
when HZ=300") attempted to do this but WALT isn't using
MIN_SCHED_RAVG_WINDOW as the window size and the patch was
doing nothing.
Also, change the type of 'walt_disabled' to bool and warn
if an invalid window size causes WALT to be disabled.
Change-Id: Ie3dcfc21a3df4408254ca1165a355bbe391ed5c7
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
(cherry picked from commit e79f447a97)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Energy cost estimation has been a long lasting challenge for WALT
because WALT guides CPU frequency based on the CPU utilization of
previous window. Consequently it's not possible to know newly
waking-up task's energy cost until WALT's end of the current window.
The WALT already tracks 'Previous Runnable Sum' (prev_runnable_sum)
and 'Cumulative Runnable Average' (cr_avg). They are designed for
CPU frequency guidance and task placement but unfortunately both
are not suitable for the energy cost estimation.
It's because using prev_runnable_sum for energy cost calculation would
make us to account CPU and task's energy solely based on activity in the
previous window so for example, any task didn't have an activity in the
previous window will be accounted as a 'zero energy cost' task.
Energy estimation with cr_avg is what energy_diff() relies on at present.
However cr_avg can only represent instantaneous picture of energy cost
thus for example, if a CPU was fully occupied for an entire WALT window
and became idle just before window boundary, and if there is a wake-up,
energy_diff() accounts that CPU is a 'zero energy cost' CPU.
As a result, introduce a new accounting unit 'Cumulative Window Demand'.
The cumulative window demand tracks all the tasks' demands have seen in
current window which is neither instantaneous nor actual execution time.
Because task demand represents estimated scaled execution time when the
task runs a full window, accumulation of all the demands represents
predicted CPU load at the end of window.
Thus we can estimate CPU's frequency at the end of current WALT window
with the cumulative window demand.
The use of prev_runnable_sum for the CPU frequency guidance and cr_avg
for the task placement have not changed and these are going to be used
for both purpose while this patch aims to add an additional statistics.
Change-Id: I9908c77ead9973a26dea2b36c001c2baf944d4f5
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
(cherry picked from commit 43bd960dfe)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
In order to set rq->misfit_task in time, call update_task_ravg() prior
to task_tick. This reduces upmigration delay by 1 scheduler window.
Change-Id: I7cc80badd423f2e7684125fbfd853b0a3610f0e8
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
(cherry picked from commit ed9e749668)
[Trivial cherry-pick issue]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
At present need_active_balance() determines whether an active
upmigration is needed by using capacity_of(). A CPU's capacity
may be reduced by RT pressure, and therefore distinguishing
capability differences with capacity_of() may lead to suboptimal
active migrations to less capable CPUs. Use capacity_orig_of
to distinguish differently capable CPUs in addition to
capacity_of(), thus avoiding placing tasks on less capable CPUs
due to instantaneous RT pressure.
Change-Id: I3e1435246a8edc3ad618ef98a34866cfbd8c16a5
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
[markivx: Reworked the commit text a bit]
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
(cherry picked from commit 7ab48e4c8d)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
There's no need for a separate hierarchy of notifiers, APIs
and variables in walt.c for the purpose of applying frequency
and IPC invariance. Let's just use capacity_curr_of and get
rid of a lot of the infrastructure relating to capacity,
load_scale_factor etc.
Change-Id: Ia220e2c896373fa535db05bff60f9aa33aefc978
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
(cherry picked from commit be832f69a9)
[Trivial cherry pick issues]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Changes in 4.9.66
s390: fix transactional execution control register handling
s390/runtime instrumention: fix possible memory corruption
s390/disassembler: add missing end marker for e7 table
s390/disassembler: increase show_code buffer size
ACPI / EC: Fix regression related to triggering source of EC event handling
x86/mm: fix use-after-free of vma during userfaultfd fault
ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER
vsock: use new wait API for vsock_stream_sendmsg()
sched: Make resched_cpu() unconditional
lib/mpi: call cond_resched() from mpi_powm() loop
x86/decoder: Add new TEST instruction pattern
x86/entry/64: Add missing irqflags tracing to native_load_gs_index()
arm64: Implement arch-specific pte_access_permitted()
ARM: 8722/1: mm: make STRICT_KERNEL_RWX effective for LPAE
ARM: 8721/1: mm: dump: check hardware RO bit for LPAE
MIPS: ralink: Fix MT7628 pinmux
MIPS: ralink: Fix typo in mt7628 pinmux function
PCI: Set Cavium ACS capability quirk flags to assert RR/CR/SV/UF
ALSA: hda: Add Raven PCI ID
dm bufio: fix integer overflow when limiting maximum cache size
dm: allocate struct mapped_device with kvzalloc
MIPS: pci: Remove KERN_WARN instance inside the mt7620 driver
dm: fix race between dm_get_from_kobject() and __dm_destroy()
MIPS: Fix odd fp register warnings with MIPS64r2
MIPS: dts: remove bogus bcm96358nb4ser.dtb from dtb-y entry
MIPS: Fix an n32 core file generation regset support regression
MIPS: BCM47XX: Fix LED inversion for WRT54GSv1
rt2x00usb: mark device removed when get ENOENT usb error
autofs: don't fail mount for transient error
nilfs2: fix race condition that causes file system corruption
eCryptfs: use after free in ecryptfs_release_messaging()
libceph: don't WARN() if user tries to add invalid key
bcache: check ca->alloc_thread initialized before wake up it
isofs: fix timestamps beyond 2027
NFS: Fix typo in nomigration mount option
nfs: Fix ugly referral attributes
NFS: Avoid RCU usage in tracepoints
nfsd: deal with revoked delegations appropriately
rtlwifi: rtl8192ee: Fix memory leak when loading firmware
rtlwifi: fix uninitialized rtlhal->last_suspend_sec time
ata: fixes kernel crash while tracing ata_eh_link_autopsy event
ext4: fix interaction between i_size, fallocate, and delalloc after a crash
ALSA: pcm: update tstamp only if audio_tstamp changed
ALSA: usb-audio: Add sanity checks to FE parser
ALSA: usb-audio: Fix potential out-of-bound access at parsing SU
ALSA: usb-audio: Add sanity checks in v2 clock parsers
ALSA: timer: Remove kernel warning at compat ioctl error paths
ALSA: hda: Fix too short HDMI/DP chmap reporting
ALSA: hda/realtek - Fix ALC700 family no sound issue
fix a page leak in vhost_scsi_iov_to_sgl() error recovery
fs/9p: Compare qid.path in v9fs_test_inode
iscsi-target: Fix non-immediate TMR reference leak
target: Fix QUEUE_FULL + SCSI task attribute handling
mtd: nand: omap2: Fix subpage write
mtd: nand: Fix writing mtdoops to nand flash.
mtd: nand: mtk: fix infinite ECC decode IRQ issue
p54: don't unregister leds when they are not initialized
block: Fix a race between blk_cleanup_queue() and timeout handling
irqchip/gic-v3: Fix ppi-partitions lookup
lockd: double unregister of inetaddr notifiers
KVM: nVMX: set IDTR and GDTR limits when loading L1 host state
KVM: SVM: obey guest PAT
SUNRPC: Fix tracepoint storage issues with svc_recv and svc_rqst_status
clk: ti: dra7-atl-clock: fix child-node lookups
libnvdimm, pfn: make 'resource' attribute only readable by root
libnvdimm, namespace: fix label initialization to use valid seq numbers
libnvdimm, namespace: make 'resource' attribute only readable by root
IB/srpt: Do not accept invalid initiator port names
IB/srp: Avoid that a cable pull can trigger a kernel crash
NFC: fix device-allocation error return
i40e: Use smp_rmb rather than read_barrier_depends
igb: Use smp_rmb rather than read_barrier_depends
igbvf: Use smp_rmb rather than read_barrier_depends
ixgbevf: Use smp_rmb rather than read_barrier_depends
i40evf: Use smp_rmb rather than read_barrier_depends
fm10k: Use smp_rmb rather than read_barrier_depends
ixgbe: Fix skb list corruption on Power systems
parisc: Fix validity check of pointer size argument in new CAS implementation
powerpc/signal: Properly handle return value from uprobe_deny_signal()
media: Don't do DMA on stack for firmware upload in the AS102 driver
media: rc: check for integer overflow
cx231xx-cards: fix NULL-deref on missing association descriptor
media: v4l2-ctrl: Fix flags field on Control events
sched/rt: Simplify the IPI based RT balancing logic
fscrypt: lock mutex before checking for bounce page pool
net/9p: Switch to wait_event_killable()
PM / OPP: Add missing of_node_put(np)
Revert "drm/i915: Do not rely on wm preservation for ILK watermarks"
e1000e: Fix error path in link detection
e1000e: Fix return value test
e1000e: Separate signaling for link check/link up
e1000e: Avoid receiver overrun interrupt bursts
RDS: make message size limit compliant with spec
RDS: RDMA: return appropriate error on rdma map failures
RDS: RDMA: fix the ib_map_mr_sg_zbva() argument
PCI: Apply _HPX settings only to relevant devices
drm/sun4i: Fix a return value in case of error
clk: sunxi-ng: A31: Fix spdif clock register
clk: sunxi-ng: fix PLL_CPUX adjusting on A33
dmaengine: zx: set DMA_CYCLIC cap_mask bit
fscrypt: use ENOKEY when file cannot be created w/o key
fscrypt: use ENOTDIR when setting encryption policy on nondirectory
net: Allow IP_MULTICAST_IF to set index to L3 slave
net: 3com: typhoon: typhoon_init_one: make return values more specific
net: 3com: typhoon: typhoon_init_one: fix incorrect return values
drm/armada: Fix compile fail
rt2800: set minimum MPDU and PSDU lengths to sane values
adm80211: return an error if adm8211_alloc_rings() fails
mwifiex: sdio: fix use after free issue for save_adapter
ath10k: fix incorrect txpower set by P2P_DEVICE interface
ath10k: ignore configuring the incorrect board_id
ath10k: fix potential memory leak in ath10k_wmi_tlv_op_pull_fw_stats()
pinctrl: sirf: atlas7: Add missing 'of_node_put()'
bnxt_en: Set default completion ring for async events.
ath10k: set CTS protection VDEV param only if VDEV is up
ALSA: hda - Apply ALC269_FIXUP_NO_SHUTUP on HDA_FIXUP_ACT_PROBE
gpio: mockup: dynamically allocate memory for chip name
drm: Apply range restriction after color adjustment when allocation
clk: qcom: ipq4019: Add all the frequencies for apss cpu
drm/mediatek: don't use drm_put_dev
mac80211: Remove invalid flag operations in mesh TSF synchronization
mac80211: Suppress NEW_PEER_CANDIDATE event if no room
adm80211: add checks for dma mapping errors
iio: light: fix improper return value
staging: iio: cdc: fix improper return value
spi: SPI_FSL_DSPI should depend on HAS_DMA
netfilter: nft_queue: use raw_smp_processor_id()
netfilter: nf_tables: fix oob access
ASoC: rsnd: don't double free kctrl
crypto: marvell - Copy IVDIG before launching partial DMA ahash requests
btrfs: return the actual error value from from btrfs_uuid_tree_iterate
ASoC: wm_adsp: Don't overrun firmware file buffer when reading region data
s390/kbuild: enable modversions for symbols exported from asm
cec: when canceling a message, don't overwrite old status info
cec: CEC_MSG_GIVE_FEATURES should abort for CEC version < 2
cec: update log_addr[] before finishing configuration
nvmet: fix KATO offset in Set Features
xen: xenbus driver must not accept invalid transaction ids
Linux 4.9.66
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 4bdced5c9a upstream.
When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.
When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit:
b6366f048e ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
changed the way to request a pull. Instead of grabbing the lock of the
overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.
Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.
Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.
This greatly simplifies the code!
The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:
rto_push_work - the irq work to process each CPU set in rto_mask
rto_lock - the lock to protect some of the other rto fields
rto_loop_start - an atomic that keeps contention down on rto_lock
the first CPU scheduling in a lower priority task
is the one to kick off the process.
rto_loop_next - an atomic that gets incremented for each CPU that
schedules in a lower priority task.
rto_loop - a variable protected by rto_lock that is used to
compare against rto_loop_next
rto_cpu - The cpu to send the next IPI to, also protected by
the rto_lock.
When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.
If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.
When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.
Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.
If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.
This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?
Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7c2102e56a upstream.
The current implementation of synchronize_sched_expedited() incorrectly
assumes that resched_cpu() is unconditional, which it is not. This means
that synchronize_sched_expedited() can hang when resched_cpu()'s trylock
fails as follows (analysis by Neeraj Upadhyay):
o CPU1 is waiting for expedited wait to complete:
sync_rcu_exp_select_cpus
rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5
IPI sent to CPU5
synchronize_sched_expedited_wait
ret = swait_event_timeout(rsp->expedited_wq,
sync_rcu_preempt_exp_done(rnp_root),
jiffies_stall);
expmask = 0x20, CPU 5 in idle path (in cpuidle_enter())
o CPU5 handles IPI and fails to acquire rq lock.
Handles IPI
sync_sched_exp_handler
resched_cpu
returns while failing to try lock acquire rq->lock
need_resched is not set
o CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to
idle (schedule() is not called).
o CPU 1 reports RCU stall.
Given that resched_cpu() is now used only by RCU, this commit fixes the
assumption by making resched_cpu() unconditional.
Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Suggested-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.9.62
adv7604: Initialize drive strength to default when using DT
video: fbdev: pmag-ba-fb: Remove bad `__init' annotation
PCI: mvebu: Handle changes to the bridge windows while enabled
sched/core: Add missing update_rq_clock() call in sched_move_task()
xen/netback: set default upper limit of tx/rx queues to 8
ARM: dts: imx53-qsb-common: fix FEC pinmux config
dt-bindings: clockgen: Add compatible string for LS1012A
EDAC, amd64: Add x86cpuid sanity check during init
PM / OPP: Error out on failing to add static OPPs for v1 bindings
clk: samsung: exynos5433: Add IDs for PHYCLK_MIPIDPHY0_* clocks
drm: drm_minor_register(): Clean up debugfs on failure
KVM: PPC: Book 3S: XICS: correct the real mode ICP rejecting counter
iommu/arm-smmu-v3: Clear prior settings when updating STEs
pinctrl: baytrail: Fix debugfs offset output
powerpc/corenet: explicitly disable the SDHC controller on kmcoge4
cxl: Force psl data-cache flush during device shutdown
ARM: omap2plus_defconfig: Fix probe errors on UARTs 5 and 6
arm64: dma-mapping: Only swizzle DMA ops for IOMMU_DOMAIN_DMA
crypto: vmx - disable preemption to enable vsx in aes_ctr.c
drm: mali-dp: fix Lx_CONTROL register fields clobber
iio: trigger: free trigger resource correctly
iio: pressure: ms5611: claim direct mode during oversampling changes
iio: magnetometer: mag3110: claim direct mode during raw writes
iio: proximity: sx9500: claim direct mode during raw proximity reads
dt-bindings: Add LEGO MINDSTORMS EV3 compatible specification
dt-bindings: Add vendor prefix for LEGO
phy: increase size of MII_BUS_ID_SIZE and bus_id
serial: sh-sci: Fix register offsets for the IRDA serial port
libertas: fix improper return value
usb: hcd: initialize hcd->flags to 0 when rm hcd
netfilter: nft_meta: deal with PACKET_LOOPBACK in netdev family
brcmfmac: setup wiphy bands after registering it first
rt2800usb: mark tx failure on timeout
apparmor: fix undefined reference to `aa_g_hash_policy'
IPsec: do not ignore crypto err in ah4 input
EDAC, amd64: Save and return err code from probe_one_instance()
s390/topology: make "topology=off" parameter work
Input: mpr121 - handle multiple bits change of status register
Input: mpr121 - set missing event capability
sched/cputime, powerpc32: Fix stale scaled stime on context switch
IB/ipoib: Change list_del to list_del_init in the tx object
ARM: dts: STiH410-family: fix wrong parent clock frequency
s390/qeth: fix retrieval of vipa and proxy-arp addresses
s390/qeth: issue STARTLAN as first IPA command
wcn36xx: Don't use the destroyed hal_mutex
IB/rxe: Fix reference leaks in memory key invalidation code
clk: mvebu: adjust AP806 CPU clock frequencies to production chip
net: dsa: select NET_SWITCHDEV
platform/x86: hp-wmi: Fix detection for dock and tablet mode
cdc_ncm: Set NTB format again after altsetting switch for Huawei devices
KEYS: trusted: sanitize all key material
KEYS: trusted: fix writing past end of buffer in trusted_read()
platform/x86: hp-wmi: Fix error value for hp_wmi_tablet_state
platform/x86: hp-wmi: Do not shadow error values
x86/uaccess, sched/preempt: Verify access_ok() context
workqueue: Fix NULL pointer dereference
crypto: ccm - preserve the IV buffer
crypto: x86/sha1-mb - fix panic due to unaligned access
crypto: x86/sha256-mb - fix panic due to unaligned access
KEYS: fix NULL pointer dereference during ASN.1 parsing [ver #2]
ARM: 8720/1: ensure dump_instr() checks addr_limit
ALSA: seq: Fix OSS sysex delivery in OSS emulation
ALSA: seq: Avoid invalid lockdep class warning
drm/i915: Do not rely on wm preservation for ILK watermarks
MIPS: microMIPS: Fix incorrect mask in insn_table_MM
MIPS: Fix CM region target definitions
MIPS: SMP: Use a completion event to signal CPU up
MIPS: Fix race on setting and getting cpu_online_mask
MIPS: SMP: Fix deadlock & online race
selftests: firmware: send expected errors to /dev/null
tools: firmware: check for distro fallback udev cancel rule
ASoC: sun4i-spdif: remove legacy dapm components
MIPS: BMIPS: Fix missing cbr address
MIPS: AR7: Defer registration of GPIO
MIPS: AR7: Ensure that serial ports are properly set up
Input: elan_i2c - add ELAN060C to the ACPI table
rbd: use GFP_NOIO for parent stat and data requests
drm/vmwgfx: Fix Ubuntu 17.10 Wayland black screen issue
drm/bridge: adv7511: Rework adv7511_power_on/off() so they can be reused internally
drm/bridge: adv7511: Reuse __adv7511_power_on/off() when probing EDID
drm/bridge: adv7511: Re-write the i2c address before EDID probing
can: sun4i: handle overrun in RX FIFO
can: ifi: Fix transmitter delay calculation
can: c_can: don't indicate triple sampling support for D_CAN
x86/smpboot: Make optimization of delay calibration work correctly
x86/oprofile/ppro: Do not use __this_cpu*() in preemptible context
Linux 4.9.62
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit cef572ad9b upstream.
When queue_work() is used in irq (not in task context), there is
a potential case that trigger NULL pointer dereference.
----------------------------------------------------------------
worker_thread()
|-spin_lock_irq()
|-process_one_work()
|-worker->current_pwq = pwq
|-spin_unlock_irq()
|-worker->current_func(work)
|-spin_lock_irq()
|-worker->current_pwq = NULL
|-spin_unlock_irq()
//interrupt here
|-irq_handler
|-__queue_work()
//assuming that the wq is draining
|-is_chained_work(wq)
|-current_wq_worker()
//Here, 'current' is the interrupted worker!
|-current->current_pwq is NULL here!
|-schedule()
----------------------------------------------------------------
Avoid it by checking for task context in current_wq_worker(), and
if not in task context, we shouldn't use the 'current' to check the
condition.
Reported-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Li Bin <huawei.libin@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 8d03ecfe47 ("workqueue: reimplement is_chained_work() using current_wq_worker()")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Task demand and CPU util are in u64.
Change-Id: If7ec1623e723026d3346201122aab0303a6d2ba2
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
(cherry picked from commit c8b8c92bbc)
[trivial cherry-pick issues]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Task->on_rq has three states:
0 - Task is not on runqueue (rq)
1 (TASK_ON_RQ_QUEUED) - Task is on rq
2 (TASK_ON_RQ_MIGRATING) - Task is on rq but in the
process of being migrated to another rq
When a task is moving between rqs task->on_rq state should be
TASK_ON_RQ_MIGRATING in order for WALT to account rq's cumulative
runnable average correctly. Without such state marking for all the
classes, WALT's update_history() would try to fixup task's demand
which was never contributed to any of CPUs during migration.
Change-Id: I65e74a8f176c3ed4b8577577f6da8897ecda7bb8
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
[joonwoop: Reinforced changelog to explain why this is needed by WALT.
Fixed conflicts in deadline.c]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
(cherry picked from commit 0f8791c90a99f718c43ab8214076c0c671a36667)
[fixed cherry-pick issue due to missing fab5cc59bf]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
The initial window start needs to be close to ktime ns = 0 to be
aligned with scheduler tick.
Change-Id: Ia91f74efce2f910106622a054a6fcd507e763ca5
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
(cherry picked from commit 0caf1df0c5)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
EAS won't allow NOHZ idle balancer until CPU's over utilized. However
nohz_kick_needed() can return true. This causes idle CPU wake up for
nothing.
Change-Id: I6e548442e29e4f85cda695e4c7101dd591b12fe6
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
(cherry picked from commit 3989a247e2)
[trivial cherry-pick issues]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
In order to calculate energy difference we currently iterates CPUs under
the same sched doamin to accumulate total energy cost and compare before
and after :
for_each_domain(cpu)
total_energy_before += (cpu_util * power) >> SCHED_CAPACITY_SHIFT;
for_each_domain(cpu)
total_energy_after += (cpu_util * power) >> SCHED_CAPACITY_SHIFT;
Doing such can incorrectly calculate and report abs(delta) > 0 when
there is actually no energy delta between before and after because the
same total accumulated cpu_util of all the CPUs can be distributed
differently before and after and it causes different amount of rounding
error.
Fix such incorrectness by shifting just once with accumulated
total_energy.
Change-Id: I82f1e2e358367058960938b4ef81714f57e921cf
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
(moved part to another commit)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit 11b618a0b2)
[trivial cherry-pick issues]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
WALT accounts two major statistics; CPU load and cumulative tasks
demand.
The CPU load which is account of accumulated each CPU's absolute
execution time is for CPU frequency guidance. Whereas cumulative
tasks demand which is each CPU's instantaneous load to reflect
CPU's load at given time is for task placement decision.
Use cumulative tasks demand for cpu_util() for task placement and
introduce cpu_util_freq() for frequency guidance.
Change-Id: Id928f01dbc8cb2a617cdadc584c1f658022565c5
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
(cherry picked from commit ee4cebd75e)
[removed schedfreq dependency]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
When running tasks's ravg.demand is changed update_history() adjusts
rq->cumulative_runnable_avg to reflect change of CPU load. Currently
this fixup is broken by accumulating task's new demand without
subtracting the task's old demand.
Fix the fixup logic to subtract the task's old demand.
Change-Id: I61beb32a4850879ccb39b733f5564251e465bfeb
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
(cherry picked from commit 48f67ea85d)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Account cumulative runnable average for WALT CPU utilization accounting.
Change-Id: I56934894e626dec183740eeaf89a57d2ef638143
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
(cherry picked from commit 26b37261ea)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Using WALT, the utilisation of a task is computed with a resolution
scaling factor that has been used inconsistently in the code with either
hardcoded values or macros (NICE_0_LOAD_SHIFT in this case). Changes in
these macros (as the 32 to 64 bits resolution shift of 2159197d66)
happened to break the utilisation calculation wherever they have been
used whilst results remained correct in other places. This commit fixes
this issue by using SCHED_CAPACITY_SCALE as resolution scaling factor
consistently.
Change-Id: Ic5418f8a5dfc455a22bafbebb4142b4665b61c6f
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
PD#154008: the log output is in disorder
The defination for a continues line in printk is much more strict
from 3.14 to 4.9:
in kernel 3.14, if the first fragment does not end with CR, and the
next fragment does not start with LOG_PREFIX(KERN_ALERT, KERN_ERR and
so on), then they are in a continues line
eg. pr_err("foo "); printk("bar\n")
or pr_err("foo "); pr_cont("bar\n"); both are printing a continues
line
in kernel 4.9, if only the first fragment does not end with CR, and the
next fragment start with LOG_CONT, then they are in a continues line
eg. pr_err("foo "); printk("bar\n"); are not printing a continues line
and pr_err("foo "); pr_cont("bar\n"); are printing a continues line
but in the code path of crash info dumping in kernel 4.9.y, not all of
the continues line printing has been switched to the 4.9 way(aka.
calling pr_cont). Only in kernel 4.13.y all of that has been updated.
so in this commit, we lose the definition of continues line back to
3.14 way to sovle current issue, and need to revert it when we sync
with upstream kernel 4.13.y
Change-Id: I64403d3a18531ceb832b41d96dff4b79a6d7fb5a
Signed-off-by: Jiamin Ma <jiamin.ma@amlogic.com>
Introduce a bpf object related check when sending and receiving files
through unix domain socket as well as binder. It checks if the receiving
process have privilege to read/write the bpf map or use the bpf program.
This check is necessary because the bpf maps and programs are using a
anonymous inode as their shared inode so the normal way of checking the
files and sockets when passing between processes cannot work properly on
eBPF object. This check only works when the BPF_SYSCALL is configured.
Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-pick from net-next: f66e448cfd)
Bug: 30950746
Change-Id: I5b2cf4ccb4eab7eda91ddd7091d6aa3e7ed9f2cd
Introduce several LSM hooks for the syscalls that will allow the
userspace to access to eBPF object such as eBPF programs and eBPF maps.
The security check is aimed to enforce a per object security protection
for eBPF object so only processes with the right priviliges can
read/write to a specific map or use a specific eBPF program. Besides
that, a general security hook is added before the multiplexer of bpf
syscall to check the cmd and the attribute used for the command. The
actual security module can decide which command need to be checked and
how the cmd should be checked.
Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added the LIST_HEAD_INIT call for security hooks, it nolonger exist in
uptream code.
(cherry-pick from net-next: afdb09c720)
Bug: 30950746
Change-Id: Ieb3ac74392f531735fc7c949b83346a5f587a77b
Introduce the map read/write flags to the eBPF syscalls that returns the
map fd. The flags is used to set up the file mode when construct a new
file descriptor for bpf maps. To not break the backward capability, the
f_flags is set to O_RDWR if the flag passed by syscall is 0. Otherwise
it should be O_RDONLY or O_WRONLY. When the userspace want to modify or
read the map content, it will check the file mode to see if it is
allowed to make the change.
Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Deleted the file mode configuration code in unsupported map type and
removed the file mode check in non-existing helper functions.
(cherry-pick from net-next: 6e71b04a82)
Bug: 30950746
Change-Id: Icfad20f1abb77f91068d244fb0d87fa40824dd1b
We all should be using (and improving) the schedutil governor now. Get
rid of the non-upstream governor.
Tested on Hikey.
Change-Id: I2104558b03118b0a9c5f099c23c42cd9a6c2a963
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Merge in the EAS r1.4 patches from eas-dev to 4.9 common branch.
There is one patch in android-4.9-eas-dev which is not part of the 1.4
patches
ANDROID: sched/fair: Select correct capacity state for energy_diff
but we have already merged it into android-4.4 so in the interests
of keeping aligned, let's include that in the merge.
Merge Log:
* ack/android-4.9-eas-dev:
sched: EAS: Fix the condition to distinguish energy before/after
sched: EAS: update trg_cpu to backup_cpu if no energy saving for target_cpu
sched/fair: consider task utilization in group_max_util()
sched/fair: consider task utilization in group_norm_util()
sched/fair: enforce EAS mode
sched/fair: ignore backup CPU when not valid
sched/fair: trace energy_diff for non boosted tasks
UPSTREAM: sched/fair: Sync task util before slow-path wakeup
UPSTREAM: sched/core: Add missing update_rq_clock() call in set_user_nice()
UPSTREAM: sched/core: Add missing update_rq_clock() call for task_hot()
UPSTREAM: sched/core: Add missing update_rq_clock() in detach_task_cfs_rq()
UPSTREAM: sched/core: Add missing update_rq_clock() in post_init_entity_util_avg()
UPSTREAM: sched/fair: Fix task group initialization
cpufreq/sched: Consider max cpu capacity when choosing frequencies
cpufreq/sched: Use cpu max freq rather than policy max
sched/fair: remove erroneous RCU_LOCKDEP_WARN from start_cpu()
ANDROID: sched/fair: Select correct capacity state for energy_diff
UPSTREAM: sched/fair: Fix usage of find_idlest_group() when the local group is idlest
UPSTREAM: sched/fair: Fix usage of find_idlest_group() when no groups are allowed
UPSTREAM: sched/fair: Fix find_idlest_group() when local group is not allowed
UPSTREAM: sched/fair: Remove unnecessary comparison with -1
UPSTREAM: sched/fair: Move select_task_rq_fair() slow-path into its own function
UPSTREAM: sched/fair: Force balancing on NOHZ balance if local group has capacity
UPSTREAM: sched: use load_avg for selecting idlest group
UPSTREAM: sched: fix find_idlest_group for fork
Change-Id: I57bc516f9c804bfc7144a6a5bcf70572d82f7321
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Before commit 5f8b3a757d ("sched/fair: consider task utilization in
group_norm_util()"), eenv->util_delta is used to distinguish energy
before and energy after in sched_group_energy(). After that commit,
eenv->util_delta can not do that any more.
In this commit, use trg_cpu to distinguish energy before/after in
sched_group_energy().
Before apply this commit, cap_before/cap_delta is not correct:
<idle>-0 [001] 147504.608920: sched_energy_diff: pid=7 comm=rcu_preempt
src_cpu=1 dst_cpu=3 usage_delta=7 nrg_before=250 nrg_after=250 nrg_diff=0
cap_before=0 cap_after=528 cap_delta=1056 nrg_delta=0 nrg_payoff=0
After apply this commit, cap_before/cap_delta retrun to normal:
<idle>-0 [001] 220.494011: sched_energy_diff: pid=7 comm=rcu_preempt
src_cpu=1 dst_cpu=2 usage_delta=3 nrg_before=248 nrg_after=248 nrg_diff=0
cap_before=528 cap_after=528 cap_delta=0 nrg_delta=0 nrg_payoff=0
Change-Id: I7b5f7ccce56e93af7ea4e87d8e0ea6e2405f9c27
Signed-off-by: Ke Wang <ke.wang@spreadtrum.com>
(cherry picked from commit 0da783a605cd20d5a37c2a840e8a1fa641c09768)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
If no energy saving for target_cpu in the calculation of energy_diff(),
backup_cpu will be set as the new dst_cpu for the next calculation. At this
point, we also need update the new trg_cpu as backup_cpu to make sure the
subsequent calculation of energy_diff() is correct.
Change-Id: If3b35b6dc54865f1cb4b1603134102d4422227d5
Signed-off-by: Ke Wang <ke.wang@spreadtrum.com>
(cherry picked from commit b1923e22f4eca3e537e015d6ea3dce1187edea37)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
The group_max_util() function is used to compute the maximum utilization
across the CPUs of a certain energy_env configuration.
Its main client is the energy_diff function when it needs to compute the
SG capacity for one of the before/after scheduling candidates.
Currently, the energy_diff function sets util_delta = 0 when it wants to
compute the energy corresponding to the scheduling candidate where the
task runs in the previous CPU. This implies that, for the task waking up
in the previous CPU we consider only its blocked load tracked by the CPU
RQ. However, in case of a medium-big task which is waking up on a long
time idle CPU, this blocked load can be already completely decayed.
More in general, the current approach is biased towards under-estimating
the capacity requirements for the "before" scheduling candidate.
This patch fixes this by:
- always use the cpu_util_wake() to properly get the utilization of a CPU
without any (partially decayed) contribution of the waking up task
- adding the task utilization to the cpu_util_wake just for the target
cpu
The "target CPU" is defined by the energy_env to be either the src_cpu or
the dst_cpu, depending on which scheduling candidate we are considering.
Finally, since this update removes the last usage of calc_util_delta()
this function is now safely removed.
Change-Id: I20ee1bcf40cee6bf6e265fb2d32ef79061ad6ced
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit 52d70152fade678e304683bd1a5842af95e83558)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
The group_norm_util() function is used to compute the normalized
utilization of a SG given a certain energy_env configuration.
The main client of this function is the energy_diff function when it
comes to compute the SG energy for one of the before/after scheduling
candidates.
Currently, the energy_diff function sets util_delta = 0 when it wants to
compute the energy corresponding to the scheduling candidate where the
task runs in the previous CPU. This implies that, for the task waking up
in the previous CPU we consider only its blocked load tracked by the CPU
RQ. However, in case of a medium-big task which is waking up on a long
time idle CPU, this blocked load can be already completely decayed.
More in general, the current approach is biased towards under-estimating
the energy consumption for the "before" scheduling candidate.
This patch fixes this by:
- always use the cpu_util_wake() to properly get the utilization of a CPU
without any (partially decayed) contribution of the waking up task
- adding the task utilization to the cpu_util_wake just for the
target cpu
The "target CPU" is defined by the energy_env to be either the src_cpu
or the dst_cpu, depending on which scheduling candidate we are
considering.
This patch update also the definition of __cpu_norm_util(), which is
currently called just by the group_norm_util() function. This allows to
simplify the code by using this function just to normalize a specified
utilization with respect to a given capacity.
This update allows to completely remove any dependency of
group_norm_util() from calc_util_delta().
Change-Id: I3b6ec50ce8decb1521faae660e326ab3319d3c82
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit ef34ea830347ca175ba8a1baf05357ed98d5728c)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
For non latency sensitive tasks the goal is to optimize for energy efficiency.
Thus, we should try our best to avoid moving a task on a CPU which is then
going to be marked as overutilized.
Let's use the capacity_margin metric to verify if a candidate target CPU
should be considered without risking to bail out of EAS mode.
Change-Id: Ib3697106f4073aedf4a6c6ce42bd5d000fa8c007
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit f95753da4ba7e1c1eee0500ffe41b3e5fa68b347)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
The find_best_target can sometimes not return a valid backup CPU, either
because it cannot find one or just becasue it returns prev_cpu as a backup.
In these cases we should skip the energy_diff evaluation for the backup CPU.
Change-Id: I3787dbdfe74122348dd7a7485b88c4679051bd32
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit d9bcf5b88594a0225b51878236e49305f272eadc)
[trivial cherry-pick issue]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
In systems where SchedTune is enabled, we do not report energy diff for non
boosted tasks. Let's fix this by always genereting an energy_diff event where
however:
nrg.delta = 0, since we skip energy normalization
payoff = nrg.diff, since the payoff is defined just by the energy difference
Change-Id: I9a11ec19b6f56da04147f5ae5b47daf1dd180445
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit 13e2e3c7f7d09f619444631a0962fd8020660b8d)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
We use task_util() in find_idlest_group() via capacity_spare_wake().
This task_util() updated in wake_cap(). However wake_cap() is not the
only reason for ending up in find_idlest_group() - we could have been sent
there by wake_wide(). So explicitly sync the task util with prev_cpu
when we are about to head to find_idlest_group().
We could simply do this at the beginning of
select_task_rq_fair() (i.e. irrespective of whether we're heading to
select_idle_sibling() or find_idlest_group() & co), but I didn't want to
slow down the select_idle_sibling() path more than necessary.
Don't do this during fork balancing, we won't need the task_util and
we'd just clobber the last_update_time, which is supposed to be 0.
Change-Id: I56113d8d67cf338f3fb1a692422289cf659399a3
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andres Oportus <andresoportus@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20170808095519.10077-1-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit ea16f0ea6c in tip:sched/core)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Add the update_rq_clock() call at the top of the callstack instead of
at the bottom where we find it missing, this to aid later effort to
minimize the number of update_rq_lock() calls.
WARNING: CPU: 30 PID: 194 at ../kernel/sched/sched.h:797 assert_clock_updated()
rq->clock_update_flags < RQCF_ACT_SKIP
Call Trace:
dump_stack()
__warn()
warn_slowpath_fmt()
assert_clock_updated.isra.63.part.64()
can_migrate_task()
load_balance()
pick_next_task_fair()
__schedule()
schedule()
worker_thread()
kthread()
Change-Id: I17716141789f9fac709495951cd2e079cf49d6d8
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 3bed5e2166)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Instead of adding the update_rq_clock() all the way at the bottom of
the callstack, add one at the top, this to aid later effort to
minimize update_rq_lock() calls.
WARNING: CPU: 0 PID: 1 at ../kernel/sched/sched.h:797 detach_task_cfs_rq()
rq->clock_update_flags < RQCF_ACT_SKIP
Call Trace:
dump_stack()
__warn()
warn_slowpath_fmt()
detach_task_cfs_rq()
switched_from_fair()
__sched_setscheduler()
_sched_setscheduler()
sched_set_stop_task()
cpu_stop_create()
__smpboot_create_thread.part.2()
smpboot_register_percpu_thread_cpumask()
cpu_stop_init()
do_one_initcall()
? print_cpu_info()
kernel_init_freeable()
? rest_init()
kernel_init()
ret_from_fork()
Change-Id: Iee08c2ed3303ae8f0c527658f13646b02a412cad
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 80f5c1b84b)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>