Commit Graph

114 Commits

Author SHA1 Message Date
Shile Zhang
6cd73be5b6 sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
[ Upstream commit 975e155ed8 ]

We added the 'sched_rr_timeslice_ms' SCHED_RR tuning knob in this commit:

  ce0dbbbb30 ("sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice")

... which name suggests to users that it's in milliseconds, while in reality
it's being set in milliseconds but the result is shown in jiffies.

This is obviously confusing when HZ is not 1000, it makes it appear like the
value set failed, such as HZ=100:

  root# echo 100 > /proc/sys/kernel/sched_rr_timeslice_ms
  root# cat /proc/sys/kernel/sched_rr_timeslice_ms
  10

Fix this to be milliseconds all around.

Signed-off-by: Shile Zhang <shile.zhang@nokia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1485612049-20923-1-git-send-email-shile.zhang@nokia.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-16 08:29:07 +09:00
Greg Kroah-Hartman
9797dcb8c7 Merge 4.9.104 into android-4.9
Changes in 4.9.104
	MIPS: c-r4k: Fix data corruption related to cache coherence
	MIPS: ptrace: Expose FIR register through FP regset
	MIPS: Fix ptrace(2) PTRACE_PEEKUSR and PTRACE_POKEUSR accesses to o32 FGRs
	KVM: Fix spelling mistake: "cop_unsuable" -> "cop_unusable"
	affs_lookup(): close a race with affs_remove_link()
	aio: fix io_destroy(2) vs. lookup_ioctx() race
	ALSA: timer: Fix pause event notification
	do d_instantiate/unlock_new_inode combinations safely
	mmc: sdhci-iproc: remove hard coded mmc cap 1.8v
	mmc: sdhci-iproc: fix 32bit writes for TRANSFER_MODE register
	libata: Blacklist some Sandisk SSDs for NCQ
	libata: blacklist Micron 500IT SSD with MU01 firmware
	xen-swiotlb: fix the check condition for xen_swiotlb_free_coherent
	drm/vmwgfx: Fix 32-bit VMW_PORT_HB_[IN|OUT] macros
	IB/hfi1: Use after free race condition in send context error path
	Revert "ipc/shm: Fix shmat mmap nil-page protection"
	ipc/shm: fix shmat() nil address after round-down when remapping
	kasan: fix memory hotplug during boot
	kernel/sys.c: fix potential Spectre v1 issue
	kernel/signal.c: avoid undefined behaviour in kill_something_info
	KVM/VMX: Expose SSBD properly to guests
	KVM: s390: vsie: fix < 8k check for the itdba
	KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changed
	kvm: x86: IA32_ARCH_CAPABILITIES is always supported
	firewire-ohci: work around oversized DMA reads on JMicron controllers
	x86/tsc: Allow TSC calibration without PIT
	NFSv4: always set NFS_LOCK_LOST when a lock is lost.
	ALSA: hda - Use IS_REACHABLE() for dependency on input
	kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl
	netfilter: ipv6: nf_defrag: Pass on packets to stack per RFC2460
	tracing/hrtimer: Fix tracing bugs by taking all clock bases and modes into account
	PCI: Add function 1 DMA alias quirk for Marvell 9128
	Input: psmouse - fix Synaptics detection when protocol is disabled
	i40iw: Zero-out consumer key on allocate stag for FMR
	tools lib traceevent: Simplify pointer print logic and fix %pF
	perf callchain: Fix attr.sample_max_stack setting
	tools lib traceevent: Fix get_field_str() for dynamic strings
	perf record: Fix failed memory allocation for get_cpuid_str
	iommu/vt-d: Use domain instead of cache fetching
	dm thin: fix documentation relative to low water mark threshold
	net: stmmac: dwmac-meson8b: fix setting the RGMII TX clock on Meson8b
	net: stmmac: dwmac-meson8b: propagate rate changes to the parent clock
	nfs: Do not convert nfs_idmap_cache_timeout to jiffies
	watchdog: sp5100_tco: Fix watchdog disable bit
	kconfig: Don't leak main menus during parsing
	kconfig: Fix automatic menu creation mem leak
	kconfig: Fix expr_free() E_NOT leak
	mac80211_hwsim: fix possible memory leak in hwsim_new_radio_nl()
	ipmi/powernv: Fix error return code in ipmi_powernv_probe()
	Btrfs: set plug for fsync
	btrfs: Fix out of bounds access in btrfs_search_slot
	Btrfs: fix scrub to repair raid6 corruption
	btrfs: fail mount when sb flag is not in BTRFS_SUPER_FLAG_SUPP
	HID: roccat: prevent an out of bounds read in kovaplus_profile_activated()
	fm10k: fix "failed to kill vid" message for VF
	device property: Define type of PROPERTY_ENRTY_*() macros
	jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
	powerpc/numa: Use ibm,max-associativity-domains to discover possible nodes
	powerpc/numa: Ensure nodes initialized for hotplug
	RDMA/mlx5: Avoid memory leak in case of XRCD dealloc failure
	ntb_transport: Fix bug with max_mw_size parameter
	gianfar: prevent integer wrapping in the rx handler
	tcp_nv: fix potential integer overflow in tcpnv_acked
	kvm: Map PFN-type memory regions as writable (if possible)
	ocfs2: return -EROFS to mount.ocfs2 if inode block is invalid
	ocfs2/acl: use 'ip_xattr_sem' to protect getting extended attribute
	ocfs2: return error when we attempt to access a dirty bh in jbd2
	mm/mempolicy: fix the check of nodemask from user
	mm/mempolicy: add nodes_empty check in SYSC_migrate_pages
	asm-generic: provide generic_pmdp_establish()
	sparc64: update pmdp_invalidate() to return old pmd value
	mm: thp: use down_read_trylock() in khugepaged to avoid long block
	mm: pin address_space before dereferencing it while isolating an LRU page
	mm/fadvise: discard partial page if endbyte is also EOF
	openvswitch: Remove padding from packet before L3+ conntrack processing
	IB/ipoib: Fix for potential no-carrier state
	drm/nouveau/pmu/fuc: don't use movw directly anymore
	netfilter: ipv6: nf_defrag: Kill frag queue on RFC2460 failure
	x86/power: Fix swsusp_arch_resume prototype
	firmware: dmi_scan: Fix handling of empty DMI strings
	ACPI: processor_perflib: Do not send _PPC change notification if not ready
	ACPI / scan: Use acpi_bus_get_status() to initialize ACPI_TYPE_DEVICE devs
	bpf: fix selftests/bpf test_kmod.sh failure when CONFIG_BPF_JIT_ALWAYS_ON=y
	MIPS: generic: Fix machine compatible matching
	MIPS: TXx9: use IS_BUILTIN() for CONFIG_LEDS_CLASS
	xen-netfront: Fix race between device setup and open
	xen/grant-table: Use put_page instead of free_page
	RDS: IB: Fix null pointer issue
	arm64: spinlock: Fix theoretical trylock() A-B-A with LSE atomics
	proc: fix /proc/*/map_files lookup
	cifs: silence compiler warnings showing up with gcc-8.0.0
	bcache: properly set task state in bch_writeback_thread()
	bcache: fix for allocator and register thread race
	bcache: fix for data collapse after re-attaching an attached device
	bcache: return attach error when no cache set exist
	tools/libbpf: handle issues with bpf ELF objects containing .eh_frames
	bpf: fix rlimit in reuseport net selftest
	vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when dumping vsyscall user page
	locking/qspinlock: Ensure node->count is updated before initialising node
	irqchip/gic-v3: Ignore disabled ITS nodes
	cpumask: Make for_each_cpu_wrap() available on UP as well
	irqchip/gic-v3: Change pr_debug message to pr_devel
	ARC: Fix malformed ARC_EMUL_UNALIGNED default
	ptr_ring: prevent integer overflow when calculating size
	libata: Fix compile warning with ATA_DEBUG enabled
	selftests: pstore: Adding config fragment CONFIG_PSTORE_RAM=m
	selftests: memfd: add config fragment for fuse
	ARM: OMAP2+: timer: fix a kmemleak caused in omap_get_timer_dt
	ARM: OMAP3: Fix prm wake interrupt for resume
	ARM: OMAP1: clock: Fix debugfs_create_*() usage
	ibmvnic: Free RX socket buffer in case of adapter error
	iwlwifi: mvm: fix security bug in PN checking
	iwlwifi: mvm: always init rs with 20mhz bandwidth rates
	NFC: llcp: Limit size of SDP URI
	rxrpc: Work around usercopy check
	mac80211: round IEEE80211_TX_STATUS_HEADROOM up to multiple of 4
	mac80211: fix a possible leak of station stats
	mac80211: fix calling sleeping function in atomic context
	mac80211: Do not disconnect on invalid operating class
	md raid10: fix NULL deference in handle_write_completed()
	drm/exynos: g2d: use monotonic timestamps
	drm/exynos: fix comparison to bitshift when dealing with a mask
	locking/xchg/alpha: Add unconditional memory barrier to cmpxchg()
	md: raid5: avoid string overflow warning
	kernel/relay.c: limit kmalloc size to KMALLOC_MAX_SIZE
	powerpc/bpf/jit: Fix 32-bit JIT for seccomp_data access
	s390/cio: fix ccw_device_start_timeout API
	s390/cio: fix return code after missing interrupt
	s390/cio: clear timer when terminating driver I/O
	PKCS#7: fix direct verification of SignerInfo signature
	ARM: OMAP: Fix dmtimer init for omap1
	smsc75xx: fix smsc75xx_set_features()
	regulatory: add NUL to request alpha2
	integrity/security: fix digsig.c build error with header file
	locking/xchg/alpha: Fix xchg() and cmpxchg() memory ordering bugs
	x86/topology: Update the 'cpu cores' field in /proc/cpuinfo correctly across CPU hotplug operations
	mac80211: drop frames with unexpected DS bits from fast-rx to slow path
	arm64: fix unwind_frame() for filtered out fn for function graph tracing
	macvlan: fix use-after-free in macvlan_common_newlink()
	kvm: fix warning for CONFIG_HAVE_KVM_EVENTFD builds
	fs: dcache: Avoid livelock between d_alloc_parallel and __d_add
	fs: dcache: Use READ_ONCE when accessing i_dir_seq
	md: fix a potential deadlock of raid5/raid10 reshape
	md/raid1: fix NULL pointer dereference
	batman-adv: fix packet checksum in receive path
	batman-adv: invalidate checksum on fragment reassembly
	netfilter: ebtables: convert BUG_ONs to WARN_ONs
	batman-adv: Ignore invalid batadv_iv_gw during netlink send
	batman-adv: Ignore invalid batadv_v_gw during netlink send
	batman-adv: Fix netlink dumping of BLA claims
	batman-adv: Fix netlink dumping of BLA backbones
	nvme-pci: Fix nvme queue cleanup if IRQ setup fails
	clocksource/drivers/fsl_ftm_timer: Fix error return checking
	ceph: fix dentry leak when failing to init debugfs
	ARM: orion5x: Revert commit 4904dbda41.
	qrtr: add MODULE_ALIAS macro to smd
	r8152: fix tx packets accounting
	virtio-gpu: fix ioctl and expose the fixed status to userspace.
	dmaengine: rcar-dmac: fix max_chunk_size for R-Car Gen3
	bcache: fix kcrashes with fio in RAID5 backend dev
	ip6_tunnel: fix IFLA_MTU ignored on NEWLINK
	sit: fix IFLA_MTU ignored on NEWLINK
	ARM: dts: NSP: Fix amount of RAM on BCM958625HR
	powerpc/boot: Fix random libfdt related build errors
	gianfar: Fix Rx byte accounting for ndev stats
	net/tcp/illinois: replace broken algorithm reference link
	nvmet: fix PSDT field check in command format
	xen/pirq: fix error path cleanup when binding MSIs
	drm/sun4i: Fix dclk_set_phase
	Btrfs: send, fix issuing write op when processing hole in no data mode
	selftests/powerpc: Skip the subpage_prot tests if the syscall is unavailable
	KVM: PPC: Book3S HV: Fix VRMA initialization with 2MB or 1GB memory backing
	iwlwifi: mvm: fix TX of CCMP 256
	watchdog: f71808e_wdt: Fix magic close handling
	watchdog: sbsa: use 32-bit read for WCV
	batman-adv: Fix multicast packet loss with a single WANT_ALL_IPV4/6 flag
	e1000e: Fix check_for_link return value with autoneg off
	e1000e: allocate ring descriptors with dma_zalloc_coherent
	ia64/err-inject: Use get_user_pages_fast()
	RDMA/qedr: Fix kernel panic when running fio over NFSoRDMA
	RDMA/qedr: Fix iWARP write and send with immediate
	IB/mlx4: Fix corruption of RoCEv2 IPv4 GIDs
	IB/mlx4: Include GID type when deleting GIDs from HW table under RoCE
	IB/mlx5: Fix an error code in __mlx5_ib_modify_qp()
	fbdev: Fixing arbitrary kernel leak in case FBIOGETCMAP_SPARC in sbusfb_ioctl_helper().
	fsl/fman: avoid sleeping in atomic context while adding an address
	net: qcom/emac: Use proper free methods during TX
	net: smsc911x: Fix unload crash when link is up
	IB/core: Fix possible crash to access NULL netdev
	xen: xenbus: use put_device() instead of kfree()
	arm64: Relax ARM_SMCCC_ARCH_WORKAROUND_1 discovery
	dmaengine: mv_xor_v2: Fix clock resource by adding a register clock
	netfilter: ebtables: fix erroneous reject of last rule
	bnxt_en: Check valid VNIC ID in bnxt_hwrm_vnic_set_tpa().
	workqueue: use put_device() instead of kfree()
	ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu
	sunvnet: does not support GSO for sctp
	drm/imx: move arming of the vblank event to atomic_flush
	net: Fix vlan untag for bridge and vlan_dev with reorder_hdr off
	batman-adv: fix header size check in batadv_dbg_arp()
	batman-adv: Fix skbuff rcsum on packet reroute
	vti4: Don't count header length twice on tunnel setup
	vti4: Don't override MTU passed on link creation via IFLA_MTU
	perf/cgroup: Fix child event counting bug
	brcmfmac: Fix check for ISO3166 code
	kbuild: make scripts/adjust_autoksyms.sh robust against timestamp races
	RDMA/ucma: Correct option size check using optlen
	RDMA/qedr: fix QP's ack timeout configuration
	RDMA/qedr: Fix rc initialization on CNQ allocation failure
	mm/mempolicy.c: avoid use uninitialized preferred_node
	mm, thp: do not cause memcg oom for thp
	selftests: ftrace: Add probe event argument syntax testcase
	selftests: ftrace: Add a testcase for string type with kprobe_event
	selftests: ftrace: Add a testcase for probepoint
	batman-adv: fix multicast-via-unicast transmission with AP isolation
	batman-adv: fix packet loss for broadcasted DHCP packets to a server
	ARM: 8748/1: mm: Define vdso_start, vdso_end as array
	net: qmi_wwan: add BroadMobi BM806U 2020:2033
	perf/x86/intel: Fix linear IP of PEBS real_ip on Haswell and later CPUs
	llc: properly handle dev_queue_xmit() return value
	builddeb: Fix header package regarding dtc source links
	mm/kmemleak.c: wait for scan completion before disabling free
	net: Fix untag for vlan packets without ethernet header
	net: mvneta: fix enable of all initialized RXQs
	sh: fix debug trap failure to process signals before return to user
	nvme: don't send keep-alives to the discovery controller
	x86/pgtable: Don't set huge PUD/PMD on non-leaf entries
	x86/mm: Do not forbid _PAGE_RW before init for __ro_after_init
	fs/proc/proc_sysctl.c: fix potential page fault while unregistering sysctl table
	swap: divide-by-zero when zero length swap file on ssd
	sr: get/drop reference to device in revalidate and check_events
	Force log to disk before reading the AGF during a fstrim
	cpufreq: CPPC: Initialize shared perf capabilities of CPUs
	dp83640: Ensure against premature access to PHY registers after reset
	mm/ksm: fix interaction with THP
	mm: fix races between address_space dereference and free in page_evicatable
	Btrfs: bail out on error during replay_dir_deletes
	Btrfs: fix NULL pointer dereference in log_dir_items
	btrfs: Fix possible softlock on single core machines
	ocfs2/dlm: don't handle migrate lockres if already in shutdown
	sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warning
	KVM: VMX: raise internal error for exception during invalid protected mode state
	fscache: Fix hanging wait on page discarded by writeback
	sparc64: Make atomic_xchg() an inline function rather than a macro.
	net: bgmac: Fix endian access in bgmac_dma_tx_ring_free()
	btrfs: tests/qgroup: Fix wrong tree backref level
	Btrfs: fix copy_items() return value when logging an inode
	btrfs: fix lockdep splat in btrfs_alloc_subvolume_writers
	rxrpc: Fix Tx ring annotation after initial Tx failure
	rxrpc: Don't treat call aborts as conn aborts
	xen/acpi: off by one in read_acpi_id()
	drivers: macintosh: rack-meter: really fix bogus memsets
	ACPI: acpi_pad: Fix memory leak in power saving threads
	powerpc/mpic: Check if cpu_possible() in mpic_physmask()
	m68k: set dma and coherent masks for platform FEC ethernets
	parisc/pci: Switch LBA PCI bus from Hard Fail to Soft Fail mode
	hwmon: (nct6775) Fix writing pwmX_mode
	powerpc/perf: Prevent kernel address leak to userspace via BHRB buffer
	powerpc/perf: Fix kernel address leak via sampling registers
	tools/thermal: tmon: fix for segfault
	selftests: Print the test we're running to /dev/kmsg
	net/mlx5: Protect from command bit overflow
	ath10k: Fix kernel panic while using worker (ath10k_sta_rc_update_wk)
	cxgb4: Setup FW queues before registering netdev
	ima: Fallback to the builtin hash algorithm
	virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS
	arm: dts: socfpga: fix GIC PPI warning
	cpufreq: cppc_cpufreq: Fix cppc_cpufreq_init() failure path
	zorro: Set up z->dev.dma_mask for the DMA API
	bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set
	ACPICA: Events: add a return on failure from acpi_hw_register_read
	ACPICA: acpi: acpica: fix acpi operand cache leak in nseval.c
	cxgb4: Fix queue free path of ULD drivers
	i2c: mv64xxx: Apply errata delay only in standard mode
	KVM: lapic: stop advertising DIRECTED_EOI when in-kernel IOAPIC is in use
	perf top: Fix top.call-graph config option reading
	perf stat: Fix core dump when flag T is used
	IB/core: Honor port_num while resolving GID for IB link layer
	regulator: gpio: Fix some error handling paths in 'gpio_regulator_probe()'
	spi: bcm-qspi: fIX some error handling paths
	MIPS: ath79: Fix AR724X_PLL_REG_PCIE_CONFIG offset
	PCI: Restore config space on runtime resume despite being unbound
	ipmi_ssif: Fix kernel panic at msg_done_handler
	powerpc: Add missing prototype for arch_irq_work_raise()
	f2fs: fix to check extent cache in f2fs_drop_extent_tree
	perf/core: Fix perf_output_read_group()
	drm/panel: simple: Fix the bus format for the Ontat panel
	hwmon: (pmbus/max8688) Accept negative page register values
	hwmon: (pmbus/adm1275) Accept negative page register values
	perf/x86/intel: Properly save/restore the PMU state in the NMI handler
	cdrom: do not call check_disk_change() inside cdrom_open()
	perf/x86/intel: Fix large period handling on Broadwell CPUs
	perf/x86/intel: Fix event update for auto-reload
	arm64: dts: qcom: Fix SPI5 config on MSM8996
	soc: qcom: wcnss_ctrl: Fix increment in NV upload
	gfs2: Fix fallocate chunk size
	x86/devicetree: Initialize device tree before using it
	x86/devicetree: Fix device IRQ settings in DT
	ALSA: vmaster: Propagate slave error
	dmaengine: pl330: fix a race condition in case of threaded irqs
	dmaengine: rcar-dmac: Check the done lists in rcar_dmac_chan_get_residue()
	enic: enable rq before updating rq descriptors
	hwrng: stm32 - add reset during probe
	dmaengine: qcom: bam_dma: get num-channels and num-ees from dt
	net: stmmac: ensure that the device has released ownership before reading data
	net: stmmac: ensure that the MSS desc is the last desc to set the own bit
	cpufreq: Reorder cpufreq_online() error code path
	PCI: Add function 1 DMA alias quirk for Marvell 88SE9220
	udf: Provide saner default for invalid uid / gid
	ARM: dts: bcm283x: Fix probing of bcm2835-i2s
	audit: return on memory error to avoid null pointer dereference
	rcu: Call touch_nmi_watchdog() while printing stall warnings
	pinctrl: sh-pfc: r8a7796: Fix MOD_SEL register pin assignment for SSI pins group
	MIPS: Octeon: Fix logging messages with spurious periods after newlines
	drm/rockchip: Respect page offset for PRIME mmap calls
	x86/apic: Set up through-local-APIC mode on the boot CPU if 'noapic' specified
	perf tests: Use arch__compare_symbol_names to compare symbols
	perf report: Fix memory corruption in --branch-history mode --branch-history
	selftests/net: fixes psock_fanout eBPF test case
	netlabel: If PF_INET6, check sk_buff ip header version
	regmap: Correct comparison in regmap_cached
	ARM: dts: imx7d: cl-som-imx7: fix pinctrl_enet
	ARM: dts: porter: Fix HDMI output routing
	regulator: of: Add a missing 'of_node_put()' in an error handling path of 'of_regulator_match()'
	pinctrl: msm: Use dynamic GPIO numbering
	kdb: make "mdr" command repeat
	Linux 4.9.104

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-05-30 13:19:56 +02:00
Davidlohr Bueso
3d0632557e sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warning
[ Upstream commit d29a20645d ]

While running rt-tests' pi_stress program I got the following splat:

  rq->clock_update_flags < RQCF_ACT_SKIP
  WARNING: CPU: 27 PID: 0 at kernel/sched/sched.h:960 assert_clock_updated.isra.38.part.39+0x13/0x20

  [...]

  <IRQ>
  enqueue_top_rt_rq+0xf4/0x150
  ? cpufreq_dbs_governor_start+0x170/0x170
  sched_rt_rq_enqueue+0x65/0x80
  sched_rt_period_timer+0x156/0x360
  ? sched_rt_rq_enqueue+0x80/0x80
  __hrtimer_run_queues+0xfa/0x260
  hrtimer_interrupt+0xcb/0x220
  smp_apic_timer_interrupt+0x62/0x120
  apic_timer_interrupt+0xf/0x20
  </IRQ>

  [...]

  do_idle+0x183/0x1e0
  cpu_startup_entry+0x5f/0x70
  start_secondary+0x192/0x1d0
  secondary_startup_64+0xa5/0xb0

We can get rid of it be the "traditional" means of adding an
update_rq_clock() call after acquiring the rq->lock in
do_sched_rt_period_timer().

The case for the RT task throttling (which this workload also hits)
can be ignored in that the skip_update call is actually bogus and
quite the contrary (the request bits are removed/reverted).

By setting RQCF_UPDATED we really don't care if the skip is happening
or not and will therefore make the assert_clock_updated() check happy.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: linux-kernel@vger.kernel.org
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20180402164954.16255-1-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30 07:50:41 +02:00
Greg Kroah-Hartman
960923fdc2 Merge 4.9.89 into android-4.9
Changes in 4.9.89
	blkcg: fix double free of new_blkg in blkcg_init_queue
	Input: tsc2007 - check for presence and power down tsc2007 during probe
	perf stat: Issue a HW watchdog disable hint
	staging: speakup: Replace BUG_ON() with WARN_ON().
	staging: wilc1000: add check for kmalloc allocation failure.
	HID: reject input outside logical range only if null state is set
	drm: qxl: Don't alloc fbdev if emulation is not supported
	ARM: dts: r8a7791: Remove unit-address and reg from integrated cache
	ARM: dts: r8a7792: Remove unit-address and reg from integrated cache
	ARM: dts: r8a7793: Remove unit-address and reg from integrated cache
	ARM: dts: r8a7794: Remove unit-address and reg from integrated cache
	arm64: dts: r8a7796: Remove unit-address and reg from integrated cache
	drm/sun4i: Fix up error path cleanup for master bind function
	drm/sun4i: Set drm_crtc.port to the underlying TCON's output port node
	ath10k: fix a warning during channel switch with multiple vaps
	drm/sun4i: Fix TCON clock and regmap initialization sequence
	PCI/MSI: Stop disabling MSI/MSI-X in pci_device_shutdown()
	selinux: check for address length in selinux_socket_bind()
	x86/mm: Make mmap(MAP_32BIT) work correctly
	perf sort: Fix segfault with basic block 'cycles' sort dimension
	x86/mce: Handle broadcasted MCE gracefully with kexec
	eventpoll.h: fix epoll event masks
	i40e: Acquire NVM lock before reads on all devices
	i40e: fix ethtool to get EEPROM data from X722 interface
	perf tools: Make perf_event__synthesize_mmap_events() scale
	ARM: brcmstb: Enable ZONE_DMA for non 64-bit capable peripherals
	drivers: net: xgene: Fix hardware checksum setting
	drivers: net: phy: xgene: Fix mdio write
	drivers: net: xgene: Fix wrong logical operation
	drivers: net: xgene: Fix Rx checksum validation logic
	drm: Defer disabling the vblank IRQ until the next interrupt (for instant-off)
	ath10k: disallow DFS simulation if DFS channel is not enabled
	ath10k: fix fetching channel during potential radar detection
	usb: misc: lvs: fix race condition in disconnect handling
	ARM: bcm2835: Enable missing CMA settings for VC4 driver
	net: ethernet: bgmac: Allow MAC address to be specified in DTB
	netem: apply correct delay when rate throttling
	x86/mce: Init some CPU features early
	omapfb: dss: Handle return errors in dss_init_ports()
	perf probe: Fix concat_probe_trace_events
	perf probe: Return errno when not hitting any event
	HID: clamp input to logical range if no null state
	net/8021q: create device with all possible features in wanted_features
	ARM: dts: Adjust moxart IRQ controller and flags
	qed: Always publish VF link from leading hwfn
	s390/topology: fix typo in early topology code
	zd1211rw: fix NULL-deref at probe
	batman-adv: handle race condition for claims between gateways
	of: fix of_device_get_modalias returned length when truncating buffers
	solo6x10: release vb2 buffers in solo_stop_streaming()
	x86/boot/32: Defer resyncing initial_page_table until per-cpu is set up
	scsi: fnic: Fix for "Number of Active IOs" in fnicstats becoming negative
	scsi: ipr: Fix missed EH wakeup
	media: i2c/soc_camera: fix ov6650 sensor getting wrong clock
	timers, sched_clock: Update timeout for clock wrap
	sysrq: Reset the watchdog timers while displaying high-resolution timers
	Input: qt1070 - add OF device ID table
	sched: act_csum: don't mangle TCP and UDP GSO packets
	PCI: hv: Properly handle PCI bus remove
	PCI: hv: Lock PCI bus on device eject
	ASoC: rcar: ssi: don't set SSICR.CKDV = 000 with SSIWSR.CONT
	spi: omap2-mcspi: poll OMAP2_MCSPI_CHSTAT_RXS for PIO transfer
	tcp: sysctl: Fix a race to avoid unexpected 0 window from space
	dmaengine: imx-sdma: add 1ms delay to ensure SDMA channel is stopped
	usb: dwc3: make sure UX_EXIT_PX is cleared
	ARM: dts: bcm2835: add index to the ethernet alias
	perf annotate: Fix a bug following symbolic link of a build-id file
	perf buildid: Do not assume that readlink() returns a null terminated string
	i40e/i40evf: Fix use after free in Rx cleanup path
	scsi: be2iscsi: Check tag in beiscsi_mccq_compl_wait
	driver: (adm1275) set the m,b and R coefficients correctly for power
	bonding: make speed, duplex setting consistent with link state
	mm: Fix false-positive VM_BUG_ON() in page_cache_{get,add}_speculative()
	ALSA: firewire-lib: add a quirk of packet without valid EOH in CIP format
	ARM: dts: r8a7794: Add DU1 clock to device tree
	ARM: dts: r8a7794: Correct clock of DU1
	ARM: dts: silk: Correct clock of DU1
	blk-throttle: make sure expire time isn't too big
	regulator: core: Limit propagation of parent voltage count and list
	perf trace: Handle unpaired raw_syscalls:sys_exit event
	f2fs: relax node version check for victim data in gc
	drm/ttm: never add BO that failed to validate to the LRU list
	bonding: refine bond_fold_stats() wrap detection
	PCI: Apply Cavium ACS quirk only to CN81xx/CN83xx/CN88xx devices
	powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layout
	braille-console: Fix value returned by _braille_console_setup
	drm/vmwgfx: Fixes to vmwgfx_fb
	vxlan: vxlan dev should inherit lowerdev's gso_max_size
	NFC: nfcmrvl: Include unaligned.h instead of access_ok.h
	NFC: nfcmrvl: double free on error path
	NFC: pn533: change order of free_irq and dev unregistration
	ARM: dts: r7s72100: fix ethernet clock parent
	ARM: dts: r8a7790: Correct parent of SSI[0-9] clocks
	ARM: dts: r8a7791: Correct parent of SSI[0-9] clocks
	ARM: dts: r8a7793: Correct parent of SSI[0-9] clocks
	powerpc: Avoid taking a data miss on every userspace instruction miss
	net: hns: Correct HNS RSS key set function
	net/faraday: Add missing include of of.h
	qed: Fix TM block ILT allocation
	rtmutex: Fix PI chain order integrity
	printk: Correctly handle preemption in console_unlock()
	drm: rcar-du: Handle event when disabling CRTCs
	ARM: dts: koelsch: Correct clock frequency of X2 DU clock input
	reiserfs: Make cancel_old_flush() reliable
	ASoC: rt5677: Add OF device ID table
	IB/hfi1: Check for QSFP presence before attempting reads
	ALSA: firewire-digi00x: add support for console models of Digi00x series
	ALSA: firewire-digi00x: handle all MIDI messages on streaming packets
	fm10k: correctly check if interface is removed
	EDAC, altera: Fix peripheral warnings for Cyclone5
	scsi: ses: don't get power status of SES device slot on probe
	qed: Correct MSI-x for storage
	apparmor: Make path_max parameter readonly
	iommu/iova: Fix underflow bug in __alloc_and_insert_iova_range
	kvm/svm: Setup MCG_CAP on AMD properly
	kvm: nVMX: Disallow userspace-injected exceptions in guest mode
	video: ARM CLCD: fix dma allocation size
	drm/radeon: Fail fb creation from imported dma-bufs.
	drm/amdgpu: Fail fb creation from imported dma-bufs. (v2)
	drm/rockchip: vop: Enable pm domain before vop_initial
	i40e: only register client on iWarp-capable devices
	coresight: Fixes coresight DT parse to get correct output port ID.
	lkdtm: turn off kcov for lkdtm_rodata_do_nothing:
	tty: amba-pl011: Fix spurious TX interrupts
	serial: imx: setup DCEDTE early and ensure DCD and RI irqs to be off
	MIPS: BPF: Quit clobbering callee saved registers in JIT code.
	MIPS: BPF: Fix multiple problems in JIT skb access helpers.
	MIPS: r2-on-r6-emu: Fix BLEZL and BGTZL identification
	MIPS: r2-on-r6-emu: Clear BLTZALL and BGEZALL debugfs counters
	v4l: vsp1: Prevent multiple streamon race commencing pipeline early
	v4l: vsp1: Register pipe with output WPF
	regulator: isl9305: fix array size
	md/raid6: Fix anomily when recovering a single device in RAID6.
	md.c:didn't unlock the mddev before return EINVAL in array_size_store
	powerpc/nohash: Fix use of mmu_has_feature() in setup_initial_memory_limit()
	usb: dwc2: Make sure we disconnect the gadget state
	usb: gadget: dummy_hcd: Fix wrong power status bit clear/reset in dummy_hub_control()
	perf evsel: Return exact sub event which failed with EPERM for wildcards
	iwlwifi: mvm: fix RX SKB header size and align it properly
	drivers/perf: arm_pmu: handle no platform_device
	perf inject: Copy events when reordering events in pipe mode
	net: fec: add phy-reset-gpios PROBE_DEFER check
	perf session: Don't rely on evlist in pipe mode
	vfio/powerpc/spapr_tce: Enforce IOMMU type compatibility check
	vfio/spapr_tce: Check kzalloc() return when preregistering memory
	scsi: sg: check for valid direction before starting the request
	scsi: sg: close race condition in sg_remove_sfp_usercontext()
	ALSA: hda: Add Geminilake id to SKL_PLUS
	kprobes/x86: Fix kprobe-booster not to boost far call instructions
	kprobes/x86: Set kprobes pages read-only
	pwm: tegra: Increase precision in PWM rate calculation
	clk: qcom: msm8996: Fix the vfe1 powerdomain name
	Bluetooth: Avoid bt_accept_unlink() double unlinking
	Bluetooth: 6lowpan: fix delay work init in add_peer_chan()
	mac80211_hwsim: use per-interface power level
	ath10k: fix compile time sanity check for CE4 buffer size
	wil6210: fix protection against connections during reset
	wil6210: fix memory access violation in wil_memcpy_from/toio_32
	perf stat: Fix bug in handling events in error state
	mwifiex: Fix invalid port issue
	drm/edid: set ELD connector type in drm_edid_to_eld()
	video/hdmi: Allow "empty" HDMI infoframes
	HID: elo: clear BTN_LEFT mapping
	iwlwifi: mvm: rs: don't override the rate history in the search cycle
	clk: meson: gxbb: fix wrong clock for SARADC/SANA
	ARM: dts: exynos: Correct Trats2 panel reset line
	sched: Stop switched_to_rt() from sending IPIs to offline CPUs
	sched: Stop resched_cpu() from sending IPIs to offline CPUs
	test_firmware: fix setting old custom fw path back on exit
	net: ieee802154: adf7242: Fix bug if defined DEBUG
	net: xfrm: allow clearing socket xfrm policies.
	mtd: nand: fix interpretation of NAND_CMD_NONE in nand_command[_lp]()
	net: thunderx: Set max queue count taking XDP_TX into account
	ARM: dts: am335x-pepper: Fix the audio CODEC's reset pin
	ARM: dts: omap3-n900: Fix the audio CODEC's reset pin
	mtd: nand: ifc: update bufnum mask for ver >= 2.0.0
	userns: Don't fail follow_automount based on s_user_ns
	leds: pm8058: Silence pointer to integer size warning
	power: supply: ab8500_charger: Fix an error handling path
	power: supply: ab8500_charger: Bail out in case of error in 'ab8500_charger_init_hw_registers()'
	ath10k: update tdls teardown state to target
	scsi: ses: don't ask for diagnostic pages repeatedly during probe
	pwm: stmpe: Fix wrong register offset for hwpwm=2 case
	clk: qcom: msm8916: fix mnd_width for codec_digcodec
	mwifiex: cfg80211: do not change virtual interface during scan processing
	ath10k: fix invalid STS_CAP_OFFSET_MASK
	tools/usbip: fixes build with musl libc toolchain
	spi: sun6i: disable/unprepare clocks on remove
	bnxt_en: Don't print "Link speed -1 no longer supported" messages.
	scsi: core: scsi_get_device_flags_keyed(): Always return device flags
	scsi: devinfo: apply to HP XP the same flags as Hitachi VSP
	scsi: dh: add new rdac devices
	media: vsp1: Prevent suspending and resuming DRM pipelines
	media: cpia2: Fix a couple off by one bugs
	veth: set peer GSO values
	drm/amdkfd: Fix memory leaks in kfd topology
	powerpc/modules: Don't try to restore r2 after a sibling call
	agp/intel: Flush all chipset writes after updating the GGTT
	mac80211_hwsim: enforce PS_MANUAL_POLL to be set after PS_ENABLED
	mac80211: remove BUG() when interface type is invalid
	ASoC: nuc900: Fix a loop timeout test
	ipvlan: add L2 check for packets arriving via virtual devices
	rcutorture/configinit: Fix build directory error message
	locking/locktorture: Fix num reader/writer corner cases
	ima: relax requiring a file signature for new files with zero length
	net: hns: Some checkpatch.pl script & warning fixes
	x86/boot/32: Fix UP boot on Quark and possibly other platforms
	x86/cpufeatures: Add Intel PCONFIG cpufeature
	selftests/x86/entry_from_vm86: Exit with 1 if we fail
	selftests/x86: Add tests for User-Mode Instruction Prevention
	selftests/x86: Add tests for the STR and SLDT instructions
	selftests/x86/entry_from_vm86: Add test cases for POPF
	x86/vm86/32: Fix POPF emulation
	x86/speculation, objtool: Annotate indirect calls/jumps for objtool on 32-bit kernels
	x86/speculation: Remove Skylake C2 from Speculation Control microcode blacklist
	x86/mm: Fix vmalloc_fault to use pXd_large
	parisc: Handle case where flush_cache_range is called with no context
	ALSA: pcm: Fix UAF in snd_pcm_oss_get_formats()
	ALSA: hda - Revert power_save option default value
	ALSA: seq: Fix possible UAF in snd_seq_check_queue()
	ALSA: seq: Clear client entry before deleting else at closing
	drm/amdgpu: fix prime teardown order
	drm/amdgpu/dce: Don't turn off DP sink when disconnected
	fs: Teach path_connected to handle nfs filesystems with multiple roots.
	lock_parent() needs to recheck if dentry got __dentry_kill'ed under it
	fs/aio: Add explicit RCU grace period when freeing kioctx
	fs/aio: Use RCU accessors for kioctx_table->table[]
	irqchip/gic-v3-its: Ensure nr_ites >= nr_lpis
	scsi: sg: fix SG_DXFER_FROM_DEV transfers
	scsi: sg: fix static checker warning in sg_is_valid_dxfer
	scsi: sg: only check for dxfer_len greater than 256M
	btrfs: alloc_chunk: fix DUP stripe size handling
	btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device
	scsi: qla2xxx: Fix extraneous ref on sp's after adapter break
	USB: gadget: udc: Add missing platform_device_put() on error in bdc_pci_probe()
	usb: dwc3: Fix GDBGFIFOSPACE_TYPE values
	usb: gadget: bdc: 64-bit pointer capability check
	Linux 4.9.89

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-03-22 09:54:47 +01:00
Paul E. McKenney
bac7bb1849 sched: Stop switched_to_rt() from sending IPIs to offline CPUs
[ Upstream commit 2fe2582649 ]

The rcutorture test suite occasionally provokes a splat due to invoking
rt_mutex_lock() which needs to boost the priority of a task currently
sitting on a runqueue that belongs to an offline CPU:

WARNING: CPU: 0 PID: 12 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40
Modules linked in:
CPU: 0 PID: 12 Comm: rcub/7 Not tainted 4.14.0-rc4+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
task: ffff9ed3de5f8cc0 task.stack: ffffbbf80012c000
RIP: 0010:native_smp_send_reschedule+0x37/0x40
RSP: 0018:ffffbbf80012fd10 EFLAGS: 00010082
RAX: 000000000000002f RBX: ffff9ed3dd9cb300 RCX: 0000000000000004
RDX: 0000000080000004 RSI: 0000000000000086 RDI: 00000000ffffffff
RBP: ffffbbf80012fd10 R08: 000000000009da7a R09: 0000000000007b9d
R10: 0000000000000001 R11: ffffffffbb57c2cd R12: 000000000000000d
R13: ffff9ed3de5f8cc0 R14: 0000000000000061 R15: ffff9ed3ded59200
FS:  0000000000000000(0000) GS:ffff9ed3dea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000080686f0 CR3: 000000001b9e0000 CR4: 00000000000006f0
Call Trace:
 resched_curr+0x61/0xd0
 switched_to_rt+0x8f/0xa0
 rt_mutex_setprio+0x25c/0x410
 task_blocks_on_rt_mutex+0x1b3/0x1f0
 rt_mutex_slowlock+0xa9/0x1e0
 rt_mutex_lock+0x29/0x30
 rcu_boost_kthread+0x127/0x3c0
 kthread+0x104/0x140
 ? rcu_report_unblock_qs_rnp+0x90/0x90
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x22/0x30
Code: f0 00 0f 92 c0 84 c0 74 14 48 8b 05 34 74 c5 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 a0 c6 fc b9 e8 d5 b5 06 00 <0f> ff 5d c3 0f 1f 44 00 00 8b 05 a2 d1 13 02 85 c0 75 38 55 48

But the target task's priority has already been adjusted, so the only
purpose of switched_to_rt() invoking resched_curr() is to wake up the
CPU running some task that needs to be preempted by the boosted task.
But the CPU is offline, which presumably means that the task must be
migrated to some other CPU, and that this other CPU will undertake any
needed preemption at the time of migration.  Because the runqueue lock
is held when resched_curr() is invoked, we know that the boosted task
cannot go anywhere, so it is not necessary to invoke resched_curr()
in this particular case.

This commit therefore makes switched_to_rt() refrain from invoking
resched_curr() when the target CPU is offline.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22 09:17:54 +01:00
Greg Kroah-Hartman
cdfc8df1d2 Merge 4.9.82 into android-4.9
Changes in 4.9.82
	powerpc/pseries: include linux/types.h in asm/hvcall.h
	cifs: Fix missing put_xid in cifs_file_strict_mmap
	cifs: Fix autonegotiate security settings mismatch
	CIFS: zero sensitive data when freeing
	dmaengine: dmatest: fix container_of member in dmatest_callback
	kaiser: fix compile error without vsyscall
	posix-timer: Properly check sigevent->sigev_notify
	usb: gadget: uvc: Missing files for configfs interface
	sched/rt: Use container_of() to get root domain in rto_push_irq_work_func()
	sched/rt: Up the root domain ref count when passing it around via IPIs
	dccp: CVE-2017-8824: use-after-free in DCCP code
	media: dvb-usb-v2: lmedm04: Improve logic checking of warm start
	media: dvb-usb-v2: lmedm04: move ts2020 attach to dm04_lme2510_tuner
	media: hdpvr: Fix an error handling path in hdpvr_probe()
	mtd: cfi: convert inline functions to macros
	mtd: nand: brcmnand: Disable prefetch by default
	mtd: nand: Fix nand_do_read_oob() return value
	mtd: nand: sunxi: Fix ECC strength choice
	ubi: fastmap: Erase outdated anchor PEBs during attach
	ubi: block: Fix locking for idr_alloc/idr_remove
	ubifs: Massage assert in ubifs_xattr_set() wrt. init_xattrs
	nfs/pnfs: fix nfs_direct_req ref leak when i/o falls back to the mds
	NFS: Add a cond_resched() to nfs_commit_release_pages()
	NFS: commit direct writes even if they fail partially
	NFS: reject request for id_legacy key without auxdata
	NFS: Fix a race between mmap() and O_DIRECT
	kernfs: fix regression in kernfs_fop_write caused by wrong type
	ahci: Annotate PCI ids for mobile Intel chipsets as such
	ahci: Add PCI ids for Intel Bay Trail, Cherry Trail and Apollo Lake AHCI
	ahci: Add Intel Cannon Lake PCH-H PCI ID
	crypto: hash - introduce crypto_hash_alg_has_setkey()
	crypto: cryptd - pass through absence of ->setkey()
	crypto: mcryptd - pass through absence of ->setkey()
	crypto: poly1305 - remove ->setkey() method
	nsfs: mark dentry with DCACHE_RCUACCESS
	media: v4l2-ioctl.c: don't copy back the result for -ENOTTY
	media: v4l2-compat-ioctl32.c: add missing VIDIOC_PREPARE_BUF
	media: v4l2-compat-ioctl32.c: fix the indentation
	media: v4l2-compat-ioctl32.c: move 'helper' functions to __get/put_v4l2_format32
	media: v4l2-compat-ioctl32.c: avoid sizeof(type)
	media: v4l2-compat-ioctl32.c: copy m.userptr in put_v4l2_plane32
	media: v4l2-compat-ioctl32.c: fix ctrl_is_pointer
	media: v4l2-compat-ioctl32.c: make ctrl_is_pointer work for subdevs
	media: v4l2-compat-ioctl32: Copy v4l2_window->global_alpha
	media: v4l2-compat-ioctl32.c: copy clip list in put_v4l2_window32
	media: v4l2-compat-ioctl32.c: drop pr_info for unknown buffer type
	media: v4l2-compat-ioctl32.c: don't copy back the result for certain errors
	media: v4l2-compat-ioctl32.c: refactor compat ioctl32 logic
	crypto: caam - fix endless loop when DECO acquire fails
	crypto: sha512-mb - initialize pending lengths correctly
	arm: KVM: Fix SMCCC handling of unimplemented SMC/HVC calls
	KVM: nVMX: Fix races when sending nested PI while dest enters/leaves L2
	KVM: arm/arm64: Handle CPU_PM_ENTER_FAILED
	ASoC: rockchip: i2s: fix playback after runtime resume
	ASoC: skl: Fix kernel warning due to zero NHTL entry
	watchdog: imx2_wdt: restore previous timeout after suspend+resume
	media: dvb-frontends: fix i2c access helpers for KASAN
	media: ts2020: avoid integer overflows on 32 bit machines
	media: cxusb, dib0700: ignore XC2028_I2C_FLUSH
	fs/proc/kcore.c: use probe_kernel_read() instead of memcpy()
	kernel/async.c: revert "async: simplify lowest_in_progress()"
	kernel/relay.c: revert "kernel/relay.c: fix potential memory leak"
	pipe: actually allow root to exceed the pipe buffer limits
	pipe: fix off-by-one error when checking buffer limits
	HID: quirks: Fix keyboard + touchpad on Toshiba Click Mini not working
	Bluetooth: btsdio: Do not bind to non-removable BCM43341
	Revert "Bluetooth: btusb: fix QCA Rome suspend/resume"
	Bluetooth: btusb: Restore QCA Rome suspend/resume fix with a "rewritten" version
	signal/openrisc: Fix do_unaligned_access to send the proper signal
	signal/sh: Ensure si_signo is initialized in do_divide_error
	alpha: fix crash if pthread_create races with signal delivery
	alpha: fix reboot on Avanti platform
	alpha: fix formating of stack content
	xtensa: fix futex_atomic_cmpxchg_inatomic
	EDAC, octeon: Fix an uninitialized variable warning
	pinctrl: intel: Initialize GPIO properly when used through irqchip
	pktcdvd: Fix pkt_setup_dev() error path
	clocksource/drivers/stm32: Fix kernel panic with multiple timers
	lib/ubsan.c: s/missaligned/misaligned/
	lib/ubsan: add type mismatch handler for new GCC/Clang
	btrfs: Handle btrfs_set_extent_delalloc failure in fixup worker
	drm/i915: Avoid PPS HW/SW state mismatch due to rounding
	ACPI: sbshc: remove raw pointer from printk() message
	acpi, nfit: fix register dimm error handling
	ovl: fix failure to fsync lower dir
	mn10300/misalignment: Use SIGSEGV SEGV_MAPERR to report a failed user copy
	ftrace: Remove incorrect setting of glob search field
	Linux 4.9.82

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-02-17 14:52:07 +01:00
Steven Rostedt (VMware)
a384e5437f sched/rt: Up the root domain ref count when passing it around via IPIs
commit 364f566537 upstream.

When issuing an IPI RT push, where an IPI is sent to each CPU that has more
than one RT task scheduled on it, it references the root domain's rto_mask,
that contains all the CPUs within the root domain that has more than one RT
task in the runable state. The problem is, after the IPIs are initiated, the
rq->lock is released. This means that the root domain that is associated to
the run queue could be freed while the IPIs are going around.

Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
the root domain's ref count respectively. This way when initiating the IPIs,
the scheduler will up the root domain's ref count before releasing the
rq->lock, ensuring that the root domain does not go away until the IPI round
is complete.

Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 4bdced5c9a ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-17 13:21:13 +01:00
Steven Rostedt (VMware)
1c67998130 sched/rt: Use container_of() to get root domain in rto_push_irq_work_func()
commit ad0f1d9d65 upstream.

When the rto_push_irq_work_func() is called, it looks at the RT overloaded
bitmask in the root domain via the runqueue (rq->rd). The problem is that
during CPU up and down, nothing here stops rq->rd from changing between
taking the rq->rd->rto_lock and releasing it. That means the lock that is
released is not the same lock that was taken.

Instead of using this_rq()->rd to get the root domain, as the irq work is
part of the root domain, we can simply get the root domain from the irq work
that is passed to the routine:

 container_of(work, struct root_domain, rto_push_work)

This keeps the root domain consistent.

Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 4bdced5c9a ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-17 13:21:13 +01:00
Greg Kroah-Hartman
319c8e1bc7 Merge 4.9.71 into android-4.9
Changes in 4.9.71
	mfd: fsl-imx25: Clean up irq settings during removal
	crypto: rsa - fix buffer overread when stripping leading zeroes
	crypto: hmac - require that the underlying hash algorithm is unkeyed
	crypto: salsa20 - fix blkcipher_walk API usage
	autofs: fix careless error in recent commit
	tracing: Allocate mask_str buffer dynamically
	USB: uas and storage: Add US_FL_BROKEN_FUA for another JMicron JMS567 ID
	USB: core: prevent malicious bNumInterfaces overflow
	usbip: fix stub_rx: get_pipe() to validate endpoint number
	usb: add helper to extract bits 12:11 of wMaxPacketSize
	usbip: fix stub_rx: harden CMD_SUBMIT path to handle malicious input
	usbip: fix stub_send_ret_submit() vulnerability to null transfer_buffer
	ceph: drop negative child dentries before try pruning inode's alias
	usb: xhci: fix TDS for MTK xHCI1.1
	Bluetooth: btusb: driver to enable the usb-wakeup feature
	xhci: Don't add a virt_dev to the devs array before it's fully allocated
	nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests
	sched/rt: Do not pull from current CPU if only one CPU to pull
	eeprom: at24: change nvmem stride to 1
	dmaengine: dmatest: move callback wait queue to thread context
	ext4: fix fdatasync(2) after fallocate(2) operation
	ext4: fix crash when a directory's i_size is too small
	mac80211: Fix addition of mesh configuration element
	usb: phy: isp1301: Add OF device ID table
	KVM: nVMX: do not warn when MSR bitmap address is not backed
	usb: xhci-mtk: check hcc_params after adding primary hcd
	md-cluster: free md_cluster_info if node leave cluster
	userfaultfd: shmem: __do_fault requires VM_FAULT_NOPAGE
	userfaultfd: selftest: vm: allow to build in vm/ directory
	net: initialize msg.msg_flags in recvfrom
	bnxt_en: Ignore 0 value in autoneg supported speed from firmware.
	net: bcmgenet: correct the RBUF_OVFL_CNT and RBUF_ERR_CNT MIB values
	net: bcmgenet: correct MIB access of UniMAC RUNT counters
	net: bcmgenet: reserved phy revisions must be checked first
	net: bcmgenet: power down internal phy if open or resume fails
	net: bcmgenet: synchronize irq0 status between the isr and task
	net: bcmgenet: Power up the internal PHY before probing the MII
	rxrpc: Wake up the transmitter if Rx window size increases on the peer
	net/mlx5: Fix create autogroup prev initializer
	net/mlx5: Don't save PCI state when PCI error is detected
	iommu/io-pgtable-arm-v7s: Check for leaf entry before dereferencing it
	drm/amdgpu: fix parser init error path to avoid crash in parser fini
	NFSD: fix nfsd_minorversion(.., NFSD_AVAIL)
	NFSD: fix nfsd_reset_versions for NFSv4.
	Input: i8042 - add TUXEDO BU1406 (N24_25BU) to the nomux list
	drm/omap: fix dmabuf mmap for dma_alloc'ed buffers
	netfilter: bridge: honor frag_max_size when refragmenting
	ASoC: rsnd: fix sound route path when using SRC6/SRC9
	blk-mq: Fix tagset reinit in the presence of cpu hot-unplug
	writeback: fix memory leak in wb_queue_work()
	net: wimax/i2400m: fix NULL-deref at probe
	dmaengine: Fix array index out of bounds warning in __get_unmap_pool()
	irqchip/mvebu-odmi: Select GENERIC_MSI_IRQ_DOMAIN
	net: Resend IGMP memberships upon peer notification.
	mlxsw: reg: Fix SPVM max record count
	mlxsw: reg: Fix SPVMLR max record count
	qed: Align CIDs according to DORQ requirement
	qed: Fix mapping leak on LL2 rx flow
	qed: Fix interrupt flags on Rx LL2
	drm: amd: remove broken include path
	intel_th: pci: Add Gemini Lake support
	openrisc: fix issue handling 8 byte get_user calls
	ASoC: rcar: clear DE bit only in PDMACHCR when it stops
	scsi: hpsa: update check for logical volume status
	scsi: hpsa: limit outstanding rescans
	scsi: hpsa: do not timeout reset operations
	fjes: Fix wrong netdevice feature flags
	drm/radeon/si: add dpm quirk for Oland
	Drivers: hv: util: move waiting for release to hv_utils_transport itself
	iwlwifi: mvm: cleanup pending frames in DQA mode
	sched/deadline: Add missing update_rq_clock() in dl_task_timer()
	sched/deadline: Make sure the replenishment timer fires in the next period
	sched/deadline: Throttle a constrained deadline task activated after the deadline
	sched/deadline: Use deadline instead of period when calculating overflow
	mmc: mediatek: Fixed bug where clock frequency could be set wrong
	drm/radeon: reinstate oland workaround for sclk
	afs: Fix missing put_page()
	afs: Populate group ID from vnode status
	afs: Adjust mode bits processing
	afs: Deal with an empty callback array
	afs: Flush outstanding writes when an fd is closed
	afs: Migrate vlocation fields to 64-bit
	afs: Prevent callback expiry timer overflow
	afs: Fix the maths in afs_fs_store_data()
	afs: Invalid op ID should abort with RXGEN_OPCODE
	afs: Better abort and net error handling
	afs: Populate and use client modification time
	afs: Fix page leak in afs_write_begin()
	afs: Fix afs_kill_pages()
	afs: Fix abort on signal while waiting for call completion
	nvme-loop: fix a possible use-after-free when destroying the admin queue
	nvmet: confirm sq percpu has scheduled and switched to atomic
	nvmet-rdma: Fix a possible uninitialized variable dereference
	net/mlx4_core: Avoid delays during VF driver device shutdown
	net: mpls: Fix nexthop alive tracking on down events
	rxrpc: Ignore BUSY packets on old calls
	tty: don't panic on OOM in tty_set_ldisc()
	tty: fix data race in tty_ldisc_ref_wait()
	perf symbols: Fix symbols__fixup_end heuristic for corner cases
	efi/esrt: Cleanup bad memory map log messages
	NFSv4.1 respect server's max size in CREATE_SESSION
	btrfs: add missing memset while reading compressed inline extents
	target: Use system workqueue for ALUA transitions
	target: fix ALUA transition timeout handling
	target: fix race during implicit transition work flushes
	Revert "x86/acpi: Set persistent cpuid <-> nodeid mapping when booting"
	HID: cp2112: fix broken gpio_direction_input callback
	sfc: don't warn on successful change of MAC
	fbdev: controlfb: Add missing modes to fix out of bounds access
	video: udlfb: Fix read EDID timeout
	video: fbdev: au1200fb: Release some resources if a memory allocation fails
	video: fbdev: au1200fb: Return an error code if a memory allocation fails
	rtc: pcf8563: fix output clock rate
	ASoC: Intel: Skylake: Fix uuid_module memory leak in failure case
	dmaengine: ti-dma-crossbar: Correct am335x/am43xx mux value type
	PCI/PME: Handle invalid data when reading Root Status
	powerpc/powernv/cpufreq: Fix the frequency read by /proc/cpuinfo
	PCI: Do not allocate more buses than available in parent
	iommu/mediatek: Fix driver name
	netfilter: ipvs: Fix inappropriate output of procfs
	powerpc/opal: Fix EBUSY bug in acquiring tokens
	powerpc/ipic: Fix status get and status clear
	platform/x86: intel_punit_ipc: Fix resource ioremap warning
	target/iscsi: Fix a race condition in iscsit_add_reject_from_cmd()
	iscsi-target: fix memory leak in lio_target_tiqn_addtpg()
	target:fix condition return in core_pr_dump_initiator_port()
	target/file: Do not return error for UNMAP if length is zero
	badblocks: fix wrong return value in badblocks_set if badblocks are disabled
	iommu/amd: Limit the IOVA page range to the specified addresses
	xfs: truncate pagecache before writeback in xfs_setattr_size()
	arm-ccn: perf: Prevent module unload while PMU is in use
	crypto: tcrypt - fix buffer lengths in test_aead_speed()
	mm: Handle 0 flags in _calc_vm_trans() macro
	clk: mediatek: add the option for determining PLL source clock
	clk: imx6: refine hdmi_isfr's parent to make HDMI work on i.MX6 SoCs w/o VPU
	clk: hi6220: mark clock cs_atb_syspll as critical
	clk: tegra: Fix cclk_lp divisor register
	ppp: Destroy the mutex when cleanup
	ASoC: rsnd: rsnd_ssi_run_mods() needs to care ssi_parent_mod
	thermal/drivers/step_wise: Fix temperature regulation misbehavior
	scsi: scsi_debug: write_same: fix error report
	GFS2: Take inode off order_write list when setting jdata flag
	bcache: explicitly destroy mutex while exiting
	bcache: fix wrong cache_misses statistics
	Ib/hfi1: Return actual operational VLs in port info query
	arm64: prevent regressions in compressed kernel image size when upgrading to binutils 2.27
	btrfs: tests: Fix a memory leak in error handling path in 'run_test()'
	platform/x86: hp_accel: Add quirk for HP ProBook 440 G4
	nvme: use kref_get_unless_zero in nvme_find_get_ns
	l2tp: cleanup l2tp_tunnel_delete calls
	xfs: fix log block underflow during recovery cycle verification
	xfs: fix incorrect extent state in xfs_bmap_add_extent_unwritten_real
	RDMA/cxgb4: Declare stag as __be32
	PCI: Detach driver before procfs & sysfs teardown on device remove
	scsi: hpsa: cleanup sas_phy structures in sysfs when unloading
	scsi: hpsa: destroy sas transport properties before scsi_host
	powerpc/perf/hv-24x7: Fix incorrect comparison in memord
	soc: mediatek: pwrap: fix compiler errors
	tty fix oops when rmmod 8250
	usb: musb: da8xx: fix babble condition handling
	pinctrl: adi2: Fix Kconfig build problem
	raid5: Set R5_Expanded on parity devices as well as data.
	scsi: scsi_devinfo: Add REPORTLUN2 to EMC SYMMETRIX blacklist entry
	IB/core: Fix calculation of maximum RoCE MTU
	vt6655: Fix a possible sleep-in-atomic bug in vt6655_suspend
	rtl8188eu: Fix a possible sleep-in-atomic bug in rtw_createbss_cmd
	rtl8188eu: Fix a possible sleep-in-atomic bug in rtw_disassoc_cmd
	scsi: sd: change manage_start_stop to bool in sysfs interface
	scsi: sd: change allow_restart to bool in sysfs interface
	scsi: bfa: integer overflow in debugfs
	udf: Avoid overflow when session starts at large offset
	macvlan: Only deliver one copy of the frame to the macvlan interface
	RDMA/cma: Avoid triggering undefined behavior
	IB/ipoib: Grab rtnl lock on heavy flush when calling ndo_open/stop
	icmp: don't fail on fragment reassembly time exceeded
	ath9k: fix tx99 potential info leak
	Linux 4.9.71

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-12-20 10:51:15 +01:00
Steven Rostedt
d266817f50 sched/rt: Do not pull from current CPU if only one CPU to pull
commit f73c52a5bc upstream.

Daniel Wagner reported a crash on the BeagleBone Black SoC.

This is a single CPU architecture, and does not have a functional
arch_send_call_function_single_ipi() implementation which can crash
the kernel if that is called.

As it only has one CPU, it shouldn't be called, but if the kernel is
compiled for SMP, the push/pull RT scheduling logic now calls it for
irq_work if the one CPU is overloaded, it can use that function to call
itself and crash the kernel.

Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system
only has a single CPU. But SCHED_FEAT is a constant if sched debugging
is turned off. Another fix can also be used, and this should also help
with normal SMP machines. That is, do not initiate the pull code if
there's only one RT overloaded CPU, and that CPU happens to be the
current CPU that is scheduling in a lower priority task.

Even on a system with many CPUs, if there's many RT tasks waiting to
run on a single CPU, and that CPU schedules in another RT task of lower
priority, it will initiate the PULL logic in case there's a higher
priority RT task on another CPU that is waiting to run. But if there is
no other CPU with waiting RT tasks, it will initiate the RT pull logic
on itself (as it still has RT tasks waiting to run). This is a wasted
effort.

Not only does this help with SMP code where the current CPU is the only
one with RT overloaded tasks, it should also solve the issue that
Daniel encountered, because it will prevent the PULL logic from
executing, as there's only one CPU on the system, and the check added
here will cause it to exit the RT pull code.

Reported-by: Daniel Wagner <wagi@monom.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Fixes: 4bdced5c9 ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:07:17 +01:00
Greg Kroah-Hartman
c1a286429a Merge 4.9.66 into android-4.9
Changes in 4.9.66
	s390: fix transactional execution control register handling
	s390/runtime instrumention: fix possible memory corruption
	s390/disassembler: add missing end marker for e7 table
	s390/disassembler: increase show_code buffer size
	ACPI / EC: Fix regression related to triggering source of EC event handling
	x86/mm: fix use-after-free of vma during userfaultfd fault
	ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER
	vsock: use new wait API for vsock_stream_sendmsg()
	sched: Make resched_cpu() unconditional
	lib/mpi: call cond_resched() from mpi_powm() loop
	x86/decoder: Add new TEST instruction pattern
	x86/entry/64: Add missing irqflags tracing to native_load_gs_index()
	arm64: Implement arch-specific pte_access_permitted()
	ARM: 8722/1: mm: make STRICT_KERNEL_RWX effective for LPAE
	ARM: 8721/1: mm: dump: check hardware RO bit for LPAE
	MIPS: ralink: Fix MT7628 pinmux
	MIPS: ralink: Fix typo in mt7628 pinmux function
	PCI: Set Cavium ACS capability quirk flags to assert RR/CR/SV/UF
	ALSA: hda: Add Raven PCI ID
	dm bufio: fix integer overflow when limiting maximum cache size
	dm: allocate struct mapped_device with kvzalloc
	MIPS: pci: Remove KERN_WARN instance inside the mt7620 driver
	dm: fix race between dm_get_from_kobject() and __dm_destroy()
	MIPS: Fix odd fp register warnings with MIPS64r2
	MIPS: dts: remove bogus bcm96358nb4ser.dtb from dtb-y entry
	MIPS: Fix an n32 core file generation regset support regression
	MIPS: BCM47XX: Fix LED inversion for WRT54GSv1
	rt2x00usb: mark device removed when get ENOENT usb error
	autofs: don't fail mount for transient error
	nilfs2: fix race condition that causes file system corruption
	eCryptfs: use after free in ecryptfs_release_messaging()
	libceph: don't WARN() if user tries to add invalid key
	bcache: check ca->alloc_thread initialized before wake up it
	isofs: fix timestamps beyond 2027
	NFS: Fix typo in nomigration mount option
	nfs: Fix ugly referral attributes
	NFS: Avoid RCU usage in tracepoints
	nfsd: deal with revoked delegations appropriately
	rtlwifi: rtl8192ee: Fix memory leak when loading firmware
	rtlwifi: fix uninitialized rtlhal->last_suspend_sec time
	ata: fixes kernel crash while tracing ata_eh_link_autopsy event
	ext4: fix interaction between i_size, fallocate, and delalloc after a crash
	ALSA: pcm: update tstamp only if audio_tstamp changed
	ALSA: usb-audio: Add sanity checks to FE parser
	ALSA: usb-audio: Fix potential out-of-bound access at parsing SU
	ALSA: usb-audio: Add sanity checks in v2 clock parsers
	ALSA: timer: Remove kernel warning at compat ioctl error paths
	ALSA: hda: Fix too short HDMI/DP chmap reporting
	ALSA: hda/realtek - Fix ALC700 family no sound issue
	fix a page leak in vhost_scsi_iov_to_sgl() error recovery
	fs/9p: Compare qid.path in v9fs_test_inode
	iscsi-target: Fix non-immediate TMR reference leak
	target: Fix QUEUE_FULL + SCSI task attribute handling
	mtd: nand: omap2: Fix subpage write
	mtd: nand: Fix writing mtdoops to nand flash.
	mtd: nand: mtk: fix infinite ECC decode IRQ issue
	p54: don't unregister leds when they are not initialized
	block: Fix a race between blk_cleanup_queue() and timeout handling
	irqchip/gic-v3: Fix ppi-partitions lookup
	lockd: double unregister of inetaddr notifiers
	KVM: nVMX: set IDTR and GDTR limits when loading L1 host state
	KVM: SVM: obey guest PAT
	SUNRPC: Fix tracepoint storage issues with svc_recv and svc_rqst_status
	clk: ti: dra7-atl-clock: fix child-node lookups
	libnvdimm, pfn: make 'resource' attribute only readable by root
	libnvdimm, namespace: fix label initialization to use valid seq numbers
	libnvdimm, namespace: make 'resource' attribute only readable by root
	IB/srpt: Do not accept invalid initiator port names
	IB/srp: Avoid that a cable pull can trigger a kernel crash
	NFC: fix device-allocation error return
	i40e: Use smp_rmb rather than read_barrier_depends
	igb: Use smp_rmb rather than read_barrier_depends
	igbvf: Use smp_rmb rather than read_barrier_depends
	ixgbevf: Use smp_rmb rather than read_barrier_depends
	i40evf: Use smp_rmb rather than read_barrier_depends
	fm10k: Use smp_rmb rather than read_barrier_depends
	ixgbe: Fix skb list corruption on Power systems
	parisc: Fix validity check of pointer size argument in new CAS implementation
	powerpc/signal: Properly handle return value from uprobe_deny_signal()
	media: Don't do DMA on stack for firmware upload in the AS102 driver
	media: rc: check for integer overflow
	cx231xx-cards: fix NULL-deref on missing association descriptor
	media: v4l2-ctrl: Fix flags field on Control events
	sched/rt: Simplify the IPI based RT balancing logic
	fscrypt: lock mutex before checking for bounce page pool
	net/9p: Switch to wait_event_killable()
	PM / OPP: Add missing of_node_put(np)
	Revert "drm/i915: Do not rely on wm preservation for ILK watermarks"
	e1000e: Fix error path in link detection
	e1000e: Fix return value test
	e1000e: Separate signaling for link check/link up
	e1000e: Avoid receiver overrun interrupt bursts
	RDS: make message size limit compliant with spec
	RDS: RDMA: return appropriate error on rdma map failures
	RDS: RDMA: fix the ib_map_mr_sg_zbva() argument
	PCI: Apply _HPX settings only to relevant devices
	drm/sun4i: Fix a return value in case of error
	clk: sunxi-ng: A31: Fix spdif clock register
	clk: sunxi-ng: fix PLL_CPUX adjusting on A33
	dmaengine: zx: set DMA_CYCLIC cap_mask bit
	fscrypt: use ENOKEY when file cannot be created w/o key
	fscrypt: use ENOTDIR when setting encryption policy on nondirectory
	net: Allow IP_MULTICAST_IF to set index to L3 slave
	net: 3com: typhoon: typhoon_init_one: make return values more specific
	net: 3com: typhoon: typhoon_init_one: fix incorrect return values
	drm/armada: Fix compile fail
	rt2800: set minimum MPDU and PSDU lengths to sane values
	adm80211: return an error if adm8211_alloc_rings() fails
	mwifiex: sdio: fix use after free issue for save_adapter
	ath10k: fix incorrect txpower set by P2P_DEVICE interface
	ath10k: ignore configuring the incorrect board_id
	ath10k: fix potential memory leak in ath10k_wmi_tlv_op_pull_fw_stats()
	pinctrl: sirf: atlas7: Add missing 'of_node_put()'
	bnxt_en: Set default completion ring for async events.
	ath10k: set CTS protection VDEV param only if VDEV is up
	ALSA: hda - Apply ALC269_FIXUP_NO_SHUTUP on HDA_FIXUP_ACT_PROBE
	gpio: mockup: dynamically allocate memory for chip name
	drm: Apply range restriction after color adjustment when allocation
	clk: qcom: ipq4019: Add all the frequencies for apss cpu
	drm/mediatek: don't use drm_put_dev
	mac80211: Remove invalid flag operations in mesh TSF synchronization
	mac80211: Suppress NEW_PEER_CANDIDATE event if no room
	adm80211: add checks for dma mapping errors
	iio: light: fix improper return value
	staging: iio: cdc: fix improper return value
	spi: SPI_FSL_DSPI should depend on HAS_DMA
	netfilter: nft_queue: use raw_smp_processor_id()
	netfilter: nf_tables: fix oob access
	ASoC: rsnd: don't double free kctrl
	crypto: marvell - Copy IVDIG before launching partial DMA ahash requests
	btrfs: return the actual error value from from btrfs_uuid_tree_iterate
	ASoC: wm_adsp: Don't overrun firmware file buffer when reading region data
	s390/kbuild: enable modversions for symbols exported from asm
	cec: when canceling a message, don't overwrite old status info
	cec: CEC_MSG_GIVE_FEATURES should abort for CEC version < 2
	cec: update log_addr[] before finishing configuration
	nvmet: fix KATO offset in Set Features
	xen: xenbus driver must not accept invalid transaction ids
	Linux 4.9.66

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-11-30 16:24:14 +00:00
Steven Rostedt (Red Hat)
1c37ff7829 sched/rt: Simplify the IPI based RT balancing logic
commit 4bdced5c9a upstream.

When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.

When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit:

  b6366f048e ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")

changed the way to request a pull. Instead of grabbing the lock of the
overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.

Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.

Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.

This greatly simplifies the code!

The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:

  rto_push_work	 - the irq work to process each CPU set in rto_mask
  rto_lock	 - the lock to protect some of the other rto fields
  rto_loop_start - an atomic that keeps contention down on rto_lock
		    the first CPU scheduling in a lower priority task
		    is the one to kick off the process.
  rto_loop_next	 - an atomic that gets incremented for each CPU that
		    schedules in a lower priority task.
  rto_loop	 - a variable protected by rto_lock that is used to
		    compare against rto_loop_next
  rto_cpu	 - The cpu to send the next IPI to, also protected by
		    the rto_lock.

When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.

If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.

When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.

Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.

If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.

This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?

Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30 08:39:09 +00:00
Olav Haugan
77ba2b9f0b sched: Update task->on_rq when tasks are moving between runqueues
Task->on_rq has three states:
	0 - Task is not on runqueue (rq)
	1 (TASK_ON_RQ_QUEUED) - Task is on rq
	2 (TASK_ON_RQ_MIGRATING) - Task is on rq but in the
	process of being migrated to another rq

When a task is moving between rqs task->on_rq state should be
TASK_ON_RQ_MIGRATING in order for WALT to account rq's cumulative
runnable average correctly.  Without such state marking for all the
classes, WALT's update_history() would try to fixup task's demand
which was never contributed to any of CPUs during migration.

Change-Id: I65e74a8f176c3ed4b8577577f6da8897ecda7bb8
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
[joonwoop: Reinforced changelog to explain why this is needed by WALT.
           Fixed conflicts in deadline.c]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
(cherry picked from commit 0f8791c90a99f718c43ab8214076c0c671a36667)
[fixed cherry-pick issue due to missing fab5cc59bf]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
2017-11-14 16:39:00 +00:00
Greg Kroah-Hartman
c96f651270 Merge 4.9.20 into android-4.9
Changes in 4.9.20:
	xfrm: policy: init locks early
	xfrm_user: validate XFRM_MSG_NEWAE XFRMA_REPLAY_ESN_VAL replay_window
	xfrm_user: validate XFRM_MSG_NEWAE incoming ESN size harder
	KVM: x86: cleanup the page tracking SRCU instance
	virtio_balloon: init 1st buffer in stats vq
	pinctrl: qcom: Don't clear status bit on irq_unmask
	c6x/ptrace: Remove useless PTRACE_SETREGSET implementation
	h8300/ptrace: Fix incorrect register transfer count
	mips/ptrace: Preserve previous registers for short regset write
	sparc/ptrace: Preserve previous registers for short regset write
	metag/ptrace: Preserve previous registers for short regset write
	metag/ptrace: Provide default TXSTATUS for short NT_PRSTATUS
	metag/ptrace: Reject partial NT_METAG_RPIPE writes
	fscrypt: remove broken support for detecting keyring key revocation
	sched/rt: Add a missing rescheduling point
	usb: musb: fix possible spinlock deadlock
	Linux 4.9.20

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2017-03-31 11:07:16 +02:00
Sebastian Andrzej Siewior
916c5cfeab sched/rt: Add a missing rescheduling point
commit 619bd4a718 upstream.

Since the change in commit:

  fd7a4bed18 ("sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks")

... we don't reschedule a task under certain circumstances:

Lets say task-A, SCHED_OTHER, is running on CPU0 (and it may run only on
CPU0) and holds a PI lock. This task is removed from the CPU because it
used up its time slice and another SCHED_OTHER task is running. Task-B on
CPU1 runs at RT priority and asks for the lock owned by task-A. This
results in a priority boost for task-A. Task-B goes to sleep until the
lock has been made available. Task-A is already runnable (but not active),
so it receives no wake up.

The reality now is that task-A gets on the CPU once the scheduler decides
to remove the current task despite the fact that a high priority task is
enqueued and waiting. This may take a long time.

The desired behaviour is that CPU0 immediately reschedules after the
priority boost which made task-A the task with the lowest priority.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: fd7a4bed18 ("sched, rt: Convert switched_{from, to}_rt() prio_changed_rt() to balance callbacks")
Link: http://lkml.kernel.org/r/20170124144006.29821-1-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-31 10:31:46 +02:00
Matt Wagantall
345eb978a6 ANDROID: sched/rt: Add Kconfig option to enable panicking for RT throttling
This may be useful for detecting and debugging RT throttling issues.

Change-Id: I5807a897d11997d76421c1fcaa2918aad988c6c9
Signed-off-by: Matt Wagantall <mattw@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[jstultz: forwardported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andres Oportus <andresoportus@google.com>
2017-01-31 10:47:10 -08:00
Matt Wagantall
989a768d7c ANDROID: sched/rt: print RT tasks when RT throttling is activated
Existing debug prints do not provide any clues about which tasks
may have triggered RT throttling. Print the names and PIDs of
all tasks on the throttled rt_rq to help narrow down the source
of the problem.

Change-Id: I180534c8a647254ed38e89d0c981a8f8bccd741c
Signed-off-by: Matt Wagantall <mattw@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
Signed-off-by: Andres Oportus <andresoportus@google.com>
2017-01-31 10:47:09 -08:00
Srivatsa Vaddagiri
26c2154816 ANDROID: sched: Introduce Window Assisted Load Tracking (WALT)
use a window based view of time in order to track task
demand and CPU utilization in the scheduler.

Window Assisted Load Tracking (WALT) implementation credits:
 Srivatsa Vaddagiri, Steve Muckle, Syed Rameez Mustafa, Joonwoo Park,
 Pavan Kumar Kondeti, Olav Haugan

2016-03-06: Integration with EAS/refactoring by Vikram Mulukutla
            and Todd Kjos

Change-Id: I21408236836625d4e7d7de1843d20ed5ff36c708

Includes fixes for issues:

eas/walt: Use walt_ktime_clock() instead of ktime_get_ns() to avoid a
race resulting in watchdog resets
BUG: 29353986
Change-Id: Ic1820e22a136f7c7ebd6f42e15f14d470f6bbbdb

Handle walt accounting anomoly during resume

During resume, there is a corner case where on wakeup, a task's
prev_runnable_sum can go negative. This is a workaround that
fixes the condition and warns (instead of crashing).

BUG: 29464099
Change-Id: I173e7874324b31a3584435530281708145773508

Signed-off-by: Todd Kjos <tkjos@google.com>
Signed-off-by: Srinath Sridharan <srinathsr@google.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andres Oportus <andresoportus@google.com>
2017-01-31 10:46:58 -08:00
Rafael J. Wysocki
12bde33dbb cpufreq / sched: Pass runqueue pointer to cpufreq_update_util()
All of the callers of cpufreq_update_util() pass rq_clock(rq) to it
as the time argument and some of them check whether or not cpu_of(rq)
is equal to smp_processor_id() before calling it, so rework it to
take a runqueue pointer as the argument and move the rq_clock(rq)
evaluation into it.

Additionally, provide a wrapper checking cpu_of(rq) against
smp_processor_id() for the cpufreq_update_util() callers that
need it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-08-16 22:16:03 +02:00
Rafael J. Wysocki
58919e83c8 cpufreq / sched: Pass flags to cpufreq_update_util()
It is useful to know the reason why cpufreq_update_util() has just
been called and that can be passed as flags to cpufreq_update_util()
and to the ->func() callback in struct update_util_data.  However,
doing that in addition to passing the util and max arguments they
already take would be clumsy, so avoid it.

Instead, use the observation that the schedutil governor is part
of the scheduler proper, so it can access scheduler data directly.
This allows the util and max arguments of cpufreq_update_util()
and the ->func() callback in struct update_util_data to be replaced
with a flags one, but schedutil has to be modified to follow.

Thus make the schedutil governor obtain the CFS utilization
information from the scheduler and use the "RT" and "DL" flags
instead of the special utilization value of ULONG_MAX to track
updates from the RT and DL sched classes.  Make it non-modular
too to avoid having to export scheduler variables to modules at
large.

Next, update all of the other users of cpufreq_update_util()
and the ->func() callback in struct update_util_data accordingly.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2016-08-16 22:14:55 +02:00
Thomas Gleixner
50605ffbda sched/core: Provide a tsk_nr_cpus_allowed() helper
tsk_nr_cpus_allowed() is an accessor for task->nr_cpus_allowed which allows
us to change the representation of ->nr_cpus_allowed if required.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1462969411-17735-2-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12 09:55:36 +02:00
Ingo Molnar
eb60b3e5e8 Merge branch 'sched/urgent' into sched/core to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-12 09:18:13 +02:00
Xunlei Pang
13b5ab02ae sched/rt, sched/dl: Don't push if task's scheduling class was changed
We got this warning:

    WARNING: CPU: 1 PID: 2468 at kernel/sched/core.c:1161 set_task_cpu+0x1af/0x1c0
    [...]
    Call Trace:

    dump_stack+0x63/0x87
    __warn+0xd1/0xf0
    warn_slowpath_null+0x1d/0x20
    set_task_cpu+0x1af/0x1c0
    push_dl_task.part.34+0xea/0x180
    push_dl_tasks+0x17/0x30
    __balance_callback+0x45/0x5c
    __sched_setscheduler+0x906/0xb90
    SyS_sched_setattr+0x150/0x190
    do_syscall_64+0x62/0x110
    entry_SYSCALL64_slow_path+0x25/0x25

This corresponds to:

    WARN_ON_ONCE(p->state == TASK_RUNNING &&
             p->sched_class == &fair_sched_class &&
             (p->on_rq && !task_on_rq_migrating(p)))

It happens because in find_lock_later_rq(), the task whose scheduling
class was changed to fair class is still pushed away as if it were
a deadline task ...

So, check in find_lock_later_rq() after double_lock_balance(), if the
scheduling class of the deadline task was changed, break and retry.

Apply the same logic to RT tasks.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Link: http://lkml.kernel.org/r/1462767091-1215-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-10 10:02:46 +02:00
Peter Zijlstra
e7904a28f5 locking/lockdep, sched/core: Implement a better lock pinning scheme
The problem with the existing lock pinning is that each pin is of
value 1; this mean you can simply unpin if you know its pinned,
without having any extra information.

This scheme generates a random (16 bit) cookie for each pin and
requires this same cookie to unpin. This means you have to keep the
cookie in context.

No objsize difference for !LOCKDEP kernels.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05 09:23:59 +02:00
Wanpeng Li
594dd290cf sched/cpufreq: Optimize cpufreq update kicker to avoid update multiple times
Sometimes delta_exec is 0 due to update_curr() is called multiple times,
this is captured by:

	u64 delta_exec = rq_clock_task(rq) - curr->se.exec_start;

This patch optimizes the cpufreq update kicker by bailing out when nothing
changed, it will benefit the upcoming schedutil, since otherwise it will
(over)react to the special util/max combination.

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1461316044-9520-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-28 10:39:54 +02:00
Linus Torvalds
277edbabf6 Merge tag 'pm+acpi-4.6-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI updates from Rafael Wysocki:
 "This time the majority of changes go into cpufreq and they are
  significant.

  First off, the way CPU frequency updates are triggered is different
  now.  Instead of having to set up and manage a deferrable timer for
  each CPU in the system to evaluate and possibly change its frequency
  periodically, cpufreq governors set up callbacks to be invoked by the
  scheduler on a regular basis (basically on utilization updates).  The
  "old" governors, "ondemand" and "conservative", still do all of their
  work in process context (although that is triggered by the scheduler
  now), but intel_pstate does it all in the callback invoked by the
  scheduler with no need for any additional asynchronous processing.

  Of course, this eliminates the overhead related to the management of
  all those timers, but also it allows the cpufreq governor code to be
  simplified quite a bit.  On top of that, the common code and data
  structures used by the "ondemand" and "conservative" governors are
  cleaned up and made more straightforward and some long-standing and
  quite annoying problems are addressed.  In particular, the handling of
  governor sysfs attributes is modified and the related locking becomes
  more fine grained which allows some concurrency problems to be avoided
  (particularly deadlocks with the core cpufreq code).

  In principle, the new mechanism for triggering frequency updates
  allows utilization information to be passed from the scheduler to
  cpufreq.  Although the current code doesn't make use of it, in the
  works is a new cpufreq governor that will make decisions based on the
  scheduler's utilization data.  That should allow the scheduler and
  cpufreq to work more closely together in the long run.

  In addition to the core and governor changes, cpufreq drivers are
  updated too.  Fixes and optimizations go into intel_pstate, the
  cpufreq-dt driver is updated on top of some modification in the
  Operating Performance Points (OPP) framework and there are fixes and
  other updates in the powernv cpufreq driver.

  Apart from the cpufreq updates there is some new ACPICA material,
  including a fix for a problem introduced by previous ACPICA updates,
  and some less significant changes in the ACPI code, like CPPC code
  optimizations, ACPI processor driver cleanups and support for loading
  ACPI tables from initrd.

  Also updated are the generic power domains framework, the Intel RAPL
  power capping driver and the turbostat utility and we have a bunch of
  traditional assorted fixes and cleanups.

  Specifics:

   - Redesign of cpufreq governors and the intel_pstate driver to make
     them use callbacks invoked by the scheduler to trigger CPU
     frequency evaluation instead of using per-CPU deferrable timers for
     that purpose (Rafael Wysocki).

   - Reorganization and cleanup of cpufreq governor code to make it more
     straightforward and fix some concurrency problems in it (Rafael
     Wysocki, Viresh Kumar).

   - Cleanup and improvements of locking in the cpufreq core (Viresh
     Kumar).

   - Assorted cleanups in the cpufreq core (Rafael Wysocki, Viresh
     Kumar, Eric Biggers).

   - intel_pstate driver updates including fixes, optimizations and a
     modification to make it enable enable hardware-coordinated P-state
     selection (HWP) by default if supported by the processor (Philippe
     Longepe, Srinivas Pandruvada, Rafael Wysocki, Viresh Kumar, Felipe
     Franciosi).

   - Operating Performance Points (OPP) framework updates to improve its
     handling of voltage regulators and device clocks and updates of the
     cpufreq-dt driver on top of that (Viresh Kumar, Jon Hunter).

   - Updates of the powernv cpufreq driver to fix initialization and
     cleanup problems in it and correct its worker thread handling with
     respect to CPU offline, new powernv_throttle tracepoint (Shilpasri
     Bhat).

   - ACPI cpufreq driver optimization and cleanup (Rafael Wysocki).

   - ACPICA updates including one fix for a regression introduced by
     previos changes in the ACPICA code (Bob Moore, Lv Zheng, David Box,
     Colin Ian King).

   - Support for installing ACPI tables from initrd (Lv Zheng).

   - Optimizations of the ACPI CPPC code (Prashanth Prakash, Ashwin
     Chaugule).

   - Support for _HID(ACPI0010) devices (ACPI processor containers) and
     ACPI processor driver cleanups (Sudeep Holla).

   - Support for ACPI-based enumeration of the AMBA bus (Graeme Gregory,
     Aleksey Makarov).

   - Modification of the ACPI PCI IRQ management code to make it treat
     255 in the Interrupt Line register as "not connected" on x86 (as
     per the specification) and avoid attempts to use that value as a
     valid interrupt vector (Chen Fan).

   - ACPI APEI fixes related to resource leaks (Josh Hunt).

   - Removal of modularity from a few ACPI drivers (BGRT, GHES,
     intel_pmic_crc) that cannot be built as modules in practice (Paul
     Gortmaker).

   - PNP framework update to make it treat ACPI_RESOURCE_TYPE_SERIAL_BUS
     as a valid resource type (Harb Abdulhamid).

   - New device ID (future AMD I2C controller) in the ACPI driver for
     AMD SoCs (APD) and in the designware I2C driver (Xiangliang Yu).

   - Assorted ACPI cleanups (Colin Ian King, Kaiyen Chang, Oleg Drokin).

   - cpuidle menu governor optimization to avoid a square root
     computation in it (Rasmus Villemoes).

   - Fix for potential use-after-free in the generic device properties
     framework (Heikki Krogerus).

   - Updates of the generic power domains (genpd) framework including
     support for multiple power states of a domain, fixes and debugfs
     output improvements (Axel Haslam, Jon Hunter, Laurent Pinchart,
     Geert Uytterhoeven).

   - Intel RAPL power capping driver updates to reduce IPI overhead in
     it (Jacob Pan).

   - System suspend/hibernation code cleanups (Eric Biggers, Saurabh
     Sengar).

   - Year 2038 fix for the process freezer (Abhilash Jindal).

   - turbostat utility updates including new features (decoding of more
     registers and CPUID fields, sub-second intervals support, GFX MHz
     and RC6 printout, --out command line option), fixes (syscall jitter
     detection and workaround, reductioin of the number of syscalls
     made, fixes related to Xeon x200 processors, compiler warning
     fixes) and cleanups (Len Brown, Hubert Chrzaniuk, Chen Yu)"

* tag 'pm+acpi-4.6-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (182 commits)
  tools/power turbostat: bugfix: TDP MSRs print bits fixing
  tools/power turbostat: correct output for MSR_NHM_SNB_PKG_CST_CFG_CTL dump
  tools/power turbostat: call __cpuid() instead of __get_cpuid()
  tools/power turbostat: indicate SMX and SGX support
  tools/power turbostat: detect and work around syscall jitter
  tools/power turbostat: show GFX%rc6
  tools/power turbostat: show GFXMHz
  tools/power turbostat: show IRQs per CPU
  tools/power turbostat: make fewer systems calls
  tools/power turbostat: fix compiler warnings
  tools/power turbostat: add --out option for saving output in a file
  tools/power turbostat: re-name "%Busy" field to "Busy%"
  tools/power turbostat: Intel Xeon x200: fix turbo-ratio decoding
  tools/power turbostat: Intel Xeon x200: fix erroneous bclk value
  tools/power turbostat: allow sub-sec intervals
  ACPI / APEI: ERST: Fixed leaked resources in erst_init
  ACPI / APEI: Fix leaked resources
  intel_pstate: Do not skip samples partially
  intel_pstate: Remove freq calculation from intel_pstate_calc_busy()
  intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()
  ...
2016-03-16 14:10:53 -07:00
Linus Torvalds
e23604edac Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull NOHZ updates from Ingo Molnar:
 "NOHZ enhancements, by Frederic Weisbecker, which reorganizes/refactors
  the NOHZ 'can the tick be stopped?' infrastructure and related code to
  be data driven, and harmonizes the naming and handling of all the
  various properties"

[ This makes the ugly "fetch_or()" macro that the scheduler used
  internally a new generic helper, and does a bad job at it.

  I'm pulling it, but I've asked Ingo and Frederic to get this
  fixed up ]

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched-clock: Migrate to use new tick dependency mask model
  posix-cpu-timers: Migrate to use new tick dependency mask model
  sched: Migrate sched to use new tick dependency mask model
  sched: Account rr tasks
  perf: Migrate perf to use new tick dependency mask model
  nohz: Use enum code for tick stop failure tracing message
  nohz: New tick dependency mask
  nohz: Implement wide kick on top of irq work
  atomic: Export fetch_or()
2016-03-14 19:44:38 -07:00
Rafael J. Wysocki
34e2c555f3 cpufreq: Add mechanism for registering utilization update callbacks
Introduce a mechanism by which parts of the cpufreq subsystem
("setpolicy" drivers or the core) can register callbacks to be
executed from cpufreq_update_util() which is invoked by the
scheduler's update_load_avg() on CPU utilization changes.

This allows the "setpolicy" drivers to dispense with their timers
and do all of the computations they need and frequency/voltage
adjustments in the update_load_avg() code path, among other things.

The update_load_avg() changes were suggested by Peter Zijlstra.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
2016-03-09 14:39:19 +01:00
Frederic Weisbecker
01d36d0ac3 sched: Account rr tasks
In order to evaluate the scheduler tick dependency without probing
context switches, we need to know how much SCHED_RR and SCHED_FIFO tasks
are enqueued as those policies don't have the same preemption
requirements.

To prepare for that, let's account SCHED_RR tasks, we'll be able to
deduce SCHED_FIFO tasks as well from it and the total RT tasks in the
runqueue.

Reviewed-by: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2016-03-02 16:43:04 +01:00
Steven Rostedt
c3a990dc9f sched/rt: Kick RT bandwidth timer immediately on start up
I've been debugging why deadline tasks can cause the RT scheduler to
throttle, even when the deadline tasks are only taking up 50% of the
CPU and RT tasks are not even using 1% of the CPU. Here's what I found.

In order to keep a CPU from being hogged by RT tasks, the deadline
scheduler adds its run time (delta_exec) to the rt_time of the RT
bandwidth. That way, if the two use more than 95% of the CPU within one
second (default settings), the RT tasks are throttled to allow non RT
tasks to run.

Although the deadline tasks add their run time to the RT bandwidth, it
lets the RT tasks do the accounting. This is where the problem lies. If
a deadline task runs for a bit, and no RT tasks are running, then it
will continually add to the RT rt_time that is used to calculate how
much CPU the RT tasks use. But no RT period is in play, and this
accumulation of the runtime never gets reset.

When an RT task finally gets to run, and the watchdog goes off, it can
see that the RT task has used more than it should of, because the
deadline task added all this runtime to its rt_time. Then the RT task
that just woke up gets throttled for no good reason.

I also noticed that when an RT task is queued, it starts the timer to
account for overload and such. But that timer goes off one period
later, which may be too late and the extra rt_time will trigger a
throttle.

This is a quick work around to the problem. When a new RT task is
queued, the bandwidth timer is set to go off immediately. Then the
timer can clear out the extra time added to the rt_time while there was
no RT task running. This stops my tests from triggering the throttle,
and it will still throttle if an RT task runs too much, even while a
deadline task is running.

A better solution may be to subtract the bandwidth that the deadline
task uses from the rt_runtime, and add it back when its finished. Then
there wont be a need for runtime tracking of the time used by deadline
tasks.

I may play with that solution tomorrow.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <juri.lelli@gmail.com>
Cc: <williams@redhat.com>
Cc: Clark Williams
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160216183746.349ec98b@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:07 +01:00
Peter Zijlstra
ff77e46853 sched/rt: Fix PI handling vs. sched_setscheduler()
Andrea Parri reported:

> I found that the following scenario (with CONFIG_RT_GROUP_SCHED=y) is not
> handled correctly:
>
>     T1 (prio = 20)
>        lock(rtmutex);
>
>     T2 (prio = 20)
>        blocks on rtmutex  (rt_nr_boosted = 0 on T1's rq)
>
>     T1 (prio = 20)
>        sys_set_scheduler(prio = 0)
>           [new_effective_prio == oldprio]
>           T1 prio = 20    (rt_nr_boosted = 0 on T1's rq)
>
> The last step is incorrect as T1 is now boosted (c.f., rt_se_boosted());
> in particular, if we continue with
>
>    T1 (prio = 20)
>       unlock(rtmutex)
>          wakeup(T2)
>          adjust_prio(T1)
>             [prio != rt_mutex_getprio(T1)]
>	    dequeue(T1)
>	       rt_nr_boosted = (unsigned long)(-1)
>	       ...
>             T1 prio = 0
>
> then we end up leaving rt_nr_boosted in an "inconsistent" state.
>
> The simple program attached could reproduce the previous scenario; note
> that, as a consequence of the presence of this state, the "assertion"
>
>     WARN_ON(!rt_nr_running && rt_nr_boosted)
>
> from dec_rt_group() may trigger.

So normally we dequeue/enqueue tasks in sched_setscheduler(), which
would ensure the accounting stays correct. However in the early PI path
we fail to do so.

So this was introduced at around v3.14, by:

  c365c292d0 ("sched: Consider pi boosting in setscheduler()")

which fixed another problem exactly because that dequeue/enqueue, joy.

Fix this by teaching rt about DEQUEUE_SAVE/ENQUEUE_RESTORE and have it
preserve runqueue location with that option. This requires decoupling
the on_rt_rq() state from being on the list.

In order to allow for explicit movement during the SAVE/RESTORE,
introduce {DE,EN}QUEUE_MOVE. We still must use SAVE/RESTORE in these
cases to preserve other invariants.

Respecting the SAVE/RESTORE flags also has the (nice) side-effect that
things like sys_nice()/sys_sched_setaffinity() also do not reorder
FIFO tasks (whereas they used to before this patch).

Reported-by: Andrea Parri <parri.andrea@gmail.com>
Tested-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:53:05 +01:00
Arnd Bergmann
89b411081d sched/rt: Hide the push_irq_work_func() declaration
The push_irq_work_func() function is conditionally defined only
when both CONFIG_SMP and HAVE_RT_PUSH_IPI are defined, but the
forward declaration remains visibile without HAVE_RT_PUSH_IPI,
causing a gcc warning in ARM64 allnoconfig:

  kernel/sched/rt.c:68:13: warning: 'push_irq_work_func' declared 'static' but never defined [-Wunused-function]

This changes the code to use the same condition for both the
declaration and the function definition, which gets rid of the
warning.

As Peter Zijlstra, we can possibly get rid of the whole HAVE_RT_PUSH_IPI
thing after:

  8053871d0f ("smp: Fix smp_call_function_single_async() locking")

Until that is done, this patch can be used to avoid the warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: b6366f048e ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
Link: http://lkml.kernel.org/r/3828565.oKfGk7yNIT@wuerfel
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-23 09:25:08 +01:00
Juri Lelli
269b26a5ef sched/rt: Make (do_)balance_runtime() return void
The return value of (do_)balance_runtime() is not consumed by anybody.
Make them return void.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441188096-23021-5-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-23 09:51:26 +02:00
Peter Zijlstra
6c37067e27 sched: Change the sched_class::set_cpus_allowed() calling context
Change the calling context of sched_class::set_cpus_allowed() such
that we can assume the task is inactive.

This allows us to easily make changes that affect accounting done by
enqueue/dequeue. This does in fact completely remove
set_cpus_allowed_rt() and greatly reduces set_cpus_allowed_dl().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dedekind1@gmail.com
Cc: juri.lelli@arm.com
Cc: mgorman@suse.de
Cc: riel@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20150515154833.667516139@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 12:06:10 +02:00
Peter Zijlstra
c5b2803840 sched: Make sched_class::set_cpus_allowed() unconditional
Give every class a set_cpus_allowed() method, this enables some small
optimization in the RT,DL implementation by avoiding a double
cpumask_weight() call.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dedekind1@gmail.com
Cc: juri.lelli@arm.com
Cc: mgorman@suse.de
Cc: riel@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20150515154833.614517487@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 12:06:09 +02:00
Xunlei Pang
8fd373548e sched/rt: Remove a redundant condition from task_woken_rt()
'p' has been already queued at this point, so "!task_running(rq, p)"
and "p->nr_cpus_allowed > 1" imply that "has_pushable_tasks(rq)" is
true, so it can be removed.

Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435995563-3723-1-git-send-email-xlpang@126.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-03 12:21:19 +02:00
Peter Zijlstra
cbce1a6867 sched,lockdep: Employ lock pinning
Employ the new lockdep lock pinning annotation to ensure no
'accidental' lock-breaks happen with rq->lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: ktkhai@parallels.com
Cc: rostedt@goodmis.org
Cc: juri.lelli@gmail.com
Cc: pang.xunlei@linaro.org
Cc: oleg@redhat.com
Cc: wanpeng.li@linux.intel.com
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20150611124744.003233193@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 00:25:27 +02:00
Peter Zijlstra
fd7a4bed18 sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks
Remove the direct {push,pull} balancing operations from
switched_{from,to}_rt() / prio_changed_rt() and use the balance
callback queue.

Again, err on the side of too many reschedules; since too few is a
hard bug while too many is just annoying.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: ktkhai@parallels.com
Cc: rostedt@goodmis.org
Cc: juri.lelli@gmail.com
Cc: pang.xunlei@linaro.org
Cc: oleg@redhat.com
Cc: wanpeng.li@linux.intel.com
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20150611124742.766832367@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 00:25:26 +02:00
Peter Zijlstra
8046d68062 sched,rt: Remove return value from pull_rt_task()
In order to be able to use pull_rt_task() from a callback, we need to
do away with the return value.

Since the return value indicates if we should reschedule, do this
inside the function. Since not all callers currently do this, this can
increase the number of reschedules due rt balancing.

Too many reschedules is not a correctness issues, too few are.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: ktkhai@parallels.com
Cc: rostedt@goodmis.org
Cc: juri.lelli@gmail.com
Cc: pang.xunlei@linaro.org
Cc: oleg@redhat.com
Cc: wanpeng.li@linux.intel.com
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20150611124742.679002000@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 00:25:26 +02:00
Peter Zijlstra
e3fca9e7cb sched: Replace post_schedule with a balance callback list
Generalize the post_schedule() stuff into a balance callback list.
This allows us to more easily use it outside of schedule() and cross
sched_class.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: ktkhai@parallels.com
Cc: rostedt@goodmis.org
Cc: juri.lelli@gmail.com
Cc: pang.xunlei@linaro.org
Cc: oleg@redhat.com
Cc: wanpeng.li@linux.intel.com
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20150611124742.424032725@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19 00:25:26 +02:00
Thomas Gleixner
624bbdfac9 Merge branch 'timers/core' into sched/hrtimers
Merge sched/core and timers/core so we can apply the sched balancing
patch queue, which depends on both.
2015-06-19 00:17:47 +02:00
Peter Zijlstra
4cfafd3082 sched,perf: Fix periodic timers
In the below two commits (see Fixes) we have periodic timers that can
stop themselves when they're no longer required, but need to be
(re)-started when their idle condition changes.

Further complications is that we want the timer handler to always do
the forward such that it will always correctly deal with the overruns,
and we do not want to race such that the handler has already decided
to stop, but the (external) restart sees the timer still active and we
end up with a 'lost' timer.

The problem with the current code is that the re-start can come before
the callback does the forward, at which point the forward from the
callback will WARN about forwarding an enqueued timer.

Now, conceptually its easy to detect if you're before or after the fwd
by comparing the expiration time against the current time. Of course,
that's expensive (and racy) because we don't have the current time.

Alternatively one could cache this state inside the timer, but then
everybody pays the overhead of maintaining this extra state, and that
is undesired.

The only other option that I could see is the external timer_active
variable, which I tried to kill before. I would love a nicer interface
for this seemingly simple 'problem' but alas.

Fixes: 272325c482 ("perf: Fix mux_interval hrtimer wreckage")
Fixes: 77a4d1a1b9 ("sched: Cleanup bandwidth timers")
Cc: pjt@google.com
Cc: tglx@linutronix.de
Cc: klamm@yandex-team.ru
Cc: mingo@kernel.org
Cc: bsegall@google.com
Cc: hpa@zytor.com
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net
2015-05-18 17:17:42 +02:00
Jason Low
316c1608d1 sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE()
ACCESS_ONCE doesn't work reliably on non-scalar types. This patch removes
the rest of the existing usages of ACCESS_ONCE() in the scheduler, and use
the new READ_ONCE() and WRITE_ONCE() APIs as appropriate.

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Waiman Long <Waiman.Long@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1430251224-5764-2-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 12:11:32 +02:00
Peter Zijlstra
77a4d1a1b9 sched: Cleanup bandwidth timers
Roman reported a 3 cpu lockup scenario involving __start_cfs_bandwidth().

The more I look at that code the more I'm convinced its crack, that
entire __start_cfs_bandwidth() thing is brain melting, we don't need to
cancel a timer before starting it, *hrtimer_start*() will happily remove
the timer for you if its still enqueued.

Removing that, removes a big part of the problem, no more ugly cancel
loop to get stuck in.

So now, if I understand things right, the entire reason you have this
cfs_b->lock guarded ->timer_active nonsense is to make sure we don't
accidentally lose the timer.

It appears to me that it should be possible to guarantee that same by
unconditionally (re)starting the timer when !queued. Because regardless
what hrtimer::function will return, if we beat it to (re)enqueue the
timer, it doesn't matter.

Now, because hrtimers don't come with any serialization guarantees we
must ensure both handler and (re)start loop serialize their access to
the hrtimer to avoid both trying to forward the timer at the same
time.

Update the rt bandwidth timer to match.

This effectively reverts: 09dc4ab039 ("sched/fair: Fix
tg_set_cfs_bandwidth() deadlock on rq->lock").

Reported-by: Roman Gushchin <klamm@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20150415095011.804589208@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22 17:06:53 +02:00
Abel Vesa
07c54f7a7f sched/core: Remove unused argument from init_[rt|dl]_rq()
Obviously, 'rq' is not used in these two functions, therefore,
there is no reason for it to be passed as an argument.

Signed-off-by: Abel Vesa <abelvesa@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1425383427-26244-1-git-send-email-abelvesa@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02 17:42:55 +02:00
Steven Rostedt
b6366f048e sched/rt: Use IPI to trigger RT task push migration instead of pulling
When debugging the latencies on a 40 core box, where we hit 300 to
500 microsecond latencies, I found there was a huge contention on the
runqueue locks.

Investigating it further, running ftrace, I found that it was due to
the pulling of RT tasks.

The test that was run was the following:

 cyclictest --numa -p95 -m -d0 -i100

This created a thread on each CPU, that would set its wakeup in iterations
of 100 microseconds. The -d0 means that all the threads had the same
interval (100us). Each thread sleeps for 100us and wakes up and measures
its latencies.

cyclictest is maintained at:
 git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git

What happened was another RT task would be scheduled on one of the CPUs
that was running our test, when the other CPU tests went to sleep and
scheduled idle. This caused the "pull" operation to execute on all
these CPUs. Each one of these saw the RT task that was overloaded on
the CPU of the test that was still running, and each one tried
to grab that task in a thundering herd way.

To grab the task, each thread would do a double rq lock grab, grabbing
its own lock as well as the rq of the overloaded CPU. As the sched
domains on this box was rather flat for its size, I saw up to 12 CPUs
block on this lock at once. This caused a ripple affect with the
rq locks especially since the taking was done via a double rq lock, which
means that several of the CPUs had their own rq locks held while trying
to take this rq lock. As these locks were blocked, any wakeups or load
balanceing on these CPUs would also block on these locks, and the wait
time escalated.

I've tried various methods to lessen the load, but things like an
atomic counter to only let one CPU grab the task wont work, because
the task may have a limited affinity, and we may pick the wrong
CPU to take that lock and do the pull, to only find out that the
CPU we picked isn't in the task's affinity.

Instead of doing the PULL, I now have the CPUs that want the pull to
send over an IPI to the overloaded CPU, and let that CPU pick what
CPU to push the task to. No more need to grab the rq lock, and the
push/pull algorithm still works fine.

With this patch, the latency dropped to just 150us over a 20 hour run.
Without the patch, the huge latencies would trigger in seconds.

I've created a new sched feature called RT_PUSH_IPI, which is enabled
by default.

When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
is enabled, the IPI is sent to the overloaded CPU to do a push.

To enabled or disable this at run time:

 # mount -t debugfs nodev /sys/kernel/debug
 # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
or
 # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features

Update: This original patch would send an IPI to all CPUs in the RT overload
list. But that could theoretically cause the reverse issue. That is, there
could be lots of overloaded RT queues and one CPU lowers its priority. It would
then send an IPI to all the overloaded RT queues and they could then all try
to grab the rq lock of the CPU lowering its priority, and then we have the
same problem.

The latest design sends out only one IPI to the first overloaded CPU. It tries to
push any tasks that it can, and then looks for the next overloaded CPU that can
push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable
tasks that have priorities greater than the source CPU are covered. In case the
source CPU lowers its priority again, a flag is set to tell the IPI traversal to
restart with the first RT overloaded CPU after the source CPU.

Parts-suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joern Engel <joern@purestorage.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-23 10:55:22 +01:00
Tim Chen
80e3d87b2c sched/rt: Reduce rq lock contention by eliminating locking of non-feasible target
This patch adds checks that prevens futile attempts to move rt tasks
to a CPU with active tasks of equal or higher priority.

This reduces run queue lock contention and improves the performance of
a well known OLTP benchmark by 0.7%.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Cc: Suruchi Kadu <suruchi.a.kadu@intel.com>
Cc: Doug Nelson<doug.nelson@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1421430374.2399.27.camel@schen9-desk2.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-30 19:38:49 +01:00
Peter Zijlstra
9edfbfed3f sched/core: Rework rq->clock update skips
The original purpose of rq::skip_clock_update was to avoid 'costly' clock
updates for back to back wakeup-preempt pairs. The big problem with it
has always been that the rq variable is unaware of the context and
causes indiscrimiate clock skips.

Rework the entire thing and create a sense of context by only allowing
schedule() to skip clock updates. (XXX can we measure the cost of the
added store?)

By ensuring only schedule can ever skip an update, we guarantee we're
never more than 1 tick behind on the update.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20150105103554.432381549@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-14 13:34:20 +01:00
Wanpeng Li
6c1d9410f0 sched: Move p->nr_cpus_allowed check to select_task_rq()
Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
This change will make fair.c, rt.c, and deadline.c all start with the
same logic.

Suggested-and-Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "pang.xunlei" <pang.xunlei@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:58:55 +01:00
Ingo Molnar
e9ac5f0fa8 Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying more changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-11-16 10:50:25 +01:00