Commit Graph

743 Commits

Author SHA1 Message Date
Tao Huang
3430c68a33 Merge branch 'linux-linaro-lsk-v4.4-android' of git://git.linaro.org/kernel/linux-linaro-stable.git
* linux-linaro-lsk-v4.4-android: (660 commits)
  ANDROID: keychord: Check for write data size
  ANDROID: sdcardfs: Set num in extension_details during make_item
  ANDROID: sdcardfs: Hold i_mutex for i_size_write
  BACKPORT, FROMGIT: crypto: speck - add test vectors for Speck64-XTS
  BACKPORT, FROMGIT: crypto: speck - add test vectors for Speck128-XTS
  BACKPORT, FROMGIT: crypto: arm/speck - add NEON-accelerated implementation of Speck-XTS
  FROMGIT: crypto: speck - export common helpers
  BACKPORT, FROMGIT: crypto: speck - add support for the Speck block cipher
  UPSTREAM: ANDROID: binder: synchronize_rcu() when using POLLFREE.
  f2fs: updates on v4.16-rc1
  BACKPORT: tee: shm: Potential NULL dereference calling tee_shm_register()
  BACKPORT: tee: shm: don't put_page on null shm->pages
  BACKPORT: tee: shm: make function __tee_shm_alloc static
  BACKPORT: tee: optee: check type of registered shared memory
  BACKPORT: tee: add start argument to shm_register callback
  BACKPORT: tee: optee: fix header dependencies
  BACKPORT: tee: shm: inline tee_shm_get_id()
  BACKPORT: tee: use reference counting for tee_context
  BACKPORT: tee: optee: enable dynamic SHM support
  BACKPORT: tee: optee: add optee-specific shared pool implementation
  ...

Conflicts:
	drivers/irqchip/Kconfig
	drivers/media/i2c/tc35874x.c
	drivers/media/v4l2-core/v4l2-compat-ioctl32.c
	drivers/usb/gadget/function/f_fs.c
	fs/f2fs/node.c

Change-Id: Icecd73a515821b536fa3d81ea91b63d9b3699916
2018-03-09 19:10:14 +08:00
Joel Fernandes
6c48002863 ANDROID: sched/rt: schedtune: Add boost retention to RT
Boosted RT tasks can be deboosted quickly, this makes boost usless
for RT tasks and causes lots of glitching. Use timers to prevent
de-boost too soon and wait for long enough such that next enqueue
happens after a threshold.

While this can be solved in the governor, there are following
advantages:
- The approach used is governor-independent
- Reduces boost group lock contention for frequently sleepers/wakers

Note:
Fixed build breakage due to schedfreq dependency which isn't used
for RT anymore.

Bug: 30210506

Change-Id: I428a2695cac06cc3458cdde0dea72315e4e66c00
Signed-off-by: Joel Fernandes <joelaf@google.com>
2018-03-05 21:52:26 +05:30
Amit Pundir
24740dab5c Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>

Conflicts:
    fs/f2fs/extent_cache.c
        Pick changes from AOSP Change-Id: Icd8a85ac0c19a8aa25cd2591a12b4e9b85bdf1c5
        ("f2fs: catch up to v4.14-rc1")

    fs/f2fs/namei.c
        Pick changes from AOSP F2FS backport commit 7d5c08fd91
        ("f2fs: backport from (4c1fad64 - Merge tag 'for-f2fs-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs)")
2018-03-05 20:20:17 +05:30
Steven Rostedt (VMware)
911357aed6 sched/rt: Up the root domain ref count when passing it around via IPIs
commit 364f566537 upstream.

When issuing an IPI RT push, where an IPI is sent to each CPU that has more
than one RT task scheduled on it, it references the root domain's rto_mask,
that contains all the CPUs within the root domain that has more than one RT
task in the runable state. The problem is, after the IPIs are initiated, the
rq->lock is released. This means that the root domain that is associated to
the run queue could be freed while the IPIs are going around.

Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
the root domain's ref count respectively. This way when initiating the IPIs,
the scheduler will up the root domain's ref count before releasing the
rq->lock, ensuring that the root domain does not go away until the IPI round
is complete.

Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 4bdced5c9a ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-16 20:09:40 +01:00
Tao Huang
8d2d0b6a51 Merge tag 'lsk-v4.4-18.02-android' of git://git.linaro.org/kernel/linux-linaro-stable.git
LSK 18.02 v4.4-android

* tag 'lsk-v4.4-18.02-android': (131 commits)
  Linux 4.4.114
  nfsd: auth: Fix gid sorting when rootsquash enabled
  net: tcp: close sock if net namespace is exiting
  flow_dissector: properly cap thoff field
  ipv4: Make neigh lookup keys for loopback/point-to-point devices be INADDR_ANY
  net: Allow neigh contructor functions ability to modify the primary_key
  vmxnet3: repair memory leak
  sctp: return error if the asoc has been peeled off in sctp_wait_for_sndbuf
  sctp: do not allow the v4 socket to bind a v4mapped v6 address
  r8169: fix memory corruption on retrieval of hardware statistics.
  pppoe: take ->needed_headroom of lower device into account on xmit
  net: qdisc_pkt_len_init() should be more robust
  tcp: __tcp_hdrlen() helper
  net: igmp: fix source address check for IGMPv3 reports
  lan78xx: Fix failure in USB Full Speed
  ipv6: ip6_make_skb() needs to clear cork.base.dst
  ipv6: fix udpv6 sendmsg crash caused by too small MTU
  ipv6: Fix getsockopt() for sockets with default IPV6_AUTOFLOWLABEL
  dccp: don't restart ccid2_hc_tx_rto_expire() if sk in closed state
  hrtimer: Reset hrtimer cpu base proper on CPU hotplug
  ...
2018-02-07 20:59:20 +08:00
Alex Shi
59e35359ec Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android 2018-02-01 12:02:38 +08:00
Daniel Bristot de Oliveira
1d00e3d9b7 sched/deadline: Use the revised wakeup rule for suspending constrained dl tasks
commit 3effcb4247 upstream.

We have been facing some problems with self-suspending constrained
deadline tasks. The main reason is that the original CBS was not
designed for such sort of tasks.

One problem reported by Xunlei Pang takes place when a task
suspends, and then is awakened before the deadline, but so close
to the deadline that its remaining runtime can cause the task
to have an absolute density higher than allowed. In such situation,
the original CBS assumes that the task is facing an early activation,
and so it replenishes the task and set another deadline, one deadline
in the future. This rule works fine for implicit deadline tasks.
Moreover, it allows the system to adapt the period of a task in which
the external event source suffered from a clock drift.

However, this opens the window for bandwidth leakage for constrained
deadline tasks. For instance, a task with the following parameters:

  runtime   = 5 ms
  deadline  = 7 ms
  [density] = 5 / 7 = 0.71
  period    = 1000 ms

If the task runs for 1 ms, and then suspends for another 1ms,
it will be awakened with the following parameters:

  remaining runtime = 4
  laxity = 5

presenting a absolute density of 4 / 5 = 0.80.

In this case, the original CBS would assume the task had an early
wakeup. Then, CBS will reset the runtime, and the absolute deadline will
be postponed by one relative deadline, allowing the task to run.

The problem is that, if the task runs this pattern forever, it will keep
receiving bandwidth, being able to run 1ms every 2ms. Following this
behavior, the task would be able to run 500 ms in 1 sec. Thus running
more than the 5 ms / 1 sec the admission control allowed it to run.

Trying to address the self-suspending case, Luca Abeni, Giuseppe
Lipari, and Juri Lelli [1] revisited the CBS in order to deal with
self-suspending tasks. In the new approach, rather than
replenishing/postponing the absolute deadline, the revised wakeup rule
adjusts the remaining runtime, reducing it to fit into the allowed
density.

A revised version of the idea is:

At a given time t, the maximum absolute density of a task cannot be
higher than its relative density, that is:

  runtime / (deadline - t) <= dl_runtime / dl_deadline

Knowing the laxity of a task (deadline - t), it is possible to move
it to the other side of the equality, thus enabling to define max
remaining runtime a task can use within the absolute deadline, without
over-running the allowed density:

  runtime = (dl_runtime / dl_deadline) * (deadline - t)

For instance, in our previous example, the task could still run:

  runtime = ( 5 / 7 ) * 5
  runtime = 3.57 ms

Without causing damage for other deadline tasks. It is note worthy
that the laxity cannot be negative because that would cause a negative
runtime. Thus, this patch depends on the patch:

  df8eac8caf ("sched/deadline: Throttle a constrained deadline task activated after the deadline")

Which throttles a constrained deadline task activated after the
deadline.

Finally, it is also possible to use the revised wakeup rule for
all other tasks, but that would require some more discussions
about pros and cons.

[The main difference from the original commit is that
 the BW_SHIFT define was not present yet. As BW_SHIFT was
 introduced in a new feature, I just used the value (20),
 likewise we used to use before the #define.
 Other changes were required because of comments. - bistrot]

Reported-by: Xunlei Pang <xpang@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
[peterz: replaced dl_is_constrained with dl_is_implicit]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Link: http://lkml.kernel.org/r/5c800ab3a74a168a84ee5f3f84d12a02e11383be.1495803804.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-31 12:06:07 +01:00
Tao Huang
640193f76b Merge branch 'linux-linaro-lsk-v4.4-android' of git://git.linaro.org/kernel/linux-linaro-stable.git
* linux-linaro-lsk-v4.4-android: (733 commits)
  LSK-ANDROID: memcg: Remove wrong ->attach callback
  LSK-ANDROID: arm64: mm: Fix __create_pgd_mapping() call
  ANDROID: sdcardfs: Move default_normal to superblock
  blkdev: Refactoring block io latency histogram codes
  FROMLIST: arm64: kpti: Fix the interaction between ASID switching and software PAN
  FROMLIST: arm64: Move post_ttbr_update_workaround to C code
  FROMLIST: arm64: mm: Rename post_ttbr0_update_workaround
  sched: EAS: Initialize push_task as NULL to avoid direct reference on out_unlock path
  fscrypt: updates on 4.15-rc4
  ANDROID: uid_sys_stats: fix the comment
  BACKPORT: tee: indicate privileged dev in gen_caps
  BACKPORT: tee: optee: sync with new naming of interrupts
  BACKPORT: tee: tee_shm: Constify dma_buf_ops structures.
  BACKPORT: tee: optee: interruptible RPC sleep
  BACKPORT: tee: optee: add const to tee_driver_ops and tee_desc structures
  BACKPORT: tee.txt: standardize document format
  BACKPORT: tee: add forward declaration for struct device
  BACKPORT: tee: optee: fix uninitialized symbol 'parg'
  BACKPORT: tee: add ARM_SMCCC dependency
  BACKPORT: selinux: nlmsgtab: add SOCK_DESTROY to the netlink mapping tables
  ...

Conflicts:
	arch/arm64/kernel/vdso.c
	drivers/usb/host/xhci-plat.c
	include/drm/drmP.h
	include/linux/kasan.h
	kernel/time/timekeeping.c
	mm/kasan/kasan.c
	security/selinux/nlmsgtab.c

Also add this commit:
0bcdc0987c ("time: Fix ktime_get_raw() incorrect base accumulation")
2018-01-26 19:26:47 +08:00
Amit Pundir
395ae9f4e6 Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>

Conflicts:
    kernel/fork.c
        Conflict due to Kaiser implementation in LTS 4.4.110.
    net/ipv4/raw.c
        Minor conflict due to LTS commit
        be27b620a8 ("net: ipv4: fix for a race condition in raw_sendmsg")
2018-01-22 11:50:22 +05:30
Andy Lutomirski
18a5348d49 sched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off()
commit 252d2a4117 upstream.

idle_task_exit() can be called with IRQs on x86 on and therefore
should use switch_mm(), not switch_mm_irqs_off().

This doesn't seem to cause any problems right now, but it will
confuse my upcoming TLB flush changes.  Nonetheless, I think it
should be backported because it's trivial.  There won't be any
meaningful performance impact because idle_task_exit() is only
used when offlining a CPU.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: f98db6013c ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
Link: http://lkml.kernel.org/r/ca3d1a9fa93a0b49f5a8ff729eda3640fb6abdf9.1497034141.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:22:09 +01:00
Andy Lutomirski
425f13a366 sched/core: Add switch_mm_irqs_off() and use it in the scheduler
commit f98db6013c upstream.

By default, this is the same thing as switch_mm().

x86 will override it as an optimization.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/df401df47bdd6be3e389c6f1e3f5310d70e81b2c.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:22:09 +01:00
Tao Huang
afd240d168 Merge branch 'linux-linaro-lsk-v4.4-android' of git://git.linaro.org/kernel/linux-linaro-stable.git
* linux-linaro-lsk-v4.4-android: (510 commits)
  Linux 4.4.103
  Revert "sctp: do not peel off an assoc from one netns to another one"
  xen: xenbus driver must not accept invalid transaction ids
  s390/kbuild: enable modversions for symbols exported from asm
  ASoC: wm_adsp: Don't overrun firmware file buffer when reading region data
  btrfs: return the actual error value from from btrfs_uuid_tree_iterate
  ASoC: rsnd: don't double free kctrl
  netfilter: nf_tables: fix oob access
  netfilter: nft_queue: use raw_smp_processor_id()
  spi: SPI_FSL_DSPI should depend on HAS_DMA
  staging: iio: cdc: fix improper return value
  iio: light: fix improper return value
  mac80211: Suppress NEW_PEER_CANDIDATE event if no room
  mac80211: Remove invalid flag operations in mesh TSF synchronization
  drm: Apply range restriction after color adjustment when allocation
  ALSA: hda - Apply ALC269_FIXUP_NO_SHUTUP on HDA_FIXUP_ACT_PROBE
  ath10k: set CTS protection VDEV param only if VDEV is up
  ath10k: fix potential memory leak in ath10k_wmi_tlv_op_pull_fw_stats()
  ath10k: ignore configuring the incorrect board_id
  ath10k: fix incorrect txpower set by P2P_DEVICE interface
  ...

Conflicts:
	drivers/media/v4l2-core/v4l2-ctrls.c
	kernel/sched/fair.c

Change-Id: I48152b2a0ab1f9f07e1da7823119b94f9b9e1751
2017-12-01 11:04:13 +08:00
Alex Shi
0f646885c3 Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android 2017-12-01 01:02:04 +08:00
Steven Rostedt (Red Hat)
cb1831a83e sched/rt: Simplify the IPI based RT balancing logic
commit 4bdced5c9a upstream.

When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.

When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit:

  b6366f048e ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")

changed the way to request a pull. Instead of grabbing the lock of the
overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.

Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.

Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.

This greatly simplifies the code!

The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:

  rto_push_work	 - the irq work to process each CPU set in rto_mask
  rto_lock	 - the lock to protect some of the other rto fields
  rto_loop_start - an atomic that keeps contention down on rto_lock
		    the first CPU scheduling in a lower priority task
		    is the one to kick off the process.
  rto_loop_next	 - an atomic that gets incremented for each CPU that
		    schedules in a lower priority task.
  rto_loop	 - a variable protected by rto_lock that is used to
		    compare against rto_loop_next
  rto_cpu	 - The cpu to send the next IPI to, also protected by
		    the rto_lock.

When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.

If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.

When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.

Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.

If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.

This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?

Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30 08:37:25 +00:00
Paul E. McKenney
8ff3471878 sched: Make resched_cpu() unconditional
commit 7c2102e56a upstream.

The current implementation of synchronize_sched_expedited() incorrectly
assumes that resched_cpu() is unconditional, which it is not.  This means
that synchronize_sched_expedited() can hang when resched_cpu()'s trylock
fails as follows (analysis by Neeraj Upadhyay):

o	CPU1 is waiting for expedited wait to complete:

	sync_rcu_exp_select_cpus
	     rdp->exp_dynticks_snap & 0x1   // returns 1 for CPU5
	     IPI sent to CPU5

	synchronize_sched_expedited_wait
		 ret = swait_event_timeout(rsp->expedited_wq,
					   sync_rcu_preempt_exp_done(rnp_root),
					   jiffies_stall);

	expmask = 0x20, CPU 5 in idle path (in cpuidle_enter())

o	CPU5 handles IPI and fails to acquire rq lock.

	Handles IPI
	     sync_sched_exp_handler
		 resched_cpu
		     returns while failing to try lock acquire rq->lock
		 need_resched is not set

o	CPU5 calls  rcu_idle_enter() and as need_resched is not set, goes to
	idle (schedule() is not called).

o	CPU 1 reports RCU stall.

Given that resched_cpu() is now used only by RCU, this commit fixes the
assumption by making resched_cpu() unconditional.

Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Suggested-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30 08:37:19 +00:00
Todd Kjos
f353dca59d Revert "ANDROID: sched/rt: schedtune: Add boost retention to RT"
This reverts commit d194ba5d71.

Reason for revert: Broke some builds. Will fix and resubmit.

Change-Id: I4e6fa1562346eda1bbf058f1d5ace5ba6256ce07
2017-11-20 21:15:59 +05:30
Joel Fernandes
4053577123 ANDROID: sched/rt: schedtune: Add boost retention to RT
Boosted RT tasks can be deboosted quickly, this makes boost usless
for RT tasks and causes lots of glitching. Use timers to prevent
de-boost too soon and wait for long enough such that next enqueue
happens after a threshold.

While this can be solved in the governor, there are following
advantages:
- The approach used is governor-independent
- Reduces boost group lock contention for frequently sleepers/wakers
- Works with schedfreq without any other schedfreq hacks.

Bug: 30210506

Change-Id: I41788b235586988be446505deb7c0529758a9898
Signed-off-by: Joel Fernandes <joelaf@google.com>
2017-11-20 21:15:59 +05:30
Viresh Kumar
0dae60cb4e cpufreq: Drop schedfreq governor
We all should be using (and improving) the schedutil governor now. Get
rid of the non-upstream governor.

Tested on Hikey.

Change-Id: Ic660756536e5da51952738c3c18b94e31f58cd57
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2017-11-20 21:15:59 +05:30
Joonwoo Park
dbf572f159 sched: EAS: upmigrate misfit current task
Upmigrate misfit current task upon scheduler tick with stopper.

We can kick an random (not necessarily big CPU) NOHZ idle CPU when a
CPU bound task is in need of upmigration.  But it's not efficient as that
way needs following unnecessary wakeups:

  1. Busy little CPU A to kick idle B
  2. B runs idle balancer and enqueue migration/A
  3. B goes idle
  4. A runs migration/A, enqueues busy task on B.
  5. B wakes up again.

This change makes active upmigration more efficiently by doing:

  1. Busy little CPU A find target CPU B upon tick.
  2. CPU A enqueues migration/A.

Change-Id: Ie865738054ea3296f28e6ba01710635efa7193c0
[joonwoop: The original version had logic to reserve CPU.  The logic is
 omitted in this version.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
2017-11-20 21:15:59 +05:30
Srivatsa Vaddagiri
a80b8c7559 sched: Extend active balance to accept 'push_task' argument
Active balance currently picks one task to migrate from busy cpu to
a chosen cpu (push_cpu). This patch extends active load balance to
recognize a particular task ('push_task') that needs to be migrated to
'push_cpu'. This capability will be leveraged by HMP-aware task
placement in a subsequent patch.

Change-Id: If31320111e6cc7044e617b5c3fd6d8e0c0e16952
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
[rameezmustafa@codeaurora.org]: Port to msm-3.18]
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2017-11-20 21:15:59 +05:30
Joel Fernandes
ace670d5c6 Revert "sched/core: Warn if ENERGY_AWARE is enabled but data is missing"
This reverts commit a21299785a.

Change-Id: Idb707c80788d8b6d26d400a59d9b14f854cce89f
2017-11-20 21:15:59 +05:30
Joel Fernandes
05a1b4ba04 Revert "sched/core: fix have_sched_energy_data build warning"
This reverts commit a899b9085c.

Reverting for now due to suspend warning issues.

Change-Id: I9387297b52552a73d5a253b9e8e4467b366479a5
2017-11-20 21:15:59 +05:30
Joel Fernandes
27f6af2701 sched/core: fix have_sched_energy_data build warning
have_sched_energy_data is defined only for CONFIG_SMP, so declare it
only with CONFIG_SMP.

Fixes warning from intel bot:

tree:   https://android.googlesource.com/kernel/msm android-4.4
head:   a21299785a
commit: a21299785a [5/5] sched/core: Warn
if ENERGY_AWARE is enabled but data is missing
config: i386-randconfig-x002-201743 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        git checkout a21299785a
        # save the attached .config to linux build tree
        make ARCH=i386

All warnings (new ones prefixed by >>):

>> kernel//sched/core.c:94:13: warning: 'have_sched_energy_data' used
but never defined
    static bool have_sched_energy_data(void);
                ^~~~~~~~~~~~~~~~~~~~~~

vim +/have_sched_energy_data +94 kernel//sched/core.c

    93
  > 94  static bool have_sched_energy_data(void);
    95

Change-Id: I266b63ece6fb31d2b5b11821a8244e147ba6d3a4
Signed-off-by: Joel Fernandes <joelaf@google.com>
2017-11-20 21:15:59 +05:30
Brendan Jackman
3dcbf5e447 sched/core: Warn if ENERGY_AWARE is enabled but data is missing
If the EAS energy model is missing or incomplete, i.e. sd_scs is NULL, then
sched_group_energy will return -EINVAL on the assumption that it raced with a
CPU hotplug event. In that case, energy_diff will return 0 and the energy-aware
wake path will silently fail to trigger any migrations.

This case can be triggered by disabling CONFIG_SCHED_MC on existing platforms,
so that there are no sched_groups with the SD_SHARE_CAP_STATES flag, so that
sd_scs is NULL.

Add checks so that a warning is printed if EAS is ever enabled while the
necessary data is not present.

Change-Id: Id233a510b5ad8b7fcecac0b1d789e730bbfc7c4a
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
2017-11-20 21:15:59 +05:30
Brendan Jackman
2b022c3747 FROMLIST: sched/fair: Use wake_q length as a hint for wake_wide
(from https://patchwork.kernel.org/patch/9895261/)

This patch adds a parameter to select_task_rq, sibling_count_hint
allowing the caller, where it has this information, to inform the
sched_class the number of tasks that are being woken up as part of
the same event.

The wake_q mechanism is one case where this information is available.

select_task_rq_fair can then use the information to detect that it
needs to widen the search space for task placement in order to avoid
overloading the last-level cache domain's CPUs.

                               * * *

The reason I am investigating this change is the following use case
on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
all repeatedly do X amount of work then
pthread_barrier_wait (i.e. sleep until the last task finishes its X
and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU
finish faster, and then those CPUs pull over the tasks that are still
running:

     v CPU v           ->time->

                    -------------
   0  (big)         11111  /333
                    -------------
   1  (big)         22222   /444|
                    -------------
   2  (LITTLE)      333333/
                    -------------
   3  (LITTLE)      444444/
                    -------------

Now when task 4 hits the barrier (at |) and wakes the others up,
there are 4 tasks with prev_cpu=<big> and 0 tasks with
prev_cpu=<little>. want_affine therefore means that we'll only look
in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
on the bigs until the next load balance, something like this:

     v CPU v           ->time->

                    ------------------------
   0  (big)         11111  /333  31313\33333
                    ------------------------
   1  (big)         22222   /444|424\4444444
                    ------------------------
   2  (LITTLE)      333333/          \222222
                    ------------------------
   3  (LITTLE)      444444/            \1111
                    ------------------------
                                 ^^^
                           underutilization

So, I'm trying to get want_affine = 0 for these tasks.

I don't _think_ any incarnation of the wakee_flips mechanism can help
us here because which task is waker and which tasks are wakees
generally changes with each iteration.

However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
nice property that we know exactly how many tasks are being woken, so
we can cheat.

It might be a disadvantage that we "widen" _every_ task that's woken in
an event, while select_idle_sibling would work fine for the first
sd_llc_size - 1 tasks.

IIUC, if wake_affine() behaves correctly this trick wouldn't be
necessary on SMP systems, so it might be best guarded by the presence
of SD_ASYM_CPUCAPACITY?

                               * * *

Final note..

In order to observe "perfect" behaviour for this use case, I also had
to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
above we are working through the work queue and have placed tasks 3
and 2, and are about to place task 1:

     v CPU v           ->time->

                    --------------
   0  (big)         11111  /333  3
                    --------------
   1  (big)         22222   /444|4
                    --------------
   2  (LITTLE)      333333/      2
                    --------------
   3  (LITTLE)      444444/          <- Task 1 should go here
                    --------------

If TTWU_QUEUE is enabled, we will not yet have enqueued task
2 (having instead sent a reschedule IPI) or attached its load to CPU
2. So we are likely to also place task 1 on cpu 2. Disabling
TTWU_QUEUE means that we enqueue task 2 before placing task 1,
solving this issue. TTWU_QUEUE is there to minimise rq lock
contention, and I guess that this contention is less of an issue on
big.LITTLE systems since they have relatively few CPUs, which
suggests the trade-off makes sense here.

Change-Id: I2080302839a263e0841a89efea8589ea53bbda9c
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
2017-11-20 21:15:59 +05:30
Joonwoo Park
ae0177aba1 sched: WALT: account cumulative window demand
Energy cost estimation has been a long lasting challenge for WALT
because WALT guides CPU frequency based on the CPU utilization of
previous window.  Consequently it's not possible to know newly
waking-up task's energy cost until WALT's end of the current window.

The WALT already tracks 'Previous Runnable Sum' (prev_runnable_sum)
and 'Cumulative Runnable Average' (cr_avg).  They are designed for
CPU frequency guidance and task placement but unfortunately both
are not suitable for the energy cost estimation.

It's because using prev_runnable_sum for energy cost calculation would
make us to account CPU and task's energy solely based on activity in the
previous window so for example, any task didn't have an activity in the
previous window will be accounted as a 'zero energy cost' task.
Energy estimation with cr_avg is what energy_diff() relies on at present.
However cr_avg can only represent instantaneous picture of energy cost
thus for example, if a CPU was fully occupied for an entire WALT window
and became idle just before window boundary, and if there is a wake-up,
energy_diff() accounts that CPU is a 'zero energy cost' CPU.

As a result, introduce a new accounting unit 'Cumulative Window Demand'.
The cumulative window demand tracks all the tasks' demands have seen in
current window which is neither instantaneous nor actual execution time.
Because task demand represents estimated scaled execution time when the
task runs a full window, accumulation of all the demands represents
predicted CPU load at the end of window.

Thus we can estimate CPU's frequency at the end of current WALT window
with the cumulative window demand.

The use of prev_runnable_sum for the CPU frequency guidance and cr_avg
for the task placement have not changed and these are going to be used
for both purpose while this patch aims to add an additional statistics.

Change-Id: I9908c77ead9973a26dea2b36c001c2baf944d4f5
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2017-11-20 21:15:59 +05:30
Peter Zijlstra
98ac5c4cd3 UPSTREAM: sched/core: Add missing update_rq_clock() call in set_user_nice()
Address this rq-clock update bug:

  WARNING: CPU: 30 PID: 195 at ../kernel/sched/sched.h:797 set_next_entity()
  rq->clock_update_flags < RQCF_ACT_SKIP

  Call Trace:
    dump_stack()
    __warn()
    warn_slowpath_fmt()
    set_next_entity()
    ? _raw_spin_lock()
    set_curr_task_fair()
    set_user_nice.part.85()
    set_user_nice()
    create_worker()
    worker_thread()
    kthread()
    ret_from_fork()

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 2fb8d36787)
Change-Id: I53ba056e72820c7fadb3f022e4ee3b821c0de17d
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-11-20 21:15:59 +05:30
Peter Zijlstra
d996ec007f UPSTREAM: sched/core: Add missing update_rq_clock() in detach_task_cfs_rq()
Instead of adding the update_rq_clock() all the way at the bottom of
the callstack, add one at the top, this to aid later effort to
minimize update_rq_lock() calls.

  WARNING: CPU: 0 PID: 1 at ../kernel/sched/sched.h:797 detach_task_cfs_rq()
  rq->clock_update_flags < RQCF_ACT_SKIP

  Call Trace:
    dump_stack()
    __warn()
    warn_slowpath_fmt()
    detach_task_cfs_rq()
    switched_from_fair()
    __sched_setscheduler()
    _sched_setscheduler()
    sched_set_stop_task()
    cpu_stop_create()
    __smpboot_create_thread.part.2()
    smpboot_register_percpu_thread_cpumask()
    cpu_stop_init()
    do_one_initcall()
    ? print_cpu_info()
    kernel_init_freeable()
    ? rest_init()
    kernel_init()
    ret_from_fork()

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 80f5c1b84b)
Change-Id: Ibffde077d18eabec4c2984158bd9d6d73bd0fb96
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-11-20 21:15:59 +05:30
Peter Zijlstra
fc63c5c792 UPSTREAM: sched/core: Add missing update_rq_clock() in post_init_entity_util_avg()
Address this rq-clock update bug:

  WARNING: CPU: 0 PID: 0 at ../kernel/sched/sched.h:797 post_init_entity_util_avg()
  rq->clock_update_flags < RQCF_ACT_SKIP

  Call Trace:
    __warn()
    post_init_entity_util_avg()
    wake_up_new_task()
    _do_fork()
    kernel_thread()
    rest_init()
    start_kernel()

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 4126bad671)
Change-Id: Ibe9a73386896377f96483d195e433259218755a5
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-11-20 21:15:59 +05:30
Peter Zijlstra
630948e7ac BACKPORT: sched/fair: Fix PELT integrity for new tasks
Vincent and Yuyang found another few scenarios in which entity
tracking goes wobbly.

The scenarios are basically due to the fact that new tasks are not
immediately attached and thereby differ from the normal situation -- a
task is always attached to a cfs_rq load average (such that it
includes its blocked contribution) and are explicitly
detached/attached on migration to another cfs_rq.

Scenario 1: switch to fair class

  p->sched_class = fair_class;
  if (queued)
    enqueue_task(p);
      ...
        enqueue_entity()
	  enqueue_entity_load_avg()
	    migrated = !sa->last_update_time (true)
	    if (migrated)
	      attach_entity_load_avg()
  check_class_changed()
    switched_from() (!fair)
    switched_to()   (fair)
      switched_to_fair()
        attach_entity_load_avg()

If @p is a new task that hasn't been fair before, it will have
!last_update_time and, per the above, end up in
attach_entity_load_avg() _twice_.

Scenario 2: change between cgroups

  sched_move_group(p)
    if (queued)
      dequeue_task()
    task_move_group_fair()
      detach_task_cfs_rq()
        detach_entity_load_avg()
      set_task_rq()
      attach_task_cfs_rq()
        attach_entity_load_avg()
    if (queued)
      enqueue_task();
        ...
          enqueue_entity()
	    enqueue_entity_load_avg()
	      migrated = !sa->last_update_time (true)
	      if (migrated)
	        attach_entity_load_avg()

Similar as with scenario 1, if @p is a new task, it will have
!load_update_time and we'll end up in attach_entity_load_avg()
_twice_.

Furthermore, notice how we do a detach_entity_load_avg() on something
that wasn't attached to begin with.

As stated above; the problem is that the new task isn't yet attached
to the load tracking and thereby violates the invariant assumption.

This patch remedies this by ensuring a new task is indeed properly
attached to the load tracking on creation, through
post_init_entity_util_avg().

Of course, this isn't entirely as straightforward as one might think,
since the task is hashed before we call wake_up_new_task() and thus
can be poked at. We avoid this by adding TASK_NEW and teaching
cpu_cgroup_can_attach() to refuse such tasks.

.:: BACKPORT

Complicated by the fact that mch of the lines changed by the original
of this commit were then changed by:

df217913e7 sched/fair: Factorize attach/detach entity <Vincent Guittot>

and then

d31b1a66cb sched/fair: Factorize PELT update <Vincent Guittot>

, which have both already been backported here.

Reported-by: Yuyang Du <yuyang.du@intel.com>
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 7dc603c902)
Change-Id: Ibc59eb52310a62709d49a744bd5a24e8b97c4ae8
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-11-20 21:15:59 +05:30
Vincent Guittot
73f39e489e BACKPORT: sched/cgroup: Fix cpu_cgroup_fork() handling
A new fair task is detached and attached from/to task_group with:

  cgroup_post_fork()
    ss->fork(child) := cpu_cgroup_fork()
      sched_move_task()
        task_move_group_fair()

Which is wrong, because at this point in fork() the task isn't fully
initialized and it cannot 'move' to another group, because its not
attached to any group as yet.

In fact, cpu_cgroup_fork() needs a small part of sched_move_task() so we
can just call this small part directly instead sched_move_task(). And
the task doesn't really migrate because it is not yet attached so we
need the following sequence:

  do_fork()
    sched_fork()
      __set_task_cpu()

    cgroup_post_fork()
      set_task_rq() # set task group and runqueue

    wake_up_new_task()
      select_task_rq() can select a new cpu
      __set_task_cpu
      post_init_entity_util_avg
        attach_task_cfs_rq()
      activate_task
        enqueue_task

This patch makes that happen.

BACKPORT: Difference from original commit:

- Removed use of DEQUEUE_MOVE (which isn't defined in 4.4) in
  dequeue_task flags
- Replaced "struct rq_flags rf" with "unsigned long flags".

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[ Added TASK_SET_GROUP to set depth properly. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit ea86cb4b76)
Change-Id: I8126fd923288acf961218431ffd29d6bf6fd8d72
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-11-20 21:15:59 +05:30
Peter Zijlstra
b8cb77612b UPSTREAM: sched/fair: Fix and optimize the fork() path
The task_fork_fair() callback already calls __set_task_cpu() and takes
rq->lock.

If we move the sched_class::task_fork callback in sched_fork() under
the existing p->pi_lock, right after its set_task_cpu() call, we can
avoid doing two such calls and omit the IRQ disabling on the rq->lock.

Change to __set_task_cpu() to skip the migration bits, this is a new
task, not a migration. Similarly, make wake_up_new_task() use
__set_task_cpu() for the same reason, the task hasn't actually
migrated as it hasn't ever ran.

This cures the problem of calling migrate_task_rq_fair(), which does
remove_entity_from_load_avg() on tasks that have never been added to
the load avg to begin with.

This bug would result in transiently messed up load_avg values, averaged
out after a few dozen milliseconds. This is probably the reason why
this bug was not found for such a long time.

Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit e210bffd39)
Change-Id: Icbddbaa6e8c1071859673d8685bc3f38955cf144
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-11-20 21:15:59 +05:30
Chris Redpath
7beab85489 cpufreq/sched: Consider max cpu capacity when choosing frequencies
When using schedfreq on cpus with max capacity significantly smaller than
1024, the tick update uses non-normalised capacities - this leads to
selecting an incorrect OPP as we were scaling the frequency as if the
max capacity achievable was 1024 rather than the max for that particular
cpu or group. This could result in a cpu being stuck at the lowest OPP
and unable to generate enough utilisation to climb out if the max
capacity is significantly smaller than 1024.

Instead, normalize the capacity to be in the range 0-1024 in the tick
so that when we later select a frequency, we get the correct one.

Also comments updated to be clearer about what is needed.

Change-Id: Id84391c7ac015311002ada21813a353ee13bee60
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-11-20 21:15:59 +05:30
Chris Redpath
926b5b1536 BACKPORT: sched/fair: Make it possible to account fair load avg consistently
While set_task_rq_fair() is introduced in mainline by commit ad936d8658
("sched/fair: Make it possible to account fair load avg consistently"),
the function results to be introduced here by the backport of
commit 09a43ace1f ("sched/fair: Propagate load during synchronous
attach/detach"). The problem (apart from the confusion introduced by the
backport) is actually that set_task_rq_fair() is currently not called at
all.

Fix the problem by backporting again commit ad936d8658
("sched/fair: Make it possible to account fair load avg consistently").

Original change log:

The current code accounts for the time a task was absent from the fair
class (per ATTACH_AGE_LOAD). However it does not work correctly when a
task got migrated or moved to another cgroup while outside of the fair
class.

This patch tries to address that by aging on migration. We locklessly
read the 'last_update_time' stamp from both the old and new cfs_rq,
ages the load upto the old time, and sets it to the new time.

These timestamps should in general not be more than 1 tick apart from
one another, so there is a definite bound on things.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ Changelog, a few edits and !SMP build fix ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1445616981-29904-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry-picked from ad936d8658)
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Change-Id: I17294ab0ada3901d35895014715fd60952949358
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
2017-11-20 21:15:59 +05:30
Joonwoo Park
cd46b9e102 sched: EAS/WALT: finish accounting prior to task_tick
In order to set rq->misfit_task in time, call update_task_ravg() prior
to task_tick.  This reduces upmigration delay by 1 scheduler window.

Change-Id: I7cc80badd423f2e7684125fbfd853b0a3610f0e8
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
2017-11-20 21:15:59 +05:30
Joonwoo Park
4ffc773858 cpufreq: sched: update capacity request upon tick always
At present, sched_freq_tick() skips updating of capacity update when
current frequency is fmax.  This can cause incorrect frequency drop
when a CPU bound task goes into sleep for example :
  1) A task (A) enqueues onto CPU 0 and executes for long time.
  2) A new task (B) which has low task demand enqueues onto CPU 1 and
     executes long so becomes a CPU bound task.
  3) Both CPU 0 and 1 gets scheduler tick but skip sched_freq_tick()
     since current frequency is fmax.
  4) Task (A) sleeps and lower the CPU 0's capacity request.
  5) Because task (B) voted CPU capacity at step 2 with low demand and
     skipped to request afterwards, cluster frequency for both CPU 0
     and 1 drops to match capacity voted by CPU 1 at step 2 even though
     task (B) on CPU 1 requires max capacity.

Fix such incorrectness by not skipping CPU capacity voting at tick
path.

Change-Id: Ieb46af1ac96ffce7a5532c58c7f07bf1ada06b86
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
2017-11-20 21:15:59 +05:30
Vikram Mulukutla
1765eacee8 sched: walt: Leverage existing helper APIs to apply invariance
There's no need for a separate hierarchy of notifiers, APIs
and variables in walt.c for the purpose of applying frequency
and IPC invariance. Let's just use capacity_curr_of and get
rid of a lot of the infrastructure relating to capacity,
load_scale_factor etc.

Change-Id: Ia220e2c896373fa535db05bff60f9aa33aefc978
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
2017-11-20 21:15:59 +05:30
Olav Haugan
b7d6f8f22b sched: Update task->on_rq when tasks are moving between runqueues
Task->on_rq has three states:
	0 - Task is not on runqueue (rq)
	1 (TASK_ON_RQ_QUEUED) - Task is on rq
	2 (TASK_ON_RQ_MIGRATING) - Task is on rq but in the
	process of being migrated to another rq

When a task is moving between rqs task->on_rq state should be
TASK_ON_RQ_MIGRATING in order for WALT to account rq's cumulative
runnable average correctly.  Without such state marking for all the
classes, WALT's update_history() would try to fixup task's demand
which was never contributed to any of CPUs during migration.

Change-Id: Iced3428f3924fe8ab5d0075698273ead04f12d5b
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
[joonwoop: Reinforced changelog to explain why this is needed by WALT.
           Fixed conflicts in deadline.c]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2017-11-20 21:15:59 +05:30
Mark Rutland
7b63774564 UPSTREAM: sched/kasan: remove stale KASAN poison after hotplug
Functions which the compiler has instrumented for KASAN place poison on
the stack shadow upon entry and remove this poision prior to returning.

In the case of CPU hotplug, CPUs exit the kernel a number of levels deep
in C code.  Any instrumented functions on this critical path will leave
portions of the stack shadow poisoned.

When a CPU is subsequently brought back into the kernel via a different
path, depending on stackframe, layout calls to instrumented functions
may hit this stale poison, resulting in (spurious) KASAN splats to the
console.

To avoid this, clear any stale poison from the idle thread for a CPU
prior to bringing a CPU online.

Change-Id: Idd24e933ce0a93b500d17de8262afe6e43d565c8
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Tao Huang <huangtao@rock-chips.com>
(cherry picked from commit e1b77c9298)
2017-11-03 18:04:44 +08:00
Alex Shi
80cd9be34a Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android
Conflicts:
	included wakeup_reason.h file of 57caa2ad5c in kernel/power/process.c
2017-10-13 23:14:45 +08:00
Peter Zijlstra
90fd673873 sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs
commit 50e7663233 upstream.

Cpusets vs. suspend-resume is _completely_ broken. And it got noticed
because it now resulted in non-cpuset usage breaking too.

On suspend cpuset_cpu_inactive() doesn't call into
cpuset_update_active_cpus() because it doesn't want to move tasks about,
there is no need, all tasks are frozen and won't run again until after
we've resumed everything.

But this means that when we finally do call into
cpuset_update_active_cpus() after resuming the last frozen cpu in
cpuset_cpu_active(), the top_cpuset will not have any difference with
the cpu_active_mask and this it will not in fact do _anything_.

So the cpuset configuration will not be restored. This was largely
hidden because we would unconditionally create identity domains and
mobile users would not in fact use cpusets much. And servers what do use
cpusets tend to not suspend-resume much.

An addition problem is that we'd not in fact wait for the cpuset work to
finish before resuming the tasks, allowing spurious migrations outside
of the specified domains.

Fix the rebuild by introducing cpuset_force_rebuild() and fix the
ordering with cpuset_wait_for_hotplug().

Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: deb7aa308e ("cpuset: reorganize CPU / memory hotplug handling")
Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-12 11:27:35 +02:00
Joonwoo Park
d6614c4e2c cpufreq: sched: WALT: don't apply capacity margin twice
With WALT all the scheduler classes' load are accounted in scr->cfs and
update_cpu_capacity_request() adds capacity margin.  At present, at tick
path, scheduler also adds capacity margin.  Therefore the margin applied
twice.

Fix such error by using margin applied cpu utilization only for checking
whether frequency increase is needed.

Change-Id: Id7d8cc73b2e4eec70b274ca66e09bb0b16bf6f09
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
(trivial rebase conflict)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
2017-09-18 21:14:33 +01:00
Joonwoo Park
cf9ee81955 sched: EAS/WALT: use cr_avg instead of prev_runnable_sum
WALT accounts two major statistics; CPU load and cumulative tasks
demand.

The CPU load which is account of accumulated each CPU's absolute
execution time is for CPU frequency guidance.  Whereas cumulative
tasks demand which is each CPU's instantaneous load to reflect
CPU's load at given time is for task placement decision.

Use cumulative tasks demand for cpu_util() for task placement and
introduce cpu_util_freq() for frequency guidance.

Change-Id: Id928f01dbc8cb2a617cdadc584c1f658022565c5
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2017-09-18 21:14:32 +01:00
Amit Pundir
b6488ff4cb Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android
Conflicts:
        kernel/sched/sched.h
        Refactor the changes from LTS commit 62208707b4
        ("sched/cputime: Fix prev steal time accouting during CPU hotplug")
        to align with the changes from AOSP commit dee8fa1552
        ("sched: backport cpufreq hooks from 4.9-rc4").

Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2017-08-11 19:28:33 +05:30
Wanpeng Li
62208707b4 sched/cputime: Fix prev steal time accouting during CPU hotplug
commit 3d89e5478b upstream.

Commit:

  e9532e69b8 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")

... set rq->prev_* to 0 after a CPU hotplug comes back, in order to
fix the case where (after CPU hotplug) steal time is smaller than
rq->prev_steal_time.

However, this should never happen. Steal time was only smaller because of the
KVM-specific bug fixed by the previous patch.  Worse, the previous patch
triggers a bug on CPU hot-unplug/plug operation: because
rq->prev_steal_time is cleared, all of the CPU's past steal time will be
accounted again on hot-plug.

Since the root cause has been fixed, we can just revert commit e9532e69b8.

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 'commit e9532e69b8 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")'
Link: http://lkml.kernel.org/r/1465813966-3116-3-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andres Oportus <andresoportus@google.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-06 19:19:43 -07:00
Konstantin Khlebnikov
0e0967e262 sched/cgroup: Move sched_online_group() back into css_online() to fix crash
commit 96b777452d upstream.

Commit:

  2f5177f0fd ("sched/cgroup: Fix/cleanup cgroup teardown/init")

.. moved sched_online_group() from css_online() to css_alloc().
It exposes half-baked task group into global lists before initializing
generic cgroup stuff.

LTP testcase (third in cgroup_regression_test) written for testing
similar race in kernels 2.6.26-2.6.28 easily triggers this oops:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
  IP: kernfs_path_from_node_locked+0x260/0x320
  CPU: 1 PID: 30346 Comm: cat Not tainted 4.10.0-rc5-test #4
  Call Trace:
  ? kernfs_path_from_node+0x4f/0x60
  kernfs_path_from_node+0x3e/0x60
  print_rt_rq+0x44/0x2b0
  print_rt_stats+0x7a/0xd0
  print_cpu+0x2fc/0xe80
  ? __might_sleep+0x4a/0x80
  sched_debug_show+0x17/0x30
  seq_read+0xf2/0x3b0
  proc_reg_read+0x42/0x70
  __vfs_read+0x28/0x130
  ? security_file_permission+0x9b/0xc0
  ? rw_verify_area+0x4e/0xb0
  vfs_read+0xa5/0x170
  SyS_read+0x46/0xa0
  entry_SYSCALL_64_fastpath+0x1e/0xad

Here the task group is already linked into the global RCU-protected 'task_groups'
list, but the css->cgroup pointer is still NULL.

This patch reverts this chunk and moves online back to css_online().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2f5177f0fd ("sched/cgroup: Fix/cleanup cgroup teardown/init")
Link: http://lkml.kernel.org/r/148655324740.424917.5302984537258726349.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-06 19:19:42 -07:00
Alex Shi
1aaeb498dc Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android 2017-07-22 12:02:06 +08:00
Lauro Ramos Venancio
988067ec96 sched/topology: Optimize build_group_mask()
commit f32d782e31 upstream.

The group mask is always used in intersection with the group CPUs. So,
when building the group mask, we don't have to care about CPUs that are
not part of the group.

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1492717903-5195-2-git-send-email-lvenanci@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:44:59 +02:00
Peter Zijlstra
5c34f49776 sched/topology: Fix overlapping sched_group_mask
commit 73bb059f9b upstream.

The point of sched_group_mask is to select those CPUs from
sched_group_cpus that can actually arrive at this balance domain.

The current code gets it wrong, as can be readily demonstrated with a
topology like:

  node   0   1   2   3
    0:  10  20  30  20
    1:  20  10  20  30
    2:  30  20  10  20
    3:  20  30  20  10

Where (for example) domain 1 on CPU1 ends up with a mask that includes
CPU0:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 (mask: 1), 2, 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (mask: 0-2) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

This causes sched_balance_cpu() to compute the wrong CPU and
consequently should_we_balance() will terminate early resulting in
missed load-balance opportunities.

The fixed topology looks like:

  [] CPU1 attaching sched-domain:
  []  domain 0: span 0-2 level NUMA
  []   groups: 1 (mask: 1), 2, 0
  []   domain 1: span 0-3 level NUMA
  []    groups: 0-2 (mask: 1) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)

(note: this relies on OVERLAP domains to always have children, this is
 true because the regular topology domains are still here -- this is
 before degenerate trimming)

Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: e3589f6c81 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:44:59 +02:00
Andres Oportus
9955347b2e sched/core: Fix PELT jump to max OPP upon util increase
Change-Id: Ic80b588ec466ef707f658dcea039fd0d6b384b63
Signed-off-by: Andres Oportus <andresoportus@google.com>
2017-06-21 16:37:38 +05:30