When using schedfreq on cpus with max capacity significantly smaller than
1024, the tick update uses non-normalised capacities - this leads to
selecting an incorrect OPP as we were scaling the frequency as if the
max capacity achievable was 1024 rather than the max for that particular
cpu or group. This could result in a cpu being stuck at the lowest OPP
and unable to generate enough utilisation to climb out if the max
capacity is significantly smaller than 1024.
Instead, normalize the capacity to be in the range 0-1024 in the tick
so that when we later select a frequency, we get the correct one.
Also comments updated to be clearer about what is needed.
Change-Id: Id84391c7ac015311002ada21813a353ee13bee60
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit bc3b0db024f3d0b323e7d93994655e9e3d9f8d68)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
When we convert capacity into frequency, we used policy->max to get
the max freq of the cpu. Since this can be changed by userspace policy
or thermal events, we are potentially asking for a lower frequency
than the utilization demands.
Change over to using cpuinfo.max which is the max freq supported by
that cpu rather than the currently-chosen max. Frequency granted still
honours the max policy.
Tested by setting a userspace policy and observing the relevant vars
in a trace. In this instance, we ask for around 1ghz instead of 620MHz.
freq_new=1013512
unfixed_freq_new=624487
capacity=546
cpuinfo_max=1900800
policy_max=1171200
Change-Id: I8c5694db42243c6fb78bb9be9046b06ac81295e7
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit 114c6ceb2e5b0e30442e2ba5f78cfaedf76bf1f9)
[trivial cherry-pick issue]
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
On systems using ACPI cpufreq, at boot-time only CPU0 max_freq has been
populated when load_scale_factor/capacity recomputation is triggered on
all possible cpus. This leads to the following divide by zero panic when
rq->max_freq for any cpu other than CPU0 is accessed:
[ 4.827597] divide error: 0000 [#1] PREEMPT SMP
[ 4.827604] [<ffffffff862ad88e>] ? notifier_call_chain+0x37/0x57
[ 4.827607] [<ffffffff862adaf3>] ? __blocking_notifier_call_chain+0x46/0x5c
[ 4.827611] [<ffffffff8684b9f4>] ? cpufreq_set_policy+0xa6/0x2dc
[ 4.827614] [<ffffffff869e4d55>] ? cpufreq_init_policy+0xba/0xde
[ 4.827617] [<ffffffff8684bd4c>] ? cpufreq_update_policy+0x122/0x122
[ 4.827620] [<ffffffff8684c51c>] ? cpufreq_online+0x5a7/0x64e
[ 4.827623] [<ffffffff8684c63b>] ? cpufreq_add_dev+0x78/0xed
[ 4.827627] [<ffffffff866ce412>] ? subsys_interface_register+0xd6/0x117
[ 4.827630] [<ffffffff8684b410>] ? cpufreq_register_driver+0x168/0x2b2
[ 4.827632] [<ffffffff8684b410>] ? cpufreq_register_driver+0x168/0x2b2
[ 4.827636] [<ffffffff86f83b2e>] ? cpufreq_gov_dbs_init+0xc/0xc
[ 4.827639] [<ffffffff86f83d43>] ? acpi_cpufreq_init+0x215/0x282
[ 4.827641] [<ffffffff86f83b2e>] ? cpufreq_gov_dbs_init+0xc/0xc
[ 4.827645] [<ffffffff8620046e>] ? do_one_initcall+0x9c/0x12e
[ 4.827649] [<ffffffff86f3d058>] ? kernel_init_freeable+0x190/0x22b
[ 4.827651] [<ffffffff869e59d9>] ? rest_init+0x7c/0x7c
[ 4.827653] [<ffffffff869e59e3>] ? kernel_init+0xa/0xf3
[ 4.827656] [<ffffffff869ed482>] ? ret_from_fork+0x22/0x30
On ACPI based systems, for CPUs with uninitialized rq->max_freq skip the
re-calculation.
Reference discussion on a similar issue here:
https://www.spinics.net/lists/arm-kernel/msg557612.html
Change-Id: I5025e3f6db671e9e72369f85708d3b0482d5dad2
Signed-off-by: Abhilash Kesavan <abhilash.kesavan@intel.com>
The util returned from group_max_util is not capped at the max util
present in the group, so it can be larger than the capacity stored in
the array. Ensure that when this happens, we always use the last entry
in the array to fetch energy from.
Tested with synthetics on Juno board.
Change-Id: I89fb52fb7e68fa3e682e308acc232596672d03f7
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Make the schedutil governor take the initial (default) value of the
rate_limit_us sysfs attribute from the (new) transition_delay_us
policy parameter (to be set by the scaling driver).
That will allow scaling drivers to make schedutil use smaller default
values of rate_limit_us and reduce the default average time interval
between consecutive frequency changes.
Make intel_pstate set transition_delay_us to 500.
BACKPORT: Modified to support the separate up_rate_limit_us and
down_rate_limit_us (upstream just has a single rate_limit_us). Also
dropped the changes for intel_pstate as there's a merge conflict.
Change-Id: I62a8543879a4d8582cdcb31ebd55607705d1c8b1
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
(cherry picked from commit 1b72e7fd30)
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Changes in 4.9.58
MIPS: Fix minimum alignment requirement of IRQ stack
Revert "bsg-lib: don't free job in bsg_prepare_job"
xen-netback: Use GFP_ATOMIC to allocate hash
locking/lockdep: Add nest_lock integrity test
watchdog: kempld: fix gcc-4.3 build
irqchip/crossbar: Fix incorrect type of local variables
initramfs: finish fput() before accessing any binary from initramfs
mac80211_hwsim: check HWSIM_ATTR_RADIO_NAME length
ALSA: hda: Add Geminilake HDMI codec ID
qed: Don't use attention PTT for configuring BW
mac80211: fix power saving clients handling in iwlwifi
net/mlx4_en: fix overflow in mlx4_en_init_timestamp()
staging: vchiq_2835_arm: Make cache-line-size a required DT property
netfilter: nf_ct_expect: Change __nf_ct_expect_check() return value.
iio: adc: xilinx: Fix error handling
f2fs: do SSR for data when there is enough free space
sched/fair: Update rq clock before changing a task's CPU affinity
Btrfs: send, fix failure to rename top level inode due to name collision
f2fs: do not wait for writeback in write_begin
md/linear: shutup lockdep warnning
sparc64: Migrate hvcons irq to panicked cpu
net/mlx4_core: Fix VF overwrite of module param which disables DMFS on new probed PFs
crypto: xts - Add ECB dependency
mm/memory_hotplug: set magic number to page->freelist instead of page->lru.next
ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lock
slub: do not merge cache if slub_debug contains a never-merge flag
scsi: scsi_dh_emc: return success in clariion_std_inquiry()
ASoC: mediatek: add I2C dependency for CS42XX8
drm/amdgpu: refuse to reserve io mem for split VRAM buffers
net: mvpp2: release reference to txq_cpu[] entry after unmapping
qede: Prevent index problems in loopback test
qed: Reserve doorbell BAR space for present CPUs
qed: Read queue state before releasing buffer
i2c: at91: ensure state is restored after suspending
ceph: don't update_dentry_lease unless we actually got one
ceph: fix bogus endianness change in ceph_ioctl_set_layout
ceph: clean up unsafe d_parent accesses in build_dentry_path
uapi: fix linux/rds.h userspace compilation errors
uapi: fix linux/mroute6.h userspace compilation errors
IB/hfi1: Use static CTLE with Preset 6 for integrated HFIs
IB/hfi1: Allocate context data on memory node
target/iscsi: Fix unsolicited data seq_end_offset calculation
hrtimer: Catch invalid clockids again
nfsd/callback: Cleanup callback cred on shutdown
powerpc/perf: Add restrictions to PMC5 in power9 DD1
drm/nouveau/gr/gf100-: fix ccache error logging
regulator: core: Resolve supplies before disabling unused regulators
btmrvl: avoid double-disable_irq() race
EDAC, mce_amd: Print IPID and Syndrome on a separate line
cpufreq: CPPC: add ACPI_PROCESSOR dependency
usb: dwc3: gadget: Correct ISOC DATA PIDs for short packets
Linux 4.9.58
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 336a9cde10 ]
commit 82e88ff1ea ("hrtimer: Revert CLOCK_MONOTONIC_RAW support") removed
unfortunately a sanity check in the hrtimer code which was part of that
MONOTONIC_RAW patch series.
It would have caught the bogus usage of CLOCK_MONOTONIC_RAW in the wireless
code. So bring it back.
It is way too easy to take any random clockid and feed it to the hrtimer
subsystem. At best, it gets mapped to a monotonic base, but it would be
better to just catch illegal values as early as possible.
Detect invalid clockids, map them to CLOCK_MONOTONIC and emit a warning.
[ tglx: Replaced the BUG by a WARN and gracefully map to CLOCK_MONOTONIC ]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Tomasz Nowicki <tn@semihalf.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Link: http://lkml.kernel.org/r/1452879670-16133-3-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a499c3ead8 ]
This is triggered during boot when CONFIG_SCHED_DEBUG is enabled:
------------[ cut here ]------------
WARNING: CPU: 6 PID: 81 at kernel/sched/sched.h:812 set_next_entity+0x11d/0x380
rq->clock_update_flags < RQCF_ACT_SKIP
CPU: 6 PID: 81 Comm: torture_shuffle Not tainted 4.10.0+ #1
Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
Call Trace:
dump_stack+0x85/0xc2
__warn+0xcb/0xf0
warn_slowpath_fmt+0x5f/0x80
set_next_entity+0x11d/0x380
set_curr_task_fair+0x2b/0x60
do_set_cpus_allowed+0x139/0x180
__set_cpus_allowed_ptr+0x113/0x260
set_cpus_allowed_ptr+0x10/0x20
torture_shuffle+0xfd/0x180
kthread+0x10f/0x150
? torture_shutdown_init+0x60/0x60
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x31/0x40
---[ end trace dd94d92344cea9c6 ]---
The task is running && !queued, so there is no rq clock update before calling
set_curr_task().
This patch fixes it by updating rq clock after holding rq->lock/pi_lock
just as what other dequeue + put_prev + enqueue + set_curr story does.
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1487749975-5994-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.9.57
ext4: in ext4_seek_{hole,data}, return -ENXIO for negative offsets
CIFS: Reconnect expired SMB sessions
nl80211: Define policy for packet pattern attributes
rcu: Allow for page faults in NMI handlers
USB: dummy-hcd: Fix deadlock caused by disconnect detection
MIPS: math-emu: Remove pr_err() calls from fpu_emu()
dmaengine: edma: Align the memcpy acnt array size with the transfer
dmaengine: ti-dma-crossbar: Fix possible race condition with dma_inuse
HID: usbhid: fix out-of-bounds bug
crypto: shash - Fix zero-length shash ahash digest crash
KVM: MMU: always terminate page walks at level 1
KVM: nVMX: fix guest CR4 loading when emulating L2 to L1 exit
usb: renesas_usbhs: Fix DMAC sequence for receiving zero-length packet
pinctrl/amd: Fix build dependency on pinmux code
iommu/amd: Finish TLB flush in amd_iommu_unmap()
device property: Track owner device of device property
fs/mpage.c: fix mpage_writepage() for pages with buffers
ALSA: usb-audio: Kill stray URB at exiting
ALSA: seq: Fix use-after-free at creating a port
ALSA: seq: Fix copy_from_user() call inside lock
ALSA: caiaq: Fix stray URB at probe error path
ALSA: line6: Fix missing initialization before error path
ALSA: line6: Fix leftover URB at error-path during probe
drm/i915/edp: Get the Panel Power Off timestamp after panel is off
drm/i915: Read timings from the correct transcoder in intel_crtc_mode_get()
drm/i915/bios: parse DDI ports also for CHV for HDMI DDC pin and DP AUX channel
usb: gadget: configfs: Fix memory leak of interface directory data
usb: gadget: composite: Fix use-after-free in usb_composite_overwrite_options
direct-io: Prevent NULL pointer access in submit_page_section
fix unbalanced page refcounting in bio_map_user_iov
more bio_map_user_iov() leak fixes
bio_copy_user_iov(): don't ignore ->iov_offset
USB: serial: ftdi_sio: add id for Cypress WICED dev board
USB: serial: cp210x: add support for ELV TFD500
USB: serial: option: add support for TP-Link LTE module
USB: serial: qcserial: add Dell DW5818, DW5819
USB: serial: console: fix use-after-free after failed setup
x86/alternatives: Fix alt_max_short macro to really be a max()
KVM: nVMX: update last_nonleaf_level when initializing nested EPT
Linux 4.9.57
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 28585a8326 upstream.
A number of architecture invoke rcu_irq_enter() on exception entry in
order to allow RCU read-side critical sections in the exception handler
when the exception is from an idle or nohz_full CPU. This works, at
least unless the exception happens in an NMI handler. In that case,
rcu_nmi_enter() would already have exited the extended quiescent state,
which would mean that rcu_irq_enter() would (incorrectly) cause RCU
to think that it is again in an extended quiescent state. This will
in turn result in lockdep splats in response to later RCU read-side
critical sections.
This commit therefore causes rcu_irq_enter() and rcu_irq_exit() to
take no action if there is an rcu_nmi_enter() in effect, thus avoiding
the unscheduled return to RCU quiescent state. This in turn should
make the kernel safe for on-demand RCU voyeurism.
Link: http://lkml.kernel.org/r/20170922211022.GA18084@linux.vnet.ibm.com
Fixes: 0be964be0 ("module: Sanitize RCU usage and locking")
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The "goto force_balance" here is intended to mitigate the fact that
avg_load calculations can result in bad placement decisions when
priority is asymmetrical.
The original commit that adds it:
fab476228b ("sched: Force balancing on newidle balance if local group has capacity")
explains:
Under certain situations, such as a niced down task (i.e. nice =
-15) in the presence of nr_cpus NICE0 tasks, the niced task lands
on a sched group and kicks away other tasks because of its large
weight. This leads to sub-optimal utilization of the
machine. Even though the sched group has capacity, it does not
pull tasks because sds.this_load >> sds.max_load, and f_b_g()
returns NULL.
A similar but inverted issue also affects ARM big.LITTLE (asymmetrical CPU
capacity) systems - consider 8 always-running, same-priority tasks on a
system with 4 "big" and 4 "little" CPUs. Suppose that 5 of them end up on
the "big" CPUs (which will be represented by one sched_group in the DIE
sched_domain) and 3 on the "little" (the other sched_group in DIE), leaving
one CPU unused. Because the "big" group has a higher group_capacity its
avg_load may not present an imbalance that would cause migrating a
task to the idle "little".
The force_balance case here solves the problem but currently only for
CPU_NEWLY_IDLE balances, which in theory might never happen on the
unused CPU. Including CPU_IDLE in the force_balance case means
there's an upper bound on the time before we can attempt to solve the
underutilization: after DIE's sd->balance_interval has passed the
next nohz balance kick will help us out.
Change-Id: I4cbc8ced11e6a56f98ec67633e67e09cbb515268
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170807163900.25180-1-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 583ffd99d7)
Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
find_idlest_group() only compares the runnable_load_avg when looking for
the least loaded group. But on fork intensive use case like hackbench
where tasks blocked quickly after the fork, this can lead to selecting the
same CPU instead of other CPUs, which have similar runnable load but a
lower load_avg.
When the runnable_load_avg of 2 CPUs are close, we now take into account
the amount of blocked load as a 2nd selection factor. There is now 3 zones
for the runnable_load of the rq:
-[0 .. (runnable_load - imbalance)] : Select the new rq which has
significantly less runnable_load
-](runnable_load - imbalance) .. (runnable_load + imbalance)[ : The
runnable loads are close so we use load_avg to chose between the 2 rq
-[(runnable_load + imbalance) .. ULONG_MAX] : Keep the current rq which
has significantly less runnable_load
The scale factor that is currently used for comparing runnable_load,
doesn't work well with small value. As an example, the use of a scaling
factor fails as soon as this_runnable_load == 0 because we always select
local rq even if min_runnable_load is only 1, which doesn't really make
sense because they are just the same. So instead of scaling factor, we use
an absolute margin for runnable_load to detect CPUs with similar
runnable_load and we keep using scaling factor for blocked load.
For use case like hackbench, this enable the scheduler to select different
CPUs during the fork sequence and to spread tasks across the system.
Tests have been done on a Hikey board (ARM based octo cores) for several
kernel. The result below gives min, max, avg and stdev values of 18 runs
with each configuration.
The v4.8+patches configuration also includes the changes below which is
part of the proposal made by Peter to ensure that the clock will be up to
date when the fork task will be attached to the rq.
@@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p)
__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif
rq = __task_rq_lock(p, &rf);
+ update_rq_clock(rq);
post_init_entity_util_avg(&p->se);
activate_task(rq, p, 0);
hackbench -P -g 1
ea86cb4b767dc603c902 v4.8 v4.8+patches
min 0.049 0.050 0.051 0,048
avg 0.057 0.057(0%) 0.057(0%) 0,055(+5%)
max 0.066 0.068 0.070 0,063
stdev +/-9% +/-9% +/-8% +/-9%
Change-Id: I8fe40bbd106454282b1963db825705f3cf377b9e
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
During fork, the utilization of a task is init once the rq has been
selected because the current utilization level of the rq is used to set
the utilization of the fork task. As the task's utilization is still
null at this step of the fork sequence, it doesn't make sense to look for
some spare capacity that can fit the task's utilization.
Furthermore, I can see perf regressions for the test "hackbench -P -g 1"
because the least loaded policy is always bypassed and tasks are not
spread during fork.
With this patch and the fix below, we are back to same performances as
for v4.8. The fix below is only a temporary one used for the test until a
smarter solution is found because we can't simply remove the test which is
useful for others benchmarks
@@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
avg_cost = this_sd->avg_scan_cost;
- /*
- * Due to large variance we need a large fuzz factor; hackbench in
- * particularly is sensitive here.
- */
- if ((avg_idle / 512) < avg_cost)
- return -1;
-
time = local_clock();
for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
Change-Id: If7dd1c2a81fd74419733b8ecff519ae8cb8eb32d
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Changes in 4.9.55
USB: gadgetfs: Fix crash caused by inadequate synchronization
USB: gadgetfs: fix copy_to_user while holding spinlock
usb: gadget: udc: atmel: set vbus irqflags explicitly
usb: gadget: udc: renesas_usb3: fix for no-data control transfer
usb: gadget: udc: renesas_usb3: fix Pn_RAMMAP.Pn_MPKT value
usb: gadget: udc: renesas_usb3: Fix return value of usb3_write_pipe()
usb-storage: unusual_devs entry to fix write-access regression for Seagate external drives
usb-storage: fix bogus hardware error messages for ATA pass-thru devices
usb: renesas_usbhs: fix the BCLR setting condition for non-DCP pipe
usb: renesas_usbhs: fix usbhsf_fifo_clear() for RX direction
ALSA: usb-audio: Check out-of-bounds access by corrupted buffer descriptor
usb: pci-quirks.c: Corrected timeout values used in handshake
USB: cdc-wdm: ignore -EPIPE from GetEncapsulatedResponse
USB: dummy-hcd: fix connection failures (wrong speed)
USB: dummy-hcd: fix infinite-loop resubmission bug
USB: dummy-hcd: Fix erroneous synchronization change
USB: devio: Don't corrupt user memory
usb: gadget: mass_storage: set msg_registered after msg registered
USB: g_mass_storage: Fix deadlock when driver is unbound
USB: uas: fix bug in handling of alternate settings
USB: core: harden cdc_parse_cdc_header
usb: Increase quirk delay for USB devices
USB: fix out-of-bounds in usb_set_configuration
xhci: fix finding correct bus_state structure for USB 3.1 hosts
xhci: Fix sleeping with spin_lock_irq() held in ASmedia 1042A workaround
xhci: set missing SuperSpeedPlus Link Protocol bit in roothub descriptor
Revert "xhci: Limit USB2 port wake support for AMD Promontory hosts"
iio: adc: twl4030: Fix an error handling path in 'twl4030_madc_probe()'
iio: adc: twl4030: Disable the vusb3v1 rugulator in the error handling path of 'twl4030_madc_probe()'
iio: ad_sigma_delta: Implement a dedicated reset function
staging: iio: ad7192: Fix - use the dedicated reset function avoiding dma from stack.
iio: core: Return error for failed read_reg
IIO: BME280: Updates to Humidity readings need ctrl_reg write!
iio: ad7793: Fix the serial interface reset
iio: adc: mcp320x: Fix readout of negative voltages
iio: adc: mcp320x: Fix oops on module unload
uwb: properly check kthread_run return value
uwb: ensure that endpoint is interrupt
staging: vchiq_2835_arm: Fix NULL ptr dereference in free_pagelist
mm, oom_reaper: skip mm structs with mmu notifiers
lib/ratelimit.c: use deferred printk() version
lsm: fix smack_inode_removexattr and xattr_getsecurity memleak
ALSA: compress: Remove unused variable
Revert "ALSA: echoaudio: purge contradictions between dimension matrix members and total number of members"
ALSA: usx2y: Suppress kernel warning at page allocation failures
mlxsw: spectrum: Prevent mirred-related crash on removal
net: sched: fix use-after-free in tcf_action_destroy and tcf_del_walker
sctp: potential read out of bounds in sctp_ulpevent_type_enabled()
tcp: update skb->skb_mstamp more carefully
bpf/verifier: reject BPF_ALU64|BPF_END
tcp: fix data delivery rate
udpv6: Fix the checksum computation when HW checksum does not apply
ip6_gre: skb_push ipv6hdr before packing the header in ip6gre_header
net: phy: Fix mask value write on gmii2rgmii converter speed register
ip6_tunnel: do not allow loading ip6_tunnel if ipv6 is disabled in cmdline
net/sched: cls_matchall: fix crash when used with classful qdisc
tcp: fastopen: fix on syn-data transmit failure
net: emac: Fix napi poll list corruption
packet: hold bind lock when rebinding to fanout hook
bpf: one perf event close won't free bpf program attached by another perf event
isdn/i4l: fetch the ppp_write buffer in one shot
net_sched: always reset qdisc backlog in qdisc_reset()
net: qcom/emac: specify the correct size when mapping a DMA buffer
vti: fix use after free in vti_tunnel_xmit/vti6_tnl_xmit
l2tp: Avoid schedule while atomic in exit_net
l2tp: fix race condition in l2tp_tunnel_delete
tun: bail out from tun_get_user() if the skb is empty
net: dsa: Fix network device registration order
packet: in packet_do_bind, test fanout with bind_lock held
packet: only test po->has_vnet_hdr once in packet_snd
net: Set sk_prot_creator when cloning sockets to the right proto
netlink: do not proceed if dump's start() errs
ip6_gre: ip6gre_tap device should keep dst
ip6_tunnel: update mtu properly for ARPHRD_ETHER tunnel device in tx path
tipc: use only positive error codes in messages
net: rtnetlink: fix info leak in RTM_GETSTATS call
socket, bpf: fix possible use after free
powerpc/64s: Use emergency stack for kernel TM Bad Thing program checks
powerpc/tm: Fix illegal TM state in signal handler
percpu: make this_cpu_generic_read() atomic w.r.t. interrupts
driver core: platform: Don't read past the end of "driver_override" buffer
Drivers: hv: fcopy: restore correct transfer length
stm class: Fix a use-after-free
ftrace: Fix kmemleak in unregister_ftrace_graph
HID: i2c-hid: allocate hid buffers for real worst case
HID: wacom: leds: Don't try to control the EKR's read-only LEDs
HID: wacom: Always increment hdev refcount within wacom_get_hdev_data
HID: wacom: bits shifted too much for 9th and 10th buttons
rocker: fix rocker_tlv_put_* functions for KASAN
netlink: fix nla_put_{u8,u16,u32} for KASAN
iwlwifi: mvm: use IWL_HCMD_NOCOPY for MCAST_FILTER_CMD
iwlwifi: add workaround to disable wide channels in 5GHz
scsi: sd: Do not override max_sectors_kb sysfs setting
brcmfmac: add length check in brcmf_cfg80211_escan_handler()
brcmfmac: setup passive scan if requested by user-space
drm/i915/bios: ignore HDMI on port A
nvme-pci: Use PCI bus address for data/queues in CMB
mmc: core: add driver strength selection when selecting hs400es
sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs
vfs: deny copy_file_range() for non regular files
ext4: fix data corruption for mmap writes
ext4: Don't clear SGID when inheriting ACLs
ext4: don't allow encrypted operations without keys
f2fs: don't allow encrypted operations without keys
KVM: x86: fix singlestepping over syscall
Linux 4.9.55
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 50e7663233 upstream.
Cpusets vs. suspend-resume is _completely_ broken. And it got noticed
because it now resulted in non-cpuset usage breaking too.
On suspend cpuset_cpu_inactive() doesn't call into
cpuset_update_active_cpus() because it doesn't want to move tasks about,
there is no need, all tasks are frozen and won't run again until after
we've resumed everything.
But this means that when we finally do call into
cpuset_update_active_cpus() after resuming the last frozen cpu in
cpuset_cpu_active(), the top_cpuset will not have any difference with
the cpu_active_mask and this it will not in fact do _anything_.
So the cpuset configuration will not be restored. This was largely
hidden because we would unconditionally create identity domains and
mobile users would not in fact use cpusets much. And servers what do use
cpusets tend to not suspend-resume much.
An addition problem is that we'd not in fact wait for the cpuset work to
finish before resuming the tasks, allowing spurious migrations outside
of the specified domains.
Fix the rebuild by introducing cpuset_force_rebuild() and fix the
ordering with cpuset_wait_for_hotplug().
Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: deb7aa308e ("cpuset: reorganize CPU / memory hotplug handling")
Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2b0b8499ae upstream.
The trampoline allocated by function tracer was overwriten by function_graph
tracer, and caused a memory leak. The save_global_trampoline should have
saved the previous trampoline in register_ftrace_graph() and restored it in
unregister_ftrace_graph(). But as it is implemented, save_global_trampoline was
only used in unregister_ftrace_graph as default value 0, and it overwrote the
previous trampoline's value. Causing the previous allocated trampoline to be
lost.
kmmeleak backtrace:
kmemleak_vmalloc+0x77/0xc0
__vmalloc_node_range+0x1b5/0x2c0
module_alloc+0x7c/0xd0
arch_ftrace_update_trampoline+0xb5/0x290
ftrace_startup+0x78/0x210
register_ftrace_function+0x8b/0xd0
function_trace_init+0x4f/0x80
tracing_set_tracer+0xe6/0x170
tracing_set_trace_write+0x90/0xd0
__vfs_write+0x37/0x170
vfs_write+0xb2/0x1b0
SyS_write+0x55/0xc0
do_syscall_64+0x67/0x180
return_from_SYSCALL_64+0x0/0x6a
[
Looking further into this, I found that this was left over from when the
function and function graph tracers shared the same ftrace_ops. But in
commit 5f151b2401 ("ftrace: Fix function_profiler and function tracer
together"), the two were separated, and the save_global_trampoline no
longer was necessary (and it may have been broken back then too).
-- Steven Rostedt
]
Link: http://lkml.kernel.org/r/20170912021454.5976-1-shuwang@redhat.com
Fixes: 5f151b2401 ("ftrace: Fix function_profiler and function tracer together")
Signed-off-by: Shu Wang <shuwang@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ec9dd352d5 ]
This patch fixes a bug exhibited by the following scenario:
1. fd1 = perf_event_open with attr.config = ID1
2. attach bpf program prog1 to fd1
3. fd2 = perf_event_open with attr.config = ID1
<this will be successful>
4. user program closes fd2 and prog1 is detached from the tracepoint.
5. user program with fd1 does not work properly as tracepoint
no output any more.
The issue happens at step 4. Multiple perf_event_open can be called
successfully, but only one bpf prog pointer in the tp_event. In the
current logic, any fd release for the same tp_event will free
the tp_event->prog.
The fix is to free tp_event->prog only when the closing fd
corresponds to the one which registered the program.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Preempt and irq trace events can be used for tracing the start and
end of an atomic section which can be used by a trace viewer like
systrace to graphically view the start and end of an atomic section and
correlate them with latencies and scheduling issues.
This also serves as a prelude to using synthetic events or probes to
rewrite the preempt and irqsoff tracers, along with numerous benefits of
using trace events features for these events.
Change-Id: I4f992714c0161a415b17a03aa14cce823d8ca3bb
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zilstra <peterz@infradead.org>
Cc: kernel-team@android.com
Link: https://patchwork.kernel.org/patch/9988157/
Signed-off-by: Joel Fernandes <joelaf@google.com>
In preparation of adding irqsoff and preemptsoff enable and disable trace
events, move required functions and code to make it easier to add these events
in a later patch. This patch is just code movement and no functional change.
Change-Id: I587d411da5efbc4959bcccd7a05c7a66c231e1e0
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: kernel-team@android.com
Link: https://patchwork.kernel.org/patch/9988159/
Signed-off-by: Joel Fernandes <joelaf@google.com>
Changes in 4.9.53
cifs: release cifs root_cred after exit_cifs
cifs: release auth_key.response for reconnect.
fs/proc: Report eip/esp in /prod/PID/stat for coredumping
mac80211: fix VLAN handling with TXQs
mac80211_hwsim: Use proper TX power
mac80211: flush hw_roc_start work before cancelling the ROC
genirq: Make sparse_irq_lock protect what it should protect
KVM: PPC: Book3S: Fix race and leak in kvm_vm_ioctl_create_spapr_tce()
KVM: PPC: Book3S HV: Protect updates to spapr_tce_tables list
tracing: Fix trace_pipe behavior for instance traces
tracing: Erase irqsoff trace with empty write
md/raid5: fix a race condition in stripe batch
md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list
scsi: scsi_transport_iscsi: fix the issue that iscsi_if_rx doesn't parse nlmsg properly
drm/radeon: disable hard reset in hibernate for APUs
crypto: drbg - fix freeing of resources
crypto: talitos - Don't provide setkey for non hmac hashing algs.
crypto: talitos - fix sha224
crypto: talitos - fix hashing
security/keys: properly zero out sensitive key material in big_key
security/keys: rewrite all of big_key crypto
KEYS: fix writing past end of user-supplied buffer in keyring_read()
KEYS: prevent creating a different user's keyrings
KEYS: prevent KEYCTL_READ on negative key
powerpc/pseries: Fix parent_dn reference leak in add_dt_node()
powerpc/tm: Flush TM only if CPU has TM feature
powerpc/ftrace: Pass the correct stack pointer for DYNAMIC_FTRACE_WITH_REGS
s390/mm: fix write access check in gup_huge_pmd()
PM: core: Fix device_pm_check_callbacks()
Fix SMB3.1.1 guest authentication to Samba
SMB3: Warn user if trying to sign connection that authenticated as guest
SMB: Validate negotiate (to protect against downgrade) even if signing off
SMB3: Don't ignore O_SYNC/O_DSYNC and O_DIRECT flags
vfs: Return -ENXIO for negative SEEK_HOLE / SEEK_DATA offsets
nl80211: check for the required netlink attributes presence
bsg-lib: don't free job in bsg_prepare_job
iw_cxgb4: remove the stid on listen create failure
iw_cxgb4: put ep reference in pass_accept_req()
selftests/seccomp: Support glibc 2.26 siginfo_t.h
seccomp: fix the usage of get/put_seccomp_filter() in seccomp_get_filter()
arm64: Make sure SPsel is always set
arm64: fault: Route pte translation faults via do_translation_fault
KVM: VMX: extract __pi_post_block
KVM: VMX: avoid double list add with VT-d posted interrupts
KVM: VMX: simplify and fix vmx_vcpu_pi_load
kvm/x86: Handle async PF in RCU read-side critical sections
KVM: VMX: Do not BUG() on out-of-bounds guest IRQ
kvm: nVMX: Don't allow L2 to access the hardware CR8
xfs: validate bdev support for DAX inode flag
etnaviv: fix gem object list corruption
PCI: Fix race condition with driver_override
btrfs: fix NULL pointer dereference from free_reloc_roots()
btrfs: propagate error to btrfs_cmp_data_prepare caller
btrfs: prevent to set invalid default subvolid
x86/mm: Fix fault error path using unsafe vma pointer
x86/fpu: Don't let userspace set bogus xcomp_bv
gfs2: Fix debugfs glocks dump
timer/sysclt: Restrict timer migration sysctl values to 0 and 1
KVM: VMX: do not change SN bit in vmx_update_pi_irte()
KVM: VMX: remove WARN_ON_ONCE in kvm_vcpu_trigger_posted_interrupt
cxl: Fix driver use count
KVM: VMX: use cmpxchg64
video: fbdev: aty: do not leak uninitialized padding in clk to userspace
swiotlb-xen: implement xen_swiotlb_dma_mmap callback
Linux 4.9.53
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 66a733ea6b upstream.
As Chris explains, get_seccomp_filter() and put_seccomp_filter() can end
up using different filters. Once we drop ->siglock it is possible for
task->seccomp.filter to have been replaced by SECCOMP_FILTER_FLAG_TSYNC.
Fixes: f8e529ed94 ("seccomp, ptrace: add support for dumping seccomp filters")
Reported-by: Chris Salls <chrissalls5@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[tycho: add __get_seccomp_filter vs. open coding refcount_inc()]
Signed-off-by: Tycho Andersen <tycho@docker.com>
[kees: tweak commit log]
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 75df6e688c upstream.
When reading data from trace_pipe, tracing_wait_pipe() performs a
check to see if tracing has been turned off after some data was read.
Currently, this check always looks at global trace state, but it
should be checking the trace instance where trace_pipe is located at.
Because of this bug, cat instances/i1/trace_pipe in the following
script will immediately exit instead of waiting for data:
cd /sys/kernel/debug/tracing
echo 0 > tracing_on
mkdir -p instances/i1
echo 1 > instances/i1/tracing_on
echo 1 > instances/i1/events/sched/sched_process_exec/enable
cat instances/i1/trace_pipe
Link: http://lkml.kernel.org/r/20170917102348.1615-1-tahsin@google.com
Fixes: 10246fa35d ("tracing: give easy way to clear trace buffer")
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 12ac1d0f6c upstream.
for_each_active_irq() iterates the sparse irq allocation bitmap. The caller
must hold sparse_irq_lock. Several code pathes expect that an active bit in
the sparse bitmap also has a valid interrupt descriptor.
Unfortunately that's not true. The (de)allocation is a two step process,
which holds the sparse_irq_lock only across the queue/remove from the radix
tree and the set/clear in the allocation bitmap.
If a iteration locks sparse_irq_lock between the two steps, then it might
see an active bit but the corresponding irq descriptor is NULL. If that is
dereferenced unconditionally, then the kernel oopses. Of course, all
iterator sites could be audited and fixed, but....
There is no reason why the sparse_irq_lock needs to be dropped between the
two steps, in fact the code becomes simpler when the mutex is held across
both and the semantics become more straight forward, so future problems of
missing NULL pointer checks in the iteration are avoided and all existing
sites are fixed in one go.
Expand the lock held sections so both operations are covered and the bitmap
and the radixtree are in sync.
Fixes: a05a900a51 ("genirq: Make sparse_lock a mutex")
Reported-and-tested-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The Android kernel config fragments now live in a separate repository.
To prevent others from having to search for this location, add a script
to fetch and unpack the fragments.
Update .gitignore to include these fragments.
Change-Id: If2d4a59b86e4573b0a9b3190025dfe4191870b46
Signed-off-by: Steve Muckle <smuckle@google.com>
Changes in 4.9.52
SUNRPC: Refactor svc_set_num_threads()
NFSv4: Fix callback server shutdown
mm: prevent double decrease of nr_reserved_highatomic
orangefs: Don't clear SGID when inheriting ACLs
IB/{qib, hfi1}: Avoid flow control testing for RDMA write operation
drm/sun4i: Implement drm_driver lastclose to restore fbdev console
IB/addr: Fix setting source address in addr6_resolve()
tty: improve tty_insert_flip_char() fast path
tty: improve tty_insert_flip_char() slow path
tty: fix __tty_insert_flip_char regression
pinctrl/amd: save pin registers over suspend/resume
Input: i8042 - add Gigabyte P57 to the keyboard reset table
MIPS: math-emu: <MAX|MAXA|MIN|MINA>.<D|S>: Fix quiet NaN propagation
MIPS: math-emu: <MAX|MAXA|MIN|MINA>.<D|S>: Fix cases of both inputs zero
MIPS: math-emu: <MAX|MIN>.<D|S>: Fix cases of both inputs negative
MIPS: math-emu: <MAXA|MINA>.<D|S>: Fix cases of input values with opposite signs
MIPS: math-emu: <MAXA|MINA>.<D|S>: Fix cases of both infinite inputs
MIPS: math-emu: MINA.<D|S>: Fix some cases of infinity and zero inputs
MIPS: math-emu: Handle zero accumulator case in MADDF and MSUBF separately
MIPS: math-emu: <MADDF|MSUBF>.<D|S>: Fix NaN propagation
MIPS: math-emu: <MADDF|MSUBF>.<D|S>: Fix some cases of infinite inputs
MIPS: math-emu: <MADDF|MSUBF>.<D|S>: Fix some cases of zero inputs
MIPS: math-emu: <MADDF|MSUBF>.<D|S>: Clean up "maddf_flags" enumeration
MIPS: math-emu: <MADDF|MSUBF>.S: Fix accuracy (32-bit case)
MIPS: math-emu: <MADDF|MSUBF>.D: Fix accuracy (64-bit case)
crypto: ccp - Fix XTS-AES-128 support on v5 CCPs
crypto: AF_ALG - remove SGL terminator indicator when chaining
ext4: fix incorrect quotaoff if the quota feature is enabled
ext4: fix quota inconsistency during orphan cleanup for read-only mounts
powerpc: Fix DAR reporting when alignment handler faults
block: Relax a check in blk_start_queue()
md/bitmap: disable bitmap_resize for file-backed bitmaps.
skd: Avoid that module unloading triggers a use-after-free
skd: Submit requests to firmware before triggering the doorbell
scsi: zfcp: fix queuecommand for scsi_eh commands when DIX enabled
scsi: zfcp: add handling for FCP_RESID_OVER to the fcp ingress path
scsi: zfcp: fix capping of unsuccessful GPN_FT SAN response trace records
scsi: zfcp: fix passing fsf_req to SCSI trace on TMF to correlate with HBA
scsi: zfcp: fix missing trace records for early returns in TMF eh handlers
scsi: zfcp: fix payload with full FCP_RSP IU in SCSI trace records
scsi: zfcp: trace HBA FSF response by default on dismiss or timedout late response
scsi: zfcp: trace high part of "new" 64 bit SCSI LUN
scsi: megaraid_sas: set minimum value of resetwaittime to be 1 secs
scsi: megaraid_sas: Check valid aen class range to avoid kernel panic
scsi: megaraid_sas: Return pended IOCTLs with cmd_status MFI_STAT_WRONG_STATE in case adapter is dead
scsi: storvsc: fix memory leak on ring buffer busy
scsi: sg: remove 'save_scat_len'
scsi: sg: use standard lists for sg_requests
scsi: sg: off by one in sg_ioctl()
scsi: sg: factor out sg_fill_request_table()
scsi: sg: fixup infoleak when using SG_GET_REQUEST_TABLE
scsi: qla2xxx: Correction to vha->vref_count timeout
scsi: qla2xxx: Fix an integer overflow in sysfs code
ftrace: Fix selftest goto location on error
ftrace: Fix memleak when unregistering dynamic ops when tracing disabled
tracing: Add barrier to trace_printk() buffer nesting modification
tracing: Apply trace_clock changes to instance max buffer
ARC: Re-enable MMU upon Machine Check exception
PCI: shpchp: Enable bridge bus mastering if MSI is enabled
PCI: pciehp: Report power fault only once until we clear it
net/netfilter/nf_conntrack_core: Fix net_conntrack_lock()
s390/mm: fix local TLB flushing vs. detach of an mm address space
s390/mm: fix race on mm->context.flush_mm
media: v4l2-compat-ioctl32: Fix timespec conversion
media: uvcvideo: Prevent heap overflow when accessing mapped controls
PM / devfreq: Fix memory leak when fail to register device
bcache: initialize dirty stripes in flash_dev_run()
bcache: Fix leak of bdev reference
bcache: do not subtract sectors_to_gc for bypassed IO
bcache: correct cache_dirty_target in __update_writeback_rate()
bcache: Correct return value for sysfs attach errors
bcache: fix for gc and write-back race
bcache: fix bch_hprint crash and improve output
Linux 4.9.52
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 170b3b1050 upstream.
Currently trace_clock timestamps are applied to both regular and max
buffers only for global trace. For instance trace, trace_clock
timestamps are applied only to regular buffer. But, regular and max
buffers can be swapped, for example, following a snapshot. So, for
instance trace, bad timestamps can be seen following a snapshot.
Let's apply trace_clock timestamps to instance max buffer as well.
Link: http://lkml.kernel.org/r/ebdb168d0be042dcdf51f81e696b17fabe3609c1.1504642143.git.tom.zanussi@linux.intel.com
Fixes: 277ba0446 ("tracing: Add interface to allow multiple trace buffers")
Signed-off-by: Baohong Liu <baohong.liu@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3d9622c12c upstream.
trace_printk() uses 4 buffers, one for each context (normal, softirq, irq
and NMI), such that it does not need to worry about one context preempting
the other. There's a nesting counter that gets incremented to figure out
which buffer to use. If the context gets preempted by another context which
calls trace_printk() it will increment the counter and use the next buffer,
and restore the counter when it is finished.
The problem is that gcc may optimize the modification of the buffer nesting
counter and it may not be incremented in memory before the buffer is used.
If this happens, and the context gets interrupted by another context, it
could pick the same buffer and corrupt the one that is being used.
Compiler barriers need to be added after the nesting variable is incremented
and before it is decremented to prevent usage of the context buffers by more
than one context at the same time.
Cc: Andy Lutomirski <luto@kernel.org>
Fixes: e2ace00117 ("tracing: Choose static tp_printk buffer by explicit nesting count")
Hat-tip-to: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit edb096e007 upstream.
If function tracing is disabled by the user via the function-trace option or
the proc sysctl file, and a ftrace_ops that was allocated on the heap is
unregistered, then the shutdown code exits out without doing the proper
clean up. This was found via kmemleak and running the ftrace selftests, as
one of the tests unregisters with function tracing disabled.
# cat kmemleak
unreferenced object 0xffffffffa0020000 (size 4096):
comm "swapper/0", pid 1, jiffies 4294668889 (age 569.209s)
hex dump (first 32 bytes):
55 ff 74 24 10 55 48 89 e5 ff 74 24 18 55 48 89 U.t$.UH...t$.UH.
e5 48 81 ec a8 00 00 00 48 89 44 24 50 48 89 4c .H......H.D$PH.L
backtrace:
[<ffffffff81d64665>] kmemleak_vmalloc+0x85/0xf0
[<ffffffff81355631>] __vmalloc_node_range+0x281/0x3e0
[<ffffffff8109697f>] module_alloc+0x4f/0x90
[<ffffffff81091170>] arch_ftrace_update_trampoline+0x160/0x420
[<ffffffff81249947>] ftrace_startup+0xe7/0x300
[<ffffffff81249bd2>] register_ftrace_function+0x72/0x90
[<ffffffff81263786>] trace_selftest_ops+0x204/0x397
[<ffffffff82bb8971>] trace_selftest_startup_function+0x394/0x624
[<ffffffff81263a75>] run_tracer_selftest+0x15c/0x1d7
[<ffffffff82bb83f1>] init_trace_selftests+0x75/0x192
[<ffffffff81002230>] do_one_initcall+0x90/0x1e2
[<ffffffff82b7d620>] kernel_init_freeable+0x350/0x3fe
[<ffffffff81d61ec3>] kernel_init+0x13/0x122
[<ffffffff81d72c6a>] ret_from_fork+0x2a/0x40
[<ffffffffffffffff>] 0xffffffffffffffff
Fixes: 12cce594fa ("ftrace/x86: Allow !CONFIG_PREEMPT dynamic ops to use allocated trampolines")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 46320a6acc upstream.
In the second iteration of trace_selftest_ops(), the error goto label is
wrong in the case where trace_selftest_test_global_cnt is off. In the
case of error, it leaks the dynamic ops that was allocated.
Fixes: 95950c2e ("ftrace: Add self-tests for multiple function trace users")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bc4278987e upstream.
A regression of the FTQ noise has been reported by Ying Huang,
on the following hardware:
8 threads Intel(R) Core(TM)i7-4770 CPU @ 3.40GHz with 8G memory
... which was caused by this commit:
commit 4e5160766f ("sched/fair: Propagate asynchrous detach")
The only part of the patch that can increase the noise is the update
of blocked load of group entity in update_blocked_averages().
We can optimize this call and skip the update of group entity if its load
and utilization are already null and there is no pending propagation of load
in the task group.
This optimization partly restores the noise score. A more agressive
optimization has been tried but has shown worse score.
Reported-by: ying.huang@linux.intel.com
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: ying.huang@intel.com
Fixes: 4e5160766f ("sched/fair: Propagate asynchrous detach")
Link: http://lkml.kernel.org/r/1489758442-2877-1-git-send-email-vincent.guittot@linaro.org
[ Fixed typos, improved layout. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change-Id: Iba93bbda71125ce14519aa30b1beac2bf3c52e1e
Fixes: Change-Id: I5d1a9f273111c0bb7503ca3c9ad2406d12867cb5
("UPSTREAM: sched/fair: Propagate asynchrous detach")
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
CONFIG_SYNC_FILE is required in the absence of CONFIG_SYNC.
Change-Id: Ia52488a273c9e7b3080defe8cea19afc4ad21aaa
Signed-off-by: Steve Muckle <smuckle@google.com>
Changes in 4.9.48
irqchip: mips-gic: SYNC after enabling GIC region
i2c: ismt: Don't duplicate the receive length for block reads
i2c: ismt: Return EMSGSIZE for block reads with bogus length
crypto: algif_skcipher - only call put_page on referenced and used pages
mm, uprobes: fix multiple free of ->uprobes_state.xol_area
mm, madvise: ensure poisoned pages are removed from per-cpu lists
ceph: fix readpage from fscache
cpumask: fix spurious cpumask_of_node() on non-NUMA multi-node configs
cpuset: Fix incorrect memory_pressure control file mapping
alpha: uapi: Add support for __SANE_USERSPACE_TYPES__
CIFS: Fix maximum SMB2 header size
CIFS: remove endian related sparse warning
wl1251: add a missing spin_lock_init()
lib/mpi: kunmap after finishing accessing buffer
xfrm: policy: check policy direction value
drm/ttm: Fix accounting error when fail to get pages for pool
kvm: arm/arm64: Force reading uncached stage2 PGD
epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove()
Linux 4.9.48
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 1c08c22c87 upstream.
The memory_pressure control file was incorrectly set up without
a private value (0, by default). As a result, this control
file was treated like memory_migrate on read. By adding back the
FILE_MEMORY_PRESSURE private value, the correct memory pressure value
will be returned.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 7dbdb199d3 ("cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 355627f518 upstream.
Commit 7c05126793 ("mm, fork: make dup_mmap wait for mmap_sem for
write killable") made it possible to kill a forking task while it is
waiting to acquire its ->mmap_sem for write, in dup_mmap().
However, it was overlooked that this introduced an new error path before
the new mm_struct's ->uprobes_state.xol_area has been set to NULL after
being copied from the old mm_struct by the memcpy in dup_mm(). For a
task that has previously hit a uprobe tracepoint, this resulted in the
'struct xol_area' being freed multiple times if the task was killed at
just the right time while forking.
Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather
than in uprobe_dup_mmap().
With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C
program given by commit 2b7e8665b4 ("fork: fix incorrect fput of
->exe_file causing use-after-free"), provided that a uprobe tracepoint
has been set on the fork_thread() function. For example:
$ gcc reproducer.c -o reproducer -lpthread
$ nm reproducer | grep fork_thread
0000000000400719 t fork_thread
$ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events
$ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
$ ./reproducer
Here is the use-after-free reported by KASAN:
BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200
Read of size 8 at addr ffff8800320a8b88 by task reproducer/198
CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f3fb5 #255
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
Call Trace:
dump_stack+0xdb/0x185
print_address_description+0x7e/0x290
kasan_report+0x23b/0x350
__asan_report_load8_noabort+0x19/0x20
uprobe_clear_state+0x1c4/0x200
mmput+0xd6/0x360
do_exit+0x740/0x1670
do_group_exit+0x13f/0x380
get_signal+0x597/0x17d0
do_signal+0x99/0x1df0
exit_to_usermode_loop+0x166/0x1e0
syscall_return_slowpath+0x258/0x2c0
entry_SYSCALL_64_fastpath+0xbc/0xbe
...
Allocated by task 199:
save_stack_trace+0x1b/0x20
kasan_kmalloc+0xfc/0x180
kmem_cache_alloc_trace+0xf3/0x330
__create_xol_area+0x10f/0x780
uprobe_notify_resume+0x1674/0x2210
exit_to_usermode_loop+0x150/0x1e0
prepare_exit_to_usermode+0x14b/0x180
retint_user+0x8/0x20
Freed by task 199:
save_stack_trace+0x1b/0x20
kasan_slab_free+0xa8/0x1a0
kfree+0xba/0x210
uprobe_clear_state+0x151/0x200
mmput+0xd6/0x360
copy_process.part.8+0x605f/0x65d0
_do_fork+0x1a5/0xbd0
SyS_clone+0x19/0x20
do_syscall_64+0x22f/0x660
return_from_SYSCALL_64+0x0/0x7a
Note: without KASAN, you may instead see a "Bad page state" message, or
simply a general protection fault.
Link: http://lkml.kernel.org/r/20170830033303.17927-1-ebiggers3@gmail.com
Fixes: 7c05126793 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>