It will cause deadlock and while(1) if call printk while schedule
is in progress. The block state like as below:
cpu0(hold the console sem):
printk->console_unlock->up_sem->spin_lock(&sem->lock)->wake_up_process(cpu1)
->try_to_wake_up(cpu1)->while(p->on_cpu).
cpu1(request console sem):
console_lock->down_sem->schedule->idle_banlance->update_cpu_capacity->
printk->console_trylock->spin_lock(&sem->lock).
p->on_cpu will be 1 forever, because the task is still running on cpu1,
so cpu0 is blocked in while(p->on_cpu), but cpu1 could not get
spin_lock(&sem->lock), it is blocked too, it means the task will running
on cpu1 forever.
Change-Id: I60d02d8c957273872f97939632bdd235accdad4e
Signed-off-by: Chen Liang <cl@rock-chips.com>
Changes in 4.19.20
Fix "net: ipv4: do not handle duplicate fragments as overlapping"
drm/msm/gpu: fix building without debugfs
ipv6: Consider sk_bound_dev_if when binding a socket to an address
ipv6: sr: clear IP6CB(skb) on SRH ip4ip6 encapsulation
ipvlan, l3mdev: fix broken l3s mode wrt local routes
l2tp: copy 4 more bytes to linear part if necessary
l2tp: fix reading optional fields of L2TPv3
net: ip_gre: always reports o_key to userspace
net: ip_gre: use erspan key field for tunnel lookup
net/mlx4_core: Add masking for a few queries on HCA caps
netrom: switch to sock timer API
net/rose: fix NULL ax25_cb kernel panic
net: set default network namespace in init_dummy_netdev()
ravb: expand rx descriptor data to accommodate hw checksum
sctp: improve the events for sctp stream reset
tun: move the call to tun_set_real_num_queues
ucc_geth: Reset BQL queue when stopping device
vhost: fix OOB in get_rx_bufs()
net: ip6_gre: always reports o_key to userspace
sctp: improve the events for sctp stream adding
net/mlx5e: Allow MAC invalidation while spoofchk is ON
ip6mr: Fix notifiers call on mroute_clean_tables()
Revert "net/mlx5e: E-Switch, Initialize eswitch only if eswitch manager"
sctp: set chunk transport correctly when it's a new asoc
sctp: set flow sport from saddr only when it's 0
virtio_net: Don't enable NAPI when interface is down
virtio_net: Don't call free_old_xmit_skbs for xdp_frames
virtio_net: Fix not restoring real_num_rx_queues
virtio_net: Fix out of bounds access of sq
virtio_net: Don't process redirected XDP frames when XDP is disabled
virtio_net: Use xdp_return_frame to free xdp_frames on destroying vqs
virtio_net: Differentiate sk_buff and xdp_frame on freeing
CIFS: Do not count -ENODATA as failure for query directory
CIFS: Fix trace command logging for SMB2 reads and writes
CIFS: Do not consider -ENODATA as stat failure for reads
fs/dcache: Fix incorrect nr_dentry_unused accounting in shrink_dcache_sb()
iommu/vt-d: Fix memory leak in intel_iommu_put_resv_regions()
selftests/seccomp: Enhance per-arch ptrace syscall skip tests
NFS: Fix up return value on fatal errors in nfs_page_async_flush()
ARM: cns3xxx: Fix writing to wrong PCI config registers after alignment
arm64: kaslr: ensure randomized quantities are clean also when kaslr is off
arm64: Do not issue IPIs for user executable ptes
arm64: hyp-stub: Forbid kprobing of the hyp-stub
arm64: hibernate: Clean the __hyp_text to PoC after resume
gpio: altera-a10sr: Set proper output level for direction_output
gpiolib: fix line event timestamps for nested irqs
gpio: pcf857x: Fix interrupts on multiple instances
gpio: sprd: Fix the incorrect data register
gpio: sprd: Fix incorrect irq type setting for the async EIC
gfs2: Revert "Fix loop in gfs2_rbm_find"
mmc: bcm2835: Fix DMA channel leak on probe error
mmc: mediatek: fix incorrect register setting of hs400_cmd_int_delay
ALSA: usb-audio: Add Opus #3 to quirks for native DSD support
ALSA: hda/realtek - Fixed hp_pin no value
IB/hfi1: Remove overly conservative VM_EXEC flag check
platform/x86: asus-nb-wmi: Map 0x35 to KEY_SCREENLOCK
platform/x86: asus-nb-wmi: Drop mapping of 0x33 and 0x34 scan codes
mmc: sdhci-iproc: handle mmc_of_parse() errors during probe
Btrfs: fix deadlock when allocating tree block during leaf/node split
btrfs: On error always free subvol_name in btrfs_mount
kernel/exit.c: release ptraced tasks before zap_pid_ns_processes
mm/hugetlb.c: teach follow_hugetlb_page() to handle FOLL_NOWAIT
oom, oom_reaper: do not enqueue same task twice
mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages
mm, oom: fix use-after-free in oom_kill_process
mm: hwpoison: use do_send_sig_info() instead of force_sig()
mm: migrate: don't rely on __PageMovable() of newpage after unlocking it
of: Convert to using %pOFn instead of device_node.name
of: overlay: add tests to validate kfrees from overlay removal
of: overlay: add missing of_node_get() in __of_attach_node_sysfs
of: overlay: use prop add changeset entry for property in new nodes
of: overlay: do not duplicate properties from overlay for new nodes
md/raid5: fix 'out of memory' during raid cache recovery
cifs: Always resolve hostname before reconnecting
Linux 4.19.20
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 8fb335e078 upstream.
Currently, exit_ptrace() adds all ptraced tasks in a dead list, then
zap_pid_ns_processes() waits on all tasks in a current pidns, and only
then are tasks from the dead list released.
zap_pid_ns_processes() can get stuck on waiting tasks from the dead
list. In this case, we will have one unkillable process with one or
more dead children.
Thanks to Oleg for the advice to release tasks in find_child_reaper().
Link: http://lkml.kernel.org/r/20190110175200.12442-1-avagin@gmail.com
Fixes: 7c8bd2322c ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()")
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.19.19
amd-xgbe: Fix mdio access for non-zero ports and clause 45 PHYs
net: bridge: Fix ethernet header pointer before check skb forwardable
net: Fix usage of pskb_trim_rcsum
net: phy: marvell: Errata for mv88e6390 internal PHYs
net: phy: mdio_bus: add missing device_del() in mdiobus_register() error handling
net/sched: act_tunnel_key: fix memory leak in case of action replace
net_sched: refetch skb protocol for each filter
openvswitch: Avoid OOB read when parsing flow nlattrs
vhost: log dirty page correctly
mlxsw: pci: Increase PCI SW reset timeout
net: ipv4: Fix memory leak in network namespace dismantle
mlxsw: spectrum_fid: Update dummy FID index
mlxsw: pci: Ring CQ's doorbell before RDQ's
net/sched: cls_flower: allocate mask dynamically in fl_change()
udp: with udp_segment release on error path
ip6_gre: fix tunnel list corruption for x-netns
erspan: build the header with the right proto according to erspan_ver
net: phy: marvell: Fix deadlock from wrong locking
ip6_gre: update version related info when changing link
tcp: allow MSG_ZEROCOPY transmission also in CLOSE_WAIT state
mei: me: mark LBG devices as having dma support
mei: me: add denverton innovation engine device IDs
USB: leds: fix regression in usbport led trigger
USB: serial: simple: add Motorola Tetra TPG2200 device id
USB: serial: pl2303: add new PID to support PL2303TB
ceph: clear inode pointer when snap realm gets dropped by its inode
ASoC: atom: fix a missing check of snd_pcm_lib_malloc_pages
ASoC: rt5514-spi: Fix potential NULL pointer dereference
ASoC: tlv320aic32x4: Kernel OOPS while entering DAPM standby mode
clk: socfpga: stratix10: fix rate calculation for pll clocks
clk: socfpga: stratix10: fix naming convention for the fixed-clocks
inotify: Fix fd refcount leak in inotify_add_watch().
ALSA: hda/realtek - Fix typo for ALC225 model
ALSA: hda - Add mute LED support for HP ProBook 470 G5
ARCv2: lib: memeset: fix doing prefetchw outside of buffer
ARC: adjust memblock_reserve of kernel memory
ARC: perf: map generic branches to correct hardware condition
s390/mm: always force a load of the primary ASCE on context switch
s390/early: improve machine detection
s390/smp: fix CPU hotplug deadlock with CPU rescan
misc: ibmvsm: Fix potential NULL pointer dereference
char/mwave: fix potential Spectre v1 vulnerability
mmc: dw_mmc-bluefield: : Fix the license information
mmc: meson-gx: Free irq in release() callback
staging: rtl8188eu: Add device code for D-Link DWA-121 rev B1
tty: Handle problem if line discipline does not have receive_buf
uart: Fix crash in uart_write and uart_put_char
tty/n_hdlc: fix __might_sleep warning
hv_balloon: avoid touching uninitialized struct page during tail onlining
Drivers: hv: vmbus: Check for ring when getting debug info
vgacon: unconfuse vc_origin when using soft scrollback
CIFS: Fix possible hang during async MTU reads and writes
CIFS: Fix credits calculations for reads with errors
CIFS: Fix credit calculation for encrypted reads with errors
CIFS: Do not reconnect TCP session in add_credits()
smb3: add credits we receive from oplock/break PDUs
Input: xpad - add support for SteelSeries Stratus Duo
Input: input_event - provide override for sparc64
Input: uinput - fix undefined behavior in uinput_validate_absinfo()
acpi/nfit: Block function zero DSMs
acpi/nfit: Fix command-supported detection
scsi: ufs: Use explicit access size in ufshcd_dump_regs
dm thin: fix passdown_double_checking_shared_status()
dm crypt: fix parsing of extended IV arguments
drm/amdgpu: Add APTX quirk for Lenovo laptop
KVM: x86: Fix single-step debugging
KVM: x86: Fix PV IPIs for 32-bit KVM host
KVM: x86: WARN_ONCE if sending a PV IPI returns a fatal error
kvm: x86/vmx: Use kzalloc for cached_vmcs12
KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned
x86/pkeys: Properly copy pkey state at fork()
x86/selftests/pkeys: Fork() to check for state being preserved
x86/kaslr: Fix incorrect i8254 outb() parameters
x86/entry/64/compat: Fix stack switching for XEN PV
posix-cpu-timers: Unbreak timer rearming
net: sun: cassini: Cleanup license conflict
irqchip/gic-v3-its: Align PCI Multi-MSI allocation on their size
can: dev: __can_get_echo_skb(): fix bogous check for non-existing skb by removing it
can: bcm: check timer values before ktime conversion
can: flexcan: fix NULL pointer exception during bringup
vt: make vt_console_print() compatible with the unicode screen buffer
vt: always call notifier with the console lock held
vt: invoke notifier on screen size change
drm/meson: Fix atomic mode switching regression
bpf: improve verifier branch analysis
bpf: add per-insn complexity limit
bpf: move {prev_,}insn_idx into verifier env
bpf: move tmp variable into ax register in interpreter
bpf: enable access to ax register also from verifier rewrite
bpf: restrict map value pointer arithmetic for unprivileged
bpf: restrict stack pointer arithmetic for unprivileged
bpf: restrict unknown scalars of mixed signed bounds for unprivileged
bpf: fix check_map_access smin_value test when pointer contains offset
bpf: prevent out of bounds speculation on pointer arithmetic
bpf: fix sanitation of alu op with pointer / scalar type from different paths
bpf: fix inner map masking to prevent oob under speculation
s390/smp: Fix calling smp_call_ipl_cpu() from ipl CPU
nvmet-rdma: Add unlikely for response allocated check
nvmet-rdma: fix null dereference under heavy load
Revert "mm, memory_hotplug: initialize struct pages for the full memory section"
usb: dwc3: gadget: Clear req->needs_extra_trb flag on cleanup
ide: fix a typo in the settings proc file name
Input: input_event - fix the CONFIG_SPARC64 mixup
Linux 4.19.19
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ commit d3bd7413e0 upstream ]
While 979d63d50c ("bpf: prevent out of bounds speculation on pointer
arithmetic") took care of rejecting alu op on pointer when e.g. pointer
came from two different map values with different map properties such as
value size, Jann reported that a case was not covered yet when a given
alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from
different branches where we would incorrectly try to sanitize based
on the pointer's limit. Catch this corner case and reject the program
instead.
Fixes: 979d63d50c ("bpf: prevent out of bounds speculation on pointer arithmetic")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ commit 979d63d50c upstream ]
Jann reported that the original commit back in b2157399cc
("bpf: prevent out-of-bounds speculation") was not sufficient
to stop CPU from speculating out of bounds memory access:
While b2157399cc only focussed on masking array map access
for unprivileged users for tail calls and data access such
that the user provided index gets sanitized from BPF program
and syscall side, there is still a more generic form affected
from BPF programs that applies to most maps that hold user
data in relation to dynamic map access when dealing with
unknown scalars or "slow" known scalars as access offset, for
example:
- Load a map value pointer into R6
- Load an index into R7
- Do a slow computation (e.g. with a memory dependency) that
loads a limit into R8 (e.g. load the limit from a map for
high latency, then mask it to make the verifier happy)
- Exit if R7 >= R8 (mispredicted branch)
- Load R0 = R6[R7]
- Load R0 = R6[R0]
For unknown scalars there are two options in the BPF verifier
where we could derive knowledge from in order to guarantee
safe access to the memory: i) While </>/<=/>= variants won't
allow to derive any lower or upper bounds from the unknown
scalar where it would be safe to add it to the map value
pointer, it is possible through ==/!= test however. ii) another
option is to transform the unknown scalar into a known scalar,
for example, through ALU ops combination such as R &= <imm>
followed by R |= <imm> or any similar combination where the
original information from the unknown scalar would be destroyed
entirely leaving R with a constant. The initial slow load still
precedes the latter ALU ops on that register, so the CPU
executes speculatively from that point. Once we have the known
scalar, any compare operation would work then. A third option
only involving registers with known scalars could be crafted
as described in [0] where a CPU port (e.g. Slow Int unit)
would be filled with many dependent computations such that
the subsequent condition depending on its outcome has to wait
for evaluation on its execution port and thereby executing
speculatively if the speculated code can be scheduled on a
different execution port, or any other form of mistraining
as described in [1], for example. Given this is not limited
to only unknown scalars, not only map but also stack access
is affected since both is accessible for unprivileged users
and could potentially be used for out of bounds access under
speculation.
In order to prevent any of these cases, the verifier is now
sanitizing pointer arithmetic on the offset such that any
out of bounds speculation would be masked in a way where the
pointer arithmetic result in the destination register will
stay unchanged, meaning offset masked into zero similar as
in array_index_nospec() case. With regards to implementation,
there are three options that were considered: i) new insn
for sanitation, ii) push/pop insn and sanitation as inlined
BPF, iii) reuse of ax register and sanitation as inlined BPF.
Option i) has the downside that we end up using from reserved
bits in the opcode space, but also that we would require
each JIT to emit masking as native arch opcodes meaning
mitigation would have slow adoption till everyone implements
it eventually which is counter-productive. Option ii) and iii)
have both in common that a temporary register is needed in
order to implement the sanitation as inlined BPF since we
are not allowed to modify the source register. While a push /
pop insn in ii) would be useful to have in any case, it
requires once again that every JIT needs to implement it
first. While possible, amount of changes needed would also
be unsuitable for a -stable patch. Therefore, the path which
has fewer changes, less BPF instructions for the mitigation
and does not require anything to be changed in the JITs is
option iii) which this work is pursuing. The ax register is
already mapped to a register in all JITs (modulo arm32 where
it's mapped to stack as various other BPF registers there)
and used in constant blinding for JITs-only so far. It can
be reused for verifier rewrites under certain constraints.
The interpreter's tmp "register" has therefore been remapped
into extending the register set with hidden ax register and
reusing that for a number of instructions that needed the
prior temporary variable internally (e.g. div, mod). This
allows for zero increase in stack space usage in the interpreter,
and enables (restricted) generic use in rewrites otherwise as
long as such a patchlet does not make use of these instructions.
The sanitation mask is dynamic and relative to the offset the
map value or stack pointer currently holds.
There are various cases that need to be taken under consideration
for the masking, e.g. such operation could look as follows:
ptr += val or val += ptr or ptr -= val. Thus, the value to be
sanitized could reside either in source or in destination
register, and the limit is different depending on whether
the ALU op is addition or subtraction and depending on the
current known and bounded offset. The limit is derived as
follows: limit := max_value_size - (smin_value + off). For
subtraction: limit := umax_value + off. This holds because
we do not allow any pointer arithmetic that would
temporarily go out of bounds or would have an unknown
value with mixed signed bounds where it is unclear at
verification time whether the actual runtime value would
be either negative or positive. For example, we have a
derived map pointer value with constant offset and bounded
one, so limit based on smin_value works because the verifier
requires that statically analyzed arithmetic on the pointer
must be in bounds, and thus it checks if resulting
smin_value + off and umax_value + off is still within map
value bounds at time of arithmetic in addition to time of
access. Similarly, for the case of stack access we derive
the limit as follows: MAX_BPF_STACK + off for subtraction
and -off for the case of addition where off := ptr_reg->off +
ptr_reg->var_off.value. Subtraction is a special case for
the masking which can be in form of ptr += -val, ptr -= -val,
or ptr -= val. In the first two cases where we know that
the value is negative, we need to temporarily negate the
value in order to do the sanitation on a positive value
where we later swap the ALU op, and restore original source
register if the value was in source.
The sanitation of pointer arithmetic alone is still not fully
sufficient as is, since a scenario like the following could
happen ...
PTR += 0x1000 (e.g. K-based imm)
PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
PTR += 0x1000
PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
[...]
... which under speculation could end up as ...
PTR += 0x1000
PTR -= 0 [ truncated by mitigation ]
PTR += 0x1000
PTR -= 0 [ truncated by mitigation ]
[...]
... and therefore still access out of bounds. To prevent such
case, the verifier is also analyzing safety for potential out
of bounds access under speculative execution. Meaning, it is
also simulating pointer access under truncation. We therefore
"branch off" and push the current verification state after the
ALU operation with known 0 to the verification stack for later
analysis. Given the current path analysis succeeded it is
likely that the one under speculation can be pruned. In any
case, it is also subject to existing complexity limits and
therefore anything beyond this point will be rejected. In
terms of pruning, it needs to be ensured that the verification
state from speculative execution simulation must never prune
a non-speculative execution path, therefore, we mark verifier
state accordingly at the time of push_stack(). If verifier
detects out of bounds access under speculative execution from
one of the possible paths that includes a truncation, it will
reject such program.
Given we mask every reg-based pointer arithmetic for
unprivileged programs, we've been looking into how it could
affect real-world programs in terms of size increase. As the
majority of programs are targeted for privileged-only use
case, we've unconditionally enabled masking (with its alu
restrictions on top of it) for privileged programs for the
sake of testing in order to check i) whether they get rejected
in its current form, and ii) by how much the number of
instructions and size will increase. We've tested this by
using Katran, Cilium and test_l4lb from the kernel selftests.
For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
we've used test_l4lb.o as well as test_l4lb_noinline.o. We
found that none of the programs got rejected by the verifier
with this change, and that impact is rather minimal to none.
balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
7,797 bytes JITed before and after the change. Most complex
program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
and 18,538 bytes JITed before and after and none of the other
tail call programs in bpf_lxc.o had any changes either. For
the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
after the change. Other programs from that object file had
similar small increase. Both test_l4lb.o had no change and
remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
JITed and for test_l4lb_noinline.o constant at 5,080 bytes
(634 insns) xlated and 3,313 bytes JITed. This can be explained
in that LLVM typically optimizes stack based pointer arithmetic
by using K-based operations and that use of dynamic map access
is not overly frequent. However, in future we may decide to
optimize the algorithm further under known guarantees from
branch and value speculation. Latter seems also unclear in
terms of prediction heuristics that today's CPUs apply as well
as whether there could be collisions in e.g. the predictor's
Value History/Pattern Table for triggering out of bounds access,
thus masking is performed unconditionally at this point but could
be subject to relaxation later on. We were generally also
brainstorming various other approaches for mitigation, but the
blocker was always lack of available registers at runtime and/or
overhead for runtime tracking of limits belonging to a specific
pointer. Thus, we found this to be minimally intrusive under
given constraints.
With that in place, a simple example with sanitized access on
unprivileged load at post-verification time looks as follows:
# bpftool prog dump xlated id 282
[...]
28: (79) r1 = *(u64 *)(r7 +0)
29: (79) r2 = *(u64 *)(r7 +8)
30: (57) r1 &= 15
31: (79) r3 = *(u64 *)(r0 +4608)
32: (57) r3 &= 1
33: (47) r3 |= 1
34: (2d) if r2 > r3 goto pc+19
35: (b4) (u32) r11 = (u32) 20479 |
36: (1f) r11 -= r2 | Dynamic sanitation for pointer
37: (4f) r11 |= r2 | arithmetic with registers
38: (87) r11 = -r11 | containing bounded or known
39: (c7) r11 s>>= 63 | scalars in order to prevent
40: (5f) r11 &= r2 | out of bounds speculation.
41: (0f) r4 += r11 |
42: (71) r4 = *(u8 *)(r4 +0)
43: (6f) r4 <<= r1
[...]
For the case where the scalar sits in the destination register
as opposed to the source register, the following code is emitted
for the above example:
[...]
16: (b4) (u32) r11 = (u32) 20479
17: (1f) r11 -= r2
18: (4f) r11 |= r2
19: (87) r11 = -r11
20: (c7) r11 s>>= 63
21: (5f) r2 &= r11
22: (0f) r2 += r0
23: (61) r0 = *(u32 *)(r2 +0)
[...]
JIT blinding example with non-conflicting use of r10:
[...]
d5: je 0x0000000000000106 _
d7: mov 0x0(%rax),%edi |
da: mov $0xf153246,%r10d | Index load from map value and
e0: xor $0xf153259,%r10 | (const blinded) mask with 0x1f.
e7: and %r10,%rdi |_
ea: mov $0x2f,%r10d |
f0: sub %rdi,%r10 | Sanitized addition. Both use r10
f3: or %rdi,%r10 | but do not interfere with each
f6: neg %r10 | other. (Neither do these instructions
f9: sar $0x3f,%r10 | interfere with the use of ax as temp
fd: and %r10,%rdi | in interpreter.)
100: add %rax,%rdi |_
103: mov 0x0(%rdi),%eax
[...]
Tested that it fixes Jann's reproducer, and also checked that test_verifier
and test_progs suite with interpreter, JIT and JIT with hardening enabled
on x86-64 and arm64 runs successfully.
[0] Speculose: Analyzing the Security Implications of Speculative
Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
https://arxiv.org/pdf/1801.04084.pdf
[1] A Systematic Evaluation of Transient Execution Attacks and
Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
Dmitry Evtyushkin, Daniel Gruss,
https://arxiv.org/pdf/1811.05441.pdf
Fixes: b2157399cc ("bpf: prevent out-of-bounds speculation")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ commit b7137c4eab upstream ]
In check_map_access() we probe actual bounds through __check_map_access()
with offset of reg->smin_value + off for lower bound and offset of
reg->umax_value + off for the upper bound. However, even though the
reg->smin_value could have a negative value, the final result of the
sum with off could be positive when pointer arithmetic with known and
unknown scalars is combined. In this case we reject the program with
an error such as "R<x> min value is negative, either use unsigned index
or do a if (index >=0) check." even though the access itself would be
fine. Therefore extend the check to probe whether the actual resulting
reg->smin_value + off is less than zero.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ commit 9d7eceede7 upstream ]
For unknown scalars of mixed signed bounds, meaning their smin_value is
negative and their smax_value is positive, we need to reject arithmetic
with pointer to map value. For unprivileged the goal is to mask every
map pointer arithmetic and this cannot reliably be done when it is
unknown at verification time whether the scalar value is negative or
positive. Given this is a corner case, the likelihood of breaking should
be very small.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ commit e4298d2583 upstream ]
Restrict stack pointer arithmetic for unprivileged users in that
arithmetic itself must not go out of bounds as opposed to the actual
access later on. Therefore after each adjust_ptr_min_max_vals() with
a stack pointer as a destination we simulate a check_stack_access()
of 1 byte on the destination and once that fails the program is
rejected for unprivileged program loads. This is analog to map
value pointer arithmetic and needed for masking later on.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ commit 0d6303db79 upstream ]
Restrict map value pointer arithmetic for unprivileged users in that
arithmetic itself must not go out of bounds as opposed to the actual
access later on. Therefore after each adjust_ptr_min_max_vals() with a
map value pointer as a destination it will simulate a check_map_access()
of 1 byte on the destination and once that fails the program is rejected
for unprivileged program loads. We use this later on for masking any
pointer arithmetic with the remainder of the map value space. The
likelihood of breaking any existing real-world unprivileged eBPF
program is very small for this corner case.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ commit 9b73bfdd08 upstream ]
Right now we are using BPF ax register in JIT for constant blinding as
well as in interpreter as temporary variable. Verifier will not be able
to use it simply because its use will get overridden from the former in
bpf_jit_blind_insn(). However, it can be made to work in that blinding
will be skipped if there is prior use in either source or destination
register on the instruction. Taking constraints of ax into account, the
verifier is then open to use it in rewrites under some constraints. Note,
ax register already has mappings in every eBPF JIT.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ commit 144cd91c4c upstream ]
This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run()
into the hidden ax register. The latter is currently only used in JITs
for constant blinding as a temporary scratch register, meaning the BPF
interpreter will never see the use of ax. Therefore it is safe to use
it for the cases where tmp has been used earlier. This is needed to later
on allow restricted hidden use of ax in both interpreter and JITs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ commit c08435ec7f upstream ]
Move prev_insn_idx and insn_idx from the do_check() function into
the verifier environment, so they can be read inside the various
helper functions for handling the instructions. It's easier to put
this into the environment rather than changing all call-sites only
to pass it along. insn_idx is useful in particular since this later
on allows to hold state in env->insn_aux_data[env->insn_idx].
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ commit ceefbc96fa upstream ]
malicious bpf program may try to force the verifier to remember
a lot of distinct verifier states.
Put a limit to number of per-insn 'struct bpf_verifier_state'.
Note that hitting the limit doesn't reject the program.
It potentially makes the verifier do more steps to analyze the program.
It means that malicious programs will hit BPF_COMPLEXITY_LIMIT_INSNS sooner
instead of spending cpu time walking long link list.
The limit of BPF_COMPLEXITY_LIMIT_STATES==64 affects cilium progs
with slight increase in number of "steps" it takes to successfully verify
the programs:
before after
bpf_lb-DLB_L3.o 1940 1940
bpf_lb-DLB_L4.o 3089 3089
bpf_lb-DUNKNOWN.o 1065 1065
bpf_lxc-DDROP_ALL.o 28052 | 28162
bpf_lxc-DUNKNOWN.o 35487 | 35541
bpf_netdev.o 10864 10864
bpf_overlay.o 6643 6643
bpf_lcx_jit.o 38437 38437
But it also makes malicious program to be rejected in 0.4 seconds vs 6.5
Hence apply this limit to unprivileged programs only.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ commit 4f7b3e8258 upstream ]
pathological bpf programs may try to force verifier to explode in
the number of branch states:
20: (d5) if r1 s<= 0x24000028 goto pc+0
21: (b5) if r0 <= 0xe1fa20 goto pc+2
22: (d5) if r1 s<= 0x7e goto pc+0
23: (b5) if r0 <= 0xe880e000 goto pc+0
24: (c5) if r0 s< 0x2100ecf4 goto pc+0
25: (d5) if r1 s<= 0xe880e000 goto pc+1
26: (c5) if r0 s< 0xf4041810 goto pc+0
27: (d5) if r1 s<= 0x1e007e goto pc+0
28: (b5) if r0 <= 0xe86be000 goto pc+0
29: (07) r0 += 16614
30: (c5) if r0 s< 0x6d0020da goto pc+0
31: (35) if r0 >= 0x2100ecf4 goto pc+0
Teach verifier to recognize always taken and always not taken branches.
This analysis is already done for == and != comparison.
Expand it to all other branches.
It also helps real bpf programs to be verified faster:
before after
bpf_lb-DLB_L3.o 2003 1940
bpf_lb-DLB_L4.o 3173 3089
bpf_lb-DUNKNOWN.o 1080 1065
bpf_lxc-DDROP_ALL.o 29584 28052
bpf_lxc-DUNKNOWN.o 36916 35487
bpf_netdev.o 11188 10864
bpf_overlay.o 6679 6643
bpf_lcx_jit.o 39555 38437
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 93ad0fc088 upstream.
The recent commit which prevented a division by 0 issue in the alarm timer
code broke posix CPU timers as an unwanted side effect.
The reason is that the common rearm code checks for timer->it_interval
being 0 now. What went unnoticed is that the posix cpu timer setup does not
initialize timer->it_interval as it stores the interval in CPU timer
specific storage. The reason for the separate storage is historical as the
posix CPU timers always had a 64bit nanoseconds representation internally
while timer->it_interval is type ktime_t which used to be a modified
timespec representation on 32bit machines.
Instead of reverting the offending commit and fixing the alarmtimer issue
in the alarmtimer code, store the interval in timer->it_interval at CPU
timer setup time so the common code check works. This also repairs the
existing inconistency of the posix CPU timer code which kept a single shot
timer armed despite of the interval being 0.
The separate storage can be removed in mainline, but that needs to be a
separate commit as the current one has to be backported to stable kernels.
Fixes: 0e334db6bb ("posix-timers: Fix division by zero bug")
Reported-by: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190111133500.840117406@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The recently introduced Energy Model (EM) framework manages power cost
tables of CPUs. These tables are currently only visible from kernel
space. However, in order to debug the behaviour of subsystems that use
the EM (EAS for example), it is often required to know what the power
costs are from userspace.
For this reason, introduce under /sys/kernel/debug/energy_model a set of
directories representing the performance domains of the system. Each
performance domain contains a set of sub-directories representing the
different capacity states (cs) and their attributes, as well as a file
exposing the related CPUs.
The resulting hierarchy is as follows on Arm juno r0 for example:
/sys/kernel/debug/energy_model
├── pd0
│ ├── cpus
│ ├── cs:450000
│ │ ├── cost
│ │ ├── frequency
│ │ └── power
│ ├── cs:575000
│ │ ├── cost
│ │ ├── frequency
│ │ └── power
│ ├── cs:700000
│ │ ├── cost
│ │ ├── frequency
│ │ └── power
│ ├── cs:775000
│ │ ├── cost
│ │ ├── frequency
│ │ └── power
│ └── cs:850000
│ ├── cost
│ ├── frequency
│ └── power
└── pd1
├── cpus
├── cs:1100000
│ ├── cost
│ ├── frequency
│ └── power
├── cs:450000
│ ├── cost
│ ├── frequency
│ └── power
├── cs:625000
│ ├── cost
│ ├── frequency
│ └── power
├── cs:800000
│ ├── cost
│ ├── frequency
│ └── power
└── cs:950000
├── cost
├── frequency
└── power
Bug: 120440300
Change-Id: I4098c3d3e07fc4a14514751ac881422131b759e9
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 9cac42d064
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git bleeding-edge)
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Changes in 4.19.18
ipv6: Consider sk_bound_dev_if when binding a socket to a v4 mapped address
mlxsw: spectrum: Disable lag port TX before removing it
mlxsw: spectrum_switchdev: Set PVID correctly during VLAN deletion
net: dsa: mv88x6xxx: mv88e6390 errata
net, skbuff: do not prefer skb allocation fails early
qmi_wwan: add MTU default to qmap network interface
r8169: Add support for new Realtek Ethernet
ipv6: Take rcu_read_lock in __inet6_bind for mapped addresses
net: clear skb->tstamp in bridge forwarding path
netfilter: ipset: Allow matching on destination MAC address for mac and ipmac sets
gpio: pl061: Move irq_chip definition inside struct pl061
drm/amd/display: Guard against null stream_state in set_crc_source
drm/amdkfd: fix interrupt spin lock
ixgbe: allow IPsec Tx offload in VEPA mode
platform/x86: asus-wmi: Tell the EC the OS will handle the display off hotkey
e1000e: allow non-monotonic SYSTIM readings
usb: typec: tcpm: Do not disconnect link for self powered devices
selftests/bpf: enable (uncomment) all tests in test_libbpf.sh
of: overlay: add missing of_node_put() after add new node to changeset
writeback: don't decrement wb->refcnt if !wb->bdi
serial: set suppress_bind_attrs flag only if builtin
bpf: Allow narrow loads with offset > 0
ALSA: oxfw: add support for APOGEE duet FireWire
x86/mce: Fix -Wmissing-prototypes warnings
MIPS: SiByte: Enable swiotlb for SWARM, LittleSur and BigSur
crypto: ecc - regularize scalar for scalar multiplication
arm64: perf: set suppress_bind_attrs flag to true
drm/atomic-helper: Complete fake_commit->flip_done potentially earlier
clk: meson: meson8b: fix incorrect divider mapping in cpu_scale_table
samples: bpf: fix: error handling regarding kprobe_events
usb: gadget: udc: renesas_usb3: add a safety connection way for forced_b_device
fpga: altera-cvp: fix probing for multiple FPGAs on the bus
selinux: always allow mounting submounts
ASoC: pcm3168a: Don't disable pcm3168a when CONFIG_PM defined
scsi: qedi: Check for session online before getting iSCSI TLV data.
drm/amdgpu: Reorder uvd ring init before uvd resume
rxe: IB_WR_REG_MR does not capture MR's iova field
efi/libstub: Disable some warnings for x86{,_64}
jffs2: Fix use of uninitialized delayed_work, lockdep breakage
clk: imx: make mux parent strings const
pstore/ram: Do not treat empty buffers as valid
media: uvcvideo: Refactor teardown of uvc on USB disconnect
powerpc/xmon: Fix invocation inside lock region
powerpc/pseries/cpuidle: Fix preempt warning
media: firewire: Fix app_info parameter type in avc_ca{,_app}_info
ASoC: use dma_ops of parent device for acp_audio_dma
media: venus: core: Set dma maximum segment size
staging: erofs: fix use-after-free of on-stack `z_erofs_vle_unzip_io'
net: call sk_dst_reset when set SO_DONTROUTE
scsi: target: use consistent left-aligned ASCII INQUIRY data
scsi: target/core: Make sure that target_wait_for_sess_cmds() waits long enough
selftests: do not macro-expand failed assertion expressions
arm64: kasan: Increase stack size for KASAN_EXTRA
clk: imx6q: reset exclusive gates on init
arm64: Fix minor issues with the dcache_by_line_op macro
bpf: relax verifier restriction on BPF_MOV | BPF_ALU
kconfig: fix file name and line number of warn_ignored_character()
kconfig: fix memory leak when EOF is encountered in quotation
mmc: atmel-mci: do not assume idle after atmci_request_end
btrfs: volumes: Make sure there is no overlap of dev extents at mount time
btrfs: alloc_chunk: fix more DUP stripe size handling
btrfs: fix use-after-free due to race between replace start and cancel
btrfs: improve error handling of btrfs_add_link
tty/serial: do not free trasnmit buffer page under port lock
perf intel-pt: Fix error with config term "pt=0"
perf tests ARM: Disable breakpoint tests 32-bit
perf svghelper: Fix unchecked usage of strncpy()
perf parse-events: Fix unchecked usage of strncpy()
perf vendor events intel: Fix Load_Miss_Real_Latency on SKL/SKX
netfilter: ipt_CLUSTERIP: check MAC address when duplicate config is set
netfilter: ipt_CLUSTERIP: remove wrong WARN_ON_ONCE in netns exit routine
netfilter: ipt_CLUSTERIP: fix deadlock in netns exit routine
x86/topology: Use total_cpus for max logical packages calculation
dm crypt: use u64 instead of sector_t to store iv_offset
dm kcopyd: Fix bug causing workqueue stalls
perf stat: Avoid segfaults caused by negated options
tools lib subcmd: Don't add the kernel sources to the include path
dm snapshot: Fix excessive memory usage and workqueue stalls
perf cs-etm: Correct packets swapping in cs_etm__flush()
perf tools: Add missing sigqueue() prototype for systems lacking it
perf tools: Add missing open_memstream() prototype for systems lacking it
quota: Lock s_umount in exclusive mode for Q_XQUOTA{ON,OFF} quotactls.
clocksource/drivers/integrator-ap: Add missing of_node_put()
dm: Check for device sector overflow if CONFIG_LBDAF is not set
Bluetooth: btusb: Add support for Intel bluetooth device 8087:0029
ALSA: bebob: fix model-id of unit for Apogee Ensemble
sysfs: Disable lockdep for driver bind/unbind files
IB/usnic: Fix potential deadlock
scsi: mpt3sas: fix memory ordering on 64bit writes
scsi: smartpqi: correct lun reset issues
ath10k: fix peer stats null pointer dereference
scsi: smartpqi: call pqi_free_interrupts() in pqi_shutdown()
scsi: megaraid: fix out-of-bound array accesses
iomap: don't search past page end in iomap_is_partially_uptodate
ocfs2: fix panic due to unrecovered local alloc
mm/page-writeback.c: don't break integrity writeback on ->writepage() error
mm/swap: use nr_node_ids for avail_lists in swap_info_struct
userfaultfd: clear flag if remap event not enabled
mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps
iwlwifi: mvm: Send LQ command as async when necessary
Bluetooth: Fix unnecessary error message for HCI request completion
ipmi: fix use-after-free of user->release_barrier.rda
ipmi: msghandler: Fix potential Spectre v1 vulnerabilities
ipmi: Prevent use-after-free in deliver_response
ipmi:ssif: Fix handling of multi-part return messages
ipmi: Don't initialize anything in the core until something uses it
Linux 4.19.18
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit e434b8cdf7 ]
Currently, the destination register is marked as unknown for 32-bit
sub-register move (BPF_MOV | BPF_ALU) whenever the source register type is
SCALAR_VALUE.
This is too conservative that some valid cases will be rejected.
Especially, this may turn a constant scalar value into unknown value that
could break some assumptions of verifier.
For example, test_l4lb_noinline.c has the following C code:
struct real_definition *dst
1: if (!get_packet_dst(&dst, &pckt, vip_info, is_ipv6))
2: return TC_ACT_SHOT;
3:
4: if (dst->flags & F_IPV6) {
get_packet_dst is responsible for initializing "dst" into valid pointer and
return true (1), otherwise return false (0). The compiled instruction
sequence using alu32 will be:
412: (54) (u32) r7 &= (u32) 1
413: (bc) (u32) r0 = (u32) r7
414: (95) exit
insn 413, a BPF_MOV | BPF_ALU, however will turn r0 into unknown value even
r7 contains SCALAR_VALUE 1.
This causes trouble when verifier is walking the code path that hasn't
initialized "dst" inside get_packet_dst, for which case 0 is returned and
we would then expect verifier concluding line 1 in the above C code pass
the "if" check, therefore would skip fall through path starting at line 4.
Now, because r0 returned from callee has became unknown value, so verifier
won't skip analyzing path starting at line 4 and "dst->flags" requires
dereferencing the pointer "dst" which actually hasn't be initialized for
this path.
This patch relaxed the code marking sub-register move destination. For a
SCALAR_VALUE, it is safe to just copy the value from source then truncate
it into 32-bit.
A unit test also included to demonstrate this issue. This test will fail
before this patch.
This relaxation could let verifier skipping more paths for conditional
comparison against immediate. It also let verifier recording a more
accurate/strict value for one register at one state, if this state end up
with going through exit without rejection and it is used for state
comparison later, then it is possible an inaccurate/permissive value is
better. So the real impact on verifier processed insn number is complex.
But in all, without this fix, valid program could be rejected.
>From real benchmarking on kernel selftests and Cilium bpf tests, there is
no impact on processed instruction number when tests ares compiled with
default compilation options. There is slightly improvements when they are
compiled with -mattr=+alu32 after this patch.
Also, test_xdp_noinline/-mattr=+alu32 now passed verification. It is
rejected before this fix.
Insn processed before/after this patch:
default -mattr=+alu32
Kernel selftest
===
test_xdp.o 371/371 369/369
test_l4lb.o 6345/6345 5623/5623
test_xdp_noinline.o 2971/2971 rejected/2727
test_tcp_estates.o 429/429 430/430
Cilium bpf
===
bpf_lb-DLB_L3.o: 2085/2085 1685/1687
bpf_lb-DLB_L4.o: 2287/2287 1986/1982
bpf_lb-DUNKNOWN.o: 690/690 622/622
bpf_lxc.o: 95033/95033 N/A
bpf_netdev.o: 7245/7245 N/A
bpf_overlay.o: 2898/2898 3085/2947
NOTE:
- bpf_lxc.o and bpf_netdev.o compiled by -mattr=+alu32 are rejected by
verifier due to another issue inside verifier on supporting alu32
binary.
- Each cilium bpf program could generate several processed insn number,
above number is sum of them.
v1->v2:
- Restrict the change on SCALAR_VALUE.
- Update benchmark numbers on Cilium bpf tests.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 46f53a65d2 ]
Currently BPF verifier allows narrow loads for a context field only with
offset zero. E.g. if there is a __u32 field then only the following
loads are permitted:
* off=0, size=1 (narrow);
* off=0, size=2 (narrow);
* off=0, size=4 (full).
On the other hand LLVM can generate a load with offset different than
zero that make sense from program logic point of view, but verifier
doesn't accept it.
E.g. tools/testing/selftests/bpf/sendmsg4_prog.c has code:
#define DST_IP4 0xC0A801FEU /* 192.168.1.254 */
...
if ((ctx->user_ip4 >> 24) == (bpf_htonl(DST_IP4) >> 24) &&
where ctx is struct bpf_sock_addr.
Some versions of LLVM can produce the following byte code for it:
8: 71 12 07 00 00 00 00 00 r2 = *(u8 *)(r1 + 7)
9: 67 02 00 00 18 00 00 00 r2 <<= 24
10: 18 03 00 00 00 00 00 fe 00 00 00 00 00 00 00 00 r3 = 4261412864 ll
12: 5d 32 07 00 00 00 00 00 if r2 != r3 goto +7 <LBB0_6>
where `*(u8 *)(r1 + 7)` means narrow load for ctx->user_ip4 with size=1
and offset=3 (7 - sizeof(ctx->user_family) = 3). This load is currently
rejected by verifier.
Verifier code that rejects such loads is in bpf_ctx_narrow_access_ok()
what means any is_valid_access implementation, that uses the function,
works this way, e.g. bpf_skb_is_valid_access() for __sk_buff or
sock_addr_is_valid_access() for bpf_sock_addr.
The patch makes such loads supported. Offset can be in [0; size_default)
but has to be multiple of load size. E.g. for __u32 field the following
loads are supported now:
* off=0, size=1 (narrow);
* off=1, size=1 (narrow);
* off=2, size=1 (narrow);
* off=3, size=1 (narrow);
* off=0, size=2 (narrow);
* off=2, size=2 (narrow);
* off=0, size=4 (full).
Reported-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Changes in 4.19.15
ARM: dts: sun8i: a83t: bananapi-m3: increase vcc-pd voltage to 3.3V
pinctrl: meson: fix pull enable register calculation
arm64: dts: mt7622: fix no more console output on rfb1
powerpc: Fix COFF zImage booting on old powermacs
powerpc/mm: Fix linux page tables build with some configs
HID: ite: Add USB id match for another ITE based keyboard rfkill key quirk
ARM: dts: imx7d-pico: Describe the Wifi clock
ARM: imx: update the cpu power up timing setting on i.mx6sx
ARM: dts: imx7d-nitrogen7: Fix the description of the Wifi clock
IB/mlx5: Block DEVX umem from the non applicable cases
Input: restore EV_ABS ABS_RESERVED
powerpc/mm: Fallback to RAM if the altmap is unusable
drm/amdgpu: Fix DEBUG_LOCKS_WARN_ON(depth <= 0) in amdgpu_ctx.lock
IB/core: Fix oops in netdev_next_upper_dev_rcu()
checkstack.pl: fix for aarch64
xfrm: Fix error return code in xfrm_output_one()
xfrm: Fix bucket count reported to userspace
xfrm: Fix NULL pointer dereference in xfrm_input when skb_dst_force clears the dst_entry.
ieee802154: hwsim: fix off-by-one in parse nested
netfilter: nf_tables: fix suspicious RCU usage in nft_chain_stats_replace()
netfilter: seqadj: re-load tcp header pointer after possible head reallocation
Revert "scsi: qla2xxx: Fix NVMe Target discovery"
scsi: bnx2fc: Fix NULL dereference in error handling
Input: omap-keypad - fix idle configuration to not block SoC idle states
Input: synaptics - enable RMI on ThinkPad T560
ibmvnic: Convert reset work item mutex to spin lock
ibmvnic: Fix non-atomic memory allocation in IRQ context
ieee802154: ca8210: fix possible u8 overflow in ca8210_rx_done
x86/mm: Fix guard hole handling
x86/dump_pagetables: Fix LDT remap address marker
i40e: fix mac filter delete when setting mac address
ixgbe: Fix race when the VF driver does a reset
netfilter: ipset: do not call ipset_nest_end after nla_nest_cancel
netfilter: nat: can't use dst_hold on noref dst
netfilter: nf_conncount: use rb_link_node_rcu() instead of rb_link_node()
bnx2x: Clear fip MAC when fcoe offload support is disabled
bnx2x: Remove configured vlans as part of unload sequence.
bnx2x: Send update-svid ramrod with retry/poll flags enabled
scsi: target: iscsi: cxgbit: fix csk leak
scsi: target: iscsi: cxgbit: add missing spin_lock_init()
mt76: fix potential NULL pointer dereference in mt76_stop_tx_queues
x86, hyperv: remove PCI dependency
drivers: net: xgene: Remove unnecessary forward declarations
net/tls: Init routines in create_ctx
w90p910_ether: remove incorrect __init annotation
net: hns: Incorrect offset address used for some registers.
net: hns: All ports can not work when insmod hns ko after rmmod.
net: hns: Some registers use wrong address according to the datasheet.
net: hns: Fixed bug that netdev was opened twice
net: hns: Clean rx fbd when ae stopped.
net: hns: Free irq when exit from abnormal branch
net: hns: Avoid net reset caused by pause frames storm
net: hns: Fix ntuple-filters status error.
net: hns: Add mac pcs config when enable|disable mac
net: hns: Fix ping failed when use net bridge and send multicast
mac80211: fix a kernel panic when TXing after TXQ teardown
SUNRPC: Fix a race with XPRT_CONNECTING
qed: Fix an error code qed_ll2_start_xmit()
net: macb: fix random memory corruption on RX with 64-bit DMA
net: macb: fix dropped RX frames due to a race
net: macb: add missing barriers when reading descriptors
lan743x: Expand phy search for LAN7431
lan78xx: Resolve issue with changing MAC address
vxge: ensure data0 is initialized in when fetching firmware version information
nl80211: fix memory leak if validate_pae_over_nl80211() fails
mac80211: free skb fraglist before freeing the skb
kbuild: fix false positive warning/error about missing libelf
m68k: Fix memblock-related crashes
virtio: fix test build after uio.h change
lan743x: Remove MAC Reset from initialization
gpio: mvebu: only fail on missing clk if pwm is actually to be used
Input: synaptics - enable SMBus for HP EliteBook 840 G4
net: netxen: fix a missing check and an uninitialized use
qmi_wwan: Fix qmap header retrieval in qmimux_rx_fixup
serial/sunsu: fix refcount leak
auxdisplay: charlcd: fix x/y command parsing
scsi: zfcp: fix posting too many status read buffers leading to adapter shutdown
scsi: lpfc: do not set queue->page_count to 0 if pc_sli4_params.wqpcnt is invalid
fork: record start_time late
zram: fix double free backing device
hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined
mm, devm_memremap_pages: mark devm_memremap_pages() EXPORT_SYMBOL_GPL
mm, devm_memremap_pages: kill mapping "System RAM" support
mm, devm_memremap_pages: fix shutdown handling
mm, devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support
mm, hmm: use devm semantics for hmm_devmem_{add, remove}
mm, hmm: mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL
mm, swap: fix swapoff with KSM pages
memcg, oom: notify on oom killer invocation from the charge path
sunrpc: fix cache_head leak due to queued request
sunrpc: use SVC_NET() in svcauth_gss_* functions
powerpc: remove old GCC version checks
powerpc: consolidate -mno-sched-epilog into FTRACE flags
powerpc: avoid -mno-sched-epilog on GCC 4.9 and newer
powerpc: Disable -Wbuiltin-requires-header when setjmp is used
kbuild: add -no-integrated-as Clang option unconditionally
kbuild: consolidate Clang compiler flags
Makefile: Export clang toolchain variables
powerpc/boot: Set target when cross-compiling for clang
raid6/ppc: Fix build for clang
dma-direct: do not include SME mask in the DMA supported check
mt76x0: init hw capabilities
media: cx23885: only reset DMA on problematic CPUs
ALSA: cs46xx: Potential NULL dereference in probe
ALSA: usb-audio: Avoid access before bLength check in build_audio_procunit()
ALSA: usb-audio: Check mixer unit descriptors more strictly
ALSA: usb-audio: Fix an out-of-bound read in create_composite_quirks
ALSA: usb-audio: Always check descriptor sizes in parser code
srcu: Lock srcu_data structure in srcu_gp_start()
driver core: Add missing dev->bus->need_parent_lock checks
Fix failure path in alloc_pid()
block: deactivate blk_stat timer in wbt_disable_default()
block: mq-deadline: Fix write completion handling
dlm: fixed memory leaks after failed ls_remove_names allocation
dlm: possible memory leak on error path in create_lkb()
dlm: lost put_lkb on error path in receive_convert() and receive_unlock()
dlm: memory leaks on error path in dlm_user_request()
gfs2: Get rid of potential double-freeing in gfs2_create_inode
gfs2: Fix loop in gfs2_rbm_find
b43: Fix error in cordic routine
selinux: policydb - fix byte order and alignment issues
PCI / PM: Allow runtime PM without callback functions
lockd: Show pid of lockd for remote locks
nfsd4: zero-length WRITE should succeed
arm64: drop linker script hack to hide __efistub_ symbols
arm64: relocatable: fix inconsistencies in linker script and options
leds: pwm: silently error out on EPROBE_DEFER
Revert "powerpc/tm: Unset MSR[TS] if not recheckpointing"
powerpc/tm: Set MSR[TS] just prior to recheckpoint
iio: dac: ad5686: fix bit shift read register
9p/net: put a lower bound on msize
rxe: fix error completion wr_id and qp_num
RDMA/srpt: Fix a use-after-free in the channel release code
iommu/vt-d: Handle domain agaw being less than iommu agaw
sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b
ceph: don't update importing cap's mseq when handing cap export
video: fbdev: pxafb: Fix "WARNING: invalid free of devm_ allocated data"
drivers/perf: hisi: Fixup one DDRC PMU register offset
genwqe: Fix size check
intel_th: msu: Fix an off-by-one in attribute store
power: supply: olpc_battery: correct the temperature units
of: of_node_get()/of_node_put() nodes held in phandle cache
of: __of_detach_node() - remove node from phandle cache
lib: fix build failure in CONFIG_DEBUG_VIRTUAL test
drm/nouveau/drm/nouveau: Check rc from drm_dp_mst_topology_mgr_resume()
drm/vc4: Set ->is_yuv to false when num_planes == 1
drm/rockchip: psr: do not dereference encoder before it is null checked.
drm/amd/display: Fix unintialized max_bpc state values
bnx2x: Fix NULL pointer dereference in bnx2x_del_all_vlans() on some hw
Linux 4.19.15
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit c40f7d74c7 upstream.
Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the
scheduler under high loads, starting at around the v4.18 time frame,
and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list
manipulation.
Do a (manual) revert of:
a9e7f6544b ("sched/fair: Fix O(nr_cgroups) in load balance path")
It turns out that the list_del_leaf_cfs_rq() introduced by this commit
is a surprising property that was not considered in followup commits
such as:
9c2791f936 ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list")
As Vincent Guittot explains:
"I think that there is a bigger problem with commit a9e7f6544b and
cfs_rq throttling:
Let take the example of the following topology TG2 --> TG1 --> root:
1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1
cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in
one path because it has never been used and can't be throttled so
tmp_alone_branch will point to leaf_cfs_rq_list at the end.
2) Then TG1 is throttled
3) and we add TG3 as a new child of TG1.
4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1
cfs_rq and tmp_alone_branch will stay on rq->leaf_cfs_rq_list.
With commit a9e7f6544b, we can del a cfs_rq from rq->leaf_cfs_rq_list.
So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1
cfs_rq is removed from the list.
Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list
but tmp_alone_branch still points to TG3 cfs_rq because its throttled
parent can't be enqueued when the lock is released.
tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should.
So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch
points on another TG cfs_rq, the next TG cfs_rq that will be added,
will be linked outside rq->leaf_cfs_rq_list - which is bad.
In addition, we can break the ordering of the cfs_rq in
rq->leaf_cfs_rq_list but this ordering is used to update and
propagate the update from leaf down to root."
Instead of trying to work through all these cases and trying to reproduce
the very high loads that produced the lockup to begin with, simplify
the code temporarily by reverting a9e7f6544b - which change was clearly
not thought through completely.
This (hopefully) gives us a kernel that doesn't lock up so people
can continue to enjoy their holidays without worrying about regressions. ;-)
[ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ]
Analyzed-by: Xie XiuQi <xiexiuqi@huawei.com>
Analyzed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com>
Reported-by: Sargun Dhillon <sargun@sargun.me>
Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
Tested-by: Zhipeng Xie <xiezhipeng1@huawei.com>
Tested-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: <stable@vger.kernel.org> # v4.13+
Cc: Bin Li <huawei.libin@huawei.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: a9e7f6544b ("sched/fair: Fix O(nr_cgroups) in load balance path")
Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit eb4c238227 upstream.
The srcu_gp_start() function is called with the srcu_struct structure's
->lock held, but not with the srcu_data structure's ->lock. This is
problematic because this function accesses and updates the srcu_data
structure's ->srcu_cblist, which is protected by that lock. Failing to
hold this lock can result in corruption of the SRCU callback lists,
which in turn can result in arbitrarily bad results.
This commit therefore makes srcu_gp_start() acquire the srcu_data
structure's ->lock across the calls to rcu_segcblist_advance() and
rcu_segcblist_accelerate(), thus preventing this corruption.
Reported-by: Bart Van Assche <bvanassche@acm.org>
Reported-by: Christoph Hellwig <hch@infradead.org>
Reported-by: Sebastian Kuzminsky <seb.kuzminsky@gmail.com>
Signed-off-by: Dennis Krein <Dennis.Krein@netapp.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: Dennis Krein <Dennis.Krein@netapp.com>
Cc: <stable@vger.kernel.org> # 4.16.x
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c92a54cfa0 upstream.
The dma_direct_supported() function intends to check the DMA mask against
specific values. However, the phys_to_dma() function includes the SME
encryption mask, which defeats the intended purpose of the check. This
results in drivers that support less than 48-bit DMA (SME encryption mask
is bit 47) from being able to set the DMA mask successfully when SME is
active, which results in the driver failing to initialize.
Change the function used to check the mask from phys_to_dma() to
__phys_to_dma() so that the SME encryption mask is not part of the check.
Fixes: c1d0af1a1d ("kernel/dma/direct: take DMA offset into account in dma_direct_supported")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a95c90f1e2 upstream.
The last step before devm_memremap_pages() returns success is to allocate
a release action, devm_memremap_pages_release(), to tear the entire setup
down. However, the result from devm_add_action() is not checked.
Checking the error from devm_add_action() is not enough. The api
currently relies on the fact that the percpu_ref it is using is killed by
the time the devm_memremap_pages_release() is run. Rather than continue
this awkward situation, offload the responsibility of killing the
percpu_ref to devm_memremap_pages_release() directly. This allows
devm_memremap_pages() to do the right thing relative to init failures and
shutdown.
Without this change we could fail to register the teardown of
devm_memremap_pages(). The likelihood of hitting this failure is tiny as
small memory allocations almost always succeed. However, the impact of
the failure is large given any future reconfiguration, or disable/enable,
of an nvdimm namespace will fail forever as subsequent calls to
devm_memremap_pages() will fail to setup the pgmap_radix since there will
be stale entries for the physical address range.
An argument could be made to require that the ->kill() operation be set in
the @pgmap arg rather than passed in separately. However, it helps code
readability, tracking the lifetime of a given instance, to be able to grep
the kill routine directly at the devm_memremap_pages() call site.
Link: http://lkml.kernel.org/r/154275558526.76910.7535251937849268605.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Fixes: e8d5134833 ("memremap: change devm_memremap_pages interface...")
Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 808153e118 upstream.
devm_memremap_pages() is a facility that can create struct page entries
for any arbitrary range and give drivers the ability to subvert core
aspects of page management.
Specifically the facility is tightly integrated with the kernel's memory
hotplug functionality. It injects an altmap argument deep into the
architecture specific vmemmap implementation to allow allocating from
specific reserved pages, and it has Linux specific assumptions about page
structure reference counting relative to get_user_pages() and
get_user_pages_fast(). It was an oversight and a mistake that this was
not marked EXPORT_SYMBOL_GPL from the outset.
Again, devm_memremap_pagex() exposes and relies upon core kernel internal
assumptions and will continue to evolve along with 'struct page', memory
hotplug, and support for new memory types / topologies. Only an in-kernel
GPL-only driver is expected to keep up with this ongoing evolution. This
interface, and functionality derived from this interface, is not suitable
for kernel-external drivers.
Link: http://lkml.kernel.org/r/154275557457.76910.16923571232582744134.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7b55851367 upstream.
This changes the fork(2) syscall to record the process start_time after
initializing the basic task structure but still before making the new
process visible to user-space.
Technically, we could record the start_time anytime during fork(2). But
this might lead to scenarios where a start_time is recorded long before
a process becomes visible to user-space. For instance, with
userfaultfd(2) and TLS, user-space can delay the execution of fork(2)
for an indefinite amount of time (and will, if this causes network
access, or similar).
By recording the start_time late, it much closer reflects the point in
time where the process becomes live and can be observed by other
processes.
Lastly, this makes it much harder for user-space to predict and control
the start_time they get assigned. Previously, user-space could fork a
process and stall it in copy_thread_tls() before its pid is allocated,
but after its start_time is recorded. This can be misused to later-on
cycle through PIDs and resume the stalled fork(2) yielding a process
that has the same pid and start_time as a process that existed before.
This can be used to circumvent security systems that identify processes
by their pid+start_time combination.
Even though user-space was always aware that start_time recording is
flaky (but several projects are known to still rely on start_time-based
identification), changing the start_time to be recorded late will help
mitigate existing attacks and make it much harder for user-space to
control the start_time a process gets assigned.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.19.14
ax25: fix a use-after-free in ax25_fillin_cb()
gro_cell: add napi_disable in gro_cells_destroy
ibmveth: fix DMA unmap error in ibmveth_xmit_start error path
ieee802154: lowpan_header_create check must check daddr
ip6mr: Fix potential Spectre v1 vulnerability
ipv4: Fix potential Spectre v1 vulnerability
ipv6: explicitly initialize udp6_addr in udp_sock_create6()
ipv6: tunnels: fix two use-after-free
ip: validate header length on virtual device xmit
isdn: fix kernel-infoleak in capi_unlocked_ioctl
net: clear skb->tstamp in forwarding paths
net/hamradio/6pack: use mod_timer() to rearm timers
net: ipv4: do not handle duplicate fragments as overlapping
net: macb: restart tx after tx used bit read
net: mvpp2: 10G modes aren't supported on all ports
net: phy: Fix the issue that netif always links up after resuming
netrom: fix locking in nr_find_socket()
net/smc: fix TCP fallback socket release
net: stmmac: Fix an error code in probe()
net/tls: allocate tls context using GFP_ATOMIC
net/wan: fix a double free in x25_asy_open_tty()
packet: validate address length
packet: validate address length if non-zero
ptr_ring: wrap back ->producer in __ptr_ring_swap_queue()
qmi_wwan: Added support for Fibocom NL668 series
qmi_wwan: Added support for Telit LN940 series
qmi_wwan: Add support for Fibocom NL678 series
sctp: initialize sin6_flowinfo for ipv6 addrs in sctp_inet6addr_event
sock: Make sock->sk_stamp thread-safe
tcp: fix a race in inet_diag_dump_icsk()
tipc: check tsk->group in tipc_wait_for_cond()
tipc: compare remote and local protocols in tipc_udp_enable()
tipc: fix a double free in tipc_enable_bearer()
tipc: fix a double kfree_skb()
tipc: use lock_sock() in tipc_sk_reinit()
vhost: make sure used idx is seen before log in vhost_add_used_n()
VSOCK: Send reset control packet when socket is partially bound
xen/netfront: tolerate frags with no data
net/mlx5: Typo fix in del_sw_hw_rule
tipc: check group dests after tipc_wait_for_cond()
net/mlx5e: Remove the false indication of software timestamping support
ipv6: frags: Fix bogus skb->sk in reassembled packets
net/ipv6: Fix a test against 'ipv6_find_idev()' return value
nfp: flower: ensure TCP flags can be placed in IPv6 frame
ipv6: route: Fix return value of ip6_neigh_lookup() on neigh_create() error
mscc: Configured MAC entries should be locked.
net/mlx5e: Cancel DIM work on close SQ
net/mlx5e: RX, Verify MPWQE stride size is in range
net: mvpp2: fix the phylink mode validation
qed: Fix command number mismatch between driver and the mfw
mlxsw: core: Increase timeout during firmware flash process
net/mlx5e: Remove unused UDP GSO remaining counter
net/mlx5e: RX, Fix wrong early return in receive queue poll
net: mvneta: fix operation for 64K PAGE_SIZE
net: Use __kernel_clockid_t in uapi net_stamp.h
r8169: fix WoL device wakeup enable
IB/hfi1: Incorrect sizing of sge for PIO will OOPs
ALSA: rme9652: Fix potential Spectre v1 vulnerability
ALSA: emu10k1: Fix potential Spectre v1 vulnerabilities
ALSA: pcm: Fix potential Spectre v1 vulnerability
ALSA: emux: Fix potential Spectre v1 vulnerabilities
powerpc/fsl: Fix spectre_v2 mitigations reporting
mtd: atmel-quadspi: disallow building on ebsa110
mtd: rawnand: marvell: prevent timeouts on a loaded machine
mtd: rawnand: omap2: Pass the parent of pdev to dma_request_chan()
ALSA: hda: add mute LED support for HP EliteBook 840 G4
ALSA: hda/realtek: Enable audio jacks of ASUS UX391UA with ALC294
ALSA: fireface: fix for state to fetch PCM frames
ALSA: firewire-lib: fix wrong handling payload_length as payload_quadlet
ALSA: firewire-lib: fix wrong assignment for 'out_packet_without_header' tracepoint
ALSA: firewire-lib: use the same print format for 'without_header' tracepoints
ALSA: hda/realtek: Enable the headset mic auto detection for ASUS laptops
ALSA: hda/tegra: clear pending irq handlers
usb: dwc2: host: use hrtimer for NAK retries
USB: serial: pl2303: add ids for Hewlett-Packard HP POS pole displays
USB: serial: option: add Fibocom NL678 series
usb: r8a66597: Fix a possible concurrency use-after-free bug in r8a66597_endpoint_disable()
usb: dwc2: disable power_down on Amlogic devices
Revert "usb: dwc3: pci: Use devm functions to get the phy GPIOs"
usb: roles: Add a description for the class to Kconfig
media: dvb-usb-v2: Fix incorrect use of transfer_flags URB_FREE_BUFFER
staging: wilc1000: fix missing read_write setting when reading data
ASoC: intel: cht_bsw_max98090_ti: Add pmc_plt_clk_0 quirk for Chromebook Clapper
ASoC: intel: cht_bsw_max98090_ti: Add pmc_plt_clk_0 quirk for Chromebook Gnawty
s390/pci: fix sleeping in atomic during hotplug
Input: atmel_mxt_ts - don't try to free unallocated kernel memory
Input: elan_i2c - add ACPI ID for touchpad in ASUS Aspire F5-573G
x86/speculation/l1tf: Drop the swap storage limit restriction when l1tf=off
x86/mm: Drop usage of __flush_tlb_all() in kernel_physical_mapping_init()
KVM: x86: Use jmp to invoke kvm_spurious_fault() from .fixup
arm64: KVM: Make VHE Stage-2 TLB invalidation operations non-interruptible
KVM: nVMX: Free the VMREAD/VMWRITE bitmaps if alloc_kvm_area() fails
platform-msi: Free descriptors in platform_msi_domain_free()
drm/v3d: Skip debugfs dumping GCA on platforms without GCA.
DRM: UDL: get rid of useless vblank initialization
clocksource/drivers/arc_timer: Utilize generic sched_clock
perf machine: Record if a arch has a single user/kernel address space
perf thread: Add fallback functions for cases where cpumode is insufficient
perf tools: Use fallback for sample_addr_correlates_sym() cases
perf script: Use fallbacks for branch stacks
perf pmu: Suppress potential format-truncation warning
perf env: Also consider env->arch == NULL as local operation
ocxl: Fix endiannes bug in ocxl_link_update_pe()
ocxl: Fix endiannes bug in read_afu_name()
ext4: add ext4_sb_bread() to disambiguate ENOMEM cases
ext4: fix possible use after free in ext4_quota_enable
ext4: missing unlock/put_page() in ext4_try_to_write_inline_data()
ext4: fix EXT4_IOC_GROUP_ADD ioctl
ext4: include terminating u32 in size of xattr entries when expanding inodes
ext4: avoid declaring fs inconsistent due to invalid file handles
ext4: force inode writes when nfsd calls commit_metadata()
ext4: check for shutdown and r/o file system in ext4_write_inode()
spi: bcm2835: Fix race on DMA termination
spi: bcm2835: Fix book-keeping of DMA termination
spi: bcm2835: Avoid finishing transfer prematurely in IRQ mode
clk: rockchip: fix typo in rk3188 spdif_frac parent
clk: sunxi-ng: Use u64 for calculation of NM rate
crypto: cavium/nitrox - fix a DMA pool free failure
crypto: chcr - small packet Tx stalls the queue
crypto: testmgr - add AES-CFB tests
crypto: cfb - fix decryption
cgroup: fix CSS_TASK_ITER_PROCS
cdc-acm: fix abnormal DATA RX issue for Mediatek Preloader.
btrfs: dev-replace: go back to suspended state if target device is missing
btrfs: dev-replace: go back to suspend state if another EXCL_OP is running
btrfs: skip file_extent generation check for free_space_inode in run_delalloc_nocow
Btrfs: fix fsync of files with multiple hard links in new directories
btrfs: run delayed items before dropping the snapshot
Btrfs: send, fix race with transaction commits that create snapshots
brcmfmac: fix roamoff=1 modparam
brcmfmac: Fix out of bounds memory access during fw load
powerpc/tm: Unset MSR[TS] if not recheckpointing
dax: Don't access a freed inode
dax: Use non-exclusive wait in wait_entry_unlocked()
f2fs: read page index before freeing
f2fs: fix validation of the block count in sanity_check_raw_super
f2fs: sanity check of xattr entry size
serial: uartps: Fix interrupt mask issue to handle the RX interrupts properly
media: cec: keep track of outstanding transmits
media: cec-pin: fix broken tx_ignore_nack_until_eom error injection
media: rc: cec devices do not have a lirc chardev
media: imx274: fix stack corruption in imx274_read_reg
media: vivid: free bitmap_cap when updating std/timings/etc.
media: vb2: check memory model for VIDIOC_CREATE_BUFS
media: v4l2-tpg: array index could become negative
tools lib traceevent: Fix processing of dereferenced args in bprintk events
MIPS: math-emu: Write-protect delay slot emulation pages
MIPS: c-r4k: Add r4k_blast_scache_node for Loongson-3
MIPS: Ensure pmd_present() returns false after pmd_mknotpresent()
MIPS: Align kernel load address to 64KB
MIPS: Expand MIPS32 ASIDs to 64 bits
MIPS: OCTEON: mark RGMII interface disabled on OCTEON III
MIPS: Fix a R10000_LLSC_WAR logic in atomic.h
CIFS: Fix error mapping for SMB2_LOCK command which caused OFD lock problem
smb3: fix large reads on encrypted connections
arm64: KVM: Avoid setting the upper 32 bits of VTCR_EL2 to 1
arm/arm64: KVM: vgic: Force VM halt when changing the active state of GICv3 PPIs/SGIs
ARM: dts: exynos: Specify I2S assigned clocks in proper node
rtc: m41t80: Correct alarm month range with RTC reads
KVM: arm/arm64: vgic: Do not cond_resched_lock() with IRQs disabled
KVM: arm/arm64: vgic: Cap SPIs to the VM-defined maximum
KVM: arm/arm64: vgic-v2: Set active_source to 0 when restoring state
KVM: arm/arm64: vgic: Fix off-by-one bug in vgic_get_irq()
iommu/arm-smmu-v3: Fix big-endian CMD_SYNC writes
arm64: compat: Avoid sending SIGILL for unallocated syscall numbers
tpm: tpm_try_transmit() refactor error flow.
tpm: tpm_i2c_nuvoton: use correct command duration for TPM 2.x
spi: bcm2835: Unbreak the build of esoteric configs
MIPS: Only include mmzone.h when CONFIG_NEED_MULTIPLE_NODES=y
Linux 4.19.14
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit e9d81a1bc2 upstream.
CSS_TASK_ITER_PROCS implements process-only iteration by making
css_task_iter_advance() skip tasks which aren't threadgroup leaders;
however, when an iteration is started css_task_iter_start() calls the
inner helper function css_task_iter_advance_css_set() instead of
css_task_iter_advance(). As the helper doesn't have the skip logic,
when the first task to visit is a non-leader thread, it doesn't get
skipped correctly as shown in the following example.
# ps -L 2030
PID LWP TTY STAT TIME COMMAND
2030 2030 pts/0 Sl+ 0:00 ./test-thread
2030 2031 pts/0 Sl+ 0:00 ./test-thread
# mkdir -p /sys/fs/cgroup/x/a/b
# echo threaded > /sys/fs/cgroup/x/a/cgroup.type
# echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type
# echo 2030 > /sys/fs/cgroup/x/a/cgroup.procs
# cat /sys/fs/cgroup/x/a/cgroup.threads
2030
2031
# cat /sys/fs/cgroup/x/cgroup.procs
2030
# echo 2030 > /sys/fs/cgroup/x/a/b/cgroup.threads
# cat /sys/fs/cgroup/x/cgroup.procs
2031
2030
The last read of cgroup.procs is incorrectly showing non-leader 2031
in cgroup.procs output.
This can be fixed by updating css_task_iter_advance() to handle the
first advance and css_task_iters_tart() to call
css_task_iter_advance() instead of the inner helper. After the fix,
the same commands result in the following (correct) result:
# ps -L 2062
PID LWP TTY STAT TIME COMMAND
2062 2062 pts/0 Sl+ 0:00 ./test-thread
2062 2063 pts/0 Sl+ 0:00 ./test-thread
# mkdir -p /sys/fs/cgroup/x/a/b
# echo threaded > /sys/fs/cgroup/x/a/cgroup.type
# echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type
# echo 2062 > /sys/fs/cgroup/x/a/cgroup.procs
# cat /sys/fs/cgroup/x/a/cgroup.threads
2062
2063
# cat /sys/fs/cgroup/x/cgroup.procs
2062
# echo 2062 > /sys/fs/cgroup/x/a/b/cgroup.threads
# cat /sys/fs/cgroup/x/cgroup.procs
2062
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Fixes: 8cfd8147df ("cgroup: implement cgroup v2 thread support")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit b8c80a0f42. It has not
been accepted upstream and is useful only for debug purpose. An
alternate debugfs interface will be pushed upstream to expose the Energy
Model. Android should use that too. In the meantime, it should be early
enough to remove the sysfs nodes before partners start basing userspace
tools on them.
Bug: 120440300
Change-Id: I8a5f003550df7df9116d4735c8d9ac10ac0aa79c
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
This reverts commit b78eec5fb7. It has not
been accepted upstream and is of no particular interest in Android since
EAS is always enabled anyway.
Bug: 120440300
Change-Id: I7d1f2ad206c05041d386fc99f18e29c4fa56cbc8
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
The core EAS patches have now been accepted upstream. The patches used
in Android are based on a slightly earlier version of the series. In
order to reduce the delta with mainline and ease backports, align the
EAS code paths with their upstream version.
This basically applies the output of git range-diff on the appropriate
commits, and fixes a conflict in schedutil regarding the integration of
schedtune.
Bug: 120440300
Change-Id: I208ebeb4207e3f4f4bbb5103c606b293a464c20f
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Changes in 4.19.13
iomap: Revert "fs/iomap.c: get/put the page in iomap_page_create/release()"
Revert "vfs: Allow userns root to call mknod on owned filesystems."
USB: hso: Fix OOB memory access in hso_probe/hso_get_config_data
xhci: Don't prevent USB2 bus suspend in state check intended for USB3 only
USB: xhci: fix 'broken_suspend' placement in struct xchi_hcd
USB: serial: option: add GosunCn ZTE WeLink ME3630
USB: serial: option: add HP lt4132
USB: serial: option: add Simcom SIM7500/SIM7600 (MBIM mode)
USB: serial: option: add Fibocom NL668 series
USB: serial: option: add Telit LN940 series
ubifs: Handle re-linking of inodes correctly while recovery
scsi: t10-pi: Return correct ref tag when queue has no integrity profile
scsi: sd: use mempool for discard special page
mmc: core: Reset HPI enabled state during re-init and in case of errors
mmc: core: Allow BKOPS and CACHE ctrl even if no HPI support
mmc: core: Use a minimum 1600ms timeout when enabling CACHE ctrl
mmc: omap_hsmmc: fix DMA API warning
gpio: max7301: fix driver for use with CONFIG_VMAP_STACK
gpiolib-acpi: Only defer request_irq for GpioInt ACPI event handlers
posix-timers: Fix division by zero bug
KVM: X86: Fix NULL deref in vcpu_scan_ioapic
kvm: x86: Add AMD's EX_CFG to the list of ignored MSRs
KVM: Fix UAF in nested posted interrupt processing
Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels
futex: Cure exit race
x86/mtrr: Don't copy uninitialized gentry fields back to userspace
x86/mm: Fix decoy address handling vs 32-bit builds
x86/vdso: Pass --eh-frame-hdr to the linker
x86/intel_rdt: Ensure a CPU remains online for the region's pseudo-locking sequence
panic: avoid deadlocks in re-entrant console drivers
mm: add mm_pxd_folded checks to pgtable_bytes accounting functions
mm: make the __PAGETABLE_PxD_FOLDED defines non-empty
mm: introduce mm_[p4d|pud|pmd]_folded
xfrm_user: fix freeing of xfrm states on acquire
rtlwifi: Fix leak of skb when processing C2H_BT_INFO
iwlwifi: mvm: don't send GEO_TX_POWER_LIMIT to old firmwares
Revert "mwifiex: restructure rx_reorder_tbl_lock usage"
iwlwifi: add new cards for 9560, 9462, 9461 and killer series
media: ov5640: Fix set format regression
mm, memory_hotplug: initialize struct pages for the full memory section
mm: thp: fix flags for pmd migration when split
mm, page_alloc: fix has_unmovable_pages for HugePages
mm: don't miss the last page because of round-off error
Input: elantech - disable elan-i2c for P52 and P72
proc/sysctl: don't return ENOMEM on lookup when a table is unregistering
drm/ioctl: Fix Spectre v1 vulnerabilities
Linux 4.19.13
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit c7c3f05e34 upstream.
From printk()/serial console point of view panic() is special, because
it may force CPU to re-enter printk() or/and serial console driver.
Therefore, some of serial consoles drivers are re-entrant. E.g. 8250:
serial8250_console_write()
{
if (port->sysrq)
locked = 0;
else if (oops_in_progress)
locked = spin_trylock_irqsave(&port->lock, flags);
else
spin_lock_irqsave(&port->lock, flags);
...
}
panic() does set oops_in_progress via bust_spinlocks(1), so in theory
we should be able to re-enter serial console driver from panic():
CPU0
<NMI>
uart_console_write()
serial8250_console_write() // if (oops_in_progress)
// spin_trylock_irqsave()
call_console_drivers()
console_unlock()
console_flush_on_panic()
bust_spinlocks(1) // oops_in_progress++
panic()
<NMI/>
spin_lock_irqsave(&port->lock, flags) // spin_lock_irqsave()
serial8250_console_write()
call_console_drivers()
console_unlock()
printk()
...
However, this does not happen and we deadlock in serial console on
port->lock spinlock. And the problem is that console_flush_on_panic()
called after bust_spinlocks(0):
void panic(const char *fmt, ...)
{
bust_spinlocks(1);
...
bust_spinlocks(0);
console_flush_on_panic();
...
}
bust_spinlocks(0) decrements oops_in_progress, so oops_in_progress
can go back to zero. Thus even re-entrant console drivers will simply
spin on port->lock spinlock. Given that port->lock may already be
locked either by a stopped CPU, or by the very same CPU we execute
panic() on (for instance, NMI panic() on printing CPU) the system
deadlocks and does not reboot.
Fix this by removing bust_spinlocks(0), so oops_in_progress is always
set in panic() now and, thus, re-entrant console drivers will trylock
the port->lock instead of spinning on it forever, when we call them
from console_flush_on_panic().
Link: http://lkml.kernel.org/r/20181025101036.6823-1-sergey.senozhatsky@gmail.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Daniel Wang <wonderfly@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: linux-serial@vger.kernel.org
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit da791a6675 upstream.
Stefan reported, that the glibc tst-robustpi4 test case fails
occasionally. That case creates the following race between
sys_exit() and sys_futex_lock_pi():
CPU0 CPU1
sys_exit() sys_futex()
do_exit() futex_lock_pi()
exit_signals(tsk) No waiters:
tsk->flags |= PF_EXITING; *uaddr == 0x00000PID
mm_release(tsk) Set waiter bit
exit_robust_list(tsk) { *uaddr = 0x80000PID;
Set owner died attach_to_pi_owner() {
*uaddr = 0xC0000000; tsk = get_task(PID);
} if (!tsk->flags & PF_EXITING) {
... attach();
tsk->flags |= PF_EXITPIDONE; } else {
if (!(tsk->flags & PF_EXITPIDONE))
return -EAGAIN;
return -ESRCH; <--- FAIL
}
ESRCH is returned all the way to user space, which triggers the glibc test
case assert. Returning ESRCH unconditionally is wrong here because the user
space value has been changed by the exiting task to 0xC0000000, i.e. the
FUTEX_OWNER_DIED bit is set and the futex PID value has been cleared. This
is a valid state and the kernel has to handle it, i.e. taking the futex.
Cure it by rereading the user space value when PF_EXITING and PF_EXITPIDONE
is set in the task which 'owns' the futex. If the value has changed, let
the kernel retry the operation, which includes all regular sanity checks
and correctly handles the FUTEX_OWNER_DIED case.
If it hasn't changed, then return ESRCH as there is no way to distinguish
this case from malfunctioning user space. This happens when the exiting
task did not have a robust list, the robust list was corrupted or the user
space value in the futex was simply bogus.
Reported-by: Stefan Liebler <stli@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200467
Link: https://lkml.kernel.org/r/20181210152311.986181245@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.19.12
locking/qspinlock: Re-order code
locking/qspinlock, x86: Provide liveness guarantee
IB/hfi1: Remove race conditions in user_sdma send path
mac80211_hwsim: fix module init error paths for netlink
Input: hyper-v - fix wakeup from suspend-to-idle
i2c: rcar: check bus state before reinitializing
scsi: libiscsi: Fix NULL pointer dereference in iscsi_eh_session_reset
scsi: vmw_pscsi: Rearrange code to avoid multiple calls to free_irq during unload
tools/bpf: fix two test_btf unit test cases
tools/bpf: add addition type tests to test_btf
net: ethernet: ave: Replace NET_IP_ALIGN with AVE_FRAME_HEADROOM
drm/amd/display: Fix 6x4K displays light-up on Vega20 (v2)
x86/earlyprintk/efi: Fix infinite loop on some screen widths
drm/msm: Fix task dump in gpu recovery
drm/msm/gpu: Fix a couple memory leaks in debugfs
drm/msm: fix handling of cmdstream offset
drm/msm/dsi: configure VCO rate for 10nm PLL driver
drm/msm: Grab a vblank reference when waiting for commit_done
drm/ttm: fix LRU handling in ttm_buffer_object_transfer
drm/amdgpu: wait for IB test on first device open
ARC: io.h: Implement reads{x}()/writes{x}()
net: stmmac: Move debugfs init/exit to ->probe()/->remove()
net: aquantia: fix rx checksum offload bits
bonding: fix 802.3ad state sent to partner when unbinding slave
bpf: Fix verifier log string check for bad alignment.
liquidio: read sc->iq_no before release sc
nfs: don't dirty kernel pages read by direct-io
SUNRPC: Fix a potential race in xprt_connect()
sbus: char: add of_node_put()
drivers/sbus/char: add of_node_put()
drivers/tty: add missing of_node_put()
ide: pmac: add of_node_put()
drm/msm/hdmi: Enable HPD after HDMI IRQ is set up
drm/msm: dpu: Don't set legacy plane->crtc pointer
drm/msm: dpu: Fix "WARNING: invalid free of devm_ allocated data"
drm/msm: Fix error return checking
drm/amd/powerplay: issue pre-display settings for display change event
clk: mvebu: Off by one bugs in cp110_of_clk_get()
clk: mmp: Off by one in mmp_clk_add()
Input: synaptics - enable SMBus for HP 15-ay000
Input: omap-keypad - fix keyboard debounce configuration
libata: whitelist all SAMSUNG MZ7KM* solid-state disks
macvlan: return correct error value
mv88e6060: disable hardware level MAC learning
net/mlx4_en: Fix build break when CONFIG_INET is off
bpf: check pending signals while verifying programs
ARM: 8814/1: mm: improve/fix ARM v7_dma_inv_range() unaligned address handling
ARM: 8815/1: V7M: align v7m_dma_inv_range() with v7 counterpart
ARM: 8816/1: dma-mapping: fix potential uninitialized return
ethernet: fman: fix wrong of_node_put() in probe function
thermal: armada: fix legacy validity test sense
net: mvpp2: fix detection of 10G SFP modules
net: mvpp2: fix phylink handling of invalid PHY modes
drm/amdgpu/vcn: Update vcn.cur_state during suspend
tools/testing/nvdimm: Align test resources to 128M
acpi/nfit: Fix user-initiated ARS to be "ARS-long" rather than "ARS-short"
drm/ast: Fix connector leak during driver unload
cifs: In Kconfig CONFIG_CIFS_POSIX needs depends on legacy (insecure cifs)
vhost/vsock: fix reset orphans race with close timeout
mlxsw: spectrum_switchdev: Fix VLAN device deletion via ioctl
i2c: axxia: properly handle master timeout
i2c: scmi: Fix probe error on devices with an empty SMB0001 ACPI device node
i2c: uniphier: fix violation of tLOW requirement for Fast-mode
i2c: uniphier-f: fix violation of tLOW requirement for Fast-mode
nvme: validate controller state before rescheduling keep alive
nvmet-rdma: fix response use after free
Btrfs: fix missing delayed iputs on unmount
Linux 4.19.12
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit c3494801cd ]
Malicious user space may try to force the verifier to use as much cpu
time and memory as possible. Hence check for pending signals
while verifying the program.
Note that suspend of sys_bpf(PROG_LOAD) syscall will lead to EAGAIN,
since the kernel has to release the resources used for program verification.
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 7aa54be297 upstream.
On x86 we cannot do fetch_or() with a single instruction and thus end up
using a cmpxchg loop, this reduces determinism. Replace the fetch_or()
with a composite operation: tas-pending + load.
Using two instructions of course opens a window we previously did not
have. Consider the scenario:
CPU0 CPU1 CPU2
1) lock
trylock -> (0,0,1)
2) lock
trylock /* fail */
3) unlock -> (0,0,0)
4) lock
trylock -> (0,0,1)
5) tas-pending -> (0,1,1)
load-val <- (0,1,0) from 3
6) clear-pending-set-locked -> (0,0,1)
FAIL: _2_ owners
where 5) is our new composite operation. When we consider each part of
the qspinlock state as a separate variable (as we can when
_Q_PENDING_BITS == 8) then the above is entirely possible, because
tas-pending will only RmW the pending byte, so the later load is able
to observe prior tail and lock state (but not earlier than its own
trylock, which operates on the whole word, due to coherence).
To avoid this we need 2 things:
- the load must come after the tas-pending (obviously, otherwise it
can trivially observe prior state).
- the tas-pending must be a full word RmW instruction, it cannot be an XCHGB for
example, such that we cannot observe other state prior to setting
pending.
On x86 we can realize this by using "LOCK BTS m32, r32" for
tas-pending followed by a regular load.
Note that observing later state is not a problem:
- if we fail to observe a later unlock, we'll simply spin-wait for
that store to become visible.
- if we observe a later xchg_tail(), there is no difference from that
xchg_tail() having taken place before the tas-pending.
Suggested-by: Will Deacon <will.deacon@arm.com>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: andrea.parri@amarulasolutions.com
Cc: longman@redhat.com
Fixes: 59fb586b4a ("locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpath")
Link: https://lkml.kernel.org/r/20181003130957.183726335@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bigeasy: GEN_BINARY_RMWcc macro redo]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Changes in 4.19.11
sched/pelt: Fix warning and clean up IRQ PELT config
scsi: raid_attrs: fix unused variable warning
staging: olpc_dcon: add a missing dependency
slimbus: ngd: mark PM functions as __maybe_unused
i2c: aspeed: fix build warning
ARM: dts: qcom-apq8064-arrow-sd-600eval fix graph_endpoint warning
drm/msm: fix address space warning
pinctrl: sunxi: a83t: Fix IRQ offset typo for PH11
aio: fix spectre gadget in lookup_ioctx
scripts/spdxcheck.py: always open files in binary mode
fs/iomap.c: get/put the page in iomap_page_create/release()
userfaultfd: check VM_MAYWRITE was set after verifying the uffd is registered
arm64: dma-mapping: Fix FORCE_CONTIGUOUS buffer clearing
block/bio: Do not zero user pages
ovl: fix decode of dir file handle with multi lower layers
ovl: fix missing override creds in link of a metacopy upper
MMC: OMAP: fix broken MMC on OMAP15XX/OMAP5910/OMAP310
mmc: core: use mrq->sbc when sending CMD23 for RPMB
mmc: sdhci-omap: Fix DCRC error handling during tuning
mmc: sdhci: fix the timeout check window for clock and reset
fuse: continue to send FUSE_RELEASEDIR when FUSE_OPEN returns ENOSYS
ARM: mmp/mmp2: fix cpu_is_mmp2() on mmp2-dt
ARM: dts: bcm2837: Fix polarity of wifi reset GPIOs
dm thin: send event about thin-pool state change _after_ making it
dm cache metadata: verify cache has blocks in blocks_are_clean_separate_dirty()
dm: call blk_queue_split() to impose device limits on bios
tracing: Fix memory leak in create_filter()
tracing: Fix memory leak in set_trigger_filter()
tracing: Fix memory leak of instance function hash filters
media: vb2: don't call __vb2_queue_cancel if vb2_start_streaming failed
powerpc/msi: Fix NULL pointer access in teardown code
powerpc: Look for "stdout-path" when setting up legacy consoles
drm/nouveau/kms: Fix memory leak in nv50_mstm_del()
drm/nouveau/kms/nv50-: also flush fb writes when rewinding push buffer
Revert "drm/rockchip: Allow driver to be shutdown on reboot/kexec"
drm/i915/gvt: Fix tiled memory decoding bug on BDW
drm/i915/execlists: Apply a full mb before execution for Braswell
drm/amdgpu/powerplay: Apply avfs cks-off voltages on VI
drm/amdkfd: add new vega10 pci ids
drm/amdgpu: add some additional vega10 pci ids
drm/amdgpu: update smu firmware images for VI variants (v2)
drm/amdgpu: update SMC firmware image for polaris10 variants
dm zoned: Fix target BIO completion handling
x86/build: Fix compiler support check for CONFIG_RETPOLINE
Linux 4.19.11
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 2840f84f74 upstream.
The following commands will cause a memory leak:
# cd /sys/kernel/tracing
# mkdir instances/foo
# echo schedule > instance/foo/set_ftrace_filter
# rmdir instances/foo
The reason is that the hashes that hold the filters to set_ftrace_filter and
set_ftrace_notrace are not freed if they contain any data on the instance
and the instance is removed.
Found by kmemleak detector.
Cc: stable@vger.kernel.org
Fixes: 591dffdade ("ftrace: Allow for function tracing instance to filter functions")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3cec638b3d upstream.
When create_event_filter() fails in set_trigger_filter(), the filter may
still be allocated and needs to be freed. The caller expects the
data->filter to be updated with the new filter, even if the new filter
failed (we could add an error message by setting set_str parameter of
create_event_filter(), but that's another update).
But because the error would just exit, filter was left hanging and
nothing could free it.
Found by kmemleak detector.
Cc: stable@vger.kernel.org
Fixes: bac5fb97a1 ("tracing: Add and use generic set_trigger_filter() implementation")
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b61c19209c upstream.
The create_filter() calls create_filter_start() which allocates a
"parse_error" descriptor, but fails to call create_filter_finish() that
frees it.
The op_stack and inverts in predicate_parse() were also not freed.
Found by kmemleak detector.
Cc: stable@vger.kernel.org
Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>