[ no upstream commit ]
Fix incorrect bounds tracking for RSH opcode. Commit f23cc643f9 ("bpf: fix
range arithmetic for bpf map access") had a wrong assumption about min/max
bounds. The new dst_reg->min_value needs to be derived by right shifting the
max_val bounds, not min_val, and likewise new dst_reg->max_value needs to be
derived by right shifting the min_val bounds, not max_val. Later stable kernels
than 4.9 are not affected since bounds tracking was overall reworked and they
already track this similarly as in the fix.
Fixes: f23cc643f9 ("bpf: fix range arithmetic for bpf map access")
Reported-by: Ryota Shiga (Flatt Security)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8a37963c7a ]
If an element is freed via RCU then recursion into BPF instrumentation
functions is not a concern. The element is already detached from the map
and the RCU callback does not hold any locks on which a kprobe, perf event
or tracepoint attached BPF program could deadlock.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145643.259118710@linutronix.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit fdadd04931 ]
Michael and Sandipan report:
Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
and is adjusted again at init time if MODULES_VADDR is defined.
For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
the compile-time default at boot-time, which is 0x9c400000 when
using 64K page size. This overflows the signed 32-bit bpf_jit_limit
value:
root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
-1673527296
and can cause various unexpected failures throughout the network
stack. In one case `strace dhclient eth0` reported:
setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8},
16) = -1 ENOTSUPP (Unknown error 524)
and similar failures can be seen with tools like tcpdump. This doesn't
always reproduce however, and I'm not sure why. The more consistent
failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9
host would time out on systemd/netplan configuring a virtio-net NIC
with no noticeable errors in the logs.
Given this and also given that in near future some architectures like
arm64 will have a custom area for BPF JIT image allocations we should
get rid of the BPF_JIT_LIMIT_DEFAULT fallback / default entirely. For
4.21, we have an overridable bpf_jit_alloc_exec(), bpf_jit_free_exec()
so therefore add another overridable bpf_jit_alloc_exec_limit() helper
function which returns the possible size of the memory area for deriving
the default heuristic in bpf_jit_charge_init().
Like bpf_jit_alloc_exec() and bpf_jit_free_exec(), the new
bpf_jit_alloc_exec_limit() assumes that module_alloc() is the default
JIT memory provider, and therefore in case archs implement their custom
module_alloc() we use MODULES_{END,_VADDR} for limits and otherwise for
vmalloc_exec() cases like on ppc64 we use VMALLOC_{END,_START}.
Additionally, for archs supporting large page sizes, we should change
the sysctl to be handled as long to not run into sysctl restrictions
in future.
Fixes: ede95a63b5 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
Reported-by: Sandipan Das <sandipan@linux.ibm.com>
Reported-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit ede95a63b5 upstream.
Rick reported that the BPF JIT could potentially fill the entire module
space with BPF programs from unprivileged users which would prevent later
attempts to load normal kernel modules or privileged BPF programs, for
example. If JIT was enabled but unsuccessful to generate the image, then
before commit 290af86629 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
we would always fall back to the BPF interpreter. Nowadays in the case
where the CONFIG_BPF_JIT_ALWAYS_ON could be set, then the load will abort
with a failure since the BPF interpreter was compiled out.
Add a global limit and enforce it for unprivileged users such that in case
of BPF interpreter compiled out we fail once the limit has been reached
or we fall back to BPF interpreter earlier w/o using module mem if latter
was compiled in. In a next step, fair share among unprivileged users can
be resolved in particular for the case where we would fail hard once limit
is reached.
Fixes: 290af86629 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
Fixes: 0a14842f5a ("net: filter: Just In Time compiler for x86-64")
Co-Developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
[bwh: Backported to 4.9: adjust context]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fa9dd599b4 upstream.
Having a pure_initcall() callback just to permanently enable BPF
JITs under CONFIG_BPF_JIT_ALWAYS_ON is unnecessary and could leave
a small race window in future where JIT is still disabled on boot.
Since we know about the setting at compilation time anyway, just
initialize it properly there. Also consolidate all the individual
bpf_jit_enable variables into a single one and move them under one
location. Moreover, don't allow for setting unspecified garbage
values on them.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
[bwh: Backported to 4.9 as dependency of commit 2e4a30983b
"bpf: restrict access to core bpf sysctls":
- Drop change in arch/mips/net/ebpf_jit.c
- Drop change to bpf_jit_kallsyms
- Adjust filenames, context]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4fe8435909 upstream.
when all map elements are pre-allocated one cpu can delete and reuse htab_elem
while another cpu is still walking the hlist. In such case the lookup may
miss the element. Convert hlist to hlist_nulls to avoid such scenario.
When bucket lock is taken there is no need to take such precautions,
so only convert map_lookup and map_get_next to nulls.
The race window is extremely small and only reproducible with explicit
udelay() inside lookup_nulls_elem_raw()
Similar to hlist add hlist_nulls_for_each_entry_safe() and
hlist_nulls_entry_safe() helpers.
Fixes: 6c90598174 ("bpf: pre-allocate hash map elements")
Reported-by: Jonathan Perry <jonperry@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 9f691549f7 upstream.
when htab_elem is removed from the bucket list the htab_elem.hash_node.next
field should not be overridden too early otherwise we have a tiny race window
between lookup and delete.
The bug was discovered by manual code analysis and reproducible
only with explicit udelay() in lookup_elem_raw().
Fixes: 6c90598174 ("bpf: pre-allocate hash map elements")
Reported-by: Jonathan Perry <jonperry@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c3494801cd ]
Malicious user space may try to force the verifier to use as much cpu
time and memory as possible. Hence check for pending signals
while verifying the program.
Note that suspend of sys_bpf(PROG_LOAD) syscall will lead to EAGAIN,
since the kernel has to release the resources used for program verification.
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Changes in 4.9.99
perf/core: Fix the perf_cpu_time_max_percent check
percpu: include linux/sched.h for cond_resched()
bpf: map_get_next_key to return first key on NULL
arm/arm64: KVM: Add PSCI version selection API
crypto: talitos - fix IPsec cipher in length
serial: imx: ensure UCR3 and UFCR are setup correctly
USB: serial: option: Add support for Quectel EP06
ALSA: pcm: Check PCM state at xfern compat ioctl
ALSA: seq: Fix races at MIDI encoding in snd_virmidi_output_trigger()
ALSA: aloop: Mark paused device as inactive
ALSA: aloop: Add missing cable lock to ctl API callbacks
tracepoint: Do not warn on ENOMEM
Input: leds - fix out of bound access
Input: atmel_mxt_ts - add touchpad button mapping for Samsung Chromebook Pro
xfs: prevent creating negative-sized file via INSERT_RANGE
RDMA/cxgb4: release hw resources on device removal
RDMA/ucma: Allow resolving address w/o specifying source address
RDMA/mlx5: Protect from shift operand overflow
NET: usb: qmi_wwan: add support for ublox R410M PID 0x90b2
IB/mlx5: Use unlimited rate when static rate is not supported
IB/hfi1: Fix NULL pointer dereference when invalid num_vls is used
drm/vmwgfx: Fix a buffer object leak
drm/bridge: vga-dac: Fix edid memory leak
test_firmware: fix setting old custom fw path back on exit, second try
USB: serial: visor: handle potential invalid device configuration
USB: Accept bulk endpoints with 1024-byte maxpacket
USB: serial: option: reimplement interface masking
USB: serial: option: adding support for ublox R410M
usb: musb: host: fix potential NULL pointer dereference
usb: musb: trace: fix NULL pointer dereference in musb_g_tx()
platform/x86: asus-wireless: Fix NULL pointer dereference
s390/facilites: use stfle_fac_list array size for MAX_FACILITY_BIT
Linux 4.9.99
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 8fe4592438 upstream.
When iterating through a map, we need to find a key that does not exist
in the map so map_get_next_key will give us the first key of the map.
This often requires a lot of guessing in production systems.
This patch makes map_get_next_key return the first key when the key
pointer in the parameter is NULL.
Signed-off-by: Teng Qin <qinteng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chenbo Feng <fengc@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.9.91
MIPS: ralink: Remove ralink_halt()
iio: st_pressure: st_accel: pass correct platform data to init
ALSA: usb-audio: Fix parsing descriptor of UAC2 processing unit
ALSA: aloop: Sync stale timer before release
ALSA: aloop: Fix access to not-yet-ready substream via cable
ALSA: hda/realtek - Always immediately update mute LED with pin VREF
mmc: dw_mmc: fix falling from idmac to PIO mode when dw_mci_reset occurs
PCI: Add function 1 DMA alias quirk for Highpoint RocketRAID 644L
ahci: Add PCI-id for the Highpoint Rocketraid 644L card
clk: bcm2835: Fix ana->maskX definitions
clk: bcm2835: Protect sections updating shared registers
clk: sunxi-ng: a31: Fix CLK_OUT_* clock ops
Bluetooth: btusb: Fix quirk for Atheros 1525/QCA6174
libata: fix length validation of ATAPI-relayed SCSI commands
libata: remove WARN() for DMA or PIO command without data
libata: don't try to pass through NCQ commands to non-NCQ devices
libata: Apply NOLPM quirk to Crucial MX100 512GB SSDs
libata: disable LPM for Crucial BX100 SSD 500GB drive
libata: Enable queued TRIM for Samsung SSD 860
libata: Apply NOLPM quirk to Crucial M500 480 and 960GB SSDs
libata: Make Crucial BX100 500GB LPM quirk apply to all firmware versions
libata: Modify quirks for MX100 to limit NCQ_TRIM quirk to MU01 version
nfsd: remove blocked locks on client teardown
mm/vmalloc: add interfaces to free unmapped page table
x86/mm: implement free pmd/pte page interfaces
mm/khugepaged.c: convert VM_BUG_ON() to collapse fail
mm/thp: do not wait for lock_page() in deferred_split_scan()
mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink()
drm/vmwgfx: Fix a destoy-while-held mutex problem.
drm/radeon: Don't turn off DP sink when disconnected
drm: udl: Properly check framebuffer mmap offsets
acpi, numa: fix pxm to online numa node associations
ACPI / watchdog: Fix off-by-one error at resource assignment
libnvdimm, {btt, blk}: do integrity setup before add_disk()
brcmfmac: fix P2P_DEVICE ethernet address generation
rtlwifi: rtl8723be: Fix loss of signal
tracing: probeevent: Fix to support minus offset from symbol
mtdchar: fix usage of mtd_ooblayout_ecc()
mtd: nand: fsl_ifc: Fix nand waitfunc return value
mtd: nand: fsl_ifc: Fix eccstat array overflow for IFC ver >= 2.0.0
mtd: nand: fsl_ifc: Read ECCSTAT0 and ECCSTAT1 registers for IFC 2.0
staging: ncpfs: memory corruption in ncp_read_kernel()
can: ifi: Repair the error handling
can: ifi: Check core revision upon probe
can: cc770: Fix stalls on rt-linux, remove redundant IRQ ack
can: cc770: Fix queue stall & dropped RTR reply
can: cc770: Fix use after free in cc770_tx_interrupt()
tty: vt: fix up tabstops properly
selftests/x86/ptrace_syscall: Fix for yet more glibc interference
kvm/x86: fix icebp instruction handling
x86/build/64: Force the linker to use 2MB page size
x86/boot/64: Verify alignment of the LOAD segment
x86/entry/64: Don't use IST entry for #BP stack
perf/x86/intel/uncore: Fix Skylake UPI event format
perf stat: Fix CVS output format for non-supported counters
perf/x86/intel: Don't accidentally clear high bits in bdw_limit_period()
perf/x86/intel/uncore: Fix multi-domain PCI CHA enumeration bug on Skylake servers
iio: ABI: Fix name of timestamp sysfs file
staging: lustre: ptlrpc: kfree used instead of kvfree
selftests, x86, protection_keys: fix wrong offset in siginfo
selftests/x86/protection_keys: Fix syscall NR redefinition warnings
signal/testing: Don't look for __SI_FAULT in userspace
x86/pkeys/selftests: Rename 'si_pkey' to 'siginfo_pkey'
selftests: x86: sysret_ss_attrs doesn't build on a PIE build
kbuild: disable clang's default use of -fmerge-all-constants
bpf: skip unnecessary capability check
bpf, x64: increase number of passes
Linux 4.9.91
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 0fa4fe85f4 upstream.
The current check statement in BPF syscall will do a capability check
for CAP_SYS_ADMIN before checking sysctl_unprivileged_bpf_disabled. This
code path will trigger unnecessary security hooks on capability checking
and cause false alarms on unprivileged process trying to get CAP_SYS_ADMIN
access. This can be resolved by simply switch the order of the statement
and CAP_SYS_ADMIN is not required anyway if unprivileged bpf syscall is
allowed.
Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Descriptor table is a shared object; it's not a place where you can
stick temporary references to files, especially when we don't need
an opened file at all.
Cc: stable@vger.kernel.org # v4.14
Fixes: 98589a0998 ("netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chenbo Feng <fengc@google.com>
Removed the code related to function bpf_prog_get_ok() since it is not
exsit in current android tree.
(cherry picked from commit 040ee69226)
Change-Id: If7a602128cdea4b4b50c8effb215c9bca7449515
Commit 2c16d60332 ("netfilter: xt_bpf: support ebpf") introduced
support for attaching an eBPF object by an fd, with the
'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
IPT_SO_SET_REPLACE call.
However this breaks subsequent iptables calls:
# iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
# iptables -A INPUT -s 5.6.7.8 -j ACCEPT
iptables: Invalid argument. Run `dmesg' for more information.
That's because iptables works by loading existing rules using
IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
the replacement set.
However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
(from the initial "iptables -m bpf" invocation) - so when 2nd invocation
occurs, userspace passes a bogus fd number, which leads to
'bpf_mt_check_v1' to fail.
One suggested solution [1] was to hack iptables userspace, to perform a
"entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
process-local fd per every 'xt_bpf_info_v1' entry seen.
However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.
This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
'.fd' and instead perform an in-kernel lookup for the bpf object given
the provided '.path'.
It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
expected to provide the path of the pinned object.
Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.
References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2
[2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2
Reported-by: Rafael Buchbinder <rafi@rbk.ms>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Chenbo Feng <fengc@google.com>
(cherry picked from commit 98589a0998)
Change-Id: Ia0d15a76823cca3afb38786a3d2c25c13ccf941d
Changes in 4.9.87
tpm: st33zp24: fix potential buffer overruns caused by bit glitches on the bus
tpm_i2c_infineon: fix potential buffer overruns caused by bit glitches on the bus
tpm_i2c_nuvoton: fix potential buffer overruns caused by bit glitches on the bus
tpm_tis: fix potential buffer overruns caused by bit glitches on the bus
tpm: constify transmit data pointers
tpm_tis_spi: Use DMA-safe memory for SPI transfers
tpm-dev-common: Reject too short writes
ALSA: usb-audio: Add a quirck for B&W PX headphones
ALSA: hda: Add a power_save blacklist
ALSA: hda - Fix pincfg at resume on Lenovo T470 dock
timers: Forward timer base before migrating timers
parisc: Fix ordering of cache and TLB flushes
cpufreq: s3c24xx: Fix broken s3c_cpufreq_init()
dax: fix vma_is_fsdax() helper
x86/xen: Zero MSR_IA32_SPEC_CTRL before suspend
x86/platform/intel-mid: Handle Intel Edison reboot correctly
media: m88ds3103: don't call a non-initalized function
nospec: Allow index argument to have const-qualified type
ARM: mvebu: Fix broken PL310_ERRATA_753970 selects
ARM: kvm: fix building with gcc-8
KVM: mmu: Fix overlap between public and private memslots
KVM/x86: Remove indirect MSR op calls from SPEC_CTRL
KVM/VMX: Optimize vmx_vcpu_run() and svm_vcpu_run() by marking the RDMSR path as unlikely()
PCI/ASPM: Deal with missing root ports in link state handling
dm io: fix duplicate bio completion due to missing ref count
ARM: dts: LogicPD SOM-LV: Fix I2C1 pinmux
ARM: dts: LogicPD Torpedo: Fix I2C1 pinmux
x86/mm: Give each mm TLB flush generation a unique ID
x86/speculation: Use Indirect Branch Prediction Barrier in context switch
md: only allow remove_and_add_spares when no sync_thread running.
netlink: put module reference if dump start fails
x86/apic/vector: Handle legacy irq data correctly
bridge: check brport attr show in brport_show
fib_semantics: Don't match route with mismatching tclassid
hdlc_ppp: carrier detect ok, don't turn off negotiation
ipv6 sit: work around bogus gcc-8 -Wrestrict warning
net: fix race on decreasing number of TX queues
net: ipv4: don't allow setting net.ipv4.route.min_pmtu below 68
netlink: ensure to loop over all netns in genlmsg_multicast_allns()
ppp: prevent unregistered channels from connecting to PPP units
udplite: fix partial checksum initialization
sctp: fix dst refcnt leak in sctp_v4_get_dst
mlxsw: spectrum_switchdev: Check success of FDB add operation
net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT
tcp: Honor the eor bit in tcp_mtu_probe
rxrpc: Fix send in rxrpc_send_data_packet()
tcp_bbr: better deal with suboptimal GSO
sctp: fix dst refcnt leak in sctp_v6_get_dst()
s390/qeth: fix underestimated count of buffer elements
s390/qeth: fix SETIP command handling
s390/qeth: fix overestimated count of buffer elements
s390/qeth: fix IP removal on offline cards
s390/qeth: fix double-free on IP add/remove race
s390/qeth: fix IP address lookup for L3 devices
s390/qeth: fix IPA command submission race
sctp: verify size of a new chunk in _sctp_make_chunk()
net: mpls: Pull common label check into helper
mpls, nospec: Sanitize array index in mpls_label_ok()
bpf: fix wrong exposure of map_flags into fdinfo for lpm
bpf: fix mlock precharge on arraymaps
bpf, x64: implement retpoline for tail call
bpf, arm64: fix out of bounds access in tail call
bpf: add schedule points in percpu arrays management
bpf, ppc64: fix out of bounds access in tail call
btrfs: preserve i_mode if __btrfs_set_acl() fails
Linux 4.9.87
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ upstream commit 32fff239de ]
syszbot managed to trigger RCU detected stalls in
bpf_array_free_percpu()
It takes time to allocate a huge percpu map, but even more time to free
it.
Since we run in process context, use cond_resched() to yield cpu if
needed.
Fixes: a10423b87a ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ upstream commit 9c2d63b843 ]
syzkaller recently triggered OOM during percpu map allocation;
while there is work in progress by Dennis Zhou to add __GFP_NORETRY
semantics for percpu allocator under pressure, there seems also a
missing bpf_map_precharge_memlock() check in array map allocation.
Given today the actual bpf_map_charge_memlock() happens after the
find_and_alloc_map() in syscall path, the bpf_map_precharge_memlock()
is there to bail out early before we go and do the map setup work
when we find that we hit the limits anyway. Therefore add this for
array map as well.
Fixes: 6c90598174 ("bpf: pre-allocate hash map elements")
Fixes: a10423b87a ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
Reported-by: syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ upstream commit a316338cb7 ]
trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via
attr->map_flags, since it does not support preallocation yet. We
check the flag, but we never copy the flag into trie->map.map_flags,
which is later on exposed into fdinfo and used by loaders such as
iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test
whether a pinned map has the same spec as the one from the BPF obj
file and if not, bails out, which is currently the case for lpm
since it exposes always 0 as flags.
Also copy over flags in array_map_alloc() and stack_map_alloc().
They always have to be 0 right now, but we should make sure to not
miss to copy them over at a later point in time when we add actual
flags for them to use.
Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Reported-by: Jarno Rajahalme <jarno@covalent.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.9.79
x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels
orangefs: use list_for_each_entry_safe in purge_waiting_ops
orangefs: initialize op on loop restart in orangefs_devreq_read
usbip: prevent vhci_hcd driver from leaking a socket pointer address
usbip: Fix implicit fallthrough warning
usbip: Fix potential format overflow in userspace tools
can: af_can: can_rcv(): replace WARN_ONCE by pr_warn_once
can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once
KVM: arm/arm64: Check pagesize when allocating a hugepage at Stage 2
Prevent timer value 0 for MWAITX
drivers: base: cacheinfo: fix x86 with CONFIG_OF enabled
drivers: base: cacheinfo: fix boot error message when acpi is enabled
mm/mmap.c: do not blow on PROT_NONE MAP_FIXED holes in the stack
hwpoison, memcg: forcibly uncharge LRU pages
cma: fix calculation of aligned offset
mm, page_alloc: fix potential false positive in __zone_watermark_ok
ipc: msg, make msgrcv work with LONG_MIN
ACPI / scan: Prefer devices without _HID/_CID for _ADR matching
ACPICA: Namespace: fix operand cache leak
netfilter: nfnetlink_cthelper: Add missing permission checks
netfilter: xt_osf: Add missing permission checks
reiserfs: fix race in prealloc discard
reiserfs: don't preallocate blocks for extended attributes
fs/fcntl: f_setown, avoid undefined behaviour
scsi: libiscsi: fix shifting of DID_REQUEUE host byte
Revert "module: Add retpoline tag to VERMAGIC"
mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
Input: trackpoint - force 3 buttons if 0 button is reported
orangefs: fix deadlock; do not write i_size in read_iter
um: link vmlinux with -no-pie
vsyscall: Fix permissions for emulate mode with KAISER/PTI
eventpoll.h: add missing epoll event masks
dccp: don't restart ccid2_hc_tx_rto_expire() if sk in closed state
ipv6: Fix getsockopt() for sockets with default IPV6_AUTOFLOWLABEL
ipv6: fix udpv6 sendmsg crash caused by too small MTU
ipv6: ip6_make_skb() needs to clear cork.base.dst
lan78xx: Fix failure in USB Full Speed
net: igmp: fix source address check for IGMPv3 reports
net: qdisc_pkt_len_init() should be more robust
net: tcp: close sock if net namespace is exiting
pppoe: take ->needed_headroom of lower device into account on xmit
r8169: fix memory corruption on retrieval of hardware statistics.
sctp: do not allow the v4 socket to bind a v4mapped v6 address
sctp: return error if the asoc has been peeled off in sctp_wait_for_sndbuf
tipc: fix a memory leak in tipc_nl_node_get_link()
vmxnet3: repair memory leak
net: Allow neigh contructor functions ability to modify the primary_key
ipv4: Make neigh lookup keys for loopback/point-to-point devices be INADDR_ANY
ppp: unlock all_ppp_mutex before registering device
be2net: restore properly promisc mode after queues reconfiguration
ip6_gre: init dev->mtu and dev->hard_header_len correctly
gso: validate gso_type in GSO handlers
mlxsw: spectrum_router: Don't log an error on missing neighbor
tun: fix a memory leak for tfile->tx_array
flow_dissector: properly cap thoff field
perf/x86/amd/power: Do not load AMD power module on !AMD platforms
x86/microcode/intel: Extend BDW late-loading further with LLC size check
hrtimer: Reset hrtimer cpu base proper on CPU hotplug
x86: bpf_jit: small optimization in emit_bpf_tail_call()
bpf: fix bpf_tail_call() x64 JIT
bpf: introduce BPF_JIT_ALWAYS_ON config
bpf: arsh is not supported in 32 bit alu thus reject it
bpf: avoid false sharing of map refcount with max_entries
bpf: fix divides by zero
bpf: fix 32-bit divide by zero
bpf: reject stores into ctx via st and xadd
nfsd: auth: Fix gid sorting when rootsquash enabled
Linux 4.9.79
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ upstream commit f37a8cb84c ]
Alexei found that verifier does not reject stores into context
via BPF_ST instead of BPF_STX. And while looking at it, we
also should not allow XADD variant of BPF_STX.
The context rewriter is only assuming either BPF_LDX_MEM- or
BPF_STX_MEM-type operations, thus reject anything other than
that so that assumptions in the rewriter properly hold. Add
test cases as well for BPF selftests.
Fixes: d691f9e8d4 ("bpf: allow programs to write to certain skb fields")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ upstream commit c366287ebd ]
Divides by zero are not nice, lets avoid them if possible.
Also do_div() seems not needed when dealing with 32bit operands,
but this seems a minor detail.
Fixes: bd4cf0ed33 ("net: filter: rework/optimize internal BPF interpreter's instruction set")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ upstream commit 7891a87efc ]
The following snippet was throwing an 'unknown opcode cc' warning
in BPF interpreter:
0: (18) r0 = 0x0
2: (7b) *(u64 *)(r10 -16) = r0
3: (cc) (u32) r0 s>>= (u32) r0
4: (95) exit
Although a number of JITs do support BPF_ALU | BPF_ARSH | BPF_{K,X}
generation, not all of them do and interpreter does neither. We can
leave existing ones and implement it later in bpf-next for the
remaining ones, but reject this properly in verifier for the time
being.
Fixes: 17a5267067 ("bpf: verifier (add verifier core)")
Reported-by: syzbot+93c4904c5c70348a6890@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ upstream commit 290af86629 ]
The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.
A quote from goolge project zero blog:
"At this point, it would normally be necessary to locate gadgets in
the host kernel code that can be used to actually leak data by reading
from an attacker-controlled location, shifting and masking the result
appropriately and then using the result of that as offset to an
attacker-controlled address for a load. But piecing gadgets together
and figuring out which ones work in a speculation context seems annoying.
So instead, we decided to use the eBPF interpreter, which is built into
the host kernel - while there is no legitimate way to invoke it from inside
a VM, the presence of the code in the host kernel's text section is sufficient
to make it usable for the attack, just like with ordinary ROP gadgets."
To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
option that removes interpreter from the kernel in favor of JIT-only mode.
So far eBPF JIT is supported by:
x64, arm64, arm32, sparc64, s390, powerpc64, mips64
The start of JITed program is randomized and code page is marked as read-only.
In addition "constant blinding" can be turned on with net.core.bpf_jit_harden
v2->v3:
- move __bpf_prog_ret0 under ifdef (Daniel)
v1->v2:
- fix init order, test_bpf and cBPF (Daniel's feedback)
- fix offloaded bpf (Jakub's feedback)
- add 'return 0' dummy in case something can invoke prog->bpf_func
- retarget bpf tree. For bpf-next the patch would need one extra hunk.
It will be sent when the trees are merged back to net-next
Considered doing:
int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
but it seems better to land the patch as-is and in bpf-next remove
bpf_jit_enable global variable from all JITs, consolidate in one place
and remove this jit_init() function.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ upstream commit 90caccdd8c ]
- bpf prog_array just like all other types of bpf array accepts 32-bit index.
Clarify that in the comment.
- fix x64 JIT of bpf_tail_call which was incorrectly loading 8 instead of 4 bytes
- tighten corresponding check in the interpreter to stay consistent
The JIT bug can be triggered after introduction of BPF_F_NUMA_NODE flag
in commit 96eabe7a40 in 4.14. Before that the map_flags would stay zero and
though JIT code is wrong it will check bounds correctly.
Hence two fixes tags. All other JITs don't have this problem.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Fixes: 96eabe7a40 ("bpf: Allow selecting numa node during map creation")
Fixes: b52f00e6a7 ("x86: bpf_jit: implement bpf_tail_call() helper")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.9.77
dm bufio: fix shrinker scans when (nr_to_scan < retain_target)
mac80211: Add RX flag to indicate ICV stripped
ath10k: rebuild crypto header in rx data frames
KVM: Fix stack-out-of-bounds read in write_mmio
can: gs_usb: fix return value of the "set_bittiming" callback
IB/srpt: Disable RDMA access by the initiator
MIPS: Validate PR_SET_FP_MODE prctl(2) requests against the ABI of the task
MIPS: Factor out NT_PRFPREG regset access helpers
MIPS: Guard against any partial write attempt with PTRACE_SETREGSET
MIPS: Consistently handle buffer counter with PTRACE_SETREGSET
MIPS: Fix an FCSR access API regression with NT_PRFPREG and MSA
MIPS: Also verify sizeof `elf_fpreg_t' with PTRACE_SETREGSET
MIPS: Disallow outsized PTRACE_SETREGSET NT_PRFPREG regset accesses
kvm: vmx: Scrub hardware GPRs at VM-exit
platform/x86: wmi: Call acpi_wmi_init() later
x86/acpi: Handle SCI interrupts above legacy space gracefully
ALSA: pcm: Remove incorrect snd_BUG_ON() usages
ALSA: pcm: Add missing error checks in OSS emulation plugin builder
ALSA: pcm: Abort properly at pending signal in OSS read/write loops
ALSA: pcm: Allow aborting mutex lock at OSS read/write loops
ALSA: aloop: Release cable upon open error path
ALSA: aloop: Fix inconsistent format due to incomplete rule
ALSA: aloop: Fix racy hw constraints adjustment
x86/acpi: Reduce code duplication in mp_override_legacy_irq()
zswap: don't param_set_charp while holding spinlock
lan78xx: use skb_cow_head() to deal with cloned skbs
sr9700: use skb_cow_head() to deal with cloned skbs
smsc75xx: use skb_cow_head() to deal with cloned skbs
cx82310_eth: use skb_cow_head() to deal with cloned skbs
xhci: Fix ring leak in failure path of xhci_alloc_virt_device()
8021q: fix a memory leak for VLAN 0 device
ip6_tunnel: disable dst caching if tunnel is dual-stack
net: core: fix module type in sock_diag_bind
RDS: Heap OOB write in rds_message_alloc_sgs()
RDS: null pointer dereference in rds_atomic_free_op
sh_eth: fix TSU resource handling
sh_eth: fix SH7757 GEther initialization
net: stmmac: enable EEE in MII, GMII or RGMII only
ipv6: fix possible mem leaks in ipv6_make_skb()
ethtool: do not print warning for applications using legacy API
mlxsw: spectrum_router: Fix NULL pointer deref
net/sched: Fix update of lastuse in act modules implementing stats_update
crypto: algapi - fix NULL dereference in crypto_remove_spawns()
rbd: set max_segments to USHRT_MAX
x86/microcode/intel: Extend BDW late-loading with a revision check
KVM: x86: Add memory barrier on vmcs field lookup
drm/vmwgfx: Potential off by one in vmw_view_add()
kaiser: Set _PAGE_NX only if supported
iscsi-target: Make TASK_REASSIGN use proper se_cmd->cmd_kref
target: Avoid early CMD_T_PRE_EXECUTE failures during ABORT_TASK
bpf: move fixup_bpf_calls() function
bpf: refactor fixup_bpf_calls()
bpf: prevent out-of-bounds speculation
bpf, array: fix overflow in max_entries and undefined behavior in index_mask
USB: serial: cp210x: add IDs for LifeScan OneTouch Verio IQ
USB: serial: cp210x: add new device ID ELV ALC 8xxx
usb: misc: usb3503: make sure reset is low for at least 100us
USB: fix usbmon BUG trigger
usbip: remove kernel addresses from usb device and urb debug msgs
usbip: fix vudc_rx: harden CMD_SUBMIT path to handle malicious input
usbip: vudc_tx: fix v_send_ret_submit() vulnerability to null xfer buffer
staging: android: ashmem: fix a race condition in ASHMEM_SET_SIZE ioctl
Bluetooth: Prevent stack info leak from the EFS element.
uas: ignore UAS for Norelsys NS1068(X) chips
e1000e: Fix e1000_check_for_copper_link_ich8lan return value.
x86/Documentation: Add PTI description
x86/cpu: Factor out application of forced CPU caps
x86/cpufeatures: Make CPU bugs sticky
x86/cpufeatures: Add X86_BUG_CPU_INSECURE
x86/pti: Rename BUG_CPU_INSECURE to BUG_CPU_MELTDOWN
x86/cpufeatures: Add X86_BUG_SPECTRE_V[12]
x86/cpu: Merge bugs.c and bugs_64.c
sysfs/cpu: Add vulnerability folder
x86/cpu: Implement CPU vulnerabilites sysfs functions
x86/cpu/AMD: Make LFENCE a serializing instruction
x86/cpu/AMD: Use LFENCE_RDTSC in preference to MFENCE_RDTSC
sysfs/cpu: Fix typos in vulnerability documentation
x86/alternatives: Fix optimize_nops() checking
x86/alternatives: Add missing '\n' at end of ALTERNATIVE inline asm
x86/mm/32: Move setup_clear_cpu_cap(X86_FEATURE_PCID) earlier
objtool, modules: Discard objtool annotation sections for modules
objtool: Detect jumps to retpoline thunks
objtool: Allow alternatives to be ignored
x86/asm: Use register variable to get stack pointer value
x86/retpoline: Add initial retpoline support
x86/spectre: Add boot time option to select Spectre v2 mitigation
x86/retpoline/crypto: Convert crypto assembler indirect jumps
x86/retpoline/entry: Convert entry assembler indirect jumps
x86/retpoline/ftrace: Convert ftrace assembler indirect jumps
x86/retpoline/hyperv: Convert assembler indirect jumps
x86/retpoline/xen: Convert Xen hypercall indirect jumps
x86/retpoline/checksum32: Convert assembler indirect jumps
x86/retpoline/irq32: Convert assembler indirect jumps
x86/retpoline: Fill return stack buffer on vmexit
selftests/x86: Add test_vsyscall
x86/retpoline: Remove compile time warning
objtool: Fix retpoline support for pre-ORC objtool
x86/pti/efi: broken conversion from efi to kernel page table
Linux 4.9.77
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit bbeb6e4323 upstream.
syzkaller tried to alloc a map with 0xfffffffd entries out of a userns,
and thus unprivileged. With the recently added logic in b2157399cc
("bpf: prevent out-of-bounds speculation") we round this up to the next
power of two value for max_entries for unprivileged such that we can
apply proper masking into potentially zeroed out map slots.
However, this will generate an index_mask of 0xffffffff, and therefore
a + 1 will let this overflow into new max_entries of 0. This will pass
allocation, etc, and later on map access we still enforce on the original
attr->max_entries value which was 0xfffffffd, therefore triggering GPF
all over the place. Thus bail out on overflow in such case.
Moreover, on 32 bit archs roundup_pow_of_two() can also not be used,
since fls_long(max_entries - 1) can result in 32 and 1UL << 32 in 32 bit
space is undefined. Therefore, do this by hand in a 64 bit variable.
This fixes all the issues triggered by syzkaller's reproducers.
Fixes: b2157399cc ("bpf: prevent out-of-bounds speculation")
Reported-by: syzbot+b0efb8e572d01bce1ae0@syzkaller.appspotmail.com
Reported-by: syzbot+6c15e9744f75f2364773@syzkaller.appspotmail.com
Reported-by: syzbot+d2f5524fb46fd3b312ee@syzkaller.appspotmail.com
Reported-by: syzbot+61d23c95395cc90dbc2b@syzkaller.appspotmail.com
Reported-by: syzbot+0d363c942452cca68c01@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b2157399cc upstream.
Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
memory accesses under a bounds check may be speculated even if the
bounds check fails, providing a primitive for building a side channel.
To avoid leaking kernel data round up array-based maps and mask the index
after bounds check, so speculated load with out of bounds index will load
either valid value from the array or zero from the padded area.
Unconditionally mask index for all array types even when max_entries
are not rounded to power of 2 for root user.
When map is created by unpriv user generate a sequence of bpf insns
that includes AND operation to make sure that JITed code includes
the same 'index & index_mask' operation.
If prog_array map is created by unpriv user replace
bpf_tail_call(ctx, map, index);
with
if (index >= max_entries) {
index &= map->index_mask;
bpf_tail_call(ctx, map, index);
}
(along with roundup to power 2) to prevent out-of-bounds speculation.
There is secondary redundant 'if (index >= max_entries)' in the interpreter
and in all JITs, but they can be optimized later if necessary.
Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
cannot be used by unpriv, so no changes there.
That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
all architectures with and without JIT.
v2->v3:
Daniel noticed that attack potentially can be crafted via syscall commands
without loading the program, so add masking to those paths as well.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Slaby <jslaby@suse.cz>
[ Backported to 4.9 - gregkh ]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 79741b3bde upstream.
reduce indent and make it iterate over instructions similar to
convert_ctx_accesses(). Also convert hard BUG_ON into soft verifier error.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Jiri Slaby <jslaby@suse.cz>
[Backported to 4.9.y - gregkh]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.9.73
ACPI: APEI / ERST: Fix missing error handling in erst_reader()
acpi, nfit: fix health event notification
crypto: mcryptd - protect the per-CPU queue with a lock
mfd: cros ec: spi: Don't send first message too soon
mfd: twl4030-audio: Fix sibling-node lookup
mfd: twl6040: Fix child-node lookup
ALSA: rawmidi: Avoid racy info ioctl via ctl device
ALSA: usb-audio: Add native DSD support for Esoteric D-05X
ALSA: usb-audio: Fix the missing ctl name suffix at parsing SU
PCI / PM: Force devices to D0 in pci_pm_thaw_noirq()
parisc: Hide Diva-built-in serial aux and graphics card
spi: xilinx: Detect stall with Unknown commands
pinctrl: cherryview: Mask all interrupts on Intel_Strago based systems
KVM: X86: Fix load RFLAGS w/o the fixed bit
kvm: x86: fix RSM when PCID is non-zero
clk: sunxi: sun9i-mmc: Implement reset callback for reset controls
powerpc/perf: Dereference BHRB entries safely
libnvdimm, pfn: fix start_pad handling for aligned namespaces
net: mvneta: clear interface link status on port disable
net: mvneta: use proper rxq_number in loop on rx queues
net: mvneta: eliminate wrong call to handle rx descriptor error
bpf/verifier: Fix states_equal() comparison of pointer and UNKNOWN
Linux 4.9.73
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
An UNKNOWN_VALUE is not supposed to be derived from a pointer, unless
pointer leaks are allowed. Therefore, states_equal() must not treat
a state with a pointer in a register as "equal" to a state with an
UNKNOWN_VALUE in that register.
This was fixed differently upstream, but the code around here was
largely rewritten in 4.14 by commit f1174f77b5 "bpf/verifier: rework
value tracking". The bug can be detected by the bpf/verifier sub-test
"pointer/scalar confusion in state equality check (way 1)".
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Edward Cree <ecree@solarflare.com>
Cc: Jann Horn <jannh@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>