Pull USB fixes from Greg KH:
"Let's try this again... Here are some USB fixes for 5.9-rc3.
This differs from the previous pull request for this release in that
the usb gadget patch now does not break some systems, and actually
does what it was intended to do. Many thanks to Marek Szyprowski for
quickly noticing and testing the patch from Andy Shevchenko to resolve
this issue.
Additionally, some more new USB quirks have been added to get some new
devices to work properly based on user reports.
Other than that, the patches are all here, and they contain:
- usb gadget driver fixes
- xhci driver fixes
- typec fixes
- new quirks and ids
- fixes for USB patches that went into 5.9-rc1.
All of these have been tested in linux-next with no reported issues"
* tag 'usb-5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (33 commits)
usb: storage: Add unusual_uas entry for Sony PSZ drives
USB: Ignore UAS for JMicron JMS567 ATA/ATAPI Bridge
usb: host: ohci-exynos: Fix error handling in exynos_ohci_probe()
USB: gadget: u_f: Unbreak offset calculation in VLAs
USB: quirks: Ignore duplicate endpoint on Sound Devices MixPre-D
usb: typec: tcpm: Fix Fix source hard reset response for TDA 2.3.1.1 and TDA 2.3.1.2 failures
USB: PHY: JZ4770: Fix static checker warning.
USB: gadget: f_ncm: add bounds checks to ncm_unwrap_ntb()
USB: gadget: u_f: add overflow checks to VLA macros
xhci: Always restore EP_SOFT_CLEAR_TOGGLE even if ep reset failed
xhci: Do warm-reset when both CAS and XDEV_RESUME are set
usb: host: xhci: fix ep context print mismatch in debugfs
usb: uas: Add quirk for PNY Pro Elite
tools: usb: move to tools buildsystem
USB: Fix device driver race
USB: Also match device drivers using the ->match vfunc
usb: host: xhci-tegra: fix tegra_xusb_get_phy()
usb: host: xhci-tegra: otg usb2/usb3 port init
usb: hcd: Fix use after free in usb_hcd_pci_remove()
usb: typec: ucsi: Hold con->lock for the entire duration of ucsi_register_port()
...
Pull EDAC fix from Borislav Petkov:
"A fix to properly clear ghes_edac driver state on driver remove so
that a subsequent load can probe the system properly (Shiju Jose)"
* tag 'edac_urgent_for_v5.9_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()
Pull dma-mapping fix from Christoph Hellwig:
"Fix a possibly uninitialized variable (Dan Carpenter)"
* tag 'dma-mapping-5.9-2' of git://git.infradead.org/users/hch/dma-mapping:
dma-pool: Fix an uninitialized variable bug in atomic_pool_expand()
Most of the CPU mask operations behave the same way, but for_each_cpu() and
it's variants ignore the cpumask argument and claim that CPU0 is always in
the mask. This is historical, inconsistent and annoying behaviour.
The matrix allocator uses for_each_cpu() and can be called on UP with an
empty cpumask. The calling code does not expect that this succeeds but
until commit e027fffff7 ("x86/irq: Unbreak interrupt affinity setting")
this went unnoticed. That commit added a WARN_ON() to catch cases which
move an interrupt from one vector to another on the same CPU. The warning
triggers on UP.
Add a check for the cpumask being empty to prevent this.
Fixes: 2f75d9e1c9 ("genirq: Implement bitmap matrix allocator")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Commit ef91bb196b ("kernel.h: Silence sparse warning in
lower_32_bits") caused new warnings to show in the fsldma driver, but
that commit was not to blame: it only exposed some very incorrect code
that tried to take the low 32 bits of an address.
That made no sense for multiple reasons, the most notable one being that
that code was intentionally limited to only 32-bit ppc builds, so "only
low 32 bits of an address" was completely nonsensical. There were no
high bits to mask off to begin with.
But even more importantly fropm a correctness standpoint, turning the
address into an integer then caused the subsequent address arithmetic to
be completely wrong too, and the "+1" actually incremented the address
by one, rather than by four.
Which again was incorrect, since the code was reading two 32-bit values
and trying to make a 64-bit end result of it all. Surprisingly, the
iowrite64() did not suffer from the same odd and incorrect model.
This code has never worked, but it's questionable whether anybody cared:
of the two users that actually read the 64-bit value (by way of some C
preprocessor hackery and eventually the 'get_cdar()' inline function),
one of them explicitly ignored the value, and the other one might just
happen to work despite the incorrect value being read.
This patch at least makes it not fail the build any more, and makes the
logic superficially sane. Whether it makes any difference to the code
_working_ or not shall remain a mystery.
Compile-tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull i2c fixes from Wolfram Sang:
"A core fix for ACPI matching and two driver bugfixes"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: iproc: Fix shifting 31 bits
i2c: rcar: in slave mode, clear NACK earlier
i2c: acpi: Remove dead code, i.e. i2c_acpi_match_device()
i2c: core: Don't fail PRP0001 enumeration when no ID table exist
Pull s390 fixes from Vasily Gorbik:
- Disable preemption trace in percpu macros since the lockdep code
itself uses percpu variables now and it causes recursions.
- Fix kernel space 4-level paging broken by recent vmem rework.
* tag 's390-5.9-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/vmem: fix vmem_add_range for 4-level paging
s390: don't trace preemption in percpu macros
Pull xen fixes from Juergen Gross:
"Two fixes for Xen: one needed for ongoing work to support virtio with
Xen, and one for a corner case in IRQ handling with Xen"
* tag 'for-linus-5.9-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
arm/xen: Add misuse warning to virt_to_gfn
xen/xenbus: Fix granting of vmalloc'd memory
XEN uses irqdesc::irq_data_common::handler_data to store a per interrupt XEN data pointer which contains XEN specific information.
Pull hwmon fixes from Guenter Roeck:
- Fix tempeerature scale in gsc-hwmon driver
- Fix divide by 0 error in nct7904 driver
- Drop non-existing attribute from pmbus/isl68137 driver
- Fix status check in applesmc driver
* tag 'hwmon-for-v5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (gsc-hwmon) Scale temperature to millidegrees
hwmon: (applesmc) check status earlier.
hwmon: (nct7904) Correct divide by 0
hwmon: (pmbus/isl68137) remove READ_TEMPERATURE_1 telemetry for RAA228228
Pull NVMe fixes from Sagi:
"- instance leak and io boundary fixes from Keith
- fc locking fix from Christophe
- various tcp/rdma reset during traffic fixes from Me
- pci use-after-free fix from Tong
- tcp target null deref fix from Ziye"
* 'nvme-5.9-rc' of git://git.infradead.org/nvme:
nvme-pci: cancel nvme device request before disabling
nvme: only use power of two io boundaries
nvme: fix controller instance leak
nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
nvme: Fix NULL dereference for pci nvme controllers
nvme-rdma: fix reset hang if controller died in the middle of a reset
nvme-rdma: fix timeout handler
nvme-rdma: serialize controller teardown sequences
nvme-tcp: fix reset hang if controller died in the middle of a reset
nvme-tcp: fix timeout handler
nvme-tcp: serialize controller teardown sequences
nvme: have nvme_wait_freeze_timeout return if it timed out
nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu
Add NFT_SOCKET_WILDCARD to match to wildcard socket listener.
Signed-off-by: Balazs Scheidler <bazsi77@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Its possible that we have more than one packet with the same ct tuple
simultaneously, e.g. when an application emits n packets on same UDP
socket from multiple threads.
NAT rules might be applied to those packets. With the right set of rules,
n packets will be mapped to m destinations, where at least two packets end
up with the same destination.
When this happens, the existing clash resolution may merge the skb that
is processed after the first has been received with the identical tuple
already in hash table.
However, its possible that this identical tuple is a NAT_CLASH tuple.
In that case the second skb will be sent, but no reply can be received
since the reply that is processed first removes the NAT_CLASH tuple.
Do not auto-delete, this gives a 1 second window for replies to be passed
back to originator.
Packets that are coming later (udp stream case) will not be affected:
they match the original ct entry, not a NAT_CLASH one.
Also prevent NAT_CLASH entries from getting offloaded.
Fixes: 6a757c07e5 ("netfilter: conntrack: allow insertion of clashing entries")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The kernel requires a power of two for boundaries because that's the
only way it can efficiently split commands that cross them. A
controller, however, may report a non-power of two boundary.
The driver had been rounding the controller's value to one the kernel
can use, but splitting on the wrong boundary provides no benefit on the
device side, and incurs additional submission overhead from non-optimal
splits.
Don't provide any boundary hint if the controller's value can't be used
and log a warning when first scanning a disk's unreported IO boundary.
Since the chunk sector logic has grown, move it to a separate function.
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
If the driver has to unbind from the controller for an early failure
before the subsystem has been set up, there won't be a subsystem holding
the controller's instance, so the controller needs to free its own
instance in this case.
Fixes: 733e4b69d5 ("nvme: Assign subsys instance from first ctrl")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
The way 'spin_lock()' and 'spin_lock_irqsave()' are used is not consistent
in this function.
Use 'spin_lock_irqsave()' also here, as there is no guarantee that
interruptions are disabled at that point, according to surrounding code.
Fixes: a97ec51b37 ("nvmet_fc: Rework target side abort handling")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
PCIe controllers do not have fabric opts, verify they exist before
showing ctrl_loss_tmo or reconnect_delay attributes.
Fixes: 764075fdcb ("nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs")
Reported-by: Tobias Markus <tobias@markus-regensburg.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
If the controller becomes unresponsive in the middle of a reset, we
will hang because we are waiting for the freeze to complete, but that
cannot happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.
So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to
proceed (either schedule a reconnect of remove the controller).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
When a request times out in a LIVE state, we simply trigger error
recovery and let the error recovery handle the request cancellation,
however when a request times out in a non LIVE state, we make sure to
complete it immediately as it might block controller setup or teardown
and prevent forward progress.
However tearing down the entire set of I/O and admin queues causes
freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
an overkill to what we actually need, which is to just fence controller
teardown that may be running, stop the queue, and cancel the request if
it is not already completed.
Now that we have the controller teardown_lock, we can safely serialize
request cancellation. This addresses a hang caused by calling extra
queue freeze on controller namespaces, causing unfreeze to not complete
correctly.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.
In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.
Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
If the controller becomes unresponsive in the middle of a reset, we will
hang because we are waiting for the freeze to complete, but that cannot
happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.
So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to proceed
(either schedule a reconnect of remove the controller).
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
When a request times out in a LIVE state, we simply trigger error
recovery and let the error recovery handle the request cancellation,
however when a request times out in a non LIVE state, we make sure to
complete it immediately as it might block controller setup or teardown
and prevent forward progress.
However tearing down the entire set of I/O and admin queues causes
freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
an overkill to what we actually need, which is to just fence controller
teardown that may be running, stop the queue, and cancel the request if
it is not already completed.
Now that we have the controller teardown_lock, we can safely serialize
request cancellation. This addresses a hang caused by calling extra
queue freeze on controller namespaces, causing unfreeze to not complete
correctly.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.
In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.
Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Users can detect if the wait has completed or not and take appropriate
actions based on this information (e.g. weather to continue
initialization or rather fail and schedule another initialization
attempt).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
NVME_CTRL_NEW should never see any I/O, because in order to start
initialization it has to transition to NVME_CTRL_CONNECTING and from
there it will never return to this state.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
When handling commands without in-capsule data, we assign the ttag
assuming we already have the queue commands array allocated (based
on the queue size information in the connect data payload). However
if the connect itself did not send the connect data in-capsule we
have yet to allocate the queue commands,and we will assign a bogus
ttag and suffer a NULL dereference when we receive the corresponding
h2cdata pdu.
Fix this by checking if we already allocated commands before
dereferencing it when handling h2cdata, if we didn't, its for sure a
connect and we should use the preallocated connect command.
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Pull block fixes from Jens Axboe:
- nbd timeout fix (Hou)
- device size fix for loop LOOP_CONFIGURE (Martijn)
- MD pull from Song with raid5 stripe size fix (Yufen)
* tag 'block-5.9-2020-08-28' of git://git.kernel.dk/linux-block:
md/raid5: make sure stripe_size as power of two
loop: Set correct device size when using LOOP_CONFIGURE
nbd: restore default timeout when setting it to zero
Pull io_uring fixes from Jens Axboe:
"A few fixes in here, all based on reports and test cases from folks
using it. Most of it is stable material as well:
- Hashed work cancelation fix (Pavel)
- poll wakeup signalfd fix
- memlock accounting fix
- nonblocking poll retry fix
- ensure we never return -ERESTARTSYS for reads
- ensure offset == -1 is consistent with preadv2() as documented
- IOPOLL -EAGAIN handling fixes
- remove useless task_work bounce for block based -EAGAIN retry"
* tag 'io_uring-5.9-2020-08-28' of git://git.kernel.dk/linux-block:
io_uring: don't bounce block based -EAGAIN retry off task_work
io_uring: fix IOPOLL -EAGAIN retries
io_uring: clear req->result on IOPOLL re-issue
io_uring: make offset == -1 consistent with preadv2/pwritev2
io_uring: ensure read requests go through -ERESTART* transformation
io_uring: don't use poll handler if file can't be nonblocking read/written
io_uring: fix imbalanced sqo_mm accounting
io_uring: revert consumed iov_iter bytes on error
io-wq: fix hang after cancelling pending hashed work
io_uring: don't recurse on tsk->sighand->siglock with signalfd
Daniel Borkmann says:
====================
pull-request: bpf 2020-08-28
The following pull-request contains BPF updates for your *net* tree.
We've added 4 non-merge commits during the last 4 day(s) which contain
a total of 4 files changed, 7 insertions(+), 4 deletions(-).
The main changes are:
1) Fix out of bounds access for BPF_OBJ_GET_INFO_BY_FD retrieval, from Yonghong Song.
2) Fix wrong __user annotation in bpf_stats sysctl handler, from Tobias Klauser.
3) Few fixes for BPF selftest scripting in test_{progs,maps}, from Jesper Dangaard Brouer.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull device properties framework fix from Rafael Wysocki:
"Prevent the promotion of the secondary firmware node of a device to
the primary one from leaking a pointer (Heikki Krogerus)"
* tag 'devprop-5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
device property: Fix the secondary firmware node handling in set_primary_fwnode()
Pull ACPI fixes from Rafael Wysocki:
"These fix two recent issues in the ACPI memory mappings management
code and tighten up error handling in the ACPI driver for AMD SoCs
(APD).
Specifics:
- Avoid redundant rounding to the page size in acpi_os_map_iomem() to
address a recently introduced issue with the EFI memory map
permission check on ARM64 (Ard Biesheuvel).
- Fix acpi_release_memory() to wait until the memory mappings
released by it have been really unmapped (Rafael Wysocki).
- Make the ACPI driver for AMD SoCs (APD) check the return value of
acpi_dev_get_property() to avoid failures in the cases when the
device property under inspection is missing (Furquan Shaikh)"
* tag 'acpi-5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: OSL: Prevent acpi_release_memory() from returning too early
ACPI: ioremap: avoid redundant rounding to OS page size
ACPI: SoC: APD: Check return value of acpi_dev_get_property()
Pull power management fixes from Rafael Wysocki:
"These fix the recently added Tegra194 cpufreq driver and the handling
of devices using runtime PM during system-wide suspend, improve the
intel_pstate driver documentation and clean up the cpufreq core.
Specifics:
- Make the recently added Tegra194 cpufreq driver use
read_cpuid_mpir() instead of cpu_logical_map() to avoid exporting
logical_cpu_map (Sumit Gupta).
- Drop the automatic system wakeup event reporting for devices with
pending runtime-resume requests during system-wide suspend to avoid
spurious aborts of the suspend flow (Rafael Wysocki).
- Fix build warning in the intel_pstate driver documentation and
improve the wording in there (Randy Dunlap).
- Clean up two pieces of code in the cpufreq core (Viresh Kumar)"
* tag 'pm-5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: Use WARN_ON_ONCE() for invalid relation
cpufreq: No need to verify cpufreq_driver in show_scaling_cur_freq()
PM: sleep: core: Fix the handling of pending runtime resume requests
Documentation: fix pm/intel_pstate build warning and wording
cpufreq: replace cpu_logical_map() with read_cpuid_mpir()
Alexei Starovoitov says:
====================
v2->v3:
- switched to minimal allowlist approach. Essentially that means that syscall
entry, few btrfs allow_error_inject functions, should_fail_bio(), and two LSM
hooks: file_mprotect and bprm_committed_creds are the only hooks that allow
attaching of sleepable BPF programs. When comprehensive analysis of LSM hooks
will be done this allowlist will be extended.
- added patch 1 that fixes prototypes of two mm functions to reliably work with
error injection. It's also necessary for resolve_btfids tool to recognize
these two funcs, but that's secondary.
v1->v2:
- split fmod_ret fix into separate patch
- added denylist
v1:
This patch set introduces the minimal viable support for sleepable bpf programs.
In this patch only fentry/fexit/fmod_ret and lsm progs can be sleepable.
Only array and pre-allocated hash and lru maps allowed.
Here is 'perf report' difference of sleepable vs non-sleepable:
3.86% bench [k] __srcu_read_unlock
3.22% bench [k] __srcu_read_lock
0.92% bench [k] bpf_prog_740d4210cdcd99a3_bench_trigger_fentry_sleep
0.50% bench [k] bpf_trampoline_10297
0.26% bench [k] __bpf_prog_exit_sleepable
0.21% bench [k] __bpf_prog_enter_sleepable
vs
0.88% bench [k] bpf_prog_740d4210cdcd99a3_bench_trigger_fentry
0.84% bench [k] bpf_trampoline_10297
0.13% bench [k] __bpf_prog_enter
0.12% bench [k] __bpf_prog_exit
vs
0.79% bench [k] bpf_prog_740d4210cdcd99a3_bench_trigger_fentry_sleep
0.72% bench [k] bpf_trampoline_10381
0.31% bench [k] __bpf_prog_exit_sleepable
0.29% bench [k] __bpf_prog_enter_sleepable
Sleepable vs non-sleepable program invocation overhead is only marginally higher
due to rcu_trace. srcu approach is much slower.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pass request to load program as sleepable via ".s" suffix in the section name.
If it happens in the future that all map types and helpers are allowed with
BPF_F_SLEEPABLE flag "fmod_ret/" and "lsm/" can be aliased to "fmod_ret.s/" and
"lsm.s/" to make all lsm and fmod_ret programs sleepable by default. The fentry
and fexit programs would always need to have sleepable vs non-sleepable
distinction, since not all fentry/fexit progs will be attached to sleepable
kernel functions.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: KP Singh <kpsingh@google.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200827220114.69225-5-alexei.starovoitov@gmail.com
Introduce sleepable BPF programs that can request such property for themselves
via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
to use helpers like bpf_copy_from_user() that might sleep. At present only
fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
when they are attached to kernel functions that are known to allow sleeping.
The non-sleepable programs are relying on implicit rcu_read_lock() and
migrate_disable() to protect life time of programs, maps that they use and
per-cpu kernel structures used to pass info between bpf programs and the
kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
should not be enclosed in migrate_disable() as well. Therefore
rcu_read_lock_trace is used to protect the life time of sleepable progs.
There are many networking and tracing program types. In many cases the
'struct bpf_prog *' pointer itself is rcu protected within some other kernel
data structure and the kernel code is using rcu_dereference() to load that
program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
Instead sleepable bpf programs are allowed with bpf trampoline only. The
program pointers are hard-coded into generated assembly of bpf trampoline and
synchronize_rcu_tasks_trace() is used to protect the life time of the program.
The same trampoline can hold both sleepable and non-sleepable progs.
When rcu_read_lock_trace is held it means that some sleepable bpf program is
running from bpf trampoline. Those programs can use bpf arrays and preallocated
hash/lru maps. These map types are waiting on programs to complete via
synchronize_rcu_tasks_trace();
Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
synchronize_rcu_tasks() to wait for sleepable progs to finish and for
trampoline assembly to finish.
This is the first step of introducing sleepable progs. Eventually dynamically
allocated hash maps can be allowed and networking program types can become
sleepable too.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200827220114.69225-3-alexei.starovoitov@gmail.com
'static' and 'static noinline' function attributes make no guarantees that
gcc/clang won't optimize them. The compiler may decide to inline 'static'
function and in such case ALLOW_ERROR_INJECT becomes meaningless. The compiler
could have inlined __add_to_page_cache_locked() in one callsite and didn't
inline in another. In such case injecting errors into it would cause
unpredictable behavior. It's worse with 'static noinline' which won't be
inlined, but it still can be optimized. Like the compiler may decide to remove
one argument or constant propagate the value depending on the callsite.
To avoid such issues make sure that these functions are global noinline.
Fixes: af3b854492 ("mm/page_alloc.c: allow error injection")
Fixes: cfcbfb1382 ("mm/filemap.c: enable error injection at add_to_page_cache()")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/bpf/20200827220114.69225-2-alexei.starovoitov@gmail.com
* pm-cpufreq:
cpufreq: Use WARN_ON_ONCE() for invalid relation
cpufreq: No need to verify cpufreq_driver in show_scaling_cur_freq()
Documentation: fix pm/intel_pstate build warning and wording
cpufreq: replace cpu_logical_map() with read_cpuid_mpir()
Pull arm64 fixes from Catalin Marinas:
- Fix kernel build with the integrated LLVM assembler which doesn't see
the -Wa,-march option.
- Fix "make vdso_install" when COMPAT_VDSO is disabled.
- Make KVM more robust if the AT S1E1R instruction triggers an
exception (architecture corner cases).
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
KVM: arm64: Set HCR_EL2.PTW to prevent AT taking synchronous exception
KVM: arm64: Survive synchronous exceptions caused by AT instructions
KVM: arm64: Add kvm_extable for vaxorcism code
arm64: vdso32: make vdso32 install conditional
arm64: use a common .arch preamble for inline assembly
I keep getting sparse warnings in crypto such as:
CHECK drivers/crypto/ccree/cc_hash.c
drivers/crypto/ccree/cc_hash.c:49:9: warning: cast truncates bits from constant value (47b5481dbefa4fa4 becomes befa4fa4)
drivers/crypto/ccree/cc_hash.c:49:26: warning: cast truncates bits from constant value (db0c2e0d64f98fa7 becomes 64f98fa7)
[.. many more ..]
This patch removes the warning by adding a mask to keep sparse
happy.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Avoid bad command arguments.
Based on tools/power/cpupower/bench/cpufreq-bench_plot.sh
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fix some shellcheck SC2181 warnings:
"Check exit code directly with e.g. 'if mycmd;', not indirectly with
$?." as suggested by Stefano Brivio.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
'who' variable was not used in make_file()
Problem found using Shellcheck
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
exit script with comments when parameters are wrong during address
addition. No need for a message when trying to change MTU with lower
values: output is self-explanatory.
Use short testing sequence to avoid shellcheck warnings
(suggested by Stefano Brivio).
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nft_flowtable.sh is made for bash not sh.
Also give values which not return "RTNETLINK answers: Invalid argument"
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Frontend callback reports EAGAIN to nfnetlink to retry a command, this
is used to signal that module autoloading is required. Unfortunately,
nlmsg_unicast() reports EAGAIN in case the receiver socket buffer gets
full, so it enters a busy-loop.
This patch updates nfnetlink_unicast() to turn EAGAIN into ENOBUFS and
to use nlmsg_unicast(). Remove the flags field in nfnetlink_unicast()
since this is always MSG_DONTWAIT in the existing code which is exactly
what nlmsg_unicast() passes to netlink_unicast() as parameter.
Fixes: 96518518cc ("netfilter: add nftables")
Reported-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>