We have a board where measurements indicate that the current three
options - leaving IO_IMPEDANCE_CTRL at the reset value (which is
factory calibrated to a value corresponding to approximately 50 ohms)
or using one of the two boolean properties to set it to the min/max
value - are too coarse.
There is no fixed mapping from register values to values in the range
35-70 ohms; it varies from chip to chip, and even that target range is
approximate. So add a DT binding for an nvmem cell which can be
populated during production with a value suitable for each specific
board.
Reviewed-by: Rob Herring <robh@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Even when the eth port is resticted to work with speeds not higher than 1G,
and so the eth driver is requesting the phy (via phylink) to advertise up
to 1000BASET support, the aquantia phy device is still advertising for 2.5G
and 5G speeds.
Clear these advertising defaults when requested.
Cc: Ondrej Spacek <ondrej.spacek@nxp.com>
Fixes: 09c4c57f7b ("net: phy: aquantia: add support for auto-negotiation configuration")
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Link: https://lore.kernel.org/r/20220610084037.7625-1-claudiu.manoil@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiri Olsa says:
====================
hi,
there's bug in kprobe_multi link that makes cookies misplaced when
using symbols to attach. The reason is that we sort symbols by name
but not adjacent cookie values. Current test did not find it because
bpf_fentry_test* are already sorted by name.
v3 changes:
- fixed kprobe_multi bench test to filter out invalid entries
from available_filter_functions
v2 changes:
- rebased on top of bpf/master
- checking if cookies are defined later in swap function [Andrii]
- added acks
thanks,
jirka
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
With [1] the available_filter_functions file contains records
starting with __ftrace_invalid_address___ and marking disabled
entries.
We need to filter them out for the bench test to pass only
resolvable symbols to kernel.
[1] commit b39181f7c6 ("ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function")
Fixes: b39181f7c6 ("ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20220615112118.497303-5-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When user specifies symbols and cookies for kprobe_multi link
interface it's very likely the cookies will be misplaced and
returned to wrong functions (via get_attach_cookie helper).
The reason is that to resolve the provided functions we sort
them before passing them to ftrace_lookup_symbols, but we do
not do the same sort on the cookie values.
Fixing this by using sort_r function with custom swap callback
that swaps cookie values as well.
Fixes: 0236fec57a ("bpf: Resolve symbols with ftrace_lookup_symbols for kprobe multi link")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20220615112118.497303-4-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
There's a kernel bug that causes cookies to be misplaced and
the reason we did not catch this with this test is that we
provide bpf_fentry_test* functions already sorted by name.
Shuffling function bpf_fentry_test2 deeper in the list and
keeping the current cookie values as before will trigger
the bug.
The kernel fix is coming in following changes.
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20220615112118.497303-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Delyan Kratunov says:
====================
This series implements support for sleepable uprobe programs.
Key work is in patches 2 and 3, the rest is plumbing and tests.
The main observation is that the only obstacle in the way of sleepable uprobe
programs is not the uprobe infrastructure, which already runs in a user-like
context, but the rcu usage around bpf_prog_array.
Details are in patch 2 but the tl;dr is that we chain trace_tasks and normal rcu
grace periods when releasing to array to accommodate users of either rcu type.
This introduces latency for non-sleepable users (kprobe, tp) but that's deemed
acceptable, given recent benchmarks by Andrii [1]. We're a couple of orders of
magnitude under the rate of bpf_prog_array churn that would raise flags (~1MM/s per Paul).
[1]: https://lore.kernel.org/bpf/CAEf4BzbpjN6ca7D9KOTiFPOoBYkciYvTz0UJNp5c-_3ptm=Mrg@mail.gmail.com/
v3 -> v4:
* Fix kdoc and inline issues
* Rebase
v2 -> v3:
* Inline uprobe_call_bpf into trace_uprobe.c, it's just a bpf_prog_run_array_sleepable call now.
* Do not disable preemption for uprobe non-sleepable programs.
* Add acks.
v1 -> v2:
* Fix lockdep annotations in bpf_prog_run_array_sleepable
* Chain rcu grace periods only for perf_event-attached programs. This limits
the additional latency on the free path to use cases where we know it won't
be a problem.
* Add tests calling helpers only available in sleepable programs.
* Remove kprobe.s support from libbpf.
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
uprobe and kprobe programs have the same program type, KPROBE, which is
currently not allowed to load sleepable programs.
To avoid adding a new UPROBE type, instead allow sleepable KPROBE
programs to load and defer the is-it-actually-a-uprobe-program check
to attachment time, where there's already validation of the
corresponding perf_event.
A corollary of this patch is that you can now load a sleepable kprobe
program but cannot attach it.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Delyan Kratunov <delyank@fb.com>
Link: https://lore.kernel.org/r/fcd44a7cd204f372f6bb03ef794e829adeaef299.1655248076.git.delyank@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
uprobes work by raising a trap, setting a task flag from within the
interrupt handler, and processing the actual work for the uprobe on the
way back to userspace. As a result, uprobe handlers already execute in a
might_fault/_sleep context. The primary obstacle to sleepable bpf uprobe
programs is therefore on the bpf side.
Namely, the bpf_prog_array attached to the uprobe is protected by normal
rcu. In order for uprobe bpf programs to become sleepable, it has to be
protected by the tasks_trace rcu flavor instead (and kfree() called after
a corresponding grace period).
Therefore, the free path for bpf_prog_array now chains a tasks_trace and
normal grace periods one after the other.
Users who iterate under tasks_trace read section would
be safe, as would users who iterate under normal read sections (from
non-sleepable locations).
The downside is that the tasks_trace latency affects all perf_event-attached
bpf programs (and not just uprobe ones). This is deemed safe given the
possible attach rates for kprobe/uprobe/tp programs.
Separately, non-sleepable programs need access to dynamically sized
rcu-protected maps, so bpf_run_prog_array_sleepables now conditionally takes
an rcu read section, in addition to the overarching tasks_trace section.
Signed-off-by: Delyan Kratunov <delyank@fb.com>
Link: https://lore.kernel.org/r/ce844d62a2fd0443b08c5ab02e95bc7149f9aeb1.1655248076.git.delyank@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In order to add a version of bpf_prog_run_array which accesses the
bpf_prog->aux member, bpf_prog needs to be more than a forward
declaration inside bpf.h.
Given that filter.h already includes bpf.h, this merely reorders
the type declarations for filter.h users. bpf.h users now have access to
bpf_prog internals.
Signed-off-by: Delyan Kratunov <delyank@fb.com>
Link: https://lore.kernel.org/r/3ed7824e3948f22d84583649ccac0ff0d38b6b58.1655248076.git.delyank@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
hugetlbfs fallocate support was originally added with commit 70c3547e36
("hugetlbfs: add hugetlbfs_fallocate()"). Initial support only operated
on whole hugetlb pages. This makes sense for populating files as other
interfaces such as mmap and truncate require hugetlb page size alignment.
Only operating on whole hugetlb pages for the hole punch case was a
simplification and there was no compelling use case to zero partial pages.
In a recent discussion[1] it was assumed that hugetlbfs hole punch would
zero partial hugetlb pages as that is in line with the man page
description saying 'partial filesystem blocks are zeroed'. However, the
hugetlbfs hole punch code actually does this:
hole_start = round_up(offset, hpage_size);
hole_end = round_down(offset + len, hpage_size);
Modify code to zero partial hugetlb pages in hole punch range. It is
possible that application code could note a change in behavior. However,
that would imply the code is passing in an unaligned range and expecting
only whole pages be removed. This is unlikely as the fallocate
documentation states the opposite.
The current hugetlbfs fallocate hole punch behavior is tested with the
libhugetlbfs test fallocate_align[2]. This test will be updated to
validate partial page zeroing.
[1] https://lore.kernel.org/linux-mm/20571829-9d3d-0b48-817c-b6b15565f651@redhat.com/
[2] https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/fallocate_align.c
Link: https://lkml.kernel.org/r/YqeiMlZDKI1Kabfe@monkey
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The RNG uses vanilla spinlocks, not raw spinlocks, so kfence should pick
its random numbers before taking its raw spinlocks. This also has the
nice effect of doing less work inside the lock. It should fix a splat
that Geert saw with CONFIG_PROVE_RAW_LOCK_NESTING:
dump_backtrace.part.0+0x98/0xc0
show_stack+0x14/0x28
dump_stack_lvl+0xac/0xec
dump_stack+0x14/0x2c
__lock_acquire+0x388/0x10a0
lock_acquire+0x190/0x2c0
_raw_spin_lock_irqsave+0x6c/0x94
crng_make_state+0x148/0x1e4
_get_random_bytes.part.0+0x4c/0xe8
get_random_u32+0x4c/0x140
__kfence_alloc+0x460/0x5c4
kmem_cache_alloc_trace+0x194/0x1dc
__kthread_create_on_node+0x5c/0x1a8
kthread_create_on_node+0x58/0x7c
printk_start_kthread.part.0+0x34/0xa8
printk_activate_kthreads+0x4c/0x54
do_one_initcall+0xec/0x278
kernel_init_freeable+0x11c/0x214
kernel_init+0x24/0x124
ret_from_fork+0x10/0x20
Link: https://lkml.kernel.org/r/20220609123319.17576-1-Jason@zx2c4.com
Fixes: d4150779e6 ("random32: use real rng for non-deterministic randomness")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 059342d1dd ("mm/damon/reclaim: fix the timer always stays
active") made DAMON_RECLAIM's 'enabled' parameter store callback,
'enabled_store()', to schedule 'damon_reclaim_timer'. The scheduling uses
'system_wq', which is initialized in 'workqueue_init_early()'. As kernel
parameters parsing function ('parse_args()') is called before
'workqueue_init_early()', 'enabled_store()' can be executed before
'workqueue_init_early()' and end up accessing the uninitialized
'system_wq'. As a result, the booting hang[1]. This commit fixes the
issue by checking if the initialization is done before scheduling the
timer.
[1] https://lkml.kernel.org/20220604192222.1488-1-sj@kernel.org/
Link: https://lkml.kernel.org/r/20220604195051.1589-1-sj@kernel.org
Fixes: 059342d1dd ("mm/damon/reclaim: fix the timer always stays active")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reported-by: Greg White <gwhite@kupulau.com>
Cc: Hailong Tu <tuhailong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The comments in kernel/kthread.c create a feeling that only SIGKILL is
able to terminate the creation of kernel kthreads by
kthread_create()/_on_node()/_on_cpu() APIs.
In reality, wait_for_completion_killable() might be killed by any fatal
signal that does not have a custom handler:
(!siginmask(signr, SIG_KERNEL_IGNORE_MASK|SIG_KERNEL_STOP_MASK) && \
(t)->sighand->action[(signr)-1].sa.sa_handler == SIG_DFL)
static inline void signal_wake_up(struct task_struct *t, bool resume)
{
signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0);
}
static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
{
[...]
/*
* Found a killable thread. If the signal will be fatal,
* then start taking the whole group down immediately.
*/
if (sig_fatal(p, sig) ...) {
if (!sig_kernel_coredump(sig)) {
[...]
do {
task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
sigaddset(&t->pending.signal, SIGKILL);
signal_wake_up(t, 1);
} while_each_thread(p, t);
return;
}
}
}
Update the comments in kernel/kthread.c to make this more obvious.
The motivation for this change was debugging why a module initialization
failed. The module was being loaded from initrd. It "magically" failed
when systemd was switching to the real root. The clean up operations sent
SIGTERM to various pending processed that were started from initrd.
Link: https://lkml.kernel.org/r/20220315102444.2380-1-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently the back pointer from a queue to the vhost adapter isn't set
until after subcrq interrupt registration. The value is available when a
queue is first allocated and can/should be also set for primary and async
queues as well as subcrqs.
This fixes a crash observed during kexec/kdump on Power 9 with legacy XICS
interrupt controller where a pending subcrq interrupt from the previous
kernel can be replayed immediately upon IRQ registration resulting in
dereference of a garbage backpointer in ibmvfc_interrupt_scsi().
Kernel attempted to read user page (58) - exploit attempt? (uid: 0)
BUG: Kernel NULL pointer dereference on read at 0x00000058
Faulting instruction address: 0xc008000003216a08
Oops: Kernel access of bad area, sig: 11 [#1]
...
NIP [c008000003216a08] ibmvfc_interrupt_scsi+0x40/0xb0 [ibmvfc]
LR [c0000000082079e8] __handle_irq_event_percpu+0x98/0x270
Call Trace:
[c000000047fa3d80] [c0000000123e6180] 0xc0000000123e6180 (unreliable)
[c000000047fa3df0] [c0000000082079e8] __handle_irq_event_percpu+0x98/0x270
[c000000047fa3ea0] [c000000008207d18] handle_irq_event+0x98/0x188
[c000000047fa3ef0] [c00000000820f564] handle_fasteoi_irq+0xc4/0x310
[c000000047fa3f40] [c000000008205c60] generic_handle_irq+0x50/0x80
[c000000047fa3f60] [c000000008015c40] __do_irq+0x70/0x1a0
[c000000047fa3f90] [c000000008016d7c] __do_IRQ+0x9c/0x130
[c000000014622f60] [0000000020000000] 0x20000000
[c000000014622ff0] [c000000008016e50] do_IRQ+0x40/0xa0
[c000000014623020] [c000000008017044] replay_soft_interrupts+0x194/0x2f0
[c000000014623210] [c0000000080172a8] arch_local_irq_restore+0x108/0x170
[c000000014623240] [c000000008eb1008] _raw_spin_unlock_irqrestore+0x58/0xb0
[c000000014623270] [c00000000820b12c] __setup_irq+0x49c/0x9f0
[c000000014623310] [c00000000820b7c0] request_threaded_irq+0x140/0x230
[c000000014623380] [c008000003212a50] ibmvfc_register_scsi_channel+0x1e8/0x2f0 [ibmvfc]
[c000000014623450] [c008000003213d1c] ibmvfc_init_sub_crqs+0xc4/0x1f0 [ibmvfc]
[c0000000146234d0] [c0080000032145a8] ibmvfc_reset_crq+0x150/0x210 [ibmvfc]
[c000000014623550] [c0080000032147c8] ibmvfc_init_crq+0x160/0x280 [ibmvfc]
[c0000000146235f0] [c00800000321a9cc] ibmvfc_probe+0x2a4/0x530 [ibmvfc]
Link: https://lore.kernel.org/r/20220616191126.1281259-2-tyreld@linux.ibm.com
Fixes: 3034ebe263 ("scsi: ibmvfc: Add alloc/dealloc routines for SCSI Sub-CRQ Channels")
Cc: stable@vger.kernel.org
Reviewed-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Currently, the sub-queues and event pool resources are allocated/freed for
every CRQ connection event such as reset and LPM. This exposes the driver
to a couple issues. First the inefficiency of freeing and reallocating
memory that can simply be resued after being sanitized. Further, a system
under memory pressue runs the risk of allocation failures that could result
in a crippled driver. Finally, there is a race window where command
submission/compeletion can try to pull/return elements from/to an event
pool that is being deleted or already has been deleted due to the lack of
host state around freeing/allocating resources. The following is an example
of list corruption following a live partition migration (LPM):
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: vfat fat isofs cdrom ext4 mbcache jbd2 nft_counter nft_compat nf_tables nfnetlink rpadlpar_io rpaphp xsk_diag nfsv3 nfs_acl nfs lockd grace fscache netfs rfkill bonding tls sunrpc pseries_rng drm drm_panel_orientation_quirks xfs libcrc32c dm_service_time sd_mod t10_pi sg ibmvfc scsi_transport_fc ibmveth vmx_crypto dm_multipath dm_mirror dm_region_hash dm_log dm_mod ipmi_devintf ipmi_msghandler fuse
CPU: 0 PID: 2108 Comm: ibmvfc_0 Kdump: loaded Not tainted 5.14.0-70.9.1.el9_0.ppc64le #1
NIP: c0000000007c4bb0 LR: c0000000007c4bac CTR: 00000000005b9a10
REGS: c00000025c10b760 TRAP: 0700 Not tainted (5.14.0-70.9.1.el9_0.ppc64le)
MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 2800028f XER: 0000000f
CFAR: c0000000001f55bc IRQMASK: 0
GPR00: c0000000007c4bac c00000025c10ba00 c000000002a47c00 000000000000004e
GPR04: c0000031e3006f88 c0000031e308bd00 c00000025c10b768 0000000000000027
GPR08: 0000000000000000 c0000031e3009dc0 00000031e0eb0000 0000000000000000
GPR12: c0000031e2ffffa8 c000000002dd0000 c000000000187108 c00000020fcee2c0
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 c008000002f81300
GPR24: 5deadbeef0000100 5deadbeef0000122 c000000263ba6910 c00000024cc88000
GPR28: 000000000000003c c0000002430a0000 c0000002430ac300 000000000000c300
NIP [c0000000007c4bb0] __list_del_entry_valid+0x90/0x100
LR [c0000000007c4bac] __list_del_entry_valid+0x8c/0x100
Call Trace:
[c00000025c10ba00] [c0000000007c4bac] __list_del_entry_valid+0x8c/0x100 (unreliable)
[c00000025c10ba60] [c008000002f42284] ibmvfc_free_queue+0xec/0x210 [ibmvfc]
[c00000025c10bb10] [c008000002f4246c] ibmvfc_deregister_scsi_channel+0xc4/0x160 [ibmvfc]
[c00000025c10bba0] [c008000002f42580] ibmvfc_release_sub_crqs+0x78/0x130 [ibmvfc]
[c00000025c10bc20] [c008000002f4f6cc] ibmvfc_do_work+0x5c4/0xc70 [ibmvfc]
[c00000025c10bce0] [c008000002f4fdec] ibmvfc_work+0x74/0x1e8 [ibmvfc]
[c00000025c10bda0] [c0000000001872b8] kthread+0x1b8/0x1c0
[c00000025c10be10] [c00000000000cd64] ret_from_kernel_thread+0x5c/0x64
Instruction dump:
40820034 38600001 38210060 4e800020 7c0802a6 7c641b78 3c62fe7a 7d254b78
3863b590 f8010070 4ba309cd 60000000 <0fe00000> 7c0802a6 3c62fe7a 3863b640
---[ end trace 11a2b65a92f8b66c ]---
ibmvfc 30000003: Send warning. Receive queue closed, will retry.
Add registration/deregistration helpers that are called instead during
connection resets to sanitize and reconfigure the queues.
Link: https://lore.kernel.org/r/20220616191126.1281259-3-tyreld@linux.ibm.com
Fixes: 3034ebe263 ("scsi: ibmvfc: Add alloc/dealloc routines for SCSI Sub-CRQ Channels")
Cc: stable@vger.kernel.org
Reviewed-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Current code is based on the idea that the max number of SGL entries
also determines the max size of an I/O request. While this idea was
true in older versions of the storvsc driver when SGL entry length
was limited to 4 Kbytes, commit 3d9c3dcc58 ("scsi: storvsc: Enable
scatterlist entry lengths > 4Kbytes") removed that limitation. It's
now theoretically possible for the block layer to send requests that
exceed the maximum size supported by Hyper-V. This problem doesn't
currently happen in practice because the block layer defaults to a
512 Kbyte maximum, while Hyper-V in Azure supports 2 Mbyte I/O sizes.
But some future configuration of Hyper-V could have a smaller max I/O
size, and the block layer could exceed that max.
Fix this by correctly setting max_sectors as well as sg_tablesize to
reflect the maximum I/O size that Hyper-V reports. While allowing
I/O sizes larger than the block layer default of 512 Kbytes doesn’t
provide any noticeable performance benefit in the tests we ran, it's
still appropriate to report the correct underlying Hyper-V capabilities
to the Linux block layer.
Also tweak the virt_boundary_mask to reflect that the required
alignment derives from Hyper-V communication using a 4 Kbyte page size,
and not on the guest page size, which might be bigger (eg. ARM64).
Link: https://lore.kernel.org/r/1655190355-28722-1-git-send-email-ssengar@linux.microsoft.com
Fixes: 3d9c3dcc58 ("scsi: storvsc: Enable scatter list entry lengths > 4Kbytes")
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Prevent that both the interrupt handler and the reset handler try to
complete a request at the same time. This patch is the result of an
analysis of the following crash:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000120
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE 5.10.107-android13-4-00051-g1e48e8970cca-ab8664745 #1
pc : ufshcd_release_scsi_cmd+0x30/0x46c
lr : __ufshcd_transfer_req_compl+0x4fc/0x9c0
Call trace:
ufshcd_release_scsi_cmd+0x30/0x46c
__ufshcd_transfer_req_compl+0x4fc/0x9c0
ufshcd_poll+0xf0/0x208
ufshcd_sl_intr+0xb8/0xf0
ufshcd_intr+0x168/0x2f4
__handle_irq_event_percpu+0xa0/0x30c
handle_irq_event+0x84/0x178
handle_fasteoi_irq+0x150/0x2e8
__handle_domain_irq+0x114/0x1e4
gic_handle_irq.31846+0x58/0x300
el1_irq+0xe4/0x1c0
cpuidle_enter_state+0x3ac/0x8c4
do_idle+0x2fc/0x55c
cpu_startup_entry+0x84/0x90
kernel_init+0x0/0x310
start_kernel+0x0/0x608
start_kernel+0x4ec/0x608
Link: https://lore.kernel.org/r/20220613214442.212466-4-bvanassche@acm.org
Reviewed-by: Stanley Chu <stanley.chu@mediatek.com>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The code in dm-log rounds up bitset_size to 32 bits. It then uses
find_next_zero_bit_le on the allocated region. find_next_zero_bit_le
accesses the bitmap using unsigned long pointers. So, on 64-bit
architectures, it may access 4 bytes beyond the allocated size.
Fix this bug by rounding up bitset_size to BITS_PER_LONG.
This bug was found by running the lvm2 testsuite with kasan.
Fixes: 29121bd0b0 ("[PATCH] dm mirror log: bitset_size fix")
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Starting with the commit 63a225c9fd20, device mapper has an optimization
that it will take cheaper table lock (dm_get_live_table_fast instead of
dm_get_live_table) if the bio has REQ_NOWAIT. The bios with REQ_NOWAIT
must not block in the target request routine, if they did, we would be
blocking while holding rcu_read_lock, which is prohibited.
The targets that are suitable for REQ_NOWAIT optimization (and that don't
block in the map routine) have the flag DM_TARGET_NOWAIT set. Device
mapper will test if all the targets and all the devices in a table
support nowait (see the function dm_table_supports_nowait) and it will set
or clear the QUEUE_FLAG_NOWAIT flag on its request queue according to
this check.
There's a test in submit_bio_noacct: "if ((bio->bi_opf & REQ_NOWAIT) &&
!blk_queue_nowait(q)) goto not_supported" - this will make sure that
REQ_NOWAIT bios can't enter a request queue that doesn't support them.
This mechanism works to prevent REQ_NOWAIT bios from reaching dm targets
that don't support the REQ_NOWAIT flag (and that may block in the map
routine) - except that there is a small race condition:
submit_bio_noacct checks if the queue has the QUEUE_FLAG_NOWAIT without
holding any locks. Immediatelly after this check, the device mapper table
may be reloaded with a table that doesn't support REQ_NOWAIT (for example,
if we start moving the logical volume or if we activate a snapshot).
However the REQ_NOWAIT bio that already passed the check in
submit_bio_noacct would be sent to device mapper, where it could be
redirected to a dm target that doesn't support REQ_NOWAIT - the result is
sleeping while we hold rcu_read_lock.
In order to fix this race, we double-check if the target supports
REQ_NOWAIT while we hold the table lock (so that the table can't change
under us).
Fixes: 563a225c9f ("dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
dm_put_live_table_bio is called from the end of dm_submit_bio.
However, at this point, the bio may be already finished and the caller
may have freed the bio. Consequently, dm_put_live_table_bio accesses
the stale "bio" pointer.
Fix this bug by loading the bi_opf value and passing it to
dm_get_live_table_bio and dm_put_live_table_bio instead of the bio.
This bug was found by running the lvm2 testsuite with kasan.
Fixes: 563a225c9f ("dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Perform the same virtual address to file offset translation that libbpf
is doing for executable ELF binaries also for shared libraries.
Currently libbpf is making a simplifying and sometimes wrong assumption
that for shared libraries relative virtual addresses inside ELF are
always equal to file offsets.
Unfortunately, this is not always the case with LLVM's lld linker, which
now by default generates quite more complicated ELF segments layout.
E.g., for liburandom_read.so from selftests/bpf, here's an excerpt from
readelf output listing ELF segments (a.k.a. program headers):
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000040 0x0000000000000040 0x0000000000000040 0x0001f8 0x0001f8 R 0x8
LOAD 0x000000 0x0000000000000000 0x0000000000000000 0x0005e4 0x0005e4 R 0x1000
LOAD 0x0005f0 0x00000000000015f0 0x00000000000015f0 0x000160 0x000160 R E 0x1000
LOAD 0x000750 0x0000000000002750 0x0000000000002750 0x000210 0x000210 RW 0x1000
LOAD 0x000960 0x0000000000003960 0x0000000000003960 0x000028 0x000029 RW 0x1000
Compare that to what is generated by GNU ld (or LLVM lld's with extra
-znoseparate-code argument which disables this cleverness in the name of
file size reduction):
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x000000 0x0000000000000000 0x0000000000000000 0x000550 0x000550 R 0x1000
LOAD 0x001000 0x0000000000001000 0x0000000000001000 0x000131 0x000131 R E 0x1000
LOAD 0x002000 0x0000000000002000 0x0000000000002000 0x0000ac 0x0000ac R 0x1000
LOAD 0x002dc0 0x0000000000003dc0 0x0000000000003dc0 0x000262 0x000268 RW 0x1000
You can see from the first example above that for executable (Flg == "R E")
PT_LOAD segment (LOAD #2), Offset doesn't match VirtAddr columns.
And it does in the second case (GNU ld output).
This is important because all the addresses, including USDT specs,
operate in a virtual address space, while kernel is expecting file
offsets when performing uprobe attach. So such mismatches have to be
properly taken care of and compensated by libbpf, which is what this
patch is fixing.
Also patch clarifies few function and variable names, as well as updates
comments to reflect this important distinction (virtaddr vs file offset)
and to ephasize that shared libraries are not all that different from
executables in this regard.
This patch also changes selftests/bpf Makefile to force urand_read and
liburand_read.so to be built with Clang and LLVM's lld (and explicitly
request this ELF file size optimization through -znoseparate-code linker
parameter) to validate libbpf logic and ensure regressions don't happen
in the future. I've bundled these selftests changes together with libbpf
changes to keep the above description tied with both libbpf and
selftests changes.
Fixes: 74cc6311ce ("libbpf: Add USDT notes parsing and resolution logic")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220616055543.3285835-1-andrii@kernel.org
In order to debug problems with file size being reported incorrectly
temporarily (in this case xfstest generic/584 intermittent failure)
we need to add trace point for the non-compounded code path where
we set the file size (SMB2_set_eof). The new trace point is:
"smb3_set_eof"
Here is sample output from the tracepoint:
TASK-PID CPU# ||||| TIMESTAMP FUNCTION
| | | ||||| | |
xfs_io-75403 [002] ..... 95219.189835: smb3_set_eof: xid=221 sid=0xeef1cbd2 tid=0x27079ee6 fid=0x52edb58c offset=0x100000
aio-dio-append--75418 [010] ..... 95219.242402: smb3_set_eof: xid=226 sid=0xeef1cbd2 tid=0x27079ee6 fid=0xae89852d offset=0x0
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Commit 17de1e559c ("selftests: clarify common error when running
gup_test") had most of its hunks dropped due to a conflict with another
patch accepted into Linux around the same time that implemented the same
behavior as a subset of other changes.
However, the remaining hunk defines the GUP_TEST_FILE macro without
making use of it. This patch makes use of the macro in the two relevant
places.
Furthermore, the above mentioned commit's log message erroneously describes
the changes that were dropped from the patch.
This patch corrects the record.
Fixes: 17de1e559c ("selftests: clarify common error when running gup_test")
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: Nico Pache <npache@redhat.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Pull audit fix from Paul Moore:
"A single audit patch to fix a problem where we were not properly
freeing memory allocated when recording information related to a
module load"
* tag 'audit-pr-20220616' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: free module name
Pull selinux fix from Paul Moore:
"A single SELinux patch to fix memory leaks when mounting filesystems
with SELinux mount options"
* tag 'selinux-pr-20220616' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: free contexts previously transferred in selinux_add_opt()
Some additionals comments and notes from autobuilders received after the
series got applied, warranted some changes.
* commit '924cbb8cbe3460ea192e6243017ceb0ceb255b1b':
riscv: Improve description for RISCV_ISA_SVPBMT Kconfig symbol
riscv: drop cpufeature_apply_feature tracking variable
riscv: fix dependency for t-head errata