mirror of
https://github.com/hardkernel/linux.git
synced 2026-06-11 13:27:06 +09:00
9704669c386f9bbfef2e002e7e690c56b7dcf5de
43745 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
9704669c38 |
tracing/probes: Fix to search structure fields correctly
Fix to search a field from the structure which has anonymous union
correctly.
Since the reference `type` pointer was updated in the loop, the search
loop suddenly aborted where it hits an anonymous union. Thus it can not
find the field after the anonymous union. This avoids updating the
cursor `type` pointer in the loop.
Link: https://lore.kernel.org/all/170791694361.389532.10047514554799419688.stgit@devnote2/
Fixes:
|
||
|
|
2766f59ca4 |
Merge tag 'timers_urgent_for_v6.8_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Borislav Petkov: - Make sure a warning is issued when a hrtimer gets queued after the timers have been migrated on the CPU down path and thus said timer will get ignored * tag 'timers_urgent_for_v6.8_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimer: Report offline hrtimer enqueue |
||
|
|
7521f258ea |
Merge tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton: "21 hotfixes. 12 are cc:stable and the remainder pertain to post-6.7 issues or aren't considered to be needed in earlier kernel versions" * tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits) nilfs2: fix potential bug in end_buffer_async_write mm/damon/sysfs-schemes: fix wrong DAMOS tried regions update timeout setup nilfs2: fix hang in nilfs_lookup_dirty_data_buffers() MAINTAINERS: Leo Yan has moved mm/zswap: don't return LRU_SKIP if we have dropped lru lock fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super mailmap: switch email address for John Moon mm: zswap: fix objcg use-after-free in entry destruction mm/madvise: don't forget to leave lazy MMU mode in madvise_cold_or_pageout_pte_range() arch/arm/mm: fix major fault accounting when retrying under per-VMA lock selftests: core: include linux/close_range.h for CLOSE_RANGE_* macros mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page mm: memcg: optimize parent iteration in memcg_rstat_updated() nilfs2: fix data corruption in dsync block recovery for small block sizes mm/userfaultfd: UFFDIO_MOVE implementation should use ptep_get() exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock) fs/proc: do_task_stat: use sig->stats_lock to gather the threads/children stats fs/proc: do_task_stat: move thread_group_cputime_adjusted() outside of lock_task_sighand() getrusage: use sig->stats_lock rather than lock_task_sighand() getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand() ... |
||
|
|
ca8a66738a |
Merge tag 'trace-v6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt: - Fix broken direct trampolines being called when another callback is attached the same function. ARM 64 does not support FTRACE_WITH_REGS, and when it added direct trampoline calls from ftrace, it removed the "WITH_REGS" flag from the ftrace_ops for direct trampolines. This broke x86 as x86 requires direct trampolines to have WITH_REGS. This wasn't noticed because direct trampolines work as long as the function it is attached to is not shared with other callbacks (like the function tracer). When there are other callbacks, a helper trampoline is called, to call all the non direct callbacks and when it returns, the direct trampoline is called. For x86, the direct trampoline sets a flag in the regs field to tell the x86 specific code to call the direct trampoline. But this only works if the ftrace_ops had WITH_REGS set. ARM does things differently that does not require this. For now, set WITH_REGS if the arch supports WITH_REGS (which ARM does not), and this makes it work for both ARM64 and x86. - Fix wasted memory in the saved_cmdlines logic. The saved_cmdlines is a cache that maps PIDs to COMMs that tracing can use. Most trace events only save the PID in the event. The saved_cmdlines file lists PIDs to COMMs so that the tracing tools can show an actual name and not just a PID for each event. There's an array of PIDs that map to a small set of saved COMM strings. The array is set to PID_MAX_DEFAULT which is usually set to 32768. When a PID comes in, it will add itself to this array along with the index into the COMM array (note if the system allows more than PID_MAX_DEFAULT, this cache is similar to cache lines as an update of a PID that has the same PID_MAX_DEFAULT bits set will flush out another task with the same matching bits set). A while ago, the size of this cache was changed to be dynamic and the array was moved into a structure and created with kmalloc(). But this new structure had the size of 131104 bytes, or 0x20020 in hex. As kmalloc allocates in powers of two, it was actually allocating 0x40000 bytes (262144) leaving 131040 bytes of wasted memory. The last element of this structure was a pointer to the COMM string array which defaulted to just saving 128 COMMs. By changing the last field of this structure to a variable length string, and just having it round up to fill the allocated memory, the default size of the saved COMM cache is now 8190. This not only uses the wasted space, but actually saves space by removing the extra allocation for the COMM names. * tag 'trace-v6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix wasted memory in saved_cmdlines logic ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default |
||
|
|
44dc5c41b5 |
tracing: Fix wasted memory in saved_cmdlines logic
While looking at improving the saved_cmdlines cache I found a huge amount
of wasted memory that should be used for the cmdlines.
The tracing data saves pids during the trace. At sched switch, if a trace
occurred, it will save the comm of the task that did the trace. This is
saved in a "cache" that maps pids to comms and exposed to user space via
the /sys/kernel/tracing/saved_cmdlines file. Currently it only caches by
default 128 comms.
The structure that uses this creates an array to store the pids using
PID_MAX_DEFAULT (which is usually set to 32768). This causes the structure
to be of the size of 131104 bytes on 64 bit machines.
In hex: 131104 = 0x20020, and since the kernel allocates generic memory in
powers of two, the kernel would allocate 0x40000 or 262144 bytes to store
this structure. That leaves 131040 bytes of wasted space.
Worse, the structure points to an allocated array to store the comm names,
which is 16 bytes times the amount of names to save (currently 128), which
is 2048 bytes. Instead of allocating a separate array, make the structure
end with a variable length string and use the extra space for that.
This is similar to a recommendation that Linus had made about eventfs_inode names:
https://lore.kernel.org/all/20240130190355.11486-5-torvalds@linux-foundation.org/
Instead of allocating a separate string array to hold the saved comms,
have the structure end with: char saved_cmdlines[]; and round up to the
next power of two over sizeof(struct saved_cmdline_buffers) + num_cmdlines * TASK_COMM_LEN
It will use this extra space for the saved_cmdline portion.
Now, instead of saving only 128 comms by default, by using this wasted
space at the end of the structure it can save over 8000 comms and even
saves space by removing the need for allocating the other array.
Link: https://lore.kernel.org/linux-trace-kernel/20240209063622.1f7b6d5f@rorschach.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Mete Durlu <meted@linux.ibm.com>
Fixes:
|
||
|
|
a8b9cf62ad |
ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default
The commit |
||
|
|
9efd24ec55 |
kprobes: Remove unnecessary initial values of variables
ri and sym is assigned first, so it does not need to initialize the assignment. Link: https://lore.kernel.org/all/20230919012823.7815-1-zeming@nfschina.com/ Signed-off-by: Li zeming <zeming@nfschina.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> |
||
|
|
9a571c1e27 |
tracing/probes: Fix to set arg size and fmt after setting type from BTF
Since the BTF type setting updates probe_arg::type, the type size
calculation and setting print-fmt should be done after that.
Without this fix, the argument size and print-fmt can be wrong.
Link: https://lore.kernel.org/all/170602218196.215583.6417859469540955777.stgit@devnote2/
Fixes:
|
||
|
|
8c427cc2fa |
tracing/probes: Fix to show a parse error for bad type for $comm
Fix to show a parse error for bad type (non-string) for $comm/$COMM and
immediate-string. With this fix, error_log file shows appropriate error
message as below.
/sys/kernel/tracing # echo 'p vfs_read $comm:u32' >> kprobe_events
sh: write error: Invalid argument
/sys/kernel/tracing # echo 'p vfs_read \"hoge":u32' >> kprobe_events
sh: write error: Invalid argument
/sys/kernel/tracing # cat error_log
[ 30.144183] trace_kprobe: error: $comm and immediate-string only accepts string type
Command: p vfs_read $comm:u32
^
[ 62.618500] trace_kprobe: error: $comm and immediate-string only accepts string type
Command: p vfs_read \"hoge":u32
^
Link: https://lore.kernel.org/all/170602215411.215583.2238016352271091852.stgit@devnote2/
Fixes:
|
||
|
|
c1be35a16b |
exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock)
After the recent changes nobody use siglock to read the values protected by stats_lock, we can kill spin_lock_irq(¤t->sighand->siglock) and update the comment. With this patch only __exit_signal() and thread_group_start_cputime() take stats_lock under siglock. Link: https://lkml.kernel.org/r/20240123153359.GA21866@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Dylan Hatch <dylanbhatch@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
f7ec1cd5cc |
getrusage: use sig->stats_lock rather than lock_task_sighand()
lock_task_sighand() can trigger a hard lockup. If NR_CPUS threads call getrusage() at the same time and the process has NR_THREADS, spin_lock_irq will spin with irqs disabled O(NR_CPUS * NR_THREADS) time. Change getrusage() to use sig->stats_lock, it was specifically designed for this type of use. This way it runs lockless in the likely case. TODO: - Change do_task_stat() to use sig->stats_lock too, then we can remove spin_lock_irq(siglock) in wait_task_zombie(). - Turn sig->stats_lock into seqcount_rwlock_t, this way the readers in the slow mode won't exclude each other. See https://lore.kernel.org/all/20230913154907.GA26210@redhat.com/ - stats_lock has to disable irqs because ->siglock can be taken in irq context, it would be very nice to change __exit_signal() to avoid the siglock->stats_lock dependency. Link: https://lkml.kernel.org/r/20240122155053.GA26214@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Dylan Hatch <dylanbhatch@google.com> Tested-by: Dylan Hatch <dylanbhatch@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
daa694e413 |
getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand()
Patch series "getrusage: use sig->stats_lock", v2. This patch (of 2): thread_group_cputime() does its own locking, we can safely shift thread_group_cputime_adjusted() which does another for_each_thread loop outside of ->siglock protected section. This is also preparation for the next patch which changes getrusage() to use stats_lock instead of siglock, thread_group_cputime() takes the same lock. With the current implementation recursive read_seqbegin_or_lock() is fine, thread_group_cputime() can't enter the slow mode if the caller holds stats_lock, yet this looks more safe and better performance-wise. Link: https://lkml.kernel.org/r/20240122155023.GA26169@redhat.com Link: https://lkml.kernel.org/r/20240122155050.GA26205@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Dylan Hatch <dylanbhatch@google.com> Tested-by: Dylan Hatch <dylanbhatch@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
dad6a09f31 |
hrtimer: Report offline hrtimer enqueue
The hrtimers migration on CPU-down hotplug process has been moved
earlier, before the CPU actually goes to die. This leaves a small window
of opportunity to queue an hrtimer in a blind spot, leaving it ignored.
For example a practical case has been reported with RCU waking up a
SCHED_FIFO task right before the CPUHP_AP_IDLE_DEAD stage, queuing that
way a sched/rt timer to the local offline CPU.
Make sure such situations never go unnoticed and warn when that happens.
Fixes:
|
||
|
|
56897d5188 |
Merge tag 'trace-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing and eventfs fixes from Steven Rostedt: - Fix the return code for ring_buffer_poll_wait() It was returing a -EINVAL instead of EPOLLERR. - Zero out the tracefs_inode so that all fields are initialized. The ti->private could have had stale data, but instead of just initializing it to NULL, clear out the entire structure when it is allocated. - Fix a crash in timerlat The hrtimer was initialized at read and not open, but is canceled at close. If the file was opened and never read the close will pass a NULL pointer to hrtime_cancel(). - Rewrite of eventfs. Linus wrote a patch series to remove the dentry references in the eventfs_inode and to use ref counting and more of proper VFS interfaces to make it work. - Add warning to put_ei() if ei is not set to free. That means something is about to free it when it shouldn't. - Restructure the eventfs_inode to make it more compact, and remove the unused llist field. - Remove the fsnotify*() funtions for when the inodes were being created in the lookup code. It doesn't make sense to notify about creation just because something is being looked up. - The inode hard link count was not accurate. It was being updated when a file was looked up. The inodes of directories were updating their parent inode hard link count every time the inode was created. That means if memory reclaim cleaned a stale directory inode and the inode was lookup up again, it would increment the parent inode again as well. Al Viro said to just have all eventfs directories have a hard link count of 1. That tells user space not to trust it. * tag 'trace-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Keep all directory links at 1 eventfs: Remove fsnotify*() functions from lookup() eventfs: Restructure eventfs_inode structure to be more condensed eventfs: Warn if an eventfs_inode is freed without is_freed being set tracing/timerlat: Move hrtimer_init to timerlat_fd open() eventfs: Get rid of dentry pointers without refcounts eventfs: Clean up dentry ops and add revalidate function eventfs: Remove unused d_parent pointer field tracefs: dentry lookup crapectomy tracefs: Avoid using the ei->dentry pointer unnecessarily eventfs: Initialize the tracefs inode properly tracefs: Zero out the tracefs_inode when allocating it ring-buffer: Clean ring_buffer_poll_wait() error return |
||
|
|
1389358bb0 |
tracing/timerlat: Move hrtimer_init to timerlat_fd open()
Currently, the timerlat's hrtimer is initialized at the first read of
timerlat_fd, and destroyed at close(). It works, but it causes an error
if the user program open() and close() the file without reading.
Here's an example:
# echo NO_OSNOISE_WORKLOAD > /sys/kernel/debug/tracing/osnoise/options
# echo timerlat > /sys/kernel/debug/tracing/current_tracer
# cat <<EOF > ./timerlat_load.py
# !/usr/bin/env python3
timerlat_fd = open("/sys/kernel/tracing/osnoise/per_cpu/cpu0/timerlat_fd", 'r')
timerlat_fd.close();
EOF
# ./taskset -c 0 ./timerlat_load.py
<BOOM>
BUG: kernel NULL pointer dereference, address: 0000000000000010
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 2673 Comm: python3 Not tainted 6.6.13-200.fc39.x86_64 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014
RIP: 0010:hrtimer_active+0xd/0x50
Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 57 30 <8b> 42 10 a8 01 74 09 f3 90 8b 42 10 a8 01 75 f7 80 7f 38 00 75 1d
RSP: 0018:ffffb031009b7e10 EFLAGS: 00010286
RAX: 000000000002db00 RBX: ffff9118f786db08 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9117a0e64400 RDI: ffff9118f786db08
RBP: ffff9118f786db80 R08: ffff9117a0ddd420 R09: ffff9117804d4f70
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9118f786db08
R13: ffff91178fdd5e20 R14: ffff9117840978c0 R15: 0000000000000000
FS: 00007f2ffbab1740(0000) GS:ffff9118f7840000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 00000001b402e000 CR4: 0000000000750ee0
PKRU: 55555554
Call Trace:
<TASK>
? __die+0x23/0x70
? page_fault_oops+0x171/0x4e0
? srso_alias_return_thunk+0x5/0x7f
? avc_has_extended_perms+0x237/0x520
? exc_page_fault+0x7f/0x180
? asm_exc_page_fault+0x26/0x30
? hrtimer_active+0xd/0x50
hrtimer_cancel+0x15/0x40
timerlat_fd_release+0x48/0xe0
__fput+0xf5/0x290
__x64_sys_close+0x3d/0x80
do_syscall_64+0x60/0x90
? srso_alias_return_thunk+0x5/0x7f
? __x64_sys_ioctl+0x72/0xd0
? srso_alias_return_thunk+0x5/0x7f
? syscall_exit_to_user_mode+0x2b/0x40
? srso_alias_return_thunk+0x5/0x7f
? do_syscall_64+0x6c/0x90
? srso_alias_return_thunk+0x5/0x7f
? exit_to_user_mode_prepare+0x142/0x1f0
? srso_alias_return_thunk+0x5/0x7f
? syscall_exit_to_user_mode+0x2b/0x40
? srso_alias_return_thunk+0x5/0x7f
? do_syscall_64+0x6c/0x90
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0033:0x7f2ffb321594
Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 80 3d d5 cd 0d 00 00 74 13 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3c c3 0f 1f 00 55 48 89 e5 48 83 ec 10 89 7d
RSP: 002b:00007ffe8d8eef18 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
RAX: ffffffffffffffda RBX: 00007f2ffba4e668 RCX: 00007f2ffb321594
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00007ffe8d8eef40 R08: 0000000000000000 R09: 0000000000000000
R10: 55c926e3167eae79 R11: 0000000000000202 R12: 0000000000000003
R13: 00007ffe8d8ef030 R14: 0000000000000000 R15: 00007f2ffba4e668
</TASK>
CR2: 0000000000000010
---[ end trace 0000000000000000 ]---
Move hrtimer_init to timerlat_fd open() to avoid this problem.
Link: https://lore.kernel.org/linux-trace-kernel/7324dd3fc0035658c99b825204a66049389c56e3.1706798888.git.bristot@kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
66bbea9ed6 |
ring-buffer: Clean ring_buffer_poll_wait() error return
The return type for ring_buffer_poll_wait() is __poll_t. This is behind
the scenes an unsigned where we can set event bits. In case of a
non-allocated CPU, we do return instead -EINVAL (0xffffffea). Lucky us,
this ends up setting few error bits (EPOLLERR | EPOLLHUP | EPOLLNVAL), so
user-space at least is aware something went wrong.
Nonetheless, this is an incorrect code. Replace that -EINVAL with a
proper EPOLLERR to clean that output. As this doesn't change the
behaviour, there's no need to treat this change as a bug fix.
Link: https://lore.kernel.org/linux-trace-kernel/20240131140955.3322792-1-vdonnefort@google.com
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
9b7bd05beb |
Merge tag 'trace-v6.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
"Two small fixes for tracefs and eventfs:
- Fix register_snapshot_trigger() on allocation error
If the snapshot fails to allocate, the register_snapshot_trigger()
can still return success. If the call to
tracing_alloc_snapshot_instance() returned anything but 0, it
returned 0, but it should have been returning the error code from
that allocation function.
- Remove leftover code from tracefs doing a dentry walk on remount.
The update_gid() function was called by the tracefs code on remount
to update the gid of eventfs, but that is no longer the case, but
that code wasn't deleted. Nothing calls it. Remove it"
* tag 'trace-v6.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracefs: remove stale 'update_gid' code
tracing/trigger: Fix to return error if failed to alloc snapshot
|
||
|
|
6f3d7d5ced |
Merge tag 'mm-hotfixes-stable-2024-01-28-23-21' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton: "22 hotfixes. 11 are cc:stable and the remainder address post-6.7 issues or aren't considered appropriate for backporting" * tag 'mm-hotfixes-stable-2024-01-28-23-21' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits) mm: thp_get_unmapped_area must honour topdown preference mm: huge_memory: don't force huge page alignment on 32 bit userfaultfd: fix mmap_changing checking in mfill_atomic_hugetlb selftests/mm: ksm_tests should only MADV_HUGEPAGE valid memory scs: add CONFIG_MMU dependency for vfree_atomic() mm/memory: fix folio_set_dirty() vs. folio_mark_dirty() in zap_pte_range() mm/huge_memory: fix folio_set_dirty() vs. folio_mark_dirty() selftests/mm: Update va_high_addr_switch.sh to check CPU for la57 flag selftests: mm: fix map_hugetlb failure on 64K page size systems MAINTAINERS: supplement of zswap maintainers update stackdepot: make fast paths lock-less again stackdepot: add stats counters exported via debugfs mm, kmsan: fix infinite recursion due to RCU critical section mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again selftests/mm: switch to bash from sh MAINTAINERS: add man-pages git trees mm: memcontrol: don't throttle dying tasks on memory.high mm: mmap: map MAP_STACK to VM_NOHUGEPAGE uprobes: use pagesize-aligned virtual address when replacing pages selftests/mm: mremap_test: fix build warning ... |
||
|
|
648f575d5e |
Merge tag 'locking_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Borislav Petkov: - Prevent an inconsistent futex operation leading to stale state exposure * tag 'locking_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Prevent the reuse of stale pi_state |
||
|
|
0e4363ac1a |
Merge tag 'irq_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Borislav Petkov: - Initialize the resend node of each IRQ descriptor, not only the first one * tag 'irq_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Initialize resend_node hlist for all interrupt descriptors |
||
|
|
90db544eba |
Merge tag 'timers_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Borislav Petkov: - Preserve the number of idle calls and sleep entries across CPU hotplug events in order to be able to compute correct averages - Limit the duration of the clocksource watchdog checking interval as too long intervals lead to wrongly marking the TSC as unstable * tag 'timers_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick/sched: Preserve number of idle sleeps across CPU hotplug events clocksource: Skip watchdog check for large watchdog intervals |
||
|
|
0958b33ef5 |
tracing/trigger: Fix to return error if failed to alloc snapshot
Fix register_snapshot_trigger() to return error code if it failed to
allocate a snapshot instead of 0 (success). Unless that, it will register
snapshot trigger without an error.
Link: https://lore.kernel.org/linux-trace-kernel/170622977792.270660.2789298642759362200.stgit@devnote2
Fixes:
|
||
|
|
4dca82d141 |
uprobes: use pagesize-aligned virtual address when replacing pages
uprobes passes an unaligned page mapping address to folio_add_new_anon_rmap(), which ends up triggering a VM_BUG_ON() we recently extended in commit |
||
|
|
3cb9871f81 |
Merge tag 'urgent-rcu.2024.01.24a' of https://github.com/neeraju/linux
Pull RCU fix from Neeraj Upadhyay:
"This fixes RCU grace period stalls, which are observed when an
outgoing CPU's quiescent state reporting results in wakeup of one of
the grace period kthreads, to complete the grace period.
If those kthreads have SCHED_FIFO policy, the wake up can indirectly
arm the RT bandwith timer to the local offline CPU.
Earlier migration of the hrtimers from the CPU introduced in commit
|
||
|
|
9a574ea906 |
tick/sched: Preserve number of idle sleeps across CPU hotplug events
Commit |
||
|
|
6446495535 |
clocksource: Skip watchdog check for large watchdog intervals
There have been reports of the watchdog marking clocksources unstable on machines with 8 NUMA nodes: clocksource: timekeeping watchdog on CPU373: Marking clocksource 'tsc' as unstable because the skew is too large: clocksource: 'hpet' wd_nsec: 14523447520 clocksource: 'tsc' cs_nsec: 14524115132 The measured clocksource skew - the absolute difference between cs_nsec and wd_nsec - was 668 microseconds: cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612 The kernel used 200 microseconds for the uncertainty_margin of both the clocksource and watchdog, resulting in a threshold of 400 microseconds (the md variable). Both the cs_nsec and the wd_nsec value indicate that the readout interval was circa 14.5 seconds. The observed behaviour is that watchdog checks failed for large readout intervals on 8 NUMA node machines. This indicates that the size of the skew was directly proportinal to the length of the readout interval on those machines. The measured clocksource skew, 668 microseconds, was evaluated against a threshold (the md variable) that is suited for readout intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second. The intention of |
||
|
|
cf10015a24 |
Merge tag 'execve-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve fixes from Kees Cook: - Fix error handling in begin_new_exec() (Bernd Edlinger) - MAINTAINERS: specifically mention ELF (Alexey Dobriyan) - Various cleanups related to earlier open() (Askar Safin, Kees Cook) * tag 'execve-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: exec: Distinguish in_execve from in_exec exec: Fix error handling in begin_new_exec() exec: Add do_close_execat() helper exec: remove useless comment ELF, MAINTAINERS: specifically mention ELF |
||
|
|
90383cc078 |
exec: Distinguish in_execve from in_exec
Just to help distinguish the fs->in_exec flag from the current->in_execve flag, add comments in check_unsafe_exec() and copy_fs() for more context. Also note that in_execve is only used by TOMOYO now. Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-fsdevel@vger.kernel.org Cc: linux-mm@kvack.org Signed-off-by: Kees Cook <keescook@chromium.org> |
||
|
|
e787644caf |
rcu: Defer RCU kthreads wakeup when CPU is dying
When the CPU goes idle for the last time during the CPU down hotplug
process, RCU reports a final quiescent state for the current CPU. If
this quiescent state propagates up to the top, some tasks may then be
woken up to complete the grace period: the main grace period kthread
and/or the expedited main workqueue (or kworker).
If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
arm the RT bandwith timer to the local offline CPU. Since this happens
after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
timer gets ignored. Therefore if the RCU kthreads are waiting for RT
bandwidth to be available, they may never be actually scheduled.
This triggers TREE03 rcutorture hangs:
rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
rcu: (t=21035 jiffies g=938281 q=40787 ncpus=6)
rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x2eb/0xa80
schedule+0x1f/0x90
schedule_timeout+0x163/0x270
? __pfx_process_timeout+0x10/0x10
rcu_gp_fqs_loop+0x37c/0x5b0
? __pfx_rcu_gp_kthread+0x10/0x10
rcu_gp_kthread+0x17c/0x200
kthread+0xde/0x110
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2b/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
The situation can't be solved with just unpinning the timer. The hrtimer
infrastructure and the nohz heuristics involved in finding the best
remote target for an unpinned timer would then also need to handle
enqueues from an offline CPU in the most horrendous way.
So fix this on the RCU side instead and defer the wake up to an online
CPU if it's too late for the local one.
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Fixes:
|
||
|
|
b184c8c288 |
genirq: Initialize resend_node hlist for all interrupt descriptors
For a CONFIG_SPARSE_IRQ=n kernel, early_irq_init() is supposed to
initialize all interrupt descriptors.
It does except for irq_desc::resend_node, which ia only initialized for the
first descriptor.
Use the indexed decriptor and not the base pointer to address that.
Fixes:
|
||
|
|
2b44760609 |
tracing: Ensure visibility when inserting an element into tracing_map
Running the following two commands in parallel on a multi-processor
AArch64 machine can sporadically produce an unexpected warning about
duplicate histogram entries:
$ while true; do
echo hist:key=id.syscall:val=hitcount > \
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
sleep 0.001
done
$ stress-ng --sysbadaddr $(nproc)
The warning looks as follows:
[ 2911.172474] ------------[ cut here ]------------
[ 2911.173111] Duplicates detected: 1
[ 2911.173574] WARNING: CPU: 2 PID: 12247 at kernel/trace/tracing_map.c:983 tracing_map_sort_entries+0x3e0/0x408
[ 2911.174702] Modules linked in: iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) af_packet(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) ena(E) tiny_power_button(E) qemu_fw_cfg(E) button(E) fuse(E) efi_pstore(E) ip_tables(E) x_tables(E) xfs(E) libcrc32c(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) polyval_ce(E) polyval_generic(E) ghash_ce(E) gf128mul(E) sm4_ce_gcm(E) sm4_ce_ccm(E) sm4_ce(E) sm4_ce_cipher(E) sm4(E) sm3_ce(E) sm3(E) sha3_ce(E) sha512_ce(E) sha512_arm64(E) sha2_ce(E) sha256_arm64(E) nvme(E) sha1_ce(E) nvme_core(E) nvme_auth(E) t10_pi(E) sg(E) scsi_mod(E) scsi_common(E) efivarfs(E)
[ 2911.174738] Unloaded tainted modules: cppc_cpufreq(E):1
[ 2911.180985] CPU: 2 PID: 12247 Comm: cat Kdump: loaded Tainted: G E 6.7.0-default #2 1b58bbb22c97e4399dc09f92d309344f69c44a01
[ 2911.182398] Hardware name: Amazon EC2 c7g.8xlarge/, BIOS 1.0 11/1/2018
[ 2911.183208] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 2911.184038] pc : tracing_map_sort_entries+0x3e0/0x408
[ 2911.184667] lr : tracing_map_sort_entries+0x3e0/0x408
[ 2911.185310] sp : ffff8000a1513900
[ 2911.185750] x29: ffff8000a1513900 x28: ffff0003f272fe80 x27: 0000000000000001
[ 2911.186600] x26: ffff0003f272fe80 x25: 0000000000000030 x24: 0000000000000008
[ 2911.187458] x23: ffff0003c5788000 x22: ffff0003c16710c8 x21: ffff80008017f180
[ 2911.188310] x20: ffff80008017f000 x19: ffff80008017f180 x18: ffffffffffffffff
[ 2911.189160] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000a15134b8
[ 2911.190015] x14: 0000000000000000 x13: 205d373432323154 x12: 5b5d313131333731
[ 2911.190844] x11: 00000000fffeffff x10: 00000000fffeffff x9 : ffffd1b78274a13c
[ 2911.191716] x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 000000000057ffa8
[ 2911.192554] x5 : ffff0012f6c24ec0 x4 : 0000000000000000 x3 : ffff2e5b72b5d000
[ 2911.193404] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0003ff254480
[ 2911.194259] Call trace:
[ 2911.194626] tracing_map_sort_entries+0x3e0/0x408
[ 2911.195220] hist_show+0x124/0x800
[ 2911.195692] seq_read_iter+0x1d4/0x4e8
[ 2911.196193] seq_read+0xe8/0x138
[ 2911.196638] vfs_read+0xc8/0x300
[ 2911.197078] ksys_read+0x70/0x108
[ 2911.197534] __arm64_sys_read+0x24/0x38
[ 2911.198046] invoke_syscall+0x78/0x108
[ 2911.198553] el0_svc_common.constprop.0+0xd0/0xf8
[ 2911.199157] do_el0_svc+0x28/0x40
[ 2911.199613] el0_svc+0x40/0x178
[ 2911.200048] el0t_64_sync_handler+0x13c/0x158
[ 2911.200621] el0t_64_sync+0x1a8/0x1b0
[ 2911.201115] ---[ end trace 0000000000000000 ]---
The problem appears to be caused by CPU reordering of writes issued from
__tracing_map_insert().
The check for the presence of an element with a given key in this
function is:
val = READ_ONCE(entry->val);
if (val && keys_match(key, val->key, map->key_size)) ...
The write of a new entry is:
elt = get_free_elt(map);
memcpy(elt->key, key, map->key_size);
entry->val = elt;
The "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;"
stores may become visible in the reversed order on another CPU. This
second CPU might then incorrectly determine that a new key doesn't match
an already present val->key and subsequently insert a new element,
resulting in a duplicate.
Fix the problem by adding a write barrier between
"memcpy(elt->key, key, map->key_size);" and "entry->val = elt;", and for
good measure, also use WRITE_ONCE(entry->val, elt) for publishing the
element. The sequence pairs with the mentioned "READ_ONCE(entry->val);"
and the "val->key" check which has an address dependency.
The barrier is placed on a path executed when adding an element for
a new key. Subsequent updates targeting the same key remain unaffected.
From the user's perspective, the issue was introduced by commit
|
||
|
|
4fbbed7872 |
Merge tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Updates for time and clocksources:
- A fix for the idle and iowait time accounting vs CPU hotplug.
The time is reset on CPU hotplug which makes the accumulated
systemwide time jump backwards.
- Assorted fixes and improvements for clocksource/event drivers"
* tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
clocksource/drivers/ep93xx: Fix error handling during probe
clocksource/drivers/cadence-ttc: Fix some kernel-doc warnings
clocksource/drivers/timer-ti-dm: Fix make W=n kerneldoc warnings
clocksource/timer-riscv: Add riscv_clock_shutdown callback
dt-bindings: timer: Add StarFive JH8100 clint
dt-bindings: timer: thead,c900-aclint-mtimer: separate mtime and mtimecmp regs
|
||
|
|
71fee48fb7 |
tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
When offlining and onlining CPUs the overall reported idle and iowait times as reported by /proc/stat jump backward and forward: cpu 132 0 176 225249 47 6 6 21 0 0 cpu0 80 0 115 112575 33 3 4 18 0 0 cpu1 52 0 60 112673 13 3 1 2 0 0 cpu 133 0 177 226681 47 6 6 21 0 0 cpu0 80 0 116 113387 33 3 4 18 0 0 cpu 133 0 178 114431 33 6 6 21 0 0 <---- jump backward cpu0 80 0 116 114247 33 3 4 18 0 0 cpu1 52 0 61 183 0 3 1 2 0 0 <---- idle + iowait start with 0 cpu 133 0 178 228956 47 6 6 21 0 0 <---- jump forward cpu0 81 0 117 114929 33 3 4 18 0 0 Reason for this is that get_idle_time() in fs/proc/stat.c has different sources for both values depending on if a CPU is online or offline: - if a CPU is online the values may be taken from its per cpu tick_cpu_sched structure - if a CPU is offline the values are taken from its per cpu cpustat structure The problem is that the per cpu tick_cpu_sched structure is set to zero on CPU offline. See tick_cancel_sched_timer() in kernel/time/tick-sched.c. Therefore when a CPU is brought offline and online afterwards both its idle and iowait sleeptime will be zero, causing a jump backward in total system idle and iowait sleeptime. In a similar way if a CPU is then brought offline again the total idle and iowait sleeptimes will jump forward. It looks like this behavior was introduced with commit |
||
|
|
e626cb02ee |
futex: Prevent the reuse of stale pi_state
Jiri Slaby reported a futex state inconsistency resulting in -EINVAL during
a lock operation for a PI futex. It requires that the a lock process is
interrupted by a timeout or signal:
T1 Owns the futex in user space.
T2 Tries to acquire the futex in kernel (futex_lock_pi()). Allocates a
pi_state and attaches itself to it.
T2 Times out and removes its rt_waiter from the rt_mutex. Drops the
rtmutex lock and tries to acquire the hash bucket lock to remove
the futex_q. The lock is contended and T2 schedules out.
T1 Unlocks the futex (futex_unlock_pi()). Finds a futex_q but no
rt_waiter. Unlocks the futex (do_uncontended) and makes it available
to user space.
T3 Acquires the futex in user space.
T4 Tries to acquire the futex in kernel (futex_lock_pi()). Finds the
existing futex_q of T2 and tries to attach itself to the existing
pi_state. This (attach_to_pi_state()) fails with -EINVAL because uval
contains the TID of T3 but pi_state points to T1.
It's incorrect to unlock the futex and make it available for user space to
acquire as long as there is still an existing state attached to it in the
kernel.
T1 cannot hand over the futex to T2 because T2 already gave up and started
to clean up and is blocked on the hash bucket lock, so T2's futex_q with
the pi_state pointing to T1 is still queued.
T2 observes the futex_q, but ignores it as there is no waiter on the
corresponding rt_mutex and takes the uncontended path which allows the
subsequent caller of futex_lock_pi() (T4) to observe that stale state.
To prevent this the unlock path must dequeue all futex_q entries which
point to the same pi_state when there is no waiter on the rt mutex. This
requires obviously to make the dequeue conditional in the locking path to
prevent a double dequeue. With that it's guaranteed that user space cannot
observe an uncontended futex which has kernel state attached.
Fixes:
|
||
|
|
2a668d2176 |
Merge tag 'kgdb-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb update from Daniel Thompson: "The entire changeset for kgdb this cycle is a single two-line change to remove some deadcode that, had it not been dead, would have called strncpy() in an unsafe manner. To be fair there were other modest clean ups were discussed this cycle but they are not finalized and will have to wait until next time" * tag 'kgdb-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux: kdb: Fix a potential buffer overflow in kdb_local() |
||
|
|
736b5545d3 |
Merge tag 'net-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf and netfilter.
Previous releases - regressions:
- Revert "net: rtnetlink: Enslave device before bringing it up",
breaks the case inverse to the one it was trying to fix
- net: dsa: fix oob access in DSA's netdevice event handler
dereference netdev_priv() before check its a DSA port
- sched: track device in tcf_block_get/put_ext() only for clsact
binder types
- net: tls, fix WARNING in __sk_msg_free when record becomes full
during splice and MORE hint set
- sfp-bus: fix SFP mode detect from bitrate
- drv: stmmac: prevent DSA tags from breaking COE
Previous releases - always broken:
- bpf: fix no forward progress in in bpf_iter_udp if output buffer is
too small
- bpf: reject variable offset alu on registers with a type of
PTR_TO_FLOW_KEYS to prevent oob access
- netfilter: tighten input validation
- net: add more sanity check in virtio_net_hdr_to_skb()
- rxrpc: fix use of Don't Fragment flag on RESPONSE packets, avoid
infinite loop
- amt: do not use the portion of skb->cb area which may get clobbered
- mptcp: improve validation of the MPTCPOPT_MP_JOIN MCTCP option
Misc:
- spring cleanup of inactive maintainers"
* tag 'net-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
i40e: Include types.h to some headers
ipv6: mcast: fix data-race in ipv6_mc_down / mld_ifc_work
selftests: mlxsw: qos_pfc: Adjust the test to support 8 lanes
selftests: mlxsw: qos_pfc: Remove wrong description
mlxsw: spectrum_router: Register netdevice notifier before nexthop
mlxsw: spectrum_acl_tcam: Fix stack corruption
mlxsw: spectrum_acl_tcam: Fix NULL pointer dereference in error path
mlxsw: spectrum_acl_erp: Fix error flow of pool allocation failure
ethtool: netlink: Add missing ethnl_ops_begin/complete
selftests: bonding: Add more missing config options
selftests: netdevsim: add a config file
libbpf: warn on unexpected __arg_ctx type when rewriting BTF
selftests/bpf: add tests confirming type logic in kernel for __arg_ctx
bpf: enforce types for __arg_ctx-tagged arguments in global subprogs
bpf: extract bpf_ctx_convert_map logic and make it more reusable
libbpf: feature-detect arg:ctx tag support in kernel
ipvs: avoid stat macros calls from preemptible context
netfilter: nf_tables: reject NFT_SET_CONCAT with not field length description
netfilter: nf_tables: skip dead set elements in netlink dump
netfilter: nf_tables: do not allow mismatch field size and set key length
...
|
||
|
|
17e232b6d2 |
Merge tag 'dma-mapping-6.8-2024-01-18' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig: - fix kerneldoc warnings (Randy Dunlap) - better bounds checking in swiotlb (ZhangPeng) * tag 'dma-mapping-6.8-2024-01-18' of git://git.infradead.org/users/hch/dma-mapping: dma-debug: fix kernel-doc warnings swiotlb: check alloc_size before the allocation of a new memory pool |
||
|
|
0dde2bf67b |
Merge tag 'iommu-updates-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
"Core changes:
- Fix race conditions in device probe path
- Retire IOMMU bus_ops
- Support for passing custom allocators to page table drivers
- Clean up Kconfig around IOMMU_SVA
- Support for sharing SVA domains with all devices bound to a mm
- Firmware data parsing cleanup
- Tracing improvements for iommu-dma code
- Some smaller fixes and cleanups
ARM-SMMU drivers:
- Device-tree binding updates:
- Add additional compatible strings for Qualcomm SoCs
- Document Adreno clocks for Qualcomm's SM8350 SoC
- SMMUv2:
- Implement support for the ->domain_alloc_paging() callback
- Ensure Secure context is restored following suspend of Qualcomm
SMMU implementation
- SMMUv3:
- Disable stalling mode for the "quiet" context descriptor
- Minor refactoring and driver cleanups
Intel VT-d driver:
- Cleanup and refactoring
AMD IOMMU driver:
- Improve IO TLB invalidation logic
- Small cleanups and improvements
Rockchip IOMMU driver:
- DT binding update to add Rockchip RK3588
Apple DART driver:
- Apple M1 USB4/Thunderbolt DART support
- Cleanups
Virtio IOMMU driver:
- Add support for iotlb_sync_map
- Enable deferred IO TLB flushes"
* tag 'iommu-updates-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (66 commits)
iommu: Don't reserve 0-length IOVA region
iommu/vt-d: Move inline helpers to header files
iommu/vt-d: Remove unused vcmd interfaces
iommu/vt-d: Remove unused parameter of intel_pasid_setup_pass_through()
iommu/vt-d: Refactor device_to_iommu() to retrieve iommu directly
iommu/sva: Fix memory leak in iommu_sva_bind_device()
dt-bindings: iommu: rockchip: Add Rockchip RK3588
iommu/dma: Trace bounce buffer usage when mapping buffers
iommu/arm-smmu: Convert to domain_alloc_paging()
iommu/arm-smmu: Pass arm_smmu_domain to internal functions
iommu/arm-smmu: Implement IOMMU_DOMAIN_BLOCKED
iommu/arm-smmu: Convert to a global static identity domain
iommu/arm-smmu: Reorganize arm_smmu_domain_add_master()
iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED
iommu/arm-smmu-v3: Master cannot be NULL in arm_smmu_write_strtab_ent()
iommu/arm-smmu-v3: Add a type for the STE
iommu/arm-smmu-v3: disable stall for quiet_cd
iommu/qcom: restore IOMMU state if needed
iommu/arm-smmu-qcom: Add QCM2290 MDSS compatible
iommu/arm-smmu-qcom: Add missing GMU entry to match table
...
|
||
|
|
a2ded784cd |
Merge tag 'trace-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt: - Allow kernel trace instance creation to specify what events are created Inside the kernel, a subsystem may create a tracing instance that it can use to send events to user space. This sub-system may not care about the thousands of events that exist in eventfs. Allow the sub-system to specify what sub-systems of events it cares about, and only those events are exposed to this instance. - Allow the ring buffer to be broken up into bigger sub-buffers than just the architecture page size. A new tracefs file called "buffer_subbuf_size_kb" is created. The user can now specify a minimum size the sub-buffer may be in kilobytes. Note, that the implementation currently make the sub-buffer size a power of 2 pages (1, 2, 4, 8, 16, ...) but the user only writes in kilobyte size, and the sub-buffer will be updated to the next size that it will can accommodate it. If the user writes in 10, it will change the size to be 4 pages on x86 (16K), as that is the next available size that can hold 10K pages. - Update the debug output when a corrupt time is detected in the ring buffer. If the ring buffer detects inconsistent timestamps, there's a debug config options that will dump the contents of the meta data of the sub-buffer that is used for debugging. Add some more information to this dump that helps with debugging. - Add more timestamp debugging checks (only triggers when the config is enabled) - Increase the trace_seq iterator to 2 page sizes. - Allow strings written into tracefs_marker to be larger. Up to just under 2 page sizes (based on what trace_seq can hold). - Increase the trace_maker_raw write to be as big as a sub-buffer can hold. - Remove 32 bit time stamp logic, now that the rb_time_cmpxchg() has been removed. - More selftests were added. - Some code clean ups as well. * tag 'trace-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (29 commits) ring-buffer: Remove stale comment from ring_buffer_size() tracing histograms: Simplify parse_actions() function tracing/selftests: Remove exec permissions from trace_marker.tc test ring-buffer: Use subbuf_order for buffer page masking tracing: Update subbuffer with kilobytes not page order ringbuffer/selftest: Add basic selftest to test changing subbuf order ring-buffer: Add documentation on the buffer_subbuf_order file ring-buffer: Just update the subbuffers when changing their allocation order ring-buffer: Keep the same size when updating the order tracing: Stop the tracing while changing the ring buffer subbuf size tracing: Update snapshot order along with main buffer order ring-buffer: Make sure the spare sub buffer used for reads has same size ring-buffer: Do no swap cpu buffers if order is different ring-buffer: Clear pages on error in ring_buffer_subbuf_order_set() failure ring-buffer: Read and write to ring buffers with custom sub buffer size ring-buffer: Set new size of the ring buffer sub page ring-buffer: Add interface for configuring trace sub buffer size ring-buffer: Page size per ring buffer ring-buffer: Have ring_buffer_print_page_header() be able to access ring_buffer_iter ring-buffer: Check if absolute timestamp goes backwards ... |
||
|
|
5b890ad456 |
Merge tag 'probes-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes update from Masami Hiramatsu: - Update the Kprobes trace event to show the actual function name in notrace-symbol warning. Instead of using the user specified symbol name, use "%ps" printk format to show the actual symbol at the probe address. Since kprobe event accepts the offset from symbol which is bigger than the symbol size, the user specified symbol may not be the actual probed symbol. * tag 'probes-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: trace/kprobe: Display the actual notrace function when rejecting a probe |
||
|
|
b0d326da46 |
Merge tag 'sched-urgent-2024-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar: "Fix a cpufreq related performance regression on certain systems, where the CPU would remain at the lowest frequency, degrading performance substantially" * tag 'sched-urgent-2024-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix frequency selection for non-invariant case |
||
|
|
80955ae955 |
Merge tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here are the set of driver core and kernfs changes for 6.8-rc1.
Nothing major in here this release cycle, just lots of small cleanups
and some tweaks on kernfs that in the very end, got reverted and will
come back in a safer way next release cycle.
Included in here are:
- more driver core 'const' cleanups and fixes
- fw_devlink=rpm is now the default behavior
- kernfs tiny changes to remove some string functions
- cpu handling in the driver core is updated to work better on many
systems that add topologies and cpus after booting
- other minor changes and cleanups
All of the cpu handling patches have been acked by the respective
maintainers and are coming in here in one series. Everything has been
in linux-next for a while with no reported issues"
* tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (51 commits)
Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock"
kernfs: convert kernfs_idr_lock to an irq safe raw spinlock
class: fix use-after-free in class_register()
PM: clk: make pm_clk_add_notifier() take a const pointer
EDAC: constantify the struct bus_type usage
kernfs: fix reference to renamed function
driver core: device.h: fix Excess kernel-doc description warning
driver core: class: fix Excess kernel-doc description warning
driver core: mark remaining local bus_type variables as const
driver core: container: make container_subsys const
driver core: bus: constantify subsys_register() calls
driver core: bus: make bus_sort_breadthfirst() take a const pointer
kernfs: d_obtain_alias(NULL) will do the right thing...
driver core: Better advertise dev_err_probe()
kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy()
kernfs: Convert kernfs_name_locked() from strlcpy() to strscpy()
kernfs: Convert kernfs_walk_ns() from strlcpy() to strscpy()
initramfs: Expose retained initrd as sysfs file
fs/kernfs/dir: obey S_ISGID
kernel/cgroup: use kernfs_create_dir_ns()
...
|
||
|
|
0ba971511d |
bpf: enforce types for __arg_ctx-tagged arguments in global subprogs
Add enforcement of expected types for context arguments tagged with
arg:ctx (__arg_ctx) tag.
First, any program type will accept generic `void *` context type when
combined with __arg_ctx tag.
Besides accepting "canonical" struct names and `void *`, for a bunch of
program types for which program context is actually a named struct, we
allows a bunch of pragmatic exceptions to match real-world and expected
usage:
- for both kprobes and perf_event we allow `bpf_user_pt_regs_t *` as
canonical context argument type, where `bpf_user_pt_regs_t` is a
*typedef*, not a struct;
- for kprobes, we also always accept `struct pt_regs *`, as that's what
actually is passed as a context to any kprobe program;
- for perf_event, we resolve typedefs (unless it's `bpf_user_pt_regs_t`)
down to actual struct type and accept `struct pt_regs *`, or
`struct user_pt_regs *`, or `struct user_regs_struct *`, depending
on the actual struct type kernel architecture points `bpf_user_pt_regs_t`
typedef to; otherwise, canonical `struct bpf_perf_event_data *` is
expected;
- for raw_tp/raw_tp.w programs, `u64/long *` are accepted, as that's
what's expected with BPF_PROG() usage; otherwise, canonical
`struct bpf_raw_tracepoint_args *` is expected;
- tp_btf supports both `struct bpf_raw_tracepoint_args *` and `u64 *`
formats, both are coded as expections as tp_btf is actually a TRACING
program type, which has no canonical context type;
- iterator programs accept `struct bpf_iter__xxx *` structs, currently
with no further iterator-type specific enforcement;
- fentry/fexit/fmod_ret/lsm/struct_ops all accept `u64 *`;
- classic tracepoint programs, as well as syscall and freplace
programs allow any user-provided type.
In all other cases kernel will enforce exact match of struct name to
expected canonical type. And if user-provided type doesn't match that
expectation, verifier will emit helpful message with expected type name.
Note a bit unnatural way the check is done after processing all the
arguments. This is done to avoid conflict between bpf and bpf-next
trees. Once trees converge, a small follow up patch will place a simple
btf_validate_prog_ctx_type() check into a proper ARG_PTR_TO_CTX branch
(which bpf-next tree patch refactored already), removing duplicated
arg:ctx detection logic.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240118033143.3384355-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
||
|
|
66967a32d3 |
bpf: extract bpf_ctx_convert_map logic and make it more reusable
Refactor btf_get_prog_ctx_type() a bit to allow reuse of bpf_ctx_convert_map logic in more than one places. Simplify interface by returning btf_type instead of btf_member (field reference in BTF). To do the above we need to touch and start untangling btf_translate_to_vmlinux() implementation. We do the bare minimum to not regress anything for btf_translate_to_vmlinux(), but its implementation is very questionable for what it claims to be doing. Mapping kfunc argument types to kernel corresponding types conceptually is quite different from recognizing program context types. Fixing this is out of scope for this change though. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240118033143.3384355-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
296455ade1 |
Merge tag 'char-misc-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc and other driver updates from Greg KH: "Here is the big set of char/misc and other driver subsystem changes for 6.8-rc1. Other than lots of binder driver changes (as you can see by the merge conflicts) included in here are: - lots of iio driver updates and additions - spmi driver updates - eeprom driver updates - firmware driver updates - ocxl driver updates - mhi driver updates - w1 driver updates - nvmem driver updates - coresight driver updates - platform driver remove callback api changes - tags.sh script updates - bus_type constant marking cleanups - lots of other small driver updates All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (341 commits) android: removed duplicate linux/errno uio: Fix use-after-free in uio_open drivers: soc: xilinx: add check for platform firmware: xilinx: Export function to use in other module scripts/tags.sh: remove find_sources scripts/tags.sh: use -n to test archinclude scripts/tags.sh: add local annotation scripts/tags.sh: use more portable -path instead of -wholename scripts/tags.sh: Update comment (addition of gtags) firmware: zynqmp: Convert to platform remove callback returning void firmware: turris-mox-rwtm: Convert to platform remove callback returning void firmware: stratix10-svc: Convert to platform remove callback returning void firmware: stratix10-rsu: Convert to platform remove callback returning void firmware: raspberrypi: Convert to platform remove callback returning void firmware: qemu_fw_cfg: Convert to platform remove callback returning void firmware: mtk-adsp-ipc: Convert to platform remove callback returning void firmware: imx-dsp: Convert to platform remove callback returning void firmware: coreboot_table: Convert to platform remove callback returning void firmware: arm_scpi: Convert to platform remove callback returning void firmware: arm_scmi: Convert to platform remove callback returning void ... |
||
|
|
7f5e47f785 |
Merge tag 'mm-hotfixes-stable-2024-01-12-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton: "For once not mostly MM-related. 17 hotfixes. 10 address post-6.7 issues and the other 7 are cc:stable" * tag 'mm-hotfixes-stable-2024-01-12-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: userfaultfd: avoid huge_zero_page in UFFDIO_MOVE MAINTAINERS: add entry for shrinker selftests: mm: hugepage-vmemmap fails on 64K page size systems mm/memory_hotplug: fix memmap_on_memory sysfs value retrieval mailmap: switch email for Tanzir Hasan mailmap: add old address mappings for Randy kernel/crash_core.c: make __crash_hotplug_lock static efi: disable mirror feature during crashkernel kexec: do syscore_shutdown() in kernel_kexec mailmap: update entry for Manivannan Sadhasivam fs/proc/task_mmu: move mmu notification mechanism inside mm lock mm: zswap: switch maintainers to recently active developers and reviewers scripts/decode_stacktrace.sh: optionally use LLVM utilities kasan: avoid resetting aux_lock lib/Kconfig.debug: disable CONFIG_DEBUG_INFO_BTF for Hexagon MAINTAINERS: update LTP maintainers kdump: defer the insertion of crashkernel resources |
||
|
|
4f41d30cd6 |
kdb: Fix a potential buffer overflow in kdb_local()
When appending "[defcmd]" to 'kdb_prompt_str', the size of the string
already in the buffer should be taken into account.
An option could be to switch from strncat() to strlcat() which does the
correct test to avoid such an overflow.
However, this actually looks as dead code, because 'defcmd_in_progress'
can't be true here.
See a more detailed explanation at [1].
[1]: https://lore.kernel.org/all/CAD=FV=WSh7wKN7Yp-3wWiDgX4E3isQ8uh0LCzTmd1v9Cg9j+nQ@mail.gmail.com/
Fixes:
|
||
|
|
22c7fa171a |
bpf: Reject variable offset alu on PTR_TO_FLOW_KEYS
For PTR_TO_FLOW_KEYS, check_flow_keys_access() only uses fixed off
for validation. However, variable offset ptr alu is not prohibited
for this ptr kind. So the variable offset is not checked.
The following prog is accepted:
func#0 @0
0: R1=ctx() R10=fp0
0: (bf) r6 = r1 ; R1=ctx() R6_w=ctx()
1: (79) r7 = *(u64 *)(r6 +144) ; R6_w=ctx() R7_w=flow_keys()
2: (b7) r8 = 1024 ; R8_w=1024
3: (37) r8 /= 1 ; R8_w=scalar()
4: (57) r8 &= 1024 ; R8_w=scalar(smin=smin32=0,
smax=umax=smax32=umax32=1024,var_off=(0x0; 0x400))
5: (0f) r7 += r8
mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1
mark_precise: frame0: regs=r8 stack= before 4: (57) r8 &= 1024
mark_precise: frame0: regs=r8 stack= before 3: (37) r8 /= 1
mark_precise: frame0: regs=r8 stack= before 2: (b7) r8 = 1024
6: R7_w=flow_keys(smin=smin32=0,smax=umax=smax32=umax32=1024,var_off
=(0x0; 0x400)) R8_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1024,
var_off=(0x0; 0x400))
6: (79) r0 = *(u64 *)(r7 +0) ; R0_w=scalar()
7: (95) exit
This prog loads flow_keys to r7, and adds the variable offset r8
to r7, and finally causes out-of-bounds access:
BUG: unable to handle page fault for address: ffffc90014c80038
[...]
Call Trace:
<TASK>
bpf_dispatcher_nop_func include/linux/bpf.h:1231 [inline]
__bpf_prog_run include/linux/filter.h:651 [inline]
bpf_prog_run include/linux/filter.h:658 [inline]
bpf_prog_run_pin_on_cpu include/linux/filter.h:675 [inline]
bpf_flow_dissect+0x15f/0x350 net/core/flow_dissector.c:991
bpf_prog_test_run_flow_dissector+0x39d/0x620 net/bpf/test_run.c:1359
bpf_prog_test_run kernel/bpf/syscall.c:4107 [inline]
__sys_bpf+0xf8f/0x4560 kernel/bpf/syscall.c:5475
__do_sys_bpf kernel/bpf/syscall.c:5561 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5559 [inline]
__x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:5559
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x3f/0x110 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x63/0x6b
Fix this by rejecting ptr alu with variable offset on flow_keys.
Applying the patch rejects the program with "R7 pointer arithmetic
on flow_keys prohibited".
Fixes:
|
||
|
|
e37617c8e5 |
sched/fair: Fix frequency selection for non-invariant case
Linus reported a ~50% performance regression on single-threaded workloads on his AMD Ryzen system, and bisected it to: |
||
|
|
7c65aa3cc0 |
dma-debug: fix kernel-doc warnings
Update the kernel-doc comments to catch up with the code changes and
fix the kernel-doc warnings:
debug.c:83: warning: Excess struct member 'stacktrace' description in 'dma_debug_entry'
debug.c:83: warning: Function parameter or struct member 'stack_len' not described in 'dma_debug_entry'
debug.c:83: warning: Function parameter or struct member 'stack_entries' not described in 'dma_debug_entry'
Fixes:
|